ArticlePDF Available

Abstract and Figures

Drug-Drug Interactions (DDIs) are a major cause of preventable Adverse Drug Reactions (ADRs), causing a significant burden on the patients’ health and the healthcare system. It is widely known that clinical studies cannot sufficiently and accurately identify DDIs for new drugs before they are made available on the market. In addition, existing public and proprietary sources of DDI information are known to be incomplete and/or inaccurate and so not reliable. As a result, there is an emerging body of research on in-silico prediction of drug-drug interactions. In this paper, we present Tiresias, a large-scale similarity-based framework that predicts DDIs through link prediction. Tiresias takes in various sources of drug-related data and knowledge as inputs, and provides DDI predictions as outputs. The process starts with semantic integration of the input data that results in a knowledge graph describing drug attributes and relationships with various related entities such as enzymes, chemical structures, and pathways. The knowledge graph is then used to compute several similarity measures between all the drugs in a scalable and distributed framework. In particular, Tiresias utilizes two classes of features in a knowledge graph: local and global features. Local features are derived from the information directly associated to each drug (i.e., one hop away) while global features are learnt by minimizing a global loss function that considers the complete structure of the knowledge graph. The resulting similarity metrics are used to build features for a large-scale logistic regression model to predict potential DDIs. We highlight the novelty of our proposed Tiresias and perform thorough evaluation of the quality of the predictions. The results show the effectiveness of Tiresias in both predicting new interactions among existing drugs as well as newly developed drugs.
Content may be subject to copyright.
Large-Scale Structural and Textual Similarity-Based Mining of Knowledge Graph to
Predict Drug-Drug Interactions
Ibrahim Abdelaziza,, Achille Fokoueb, Oktie Hassanzadehb, Ping Zhangb, Mohammad Sadoghic
aKing Abdullah University of Science &Technology, KSA
bIBM T.J. Watson Research Center, Yorktown Heights, NY, USA
cDepartment of Computer Science, Purdue University, West Lafayette, IN, USA
Abstract
Drug-Drug Interactions (DDIs) are a major cause of preventable Adverse Drug Reactions (ADRs), causing a significant burden on
the patients’ health and the healthcare system. It is widely known that clinical studies cannot suciently and accurately identify
DDIs for new drugs before they are made available on the market. In addition, existing public and proprietary sources of DDI
information are known to be incomplete and/or inaccurate and so not reliable. As a result, there is an emerging body of research
on in-silico prediction of drug-drug interactions. In this paper, we present Tiresias, a large-scale similarity-based framework that
predicts DDIs through link prediction.Tiresias takes in various sources of drug-related data and knowledge as inputs, and provides
DDI predictions as outputs. The process starts with semantic integration of the input data that results in a knowledge graph
describing drug attributes and relationships with various related entities such as enzymes, chemical structures, and pathways. The
knowledge graph is then used to compute several similarity measures between all the drugs in a scalable and distributed framework.
In particular, Tiresias utilizes two classes of features in a knowledge graph: local and global features. Local features are derived
from the information directly associated to each drug (i.e., one hop away) while global features are learnt by minimizing a global
loss function that considers the complete structure of the knowledge graph. The resulting similarity metrics are used to build
features for a large-scale logistic regression model to predict potential DDIs. We highlight the novelty of our proposed Tiresias and
perform thorough evaluation of the quality of the predictions. The results show the eectiveness of Tiresias in both predicting new
interactions among existing drugs as well as newly developed drugs.
Keywords: Drug Interaction, Similarity-Based, Link Prediction
1. Introduction
Adverse drug reactions (ADRs) is now becoming the 4th
leading cause of deaths in United States surpassing complex
diseases such as diabetes, pneumonia, and AIDS [1]. Over two
million ADRs are being reported in U.S. annually that sadly re-
sults in 100,000 loss of life every year. Furthermore, a signifi-
cant resource of $136 billion is dedicated to treat complications
arised due to ADRs. In fact, the cost of care for attempting to
reverse ADRs symptoms is higher than the cost of care for both
diabetic and cardiovascular combined. More importantly, a de-
tailed analysis of ADR incidents reveals that approximately 3 to
5% of all in-hospital medication errors are due to “preventable”
drug-drug interactions (DDIs) [1].
Therefore, a natural question arises as to why so many pre-
ventable DDIs continues to plaque patients and the healthcare
system as a whole, the answer is twofold. First, despite the
advances made in drug development and safety, clinical trails
Corresponding author
Email addresses: ibrahim.abdelaziz@kaust.edu.sa (Ibrahim
Abdelaziz), achille@us.ibm.com (Achille Fokoue),
hassanzadeh@us.ibm.com (Oktie Hassanzadeh), pzhang@us.ibm.com
(Ping Zhang), msadoghi@purdue.edu (Mohammad Sadoghi)
often fail to reveal rare toxicity of certain drugs given the lim-
ited size and length of these studies. For instance, an average
typical trail for any drug is limited to only 1,500 patients for
rather a short period of time. Therefore, they fail to show the
actual impacts of the drug once oered to millions of patients
for much longer period of time. These concerns are further ex-
acerbated as it is well known that adverse reaction increases ex-
ponentially when taking four or more drugs simultaneously [1].
Consequently, the rare toxicity of newly developed drugs can-
not be established until after the drug becomes widely available
in the market. Second, to make the matter worse, healthcare
providers often fail to report ADRs because they have a mis-
conception that all severe adverse reactions are already known
when a drug is brought to the market [1].
Recently, there is a growing interest in computationally pre-
dicting potential DDIs [2, 3, 4, 5, 6, 7, 8]. These approaches
are broadly classified as either similarity (e.g., [2, 6, 7]) or
feature-based (e.g., [3]) DDI predication methods. There are
a set of significant challenges and shortcomings that that are
mostly overlooked by prior work. We summarize each of these
limitations as follows:
Problem 1: Inability to make predictions for newly devel-
oped drugs. Prior work either (i) are fundamentally unable
Preprint submitted to Web Semantics March 8, 2017
*Manuscript
Click here to view linked References
to make predictions for newly developed drugs (i.e., drugs for
which no or very limited information about interacting drugs is
available) [8] or (ii) could conceptually predict drugs interact-
ing with a new drug, but have not been tested for this scenario
[2, 7]. Similarity-based approaches (e.g. [2, 7]) can clearly be
applied to drugs without any known interacting drugs. How-
ever, in commonly carried 10 fold cross validation evaluation,
prior work using similarity-based approaches have hidden drug-
drug interaction associations and not drugs. Thus, the large ma-
jority of drugs used at testing are also known during the training
phase, which is an inappropriate evaluation strategy to simulate
the introduction of a newly developed drug. In our experimen-
tal evaluation, we show that the prediction quality of the ba-
sic similarity-based approaches drops noticeably when hiding
drugs instead of drug-drug associations.
Problem 2: Ignoring the skewed distribution of interacting
drug pairs. Most prior work [2, 6, 7] assume a priori a bal-
anced distribution of interacting drug pairs at training or at test-
ing. There is no reason to believe that the prevalence of pairs of
interacting drugs in the set of all the drug pairs is close to 50%
(often falsely assumed in past studies).
Problem 3: Discarding many relevant data sources and in-
completeness of similarity measures. Existing techniques
[3, 2, 7, 6] have relied on a limited number of data sources (pri-
marily DrugBank) for creating drug similarity measures. Since
various data sources provide only partial information about a
subset of drugs of interest, the resulting drug similarity mea-
sures exhibit varying levels of incompleteness. This incom-
pleteness of similarity measures, which has been for the most
part overlooked by prior work, is already an issue even when a
single data source such as DrugBank is used. The reason is that
not all the attributes needed by a given similarity measure are
available for all drugs. Without any additional machine learn-
ing features, the learning algorithm cannot distinguish between
a low similarity value between two drugs due to incomplete
data about at least one of the drugs or real dissimilarity between
them.
Problem 4: Usage of inappropriate evaluation metrics. Ex-
isting work [2, 6, 7] use mainly the area under the R.O.C curves
(AUROC) as the evaluation metric to assess the quality of pre-
dictions. They often justify their decision to rely on a balanced
testing dataset because of the observation that AUROC is not
sensitive to the ratio of positive to negative examples. How-
ever, as shown in [9] and reinforced in our experimental eval-
uation section, AUROC is not appropriate for skewed distribu-
tion. Metrics designed specifically for skewed distribution such
as precision & recall, F-score, or area under Precision-Recall
curve (AUPR) should be used instead. Unfortunately, when
prior work use these metrics, they do so on a balanced testing
data set, which results in artificially high values. For example,
for a trivial classifier that report all pairs of drugs as interact-
ing, recall is 1, precision 0.5 and F-score 0.67. As shown in
our evaluation, on unbalanced testing dataset (with prevalence
of drug-drug interacting ranging from 10% to 30%), the basic
similarity-based prediction produces excellent AUROC values,
but mediocre F-score or AUPR.
To address these shortcomings, we present an extension of
our system Tiresias, a large-scale similarity-based framework
that predicts DDIs through link prediction [10, 11]. Tiresias
begins by a semantic integration of the input data that results
in a knowledge graph describing drug attributes and relation-
ships with various related entities such as enzymes, chemical
structures, and pathways. The knowledge graph is then used to
compute several similarity measures between all the drugs in a
scalable and distributed framework. In Tiresias, we primarily
relied on a carefully engineered set of local drug similarity fea-
tures derived from the information directly associated to each
drug (i.e., one hop away). In this paper, we go beyond our
original Tiresias by (i) introducing an enriched set of similarity
features through extending the set of local features. The new
added features include 8 new chemical structure based simi-
larity measures, drug side eects, physiological eects, targets,
metabolizing enzyme and MeSH-based similarity. More impor-
tantly, (ii) we enrich our knowledge graph with both structured
and unstructured data sources (e.g., DailyMed1). We also in-
troduce the notion of global features to capture the structural
and textual features of our knowledge graph. These features
are learnt by minimizing a global loss function that considers
the complete structure of the knowledge graph. (iii) Finally, we
provide a richer set of experiments to evaluate Tiresias in both
newly developed and existing drugs scenarios that demonstrates
the eectiveness of our newly developed local and global fea-
tures. Below we summarize the key contributions of Tiresias.
Broader set of data sources: Tiresias introduces a first of kind
semantic integration of a comprehensive set of structured and
unstructured data sources. We exploit information originating
from multiple linked data sources including, e.g., DrugBank,
UMLS, DailyMed, Uniprot and CTD (cf. Section 4) to con-
struct a knowledge graph. This integrated knowledge graph
describes drug attributes and relationships with various related
entities such as enzymes, chemical structures, and pathways.
Extensive set of novel similarity measures: We utilize the
integrated knowledge graph to compute several similarity mea-
sures between all the drugs (cf. Section 5). We develop new
drug-drug similarity measures based on various properties of
drugs including metabolic and signaling pathways, drug mech-
anism of action and physiological eects. We also define a new
class of global drug features by learning low-dimensional em-
beddings of drugs from textual and graph-based datasets.
Handling Data Skewness and Incompleteness: We build a
large-scale and distributed linear regression learning model (in
Apache Spark) to predict the existence of DDIs. Our model
eciently handle skewed distribution of DDIs and data incom-
pleteness through ; (i) a combination of case control sampling
for rare events and (ii) a new class of calibration features. First,
in Section 6, we present a systematic methodology to estimate
that the true prevalence of interacting drug pairs in the set of
all drug pairs, which we discover to be ranging between 10%
1https://dailymed.nlm.nih.gov/dailymed/
2
and 30%. Second, to address the incompleteness of similarity
measures which aects prediction quality as measured by pre-
cision & recall, F-score, etc, we introduce a new class of fea-
tures, called calibration features (cf. Section 5.3) that captures
the relative completeness of the drug-drug similarity measures.
Extending prediction to newly developed drugs: Given Tire-
sias extensive set of similarity-based features, we demonstrate
that our framework is capable of dealing with drugs without any
known interacting drugs. We further show that techniques de-
veloped in Tiresias significantly improve the prediction quality
for new drugs not seen at training.
Comprehensive evaluation: We conduct detailed evaluations
with real data assuming skewed data distribution and using
proper evaluation metrics including precision, recall, F-score
and AUPR. For newly developed drugs, using standard 10-fold
cross validation, Tiresias is able to achieve DDI prediction with
an average F-Score of 0.74 (vs. 0.65 for the baseline) and area
under PR curve of 0.82 (vs. 0.78 for the baseline). For the
existing drugs scenario, Tiresias is able to achieve an F-Score
value of 0.85 (vs. 0.75 for the baseline) and AUPR of 0.92 (vs.
0.87 for the baseline). The performance becomes even better
as we include our global embedding-based features; F-score in-
creases to 0.89 while AUPR increases to 0.97. Additionally, we
introduce a novel retrospective analysis to demonstrate the ef-
fectiveness of our approach to predict correct, but yet unknown
DDIs. Up to 68% of all DDIs found after 2011 were correctly
predicted using only DDIs known in 2011 as positive examples
in training (cf. Section 7).
The rest of the paper is organized as follows. Section 2 dis-
cusses the preliminaries of similarity-based DDI approaches as
well as text and graph embedding techniques. In Section 3,
we give an overview of the main components of Tiresias and
highlight its computation phases. Section 4 describes the data
integration phase and how Tiresias handles the associated in-
tegration challenges. Then, we describe the dierent extracted
features required for model building in Section 5. Section 6
shows how Tiresias handles unbalanced data distributions while
Section 7 presents the experimental evaluation. Finally, Section
8 surveys the related work and we conclude in Section 9.
2. Background
In this section, we discuss the main ideas of similarity-based
DDI approaches. We also discuss a recent line of research
which aims at learning low-dimensional embeddings of enti-
ties of textual and graph-based datasets. These embedding ap-
proaches are used in Tiresias to define a new family of global
features for comparing drugs.
2.1. Similarity-based DDI predictions
Similar to content-based recommender systems, the core idea
of similarity-based approaches [2, 6, 7] is to predict the exis-
tence of an interaction between a candidate pair of drugs by
comparing it against known interacting pairs of drugs. Finding
known interacting drugs that are very similar to the candidate
pair provides supporting evidence in favor of the existence of a
drug-drug interaction between the two candidate drugs.
These approaches first define a variety of drug similarity
measures to compare drugs. A drug similarity measure sim
is a function that takes as input two drugs and returns a real
number between 0 (no similarity between the two drugs) and 1
(perfect match between the two drugs) indicating the similarity
between the two drugs. S IM denotes the set of all drug sim-
ilarity measures. Entities of interest for drug-drug interaction
prediction are not single drugs, but rather pair of drugs. Thus,
drug similarity measures in S IM need to be extended to pro-
duce drug-drug similarity measures that compare two pairs of
drugs (e.g., a pair of candidate drugs against an already known
interacting pair of drugs). Given two drug similarity measures
sim1and sim2in S IM, we can define a new drug-drug similar-
ity measure, denoted sim1sim2, that takes as input a two pairs
of drugs (a1,a2) and (b1,b2) and returns the similarity between
the two pairs of drugs computed as follows:
sim1sim2((a1,a2),(b1,b2)) =avg(sim1(a1,b1),sim2(a2,b2))
where avg is an average or mean function such as the geometric
mean or the harmonic mean. In other words, the first drug simi-
larity measure (sim1) is used to compare the the first element of
each pair and the second drug similarity measure (sim2) is used
to compare the second element of each pair. Finally, the results
of the two comparisons are combined using, for example, har-
monic or geometric mean. The set of all drug-drug similarity
measures thus defined by combining drug similarity measures
in S IM is denoted S IM2={sim1sim2|sim1S IM sim2
S IM}.
Given a set KDDI of known drug-drug interactions, a drug-
drug similarity measure sim1sim2S I M2, and a candidate
drug pair (d1,d2), the prediction based solely on sim1sim2that
d1and d2interacts, denoted predict[sim1sim2,K DDI ](d1,d2),
is computed as the arithmetic mean of the similarity values
between (d1,d2) and the top-kmost similar known interacting
drug pairs to (d1,d2): predict[sim1sim2,K DDI](d1,d2)=
amean(topk{sim1sim2((d1,d2),(x,y))|(x,y)KDDI−{(d1,d2)}})
where amean is the arithmetic mean, and, in most cases,
kis equal to 1. The power of similarity-based approaches
stems from not relying on a single similarity based predic-
tion, but from combining all the individual independent predic-
tions predict[sim1sim2,K DDI] for all sim1sim2KDDI
into a single score that indicates the level of confidence in
the existence of a drug-drug interaction. This combination is
typically done through machine learning (e.g., logistic regres-
sion): the training is performed using KDDI as the ground truth
and, given a drug pair (d1,d2), its feature vector consists of
predict[sim1sim2,K DDI](d1,d2) for all sim1sim2KDDI.
Similarity-based methods have a number of clear advantages:
(i) compared with direct feature vector-based approaches,
similarity-based approaches do not need complex and dicult
feature extraction or selection (e.g., generating and combining
features for a drug pair); (ii) many drug similarity measures
3
such as chemical structure similarity [4], target protein similar-
ity [5], and side-eect similarity [6] have already been fully de-
veloped and are widely used; (iii) similarity-based approaches
can be directly related to well-developed kernel methods, which
can provide high-performance prediction results; (iv) dier-
ent similarity measures can easily be combined. For example,
we generate drug chemical-protein interactome (CPI) similar-
ity measure based on the concept of DDI-CPI server [3], which
shows the flexibility of our method in integrating multiple drug
information resources.
2.2. Textual Embedding
Recently, the word2vec model has attracted a lot of attention
to construct embedding for textual data [12]. It aims at learn-
ing high-quality word vectors (embedding) from huge data sets
with billions of words. Word2vec is a two-layer neural network
used for computing vector representations of words. Each word
vector is trained to maximize the log probability of the word
given the context word(s) occurring within a fixed-size win-
dow. Word2vec proposed two dierent architectures that can
be utilized to obtain word vectors; Continuous Bag-of-Words
(CBOW) and Skip-gram Model [12]. CBOW tries to predict
the word given its context while skip-gram tries to predict the
context given a word. In skip-gram, the context is not limited to
the word’s immediate context, rather training instances can be
created by skipping a constant number of words. CBOW is sev-
eral times faster compared to skip-gram which tends to be more
accurate [12] due to the more generalizable contexts generated.
The training objective of the Skip-gram model is to find word
representations that are useful for predicting the surrounding
words in a sentence or a document. Given a sequence of words
w1,w2,w3,...,wT, the Skip-gram model tries to to maximize
the average log probability as follows:
1
T
T
X
t=1
X
cjc,j6=0
log p(wt+j|wt)
where cis the size of the training context while p(wt+j|wt) is
defined using the softmax function as follows:
p(wO|wI)=
exp v0
wO
>vwI
PW
w=1exp v0
w
>vwI
vwand v0
wrefer to the input and output vector representations
of w, respectively while Wis the number of words in the vo-
cabulary. vwcomes from input hidden layer weight matrix
while v0
wcomes from hidden output layer weight matrix. The
training objective is to maximize the conditional probability of
observing the actual output word wOgiven the input context
word wIwith regard to the weights. With the existence of two
vector representations for each word in the vocabulary, learning
the input vectors becomes cheap; but learning the output vectors
is very expensive. Consequently, more ecient approximation
techniques like hierarchical softmax and negative sampling [12]
can be utilized.
One of the advantages of word2vec is its ability to groups
similar words together in the vector space. Moreover, it can
automatically learn concepts and predict semantic relation-
ships between words using simple algebraic operations on the
word vectors [12, 13]. For example, vector(Germany) +vec-
tor(capital) is close to vector(Berlin) and vector(Russia) +vec-
tor(river) is close to vector(Volga River). Furthermore, vec-
tor(King) - vector(Man) +vector(Woman) results in a vec-
tor close to vector(Queen). Similarly, vector(Einstein) - vec-
tor(scientist) +vector(Picasso) results in a vector close to vec-
tor(painter).
2.3. Graph Embedding
Several approaches [14, 15, 16] have been proposed to embed
an input knowledge graph into a continuous vector space while
preserving certain properties of the original graph. The output
of these techniques is a vector representation for each entity
and relation in the input graph. Each entity is represented as a
point in the vector space while each relation is modeled as an
operation (e.g. translation, projection, etc) in that space.
TransE [14] is an energy-based model for learning low-
dimensional embeddings of entities. TransE treats a triple
(s,p,o) as a relation-specific translation from a head entity
(subject) to a tail entity (object). The translation function is
modeled as a simple addition of the vectors that correspond to
the head entity (s) and the predicate relation (p). When (s,p,o)
holds, TransE tries to have oas the nearest neighbor of s+p
and far away otherwise. To learn these embedding, TransE min-
imizes the max-margin-based ranking cost function as follows:
L=X
(s,p,o)S
X
(s0,p,o0)S0
(s,p,o)
[γ+d(s+p,o)d(s0+p,o0)]+
where Lis the loss function to be minimized, γ > 0 is a margin
hyperparameter and dfinds similarity of the translation and the
object embedding which can be measured by the L1or L2dis-
tance. S0
(s,p,o)is the set of corrupted triples which is drawn from
the set of training triplets with either the head or tail replaced
by a random entity (but not both at the same time). TransE is an
easy to train model since it relies on a reduced set of parameters.
At the same time, it is an ecient model that achieves high pre-
diction performance. TransH [15] improves the performance
of TransE when dealing with relations with mapping properties
of reflexive/one-to-many/many-to-one/many-to-many. TransH
shows better performance compared to TransE at the cost of a
higher computational complexity.
Recently, Nickel et al. introduced HolE [16], a compositional
vector space based model for knowledge graphs. The meaning
and representation of entities in compositional models do not
vary according to their position in the compositional represen-
tation. Furthermore, the representations of all entities and re-
lations are learned jointly. This allows the model to propagate
information between triples which captures global dependen-
cies in the data. HolE combines the expressive power of the
tensor product with the eciency and simplicity of TransE. It
represents a pair of entities (a,b) using the circular correlation
(a compression of the tensor product) of their vectors as fol-
lows:
4
Ingest
Drug1Drug2Sim
SalsalateAspirin .7
Dicoumarol Warfarin .6
Drug1Drug2Sim
Salsalate Aspirin.9
Dicoumarol Warfarin .7
Sim1(e.g.,ChemicalSimilarity)
SimN
Drug1Drug2
AspirinGliclazide
AspirinDicoumarol
KnownDDIsDrug1Drug2Feature
Vector
SalsalateGliclazide[.9,..,.7]
SalsalateWarfarin [.7,…,.4]
CandidateFeatures
LogisMcRegression
Model(λ8)
LogisMcRegression
Model(λ1)
LogisMcRegression
Model(λ,η)
Drug1Drug2score
Salsalateeltrombopag0.98
Salsalatecolesevelam0.94
DDIPredicMons
Selectmodel
&threshold
Predict
Buildadjusted
logisMcregression
models
Build
features
Figure 1: Overview of Tiresias: a large-scale similarity-based DDI prediction framework.
ab=a?b,
where ?:Rd×RdRddenotes circular correlation which is
defined as:
[a?b]k=
d1
X
i=0
aib(k+i) mod d.
Consequently, the probability of a triple is modeled as:
Pr(φp(s,o)=1|Θ)=σ(r>
p(es?eo)).
where rpand eiare vector representations of relations and enti-
ties. σ(x)=1/(1 +exp(x)) denotes the logistic function while
Θ = {ei}ne
i=1∪{rk}nr
k=1denotes the set of all embeddings. ?denotes
the compositional operator which creates a composite vector
representation for the pair (s,o) from the embeddings es,eo.
HolE is shown to handle relatively large knowledge graphs and
provide better performance compared to state-of-the-art embed-
ding techniques.
3. Tiresias Overview
Figure 1 shows the architecture of Tiresias. It consists of five
key phases (the arrows in Figure 1). We describe below each
phase in details.
Ingestion: In this phase, data originating from multiple
sources are ingested and integrated to create various drug sim-
ilarity measures (represented as blue tables in Figure 1) and
a known DDIs table. Similarity measures are not necessarily
complete in the sense that some drug pairs may be missing from
the similarity tables displayed in Figure 1. The known DDIs
table, denoted KDDI, contains the set of 12,104 drug pairs al-
ready known to interact in DrugBank. In the 10 fold cross vali-
dation of our approach, K DDI is randomly split into 3 disjoint
subsets: KDDItrain ,KDDIval , and KDDIte st representing the set
of positive examples respectively used in the training, valida-
tion and testing (or prediction) phases. Contrary to most prior
work, which partition K DDI on the DDI associations instead
of on drugs, our partitioning simulates the scenario of the in-
troduction of newly developed drugs for which no interacting
drugs are known. In particular, each pair (d1,d2) in K DDItest
is such that either d1or d2does not appear in KDDItr ain or
KDDval.
Feature Building: Given a pair of drugs (d1,d2), we con-
struct its machine learning feature vector derived from the drug
similarity measures and the set of DDIs known at training. Like
previous similarity-based approaches, for a drug candidate pair
(d1,d2) and a drug-drug similarity measure sim1sim2S I M2,
we create a feature that indicates the similarity value of the
known pair of interacting drugs most similar to (d1,d2) (see
Section 5.2). Unlike prior work, we introduce new calibra-
tion features to address the issue of the incompleteness of the
similarity measures and to provide more information about the
distribution of the similarity values between a drug candidate
pair and all known interacting drug pairs - not just the maxi-
mum value (see Section 5.3).
Logistic Regression Model: As a result of relying on more
data sources, using more similarity measures, and introducing
new calibration features, we have significantly more features
(1014) than prior work (e.g., [2] uses only 49 features). Thus,
there is an increased risk of overfitting that we address by per-
forming L2-model regularization. Since the optimal regulariza-
5
Rheumatoid Arthritis
Acetaminophen
Gene
TP53
Linked Data Source
Acetaminophen
Figure 2: Semantic curation and linkage of data from variety of sources on the Web.
tion parameter is not known a-priori, in the model generation
phase, we build 8 dierent logistic regression models using 8
dierent regularization values. To address issues related to the
skewed distribution of DDIs (for an assumed prevalence DDIs
lower than 17%), we make some adjustments to logistic regres-
sion (see Section 6).
Model Selection: The goals of this phase are twofold. First,
in this phase, we select the best of the eight models (i.e., the best
regularization parameter value) built in the model generation
phase by choosing the model producing the best F-score on the
validation data. Second, we also select the optimal threshold
as the threshold at which the best F-score is obtained on the
validation data evaluated on the selected model.
Prediction: Let fdenote the logistic function selected in the
model validation phase and ηthe confidence threshold selected
in the same phase. In the prediction phase, for each candidate
drug pair (d1,d2), we first get its feature vector vcomputed in
the feature construction phase. f(v) then indicates the probabil-
ity that the two drugs d1and d2interact, and the pair (d1,d2) is
labeled as interacting i.f(v)η.
4. Data Integration
4.1. Datasets
We form our knowledge graph by integrating data from a va-
riety of web sources together. These sources come in dierent
formats including XML, relational, graph and CSV formats .
As partially shown in Figure 2, our data comes from variety of
sources: (i)DrugBank [17]: it oers data about known drugs
and diseases. (ii)DailyMed2provides high qualitative informa-
tion about marketed drugs in the United States. (iii)Compara-
tive Toxicogenomics Database [18] provides information about
2https://dailymed.nlm.nih.gov/dailymed/
gene interaction. (iv)Uniprot [19] provides details about the
functions and structure of genes. (v)BioGRID database col-
lects genetic and protein interactions [20]. (vi)Unified Medical
Language System [21] is the largest repository of biomedical
vocabularies including NCBI taxonomy, Gene Ontology (GO).
(vii)Medical Subject Headings (MeSH) [22], and (viii)Na-
tional Drug File - Reference Terminology (NDF-RT) classifies
drug with a multi-category reference models such as cellular or
molecular interactions and therapeutic categories [23].
4.2. Addressing Integration Challenges
One of the salient feature of our Tiresias framework is to
leverage many available sources on the Web. More importantly,
there is a crucial need to connect these disparate sources in or-
der to create a knowledge graph that is continuously being en-
riched as ingesting more sources. Notably the life science com-
munity has already recognized the importance of the data inte-
gration and taken the first step to employ the Linked Open Data
methodology for connecting identical entities across dierent
sources. However, most of the existing linkages in the scien-
tific domain are often done statically, which results in many
outdated or even non-existent links overtime.
Therefore, even when the data is presumably linked, we are
forced to verify these links. Furthermore, there are number of
fundamental challenges that must be addressed to construct a
unified view of the data with rich interconnectedness and se-
mantics [24] — a knowledge graph. For example, we employ
entity resolution methodology either through syntactical dis-
ambiguation (e.g., cosine similarity, edit distance, or language
model techniques [25]) or through semantic analysis by exam-
ining the conceptual property of entities [21]. These techniques
are not only essential to identify similar entities but also instru-
mental in designing and capturing similarities among entities in
order to engineer features necessary to enable DDIs prediction.
6
As part of our knowledge graph curation task, we identify
which attributes or columns refer to which real world entities
(i.e., data instances). Therefore, our constructed knowledge
graph possess a clear notion of what the entities are, and what
relations exist for each instance in order to capture the data in-
terconnectedness. These may be relations to other entities, or
the relations of the attributes of the entity to data values. As
an example, in our ingested and curated data, we have a table
for Drug, and have the columns Name,Targets,Symptomatic
Treatment. Our knowledge graph has an identifier for a real
world drug Methotrexate, and captures its attributes such as
Molecular Structure or Mechanism of Actions, as well as rela-
tions to other entities including Genes that Methotrexate targets
(e.g., DHFR), and subsequently, Conditions that it treats such
as Osteosarcoma (bone cancer) that are reachable through its
target genes, as demonstrated in Figure 2. We then encode and
store the integrated graph in RDF format which is used as input
to Apache Spark for similarity calculation and model building.
Constructing a rich knowledge graph is a necessary step before
building our predication model as discussed next.
5. Feature Engineering
In this section, we describe the drug similarity measures used
to compare drugs and how various machine learning features
are generated from them.
5.1. Drug Similarity and Drug-Drug Similarity Measures
To measure the similarity between two drugs, Tiresias uses a
set of features that are divided into two categories based on the
way they are generated; local and global similarity-based fea-
tures. Local features are the set of features engineered based on
the information available about drugs. These features consider
the direct associated information with each drug; e.g. chemical
structure, side eects and drug target. On the other hand, global
features are obtained by embedding drugs in low-dimensional
vector spaces. We learn a vector representation for each drug
such that the similarity between two drugs is defined as the co-
sine similarity between their corresponding vectors. To con-
struct these vector representations, we minimize a global loss
function that considers all facts (including structural properties
of graphs) in the dataset. We describe below each category in
details.
5.1.1. Local Similarity-based Features
Based on the available information about drugs, we manually
selected the following drug similarity measures to compare two
drugs.
Chemical-Protein Interactome (CPI) Profile based Similar-
ity: The Chemical-Protein Interactome (CPI) profile of a drug
d, denoted c pi(d), is a vector indicating how well its chem-
ical structure docks or binds with about 611 human Protein
Data Bank (PDB) structures associated with drug-drug inter-
actions [3]. The CPI profile based similarity of two drugs d1
and d2is computed as the cosine similarity between the mean-
centered versions of vectors cpi(d1) and c pi(d2).
Mechanism of Action based Similarity: For a drug d, we col-
lect all its mechanisms of action obtained from NDF-RT. To
discount popular terms, Inverse Document Frequency (IDF) is
used to assign more weight to relatively rare mechanism of
actions: IDF(t,Drugs)=log |Drug s|+1
DF(t,Drug s)+1where Drugs is the
set of all drugs, t is a mechanism of action, and DF(t,Drug s)
is the number of drugs with the mechanism of action t. The
IDF-weighted mechanism of action vector of a drug dis a vec-
tor moa(d) whose components are mechanisms of action. The
value of a component tof moa(d), denoted moa(d)[t], is zero
if tis not a known mechanism of action of d; otherwise, it is
IDF(t,Drugs). The mechanism of action based similarity mea-
sure of two drugs d1and d2is the cosine similarity of the vectors
moa(d1) and moa(d2).
Physiological Eect based Similarity: For a drug d, we col-
lect all its physiological eects obtained from NDF-RT. The
physiological eect based similarity measure of two drugs d1
and d2is defined as the cosine similarity of IDF-weighted phys-
iological eect vectors of the two drugs - which are computed
in the same way as the IDF-weighted mechanism of action vec-
tor described in the previous paragraph.
Pathways based Similarity: Information about pathways af-
fected by drugs is obtained from CTD database. The pathways
based similarity of two drugs is defined as the cosine similarity
between the IDF-weighted pathways vectors of the two drugs,
which are computed in a similar way as IDF-weighted mecha-
nism of action vectors.
Side Eect based Similarity: Side eects associated with
a drug are obtained from SIDER database of drug side ef-
fects [26]. The side eect based similarity of two drugs is de-
fined as the cosine similarity between the IDF-weighted side
eect vectors of the two drugs, which are computed in a similar
way as IDF-weighted mechanism of action vectors of drugs.
Metabolizing Enzyme based Similarities: Information about
enzymes responsible for the metabolism of drugs is obtained
from DrugBank. We define two drug similarity measures re-
lated to metabolizing enzymes.
The first measure compares drugs based on the commonal-
ity of the metabolizing enzymes they interact with. How-
ever, it does not take into account the nature of the inter-
action (i.e., inhibitor, substrate, or inducer). It is formally
defined as the cosine similarity between the IDF-weighted
metabolizing enzyme vectors of two drugs, which are
computed in a similar way as the IDF-weighted mecha-
nism of action vectors of drugs.
The second measure takes into account the nature of the
interaction. For example, if drug d1interacts with a single
metabolizing enzyme eby acting as an inhibitor, and drug
d2also interacts only with the same enzyme ebut as an in-
ducer. According to the first measure, d1and d2will have
a similarity value of 1. However, once the nature of the in-
teraction with the enzyme is taken into account, it is clear
that d1and d2are actually very dissimilar. Formally, to
take into account the nature of the interaction, we modify
7
the IDF-weighted metabolizing enzyme vector me(d) of a
drug dby multiplying by 1 the value of each component
corresponding to an enzyme that is inhibited by the drug.
The similarity between two drugs is then defined as the
normalized cosine similarity between the modified IDF-
weighted metabolizing enzyme vectors of the two drugs
(normalization ensures that the value remains in the [0, 1]
range instead of [-1, 1] range).
Drug Target based Similarities: Information about proteins
targeted by a drug is obtained from DrugBank. We define three
drug similarity measures related to drug targets. The first two
are constructed in a similar way as the two metabolizing en-
zyme based similarities. The first similarity ignores the nature
of the action of the drug on an protein target (i.e., inhibition
or activation), whereas the second takes it into account. The
third similarity measure compares drugs based on the molecu-
lar functions of their protein targets as defined in Uniprot using
Gene Ontology (GO) annotations. Specifically, the third simi-
larity measure is computed as Resnik semantic similarity [27],
using the csbl.go R package [28].
Chemical Structure Similarity: Fingerprinting is considered
nowadays an important tool for judging the similarity of drugs
chemical structures. Therefore, we define a new similarity mea-
sure for comparing two drugs based on the fingerprints of their
chemical structures. The chemical structures of the drugs are
obtained from DrugBank in the SMILES format. We use the
Chemical Development Kit3(CDK) [29], with default setting,
to compute the fingerprints of the molecular structures of drugs
as bit vectors. Then, the chemical structure similarity of two
drugs is computed as the Jaccard similarity (or Tanimoto coef-
ficient) of their fingerprints. There are several approaches for
computing the fingerprints. Instead of using a single method,
we use 9 types of fingerprints: path-based, circular, shortest
path, MACCS, EState, Extended, KlekotaRoth, Pubchem and
substructure Fingerprinter. More details about each fingerprint
type can be found in [30].
Anatomical Therapeutic Chemical (ATC) Classification
System based Similarity: ATC [31] is a classification of the
active ingredients of drugs according to the organs that they af-
fect as well as their chemical, pharmacological and therapeutic
characteristics. The classification consists of multiple trees rep-
resenting dierent organs or systems aected by drugs, and dif-
ferent therapeutical and chemical properties of drugs. The ATC
codes associated with each drug are obtained from DrugBank.
For a given drug, we collect all its ATC code from DrugBank
to build a ATC code vector (the most specific ATC codes asso-
ciated with the drug -i.e., leaves of the classification tree- and
also all the ancestor codes are included). The ATC based sim-
ilarity of two drugs is defined as the cosine similarity between
the IDF-weighted ATC code vectors of the two drugs, which
are computed in a similar way as IDF-weighted mechanism of
action vectors.
3http://cdk.github.io/cdk/
MeSH based Similarity: DrugBank associates each drug with
a set of relevant MeSH [22] (Medical Subject Heading) terms.
The MeSH based similarity of two drugs is defined as the cosine
similarity between the IDF-weighted MeSH vectors of the two
drugs, which are computed in a similar way as IDF-weighted
mechanism of action vectors of drugs.
5.1.2. Global Similarity-based Features
Tiresias relies on another set of features that are used to com-
pare two drugs. We used graph and word embedding techniques
(see Section 2) to get a vector representation for each drug. Re-
call that these techniques minimize a global loss function that
consider all the facts in the dataset. Therefore, the obtained vec-
tor representation for the entities capture global dependencies
in the data and inherit semantics that goes beyond just consid-
ering the direct neighbours. We describe below how we utilize
these techniques to define new set of global similarity features
for comparing drugs.
Word Embedding-based Features: We exploit DailyMed4
and DrugBank5to construct textual embedding for each drug
in order to obtain a numerical representation in the vector space
model; thus, enabling a computational framework to compare
drugs in the learned embedding space.
To avoid any possible information leakage, we remove the
drug interaction information from each drug in each dataset.
Then, we use word2vec [12] on both DailyMed and DrugBank
corpora to obtain a vector representation for each drug. We use
the Skip-gram architecture since it is proved to show a more ac-
curate representation compared to CBOW. As a result, we have
a dierent vector representation for each drug; one per each
database. Then, we define a word-embedding based similarity
between a pair of drugs (d1,d2) which is calculated as the co-
sine similarity between the the two vectors that correspond to
d1and d2, respectively.
Graph Embedding-based Features: Similarly, we utilize
graph embedding techniques to learn a vector representation
of the drugs from our integrated knowledge graphs (see Fig-
ure 2). We use two graph embedding techniques; TransH [15]
and HolE [16]. For each method, we define a drug-drug similar-
ity measure that is calculated as the cosine similarity between
the corresponding drugs vectors. We show in Section 7 the ef-
fect of adding these features to Tiresias.
Most of the previously defined drug similarity measures rely
on both cosine similarity and IDF (to discount popular terms).
We have evaluated our system by replacing cosine by other sim-
ilarity metrics such as weighted Jaccard or soft cosine similar-
ity [32] (when components of the vectors are elements of a tax-
onomical hierarchy: e.g., Mechanism of Action or Physiolog-
ical Eect) without any noticeable improvement of the quality
of our predictions. We have also tried using information the-
oretical means to discount popular terms (e.g., entropy based
weighting) instead of IDF without any noticeable improvement
of the quality of our predictions.
4https://dailymed.nlm.nih.gov/dailymed/
5http://www.drugbank.ca/releases/latest
8
The set of all drug similarity measures is denoted S IM. As
explained in the background section 2, drug similarity measures
in S IM need to be extended to produce drug-drug similarity
measures that compare two pairs of drugs (e.g., a pair of candi-
date drugs against an already known interacting pair of drugs).
S IM2denotes the set of all drug-drug similarity measures de-
rived from S IM as explained in section 2.
5.2. Top-k Similarity-based Features
Like previous similarity-based approaches, for a given drug
candidate pair (d1,d2), a set K DDtrain of DDIs known at train-
ing, and a drug-drug similarity measure sim1sim2S I M2,
we create a similarity-based feature, denoted abssim1sim2and
computed as the similarity value between (d1,d2) and the most
similar known interacting drug pair to (d1,d2) in KDDItrain . In
other words,
abssim1sim2(d1,d2)=max(Dsim1sim2(d1,d2))
where Dsim1sim2(d1,d2) is the set of all the similarity values
between (d1,d2) and all known DDIs:
Dsim1sim2(d1,d2)={sim1sim2((d1,d2),(x,y))
|(x,y)KDDItrain − {(d1,d2)}} (1)
Note that these similarity-based features are computed using
only DDIs known at training (i.e., KDDItrain )
5.3. Calibration Features
Calibration of top-k similarity-based features: For a drug
candidate pair (d1,d2), a high value of the similarity-based fea-
ture abssim1sim2(d1,d2) is a clear indication of the presence of at
least one known interacting drug pair very similar to (d1,d2) ac-
cording to the drug-drug similarity measure sim1sim2. How-
ever, this feature value provides to the machine learning algo-
rithm only a limited view of the distribution Dsim1sim2(d1,d2)
of all the similarity values between (d1,d2) and all known DDIs
(see equation (1)).
For example, with only access to max(Dsim1sim2(d1,d2)),
there is no way to dierentiate between a case where that maxi-
mum value is a significant outlier (i.e., many standard deviation
away from the mean of Dsim1sim2(d1,d2)) and the case where it
is not too far from the mean value of Dsim1sim2(d1,d2). Since
it would be impractical to have a feature for each data point
in D(overfitting and scalability issues), we instead summarize
the distribution Dsim1sim2(d1,d2) by introducing the following
features to capture its mean and standard deviation:
avgsim1sim2(d1,d2)=mean(Dsim1sim2(d1,d2))
stdsim1sim2(d1,d2)=stdev(Dsim1sim2(d1,d2)))
To calibrate the absolute maximum value computed by
abssim1sim2(d1,d2), we introduce a calibration feature, denoted
relsim1sim2, that corresponds to the z-score of the maximum
similarity value of the candidate and a known DDI (i.e., it in-
dicates the number of standard deviations max(D) is from the
mean of D):
relsim1sim2(d1,d2)=abssim1sim2(d1,d2)avgsim1sim2(d1,d2)
stdsim1sim2(d1,d2)
Finally, for a candidate pair (d1,d2), we add a boolean fea-
ture, denoted consim1sim2(d1,d2), that indicates whether the
most similar known interacting drug pair contains d1or d2.
Calibration of drug-drug similarity measures: Features de-
scribed so far capture similarity values between a drug candi-
date pair and known DDIs. As such, a high feature value for a
given candidate pair (d1,d2) does not necessarily indicate that
the two drugs are likely to interact. For example, it could be
the case that, for a given drug-drug similarity measure, (d1,d2)
is actually very similar to most drug pairs (whether or not they
are known to interact). Likewise, a low feature value does not
necessarily indicate a reduced likelihood of drug-drug interac-
tion if (d1,d2) has a very low similarity value with respect to
most drug pairs (whether or not they are known to interact).
In particular, such a low overall similarity between (d1,d2) and
most drug pairs is often due to the incompleteness of the simi-
larity measures considered. For a drug-drug similarity measure
sim1sim2S IM2and a candidate pair (d1,d2), we intro-
duce a new calibration feature, denoted basesim1sim2, to serve
as a baseline measurement of the average similarity measure
between the candidate pair (d1,d2) and any other pair of drugs
(whether or not they are known to interact). The exact expres-
sion of basesim1sim2(d1,d2) is as follows:
X
(x,y)6=(d1,d2)x6=y
sim1sim2((d1,d2),(x,y))
|Drugs|(|Drugs|−1)/21
The evaluation of this expression is quadratic in the number
of drugs |Drugs|, which results in a significant runtime perfor-
mance degradation without any noticeable gain in the quality of
the predictions as compared to the following approximation of
basesim1sim2(with a linear time complexity):
basesim1sim2(d1,d2)hm(
X
x6=d1
sim1(d1,x)
|Drugs|−1,
X
y6=d2
sim2(d2,y)
|Drugs|−1)
where hm denotes the harmonic mean. In other words.
basesim1sim2(d1,d2) is approximated as the harmonic mean of
1) the arithmetic mean of the similarity between d1and all other
drugs computed using sim1, and 2) the arithmetic mean of the
similarity between d2and all other drugs computed using sim2.
6. Dealing with Unbalanced Data
In evaluating any machine learning system, the testing data
should ideally be representative of the real data. In particular,
for our binary classifier that predicts whether a pair of drugs
interacts, the fraction of positive examples in the testing data
9
should be as close as possible to the prevalence or fraction of
DDIs in the set of all pairs of drugs. Although the ratio of pos-
itive to negative examples in the testing has limited impact on
the area under the ROC curves, as shown in the experimental
evaluation, it has significant impact on other key quality met-
rics more appropriate for skewed distributions (e.g., precision
& recall, F-score and area under precision-recall curves).
Unfortunately, the exact prevalence of DDIs in the set of all
drugs pairs is unknown. Here, we provide upper and lower
bounds on the true prevalence of DDIs in the set of all drug
pairs. Then, we discuss logistic regression adjustments to deal
with the skewed distribution of DDIs.
Upper bound: FDA Adverse Event Reporting System
(FAERS) is a database that contains information on adverse
events submitted to FDA. It is designed to support FDA’s post-
marketing safety surveillance program for drugs and therapeu-
tic biological products. Mined from FAERS, TWOSIDES [33]
is a dataset containing only side eects caused by the combina-
tion of drugs rather than by any single drugs. Used as the set of
known DDIs, TWOSIDES [33] contains many false positives
as some DDIs are observed from FAERS, but without rigorous
clinical validation. Thus, we use TWOSIDES to estimate the
upper bound of the DDI prevalence. There are 645 drugs and
63,473 distinct pairwise DDIs in the dataset. Thus, the upper
bound of the DDI prevalence is about 30%.
Lower bound: We used a DDI data set from Gottlieb et
al [2] to estimate the lower bound of the DDI prevalence.
The data set were extracted from DrugBank [17] and the
http://drugs.com website (excluding DDIs tagged as minor),
updated by CernerMultumT M . DDIs from this data set are ex-
tracted from drug’s package inserts (accurate but far from com-
plete), thus there are some false negatives in such a data set.
There are 1,227 drugs and 74,104 distinct pairwise DDIs in the
dataset. Thus the lower bound of the DDI prevalence is about
10%.
Modified logistic regression to handle unbalanced data:
For a given assumed low prevalence of DDIs τa, it is often ad-
vantageous to train our logistic regression classifier on a train-
ing set with a higher fraction τtof positive examples and to later
adjust the model parameters accordingly. The main motivation
for this case-control sampling approach for rare events [34] is
to improve runtime performance of the model building phase
since, for the same number of positive examples, the higher
fraction τtof positive examples yields a smaller total number of
examples at training. Furthermore, for an assumed prevalence
τa0.17, the quality of the predictions is only marginally af-
fected by the use of a training set with a ratio of one positive
example to 5 negative examples (i.e., τt0.17)
A logistic regression model with parameters β0, β1, . . . , βn
trained on a training sample with prevalence of positive exam-
ples of τtinstead of τais then converted into the final model
with parameters ˆ
β0,ˆ
β1,..., ˆ
βnby correcting the intercept ˆ
β0as
indicated in [34] :
ˆ
β0=β0+log τa
1τa
log τt
1τt
The other parameters are unchanged: ˆ
βi=βifor i1.
We have tried more advanced adjustments for rare events dis-
cussed in [34] (e.g., weighted logistic regression and ReLogit6),
but the overall improvement of the quality of our predictions
was only marginal.
7. Evaluation
To assess the quality of our DDI predictions, we perform two
types of experiments. First, we perform a retrospective analy-
sis that shows the ability of our system to discover valid, but
yet unknown drug-drug interactions. Then, a 10-fold cross val-
idation is performed to assess the performance of Tiresias for
newly developed and existing drugs scenarios. Furthermore, we
measure the prediction power of the individual local and global
features and show how they aect the system performance. Fi-
nally, we evaluate the eect of adding the global embedding-
based features on the performance of Tiresias and the baseline.
Hardware Setup: We deployed Tiresias on a local cluster of 8
machines. Each machine is equipped with 512GB of RAM and
4 Intel Xeon E5-4650 CPUs of 2.4GHz; 10 cores each. The
machines run a 64-bit Redhat Linux and are interconnected by
a 1Gbps Ethernet switch.
Implementation Details: Tiresias is written entirely in Scala.
It uses Apache Spark (v 1.6) scalable machine learning library
(MLlib) for building the logistic regression model. MLlib pro-
vides APIs that facilitates combining multiple algorithms into a
single pipeline. Figure 3 shows the Spark MLLib Pipeline that
Tiresias uses at training and testing. In the training phase, Tire-
sias, receives as input our integrated knowledge graph which
includes the set of DDIs as a ground truth. This input goes
to the CrossValidator module which is responsible for doing
the data splits, calling dierent similarity pipelines to generate
the features, building the classification model and selecting the
best model. The crossValidator module selects the best of the
eight models built in the model generation phase. It selects the
model that produces the best F-score on the validation data. In
the testing phase, Tiresias get as input pair of drugs under in-
vestigation, build their feature vectors and consults the model
to check whether they interact or not. The output is the drug
pair, the reference similarity features along with the probability
of their interaction per feature. For the runtime of the whole
process, Tiresias finishes the whole 10-fold cross validation it-
erations is less than 2 hours.
Datasets Statistics: Our integrated knowledge graph con-
sists of 160K triples representing information about 2,600
approved drugs. Tiresias uses this graph for local similar-
ity features calculation as well as for constructing the global
embedding-based similarity features using TransH [15] and
HolE [16] graph embedding techniques. For word2vec-based
embedding features, we used the textual versions of DrugBank
and DailyMed. For DrugBank, we used July 2016 release
which is approximately 500MB of drug-associated textual in-
formation. As for DailyMed, we used its full release which
6http://gking.harvard.edu/relogit
10
Figure 3: Tiresias Spark MLLib Pipeline
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1 0.2 0.3
Fraction of DDIs correctly predicted
DDI Prevalence at training/validation
with calibiration
no calibiration
Figure 4: Retrospective Evaluation: predictions using only known DDIs as of
2011. Tiresias correctly predicts up to 68% of the DDIs found after 2011.
corresponds to around 17K parsed text files whose size is al-
most 2GB.
7.1. Retrospective Analysis
We perform a retrospective evaluation using as the set of
known DDIs (KDDI) only pairs of interacting drugs present
in an earlier version of DrugBank (January 2011). For dierent
DDI prevalence at training/validation, Figure 4 shows the frac-
tion of the total of 713 DDIs added to DrugBank between Jan-
uary 2011 and December 2014 that our approach can discover
based only on DDIs known in January 2011 for dierent DDI
prevalence at training/validation. Figure 4 shows that we can
correctly predict up to 68% of the DDI discovered after January
2011, which demonstrates the ability of our system to discover
valid, but yet unknown drug-drug interactions.
7.2. DDI Prediction Performance
In this section, we evaluate the DDI prediction performance
of Tiresias for newly developed and existing drugs scenarios.
We begin by first describing how the data is partitioned into
training/testing followed by Tiresias prediction analysis.
Competitors: In our experiments, we compare against a base-
line system which is a representative of existing similarity-
based DDI prediction methods. This baseline is a version of our
system that uses as input the same integrated knowledge graph
and utilize the same set of local features discussed in Section
5. Notice that existing DDI prediction methods use similar fea-
tures to the set of local features that we have in Tiresias (see
Section 8 for details). For example, INDI [2] uses chemical-
based, side-eect based and ATC-based similarities. Similarly,
[7] uses chemical fingerprints while [6] uses side eects and
chemical structures. Notice that the baseline assumes 50% DDI
prevalence at training and it does not include the calibration
features, global embedding features and the techniques for han-
dling unbalanced data distribution. Furthermore, as shown pre-
viously in Section 1, these systems have one or more of the fol-
lowing problems; inability to make predictions for newly de-
veloped drugs, assuming balanced data distribution, usage of
limited data sources and usage of inappropriate evaluation met-
rics. Therefore and as shown in the next sections, Tiresias sig-
nificantly outperforms the baseline for DDIs prediction in both
existing and newly developed drug scenarios.
11
0.52
0.56
0.6
0.64
0.68
0.72
0.76
0.8
0.1 0.15 0.2 0.25 0.3
F-score
DDI prevalence at testing
10% train prevalence
20% train prevalence
30% train prevalence
50% train prevalence(baseline)
(a) F-Score for new drugs (all)
0.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.1 0.15 0.2 0.25 0.3
AUPR
DDI prevalence at testing
10% train prevalence
20% train prevalence
30% train prevalence
50% train prevalence(baseline)
(b) AUPR for new drugs (all)
0.48
0.52
0.56
0.6
0.64
0.68
0.72
0.76
0.8
0.1 0.15 0.2 0.25 0.3
F-Score
DDI prevalence at testing
10% train prevalence
20% train prevalence
30% train prevalence
50% train prevalence(baseline)
(c) F-Score for new drugs (nocal)
0.68
0.7
0.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.1 0.15 0.2 0.25 0.3
AUPR
DDI prevalence at testing
10% train prevalence
20% train prevalence
30% train prevalence
50% train prevalence(baseline)
(d) AUPR for new drugs (nocal)
Figure 5: Evaluating Tiresias performance for new developed drugs scenario. Using calibration features with unbalanced training/validation data, Tiresias signifi-
cantly outperforms the baseline.
7.2.1. Data Partitioning
In the 10 fold cross evaluation of our approach, to simulate
the introduction of a newly developed drug for which no inter-
acting drugs are known, 10% of the drugs appearing as the first
element of a pair of drugs in the set KDDI of all known drug
pairs are hidden, rather than hiding 10% of the drug-drug rela-
tions as done in [2, 8, 7]. Since the drug-drug interaction rela-
tion is symmetric, we consider, without loss of generality, only
drug candidate pairs (d1,d2) where the canonical name of d1is
less than or equal to the canonical name of d2according to the
lexicographic order (i.e., d1d2). In particular, pairs of drugs
(d1,d2) in KDDI are such that d1d2.Drugste st denotes the set
of hidden drugs that act as the newly developed drugs for which
no DDIs are known at training or validation. This results in two
subsets of KDDI,KDDIte st ={(d1,d2)|d1Drugste st d2/
Drugst est (d1,d2)KDDI}and KDDx={(d1,d2)|d1/
Drugst est d2/Drugstest (d1,d2)K DDI }.KDDIte st is
the set of known interacting pairs to use at testing (positive ex-
amples). Likewise KDDIxis further split into validation and
training set by hiding 10 % of the drugs appearing as the first el-
ement of a pair of drugs in KDDIx(Drugsval denotes this set of
hidden drugs that act as the drugs for which no DDIs are known
at training). This results in two subsets of K DDIx:KDDIval =
{(d1,d2)|d1Drugsval d2/Drugsval (d1,d2)K DDIx}and
KDDtrain ={(d1,d2)|d1/Drugsval d2/Drugsval (d1,d2)
KDDIx}.K DDIval corresponds to the set of known interacting
drugs used in the model validation phase, and KDDItrain is the
set of known interacting drugs used at training.
The training data set consists of (i) known interacting drugs
in KDDItrain as positive examples, and (ii) randomly generated
pairs of drugs (d1,d2) not already known to interact (i.e, not in
KDDI) such that the drugs d1and d2appear in KDDItrain (as
negative examples).
The validation data set consists of (i) the known interacting
drug pairs in KDDIval as positive examples, and (ii) negative
examples that are randomly generated pairs of drugs (d1,d2)
not already known to interact (i.e., not in KDDI) such that d1
is the first drug in at least one pair in KDDIval (i.e., a drug only
seen at validation but not at training) and d2appears (as first
or second element) in at least on pair in KDDItr ain (i.e., d2is
known at training).
The testing data set consists of (i) the known interacting
drug pairs in KDDIte st as positive examples, and (ii) negative
examples that are randomly generated pairs of drugs (d1,d2)
not already known to interact (i.e., not in K DDI) such that
d1is the first drug in at least one pair in KDDIte st (i.e., a
drug only seen at testing but not at training or validation) and
d2appears (as first or second element) in at least on pair in
KDDItrain K DDIval (i.e., d2is known at training or at valida-
tion).
7.2.2. DDI Prediction for Newly Developed Drugs
Contrary to prior work, in our evaluation, the ratio of positive
examples to randomly generated negative examples is not 1 to
1. Instead, the assumed prevalence of DDIs at training and vali-
dation is the same and is in the set {10%, 20%, 30%, 50%}. DDI
12
0.68
0.72
0.76
0.8
0.84
0.88
0.92
0.1 0.15 0.2 0.25 0.3
F-score
DDI prevalence at testing
10% train prevalence
20% train prevalence
30% train prevalence
50% train prevalence(baseline)
(a) F-score for exist drugs (all)
0.86
0.88
0.9
0.92
0.94
0.96
0.1 0.15 0.2 0.25 0.3
AUPR
DDI prevalence at testing
10% train prevalence
20% train prevalence
30% train prevalence
50% train prevalence(baseline)
(b) AUPR for exist drugs (all)
0.6
0.64
0.68
0.72
0.76
0.8
0.84
0.88
0.1 0.15 0.2 0.25 0.3
F-score
DDI prevalence at testing
10% train prevalence
20% train prevalence
30% train prevalence
50% train prevalence(baseline)
(c) F-score for exist drugs (nocal)
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.1 0.15 0.2 0.25 0.3
AUPR
DDI prevalence at testing
10% train prevalence
20% train prevalence
30% train prevalence
50% train prevalence(baseline)
(d) AUPR for exist drugs (nocal)
Figure 6: Evaluating Tiresias performance for existing drugs scenario. For a fixed DDI prevalence at training/validation, using calibration features is always better.
Prevalence is a quantitative measure of the percentage of DDIs
existing in the dataset. For example, 20% DDI train prevalence
corresponds to a training dataset with a 20% of the set of all
pairs of drugs are interacting. Similarly, 10% DDI prevalence
at testing means that the testing dataset has 10% of its DDIs as
positive examples. For a given DDI prevalence at training and
validation, we evaluate the quality of our predictions on test-
ing data sets with varying prevalence of DDIs (ranging from
10% to 30%). 50% DDI prevalence at training and validation
is used here to assess the quality of prior work (which rely on a
balanced distribution of positive and negative examples at train-
ing) when the testing data is unbalanced.
For a given assumed DDI prevalence at training/validation
and a DDI prevalence at testing, to get robust results and show
the eectiveness of our calibration-based features, we perform
not one, but five 10-fold cross validations with all the features
described in section 5 (see Figures 5(a) and 5(b)) and five
10 fold-cross validations without calibration features (see Fig-
ures 5(c) and 5(d)). Results reported on Figures 5 represent
average over the five 10 fold-cross validations.
The key results from our evaluation are as follows:
Regardless of the DDI prevalence used at training and val-
idation (provided that it is between 10% to 30% -i.e., the
lower and upper bound of the true prevalence of DDIs
over the set of all drug pairs), our approach using cal-
ibration features (solid lines in Figures 5(a) and 5(b))
and unbalanced training/validation data (non-black lines)
significantly outperforms the baseline representing prior
similarity-based approaches (e.g., [2]) that rely on bal-
anced training data without calibration-based features (the
dotted black line with crosses as markers). For an as-
sumed DDI prevalence at training ranging from 10% to
50%, the average F-score (resp. AUPR) over testing data
with prevalence between from 10% to 30% varies from
0.73 to 0.74 (resp. 0.821 to 0.825) when all features are
used. However, when calibration features are not used and
the training is done on balanced data, the average F-score
(resp. AUPR) over testing data with prevalence between
from 10% to 30% is 0.65 (resp. 0.78)7. The dierence
with the baseline is higher the skewer the testing data dis-
tribution is.
For a fixed DDI prevalence at training/validation, using
calibration features is always better in terms of F-Score
or AUPR (see Figures 5(a) and 5(b) compared to 5(c)
and 5(d) for the F-score and AUPR values of Tiresias with
and without calibration features, respectively.)
As pointed out in prior work, the area under ROC curves
(AUROC) is not aected by the prevalence of DDI at train-
ing/validation or testing. It remains constant at about 0.92
with calibration features and 0.90 without calibration fea-
tures.
7Precision (resp. recall) varies from 0.84 to 0.70 (resp. 0.66 to 0.78) with
calibration features and unbalanced training set. Precision (resp. recall) is at
0.54 (resp. 0.84) on balanced training without calibration.
13
Table 1: Eect of adding embedding-based features to Tiresias at 10% DDI Prevalence at Training and Testing
Precision Recall F-score ROC AUPR
Individual Features
Tiresias- Local Features (TLF) 0.815 0.812 0.813 0.974 0.887
Global Features - HolE only 0.785 0.766 0.775 0.961 0.841
Global Features - DailyMed-word2vec only 0.585 0.641 0.611 0.885 0.611
Global Features - DrugBank-word2vec only 0.381 0.593 0.463 0.852 0.412
Combination of 2 Features
TLF +DrugBank-word2vec 0.817 0.815 0.816 0.975 0.890
TLF +DailyMed-word2vec 0.827 0.817 0.822 0.977 0.895
TLF +TransH 0.832 0.821 0.826 0.976 0.896
TLF +HolE 0.860 0.835 0.847 0.981 0.918
Combination of 3 Features
TLF +TransH+DailyMed-word2vec 0.841 0.820 0.830 0.977 0.901
TLF +HolE+DailyMed-word2vec 0.871 0.831 0.850 0.980 0.917
TLF +HolE+DrugBank-word2vec 0.867 0.832 0.849 0.980 0.917
TLF +HolE+DailyMed+TransH 0.866 0.838 0.851 0.981 0.919
7.2.3. DDI Prediction for Existing Drugs
We also perform 10 fold-cross validation evaluations hiding
drug-drug associations instead of drugs. Figures 6(a), 6(b), 6(c)
and 6(d) shows the F-score and AUPR values of Tiresias with
and without calibration features, respectively. The results in
Figure 6 show that, even when predictions are made only on
drugs with some known interacting drugs, the combination of
unbalanced training/validation data and calibration features re-
mains superior to the baseline.
7.3. Individual Features Prediction Power
In this experiment, we test the eect of using each local fea-
ture individually on the overall performance. Our experiments
show that no similarity measure by itself has a good predictive
power. We found that ATC-based similarity is the best with 0.58
F-score and 0.56 AUPR. Removing any given local similarity
measure has limited impact on the quality of the predictions.
The greatest decrease was by 1% in the F-score and AUPR val-
ues after ATC-based similarity removal.
We also tested the prediction power of each global
embedding-based similarity feature individually (see upper part
of Table 1) and compare it against Tiresias using only the Local
Features (TLF) discussed in Section 5. As the table shows,
HolE-based graph embedding feature is the most powerful in-
dividual feature; it could achieve by its own an F-score value
of 77.5%. However, HolE-based feature is still inferior com-
pared to Tiresias using all local features which has an F-Score
of 81.3%. Moreover, word2vec embeddings; either based on
DailyMed or DrugBank, did not perform very well by itself
compared to HolE or the local features. Consequently, we be-
lieve that the good prediction performance of Tiresias is a result
of combining all the features together and not by any individual
feature (see Section 7.4).
7.4. Combining Local and Global Embedding Features
In this experiment, we measure the eect of adding the global
embedding-based features to Tiresias. These features include
both word and graph embedding features as discussed in Sec-
tion 5. Specifically, we compare against TLF which is Tiresias
using all our supervised local features. Table 1 shows the eect
of adding each embedding-based feature individually followed
by combining multiple features together. The word2vec fea-
tures results a modest improvement over TLF F-score perfor-
mance by 0.3% and 1% for DrugBank and DailyMed, respec-
tively. On the other hand, using graph embedding based fea-
tures improved the performance of TLF significantly. TransH
improved the F-score value of the baseline by 1.3% while HolE
increased that value by almost 3%. We also tested the eect of
combining these features together. The lower part of the table
shows Tiresias performance when we combine word and graph
embedding features together which generally shows better per-
formance than using these features individually. The best per-
formance is obtained when we combined DailyMed word2vec
with HolE and TransH graph embedding which gains almost
a 4% F-score improvement over the Tiresias using only local
features.
We also show how the DDI prevalence ratio aects the per-
formance of Tiresias when using the embedding features. In
this experiment, we use all supervised features in addition to
DailyMed word2vec and HolE graph embedding. This version
is coined Tiresias using both Local and Gocal Features (TLGF).
Figure 7 shows how Tiresias performance varies as we change
the DDI prevalence at training and testing. We use the lower-
and upper-bound values of DDI prevalence at training used in
Figures 5 and 6 which are 10% and 50%, respectively. As
Figure 7 demonstrates, introducing the embedding features im-
proves the performance of both TLF as well as the baseline sig-
nificantly. For example, at 10% DDI at testing prevalence, the
F-score of Tiresias using both local and global features (TLGF)
increased from 0.81 to 0.85 (resp. AUPR increased from 0.88
to 0.91). Similarly, the F-score value of the baseline improved
from 0.69 to 0.74 while AUPR improved from 0.87 to 0.91.
7.5. Discussion
In this study, we proposed Tiresias; a large-scale system for
predicting DDIs. Our results show that Tiresias is eective in
predicting new interactions among existing as well as newly
developed drugs. We summarize below our main findings:
For predicting DDIs among newly developed drugs and
using only known DDIs before 2011 for training, Tiresias
14
0.65
0.7
0.75
0.8
0.85
0.9
0.1 0.15 0.2 0.25 0.3
F-score
DDI prevalence at testing
10% train prevalence (TLF)
10% train prevalence (TLGF)
50% train prevalence(baseline)
50% train prevalence (baseline+G. Embed)
(a) F-score for existing drugs
0.86
0.88
0.9
0.92
0.94
0.96
0.98
0.1 0.15 0.2 0.25 0.3
AUPR
DDI prevalence at testing
10% train prevalence (TLF)
10% train prevalence (TLGF)
50% train prevalence(baseline)
50% train prevalence (baseline+G. Embed)
(b) AUPR for existing drugs
Figure 7: Eect of DDI prevalence ratio on Tiresias when using local and
global embedding-based similarity features. At the lower and upper values for
training prevalence, embedding features improves the performance of both the
baseline as well as TLF significantly.
was able to correctly predict 68% of the DDIs found after
2011.
We have also shown how the assumption of a balanced
distribution of interacting drug pairs severely aects the
performance. This shows the importance and the eec-
tiveness of the proposed calibration features and the tech-
niques we proposed for handling unbalanced datasets.
Our experiments show that Tiresias outperforms existing
systems by up to 11% for the existing drugs scenario and
by up to 15% for the newly develop drug scenario.
We also show that the high prediction power of Tiresias
comes from the combination of the various proposed local
and global similarity features and not by any individual
feature.
Finally, we show that our newly introduced global
embedding-based similarity features could enhance the
performance further by an increase of 4% and 5% over the
F-score value of Tiresias using local features only (TLF)
and the baseline, respectively.
In the current version of Tiresias, we only predicted whether
two drugs interact or not without providing further information
about the type of interaction. In future work, we plan to extend
our model to give detailed predictions of the nature, severity,
and cause of DDIs. We also plan to investigate the possibility of
identifying subsets of features that characterize DDIs of certain
groups of drugs.
8. Related Work
In this section, we briefly review existing computational
approaches for predicting DDIs. We discuss mainly fea-
ture vector-based and similarity-based DDI prediction methods.
Notice that all the methods discussed in this section have one or
more of the following shortcomings (see Section 1 for details);
(i) Inability to make predictions for newly-developed drugs. (ii)
Ignoring the skewed distribution of interacting drug pairs. (iii)
Discarding many relevant data sources and (iv) usage of inap-
propriate evaluation metrics.
8.1. Direct feature vector-based approaches
The inputs of general machine learning methods are in-
stances, which can be represented by feature vectors. In our
setting, instances are pairs of drugs, and their feature vectors
can be generated by directly combining features of two drugs
(e.g., chemical descriptors of two drugs). With these inputs,
any standard machine learning method (e.g., logistic regression,
support vector machines) can be used to build models for pre-
dicting drug-drug interactions. Luo et al [3], for example, pro-
poses a feature vector-based DDI prediction server that makes
real-time DDI predictions based only on molecular structure.
Given the molecular structure of a drug d, the server docks it
across 611 human proteins to calculate a docking score of the
molecule to each human protein target. This produces a 611-
dimensional docking vector v(d). The feature vector associated
with a pair of drugs (d1,d2) is then computed as the concatena-
tion of the two vectors v(d1)+v(d2) and |v(d1)v(d2)|to produce
a 1222-dimensional vector (here, for a vector x,|x|denotes the
vector obtained by taking the absolute value of each compo-
nent of x). Finally, a logistic regression model is built based
on these features for DDI predictions. The model can suggest
potential DDIs between a user’s molecule and a library of 2515
drug molecules.
8.2. Similarity-based DDI prediction approaches
Similarity-based approaches [2, 6, 7] try to predict whether a
candidate pair of drugs interacts by comparing it against known
interacting pairs of drugs. Finding known interacting drugs
that are very similar to the candidate pair provides supporting
evidence for the existence of a drug-drug interaction between
the two candidate drugs. INDI (INferring Drug Interactions)
[2] has three phases; construction of drug-drug similarity mea-
sures, building classification features and applying a classifier
to predict new DDIs. INDI used seven drug-drug similarity
measures including chemical similarity, similarities based on
registered and predicted side eects, the Anatomical, Therapeu-
tic and Chemical (ATC) classification system and three similar-
ity measures constructed between drug targets. These features
15
are then combined into 49 features to calculate the maximum
similarity between the query drug pair and all the known DDIs
existing in the database. Vilar et al. [7] proposed a similarity-
based modeling protocol that uses several similarity measures
to predict novel DDIs. These measures include structure simi-
larity, interaction profile fingerprints, 3D pharmacophoric sim-
ilarity, drug-target similarity and adverse drug eects similar-
ity. This proposed protocol is a multi-type predictor that can
isolate the pharmacological or clinical eect associated with
the predicted interactions. Zhang et al. [6] proposed an inte-
grative label propagation framework to predict DDIs. It inte-
grates multiple similarity measures together including side ef-
fects extracted from prescription drugs, side eects extracted
from FDA Adverse Event Reporting System, and chemical
structures from PubChem. In addition to predicting DDIs, their
proposed method is also able to rank drug information sources
based on their contributions to the prediction.
9. Conclusion
In this paper, we proposed Tiresias; a large-scale computa-
tional framework that predicts DDIs through similarity-based
link prediction. Tiresias addresses the limitations of exist-
ing approaches by: (i) utilizing information from various data
sources, (ii) using larger set of local and global similarity fea-
tures, (iii) handling data skewness and similarity measures in-
completeness and (v) being able to make DDI predictions for
existing drugs as well as newly developed drugs. We exten-
sively evaluated Tiresias using real datasets to assess its per-
formance. Experimental results clearly show the eectiveness
of Tiresias in both predicting new interactions among newly
developed and existing drugs. It also shows that the combina-
tion of locally and globally generated drugs similarity features
improves the performance of Tiresias significantly. The pre-
dictions provided by Tiresias will help clinicians to avoid haz-
ardous DDIs in their prescriptions and will aid pharmaceutical
companies to design large-scale clinical trials by assessing po-
tentially hazardous drug combinations.
10. References
[1] D. Flockhart, P. Honig, S. Yasuda, C. Rosebraugh, Preventable adverse
drug reactions: A focus on drug interactions, Centers for Education &
Research on Therapeutics 452.
[2] A. Gottlieb, G. Y. Stein, Y. Oron, E. Ruppin, R. Sharan, Indi: a com-
putational framework for inferring drug interactions and their associated
recommendations, Molecular systems biology 8 (1) (2012) 592.
[3] H. Luo, P. Zhang, H. Huang, J. Huang, E. Kao, L. Shi, L. He, L. Yang,
Ddi-cpi, a server that predicts drug-drug interactions through implement-
ing the chemical-protein interactome, Nucleic Acids Research 42 (2014)
W46–W52.
[4] P. Zhang, P. Agarwal, Z. Obradovic, Computational drug repositioning by
ranking and integrating multiple data sources, in: Machine Learning and
Knowledge Discovery in Databases, Springer, 2013, pp. 579–594.
[5] P. Zhang, F. Wang, J. Hu, R. Sorrentino, Towards personalized medicine:
Leveraging patient similarity and drug similarity analytics, AMIA Sum-
mits on Translational Science Proceedings 2014 (2014) 132.
[6] P. Zhang, F. Wang, J. Hu, R. Sorrentino, Label propagation prediction of
drug-drug interactions based on clinical side eects, Scientific reports 5
(2015) 12339.
[7] S. Vilar, E. Uriarte, L. Santana, T. Lorberbaum, G. Hripcsak, C. Fried-
man, N. P. Tatonetti, Similarity-based modeling in large-scale prediction
of drug-drug interactions, Nature protocols 9 (9) (2014) 2147–2163.
[8] S. Vilar, E. Uriarte, L. Santana, N. P. Tatonetti, C. Friedman, Detection of
drug-drug interactions by modeling interaction profile fingerprints, PloS
one 8 (3) (2013) 1–11.
[9] J. Davis, M. Goadrich, The relationship between precision-recall and roc
curves, in: Proceedings of the 23rd international conference on Machine
learning, ACM, 2006, pp. 233–240.
[10] A. Fokoue, M. Sadoghi, O. Hassanzadeh, P. Zhang, Predicting drug-drug
interactions through large-scale similarity-based link prediction, in: In-
ternational Semantic Web Conference, Springer, 2016, pp. 774–789.
[11] A. Fokoue, O. Hassanzadeh, M. Sadoghi, P. Zhang, Predicting drug-drug
interactions through similarity-based link prediction over web data, in:
Proceedings of the 25th International Conference on World Wide Web,
WWW 2016, Montreal, Canada, April 11-15, 2016, Companion Volume,
2016, pp. 175–178.
[12] T. Mikolov, K. Chen, G. Corrado, J. Dean, Ecient estimation of word
representations in vector space, arXiv preprint arXiv:1301.3781.
[13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed
representations of words and phrases and their compositionality, in: Ad-
vances in neural information processing systems, 2013, pp. 3111–3119.
[14] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko,
Translating embeddings for modeling multi-relational data, in: Advances
in Neural Information Processing Systems, 2013, pp. 2787–2795.
[15] Z. Wang, J. Zhang, J. Feng, Z. Chen, Knowledge graph embedding by
translating on hyperplanes, in: AAAI, Citeseer, 2014, pp. 1112–1119.
[16] M. Nickel, L. Rosasco, T. Poggio, Holographic embeddings of knowledge
graphs, arXiv preprint arXiv:1510.04935.
[17] C. Knox, V. Law, T. Jewison, P. Liu, S. Ly, A. Frolkis, A. Pon, K. Banco,
C. Mak, V. Neveu, et al., Drugbank 3.0: a comprehensive resource for
’omics’ research on drugs, Nucleic acids research 39 (suppl 1) (2011)
D1035–D1041.
[18] A. P. Davis, C. G. Murphy, C. A. Saraceni-Richards, M. C. Rosenstein,
T. C. Wiegers, C. J. Mattingly, Comparative toxicogenomics database: a
knowledgebase and discovery tool for chemical–gene–disease networks,
Nucleic acids research 37 (suppl 1) (2009) D786–D792.
[19] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann,
S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al., Uniprot:
the universal protein knowledgebase, Nucleic acids research 32 (suppl 1)
(2004) D115–D119.
[20] A. Chatr-aryamontri, B.-J. Breitkreutz, R. Oughtred, L. Boucher,
S. Heinicke, D. Chen, C. Stark, A. Breitkreutz, N. Kolas, L. O’Donnell,
et al., The BioGRID interaction database: 2015 update, Nucleic acids
research 43 (D1) (2015) D470–D478.
[21] O. Bodenreider, The unified medical language system (UMLS): integrat-
ing biomedical terminology, Nucleic acids research 32 (suppl 1) (2004)
D267–D270.
[22] C. E. Lipscomb, Medical subject headings (mesh), Bulletin of the Medical
Library Association 88 (3) (2000) 265.
[23] S. H. Brown, P. L. Elkin, S. Rosenbloom, C. Husser, B. Bauer, M. Lin-
coln, J. Carter, M. Erlbaum, M. Tuttle, VA National Drug File Reference
Terminology: a cross-institutional content coverage study, Medinfo 11 (Pt
1) (2004) 477–81.
[24] M. Sadoghi, K. Srinivas, O. Hassanzadeh, Y. Chang, M. Canim, A. Fok-
oue, Y. A. Feldman, Self-curating databases, in: Proceedings of the
19th International Conference on Extending Database Technology, EDBT
2016, Bordeaux, France, March 15-16, 2016, Bordeaux, France, March
15-16, 2016., 2016, pp. 467–472.
[25] A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi, D. Srivastava,
Benchmarking declarative approximate selection predicates, in: ACM
SIGMOD International Conference on Management of Data, SIGMOD
’07, 2007, pp. 353–364.
[26] M. Kuhn, M. Campillos, I. Letunic, L. J. Jensen, P. Bork, A side eect re-
source to capture phenotypic eects of drugs, Molecular systems biology
6 (1) (2010) 343.
[27] P. Resnik, et al., Semantic similarity in a taxonomy: An information-
based measure and its application to problems of ambiguity in natural
language, J. Artif. Intell. Res.(JAIR) 11 (1999) 95–130.
[28] K. Ovaska, M. Laakso, S. Hautaniemi, Fast gene ontology based cluster-
ing for microarray experiments, BioData mining 1 (1) (2008) 11.
16
[29] C. Steinbeck, C. Hoppe, S. Kuhn, M. Floris, R. Guha, E. L. Willighagen,
Recent developments of the chemistry development kit (cdk)-an open-
source java library for chemo-and bioinformatics, Current pharmaceutical
design 12 (17) (2006) 2111–2120.
[30] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann, E. Willigha-
gen, The chemistry development kit (cdk): An open-source java library
for chemo-and bioinformatics, Journal of chemical information and com-
puter sciences 43 (2) (2003) 493–500.
[31] A. Skrbo, B. Begovi´
c, S. Skrbo, [classification of drugs using the atc
system (anatomic, therapeutic, chemical classification) and the latest
changes]., Medicinski arhiv 58 (1 Suppl 2) (2003) 138–141.
[32] G. Sidorov, A. Gelbukh, H. G´
omez-Adorno, D. Pinto, Soft similarity and
soft cosine measure: Similarity of features in vector space model, Com-
putaci´
on y Sistemas 18 (3) (2014) 491–504.
[33] N. P. Tatonetti, P. Y. Patrick, R. Daneshjou, R. B. Altman, Data-driven
prediction of drug eects and interactions, Science translational medicine
4 (125) (2012) 125ra31–125ra31.
[34] G. King, L. Zeng, Logistic regression in rare events data, Political Anal-
ysis 9 (2) (2001) 137–163.
17
... Knowledge graph embedding (KGE) is a technique that represents entities and relations within a KG in a continuous vector space [1]. This technique is conducive to numerous downstream applications such as KG completion [2], disease diagnosis [3,4], recommender systems [5], and question-answering systems [6]. With the promulgation of general data protection regulation (GDPR) [7], KGs from multiple sources are no longer stored centrally on a single device as a complete KG. ...
... The motivation behind it is to preserve the structure information and underlying semantic information of the KG [2,[26][27][28][29]. The learned embedding vectors by KGE models can be effectively applied in many downstream tasks, such as disease diagnosis [3,4], question answering [6], recommendation system [5,30,31] and knowledge graph completion [32]. ...
... We first investigate the impact of batch size and local epochs on the performance of PFedEG+ on FB15k237-Fed5, with RotatE as the KGE method. Three different local epoch (1,3,5), are chosen, along with three different batch sizes (64, 256, 512). The MRR values of PFedEG+ on validation set are presented in Figure 4a and Figure 4b. ...
Preprint
Federated Knowledge Graph Embedding (FKGE) has recently garnered considerable interest due to its capacity to extract expressive representations from distributed knowledge graphs, while concurrently safeguarding the privacy of individual clients. Existing FKGE methods typically harness the arithmetic mean of entity embeddings from all clients as the global supplementary knowledge, and learn a replica of global consensus entities embeddings for each client. However, these methods usually neglect the inherent semantic disparities among distinct clients. This oversight not only results in the globally shared complementary knowledge being inundated with too much noise when tailored to a specific client, but also instigates a discrepancy between local and global optimization objectives. Consequently, the quality of the learned embeddings is compromised. To address this, we propose Personalized Federated knowledge graph Embedding with client-wise relation Graph (PFedEG), a novel approach that employs a client-wise relation graph to learn personalized embeddings by discerning the semantic relevance of embeddings from other clients. Specifically, PFedEG learns personalized supplementary knowledge for each client by amalgamating entity embedding from its neighboring clients based on their "affinity" on the client-wise relation graph. Each client then conducts personalized embedding learning based on its local triples and personalized supplementary knowledge. We conduct extensive experiments on four benchmark datasets to evaluate our method against state-of-the-art models and results demonstrate the superiority of our method.
... However, more than single similarity metrics and singular information sources may be required for accurately predicting DDIs. Abdelaziz et al. [15] fused multiple drug features to compute drug similarity and accurately predicted DDIs based on the fused similarity in 2017. Vilar et al. [16] integrated drug similarity information extracted from different sources in 2014. ...
... The initial representation matrix of the knowledge graph G is as follows: (15) where l is set to 1. To gather information about the common properties of each node in relation to similar or identical edges corresponding to tail entities, one only needs to aggregate information from one layer. ...
Article
Full-text available
The combined use of multiple medications is common in treatment, which may lead to severe drug–drug interactions (DDIs). Deep learning methods have been widely used to predict DDIs in recent years. However, current models need help to fully understand the characteristics of drugs and the relationships between these characteristics, resulting in inaccurate and inefficient feature representations. Beyond that, existing studies predominantly focus on analyzing a single DDIs, failing to explore multiple similar DDIs simultaneously, thus limiting the discovery of common mechanisms underlying DDIs. To address these limitations, this research proposes a method based on M-Transformer and knowledge graph for predicting DDIs, comprising a dual-pathway approach and neural network. In the first pathway, we leverage the interpretability of the transformer to capture the intricate relationships between drug features using the multi-head attention mechanism, identifying and discarding redundant information to obtain a more refined and information-dense drug representation. However, due to the potential difficulty for a single transformer model to understand features from multiple semantic spaces, we adopted M-Transformer to understand the structural and pharmacological information of the drug as well as the connections between them. In the second pathway, we constructed a drug–drug interaction knowledge graph (DDIKG) using drug representation vectors obtained from M-Transformer as nodes and DDI types as edges. Subsequently, drug edges with similar interactions were aggregated using a graph neural network (GNN). This facilitates the exploration and extraction of shared mechanisms underlying drug–drug interactions. Extensive experiments demonstrate that our MTrans model accurately predicts DDIs and outperforms state-of-the-art models.
... Current DDI prediction methods primarily rely on integrating multiple databases to obtain drug features, such as similarity features [6], adverse or side effects [7] and multi-task learning [8]. They make the assumption that drugs with similar representations will be having similar DDIs. ...
Article
Full-text available
Effective drug combination prediction is crucial for the success of drug discovery, but it is a challenging task due to drug-drug interactions and potential adverse drug reactions. In this work, a novel technique to DDI prediction using knowledge graph-based approach called KGAT is proposed, which utilizes attention mechanisms with graph convolution layers to capture important features and correlations between drugs and other entities such as targets and genes. Our model employs attention mechanisms to prioritize significant interactions and aggregates information through sum, mean, and max operations to enhance prediction accuracy. This allows KGAT to effectively mine high-order structures and semantic relationships within the knowledge graph. We evaluate our model on the KEGG dataset and compare its performance with existing state-of-the-art methods. The results show that KGAT outperforms these methods. Additionally, our approach has several advantages, including simplicity, interpretability, and low-dimensional complexity, making it a promising tool for accelerating drug discovery and development. By identifying novel drug combinations with improved efficacy and safety profiles, our approach has the potential to improve patient outcomes and support safer drug development. Our study highlights the potential of attention mechanisms in knowledge graph-based drug combination prediction, and we believe that KGAT can serve as a valuable framework for future research in this field.
... A knowledge graph (KG) describes real-world facts in the form of triples (head entity, relation, tail entity). Knowledge graph embedding (KGE) aims to encode entities and relations in the KG into continuous vector representations which capture the semantic meanings and relationships inherent in the graph structure, enabling various downstream tasks such as disease diagnosis [1,21], recommendation system [28,29,32], question answering system [10] and so on. ...
Preprint
Federated Knowledge Graph Embedding (FKGE) aims to facilitate collaborative learning of entity and relation embeddings from distributed Knowledge Graphs (KGs) across multiple clients, while preserving data privacy. Training FKGE models with higher dimensions is typically favored due to their potential for achieving superior performance. However, high-dimensional embeddings present significant challenges in terms of storage resource and inference speed. Unlike traditional KG embedding methods, FKGE involves multiple client-server communication rounds, where communication efficiency is critical. Existing embedding compression methods for traditional KGs may not be directly applicable to FKGE as they often require multiple model trainings which potentially incur substantial communication costs. In this paper, we propose a light-weight component based on Knowledge Distillation (KD) which is titled FedKD and tailored specifically for FKGE methods. During client-side local training, FedKD facilitates the low-dimensional student model to mimic the score distribution of triples from the high-dimensional teacher model using KL divergence loss. Unlike traditional KD way, FedKD adaptively learns a temperature to scale the score of positive triples and separately adjusts the scores of corresponding negative triples using a predefined temperature, thereby mitigating teacher over-confidence issue. Furthermore, we dynamically adjust the weight of KD loss to optimize the training process. Extensive experiments on three datasets support the effectiveness of FedKD.
... KGE methods have been shown to provide competitive performance in the DDIs prediction task. Among others, Tiresias [25] first integrated various drug-related variables into a BioKG, which was then used to compute several similarity measures among all drugs and predict potential DDIs using a logistic regression classifier. Celebi et al. [26] applied several classical KGE models, such as TransE [27] and TransD [28], to predict potential interactions between drugs, and BERTKG-DDIs [29] based on the classical KGE models, which combines the interactions of drug embeddings with other biomedical entities and the domain-specific BioBERT embedding-based Relation Classification (RC) architecture in combination. ...
Article
Full-text available
In biomedicine, the critical task is to decode Drug–Drug Interactions (DDIs) from complex biomedical texts. The scientific community employs Knowledge Graph Embedding (KGE) methods, enhanced with advanced neural network technologies, including capsule networks. However, existing methodologies primarily focus on the structural details of individual entities or relations within Biomedical Knowledge Graphs (BioKGs), overlooking the overall structural context of BioKGs, molecular structures, positional features of drug pairs, and their critical Relational Mapping Properties. To tackle the challenges identified, this study presents HSTrHouse an innovative hierarchical self-attention BioKGs embedding framework. This architecture integrates self-attention mechanisms with advanced neural network technologies, including Convolutional Neural Network (CNN) and Graph Neural Network (GNN), for enhanced computational modeling in biomedical contexts. The model bifurcates the BioKGs into entity and relation layers for structural analysis. It employs self-attention across these layers, utilizing PubMedBERT and CNN for position feature extraction, and a GNN for drug pair molecular structure analysis. Then, we connect the position and molecular structure features to integrate them into the self-attention calculation of entity and relation. After that, the output of the self-attention layer is combined with the connected vectors of the position feature and molecular structure feature to obtain the final representation vector, and finally, to model the Relational Mapping Properties (RMPs), the representation vector is embedded into the complex vector space using Householder projections to obtain the BioKGs model. The paper validates HSTrHouse’s efficacy by comparing it with advanced models on three standard BioKGs for DDIs research.
... In general, the process of learning entity embeddings from networks often involves knowledge graph embedding methods, which leverage various types of relations to generate embeddings of entities, including multi-relational embeddings. For instance, Tiresias 16 harnesses TransH 17 and HolE 18 to embed drugs and their relations. The approach using graph neural networks learns entity embeddings by aggregating information from neighboring nodes. ...
Conference Paper
Full-text available
Drug-Drug Interactions (DDIs) are a major cause of preventable adverse drug reactions and a huge burden on public health and the healthcare system. On the other hand, there is a large amount of drug-related (open) data published on the Web, describing various properties of drugs and their relationships to other drugs, genes, diseases, and related concepts and entities. In this demonstration, we describe an end-to-end system we have designed to take in various Web data sources as input and provide as output a prediction of DDIs along with an explanation of why two drugs may interact. The system first creates a knowledge graph out of input data sources through large-scale semantic integration, and then performs link prediction among drug entities in the graph through large-scale similarity analysis and machine learning. The link prediction is performed using a logistic regression model over several similarity matrices built using different drug similarity measures. We present both the efficient link prediction framework implemented in Apache Spark, and our APIs and Web interface for predicting DDIs and exploring their potential causes and nature.
Conference Paper
Full-text available
Drug-Drug Interactions (DDIs) are a major cause of preventable adverse drug reactions (ADRs), causing a significant burden on the patients’ health and the healthcare system. It is widely known that clinical studies cannot sufficiently and accurately identify DDIs for new drugs before they are made available on the market. In addition, existing public and proprietary sources of DDI information are known to be incomplete and/or inaccurate and so not reliable. As a result, there is an emerging body of research on in-silico prediction of drug-drug interactions. We present Tiresias, a framework that takes in various sources of drug-related data and knowledge as inputs, and provides DDI predictions as outputs. The process starts with semantic integration of the input data that results in a knowledge graph describing drug attributes and relationships with various related entities such as enzymes, chemical structures, and pathways. The knowledge graph is then used to compute several similarity measures between all the drugs in a scalable and distributed framework. The resulting similarity metrics are used to build features for a large-scale logistic regression model to predict potential DDIs. We highlight the novelty of our proposed approach and perform thorough evaluation of the quality of the predictions. The results show the effectiveness of Tiresias in both predicting new interactions among existing drugs and among newly developed and existing drugs.
Article
Full-text available
Learning embeddings of entities and relations is an efficient and versatile method to perform machine learning on relational data such as knowledge graphs. In this work, we propose holographic embeddings (HolE) to learn compositional vector space representations of entire knowledge graphs. The proposed method is related to holographic models of associative memory in that it employs circular correlation to create compositional representations. By using correlation as the compositional operator HolE can capture rich interactions but simultaneously remains efficient to compute, easy to train, and scalable to very large datasets. In extensive experiments we show that holographic embeddings are able to outperform state-of-the-art methods for link prediction in knowledge graphs and relational learning benchmark datasets.
Article
Full-text available
Drug-drug interaction (DDI) is an important topic for public health, and thus attracts attention from both academia and industry. Here we hypothesize that clinical side effects (SEs) provide a human phenotypic profile and can be translated into the development of computational models for predicting adverse DDIs. We propose an integrative label propagation framework to predict DDIs by integrating SEs extracted from package inserts of prescription drugs, SEs extracted from FDA Adverse Event Reporting System, and chemical structures from PubChem. Experimental results based on hold-out validation demonstrated the effectiveness of the proposed algorithm. In addition, the new algorithm also ranked drug information sources based on their contributions to the prediction, thus not only confirming that SEs are important features for DDI prediction but also paving the way for building more reliable DDI prediction models by prioritizing multiple data sources. By applying the proposed algorithm to 1,626 small-molecule drugs which have one or more SE profiles, we obtained 145,068 predicted DDIs. The predicted DDIs will help clinicians to avoid hazardous drug interactions in their prescriptions and will aid pharmaceutical companies to design large-scale clinical trial by assessing potentially hazardous drug combinations. All data sets and predicted DDIs are available at http://astro.temple.edu/~tua87106/ddi.html.
Article
We deal with embedding a large scale knowledge graph composed of entities and relations into a continuous vector space. TransE is a promising method proposed recently, which is very efficient while achieving state-of-the-art predictive performance. We discuss some mapping properties of relations which should be considered in embedding, such as reflexive, one-to-many, many-to-one, and many-to-many. We note that TransE does not do well in dealing with these properties. Some complex models are capable of preserving these mapping properties but sacrifice efficiency in the process. To make a good trade-off between model capacity and efficiency, in this paper we propose TransH which models a relation as a hyperplane together with a translation operation on it. In this way, we can well preserve the above mapping properties of relations with almost the same model complexity of TransE. Additionally, as a practical knowledge graph is often far from completed, how to construct negative examples to reduce false negative labels in training is very important. Utilizing the one-to-many/many-to-one mapping property of a relation, we propose a simple trick to reduce the possibility of false negative labeling. We conduct extensive experiments on link prediction, triplet classification and fact extraction on benchmark datasets like WordNet and Freebase. Experiments show TransH delivers significant improvements over TransE on predictive accuracy with comparable capability to scale up.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Conference Paper
Drug repositioning helps identify new indications for marketed drugs and clinical candidates. In this study, we proposed an integrative computational framework to predict novel drug indications for both approved drugs and clinical molecules by integrating chemical, biological and phenotypic data sources. We defined different similarity measures for each of these data sources and utilized a weighted k-nearest neighbor algorithm to transfer similarities of nearest neighbors to prediction scores for a given compound. A large margin method was used to combine individual metrics from multiple sources into a global metric. A large-scale study was conducted to repurpose 1007 drugs against 719 diseases. Experimental results showed that the proposed algorithm outperformed similar previously developed computational drug repositioning approaches. Moreover, the new algorithm also ranked drug information sources based on their contributions to the prediction, thus paving the way for prioritizing multiple data sources and building more reliable drug repositioning models.