ArticlePDF Available

Abstract and Figures

Deep learning is rapidly advancing many areas of science and technology with multiple success stories in image, text, voice and video recognition, robotics and autonomous driving. In this paper we demonstrate how deep neural networks (DNN) trained on large transcriptional response data sets can classify various drugs to therapeutic categories solely based on their transcriptional profiles. We used the perturbation samples of 678 drugs across A549, MCF-7 and PC-3 cell lines from the LINCS project and linked those to 12 therapeutic use categories derived from MeSH. To train the DNN, we utilized both gene level transcriptomic data and transcriptomic data processed using a pathway activation scoring algorithm, for a pooled dataset of samples perturbed with different concentrations of the drug for 6 and 24 hours. When applied to normalized gene expression data for “landmark genes,” DNN showed cross-validation mean F1 scores of 0.397, 0.285 and 0.234 on 3-, 5- and 12-category classification problems, respectively. At the pathway level DNN performed best with cross-validation mean F1 scores of 0.701, 0.596 and 0.546 on the same tasks. In both gene and pathway level classification, DNN convincingly outperformed support vector machine (SVM) model on every multiclass classification problem. For the first time we demonstrate a deep learning neural net trained on transcriptomic data to recognize pharmacological properties of multiple drugs across different biological systems and conditions. We also propose using deep neural net confusion matrices for drug repositioning. This work is a proof of principle for applying deep learning to drug discovery and development.
Content may be subject to copyright.
Deep Learning Applications for Predicting Pharmacological
Properties of Drugs and Drug Repurposing Using Transcriptomic
Alexander Aliper,*
Sergey Plis,
Artem Artemov,
Alvaro Ulloa,
Polina Mamoshina,
and Alex Zhavoronkov*
Insilico Medicine, ETC, B301, Johns Hopkins University, Baltimore, Maryland 21218, United States
Datalytic Solutions, 1101 Yale Boulevard NE, Albuquerque, New Mexico 87106, United States
The Mind Research Network, Albuquerque, New Mexico 87106, United States
The Biogerontology Research Foundation, Oxford, U.K.
SSupporting Information
ABSTRACT: Deep learning is rapidly advancing many areas
of science and technology with multiple success stories in
image, text, voice and video recognition, robotics, and
autonomous driving. In this paper we demonstrate how deep
neural networks (DNN) trained on large transcriptional
response data sets can classify various drugs to therapeutic
categories solely based on their transcriptional proles. We
used the perturbation samples of 678 drugs across A549,
MCF-7, and PC-3 cell lines from the LINCS Project and
linked those to 12 therapeutic use categories derived from
MeSH. To train the DNN, we utilized both gene level transcriptomic data and transcriptomic data processed using a pathway
activation scoring algorithm, for a pooled data set of samples perturbed with dierent concentrations of the drug for 6 and 24
hours. In both pathway and gene level classication, DNN achieved high classication accuracy and convincingly outperformed
the support vector machine (SVM) model on every multiclass classication problem, however, models based on pathway level
data performed signicantly better. For the rst time we demonstrate a deep learning neural net trained on transcriptomic data to
recognize pharmacological properties of multiple drugs across dierent biological systems and conditions. We also propose using
deep neural net confusion matrices for drug repositioning. This work is a proof of principle for applying deep learning to drug
discovery and development.
KEYWORDS: deep learning, DNN, predictor, drug repurposing, drug discovery, confusion matrix, deep neural networks
Drug discovery and development is a complicated and time and
resource consuming process, and various computational
approaches are regularly being developed to improve it. In
silico drug discovery
has evolved over the past decade and
oers a targeted, ecient approach compared to those of the
past, which often relied on either identifying active ingredients
in traditional remedies or, in many cases, serendipitous
discovery. Modern methods include data mining, structure
modeling (homology modeling), traditional machine learning
(ML), and its biologically inspired branch technique, deep
learning (DL).
methods model high-level representations of data using
deep neural networks (DNNs). DNNs are exible multilayer
systems of connected and interacting articial neurons that
perform various data transformations. They have several hidden
layers of neurons, which number variation allows adjusting the
level of data abstraction. DL now plays a dominant role in the
areas of physics
and speech, signal, image, video, and text
mining and recognition,
improving state of the art perform-
ances by more than 30%, where the prior decade struggled to
obtain 12% improvements. Traditional machine learning
approaches have achieved signicant levels of classication
accuracy, but at the price of manually selected and tuned
features. Arguably, feature engineering is the dominating
research component in practical applications of ML. In
contrast, the power of NNs is in automatic feature learning
from massive data sets. Not only does it simplify manual and
laborious feature engineering but also it allows learning task-
optimal features.
Modern biology has entered the era of Big Data, wherein
data sets are too large, high-dimensional, and complex for
classical computational biology methods. The ability to learn at
Received: March 18, 2016
Revised: May 13, 2016
Accepted: May 20, 2016
© XXXX American Chemical Society ADOI: 10.1021/acs.molpharmaceut.6b00248
Mol. Pharmaceutics XXXX, XXX, XXXXXX
This is an open access article published under an ACS AuthorChoice License, which permits
copying and redistribution of the article or any adaptations for non-commercial purposes.
the higher levels of abstraction made DL a promising and
eective tool for working with biological and chemical data.
Methods using DL architecture are capable of dealing with
sparse and complex information, which is especially demanded
in the analysis of high-dimensional gene expression data. Curse
of dimensionalityis one of the major problems of gene
expression data that can be solved by feature selection
implementing standard data projection methods as PCA or
more biologically relevant as pathway analysis.
demonstrate the state-of-the-art performance extracting features
from sparse transcriptomics data (both mRNA and miRNA
in classifying cancer using gene expression data
predicting splicing code patterns.
DL has been eectively
applied in biomodeling and structural genomics to predict
protein 3-D structure using protein sequence (ordered or
disordered protein (with lack of xed 3-D structure))
may become an essential tool for development of new drugs.
DL approaches were successfully implemented to predict
drugtarget interactions,
model reaction properties of
and calculate toxicity of drugs.
As deep networks
incorporate more features from biology,
application breadth
and accuracy will likely increase.
Drug repurposing or target extension allows prediction of
new potential applications of medications or even new
therapeutic classes of drugs using gene expression data before
and after treatment (e.g., before and after incubation of a cell
line with multiple drugs). There are multiple in silico
approaches to drug discovery and classication,
many attempts were made to predict transcriptional response
with functional properties of drugs.
In this study we
addressed this problem by classifying various drugs to
therapeutic categories with DNN solely based on their
transcriptional proles. We used the perturbation samples of
X drugs across A549, MCF-7, and PC-3 cell lines from the
LINCS Project and linked those to 12 therapeutic use
categories derived from MeSH therapeutic use section (Figure
1). After that we independently used both gene expression level
data for landmark genesand pathway activation scores to
train DNN classier.
The main aim of this study was to apply and estimate the
accuracy of DL methods to classify various drugs to therapeutic
categories solely based on their transcriptional proles. In total,
we analyzed 26,420 drug perturbation samples for three cell
lines from the Broad LINCS database. All samples were
assigned to 12 specic therapeutic use categories according to
MeSH classication of the particular drug (Supplementary
Table 1). Since a number of drugs were present in multiple
categories, we considered only those drugs that belong only to
one category. To increase the number of samples in each of the
categories and to make the classication more robust for each
given drug, we aggregated all samples corresponding to all
possible perturbation time, perturbation concentration, and cell
line parameters (Supplementary Table 2).
When dealing with transcriptional data at the gene level, a
common problem is the so-called curse of dimensionality.
Indeed, when we applied DNN on gene level data for whole
data set of 12,797 genes, it did not perform very well, achieving
only 0.24 mean F1 score on 12 classes. So our rst step was
proper feature selection. Here we investigated two approaches:
pathway activation scoring and using landmark genesas new
Pathway Level. For pathway level analysis we used a
previously established pathway analysis method called Onco-
It preserves biological function and allows for
dimensionality reduction at the same time. In contrast to other
pathway analysis tools, which mostly implement pathway
enrichment analysis, OncoFinder performs quantitative estima-
tion of signaling pathway activation strength, and the sign of the
resulting value indicates how signicantly the pathway is up- or
downregulated. All perturbation samples were analyzed with
this tool and for each sample we calculated pathway activation
proles for 271 signaling pathways. Samples with zero pathway
activation score for all of the pathways were considered as
insignicantly perturbed and were excluded from further
analysis. That resulted in a nal data set containing 308, 454,
and 433 drugs for A549, MCF7, and PC3 cell lines,
respectively, and totalling 9352 samples (Supplementary
Table 2).
Using this data set we built a deep learning classier based
only on pathway activation scores for drug perturbation proles
of 3 cell lines: A549, MCF-7, and PC-3. Making a classier
based on a pooled data set with dierent cell lines, drug
concentration, and perturbation time, we are able to estimate
the classication performance in recognizing complex drug
Figure 1. Study design. Gene expression data from LINCS Project was linked to 12 MeSH therapeutic use categories. DNN was trained separately
on gene expression level data for landmark genesand pathway activation scores for signicantly perturbed samples, forming input layers of 977 and
271 neural nodes, respectively.
Molecular Pharmaceutics Article
DOI: 10.1021/acs.molpharmaceut.6b00248
Mol. Pharmaceutics XXXX, XXX, XXXXXX
action patterns across dierent biological conditions. For the 3-
class classication problem we chose the most abundant
categories: antineoplastic, cardiovascular, and central nervous
system agents. DNN achieved 10-fold cross-validation mean F1
score of 0.701. We compared the results of DNN to another
popular classication algorithm called support vector machine
(SVM) trained via nested 3-fold cross validation for several
hyperparameters (see Materials and Methods). On 3-class
classication problem SVM performed with mean F1 score of
Addition of gastrointestinal and anti-infective classes
decreased the mean F1 score of DNN to 0.596. Mean F1
score for SVM dropped as well, down to 0.417.
When all 12 classes were considered, the classication neural
performance decreased in a minor way, with a cross-validation
mean F1 score of 0.546. SVM performed with cross-validation
mean F1 score of 0.366 on the same 12-class classication
problem. The performance comparison of DNN and SVM on
investigated classication problems is depicted in Figure 2ac.
These results indicate that our model performance far exceeds
random chance,
and we can conclude that DNN out-
performed SVM on every multiclass classication problem.
Landmark Gene Level. In our second feature selection
approach we used a data set containing normalized gene
expression data for 977 landmark genes. According to the
authors of LINCS Project they can capture approximately 80%
of the information and possess great inferential value. For fair
comparison we trained DNN exactly the same way we did on
the pathway level. We used the same data set of 9352
signicantly perturbed samples and tested the performance of
DNN on the same classication problems. DNN trained on
landmark genedata performed with 10-fold cross-validation
mean F1 scores of 0.397, 0.285, and 0.234 for 3-, 5-, and 12-
class classication tasks, respectively. The SVM model showed
mean F1 scores of 0.372, 0.238, and 0.202 for respective tasks
(Figure 2df).
DNN as Drug Repurposing Tool. Here we tried to dig a
bit deeper into classication results on pathway level, since the
DNN model worked best with pathways as features. To
determine which of the 12 therapeutic use categories are the
most detectable by DNN, we calculated 10-fold cross-validation
classication accuracy of each category. Antineoplastic agents
turned out to be the most recognizablecategory by a large
margin, with 0.686 accuracy on 12 classes. This was followed by
anti-infective, central nervous system, and dermatologic
categories, with 0.513, 0.506, and 0.505 accuracy, respectively.
The least recognizableon the same number of classes was
hematologic agents, with accuracy of 0.23. On 3- and 5-class
classication problems, the category antineoplastic drugs was
on top as well, with accuracy of 0.82 and 0.742. Separability of
therapeutic categories by DNN can be illustrated with
confusion matrices (Figure 3). Here we observed that the
cardiovascular category drugs was relatively often misclassied
as central nervous system and antineoplastic agents. In contrast,
the level of false positives for the antineoplastic category was
relatively small. If we look even closer into the results,
sometimes these misclassied false positive drugs may in fact
represent a possibility for drug repurposing. For instance, well-
known muscarinic receptor antagonist otenzepad was mis-
classied as central nervous system agent, but despite its
obvious role in brain function,
according to the MeSH
therapeutic use section, it is only used against cardiac
arrhythmia. Another example includes vasodilator pinacidil, a
cyanoguanidine drug that opens ATP-sensitive potassium
channels, which was misclassied as central nervous system
agent in several cross-validation iterations, although it is used
only in cardiovascular conditions. It is known that potassium
channels play important roles in dierent brain regions,
Figure 2. Classication results. Classication performance of DNN and SVM trained on signaling pathways (a, b, c) and landmark genes (d, e, f) for
3, 5, and 12 drug classes, respectively, after 10-fold cross validation. Training and validation set results are shown in gray and green colors,
Molecular Pharmaceutics Article
DOI: 10.1021/acs.molpharmaceut.6b00248
Mol. Pharmaceutics XXXX, XXX, XXXXXX
pinacidil might inuence some of them. Aforementioned cases
hint to the fact that imperfect accuracy here might not be a bad
thing and the DNN model could serve as powerful drug
repositioning tool.
With increasing availability of big data and GPU computing, the
entire eld of deep learning is experiencing very rapid
development, and the breadth of DNN applications goes far
beyond text, voice, and image recognition problems. In this
paper we explored the possibility of using DL to classify various
drugs into therapeutic categories solely based on their
transcriptomic data. To our knowledge, this is the rst DL
model to map transcriptomic data onto therapeutical category.
DNN trained on gene level data did not perform very well,
achieving only 0.24 F1 score on 12 classes. Thus, as a way to
reduce dimensionality and keep biological relevance, we
decided to apply pathway activation scoring.
Translation of
perturbation proles onto the pathway level turned out to be
very benecial. Pathways served as excellent features, and we
were able to exclude insignicantly perturbed samples and
demonstrate the ability of deep neural network to recognize
Figure 3. Validation confusion matrix representing deep neural network classication performance over a set of drugs proled for A549, MCF7, and
PC3 cell lines, belonging to 3 (a), 5 (b), and 12 (c) therapeutic classes. C(i,j) element is a sample count of how many times i was the truth and j was
Molecular Pharmaceutics Article
DOI: 10.1021/acs.molpharmaceut.6b00248
Mol. Pharmaceutics XXXX, XXX, XXXXXX
sophisticated drug action mechanisms on the pathway level.
The power of this approach is further highlighted by the fact
that it performs with high accuracy on a pooled data set with
samples from cells treated by dierent drug concentrations and
perturbation times. Furthermore, the DNN achieves signicant
classication accuracy even across dierent cell lines. Its
performance turned out to be better than SVM on every
multiclass classication problem. When we used the same set of
signicantly perturbed samples that we selected with the
pathway activation approach, and trained DNN on a data set
with gene expression data for landmark genes, the results
turned out to be signicantly worse. Hence, we can conclude
that pathway level data is more complementary for DNN and
more suitable for classifying drugs into therapeutic use
categories. Proper comparison to reference group of samples
plays an essential role, and merely normalized gene expression
data from perturbed samples is not sucient for complex
classication tasks.
It is possible to interpret the classication results from
dierent angles. For instance, as it was shown, in confusion
matrices (Figure 3), that the misclassiedsamples for a
certain drug might in fact be an indication of its potential for
novel use, or repurposing, in these exact incorrectlyassigned
conditions. Misclassication, therefore, may lead to unexpected
new discoveries. This approach opens a great avenue for
application of DL in the drug repurposing eld.
Data Collection. In this study, we performed the analysis of
gene expression data produced by the LINCS Project
participants ( We utilized the
level 3 (Q2NORM) gene expression data for three cell lines:
MCF7, A549, and PC3. Q2NORM data contains the
expression levels of both directly measured landmark tran-
scripts plus inferred genes, which were normalized using
invariant set scaling followed by quantile normalization.
Mapping probes onto ocial HGNC
gene symbols resulted
in a gene expression data set comprising 12,797 genes total.
Drug Selection. To link the drugs proled by LINCS
Project to medical conditions, we utilized MeSH (medical
subject headings) classication (
mesh/). We picked only those drugs that had a link to a
disease in the therapeutic usesection of the classication tree.
Drugs belonging to several categories simultaneously were
excluded from the analysis.
Gene Expression Analysis. All of the selected drugs
samples collected for three cell lines were grouped by the
combination of the following parameters: cell line, drug,
perturbation concentration, and perturbation time. This
resulted in a total number of 26,420 samples. For gene level
analysis we trained our models on both whole data set of
12,797 genes and a subset of 977 so-called landmark genes
dened in the LINCS Project.
In pathway level analysis, for each given case sample group
we generated a reference group consisting of samples perturbed
with DMSO, that came from the same RNA plate as samples
from the case group. After that, each case sample group was
independently analyzed using an algorithm called Onco-
Taking the preprocessed gene expression data as an
input, it allows for cross-platform data set comparison with low
error rate and has the ability to obtain functional features of
intracellular regulation using mathematical estimations. For
each investigated sample group it performs a case-reference
comparison using Studentsttest, generates the list of
signicantly dierentially expressed genes, and calculates the
pathway activation strength (PAS), a value which serves as a
qualitative measure of pathway activation. Positive and negative
PAS values indicate pathway up- and downregulation,
respectively. In this study the genes with FDR-adjusted p-
value <0.05 were considered signicantly dierentially ex-
pressed. Samples with zero pathway activation score for all of
the pathways were considered as insignicantly perturbed and
were excluded from further analysis. The ltered data set
contained 308, 454, and 433 drugs for A549, MCF7, and PC3
cell lines, respectively, and comprised 9352 samples total
(Supplementary Tables 1 and 2).
Classication Methods. Among the multitude of available
classication methods we have employed two that are highly
robust and widely successful in elds outside drug prediction:
and deep neural network.
SVMs are a celebrated
classication method for their exibility and ease of use, while
deep learning approaches are continuing to break records in
many pattern recognition tasks.
Flexibility of SVM, as a maximum margin classier, is in part
reected in provided ability to select a kernel that ts the data
best and choose a soft-margin parameter that allows for best
generalization. However, these parameters are not evident for
any given data set. It has been previously shown that radial basis
function (RBF) kernel SVMs perform well on largely dierent
data. In order to be more objective in selection of kernel we
have allowed our nested cross validation choice of three. We
used grid search for hyperparameter optimization. We have
trained the SVM via nested 3-fold cross validation for
hyperparameters that include kernel type (linear, RBF, or
polynomial) and soft-margin parameter Cembedded in the
outer 10-fold cross-validation loop. For each fold, the algorithm
could have selected dierent kernels and dierent soft-margin
parameters. Nevertheless, RBF was indeed the most preferred.
The deep learning method used in our work was the standard
fully connected multilayer perceptron with 977 input nodes for
gene expression level data and 271 for pathway activation
scores. Similarly to SVM, we used grid search for hyper-
parameter optimization. We used cross entropy as cost function
and AdaDelta
as cost function optimizer. We searched for the
optimal number of layers, number of hidden units, and dropout
rejection rate in a nested cross-validation framework. For the
hyperparameter search, the number of layers varied from 3 to 8,
where the number of hidden units per layer was reduced with
the depth of the network to half the previous layer. We set the
search for the starting hidden layer from 300 to 900 in steps of
150. Each layer was initialized with the Glorot uniform
The experiment shows that the best combination
of parameters was 3 hidden layers with 200 in each with
rectied linear activation function. The dropout rejection ratio
was tested for 20% and 80% at each layer. We chose an
antisymmetric activation function, hyperbolic tangent, as the
nonlinear function of hidden neurons because the data is
normalized to zero mean and unit variance, thus we rely on
deviations from the mean to train the network. Also, it has been
reported that the hyperbolic tangent speeds up convergence
compared to sigmoid functions.
The number of output nodes
was equal to the number of classes to explore in each particular
experiment with a softmax activation function. The code that
implements the feed-forward neural network used in our
experiments is publicly available at
Molecular Pharmaceutics Article
DOI: 10.1021/acs.molpharmaceut.6b00248
Mol. Pharmaceutics XXXX, XXX, XXXXXX
alvarouc/mlp (commit: 8b07a1a18b17ca530fdcb482fce-
SSupporting Information
The Supporting Information is available free of charge on the
ACS Publications website at DOI: 10.1021/acs.molpharma-
MeSH category stratication binary matrix (XLSX)
Number of drugs selected for A549, MCF7, and PC3 cell
lines (XLSX)
Corresponding Author
The authors declare no competing nancial interest.
We would like to thank Dr. Leslie C. Jellen of Insilico Medicine
for editing this manuscript for language and style. Authors
thank Mark Berger and NVIDIA Corporation for providing
valuable advice and high performance GPU equipment for deep
learning applications.
(1) Loging, W.; Harland, L.; Williams-Jones, B. High-Throughput
Electronic Biology: Mining Information for Drug Discovery. Nat. Rev.
Drug Discovery 2007,DOI: 10.1038/nrd2345.
(2) Kirchmair, J.; Göller, A. H.; Lang, D.; Kunze, J.; Testa, B.; Wilson,
I. D.; Glen, R. C.; Schneider, G. Predicting Drug Metabolism:
Experiment and/or Computation? Nat. Rev. Drug Discovery 2015,14,
(3) Schirle, M.; Jenkins, J. L. Identifying Compound Efficacy Targets
in Phenotypic Drug Discovery. Drug Discovery Today 2016,21, 82.
(4) LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015,
521 (7553), 436444.
(5) Baldi, P.; Sadowski, P.; Whiteson, D. Searching for Exotic
Particles in High-Energy Physics with Deep Learning. Nat. Commun.
2014,5, 4308.
(6) Schmidhuber, J. Deep Learning in Neural Networks: An
Overview. Neural Networks 2015,61,85117.
(7) Mamoshina, P.; Vieira, A.; Putin, E.; Zhavoronkov, A.
Applications of Deep Learning in Biomedicine. Mol. Pharmaceutics
2016,13 (5), 14451454.
(8) Hira, Z. M.; Gillies, D. F. A Review of Feature Selection and
Feature Extraction Methods Applied on Microarray Data. Adv. Bioinf.
2015,2015, 198363.
(9) Ibrahim, R.; Rania, I.; Yousri, N. A.; Ismail, M. A.; El-Makky, N.
M. Multi-Level gene/MiRNA Feature Selection Using Deep Belief
Nets and Active Learning. In 2014 36th Annual International
Conference of the IEEE Engineering in Medicine and Biology Society;
(10) Fakoor, R.; Ladhak, F.; Nazi, A.; Huber, M. Using Deep Learning
to Enhance Cancer Diagnosis and Classication.InProceedings of the
International Conference on Machine Learning; 2013.
(11) Leung, M. K. K.; Xiong, H. Y.; Lee, L. J.; Frey, B. J. Deep
Learning of the Tissue-Regulated Splicing Code. Bioinformatics 2014,
30 (12), i121i129.
(12) Lyons, J.; Dehzangi, A.; Heffernan, R.; Sharma, A.; Paliwal, K.;
Sattar, A.; Zhou, Y.; Yang, Y. Predicting Backbone CαAngles and
Dihedrals from Protein Sequences by Stacked Sparse Auto-Encoder
Deep Neural Network. J. Comput. Chem. 2014,35 (28), 20402046.
(13) Wang, S.; Weng, S.; Ma, J.; Tang, Q. DeepCNF-D: Predicting
Protein Order/Disorder Regions by Weighted Deep Convolutional
Neural Fields. Int. J. Mol. Sci. 2015,16 (8), 1731517330.
(14) Lusci, A.; Pollastri, G.; Baldi, P. Deep Architectures and Deep
Learning in Chemoinformatics: The Prediction of Aqueous Solubility
for Drug-like Molecules. J. Chem. Inf. Model. 2013,53 (7), 15631575.
(15) Wang, C.; Caihua, W.; Juan, L.; Fei, L.; Yafang, T.; Zixin, D.;
Qian-Nan, H. Pairwise Input Neural Network for Target-Ligand
Interaction Prediction.In2014 IEEE International Conference on
Bioinformatics and Biomedicine (BIBM); 2014.
(16) Hughes, T. B.; Miller, G. P.; Swamidass, S. J. Modeling
Epoxidation of Drug-like Molecules with a Deep Machine Learning
Network. ACS Cent. Sci. 2015,1(4), 168180.
(17) Xu, Y.; Dai, Z.; Chen, F.; Gao, S.; Pei, J.; Lai, L. Deep Learning
for Drug-Induced Liver Injury. J. Chem. Inf. Model. 2015,55 (10),
(18) Solovyeva, K. P.; Karandashev, I. M.; Zhavoronkov, A.; Dunin-
Barkowski, W. L. Models of Innate Neural Attractors and Their
Applications for Neural Information Processing. Front. Syst. Neurosci.
2015,DOI: 10.3389/fnsys.2015.00178.
(19) Newby, D.; Freitas, A. A.; Ghafourian, T. Comparing Multilabel
Classification Methods for Provisional Biopharmaceutics Class
Prediction. Mol. Pharmaceutics 2015,12 (1), 87102.
(20) Wenlock, M. C.; Barton, P. In Silico Physicochemical Parameter
Predictions. Mol. Pharmaceutics 2013,10 (4), 12241235.
(21) Broccatelli, F.; Cruciani, G.; Benet, L. Z.; Oprea, T. I. BDDCS
Class Prediction for New Molecular Entities. Mol. Pharmaceutics 2012,
9(3), 570580.
(22) Herrera-Ruiz, D.; Faria, T. N.; Bhardwaj, R. K.; Timoszyk, J.;
Gudmundsson, O. S.; Moench, P.; Wall, D. A.; Smith, R. L.; Knipp, G.
T. A Novel hPepT1 Stably Transfected Cell Line: Establishing a
Correlation between Expression and Function. Mol. Pharmaceutics
2004,1(2), 136144.
(23) Iskar, M.; Zeller, G.; Blattmann, P.; Campillos, M.; Kuhn, M.;
Kaminska, K. H.; Runz, H.; Gavin, A.-C.; Pepperkok, R.; van Noort,
V.; Bork, P. Characterization of Drug-Induced Transcriptional
Modules: Towards Drug Repositioning and Functional Under-
standing. Mol. Syst. Biol. 2013,9, 662.
(24) Kutalik, Z.; Beckmann, J. S.; Bergmann, S. A Modular Approach
for Integrative Analysis of Large-Scale Gene-Expression and Drug-
Response Data. Nat. Biotechnol. 2008,26 (5), 531539.
(25) Spirin, P. V.; Lebedev, T. D.; Orlova, N. N.; Gornostaeva, A. S.;
Prokofjeva, M. M.; Nikitenko, N. A.; Dmitriev, S. E.; Buzdin, A. A.;
Borisov, N. M.; Aliper, A. M.; Garazha, A. V.; Rubtsov, P. M.; Stocking,
C.; Prassolov, V. S. Silencing AML1-ETO Gene Expression Leads to
Simultaneous Activation of Both pro-Apoptotic and Proliferation
Signaling. Leukemia 2014,28 (11), 22222228.
(26) Zhu, Q.; Izumchenko, E.; Aliper, A. M.; Makarev, E.; Paz, K.;
Buzdin, A. A.; Zhavoronkov, A. A.; Sidransky, D. Pathway Activation
Strength Is a Novel Independent Prognostic Biomarker for Cetuximab
Sensitivity in Colorectal Cancer Patients. Hum. Genome Var. 2015,2,
(27) Buzdin, A. A.; Zhavoronkov, A. A.; Korzinkin, M. B.; Venkova,
L. S.; Zenin, A. A.; Smirnov, P. Y.; Borisov, N. M. Oncofinder, a new
method for the analysis of intracellular signaling pathway activation
using transcriptomic data. Front. Genet. 2014,5, 55.
(28) Artemov, A.; Aliper, A.; Korzinkin, M.; Lezhnina, K.; Jellen, L.;
Zhukov, N.; Roumiantsev, S.; Gaifullin, N.; Zhavoronkov, A.; Borisov,
N.; Buzdin, A. A Method for Predicting Target Drug Efficiency in
Cancer Based on the Analysis of Signaling Pathway Activation.
Oncotarget 2015,6(30), 2934729356.
(29) Venkova, L.; Aliper, A.; Suntsova, M.; Kholodenko, R.; Shepelin,
D.; Borisov, N.; Malakhova, G.; Vasilov, R.; Roumiantsev, S.;
Zhavoronkov, A.; Buzdin, A. Combinatorial High-Throughput
Experimental and Bioinformatic Approach Identifies Molecular
Pathways Linked with the Sensitivity to Anticancer Target Drugs.
Oncotarget 2015,6(29), 2722727238.
(30) Makarev, E.; Cantor, C.; Zhavoronkov, A.; Buzdin, A.; Aliper,
A.; Csoka, A. B. Pathway Activation Profiling Reveals New Insights
Molecular Pharmaceutics Article
DOI: 10.1021/acs.molpharmaceut.6b00248
Mol. Pharmaceutics XXXX, XXX, XXXXXX
into Age-Related Macular Degeneration and Provides Avenues for
Therapeutic Interventions. Aging 2014,6(12), 10641075.
(31) Combrisson, E.; Jerbi, K. Exceeding Chance Level by Chance:
The Caveat of Theoretical Chance Levels in Brain Signal Classification
and Statistical Assessment of Decoding Accuracy. J. Neurosci. Methods
2015,250, 126136.
(32) Langmead, C. J.; Watson, J.; Reavill, C. Muscarinic Acetylcho-
line Receptors as CNS Drug Targets. Pharmacol. Ther. 2008,117 (2),
(33) Volpicelli, L. A.; Levey, A. I. Muscarinic Acetylcholine Receptor
Subtypes in Cerebral Cortex and Hippocampus. Prog. Brain Res. 2004,
(34) Trimmer, J. S. Subcellular Localization of K+ Channels in
Mammalian Brain Neurons: Remarkable Precision in the Midst of
Extraordinary Complexity. Neuron 2015,85 (2), 238256.
(35) Gray, K. A.; Yates, B.; Seal, R. L.; Wright, M. W.; Bruford, E. A. The HGNC Resources in 2015. Nucleic Acids Res.
2015,43 (D1), D1079D1085.
(36) Boser, B. E.; Guyon, I. M.; Vapnik, V. N. A Training Algorithm
for Optimal Margin Classifiers. Proc. 5th Annu. Workshop Comput.
Learn. Theory - COLT 92 1992, 144.
(37) Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn.
1995,20 (3), 273297.
(38) Hinton, G. E.; Salakhutdinov, R. R. Reducing the Dimension-
ality of Data with Neural Networks. Science 2006,313 (5786), 504
(39) Zeiler, M. D. ADADELTA: An Adaptive Learning Rate Method
arXiv 2012,6.
(40) Glorot, X.; Bengio, Y. Understanding the difficulty of training
deep feedforward neural networks. Int. Conf. Artif. Intell. Stat. 2010,
Molecular Pharmaceutics Article
DOI: 10.1021/acs.molpharmaceut.6b00248
Mol. Pharmaceutics XXXX, XXX, XXXXXX
... However, some MeSH terms, such as 'tumor suppressor protein p53' (D016159) in category D, refer to the protein that is directly implicated in disease. To ensure that the chemicals used are all drugs, we retained MeSH terms in D category under the top terms 'Chemical Actions and Uses', 'Organic Chemicals' and 'Compounds, Heterocyclic', which had been successfully used in drug analysis [23]. ...
... To test whether the inclusion of unclassified drugs without definite 'therapeutic uses' annotation in MeSH [23] would affect the performance of DSEATM, we compared the performance metrics between 'Classified drugs' and 'All drugs' (classified and unclassified drugs). The average recall rate (0.73 versus 0.68, P = 2.9e −09 ) and AUC score (0.67 versus 0.64, P < 2.2e −16 ) of 'All drugs' were all higher than that of 'Classified drugs' (Supplementary Figure S7b, Figure 5A). ...
Full-text available
Disease pathogenesis is always a major topic in biomedical research. With the exponential growth of biomedical information, drug effect analysis for specific phenotypes has shown great promise in uncovering disease-associated pathways. However, this method has only been applied to a limited number of drugs. Here, we extracted the data of 4634 diseases, 3671 drugs, 112 809 disease-drug associations and 81 527 drug-gene associations by text mining of 29 168 919 publications. On this basis, we proposed a 'Drug Set Enrichment Analysis by Text Mining (DSEATM)' pipeline and applied it to 3250 diseases, which outperformed the state-of-the-art method. Furthermore, diseases pathways enriched by DSEATM were similar to those obtained using the TCGA cancer RNA-seq differentially expressed genes. In addition, the drug number, which showed a remarkable positive correlation of 0.73 with the AUC, plays a determining role in the performance of DSEATM. Taken together, DSEATM is an auspicious and accurate disease research tool that offers fresh insights.
... We prepared the drug selection dataset for the COVID-19 treatment from different sources including drug banks, news reports, and the existing literature [33][34][35][36][37]. Primarily, potential drugs recommended for COVID-19 by the World Health Organization (WHO) including Remdesivir, Umifenovir, acetaminophen, and Favipiravir were considered. ...
... We prepared the drug selection dataset for the COVID-19 treatment from different sources including drug banks, news reports, and the existing literature [33][34][35][36][37]. Primarily, potential drugs recommended for COVID-19 by the World Health Organization (WHO) including Remdesivir, Umifenovir, acetaminophen, and Favipiravir were considered. Later, the data from studies [34][35][36][37] helped us to create interactions between the drugs and the collected clinical data. Other demographic patient information including gender, age, height, weight, exercise habits, country, food habits, COVID-19 infected data, and the co-morbidities were collected and are presented in Table 1. ...
Full-text available
The importance of online recommender systems for drugs, medical professionals, and hospitals is growing. Today, the majority of people use online consultations for drug recommendations for all types of health issues. Emergencies such as pandemics, floods, or cyclones can be helped by the medical recommender system. In the era of machine learning (ML), recommender systems produce more accurate, quick, and reliable clinical predictions with minimal costs. As a result, these systems maintain better performance, integrity, and privacy of patient data in the decision-making process and provide precise information at any time. Therefore, we present drug recommender systems with a stacked artificial neural network (ANN) model to improve the fairness and safety of treatment for infectious diseases. To reduce side effects, drugs are recommended based on a patient’s previous health profile, lifestyle, and habits. The proposed system produced results with 97.5% accuracy. A system such as this could be useful in recommending safe medicines to patients, especially during health emergencies.
... The advance of generative adversarial networks (GANs) accelerated the process of target discovery using transcriptomic data and de novo molecular design (Aliper et al., 2016;West et al., 2018;Vanhaelen et al., 2020). PandaOmics was a cloud-based target discovery platform that incorporated multiple scores developed using transcriptomic and proteomic data, text data including grants, scientific literature, publications, patents, stock reports, molecular data, as well as multiple meta-data repositories. ...
Full-text available
Amyotrophic lateral sclerosis (ALS) is a severe neurodegenerative disease with ill-defined pathogenesis, calling for urgent developments of new therapeutic regimens. Herein, we applied PandaOmics, an AI-driven target discovery platform, to analyze the expression profiles of central nervous system (CNS) samples (237 cases; 91 controls) from public datasets, and direct iPSC-derived motor neurons (diMNs) (135 cases; 31 controls) from Answer ALS. Seventeen high-confidence and eleven novel therapeutic targets were identified and will be released onto ALS.AI ( Among the proposed targets screened in the c9ALS Drosophila model, we verified 8 unreported genes (KCNB2, KCNS3, ADRA2B, NR3C1, P2RY14, PPP3CB, PTPRC, and RARA) whose suppression strongly rescues eye neurodegeneration. Dysregulated pathways identified from CNS and diMN data characterize different stages of disease development. Altogether, our study provides new insights into ALS pathophysiology and demonstrates how AI speeds up the target discovery process, and opens up new opportunities for therapeutic interventions.
... In 2016, DeepTox pipeline was developed by Mayr et al. [34], winner of the Tox21 Data Challenge on a dataset comprising 12,000 compounds, they use deep learning for the prediction of toxic effects of compounds, in this study the chemical compounds are described by a large number of descriptors, DeepTox obtained the best performance among all the calculation methods with which it was compared on nine of the fifteen targets and the authors indicated that the use of Single-task DNN is less efficient than a multitasking DNN. In [35] Aliper et al. presented a method based on a deep neural network to classify drugs into therapeutic categories, the data used are at the gene level and the transcriptome profiles in different cell lines have to be used as input to classify drugs into 12 therapeutic categories such as functionality, efficacy, and toxicity. Wallach et al. [36] suggested an efficient method based on CNN named AtomNet, which aims to predict the bioactivity of small molecules, they combine both ligand and target structure information as input data, the limitation of this approach is knowledge of ligand complex structures which contains coordinates of each atom in the binding site of the receptor. ...
Full-text available
Predicting biological activity and molecular properties is one of the most important goals in the pharmaceutical and bioinformatics field in order to discover potential new drugs. Although machine learning methods have been used in drug discovery for a long time with good efficacy, the use of deep learning has proven its superiority in most cases. In this paper, we present a virtual screening procedure based on deep learning that aims to classify a set of chemical compounds as regards their biological activity on a particular receptor. The molecules are described with 2D pharmacophore fingerprints, which use the coordinates of atoms in 2D space to calculate them. Two deep learning models are proposed, the first is a deep neural network that uses as input data the fingerprint represented as a 1D vector. The second model is a convolutional neural network that uses the same input data after reshaping it into a 2D vector. Our models were trained on a dataset of active and inactive chemical compounds on cyclin A kinase1 receptor a very important protein family. The results have proven that the proposed models are efficient and comparable with some widely used machine learning methods in drug discovery.
Artificial intelligence (AI) is composed of a number of supervised and unsupervised computational learning techniques. At present, cancer research utilizes some of these techniques to analyze and cluster large datasets, such as those resulting from genomic or proteomic analyses. AI is also beginning to enable automation of research tasks, increasing efficiency and reproducibility in classifications, and extending capabilities in basic science and clinical research beyond what is readily achievable by human investigators alone. The complex and nonlinear interactions between cancer and its environment can be modeled with AI techniques, allowing new discoveries in the molecular pathways and signaling networks leading to cancer and its metastasis to local and distant sites. Meanwhile, in cancer drug discovery and clinical trials, AI models predict the chemical, biologic, and pharmaceutical properties of small-molecule candidates and new applications of existing drugs and identify characteristics of patients in whom higher drug efficacy might be expected in clinical trials. Through facilitating analysis of sources of big data and increasing the rate of discovery in cancer research, AI promises to drive the field of cancer research forward at an exponential rate over the upcoming decade.
Repositioning or repurposing drugs account for a substantial part of entering approval pipeline drugs, which indicates that drug repositioning has huge market potential and value. Computational technologies such as machine learning methods have accelerated the process of drug repositioning in the last few decades years. The repositioning potential of Type 2 Diabetes Mellitus (T2DM) drugs for various diseases such as cancer, neurodegenerative diseases, and cardiovascular diseases have been widely studied. Hence, the related summary about repurposing antidiabetic drugs is of great significance. In this review, we focus on the machine learning methods for the development of new T2DM drugs and give an overview of the repurposing potential of the existing antidiabetic agents.
Rare diseases are a group of unusual pathologies in the world population, hence their name. They are considered the great neglected field of pharmaceutical research. To date, over 6,000 rare diseases have been identified and most of them lack treatment. The fact that they are so rare in the population does not encourage research efforts since their treatments are not in high demand. This work aims to analyze potential drug repositioning strategies that could be applied to these types of diseases. That is, discovering if existing drugs currently used for treating certain diseases can be employed to treat rare diseases. This process has been carried out using computational methods that compute similarities between rare diseases and other diseases, considering biological characteristics such as genes, proteins, and symptoms. The obtained potential drug repositioning hypotheses have been contrasted with related clinical trials found in scientific literature published to date.
The development of new molecules is a multi-stage process and clinical trials to verify their efficacy cost billions of dollars each year. Machine learning is a tool that is rapidly advancing in image, voice, and text recognition, and working in silico would increase the ability to predict and prioritize a drug's function. In this research we asked whether the function of therapeutic drugs can be predicted from the stereochemical configuration of the molecule. We use convolutional neural networks to predict the therapeutic use of drugs, trained with both two-dimensional and three-dimensional information of their chemical structure. The model trained with only six views of the 3D information of the molecular structure improved the accuracy by 10 over the model trained with the 2D information.
Since the advent of high-throughput omics technologies, various molecular data such as genes, transcripts, proteins, and metabolites have been made widely available to researchers. This has afforded clinicians, bioinformaticians, statisticians, and data scientists the opportunity to apply their innovations in feature mining and predictive modeling to a rich data resource to develop a wide range of generalizable prediction models. What has become apparent over the last 10 years is that researchers have adopted deep neural networks (or “deep nets”) as their preferred paradigm of choice for complex data modeling due to the superiority of performance over more traditional statistical machine learning approaches, such as support vector machines. A key stumbling block, however, is that deep nets inherently lack transparency and are considered to be a “black box” approach. This naturally makes it very difficult for clinicians and other stakeholders to trust their deep learning models even though the model predictions appear to be highly accurate. In this chapter, we therefore provide a detailed summary of the deep net architectures typically used in omics research, together with a comprehensive summary of the notable “deep feature mining” techniques researchers have applied to open up this black box and provide some insights into the salient input features and why these models behave as they do. We group these techniques into the following three categories: (a) hidden layer visualization and interpretation; (b) input feature importance and impact evaluation; and (c) output layer gradient analysis. While we find that omics researchers have made some considerable gains in opening up the black box through interpretation of the hidden layer weights and node activations to identify salient input features, we highlight other approaches for omics researchers, such as employing deconvolutional network-based approaches and development of bespoke attribute impact measures to enable researchers to better understand the relationships between the input data and hidden layer representations formed and thus the output behavior of their deep nets.
Full-text available
Intrinsically disordered proteins or protein regions are involved in key biological processes including regulation of transcription, signal transduction, and alternative splicing. Accurately predicting order/disorder regions ab initio from the protein sequence is a prerequisite step for further analysis of functions and mechanisms for these disordered regions. This work presents a learning method, weighted DeepCNF (Deep Convolutional Neural Fields), to improve the accuracy of order/disorder prediction by exploiting the long-range sequential information and the interdependency between adjacent order/disorder labels and by assigning different weights for each label during training and prediction to solve the label imbalance issue. Evaluated by the CASP9 and CASP10 targets, our method obtains 0.855 and 0.898 AUC values, which are higher than the state-of-the-art single ab initio predictors.
Full-text available
A new generation of anticancer therapeutics called target drugs has quickly developed in the 21st century. These drugs are tailored to inhibit cancer cell growth, proliferation, and viability by specific interactions with one or a few target proteins. However, despite formally known molecular targets for every "target" drug, patient response to treatment remains largely individual and unpredictable. Choosing the most effective personalized treatment remains a major challenge in oncology and is still largely trial and error. Here we present a novel approach for predicting target drug efficacy based on the gene expression signature of the individual tumor sample(s). The enclosed bioinformatic algorithm detects activation of intracellular regulatory pathways in the tumor in comparison to the corresponding normal tissues. According to the nature of the molecular targets of a drug, it predicts whether the drug can prevent cancer growth and survival in each individual case by blocking the abnormally activated tumor-promoting pathways or by reinforcing internal tumor suppressor cascades. To validate the method, we compared the distribution of predicted drug efficacy scores for five drugs (Sorafenib, Bevacizumab, Cetuximab, Sorafenib, Imatinib, Sunitinib) and seven cancer types (Clear Cell Renal Cell Carcinoma, Colon cancer, Lung adenocarcinoma, non-Hodgkin Lymphoma, Thyroid cancer and Sarcoma) with the available clinical trials data for the respective cancer types and drugs. The percent of responders to a drug treatment correlated significantly (Pearson's correlation 0.77 p = 0.023) with the percent of tumors showing high drug scores calculated with the current algorithm.
Full-text available
Effective choice of anticancer drugs is important problem of modern medicine. We developed a method termed OncoFinder for the analysis of new type of biomarkers reflecting activation of intracellular signaling and metabolic molecular pathways. These biomarkers may be linked with the sensitivity to anticancer drugs. In this study, we compared the experimental data obtained in our laboratory and in the Genomics of Drug Sensitivity in Cancer (GDS) project for testing response to anticancer drugs and transcriptomes of various human cell lines. The microarray-based profiling of transcriptomes was performed for the cell lines before the addition of drugs to the medium, and experimental growth inhibition curves were built for each drug, featuring characteristic IC50 values. We assayed here four target drugs - Pazopanib, Sorafenib, Sunitinib and Temsirolimus, and 238 different cell lines, of which 11 were profiled in our laboratory and 227 - in GDS project. Using the OncoFinder-processed transcriptomic data on ~600 molecular pathways, we identified pathways showing significant correlation between pathway activation strength (PAS) and IC50 values for these drugs. Correlations reflect relationships between response to drug and pathway activation features. We intersected the results and found molecular pathways significantly correlated in both our assay and GDS project. For most of these pathways, we generated molecular models of their interaction with known molecular target(s) of the respective drugs. For the first time, our study uncovered mechanisms underlying cancer cell response to drugs at the high-throughput molecular interactomic level.
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Increases in throughput and installed base of biomedical research equipment led to a massive accumulation of -omics data known to be highly variable, high-dimensional, and sourced from multiple often incompatible data platforms. While this data may be useful for biomarker identification and drug discovery, the bulk of it remains underutilized. Deep neural networks (DNNs) are efficient algorithms based on the use of compositional layers of neurons, with advantages well matched to the challenges -omics data presents. While achieving state-of-the-art results and even surpassing human accuracy in many challenging tasks, the adoption of deep learning in biomedicine has been comparatively slow. Here, we discuss key features of deep learning that may give this approach an edge over other machine learning methods. We then consider limitations and review a number of applications of deep learning in biomedical studies demonstrating proof of concept and practical utility.
Prediction the interactions between proteins (targets) and small molecules (ligands) is a critical task for the drug discovery in silico. In this work, we consider the target binding site instead of the whole target and propose a pairwise input neural network (PINN) for constructing the site-ligand interaction prediction model. Different with the ordinary artificial neural network (ANN) with one vector as input, the proposed PINN can accept a pair of vectors as the input, corresponding to a binding site and a ligand respectively. The 5-CV evaluation results show that PINN outperforms other representative target-ligand interaction prediction methods.
Drug-induced liver injury (DILI) has been the single most frequent cause of safety-related drug marketing withdrawals for the past 50 years. Recently, deep learning (DL) has been successfully applied in many fields due to its exceptional and automatic learning ability. In this study, DILI prediction models were developed using DL architectures, and the best model trained on 475 drugs predicted an external validation set of 198 drugs with an accuracy of 86.9%, sensitivity of 82.5%, specificity of 92.9%, and area under the curve of 0.955, which is better than the performance of previously described DILI prediction models. Furthermore, with deep analysis, we also identified important molecular features that are related to DILI. Such DL models could improve the prediction of DILI risk in humans. The DL DILI prediction models are freely available at