NeoDTI: neural integration of neighbor information from a heterogeneous network for discovering new drug-target interactions

ArticleinBioinformatics 35(1):104-111 · January 2019with 293 Reads 
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
Cite this publication
Abstract
Motivation: Accurately predicting drug-target interactions (DTIs) in silico can guide the drug discovery process and thus facilitate drug development. Computational approaches for DTI prediction that adopt the systems biology perspective generally exploit the rationale that the properties of drugs and targets can be characterized by their functional roles in biological networks. Results: Inspired by recent advance of information passing and aggregation techniques that generalize the convolution neural networks to mine large-scale graph data and greatly improve the performance of many network-related prediction tasks, we develop a new nonlinear end-to-end learning model, called NeoDTI, that integrates diverse information from heterogeneous network data and automatically learns topology-preserving representations of drugs and targets to facilitate DTI prediction. The substantial prediction performance improvement over other state-of-the-art DTI prediction methods as well as several novel predicted DTIs with evidence supports from previous studies have demonstrated the superior predictive power of NeoDTI. In addition, NeoDTI is robust against a wide range of choices of hyperparameters and is ready to integrate more drug and target related information (e.g. compound-protein binding affinity data). All these results suggest that NeoDTI can offer a powerful and robust tool for drug development and drug repositioning. Availability and implementation: The source code and data used in NeoDTI are available at: https://github.com/FangpingWan/NeoDTI. Supplementary information: Supplementary data are available at Bioinformatics online.

Do you want to read the rest of this article?

Request Full-text Paper PDF
  • ... Compared to previous works, we focus on the special topic of machine learning methods used in DTI prediction. Besides, we utilize a hierarchical classification scheme and summarize several latest prediction methods such as [20][21][22][23] which are hardly mentioned in any previous review. In particular, review [17] is written only from a narrow viewpoint, namely similarity-based approaches, which are a subclass of machine learning methods. ...
    Article
    Full-text available
    Identifying drug-target interactions will greatly narrow down the scope of search of candidate medications, and thus can serve as the vital first step in drug discovery. Considering that in vitro experiments are extremely costly and time-consuming, high efficiency computational prediction methods could serve as promising strategies for drug-target interaction (DTI) prediction. In this review, our goal is to focus on machine learning approaches and provide a comprehensive overview. First, we summarize a brief list of databases frequently used in drug discovery. Next, we adopt a hierarchical classification scheme and introduce several representative methods of each category, especially the recent state-of-the-art methods. In addition, we compare the advantages and limitations of methods in each category. Lastly, we discuss the remaining challenges and future outlook of machine learning in DTI prediction. This article may provide a reference and tutorial insights on machine learning-based DTI prediction for future researchers.
  • ... Moreover, there are a few other interesting methods, such as the text mining-based method [89] and a two-layer graphical model (called restricted Boltzmann machine) [90]. More recently, a number of deep learning-based methods have been developed for DTI prediction [91][92][93][94][95]. ...
    Article
    Drug–target interactions (DTIs) play a crucial role in target-based drug discovery and development. Computational prediction of DTIs can effectively complement experimental wet-lab techniques for the identification of DTIs, which are typically time- and resource-consuming. However, the performances of the current DTI prediction approaches suffer from a problem of low precision and high false-positive rate. In this study, we aim to develop a novel DTI prediction method for improving the prediction performance based on a cascade deep forest (CDF) model, named DTI-CDF, with multiple similarity-based features between drugs and the similarity-based features between target proteins extracted from the heterogeneous graph, which contains known DTIs. In the experiments, we built five replicates of 10-fold cross-validation under three different experimental settings of data sets, namely, corresponding DTI values of certain drugs (SD), targets (ST), or drug-target pairs (SP) in the training sets are missed but existed in the test sets. The experimental results demonstrate that our proposed approach DTI-CDF achieves a significantly higher performance than that of the traditional ensemble learning-based methods such as random forest and XGBoost, deep neural network, and the state-of-the-art methods such as DDR. Furthermore, there are 1352 newly predicted DTIs which are proved to be correct by KEGG and DrugBank databases. The data sets and source code are freely available at https://github.com//a96123155/DTI-CDF.
  • ... Meanwhile, deep learning has clearly demonstrated its power in promoting the bioinformatics field [104,5,18], including sequence analysis [198,3,25,156,87,6,80,169,157,158,175], structure prediction and reconstruction [167,90,38,168,180,196,170], biomolecular property and function prediction [85,204,75,4], biomedical image processing and diagnosis [35,66,41,22,160], and biomolecule interaction prediction and systems biology [95,201,203,144,145,67,165,191]. Specifically, regarding sequence analysis, people have used deep learning to predict the effect of noncoding sequence variants [198,166], model the transcription factor binding affinity landscape [25,3,166], improve DNA sequencing [87,154] and peptide sequencing [156], analyze DNA sequence modification [143], and model various post-transcription regulation events, such as alternative polyadenylation [81], alternative splicing [80], transcription starting site [159,157], noncoding RNA [181,7] and transcript boundaries [139]. ...
    Preprint
    Full-text available
    Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples .
  • ... Inspired by popular neural-network-based approaches 45 and the latest advances in network embedding technologies, 46 we employ NNMDA, which could accurately and efficiently predict miRNAdisease associations by integrating neighborhood information based on neural networks. Specifically, network embedding is an effective approach that aims at converting the network into a low-dimensional space while preserving the structural information of the network. ...
    Article
    Full-text available
    Identifying disease-related microRNAs (miRNAs)is an essential but challenging task in bioinformatics research. Much effort has been devoted to discovering the underlying associations between miRNAs and diseases. However, most studies mainly focus on designing advanced methods to improve prediction accuracy while neglecting to investigate the link predictability of the relationships between miRNAs and diseases. In this work, we construct a heterogeneous network by integrating neighborhood information in the neural network to predict potential associations between miRNAs and diseases, which also consider the imbalance of datasets. We also employ a new computational method called a neural network model for miRNA-disease association prediction (NNMDA). This model predicts miRNA-disease associations by integrating multiple biological data resources. Comparison of our work with other algorithms reveals the reliable performance of NNMDA. Its average AUC score was 0.937 over 15 diseases in a 5-fold cross-validation and AUC of 0.8439 based on leave-one-out cross-validation. The results indicate that NNMDA could be used in evaluating the accuracy of miRNA-disease associations. Moreover, NNMDA was applied to two common human diseases in two types of case studies. In the first type, 26 out of the top 30 predicted miRNAs of lung neoplasms were confirmed by the experiments. In the second type of case study for new diseases without any known miRNAs related to it, we selected breast neoplasms as the test example by hiding the association information between the miRNAs and this disease. The results verified 50 out of the top 50 predicted breast-neoplasm-related miRNAs.
  • ... Meanwhile, deep learning has clearly demonstrated its power in promoting the bioinformatics field [104,5,18], including sequence analysis [198,3,25,156,87,6,80,169,157,158,175], structure prediction and reconstruction [167,90,38,168,180,196,170], biomolecular property and function prediction [85,204,75,4], biomedical image processing and diagnosis [35,66,41,22,160], and biomolecule interaction prediction and systems biology [95,201,203,144,145,67,165,191]. Specifically, regarding sequence analysis, people have used deep learning to predict the effect of noncoding sequence variants [198,166], model the transcription factor binding affinity landscape [25,3,166], improve DNA sequencing [87,154] and peptide sequencing [156], analyze DNA sequence modification [143], and model various posttranscription regulation events, such as alternative polyadenylation [81], alternative splicing [80], transcription starting site [159,157], noncoding RNA [181,7] and transcript boundaries [139]. ...
    Article
    Full-text available
    Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples.
  • ... Then it finds the lowest level of the big matrix that reconstructs the large matrix. NeoDTI [23] predicts new drugs and drug targets by integrating various information in heterogeneous networks and conducting end-to-end learning through a nonlinear model. TL-HGBI [24] has proposed a computational framework, to infer novel treatments for diseases based on a heterogeneous network integrating similarity and association data about diseases, drugs and drug targets. ...
    Article
    Computational drug repositioning plays a vital role in the prediction of drug function. Many new functions discovered have been confirmed. In comparison with traditional drug repositioning, computational drug repositioning shortens the time and reduces labor. Thus, it has received wide attention in recent years. However, prediction remains a considerable challenge. In this paper, a method called HNRD is introduced to predict the link between drugs and diseases. It is based on neighborhood information aggregation in neural networks which combines the similarity of diseases and drugs, the associations between the drugs and diseases. Compared with the state-of-the-art method before, our method has achieved better results, with the best AUC of 0.97 in one of the golden datasets. To better evaluate our approach, we also performed data analysis based on one-to-one association’s prediction and robust analysis by testing on different datasets. All the results prove the excellent performance of prediction. Source codes of this paper are available on https://github.com/heibaipei/HNRD .
  • ... These embeddings are thus well-suited for predicting drug-target interactions by calculating the similarity between embeddings representing the drug and the protein, or by using embeddings as inputs to a machine learning method (Crichton et al., 2018). Alternatively, predictions can be made in an end-to-end fashion, where a neural network learns node embeddings and predicts interactions directly from the graph (Wang and Zeng, 2013;Gao et al., 2018;Wan et al., 2018). Detecting drug-drug interactions, in which the activity of one drug changes, favorably or unfavorably, if taken with another drug, is an important challenge with significant implications for patient mortality and morbidity (Chan and Giaccia, 2011;Guthrie et al., 2015;Han et al., 2017). ...
    Article
    Full-text available
    Current technology is producing high throughput biomedical data at an ever-growing rate. A common approach to interpreting such data is through network-based analyses. Since biological networks are notoriously complex and hard to decipher, a growing body of work applies graph embedding techniques to simplify, visualize, and facilitate the analysis of the resulting networks. In this review, we survey traditional and new approaches for graph embedding and compare their application to fundamental problems in network biology with using the networks directly. We consider a broad variety of applications including protein network alignment, community detection, and protein function prediction. We find that in all of these domains both types of approaches are of value and their performance depends on the evaluation measures being used and the goal of the project. In particular, network embedding methods outshine direct methods according to some of those measures and are, thus, an essential tool in bioinformatics research.
  • ... The "deep learning revolution" largely enlightened by the October 2012 ImageNet victory [23] has transformed various industries in human society, including artificial intelligence, health care, online advertising, transportation, and robotics. As the most widely-used and mature model in deep learning, Deep Neural Network (DNN) [15] demonstrates superb performance in complex engineering tasks such as recommendation [9], bio-informatics [39], mastering difficult game like Go [33], and human pose estimation [37]. The capability of approximating continuous mappings and the desirable scalability make DNN a favorable choice in the arsenal of solving large-scale optimization and decision problems in various engineering systems. ...
    Preprint
    We develop DeepOPF as a Deep Neural Network (DNN) approach for solving discrete circuit optimal power flow (DC-OPF) problems. DeepOPF is inspired by the observation that solving DC-OPF for a given power network is equivalent to characterizing a high-dimensional mapping between the load inputs and the dispatch and transmission decisions. We construct and train a DNN model to learn such mapping, then we apply it to obtain optimized operating decisions upon arbitrary load inputs. We adopt uniform sampling to address the over-fitting problem common in generic DNN approaches. We leverage on a useful structure in DC-OPF to significantly reduce the mapping dimension, subsequently cutting down the size of our DNN model and the amount of training data/time needed. We also design a post-processing procedure to ensure the feasibility of the obtained solution. Simulation results of IEEE test cases show that DeepOPF always generates feasible solutions with negligible optimality loss, while speeding up the computing time by two orders of magnitude as compared to conventional approaches implemented in a state-of-the-art solver.
  • ... To test the reliability and robustness of NDD, we also assessed its performance on another dataset that is compiled by Wan et al. 39 . This dataset contains only the chemical similarity of drugs. ...
    Article
    Full-text available
    Drug-Drug Interaction (DDI) prediction is one of the most critical issues in drug development and health. Proposing appropriate computational methods for predicting unknown DDI with high precision is challenging. We proposed "NDD: Neural network-based method for drug-drug interaction prediction" for predicting unknown DDIs using various information about drugs. Multiple drug similarities based on drug substructure, target, side effect, off-label side effect, pathway, transporter, and indication data are calculated. At first, NDD uses a heuristic similarity selection process and then integrates the selected similarities with a nonlinear similarity fusion method to achieve high-level features. Afterward, it uses a neural network for interaction prediction. The similarity selection and similarity integration parts of NDD have been proposed in previous studies of other problems. Our novelty is to combine these parts with new neural network architecture and apply these approaches in the context of DDI prediction. We compared NDD with six machine learning classifiers and six state-of-the-art graph-based methods on three benchmark datasets. NDD achieved superior performance in cross-validation with AUPR ranging from 0.830 to 0.947, AUC from 0.954 to 0.994 and F-measure from 0.772 to 0.902. Moreover, cumulative evidence in case studies on numerous drug pairs, further confirm the ability of NDD to predict unknown DDIs. The evaluations corroborate that NDD is an efficient method for predicting unknown DDIs. The data and implementation of NDD are available at https://github.com/nrohani/NDD.
  • ... Previous studies on drug repositioning using computational approaches have revolutionized the discovery of new uses for existing drugs (25). This has led to the development of several computational tools (26,27), some of which uses transcriptomic data (28) or even adverse reaction database (29), among others. However, the use of computational approaches for food bioactive, nutritional or nutrigenomics studies is scarce. ...
    Article
    Habitual consumption of certain foods has shown beneficial and protective effects against multiple chronic diseases. However, it is not clear by which molecular mechanisms they may exert their beneficial effects. Multiple -omic experiments available in public databases have generated gene expression data following the treatment of human cells with different food nutrients and bioactive compounds. Exploration of such data in an integrative manner offers excellent possibilities for gaining insights into the molecular effects of food compounds and bioactive molecules at the cellular level. Here we present NutriGenomeDB, a web-based application that hosts manually curated gene sets defined from gene expression signatures, after differential expression analysis of nutrigenomics experiments performed on human cells available in the Gene Expression Omnibus (GEO) repository. Through its web interface, users can explore gene expression data with interactive visualizations. In addition, external gene signatures can be connected with nutrigenomics gene sets using a gene pattern-matching algorithm. We further demonstrate how the application can capture the primary molecular mechanisms of a drug used to treat hypertension and thus connect its mode of action with hosted food compounds.
  • ... The "deep learning revolution" largely enlightened by the October 2012 ImageNet victory [1] has transformed various industries in human society, including artificial intelligence, health care, online advertising, transportation, and robotics. As the most widely-used and mature model in deep learning, Deep Neural Network (DNN) [2] demonstrates superb performance in complex engineering tasks such as recommendation [3], bio-informatics [4], mastering difficult game like Go [5], and human pose estimation [6]. The capability of approximating continuous mappings and the desirable scalability make DNN a favorable choice in the arsenal of solving large-scale optimization and decision problems in engineering systems. ...
    Preprint
    We develop DeepOPF as a Deep Neural Network (DNN) approach for solving security-constrained direct current optimal power flow (SC-DCOPF) problems, which are critical for reliable and cost-effective power system operation. DeepOPF is inspired by the observation that solving the SC-DCOPF problem for a given power network is equivalent to depicting a high-dimensional mapping between load inputs and generation and phase-angle outputs. We first construct and train a DNN to learn the mapping between the load inputs and the generations. We then directly compute the phase angles from the generations and loads by using the (linearized) power flow equations. Such a two-step procedure significantly reduces the dimension of the mapping to learn, subsequently cutting down the size of the DNN and the amount of training data/time needed. We further characterize a condition that allows us to tune the size of our neural network according to the desired approximation accuracy of the load-to-generation mapping. Simulation results of IEEE test cases show that DeepOPF always generates feasible solutions with negligible optimality loss, while speeding up the computing time by up to 400x as compared to a state-of-the-art solver.
  • ... Feature-based AI/ML methods can be integrated with other approaches constructing "Ensemble system" as presented in Ezzat et al. (2016), Jiang et al. (2017), and Rayhan et al. (2019). Thus, several comprehensive recent reviews summarized the different studies that predict DTIs using various techniques covering structure-based, similarity-based, networkbased, and AI/ML-based methods as presented in , Ezzat et al. (2017Ezzat et al. ( , 2018Ezzat et al. ( , 2019, Rayhan et al. (2017), Trosset and Cavé (2019), and Wan et al. (2019). Other reviews focused on one aspect which are similarity-based methods (Ding et al., 2014;Kurgan and Wang, 2018) or feature-based methods (Gupta, 2017). ...
    Article
    Full-text available
    The drug development is generally arduous, costly, and success rates are low. Thus, the identification of drug-target interactions (DTIs) has become a crucial step in early stages of drug discovery. Consequently, developing computational approaches capable of identifying potential DTIs with minimum error rate are increasingly being pursued. These computational approaches aim to narrow down the search space for novel DTIs and shed light on drug functioning context. Most methods developed to date use binary classification to predict if the interaction between a drug and its target exists or not. However, it is more informative but also more challenging to predict the strength of the binding between a drug and its target. If that strength is not sufficiently strong, such DTI may not be useful. Therefore, the methods developed to predict drug-target binding affinities (DTBA) are of great value. In this study, we provide a comprehensive overview of the existing methods that predict DTBA. We focus on the methods developed using artificial intelligence (AI), machine learning (ML), and deep learning (DL) approaches, as well as related benchmark datasets and databases. Furthermore, guidance and recommendations are provided that cover the gaps and directions of the upcoming work in this research area. To the best of our knowledge, this is the first comprehensive comparison analysis of tools focused on DTBA with reference to AI/ML/DL.
  • ... For example, in [37] they developed a Deep Belief Network (DBN) model constructed by stacking Restricted Boltzmann Machines (RBMs). Instead of using DBN, a nonlinear end-to-end learning model named NeoDTI [34] was proposed. NeoDTI integrates variety of information from heterogeneous network data and uses topology-preserving based representations of drugs and targets to facilitate drug-target prediction. ...
    Preprint
    Accurately predicting drug-target binding affinity (DTA) in silico is a key task in drug discovery. Most of the conventional DTA prediction methods are simulation-based, which rely heavily on domain knowledge or the assumption of having the 3D structure of the targets, which are often difficult to obtain. Meanwhile, traditional machine learning-based methods apply various features and descriptors, and simply depend on the similarities between drug-target pairs. Recently, with the increasing amount of affinity data available and the success of deep representation learning models on various domains, deep learning techniques have been applied to DTA prediction. However, these methods consider either label/one-hot encodings or the topological structure of molecules, without considering the local chemical context of amino acids and SMILES sequences. Motivated by this, we propose a novel end-to-end learning framework, called DeepGS, which uses deep neural networks to extract the local chemical context from amino acids and SMILES sequences, as well as the molecular structure from the drugs. To assist the operations on the symbolic data, we propose to use advanced embedding techniques (i.e., Smi2Vec and Prot2Vec) to encode the amino acids and SMILES sequences to a distributed representation. Meanwhile, we suggest a new molecular structure modeling approach that works well under our framework. We have conducted extensive experiments to compare our proposed method with state-of-the-art models including KronRLS, SimBoost, DeepDTA and DeepCPI. Extensive experimental results demonstrate the superiorities and competitiveness of DeepGS.
  • ... Eq. (3) indicates that, after the network-specific projections by , the inner product of and should reconstruct the original edge weight, ,as much as possible. It noted that a similar reconstruction strategy has been used in [37], [38] to solve the problem of predicting interactions. To preserve the structure feature of Heterogeneous Networks (HN), the gene representations and the projected matric are continually updated by minimizing the difference between the reconstruction and the observed edges. ...
    Article
    Full-text available
    Molecular networks embraced diverse biological and functional associations between genes and gene products, which are conducive for identifying novel genes and pathways of a specific disease phenotype. Although great progress has been achieved in high-throughput interactome mapping, the associations among genes still incomplete which caused the sparsity of Gene Networks (GNs). Here, we proposed a network-based framework, termed NIHO, for optimizing and completing GNs by integrating six genome networks: STRING, ConsenusPathDB, HumanNet, GeneMANIA, GIANT and BioGRID. NIHO learns high-level features of genes from the heterogeneous networks by an end-to-end way that consisted of neural network and matrix completion. Then, the learned low-dimensional representations are used to calculate the geometric proximity of genes in the projected space. Finally, NIHO infers the interactions among genes by analyzing the proximity scores and adds those interactions that originally not existed into GNs. The experimental results showed that the capability of GNs to recover disease gene sets get significantly improved after processed by NIHO. In addition, we not only examined the proportion of the added interactions but also observed the performance of NIHO can be promoted when the number of heterogeneous networks is increased.
  • ... For instance, authors in [177] proposed a method integrating feature-based and similarity-based machine learning approaches [205,206]. The hybrid methods performed superior to other state-of-the-art methods by optimizing the feature extraction process by extracting the complex hidden features of drugs and targets [134,144,172,173,182,197,201,207,208]. Integrating two machine learning methods in DTI prediction often has a leverage in terms of results as they fully exploit the potential of two methods, simultaneously. ...
    Article
    Full-text available
    The task of predicting the interactions between drugs and targets plays a key role in the process of drug discovery. There is a need to develop novel and efficient prediction approaches in order to avoid costly and laborious yet not-always-deterministic experiments to determine drug-target interactions (DTIs) by experiments alone. These approaches should be capable of identifying the potential DTIs in a timely manner. In this article, we describe the data required for the task of DTI prediction followed by a comprehensive catalog consisting of machine learning methods and databases, which have been proposed and utilized to predict DTIs. The advantages and disadvantages of each set of methods are also briefly discussed. Lastly, the challenges one may face in prediction of DTI using machine learning approaches are highlighted and we conclude by shedding some lights on important future research directions.
  • ... Eq. (3) indicates that, after the network-specific projections by , the inner product of and should reconstruct the original edge weight, ,as much as possible. It noted that a similar reconstruction strategy has been used in [37], [38] to solve the problem of predicting interactions. To preserve the structure feature of Heterogeneous Networks (HN), the gene representations and the projected matric are continually updated by minimizing the difference between the reconstruction and the observed edges. ...
  • ... Eq. (3) indicates that, after the network-specific projections by , the inner product of and should reconstruct the original edge weight, ,as much as possible. It noted that a similar reconstruction strategy has been used in [37], [38] to solve the problem of predicting interactions. To preserve the structure feature of Heterogeneous Networks (HN), the gene representations and the projected matric are continually updated by minimizing the difference between the reconstruction and the observed edges. ...
    Data
    Full-text available
  • ... Meanwhile, DL aims to extract high-level feature abstractions from input data, typically using several layers of non-linear transformations, and is a dominant method used in numerous complex learning tasks with large-scale samples in the data science field, such as computer vision, speech recognition, natural language processing (NLP), game playing, and bioinformatics [17][18][19]. Although several DL models have been used to address various learning problems in drug discovery [20][21][22], they rarely fully exploit the currently available large-scale protein and compound data to predict CPI. For example, the computational approaches proposed in the literature [20,21] only use the hand-designed features of compounds and do not take into account the features of targets. ...
    Article
    Full-text available
    Accurate identification of compound-protein interactions (CPIs) in silico may deepen our understanding of the underlying mechanisms of drug action and thus remarkably facilitate drug discovery and development. Conventional similarity- or docking-based computational methods for predicting CPIs rarely exploit latent features from currently available large-scale unlabeled compound and protein data and often limit their usage to relatively small-scale datasets. In the present study, we proposed DeepCPI, a novel general and scalable computational framework that combines effective feature embedding (a technique of representation learning) with powerful deep learning methods to accurately predict CPIs at a large scale. DeepCPI automatically learns the implicit yet expressive low-dimensional features of compounds and proteins from a massive amount of unlabeled data. Evaluations of the measured CPIs in large-scale databases, such as ChEMBL and BindingDB, as well as of the known drug-target interactions from DrugBank, demonstrated the superior predictive performance of DeepCPI. Furthermore, several interactions among small-molecule compounds and three G protein-coupled receptor targets (glucagon-like peptide-1 receptor, glucagon receptor, and vasoactive intestinal peptide receptor) predicted using DeepCPI were experimentally validated. The present study suggests that DeepCPI is a useful and powerful tool for drug discovery and repositioning. The source code of DeepCPI can be downloaded from https://github.com/FangpingWan/DeepCPI.
  • ... The initial list of drug candidates targeting SARS-CoV-2 was first screened using a network-based knowledge mining algorithm modified from our previous work [68,69]. ...
    Preprint
    Full-text available
    The global spread of SARS-CoV-2 requires an urgent need to find effective therapeutics for the treatment of COVID-19. We developed a data-driven drug repositioning framework, which applies both machine learning and statistical analysis approaches to systematically integrate and mine large-scale knowledge graph, literature and transcriptome data to discover the potential drug candidates against SARS-CoV-2. The retrospective study using the past SARS-CoV and MERS-CoV data demonstrated that our machine learning based method can successfully predict effective drug candidates against a specific coronavirus. Our in silico screening followed by wet-lab validation indicated that a poly-ADP-ribose polymerase 1 (PARP1) inhibitor, CVL218, currently in Phase I clinical trial, may be repurposed to treat COVID-19. Our in vitro assays revealed that CVL218 can exhibit effective inhibitory activity against SARS-CoV-2 replication without obvious cytopathic effect. In addition, we showed that CVL218 is able to suppress the CpG-induced IL-6 production in peripheral blood mononuclear cells, suggesting that it may also have anti-inflammatory effect that is highly relevant to the prevention immunopathology induced by SARS-CoV-2 infection. Further pharmacokinetic and toxicokinetic evaluation in rats and monkeys showed a high concentration of CVL218 in lung and observed no apparent signs of toxicity, indicating the appealing potential of this drug for the treatment of the pneumonia caused by SARS-CoV-2 infection. Moreover, molecular docking simulation suggested that CVL218 may bind to the N-terminal domain of nucleocapsid (N) protein of SARS-CoV-2, providing a possible model to explain its antiviral action. We also proposed several possible mechanisms to explain the antiviral activities of PARP1 inhibitors against SARS-CoV-2, based on the data present in this study and previous evidences reported in the literature. In summary, the PARP1 inhibitor CVL218 discovered by our data-driven drug repositioning framework can serve as a potential therapeutic agent for the treatment of COVID-19.
  • Article
    Motivation: Quantitative structure-activity relationship (QSAR) and drug-target interaction (DTI) prediction are both commonly used in drug discovery. Collaboration among pharmaceutical institutions can lead to better performance in both QSAR and DTI prediction. However, the drug-related data privacy and intellectual property issues have become a noticeable hindrance for inter-institutional collaboration in drug discovery. Results: We have developed two novel algorithms under secure multiparty computation (MPC), including QSARMPC and DTIMPC, which enable pharmaceutical institutions to achieve high-quality collaboration to advance drug discovery without divulging private drug-related information. QSARMPC, a neural network model under MPC, displays good scalability and performance, and is feasible for privacy-preserving collaboration on large-scale QSAR prediction. DTIMPC integrates drug-related heterogeneous network data and accurately predicts novel DTIs, while keeping the drug information confidential. Under several experimental settings that reflect the situations in real drug discovery scenarios, we have demonstrated that DTIMPC possesses significant performance improvement over the baseline methods, generates novel DTI predictions with supporting evidence from the literature, and shows the feasible scalability to handle growing DTI data. All these results indicate that QSARMPC and DTIMPC can provide practically useful tools for advancing privacy-preserving drug discovery. Availability and implementation: The source codes of QSARMPC and DTIMPC are available on the GitHub: https://github.com/rongma6/QSARMPC_DTIMPC.git. Supplementary information: Supplementary data are available at Bioinformatics online.
  • Article
    Motivation: Systematic identification of molecular targets among known drugs plays an essential role in drug repurposing and understanding of their unexpected side effects. Computational approaches for prediction of drug-target interactions (DTIs) are highly desired in comparison to traditional experimental assays. Furthermore, recent advances of multi-omics technologies and systems biology approaches have generated large-scale heterogeneous, biological networks, which offer unexpected opportunities for network-based identification of new molecular targets among known drugs. Results: In this study, we present a network-based computational framework, termed AOPEDF, an Arbitrary-Order Proximity Embedded Deep Forest approach, for prediction of DTIs. AOPEDF learns a low-dimensional vector representation of features that preserve arbitrary-order proximity from a highly integrated, heterogeneous biological network connecting drugs, targets (proteins), and diseases. In total, we construct a heterogeneous network by uniquely integrating 15 networks covering chemical, genomic, phenotypic, and network profiles among drugs, proteins/targets, and diseases. Then, we build a cascade deep forest classifier to infer new DTIs. Via systematic performance evaluation, AOPEDF achieves high accuracy in identifying molecular targets among known drugs on two external validation sets collected from DrugCentral (area under receiver operating characteristic curve [AUROC] = 0.868) and ChEMBL (AUROC = 0.768) databases, outperforming several state-of-the-art methods. In a case study, we showcase that multiple molecular targets predicted by AOPEDF are associated with mechanism-of-action of substance abuse disorder for several marketed drugs (such as aripiprazole, risperidone, and haloperidol). Availability: Source code and data can be downloaded from https://github.com/ChengF-Lab/AOPEDF. Supplementary information: Supplementary data are available online at Bioinformatics.
  • Article
    Groundnut is one of the most important and popular oilseed foods in the agricultural field, and its botanical name is Arachis hypogaea L. Approximately, the pod of mature groundnut contains 1–5 seeds with 57% of oil and 25% of protein content. The oil obtained from the groundnut is widely used for cooking and losing body weight, and its fats are widely used for making soaps. The groundnut cultivation is affected by different kinds of diseases such as fungi, viruses, and bacteria. Hence, these diseases affect the leaf, root and stem of the groundnut plant and it leads to heavy loss in yield. Moreover, the enlarger number of diseases affects the leaf and root-like Alternaria, Pestalotiopsis, Bud necrosis, tikka, Phyllosticta, Rust, Pepper spot, Choanephora, early and late leaf spot. To overcome these issues, we introduce an efficient method of deep convolutional neural network (DCNN) because it automatically detects the important features without any human supervision. The DCNN procedure can deeply detect plant disease by using a deep learning process. Moreover, the DCNN training and testing process demonstrate an accurate groundnut disease determination and classification result. The number of groundnut leaf disease images is chosen from the plant village dataset, and it is used for the training and testing process. The stochastic gradient decent momentum method is used for dataset training, and it has shown the better performance of proposed DCNN. From the comparison analysis, the 6th combined layer of proposed DCNN delivers a 95.28% accuracy value. Ultimately, the groundnut disease classification with its overall performance of proposed DCNN provides 99.88% accuracy.
  • Deep learning with feature embedding for compoundprotein interaction prediction. bioRxiv
    • F Wan
    • J Zeng
    Wan, F. and Zeng, J. (2016). Deep learning with feature embedding for compoundprotein interaction prediction. bioRxiv, page 086033.
    • A P Davis
    • C G Murphy
    • R Johnson
    • J M Lay
    • K Lennon-Hopkins
    • C Saraceni-Richards
    • D Sciaky
    • B L King
    • M C Rosenstein
    • T C Wiegers
    Davis, A. P., Murphy, C. G., Johnson, R., Lay, J. M., Lennon-Hopkins, K., Saraceni-Richards, C., Sciaky, D., King, B. L., Rosenstein, M. C., Wiegers, T. C., et al. (2012). The comparative toxicogenomics database: update 2013. Nucleic acids research, 41(D1), D1104-D1114.
  • Human protein reference databaseâŁ
    • Keshava Prasad
    • T Goel
    • R Kandasamy
    • K Keerthikumar
    • S Kumar
    • S Mathivanan
    • S Telikicherla
    • D Raju
    • R Shafreen
    • B Venugopal
    Keshava Prasad, T., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S., Telikicherla, D., Raju, R., Shafreen, B., Venugopal, A., et al. (2008). Human protein reference databaseâŁ"2009 update. Nucleic acids research, 37(suppl_1), D767-D772.
  • Deepwalk: Online learning of social representations
    • B Perozzi
    • R Al-Rfou
    • S Skiena
    Perozzi, B., Al-Rfou, R., and Skiena, S. (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701-710. ACM.
  • Article
    Median lethal death, LD50, is a general indicator of compound acute oral toxicity (AOT). Various in silico methods were developed for AOT prediction to reduce costs and time. In this study, we developed an improved molecular graph encoding convolutional neural networks (MGE-CNN) architecture to develop three types of high-quality AOT models: regression model (deepAOT-R), multi-classification model (deepAOTC) and multi-task (deepAOT-CR). These predictive models highly outperformed previously reported models. For the two external data sets containing 1673 (test set I) and 375 (test set II) compounds, the R2 and mean absolute error (MAE) of deepAOT-R on the test set I were 0.864 and 0.195, and the prediction accuracy of deepAOT-C was 95.5% and 96.3% on the test set I and II, respectively. The two external prediction accuracy of deepAOT-CR is 95.0% and 94.1%, while the R2 and MAE are 0.861 and 0.204 for test set I, respectively. We then performed forward and backward exploration of deepAOT models for deep fingerprints, which could support shallow machine learning methods more efficiently than traditional fingerprints or descriptors.We further performed automatic feature learning, a key essence of deep learning, to map the corresponding activation values into fragment space and derive AOT-related chemical substructures by reverse mining of the features. Our deep learning architecture for AOT is generally applicable in predicting and exploring other toxicity or property endpoints of chemical compounds. The two deepAOT models are freely available at http://repharma.pku.edu.cn/DLAOT/DLAOThome.php.
  • Article
    Full-text available
    The emergence of large-scale genomic, chemical and pharmacological data provides new opportunities for drug discovery and repositioning. In this work, we develop a computational pipeline, called DTINet, to predict novel drug–target interactions from a constructed heterogeneous network, which integrates diverse drug-related information. DTINet focuses on learning a low-dimensional vector representation of features, which accurately explains the topological properties of individual nodes in the heterogeneous network, and then makes prediction based on these representations via a vector space projection scheme. DTINet achieves substantial performance improvement over other state-of-the-art methods for drug–target interaction prediction. Moreover, we experimentally validate the novel interactions between three drugs and the cyclooxygenase proteins predicted by DTINet, and demonstrate the new potential applications of these identified cyclooxygenase inhibitors in preventing inflammatory diseases. These results indicate that DTINet can provide a practically useful tool for integrating heterogeneous information to predict new drug–target interactions and repurpose existing drugs.
  • Conference Paper
    We study the problem of representation learning in heterogeneous networks. Its unique challenges come from the existence of multiple types of nodes and links, which limit the feasibility of the conventional network embedding techniques. We develop two scalable representation learning models, namely metapath2vec and metapath2vec++. The metapath2vec model formalizes meta-path-based random walks to construct the heterogeneous neighborhood of a node and then leverages a heterogeneous skip-gram model to perform node embeddings. The metapath2vec++ model further enables the simultaneous modeling of structural and semantic correlations in heterogeneous networks. Extensive experiments show that metapath2vec and metapath2vec++ are able to not only outperform state-of-the-art embedding models in various heterogeneous network mining tasks, such as node classification, clustering, and similarity search, but also discern the structural and semantic correlations between diverse network objects.
  • Article
    Full-text available
    Low-dimensional embeddings of nodes in large graphs have proved extremely useful in a variety of prediction tasks, from content recommendation to identifying protein functions. However, most existing approaches require that all nodes in the graph are present during training of the embeddings; these previous approaches are inherently transductive and do not naturally generalize to unseen nodes. Here we present GraphSAGE, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, we learn a function that generates embeddings by sampling and aggregating features from a node's local neighborhood. Our algorithm outperforms strong baselines on three inductive node-classification benchmarks: we classify the category of unseen nodes in evolving information graphs based on citation and Reddit post data, and we show that our algorithm generalizes to completely unseen graphs using a multi-graph dataset of protein-protein interactions.
  • Article
    Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation function to compute a function of their entire input graph. At this point, the next step is to find a particularly effective variant of this general approach and apply it to chemical prediction benchmarks until we either solve them or reach the limits of the approach. In this paper, we reformulate existing models into a single common framework we call Message Passing Neural Networks (MPNNs) and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark, results we believe are strong enough to justify retiring this benchmark.
  • Article
    Full-text available
    Model: 2015, 55, 263-274). However, the applicability of these techniques has been limited by the requirement for large amounts of training data. In this work, we demonstrate how one-shot learning can be used to significantly lower the amounts of data required to make meaningful predictions in drug discovery applications. We introduce a new architecture, the iterative refinement long short-term memory, that, when combined with graph convolutional neural networks, significantly improves learning of meaningful distance metrics over small-molecules. We open source all models introduced in this work as part of DeepChem, an open-source framework for deep-learning in drug discovery (Ramsundar, B. deepchem.io. https://github.com/deepchem/deepchem, 2016).
  • Article
    Decades of costly failures in translating drug candidates from preclinical disease models to human therapeutic use warrant reconsideration of the priority placed on animal models in biomedical research. Following an international workshop attended by experts from academia, government institutions, research funding bodies, and the corporate and nongovernmental organisation (NGO) sectors, in this consensus report, we analyse, as case studies, five disease areas with major unmet needs for new treatments. In view of the scientifically driven transition towards a human pathway-based paradigm in toxicology, a similar paradigm shift appears to be justified in biomedical research. There is a pressing need for an approach that strategically implements advanced, human biology-based models and tools to understand disease pathways at multiple biological scales. We present recommendations to help achieve this.
  • Article
    We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.
  • Article
    Computational prediction of compound-protein interactions (CPIs) is of great importance for drug design as the first step in in-silico screening. We previously proposed chemical genomics-based virtual screening (CGBVS), which predicts CPIs by using a support vector machine (SVM). However, the CGBVS has problems when training using more than a million datasets of CPIs since SVMs require an exponential increase in the calculation time and computer memory. To solve this problem, we propose the CGBVS-DNN, in which we use deep neural networks, a kind of deep learning technique, instead of the SVM. Deep learning does not require learning all input data at once because the network can be trained with small mini-batches. Experimental results show that the CGBVS-DNN outperformed the original CGBVS with a quarter million CPIs. Results of cross-validation show that the accuracy of the CGBVS-DNN reaches up to 98.2 % (σ<0.01) with 4 million CPIs.
  • Article
    The identification of interactions between compounds and proteins plays an important role in network pharmacology and drug discovery. However, experimentally identifying compound-protein interactions (CPIs) is generally expensive and time-consuming, computational approaches are thus introduced. Among these, machine-learning based methods have achieved a considerable success. However, due to the nonlinear and imbalanced nature of biological data, many machine learning approaches have their own limitations. Recently, deep learning techniques show advantages over many state-of-the-art machine learning methods in some applications. In this study, we aim at improving the performance of CPI prediction based on deep learning, and propose a method called DL- CPI (the abbreviation of Deep Learning for Compound-Protein Interactions prediction), which employs deep neural network (DNN) to effectively learn the representations of compound-protein pairs. Extensive experiments show that DL-CPI can learn useful features of compound-protein pairs by a layerwise abstraction, and thus achieves better prediction performance than existing methods on both balanced and imbalanced datasets.
  • Article
    Convolutional neural networks (CNNs) have greatly improved state-of-the-art performances in a number of fields, notably computer vision and natural language processing. In this work, we are interested in generalizing the formulation of CNNs from low-dimensional regular Euclidean domains, where images (2D), videos (3D) and audios (1D) are represented, to high-dimensional irregular domains such as social networks or biological networks represented by graphs. This paper introduces a formulation of CNNs on graphs in the context of spectral graph theory. We borrow the fundamental tools from the emerging field of signal processing on graphs, which provides the necessary mathematical background and efficient numerical schemes to design localized graph filters efficient to learn and evaluate. As a matter of fact, we introduce the first technique that offers the same computational complexity than standard CNNs, while being universal to any graph structure. Numerical experiments on MNIST and 20NEWS demonstrate the ability of this novel deep learning system to learn local, stationary, and compositional features on graphs, as long as the graph is well-constructed.
  • Article
    Full-text available
    Motivation: Identifying drug–target interactions is an important task in drug discovery. To reduce heavy time and financial cost in experimental way, many computational approaches have been proposed. Although these approaches have used many different principles, their performance is far from satisfactory, especially in predicting drug–target interactions of new candidate drugs or targets. Methods: Approaches based on machine learning for this problem can be divided into two types: feature-based and similarity-based methods. Learning to rank is the most powerful technique in the feature-based methods. Similarity-based methods are well accepted, due to their idea of connecting the chemical and genomic spaces, represented by drug and target similarities, respectively. We propose a new method, DrugE-Rank, to improve the prediction performance by nicely combining the advantages of the two different types of methods. That is, DrugE-Rank uses LTR, for which multiple well-known similarity-based methods can be used as components of ensemble learning. Results: The performance of DrugE-Rank is thoroughly examined by three main experiments using data from DrugBank: (i) cross-validation on FDA (US Food and Drug Administration) approved drugs before March 2014; (ii) independent test on FDA approved drugs after March 2014; and (iii) independent test on FDA experimental drugs. Experimental results show that DrugE-Rank outperforms competing methods significantly, especially achieving more than 30% improvement in Area under Prediction Recall curve for FDA approved new drugs and FDA experimental drugs. Availability: http://datamining-iip.fudan.edu.cn/service/DrugE-Rank Contact: zhusf@fudan.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.
  • Conference Paper
    Network based prediction of interaction between drug compounds and target proteins is a core step in the drug discovery process. The availability of drug–target interaction data has boosted the development of machine learning methods for the in silico prediction of drug–target interactions. In this paper we focus on the crucial issue of data bias. We show that four popular datasets contain a bias because of the way they have been constructed: all drug compounds and target proteins have at least one interaction and some of them have only a single interaction. We show that this bias can be exploited by prediction methods to achieve an optimistic generalization performance as estimated by cross-validation procedures, in particular leave-one-out cross validation. We discuss possible ways to mitigate the effect of this bias, in particular by adapting the validation procedure. In general, results indicate that the data bias should be taken into account when assessing the generalization performance of machine learning methods for the in silico prediction of drug–target interactions. The datasets and source code for this article are available at http://cs.ru.nl/~tvanlaarhoven/bias2014/
  • Article
    Many questions about the biological activity and availability of small molecules remain inaccessible to investigators who could most benefit from their answers. To narrow the gap between chemoinformatics and biology, we have developed a suite of ligand annotation, purchasability, target and biology association tools, incorporated into ZINC and meant for investigators who are not computer specialists. The new version contains over 120 million purchasable "drug-like" compounds - effectively all organic molecules that are for sale - a quarter of which are available for immediate delivery. ZINC connects purchasable compounds to high-value ones such as metabolites, drugs, natural products and annotated compounds from the literature. Compounds may be accessed by the genes they are annotated for, as well as the major and minor target classes to which those genes belong. It offers new analysis tools that are easy for non-specialists yet with few limitations for experts. ZINC retains its original 3D roots - all molecules are available in biologically relevant, ready-to-dock formats. ZINC is freely available at zinc15.docking.org.
  • Article
    The emergence of network medicine not only offers more opportunities for better and more complete understanding of the molecular complexities of diseases, but also serves as a promising tool for identifying new drug targets and establishing new relationships among diseases that enable drug repositioning. Computational approaches for drug repositioning by integrating information from multiple sources and multiple levels have the potential to provide great insights to the complex relationships among drugs, targets, disease genes and diseases at a system level.
  • Article
    Full-text available
    Motivation: Most existing methods for predicting causal disease genes rely on specific type of evidence, and are therefore limited in terms of applicability. More often than not, the type of evidence available for diseases varies—for example, we may know linked genes, keywords associated with the disease obtained by mining text, or co-occurrence of disease symptoms in patients. Similarly, the type of evidence available for genes varies—for example, specific microarray probes convey information only for certain sets of genes. In this article, we apply a novel matrix-completion method called Inductive Matrix Completion to the problem of predicting gene-disease associations; it combines multiple types of evidence (features) for diseases and genes to learn latent factors that explain the observed gene–disease associations. We construct features from different biological sources such as microarray expression data and disease-related textual data. A crucial advantage of the method is that it is inductive; it can be applied to diseases not seen at training time, unlike traditional matrix-completion approaches and network-based inference methods that are transductive.Results: Comparison with state-of-the-art methods on diseases from the Online Mendelian Inheritance in Man (OMIM) database shows that the proposed approach is substantially better—it has close to one-in-four chance of recovering a true association in the top 100 predictions, compared to the recently proposed Catapult method (second best) that has <15% chance. We demonstrate that the inductive method is particularly effective for a query disease with no previously known gene associations, and for predicting novel genes, i.e. genes that are previously not linked to diseases. Thus the method is capable of predicting novel genes even for well-characterized diseases. We also validate the novelty of predictions by evaluating the method on recently reported OMIM associations and on associations recently reported in the literature.Availability: Source code and datasets can be downloaded from http://bigdata.ices.utexas.edu/project/gene-disease.Contact: naga86@cs.utexas.edu
  • Conference Paper
    We address the problem of predicting new drug-target interactions from three inputs: known interactions, similarities over drugs and those over targets. This setting has been considered by many methods, which however have a common problem of allowing to have only one similarity matrix over drugs and that over targets. The key idea of our approach is to use more than one similarity matrices over drugs as well as those over targets, where weights over the multiple similarity matrices are estimated from data to automatically select similarities, which are effective for improving the performance of predicting drug-target interactions. We propose a factor model, named Multiple Similarities Collaborative Matrix Factorization(MSCMF), which projects drugs and targets into a common low-rank feature space, which is further consistent with weighted similarity matrices over drugs and those over targets. These two low-rank matrices and weights over similarity matrices are estimated by an alternating least squares algorithm. Our approach allows to predict drug-target interactions by the two low-rank matrices collaboratively and to detect similarities which are important for predicting drug-target interactions. This approach is general and applicable to any binary relations with similarities over elements, being found in many applications, such as recommender systems. In fact, MSCMF is an extension of weighted low-rank approximation for one-class collaborative filtering. We extensively evaluated the performance of MSCMF by using both synthetic and real datasets. Experimental results showed nice properties of MSCMF on selecting similarities useful in improving the predictive performance and the performance advantage of MSCMF over six state-of-the-art methods for predicting drug-target interactions.
  • Article
    Full-text available
    The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
  • Article
    Full-text available
    In silico discovery of interactions between drug compounds and target proteins is of core importance for improving the efficiency of the laborious and costly experimental determination of drug-target interaction. Drug-target interaction data are available for many classes of pharmaceutically useful target proteins including enzymes, ion channels, GPCRs and nuclear receptors. However, current drug-target interaction databases contain a small number of drug-target pairs which are experimentally validated interactions. In particular, for some drug compounds (or targets) there is no available interaction. This motivates the need for developing methods that predict interacting pairs with high accuracy also for these 'new' drug compounds (or targets). We show that a simple weighted nearest neighbor procedure is highly effective for this task. We integrate this procedure into a recent machine learning method for drug-target interaction we developed in previous work. Results of experiments indicate that the resulting method predicts true interactions with high accuracy also for new drug compounds and achieves results comparable or better than those of recent state-of-the-art algorithms. Software is publicly available at http://cs.ru.nl/~tvanlaarhoven/drugtarget2013/.
  • Article
    Full-text available
    Motivation: In silico prediction of drug-target interactions plays an important role toward identifying and developing new uses of existing or abandoned drugs. Network-based approaches have recently become a popular tool for discovering new drug-target interactions (DTIs). Unfortunately, most of these network-based approaches can only predict binary interactions between drugs and targets, and information about different types of interactions has not been well exploited for DTI prediction in previous studies. On the other hand, incorporating additional information about drug-target relationships or drug modes of action can improve prediction of DTIs. Furthermore, the predicted types of DTIs can broaden our understanding about the molecular basis of drug action. Results: We propose a first machine learning approach to integrate multiple types of DTIs and predict unknown drug-target relationships or drug modes of action. We cast the new DTI prediction problem into a two-layer graphical model, called restricted Boltzmann machine, and apply a practical learning algorithm to train our model and make predictions. Tests on two public databases show that our restricted Boltzmann machine model can effectively capture the latent features of a DTI network and achieve excellent performance on predicting different types of DTIs, with the area under precision-recall curve up to 89.6. In addition, we demonstrate that integrating multiple types of DTIs can significantly outperform other predictions either by simply mixing multiple types of interactions without distinction or using only a single interaction type. Further tests show that our approach can infer a high fraction of novel DTIs that has been validated by known experiments in the literature or other databases. These results indicate that our approach can have highly practical relevance to DTI prediction and drug repositioning, and hence advance the drug discovery process. Availability: Software and datasets are available on request. Supplementary information: Supplementary data are available at Bioinformatics online.
  • Article
    Full-text available
    Motivation: The identification of drug–target interaction (DTI) represents a costly and time-consuming step in drug discovery and design. Computational methods capable of predicting reliable DTI play an important role in the field. Recently, recommendation methods relying on network-based inference (NBI) have been proposed. However, such approaches implement naive topology-based inference and do not take into account important features within the drug–target domain. Results: In this article, we present a new NBI method, called domain tuned-hybrid (DT-Hybrid), which extends a well-established recommendation technique by domain-based knowledge including drug and target similarity. DT-Hybrid has been extensively tested using the last version of an experimentally validated DTI database obtained from DrugBank. Comparison with other recently proposed NBI methods clearly shows that DT-Hybrid is capable of predicting more reliable DTIs. Availability: DT-Hybrid has been developed in R and it is available, along with all the results on the predictions, through an R package at the following URL: http://sites.google.com/site/ehybridalgo/. Contact: apulvirenti@dmi.unict.it Supplementary information: Supplementary data are available at Bioinformatics online.
  • Article
    Full-text available
    Motivation: In silico methods provide efficient ways to predict possible interactions between drugs and targets. Supervised learning approach, bipartite local model (BLM), has recently been shown to be effective in prediction of drug-target interactions. However, for drug-candidate compounds or target-candidate proteins that currently have no known interactions available, its pure 'local' model is not able to be learned and hence BLM may fail to make correct prediction when involving such kind of new candidates. Results: We present a simple procedure called neighbor-based interaction-profile inferring (NII) and integrate it into the existing BLM method to handle the new candidate problem. Specifically, the inferred interaction profile is treated as label information and is used for model learning of new candidates. This functionality is particularly important in practice to find targets for new drug-candidate compounds and identify targeting drugs for new target-candidate proteins. Consistent good performance of the new BLM-NII approach has been observed in the experiment for the prediction of interactions between drugs and four categories of target proteins. Especially for nuclear receptors, BLM-NII achieves the most significant improvement as this dataset contains many drugs/targets with no interactions in the cross-validation. This demonstrates the effectiveness of the NII strategy and also shows the great potential of BLM-NII for prediction of compound-protein interactions. Contact: jpmei@ntu.edu.sg Supplementary information: Supplementary data are available at Bioinformatics online.
  • Article
    Full-text available
    The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) provides information about interactions between environmental chemicals and gene products and their relationships to diseases. Chemical–gene, chemical–disease and gene–disease interactions manually curated from the literature are integrated to generate expanded networks and predict many novel associations between different data types. CTD now contains over 15 million toxicogenomic relationships. To navigate this sea of data, we added several new features, including DiseaseComps (which finds comparable diseases that share toxicogenomic profiles), statistical scoring for inferred gene–disease and pathway–chemical relationships, filtering options for several tools to refine user analysis and our new Gene Set Enricher (which provides biological annotations that are enriched for gene sets). To improve data visualization, we added a Cytoscape Web view to our ChemComps feature, included color-coded interactions and created a ‘slim list’ for our MEDIC disease vocabulary (allowing diseases to be grouped for meta-analysis, visualization and better data management). CTD continues to promote interoperability with external databases by providing content and cross-links to their sites. Together, this wealth of expanded chemical–gene–disease data, combined with novel ways to analyze and view content, continues to help users generate testable hypotheses about the molecular mechanisms of environmental diseases.
  • Conference Paper
    Full-text available
    Receiver Operator Characteristic (ROC) curves are commonly used to present results for binary decision problems in machine learning. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm's performance. We show that a deep connection exists between ROC space and PR space, such that a curve dominates in ROC space if and only if it dominates in PR space. A corollary is the notion of an achievable PR curve, which has properties much like the convex hull in ROC space; we show an efficient algorithm for computing this curve. Finally, we also note differences in the two types of curves are significant for algorithm design. For example, in PR space it is incorrect to linearly interpolate between points. Furthermore, algorithms that optimize the area under the ROC curve are not guaranteed to optimize the area under the PR curve.
  • Article
    Full-text available
    The in silico prediction of potential interactions between drugs and target proteins is of core importance for the identification of new drugs or novel targets for existing drugs. However, only a tiny portion of all drug-target pairs in current datasets are experimentally validated interactions. This motivates the need for developing computational methods that predict true interaction pairs with high accuracy. We show that a simple machine learning method that uses the drug-target network as the only source of information is capable of predicting true interaction pairs with high accuracy. Specifically, we introduce interaction profiles of drugs (and of targets) in a network, which are binary vectors specifying the presence or absence of interaction with every target (drug) in that network. We define a kernel on these profiles, called the Gaussian Interaction Profile (GIP) kernel, and use a simple classifier, (kernel) Regularized Least Squares (RLS), for prediction drug-target interactions. We test comparatively the effectiveness of RLS with the GIP kernel on four drug-target interaction networks used in previous studies. The proposed algorithm achieves area under the precision-recall curve (AUPR) up to 92.7, significantly improving over results of state-of-the-art methods. Moreover, we show that using also kernels based on chemical and genomic information further increases accuracy, with a neat improvement on small datasets. These results substantiate the relevance of the network topology (in the form of interaction profiles) as source of information for predicting drug-target interactions. Software and Supplementary Material are available at http://cs.ru.nl/~tvanlaarhoven/drugtarget2011/. tvanlaarhoven@cs.ru.nl; elenam@cs.ru.nl. Supplementary data are available at Bioinformatics online.
  • Article
    Full-text available
    DrugBank (http://www.drugbank.ca) is a richly annotated database of drug and drug target information. It contains extensive data on the nomenclature, ontology, chemistry, structure, function, action, pharmacology, pharmacokinetics, metabolism and pharmaceutical properties of both small molecule and large molecule (biotech) drugs. It also contains comprehensive information on the target diseases, proteins, genes and organisms on which these drugs act. First released in 2006, DrugBank has become widely used by pharmacists, medicinal chemists, pharmaceutical researchers, clinicians, educators and the general public. Since its last update in 2008, DrugBank has been greatly expanded through the addition of new drugs, new targets and the inclusion of more than 40 new data fields per drug entry (a 40% increase in data 'depth'). These data field additions include illustrated drug-action pathways, drug transporter data, drug metabolite data, pharmacogenomic data, adverse drug response data, ADMET data, pharmacokinetic data, computed property data and chemical classification data. DrugBank 3.0 also offers expanded database links, improved search tools for drug-drug and food-drug interaction, new resources for querying and viewing drug pathways and hundreds of new drug entries with detailed patent, pricing and manufacturer data. These additions have been complemented by enhancements to the quality and quantity of existing data, particularly with regard to drug target, drug description and drug action data. DrugBank 3.0 represents the result of 2 years of manual annotation work aimed at making the database much more useful for a wide range of 'omics' (i.e. pharmacogenomic, pharmacoproteomic, pharmacometabolomic and even pharmacoeconomic) applications.
  • Article
    Full-text available
    Predicting drug-protein interactions from heterogeneous biological data sources is a key step for in silico drug discovery. The difficulty of this prediction task lies in the rarity of known drug-protein interactions and myriad unknown interactions to be predicted. To meet this challenge, a manifold regularization semi-supervised learning method is presented to tackle this issue by using labeled and unlabeled information which often generates better results than using the labeled data alone. Furthermore, our semi-supervised learning method integrates known drug-protein interaction network information as well as chemical structure and genomic sequence data. Using the proposed method, we predicted certain drug-protein interactions on the enzyme, ion channel, GPCRs, and nuclear receptor data sets. Some of them are confirmed by the latest publicly available drug targets databases such as KEGG. We report encouraging results of using our method for drug-protein interaction network reconstruction which may shed light on the molecular interaction inference and new uses of marketed drugs.
  • Article
    Extended-connectivity fingerprints (ECFPs) are a novel class of topological fingerprints for molecular characterization. Historically, topological fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure-activity modeling. ECFPs are circular fingerprints with a number of useful qualities: they can be very rapidly calculated; they are not predefined and can represent an essentially infinite number of different molecular features (including stereochemical information); their features represent the presence of particular substructures, allowing easier interpretation of analysis results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature.
  • Article
    Full-text available
    The molecular understanding of phenotypes caused by drugs in humans is essential for elucidating mechanisms of action and for developing personalized medicines. Side effects of drugs (also known as adverse drug reactions) are an important source of human phenotypic information, but so far research on this topic has been hampered by insufficient accessibility of data. Consequently, we have developed a public, computer-readable side effect resource (SIDER) that connects 888 drugs to 1450 side effect terms. It contains information on frequency in patients for one-third of the drug-side effect pairs. For 199 drugs, the side effect frequency of placebo administration could also be extracted. We illustrate the potential of SIDER with a number of analyses. The resource is freely available for academic research at http://sideeffects.embl.de.
  • Article
    Full-text available
    In silico prediction of drug-target interactions from heterogeneous biological data is critical in the search for drugs for known diseases. This problem is currently being attacked from many different points of view, a strong indication of its current importance. Precisely, being able to predict new drug-target interactions with both high precision and accuracy is the holy grail, a fundamental requirement for in silico methods to be useful in a biological setting. This, however, remains extremely challenging due to, amongst other things, the rarity of known drug-target interactions. We propose a novel supervised inference method to predict unknown drug-target interactions, represented as a bipartite graph. We use this method, known as bipartite local models to first predict target proteins of a given drug, then to predict drugs targeting a given protein. This gives two independent predictions for each putative drug-target interaction, which we show can be combined to give a definitive prediction for each interaction. We demonstrate the excellent performance of the proposed method in the prediction of four classes of drug-target interaction networks involving enzymes, ion channels, G protein-coupled receptors (GPCRs) and nuclear receptors in human. This enables us to suggest a number of new potential drug-target interactions. An implementation of the proposed algorithm is available upon request from the authors. Datasets and all prediction results are available at http://cbio.ensmp.fr/~yyamanishi/bipartitelocal/.
  • Article
    We describe the testing and release of AutoDock4 and the accompanying graphical user interface AutoDockTools. AutoDock4 incorporates limited flexibility in the receptor. Several tests are reported here, including a redocking experiment with 188 diverse ligand-protein complexes and a cross-docking experiment using flexible sidechains in 87 HIV protease complexes. We also report its utility in analysis of covalently bound ligands, using both a grid-based docking method and a modification of the flexible sidechain technique.
  • Article
    Introduction to Computational Biology: Maps, Sequencesand Genomes. Chapman Hall, 1995.[WF74] R.A. Wagner and M.J. Fischer. The String to String Correction Problem. Journal of the ACM, 21(1):168--173, 1974.[WM92] S. Wu and U. Manber. Fast Text Searching Allowing Errors. Communicationsof the ACM, 10(35):83--91, 1992.73Bibliography[KOS+00] S. Kurtz, E. Ohlebusch, J. Stoye, C. Schleiermacher, and R. Giegerich.Computation and Visualization of Degenerate Repeats in CompleteGenomes. In ...
  • Article
    The secretory isozyme of human carbonic anhydrase (hCA, EC 4.2.1.1), hCA VI, has been cloned, expressed, and purified in a bacterial expression system. The kinetic parameters for the CO2 hydration reaction proved hCA VI to possess a kcat of 3.4 x 10(5) s-1 and kcat/KM of 4.9 x 10(7) M-1 s-1 (at pH 7.5 and 20 degrees C). hCA VI has a significant catalytic activity for the physiological reaction on the same order of magnitude as the ubiquitous isoform CA I or the transmembrane, tumor-associated isozyme CA IX. A series of sulfonamides and one sulfamate have been tested for their interaction with this isozyme. Simple benzenesulfonamides were rather ineffective hCA VI inhibitors, with inhibition constants in the range of 1090-6680 nM. Better inhibitors were detected among such derivatives bearing 2- or 4-amino-, 4-aminomethyl-, or 4-hydroxymethyl moieties or among halogenated sulfanilamides (KI values of 608-955 nM). Some clinically used compounds, such as acetazolamide, methazolamide, ethoxzolamide, dichlorophenamide, dorzolamide, brinzolamide, topiramate, sulpiride, and indisulam, or the orphan drug benzolamide, showed effective hCA VI inhibitory activity, with inhibition constants of 0.8-79 nM. The best inhibitors were brinzolamide and sulpiride (KI values of 0.8-0.9 nM), the latter compound being also a CA VI-selective inhibitor. The metallic taste reported as a side effect after the treatment with systemic sulfonamides may be due to the inhibition of the salivary CA VI. Some of the compounds investigated in this study might be used as additives in toothpastes for reducing the acidification produced by the relevant CO2 hydrase activity of enamel CA VI, which leads to the formation of protons and bicarbonate and may have a role in cariogenesis.
  • Article
    Full-text available
    The identification of protein function based on biological information is an area of intense research. Here we consider a complementary technique that quantitatively groups and relates proteins based on the chemical similarity of their ligands. We began with 65,000 ligands annotated into sets for hundreds of drug targets. The similarity score between each set was calculated using ligand topology. A statistical model was developed to rank the significance of the resulting similarity scores, which are expressed as a minimum spanning tree to map the sets together. Although these maps are connected solely by chemical similarity, biologically sensible clusters nevertheless emerged. Links among unexpected targets also emerged, among them that methadone, emetine and loperamide (Imodium) may antagonize muscarinic M3, alpha2 adrenergic and neurokinin NK2 receptors, respectively. These predictions were subsequently confirmed experimentally. Relating receptors by ligand chemistry organizes biology to reveal unexpected relationships that may be assayed using the ligands themselves.
  • Article
    Full-text available
    Colony stimulating factor-1 (CSF1) and its receptor (CSF1-R) are important in mammary gland development and have been implicated in breast carcinogenesis. In a nested case-control study in the Nurses' Heath Study of 726 breast cancer cases diagnosed between June 1, 1992, and June 1, 1998, and 734 matched controls, we prospectively evaluated whether circulating levels of CSF1 (assessed in 1989-1990) are associated with breast cancer risk. The association varied by menopausal status (P(heterogeneity) = 0.009). CSF1 levels in the highest quartile (versus lowest) were associated with an 85% reduced risk of premenopausal breast cancer [relative risk (RR), 0.15; 95% confidence interval (95% CI), 0.03-0.85; P(trend) = 0.02]. In contrast, CSF1 levels in the highest quartile conferred a 33% increased risk of postmenopausal breast cancer (RR, 1.33; 95% CI, 0.96-1.86; P(trend) = 0.11), with greatest risk for invasive (RR, 1.45; 95% CI, 1.02-2.07; P(trend) = 0.06) and ER+/PR+ tumors (RR, 1.72; 95% CI, 1.11-2.66; P(trend) = 0.04). Thus, the association of circulating CSF1 levels and breast cancer varies by menopausal status.