Conference Paper

Predicting Drug Target Interaction by Integrating Drug Fingerprint and Drug Side Effect Using Machine Learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Drug discovery is an important step before drug development. Drug discovery is the process of identifying, testing a drug before medical use. Drugs are used to cure diseases by interacting with the target, which is the protein in the human cells. Many resources are wasted (cost and time) on lab experiments to discover drugs and its application. Yet machine learning enhanced the process of drug discovery and the prediction of drug-target interaction, which helped in predicting new drugs and finding more applications for old drugs. Predicting drug-target interaction starting by studying the nature of drugs and its properties. Most of the datasets existing are drugs, targets and their interactions datasets. We compiled our dataset to include side effect as drug feature. The dataset contains 400 drugs, 794 targets and 3990 side effects. In this study, a machine-learning model is implemented using three different classifiers: Decision Tree, Random Forest (RF) and K-Nearest Neighbors (K-NN) for classification. Drug fingerprint and side effect were used as input features to train our model. Three different experiments were conducted using fingerprint, side effect and both fingerprint and side effect. Results showed improvement in prediction when integrating both drug fingerprint and side effect. K-NN scored best results in the three experiment with an average accuracy of 94.69%.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Full-text available
Analysis of drug–target interactions (DTIs) is of great importance in developing new drug candidates for known protein targets or discovering new targets for old drugs. However, the experimental approaches for identifying DTIs are expensive, laborious and challenging. In this study, we report a novel computational method for predicting DTIs using the highly discriminative information of drug-target interactions and our newly developed discriminative vector machine (DVM) classifier. More specifically, each target protein sequence is transformed as the position-specific scoring matrix (PSSM), in which the evolutionary information is retained; then the local binary pattern (LBP) operator is used to calculate the LBP histogram descriptor. For a drug molecule, a novel fingerprint representation is utilized to describe its chemical structure information representing existence of certain functional groups or fragments. When applying the proposed method to the four datasets (Enzyme, GPCR, Ion Channel and Nuclear Receptor) for predicting DTIs, we obtained good average accuracies of 93.16%, 89.37%, 91.73% and 92.22%, respectively. Furthermore, we compared the performance of the proposed model with that of the state-of-the-art SVM model and other previous methods. The achieved results demonstrate that our method is effective and robust and can be taken as a useful tool for predicting DTIs.
Article
Full-text available
In this work, we propose a dual-network integrated logistic matrix factorization (DNILMF) algorithm to predict potential drug-target interactions (DTI). The prediction procedure consists of four steps: (1) inferring new drug/target profiles and constructing profile kernel matrix; (2) diffusing drug profile kernel matrix with drug structure kernel matrix; (3) diffusing target profile kernel matrix with target sequence kernel matrix; and (4) building DNILMF model and smoothing new drug/target predictions based on their neighbors. We compare our algorithm with the state-of-the-art method based on the benchmark dataset. Results indicate that the DNILMF algorithm outperforms the previously reported approaches in terms of AUPR (area under precision-recall curve) and AUC (area under curve of receiver operating characteristic) based on the 5 trials of 10-fold cross-validation. We conclude that the performance improvement depends on not only the proposed objective function, but also the used nonlinear diffusion technique which is important but under studied in the DTI prediction field. In addition, we also compile a new DTI dataset for increasing the diversity of currently available benchmark datasets. The top prediction results for the new dataset are confirmed by experimental studies or supported by other computational research.
Article
Full-text available
The systems-level characterization of drug-target associations in myocardial infarction (MI) has not been reported to date. We report a computational approach that combines different sources of drug and protein interaction information to assemble the myocardial infarction drug-target interactome network (My-DTome). My-DTome comprises approved and other drugs interlinked in a single, highly-connected network with modular organization. We show that approved and other drugs may both be highly connected and represent network bottlenecks. This highlights influential roles for such drugs on seemingly unrelated targets and pathways via direct and indirect interactions. My-DTome modules are associated with relevant molecular processes and pathways. We find evidence that these modules may be regulated by microRNAs with potential therapeutic roles in MI. Different drugs can jointly impact a module. We provide systemic insights into cardiovascular effects of non-cardiovascular drugs. My-DTome provides the basis for an alternative approach to investigate new targets and multidrug treatment in MI.
Article
Full-text available
To facilitate the study of interactions between proteins and chemicals, we have created STITCH, an aggregated database of interactions connecting over 300 000 chemicals and 2.6 million proteins from 1133 organisms. Compared to the previous version, the number of chemicals with interactions and the number of high-confidence interactions both increase 4-fold. The database can be accessed interactively through a web interface, displaying interactions in an integrated network view. It is also available for computational studies through downloadable files and an API. As an extension in the current version, we offer the option to switch between two levels of detail, namely whether stereoisomers of a given compound are shown as a merged entity or as separate entities. Separate display of stereoisomers is necessary, for example, for carbohydrates and chiral drugs. Combining the isomers increases the coverage, as interaction databases and publications found through text mining will often refer to compounds without specifying the stereoisomer. The database is accessible at http://stitch.embl.de/.
Article
Full-text available
In silico prediction of drug-target interactions from heterogeneous biological data is critical in the search for drugs and therapeutic targets for known diseases such as cancers. There is therefore a strong incentive to develop new methods capable of detecting these potential drug-target interactions efficiently. In this article, we investigate the relationship between the chemical space, the pharmacological space and the topology of drug-target interaction networks, and show that drug-target interactions are more correlated with pharmacological effect similarity than with chemical structure similarity. We then develop a new method to predict unknown drug-target interactions from chemical, genomic and pharmacological data on a large scale. The proposed method consists of two steps: (i) prediction of pharmacological effects from chemical structures of given compounds and (ii) inference of unknown drug-target interactions based on the pharmacological effect similarity in the framework of supervised bipartite graph inference. The originality of the proposed method lies in the prediction of potential pharmacological similarity for any drug candidate compounds and in the integration of chemical, genomic and pharmacological data in a unified framework. In the results, we make predictions for four classes of important drug-target interactions involving enzymes, ion channels, GPCRs and nuclear receptors. Our comprehensively predicted drug-target interaction networks enable us to suggest many potential drug-target interactions and to increase research productivity toward genomic drug discovery. Datasets and all prediction results are available at http://cbio.ensmp.fr/~yyamanishi/pharmaco/. Softwares are available upon request.
Article
Full-text available
BRENDA (BRaunschweig ENzyme DAtabase) represents a comprehensive collection of enzyme and metabolic information, based on primary literature. The database contains data from at least 83 000 different enzymes from 9800 different organisms, classified in ∼4200 EC numbers. BRENDA includes biochemical and molecular information on classification and nomenclature, reaction and specificity, functional parameters, occurrence, enzyme structure, application, engineering, stability, disease, isolation and preparation, links and literature references. The data are extracted and evaluated from ∼46 000 references, which are linked to PubMed as long as the reference is cited in PubMed. In the past year BRENDA has undergone major changes including a large increase in updating speed with >50% of all data updated in 2002 or in the first half of 2003, the development of a new EC‐tree browser, a taxonomy‐tree browser, a chemical substructure search engine for ligand structure, the development of controlled vocabulary, an ontology for some information fields and a thesaurus for ligand names. The database is accessible free of charge to the academic community at http://www.brenda. uni‐koeln.de.
Article
Full-text available
DrugBank is a unique bioinformatics/cheminformatics resource that combines detailed drug (i.e. chemical) data with comprehensive drug target (i.e. protein) information. The database contains >4100 drug entries including >800 FDA approved small molecule and biotech drugs as well as >3200 experimental drugs. Additionally, >14 000 protein or drug target sequences are linked to these drug entries. Each DrugCard entry contains >80 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data. Many data fields are hyperlinked to other databases (KEGG, PubChem, ChEBI, PDB, Swiss-Prot and GenBank) and a variety of structure viewing applets. The database is fully searchable supporting extensive text, sequence, chemical structure and relational query searches. Potential applications of DrugBank include in silico drug target discovery, drug design, drug docking or screening, drug metabolism prediction, drug interaction prediction and general pharmaceutical education. DrugBank is available at http://redpoll.pharmacy.ualberta.ca/drugbank/.
Article
Full-text available
The increasing amount of genomic and molecular information is the basis for understanding higher-order biological systems, such as the cell and the organism, and their interactions with the environment, as well as for medical, industrial and other practical applications. The KEGG resource (http://www.genome.jp/kegg/) provides a reference knowledge base for linking genomes to biological systems, categorized as building blocks in the genomic space (KEGG GENES) and the chemical space (KEGG LIGAND), and wiring diagrams of interaction networks and reaction networks (KEGG PATHWAY). A fourth component, KEGG BRITE, has been formally added to the KEGG suite of databases. This reflects our attempt to computerize functional interpretations as part of the pathway reconstruction process based on the hierarchically structured knowledge about the genomic, chemical and network spaces. In accordance with the new chemical genomics initiatives, the scope of KEGG LIGAND has been significantly expanded to cover both endogenous and exogenous molecules. Specifically, RPAIR contains curated chemical structure transformation patterns extracted from known enzymatic reactions, which would enable analysis of genome-environment interactions, such as the prediction of new reactions and new enzyme genes that would degrade new environmental compounds. Additionally, drug information is now stored separately and linked to new KEGG DRUG structure maps.
Article
Full-text available
TarFisDock is a web-based tool for automating the procedure of searching for small molecule-protein interactions over a large repertoire of protein structures. It offers PDTD (potential drug target database), a target database containing 698 protein structures covering 15 therapeutic areas and a reverse ligand-protein docking program. In contrast to conventional ligand-protein docking, reverse ligand-protein docking aims to seek potential protein targets by screening an appropriate protein database. The input file of this web server is the small molecule to be tested, in standard mol2 format; TarFisDock then searches for possible binding proteins for the given small molecule by use of a docking approach. The ligand-protein interaction energy terms of the program DOCK are adopted for ranking the proteins. To test the reliability of the TarFisDock server, we searched the PDTD for putative binding proteins for vitamin E and 4H-tamoxifen. The top 2 and 10% candidates of vitamin E binding proteins identified by TarFisDock respectively cover 30 and 50% of reported targets verified or implicated by experiments; and 30 and 50% of experimentally confirmed targets for 4H-tamoxifen appear amongst the top 2 and 5% of the TarFisDock predicted candidates, respectively. Therefore, TarFisDock may be a useful tool for target identification, mechanism study of old drugs and probes discovered from natural products. TarFisDock and PDTD are available at http://www.dddc.ac.cn/tarfisdock/.
Article
Full-text available
The molecular basis of drug action is often not well understood. This is partly because the very abundant and diverse information generated in the past decades on drugs is hidden in millions of medical articles or textbooks. Therefore, we developed a one-stop data warehouse, SuperTarget that integrates drug-related information about medical indication areas, adverse drug effects, drug metabolization, pathways and Gene Ontology terms of the target proteins. An easy-to-use query interface enables the user to pose complex queries, for example to find drugs that target a certain pathway, interacting drugs that are metabolized by the same cytochrome P450 or drugs that target the same protein but are metabolized by different enzymes. Furthermore, we provide tools for 2D drug screening and sequence comparison of the targets. The database contains more than 2500 target proteins, which are annotated with about 7300 relations to 1500 drugs; the vast majority of entries have pointers to the respective literature source. A subset of these drugs has been annotated with additional binding information and indirect interactions and is available as a separate resource called Matador. SuperTarget and Matador are available at http://insilico.charite.de/supertarget and http://matador.embl.de.
Article
Full-text available
DrugBank is a richly annotated resource that combines detailed drug data with comprehensive drug target and drug action information. Since its first release in 2006, DrugBank has been widely used to facilitate in silico drug target discovery, drug design, drug docking or screening, drug metabolism prediction, drug interaction prediction and general pharmaceutical education. The latest version of DrugBank (release 2.0) has been expanded significantly over the previous release. With ∼4900 drug entries, it now contains 60% more FDA-approved small molecule and biotech drugs including 10% more ‘experimental’ drugs. Significantly, more protein target data has also been added to the database, with the latest version of DrugBank containing three times as many non-redundant protein or drug target sequences as before (1565 versus 524). Each DrugCard entry now contains more than 100 data fields with half of the information being devoted to drug/chemical data and the other half devoted to pharmacological, pharmacogenomic and molecular biological data. A number of new data fields, including food–drug interactions, drug–drug interactions and experimental ADME data have been added in response to numerous user requests. DrugBank has also significantly improved the power and simplicity of its structure query and text query searches. DrugBank is available at http://www.drugbank.ca
Article
Cervical cancer is the fourth most common malignant disease in women’s worldwide. In most cases cervical cancer symptoms are not noticeable at its early stages. There are a lot of factors that increase the risk of developing cervical cancer like Human Papilloma Virus (HPV), Sexual Transmitted Diseases (STD) and smoking. Identifying those factors and building a classification model to classify whether the cases are cervical cancer or not is a challenging research. This study aims at using cervical cancer risk factors to build classification model using Random Forest (RF) classification technique with Synthetic Minority Oversampling Technique (SMOTE) and two feature reduction techniques Recursive Feature Elimination (RFE) and Principle Component Analysis (PCA). Most medical datasets are often imbalanced because the number of patients is much less than the number of non-patients. Because of the imbalance of the used dataset, SMOTE is used to solve this problem. The dataset consists of 32 risk factors and 4 target variables: Hinselmann, Schiller, Cytology and Biopsy. After comparing the results, we find that the combination of the random forest classification technique with SMOTE improve the classification performance.
Article
Predicting the role of protein is one of the most challenging problems. There are few approaches available for the prediction of role of unknown protein in terms of drug target or vaccine candidate. We propose here Naïve Bayes probabilistic classifier, a promising method for reliable predictions. This method is tested on the proteins identified in our mass spectrometry based membrane protemics study of Leishmania donovani parasite that causes a fatal disease (Visceral Leishmaniasis) in humans all around the world. Most of the vaccine/drug targets belonging to membrane proteins are represented as key players in the pathogenesis of Leishmania infection. Analyses of our previous results, using Naïve Bayes probabilistic classifier, indicate that this method predicts the role of unknown/hypothetical protein (as drug target/vaccine candidate) significantly with higher precision. We have employed this method in order to provide probabilistic predictions of unknown/hypothetical proteins as targets. This study reports the unknown/hypothetical proteins of Leishmania membrane fraction as a potential drug targets and vaccine candidate which is vital information for this parasite. Future molecular studies and characterization of these potent targets may produce a recombinant therapeutic/prophylactic tool against Visceral Leishmaniasis. These unknown/hypothetical proteins may open a vast research field to be exploited for novel treatment strategies.
Conference Paper
This paper introduces a framework for inference of timed temporal logic properties from data. The dataset is given as a finite set of pairs of finite-time system traces and labels, where the labels indicate whether the traces exhibit some desired behavior (e.g., a ship traveling along a safe route). We propose a decision-tree based approach for learning signal temporal logic classifiers. The method produces binary decision trees that represent the inferred formulae. Each node of the tree contains a test associated with the satisfaction of a simple formula, optimally tuned from a predefined finite set of primitives. Optimality is assessed using heuristic impurity measures, which capture how well the current primitive splits the data with respect to the traces' labels. We propose extensions of the usual impurity measures from machine learning literature to handle classification of system traces by leveraging upon the robustness degree concept. The proposed incremental construction procedure greatly improves the execution time and the accuracy compared to existing algorithms. We present two case studies that illustrate the usefulness and the computational advantages of the algorithms. The first is an anomaly detection problem in a maritime environment. The second is a fault detection problem in an automotive powertrain system.
Article
Identification of drug-target interactions is an important process in drug discovery. Although high-throughput screening and other biological assays are becoming available, experimental methods for drug-target interaction identification remain to be extremely costly, time-consuming and challenging even nowadays. Therefore, various computational models have been developed to predict potential drug-target associations on a large scale. In this review, databases and web servers involved in drug-target identification and drug discovery are summarized. In addition, we mainly introduced some state-of-the-art computational models for drug-target interactions prediction, including network-based method, machine learning-based method and so on. Specially, for the machine learning-based method, much attention was paid to supervised and semi-supervised models, which have essential difference in the adoption of negative samples. Although significant improvements for drug-target interaction prediction have been obtained by many effective computational models, both network-based and machine learning-based methods have their disadvantages, respectively. Furthermore, we discuss the future directions of the network-based drug discovery and network approach for personalized drug discovery based on personalized medicine, genome sequencing, tumor clone-based network and cancer hallmark-based network. Finally, we discussed the new evaluation validation framework and the formulation of drug-target interactions prediction problem by more realistic regression formulation based on quantitative bioactivity data. © The Author 2015. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Article
Predicting drug–target interaction using computational approaches is an important step in drug discovery and repositioning. To predict whether there will be an interaction between a drug and a target, most existing methods identify similar drugs and targets in the database. The prediction is then made based on the known interactions of these drugs and targets. This idea is promising. However, there are two shortcomings that have not yet been addressed appropriately. Firstly, most of the methods only use 2D chemical structures and protein sequences to measure the similarity of drugs and targets respectively. However, this information may not fully capture the characteristics determining whether a drug will interact with a target. Secondly, there are very few known interactions, i.e. many interactions are “missing” in the database. Existing approaches are biased towards known interactions and have no good solutions to handle possibly missing interactions which affect the accuracy of the prediction. In this paper, we enhance the similarity measures to include non-structural (and non-sequence-based) information and introduce the concept of a “super-target” to handle the problem of possibly missing interactions. Based on evaluations on real data, we show that our similarity measure is better than the existing measures and our approach is able to achieve higher accuracy than the two best existing algorithms, WNN-GIP and KBMF2K. Our approach is available at http://web.hku.hk/∼liym1018/projects/drug/drug.html or http://www.bmlnwpu.org/us/tools/PredictingDTI_S2/METHODS.html.
Conference Paper
Predicting drug-target interaction using computational approaches is an important step in drug discovery and repositioning. To predict whether there will be an interaction between a drug and a target, most existing methods identify similar drugs and targets in the database. The prediction is then made based on the known interactions of these drugs and targets. This idea is promising. However, there are two shortcomings that have not yet been addressed appropriately. Firstly, most of the methods only use 2D chemical structures and protein sequences to measure the similarity of drugs and targets respectively. However, this information may not fully capture the characteristics determining whether a drug will interact with a target. Secondly, there are very few known interactions, i.e. many interactions are "missing" in the database. Existing approaches are biased towards known interactions and have no good solutions to handle possibly missing interactions which affect the accuracy of the prediction. In this paper, we enhance the similarity measures to include non-structural (and non-sequence-based) information and introduce the concept of a "super-target" to handle the problem of possibly missing interactions. Based on evaluations on real data, we show that our similarity measure is better than the existing measures and our approach is able to achieve higher accuracy than the two best existing algorithms, WNN-GIP and KBMF2K. Our approach is available at http://web.hku.hk/∼liym1018/projects/drug/drug.html or http://www.bmlnwpu.org/us/tools/PredictingDTI_S2/METHODS.html. Copyright © 2015. Published by Elsevier Inc.
Article
PubChem is an open repository for experimental data identifying the biological activities of small molecules. PubChem contents include more than: 1000 bioassays, 28 million bioassay test outcomes, 40 million substance contributed descriptions, and 19 million unique compound structures contributed from over 70 depositing organizations. PubChem provides a significant, publicly accessible platform for mining the biological information of small molecules.
Article
Traditionally, most drugs have been discovered using phenotypic or target-based screens. Subsequently, their indications are often expanded on the basis of clinical observations, providing additional benefit to patients. This review highlights computational techniques for systematic analysis of transcriptomics (Connectivity Map, CMap), side effects, and genetics (genome-wide association study, GWAS) data to generate new hypotheses for additional indications. We also discuss data domains such as electronic health records (EHRs) and phenotypic screening that we consider promising for novel computational repositioning methods.Clinical Pharmacology & Therapeutics (2013); advance online publication 27 February 2013. doi:10.1038/clpt.2013.1.
Article
The identification of interactions between drugs and target proteins plays a key role in the process of genomic drug discovery. It is both consuming and costly to determine drug-target interactions by experiments alone. Therefore, there is an urgent need to develop new in silico prediction approaches capable of identifying these potential drug-target interactions in a timely manner. In this article, we aim at extending current structure-activity relationship (SAR) methodology to fulfill such requirements. In some sense, a drug-target interaction can be regarded as an event or property triggered by many influence factors from drugs and target proteins. Thus, each interaction pair can be represented theoretically by using these factors which are based on the structural and physicochemical properties simultaneously from drugs and proteins. To realize this, drug molecules are encoded with MACCS substructure fingerings representing existence of certain functional groups or fragments; and proteins are encoded with some biochemical and physicochemical properties. Four classes of drug-target interaction networks in humans involving enzymes, ion channels, G-protein-coupled receptors (GPCRs) and nuclear receptors, are independently used for establishing predictive models with support vector machines (SVMs). The SVM models gave prediction accuracy of 90.31%, 88.91%, 84.68% and 83.74% for four datasets, respectively. In conclusion, the results demonstrate the ability of our proposed method to predict the drug-target interactions, and show a general compatibility between the new scheme and current SAR methodology. They open the way to a host of new investigations on the diversity analysis and prediction of drug-target interactions.
Article
Access to unified datasets of protein and genetic interactions is critical for interrogation of gene/protein function and analysis of global network properties. BioGRID is a freely accessible database of physical and genetic interactions available at http://www.thebiogrid.org. BioGRID release version 2.0 includes >116 000 interactions from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens. Over 30 000 interactions have recently been added from 5778 sources through exhaustive curation of the Saccharomyces cerevisiae primary literature. An internally hyper-linked web interface allows for rapid search and retrieval of interaction data. Full or user-defined datasets are freely downloadable as tab-delimited text files and PSI-MI XML. Pre-computed graphical layouts of interactions are available in a variety of file formats. User-customized graphs with embedded protein, gene and interaction attributes can be constructed with a visualization system called Osprey that is dynamically linked to the BioGRID.
Article
Targets for drugs have so far been predicted on the basis of molecular or cellular features, for example, by exploiting similarity in chemical structure or in activity across cell lines. We used phenotypic side-effect similarities to infer whether two drugs share a target. Applied to 746 marketed drugs, a network of 1018 side effect–driven drug-drug relations became apparent, 261 of which are formed by chemically dissimilar drugs from different therapeutic indications. We experimentally tested 20 of these unexpected drug-drug relations and validated 13 implied drug-target relations by in vitro binding assays, of which 11 reveal inhibition constants equal to less than 10 micromolar. Nine of these were tested and confirmed in cell assays, documenting the feasibility of using phenotypic information to infer molecular interactions and hinting at new uses of marketed drugs.
A distributed and privatized framework for drug-target interaction prediction
  • C Lan
  • S Chandrasekarany
  • J Huan
Lan, C., Chandrasekarany, S., Huan, J.: A distributed and privatized framework for drugtarget interaction prediction. In: International Conference on Bioinformatics and Biomedicine (BIBM), pp. 731-734. IEEE (2016)