Article

A Multi-View Deep Neural Network Model for Chemical-Disease Relation Extraction From Imbalanced Datasets

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Understanding the chemical-disease relations (CDR) is a crucial task in various biomedical domains. Manual mining of these information from biomedical literature is costly and time-consuming. To address these issues, various researches have been carried out to design an efficient automatic tool. In this paper, we propose a multi-view based deep neural network model for CDR task. Typically, multiple representations (or views) of the datasets are not available for this task. So, we train multiple conceptually different deep neural network models on the dataset to generate different abstract features, treated as different views. A novel loss function, "Penalized LF", is defined to address the problem of imbalance dataset. The proposed loss function is generic in nature. The model is designed as a combination of Convolution Neural Network (CNN) and Bidirectional Long Short Term Memory (Bi-LSTM) network along with a Multi-Layer Perceptron (MLP). To show the efficacy of our proposed model, we have compared it with six baseline models and other state-of-the-art techniques, on "chemicals-and-disease-DFE" dataset, a free text dataset created by Li et al. from BioCreative V Chemical Disease Relation dataset. Results show that the proposed model attains highest F1-score for individual classes, proving its efficiency in handling class imbalance problem in the dataset. To further demonstrate the efficacy of the proposed model, we have presented results on BioCreative V dataset and two Protein-Protein Interaction Identification (PPI) datasets, viz., AiMed and BioInfer. All these results are also compared with the state-of-the-art models.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To effectively alleviate class imbalance, Ye and Luo [28] present a general ranking-based multilabel learning framework combined with the convolutional neural network (CNN). Mitra et al. [29] propose a multiview-based deep neural network model, which combines CNN and Bidirectional Long Short Term Memory (Bi-LSTM) network along with a multilayer perceptron (MLP). Shi et al. [30] propose an advanced graph neural network, which assigns higher weights to those direct neighbor words that contribute more to relation prediction through breadth exploration. ...
Article
Full-text available
Distant supervision (DS) has been widely used for relation extraction (RE), which automatically generates large-scale labeled data. However, there is a wrong labeling problem, which affects the performance of RE. Besides, the existing method suffers from the lack of useful semantic features for some positive training instances. To address the above problems, we propose a novel RE model with sentence selection and interaction representation for distantly supervised RE. First, we propose a pattern method based on the relation trigger words as a sentence selector to filter out noisy sentences to alleviate the wrong labeling problem. After clean instances are obtained, we propose the interaction representation using the word-level attention mechanism-based entity pairs to dynamically increase the weights of the words related to entity pairs, which can provide more useful semantic information for relation prediction. The proposed model outperforms the strongest baseline by 2.61 in F1-score on a widely used dataset, which proves that our model performs significantly better than the state-of-the-art RE systems.
... Tao et al. [7] proposed crowdsourcing and machine learning approaches for extracting entities indicating potential food-borne outbreaks from social media using the dual-task BERTweet model. Mitra et al. [8] adopted a multiview deep neural network model for chemical-disease relation extraction from imbalanced datasets. ...
Article
Full-text available
Dealing with food safety issues in time through online public opinion incidents can reduce the impact of incidents and protect human health effectively. Therefore, by the smart technology of extracting the entity relationship of public opinion events in the food field, the knowledge graph of the food safety field is constructed to discover the relationship between food safety issues. To solve the problem of multi-entity relationships in food safety incident sentences for few-shot learning, this paper adopts the pipeline-type extraction method. Entity relationship is extracted from Bidirectional Encoder Representation from Transformers (BERTs) joined Bidirectional Long Short-Term Memory (BLSTM), namely, the BERT-BLSTM network model. Based on the entity relationship types extracted from the BERT-BLSTM model and the introduction of Chinese character features, an entity pair extraction model based on the BERT-BLSTM-conditional random field (CRF) is established. In this paper, several common deep neural network models are compared with the BERT-BLSTM-CRF model with a food public opinion events dataset. Experimental results show that the precision of the entity relationship extraction model based on BERT-BLSTM-CRF is 3.29%∼23.25% higher than that of other models in the food public opinion events dataset, which verifies the validity and rationality of the model proposed in this paper.
... In [5], [11], authors proposed two MTMV algorithms, the bipartite graph based MTMV algorithm which can only deal with non-negative data and the semi-nonnegative matrix tri-factorization based MTMV algorithm which is a general algorithm that can deal with negative values. A multi-view based deep neural network model for extracting chemicaldisease relations from imbalanced datasets is developed in [12]. The problem of selecting a single solution from the final Pareto optimal front in connection with multi-view multiobjective optimization based clustering technique is addressed in [13]. ...
Conference Paper
In recent years, multi-view multi-task clustering has received much attention. There are several real-life problems that involve both multi-view clustering and multi-task clustering , i.e., the tasks are closely related, and each task can be analyzed using multiple views. Traditional multi-task multi-view clustering algorithms use single-objective optimization based approaches and cannot apply too-many regularization terms. However, these problems are inherently some multi-objective optimization problems because conflict may be between different views within a given task and also between different tasks, necessitating a trade-off. Based on these observations, in this paper, we have proposed a novel multi-task multi-view multi-objective optimization (MTMV-MO) based clustering algorithm which simultaneously optimizes three objectives, i.e., within-view task relation, within-task view relation and the quality of the clusters obtained. The proposed methodology (MTMV-MO) is evaluated on four different datasets and the results are compared with five state-of-the-art algorithms in terms of Adjusted Rand Index (ARI) and Classification Accuracy (%CoA). MTMV-MO illustrates an improvement of 1.5-2% in terms of ARI and 4-5% in terms of %CoA compared to the state-of-the-art algorithms.
Article
The identification of chemical-disease association types is helpful not only to discovery lead compounds and study drug repositioning, but also to treat disease and decipher pathomechanism. It is very urgent to develop computational method for identifying potential chemical-disease association types, since wet methods are usually expensive, laborious and time-consuming. In this study, molecular fingerprint, gene ontology and pathway are utilized to characterize chemicals and diseases. A novel predictor is proposed to recognize potential chemical-disease associations at the first layer, and further distinguish whether their relationships belong to biomarker or therapeutic relations at the second layer. The prediction performance of current method is assessed using the benchmark dataset based on ten-fold cross-validation. The practical prediction accuracies of the first layer and the second layer are 78.47% and 72.07%, respectively. The recognition ability for lead compounds, new drug indications, potential and true chemical-disease association pairs has also been investigated and confirmed by constructing a variety of datasets and performing a series of experiments. It is anticipated that the current method can be considered as a powerful high-throughput virtual screening tool for drug researches and developments.
Article
In the field of artificial intelligence, classification algorithms tend to be biased toward the majority class samples when encountering imbalanced data, resulting in low recognition rates for minority class samples. Undersampling techniques address this issue by decreasing the number of majority class samples to balance the original data distribution before the dataset is learned. However, current clustering-based undersampling methods have limitations that directly affect the original imbalanced dataset and the final classification performance. To address these problems, we propose a novel three-stage undersampling framework with denoising, fuzzy c-means clustering, and representative sample selection (UFFDFR). This framework improves the classification performance on imbalanced data by removing noise and unrepresentative samples from the majority class. Experiments on 15 different imbalanced datasets demonstrate that UFFDFR effectively removed noise and unrepresentative majority class samples and improved classification performance. Furthermore, UFFDFR outperformed three classic and three state-of-the-art clustering-based undersampling methods in terms F-measure, G-mean, and AUC for five classification algorithms, which was confirmed by the Friedman and Nemenyi post-hoc statistical tests.
Article
Extracting the chemical-induced disease relation from literatures is important for biomedical research. On one hand, it is challenging to capture the interactions among remote words and the long-distance information is not adequately exploited by existing systems for document-level relation extraction. On the other hand, there is some information particularly important to the target relations in documents, which should attract more attention than the less relevant information for the relation extraction. However, this issue is not well addressed in existing methods. In this paper, we present a method that integrates a hybrid graph and a hierarchical concentrative attention to overcome these problems. The hybrid graph is constructed by synthesizing the syntactic graph and Abstract Meaning Representation graph to acquire the long-distance information for document-level relation extraction. Meanwhile, the concentrative attention is used to focus on the most important information, and alleviate the disturbance brought by the less relevant items in the document. The experimental results demonstrate that our model yields competitive performance on the dataset of chemical-induced disease relations.
Article
The identification of causal relationships between events or entities within biomedical texts is of great importance for creating scientific knowledge bases and is also a fundamental natural language processing (NLP) task. A causal (cause-effect) relation is defined as an association between two events in which the first must occur before the second. Although this task is an open problem in artificial intelligence, and despite its important role in information extraction from the biomedical literature, very few works have considered this problem. However, with the advent of new techniques in machine learning, especially deep neural networks, research increasingly addresses this problem. This paper summarizes state-of-the-art research, its applications, existing datasets, and remaining challenges. For this survey we have implemented and evaluated various techniques including a Multiview CNN (MVC), attention-based BiLSTM models and state-of-the-art word embedding models, such as those obtained with bidirectional encoder representations (ELMo) and transformer architectures (BioBERT). In addition, we have evaluated a graph LSTM as well as a baseline rule based system. We have investigated the class imbalance problem as an innate property of annotated data in this type of task. The results show that a considerable improvement of the results of state-of-the-art systems can be achieved when a simple random oversampling technique for data augmentation is used in order to reduce class imbalance.
Article
Search Results Clustering (SRC) is a well-known problem in the field of information retrieval and refers to the clustering of web-snippets for a given query based on some similarity/dissimilarity measure. In this current study, we have posed Search Results Clustering problem as a multi-view clustering problem and solved it from an optimization point of view. Various views based on syntactic and semantic similarity measures were considered while performing the clustering. In contrast to existing algorithms, three new views based on word mover distance, textual-entailment, and universal sentence encoder, measuring semantics while performing clustering, are incorporated in our framework. Different quality measures computed on clusters generated by different views are optimized simultaneously using multi-objective binary differential evolution (MBDE) framework. MBDE comprises a set of solutions and each solution is composed of two parts corresponding to different views. An agreement index checking the accordance between partitionings of different views is also optimized to obtain a consensus partitioning. The proposed approach is automatic in nature as it is capable of detecting the number of clusters for any query in an automatic way. Experiments are performed on three benchmark multi-view datasets corresponding to web search results and evaluated using well-known F-measure metric. Results obtained illustrate that our approach outperforms state-of-the-art techniques.
Article
Full-text available
Knowledge extracted from the protein-protein interaction network can help researchers reveal the molecular mechanisms of biological processes. With the rapid growth in the volume of biomedical literature, manually detecting and annotating protein-protein interactions (PPIs) from raw literature has become increasingly difficult. Hence, automatically extracting protein-protein interactions by machine learning methods from raw literature has gained significance in biomedical research. In the present study, we propose a novel PPI extraction method based on the residual convolutional neural network. This is the first time that the residual convolutional neural network is applied to the PPI extraction task. Additionally, previous state-of-the-art PPI extraction models heavily rely on parsing results from natural language processing tools such as dependency parsers. Our model does not rely on any parsing tools. We evaluated our model based on five benchmark PPI extraction corpora, AIMed, BioInfer, HPRD50, IEPA and LLL. The experimental results showed that our model achieved the best results compared with previous kernel-based and convolutional neural network based PPI extraction models. Compared with previous recurrent neural network based PPI extraction models, our model achieved better or comparable performance.
Article
Full-text available
Epigenome-wide association studies (EWASs) have become increasingly popular for studying DNA methylation (DNAm) variations in complex diseases. The Illumina methylation arrays provide an economical, high-throughput and comprehensive platform for measuring methylation status in EWASs. A number of software tools have been developed for identifying disease-associated differentially methylated regions (DMRs) in the epigenome. However, in practice, we found these tools typically had multiple parameter settings that needed to be specified and the performance of the software tools under different parameters was often unclear. To help users better understand and choose optimal parameter settings when using DNAm analysis tools, we conducted a comprehensive evaluation of 4 popular DMR analysis tools under 60 different parameter settings. In addition to evaluating power, precision, area under precision-recall curve, Matthews correlation coefficient, F1 score and type I error rate, we also compared several additional characteristics of the analysis results, including the size of the DMRs, overlap between the methods and execution time. The results showed that none of the software tools performed best under their default parameter settings, and power varied widely when parameters were changed. Overall, the precision of these software tools were good. In contrast, all methods lacked power when effect size was consistent but small. Across all simulation scenarios, comb-p consistently had the best sensitivity as well as good control of false-positive rate.
Article
Full-text available
Background: Extracting relationships between chemicals and diseases from unstructured literature have attracted plenty of attention since the relationships are very useful for a large number of biomedical applications such as drug repositioning and pharmacovigilance. A number of machine learning methods have been proposed for chemical-induced disease (CID) extraction due to some publicly available annotated corpora. Most of them suffer from time-consuming feature engineering except deep learning methods. In this paper, we propose a novel document-level deep learning method, called recurrent piecewise convolutional neural networks (RPCNN), for CID extraction. Results: Experimental results on a benchmark dataset, the CDR (Chemical-induced Disease Relation) dataset of the BioCreative V challenge for CID extraction show that the highest precision, recall and F-score of our RPCNN-based CID extraction system are 65.24, 77.21 and 70.77%, which is competitive with other state-of-the-art systems. Conclusions: A novel deep learning method is proposed for document-level CID extraction, where domain knowledge, piecewise strategy, attention mechanism, and multi-instance learning are combined together. The effectiveness of the method is proved by experiments conducted on a benchmark dataset.
Article
Full-text available
Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.
Article
Full-text available
The problem of predicting the location of users on large social networks like Twitter has emerged from real-life applications such as social unrest detection and online marketing. Twitter user geolocation is a difficult and active research topic with a vast literature. Most of the proposed methods follow either a content-based or a network-based approach. The former exploits user-generated content while the latter utilizes the connection or interaction between Twitter users. In this paper, we introduce a novel method combining the strength of both approaches. Concretely, we propose a multi-entry neural network architecture named MENET leveraging the advances in deep learning and multiview learning. The generalizability of MENET enables the integration of multiple data representations. In the context of Twitter user geolocation, we realize MENET with textual, network, and metadata features. Considering the natural distribution of Twitter users across the concerned geographical area, we subdivide the surface of the earth into multi-scale cells and train MENET with the labels of the cells. We show that our method outperforms the state of the art by a large margin on three benchmark datasets.
Article
Full-text available
Automatic extraction of protein-protein interaction (PPI) pairs from biomedical literature is a widely examined task in biological information extraction. Currently, many kernel based approaches such as linear kernel, tree kernel, graph kernel and combination of multiple kernels has achieved promising results in PPI task. However, most of these kernel methods fail to capture the semantic relation information between two entities. In this paper, we present a special type of tree kernel for PPI extraction which exploits both syntactic (structural) and semantic vectors information known as Distributed Smoothed Tree kernel (DSTK). DSTK comprises of distributed trees with syntactic information along with distributional semantic vectors representing semantic information of the sentences or phrases. To generate robust machine learning model composition of feature based kernel and DSTK were combined using ensemble support vector machine (SVM). Five different corpora (AIMed, BioInfer, HPRD50, IEPA, and LLL) were used for evaluating the performance of our system. Experimental results show that our system achieves better f-score with five different corpora compared to other state-of-the-art systems.
Article
Full-text available
This article describes our work on the BioCreative-V chemical–disease relation (CDR) extraction task, which employed a maximum entropy (ME) model and a convolutional neural network model for relation extraction at inter- and intra-sentence level, respectively. In our work, relation extraction between entity concepts in documents was simplified to relation extraction between entity mentions. We first constructed pairs of chemical and disease mentions as relation instances for training and testing stages, then we trained and applied the ME model and the convolutional neural network model for inter- and intra-sentence level, respectively. Finally, we merged the classification results from mention level to document level to acquire the final relations between chemical and disease concepts. The evaluation on the BioCreative-V CDR corpus shows the effectiveness of our proposed approach. Database URL:http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/
Article
Full-text available
The plethora of biomedical relations which are embedded in medical logs (records) demands researchers’ attention. Previous theoretical and practical focuses were restricted on traditional machine learning techniques. However, these methods are susceptible to the issues of “vocabulary gap” and data sparseness and the unattainable automation process in feature extraction. To address aforementioned issues, in this work, we propose a multichannel convolutional neural network (MCCNN) for automated biomedical relation extraction. The proposed model has the following two contributions: (1) it enables the fusion of multiple (e.g., five) versions in word embeddings; (2) the need for manual feature engineering can be obviated by automated feature learning with convolutional neural network (CNN). We evaluated our model on two biomedical relation extraction tasks: drug-drug interaction (DDI) extraction and protein-protein interaction (PPI) extraction. For DDI task, our system achieved an overall f -score of 70.2% compared to the standard linear SVM based system (e.g., 67.0%) on DDIExtraction 2013 challenge dataset. And for PPI task, we evaluated our system on Aimed and BioInfer PPI corpus; our system exceeded the state-of-art ensemble SVM system by 2.7% and 5.6% on f -scores.
Article
Full-text available
Background Due to the importance of identifying relations between chemicals and diseases for new drug discovery and improving chemical safety, there has been a growing interest in developing automatic relation extraction systems for capturing these relations from the rich and rapid-growing biomedical literature. In this work we aim to build on current advances in named entity recognition and a recent BioCreative effort to further improve the state of the art in biomedical relation extraction, in particular for the chemical-induced disease (CID) relations. ResultsWe propose a rich-feature approach with Support Vector Machine to aid in the extraction of CIDs from PubMed articles. Our feature vector includes novel statistical features, linguistic knowledge, and domain resources. We also incorporate the output of a rule-based system as features, thus combining the advantages of rule- and machine learning-based systems. Furthermore, we augment our approach with automatically generated labeled text from an existing knowledge base to improve performance without additional cost for corpus construction. To evaluate our system, we perform experiments on the human-annotated BioCreative V benchmarking dataset and compare with previous results. When trained using only BioCreative V training and development sets, our system achieves an F-score of 57.51 %, which already compares favorably to previous methods. Our system performance was further improved to 61.01 % in F-score when augmented with additional automatically generated weakly labeled data. Conclusions Our text-mining approach demonstrates state-of-the-art performance in disease-chemical relation extraction. More importantly, this work exemplifies the use of (freely available) curated document-level annotations in existing biomedical databases, which are largely overlooked in text-mining system development.
Article
Full-text available
The state-of-the-art methods for protein-protein interaction (PPI) extraction are primarily based on kernel methods, and their performances strongly depend on the handcraft features. In this paper, we tackle PPI extraction by using convolutional neural networks (CNN) and propose a shortest dependency path based CNN (sdpCNN) model. The proposed method ( 1 ) only takes the sdp and word embedding as input and ( 2 ) could avoid bias from feature selection by using CNN. We performed experiments on standard Aimed and BioInfer datasets, and the experimental results demonstrated that our approach outperformed state-of-the-art kernel based methods. In particular, by tracking the sdpCNN model, we find that sdpCNN could extract key features automatically and it is verified that pretrained word embedding is crucial in PPI task.
Article
Full-text available
Protein-Protein Interactions (PPIs) information extraction from biomedical literature helps unveil the molecular mechanisms of biological processes. Machine learning methods have been the most popular ones in PPI extraction area. However, these methods are still feature engineering-based, which means that their performances are also heavily dependent on the appropriate feature selection which is still a skill-dependent task. This paper presents a deep neural network-based approach which can learn complex and abstract features automatically from unlabelled data by unsupervised representation learning methods. This approach first employs the training algorithm of auto-encoders to initialise the parameters of a deep multilayer neural network. Then the gradient descent method using back propagation is applied to train this deep multilayer neural network model. Experimental results on five public PPI corpora show that our method can achieve better performance than can a multilayer neural network: on two 'toughest handling' corpora AImed and BioInfer, the former outperforms the latter with the improvements of 3.10 and 2.89 percentage units in F-score, respectively. In addition, the performance comparison with APG also verifies the effectiveness of our method.
Article
Full-text available
Relations between chemicals and diseases are one of the most queried biomedical interactions. Although expert manual curation is the standard method for extracting these relations from the literature, it is expensive and impractical to apply to large numbers of documents, and therefore alternative methods are required. We describe here a crowdsourcing workflow for extracting chemical-induced disease relations from free text as part of the BioCreative V Chemical Disease Relation challenge. Five non-expert workers on the CrowdFlower platform were shown each potential chemical-induced disease relation highlighted in the original source text and asked to make binary judgments about whether the text supported the relation. Worker responses were aggregated through voting, and relations receiving four or more votes were predicted as true. On the official evaluation dataset of 500 PubMed abstracts, the crowd attained a 0.505 F-score (0.475 precision, 0.540 recall), with a maximum theoretical recall of 0.751 due to errors with named entity recognition. The total crowdsourcing cost was $1290.67 ($2.58 per abstract) and took a total of 7 h. A qualitative error analysis revealed that 46.66% of sampled errors were due to task limitations and gold standard errors, indicating that performance can still be improved. All code and results are publicly available at https://github.com/SuLab/crowd_cid_relex Database URL: https://github.com/SuLab/crowd_cid_relex
Article
Full-text available
Identifying chemical–disease relations (CDR) from biomedical literature could improve chemical safety and toxicity studies. This article proposes a novel syntactic and semantic information exploitation method for CDR extraction. The proposed method consists of a feature-based model, a tree kernel-based model and a neural network model. The feature-based model exploits lexical features, the tree kernel-based model captures syntactic structure features, and the neural network model generates semantic representations. The motivation of our method is to fully utilize the nice properties of the three models to explore diverse information for CDR extraction. Experiments on the BioCreative V CDR dataset show that the three models are all effective for CDR extraction, and their combination could further improve extraction performance. Database URL: http://www.biocreative.org/resources/corpora/biocreative-v-cdr-corpus/.
Article
Full-text available
We describe our approach to the chemical–disease relation (CDR) task in the BioCreative V challenge. The CDR task consists of two subtasks: automatic disease-named entity recognition and normalization (DNER), and extraction of chemical-induced diseases (CIDs) from Medline abstracts. For the DNER subtask, we used our concept recognition tool Peregrine, in combination with several optimization steps. For the CID subtask, our system, which we named RELigator, was trained on a rich feature set, comprising features derived from a graph database containing prior knowledge about chemicals and diseases, and linguistic and statistical features derived from the abstracts in the CDR training corpus. We describe the systems that were developed and present evaluation results for both subtasks on the CDR test set. For DNER, our Peregrine system reached an F-score of 0.757. For CID, the system achieved an F-score of 0.526, which ranked second among 18 participating teams. Several post-challenge modifications of the systems resulted in substantially improved F-scores (0.828 for DNER and 0.602 for CID). RELigator is available as a web service at http://biosemantics.org/index.php/software/religator.
Article
Full-text available
Understanding the relations between chemicals and diseases is crucial in various biomedical tasks such as new drug discoveries and new therapy developments. While manually mining these relations from the biomedical literature is costly and time-consuming, such a procedure is often difficult to keep up-to-date. To address these issues, the BioCreative-V community proposed a challenging task of automatic extraction of chemical-induced disease (CID) relations in order to benefit biocuration. This article describes our work on the CID relation extraction task on the BioCreative-V tasks. We built a machine learning based system that utilized simple yet effective linguistic features to extract relations with maximum entropy models. In addition to leveraging various features, the hypernym relations between entity concepts derived from the Medical Subject Headings (MeSH)-controlled vocabulary were also employed during both training and testing stages to obtain more accurate classification models and better extraction performance, respectively. We demoted relation extraction between entities in documents to relation extraction between entity mentions. In our system, pairs of chemical and disease mentions at both intra- and inter-sentence levels were first constructed as relation instances for training and testing, then two classification models at both levels were trained from the training examples and applied to the testing examples. Finally, we merged the classification results from mention level to document level to acquire final relations between chemicals and diseases. Our system achieved promising F-scores of 60.4% on the development dataset and 58.3% on the test dataset using gold-standard entity annotations, respectively. Database URL: https://github.com/JHnlp/BC5CIDTask
Article
Full-text available
Awareness of the adverse effects of chemicals is important in biomedical research and healthcare. Text mining can allow timely and low-cost extraction of this knowledge from the biomedical literature. We extended our text mining solution, LeadMine, to identify diseases and chemical-induced disease relationships (CIDs). LeadMine is a dictionary/grammar-based entity recognizer and was used to recognize and normalize both chemicals and diseases to Medical Subject Headings (MeSH) IDs. The disease lexicon was obtained from three sources: MeSH, the Disease Ontology and Wikipedia. The Wikipedia dictionary was derived from pages with a disease/symptom box, or those where the page title appeared in the lexicon. Composite entities (e.g. heart and lung disease) were detected and mapped to their composite MeSH IDs. For CIDs, we developed a simple pattern-based system to find relationships within the same sentence. Our system was evaluated in the BioCreative V Chemical–Disease Relation task and achieved very good results for both disease concept ID recognition (F1-score: 86.12%) and CIDs (F1-score: 52.20%) on the test set. As our system was over an order of magnitude faster than other solutions evaluated on the task, we were able to apply the same system to the entirety of MEDLINE allowing us to extract a collection of over 250 000 distinct CIDs.
Article
Full-text available
Mining chemical-induced disease relations embedded in the vast biomedical literature could facilitate a wide range of computational biomedical applications, such as pharmacovigilance. The BioCreative V organized a Chemical Disease Relation (CDR) Track regarding chemical-induced disease relation extraction from biomedical literature in 2015. We participated in all subtasks of this challenge. In this article, we present our participation system Chemical Disease Relation Extraction SysTem (CD-REST), an end-to-end system for extracting chemical-induced disease relations in biomedical literature. CD-REST consists of two main components: (1) a chemical and disease named entity recognition and normalization module, which employs the Conditional Random Fields algorithm for entity recognition and a Vector Space Model-based approach for normalization; and (2) a relation extraction module that classifies both sentence-level and document-level candidate drug–disease pairs by support vector machines. Our system achieved the best performance on the chemical-induced disease relation extraction subtask in the BioCreative V CDR Track, demonstrating the effectiveness of our proposed machine learning-based approaches for automatic extraction of chemical-induced disease relations in biomedical literature. The CD-REST system provides web services using HTTP POST request. The web services can be accessed from http://clinicalnlptool.com/cdr. The online CD-REST demonstration system is available at http://clinicalnlptool.com/cdr/cdr.html. Database URL: http://clinicalnlptool.com/cdr; http://clinicalnlptool.com/cdr/cdr.html
Article
Full-text available
Manually curating chemicals, diseases and their relationships is significantly important to biomedical research, but it is plagued by its high cost and the rapid growth of the biomedical literature. In recent years, there has been a growing interest in developing computational approaches for automatic chemical-disease relation (CDR) extraction. Despite these attempts, the lack of a comprehensive benchmarking dataset has limited the comparison of different techniques in order to assess and advance the current state-of-the-art. To this end, we organized a challenge task through BioCreative V to automatically extract CDRs from the literature. We designed two challenge tasks: disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. To assist system development and assessment, we created a large annotated text corpus that consisted of human annotations of chemicals, diseases and their interactions from 1500 PubMed articles. 34 teams worldwide participated in the CDR task: 16 (DNER) and 18 (CID). The best systems achieved an F-score of 86.46% for the DNER task—a result that approaches the human inter-annotator agreement (0.8875)—and an F-score of 57.03% for the CID task, the highest results ever reported for such tasks. When combining team results via machine learning, the ensemble system was able to further improve over the best team results by achieving 88.89% and 62.80% in F-score for the DNER and CID task, respectively. Additionally, another novel aspect of our evaluation is to test each participating system’s ability to return real-time results: the average response time for each team’s DNER and CID web service systems were 5.6 and 9.3 s, respectively. Most teams used hybrid systems for their submissions based on machining learning. Given the level of participation and results, we found our task to be successful in engaging the text-mining research community, producing a large annotated corpus and improving the results of automatic disease recognition and CDR extraction. Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/
Article
Full-text available
Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced. Database URL: http://argo.nactem.ac.uk
Article
Full-text available
Improving the prediction of chemical toxicity is a goal common to both environmental health research and pharmaceutical drug development. To improve safety detection assays, it is critical to have a reference set of molecules with well-defined toxicity annotations for training and validation purposes. Here, we describe a collaboration between safety researchers at Pfizer and the research team at the Comparative Toxicogenomics Database (CTD) to text mine and manually review a collection of 88 629 articles relating over 1 200 pharmaceutical drugs to their potential involvement in cardiovascular, neurological, renal and hepatic toxicity. In 1 year, CTD biocurators curated 2 54 173 toxicogenomic interactions (1 52 173 chemical–disease, 58 572 chemical–gene, 5 345 gene–disease and 38 083 phenotype interactions). All chemical–gene–disease interactions are fully integrated with public CTD, and phenotype interactions can be downloaded. We describe Pfizer’s text-mining process to collate the articles, and CTD’s curation strategy, performance metrics, enhanced data content and new module to curate phenotype information. As well, we show how data integration can connect phenotypes to diseases. This curation can be leveraged for information about toxic endpoints important to drug safety and help develop testable hypotheses for drug–disease events. The availability of these detailed, contextualized, high-quality annotations curated from seven decades’ worth of the scientific literature should help facilitate new mechanistic screening assays for pharmaceutical compound survival. This unique partnership demonstrates the importance of resource sharing and collaboration between public and private entities and underscores the complementary needs of the environmental health science and pharmaceutical communities. Database URL: http://ctdbase.org/
Article
Full-text available
The past decade has witnessed the modern advances of high-throughput technology and rapid growth of research capacity in producing large-scale biological data, both of which were concomitant with an exponential growth of biomedical literature. This wealth of scholarly knowledge is of significant importance for researchers in making scientific discoveries and healthcare professionals in managing health-related matters. However, the acquisition of such information is becoming increasingly difficult due to its large volume and rapid growth. In response, the National Center for Biotechnology Information (NCBI) is continuously making changes to its PubMed Web service for improvement. Meanwhile, different entities have devoted themselves to developing Web tools for helping users quickly and efficiently search and retrieve relevant publications. These practices, together with maturity in the field of text mining, have led to an increase in the number and quality of various Web tools that provide comparable literature search service to PubMed. In this study, we review 28 such tools, highlight their respective innovations, compare them to the PubMed system and one another, and discuss directions for future development. Furthermore, we have built a website dedicated to tracking existing systems and future advances in the field of biomedical literature search. Taken together, our work serves information seekers in choosing tools for their needs and service providers and developers in keeping current in the field. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/search
Article
Full-text available
This article reports on a detailed investigation of PubMed users’ needs and behavior as a step toward improving biomedical information retrieval. PubMed is providing free service to researchers with access to more than 19 million citations for biomedical articles from MEDLINE and life science journals. It is accessed by millions of users each day. Efficient search tools are crucial for biomedical researchers to keep abreast of the biomedical literature relating to their own research. This study provides insight into PubMed users’ needs and their behavior. This investigation was conducted through the analysis of one month of log data, consisting of more than 23 million user sessions and more than 58 million user queries. Multiple aspects of users’ interactions with PubMed are characterized in detail with evidence from these logs. Despite having many features in common with general Web searches, biomedical information searches have unique characteristics that are made evident in this study. PubMed users are more persistent in seeking information and they reformulate queries often. The three most frequent types of search are search by author name, search by gene/protein, and search by disease. Use of abbreviation in queries is very frequent. Factors such as result set size influence users’ decisions. Analysis of characteristics such as these plays a critical role in identifying users’ information needs and their search habits. In turn, such an analysis also provides useful insight for improving biomedical information retrieval. Database URL: http://www.ncbi.nlm.nih.gov/PubMed
Article
Full-text available
The Comparative Toxicogenomics Database (CTD) is a curated database that promotes understanding about the effects of environmental chemicals on human health. Biocurators at CTD manually curate chemical-gene interactions, chemical-disease relationships and gene-disease relationships from the literature. This strategy allows data to be integrated to construct chemical-gene-disease networks. CTD is unique in numerous respects: curation focuses on environmental chemicals; interactions are manually curated; interactions are constructed using controlled vocabularies and hierarchies; additional gene attributes (such as Gene Ontology, taxonomy and KEGG pathways) are integrated; data can be viewed from the perspective of a chemical, gene or disease; results and batch queries can be downloaded and saved; and most importantly, CTD acts as both a knowledgebase (by reporting data) and a discovery tool (by generating novel inferences). Over 116,000 interactions between 3900 chemicals and 13,300 genes have been curated from 270 species, and 5900 gene-disease and 2500 chemical-disease direct relationships have been captured. By integrating these data, 350,000 gene-disease relationships and 77,000 chemical-disease relationships can be inferred. This wealth of chemical-gene-disease information yields testable hypotheses for understanding the effects of environmental chemicals on human health. CTD is freely available at http://ctd.mdibl.org.
Article
Full-text available
Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction efforts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins. We used a variety of machine learning methods to automatically develop information extraction systems for extracting information on gene/protein name, function and interactions from Medline abstracts. We present cross-validated results on identifying human proteins and their interactions by training and testing on a set of approximately 1000 manually-annotated Medline abstracts that discuss human genes/proteins. We demonstrate that machine learning approaches using support vector machines and maximum entropy are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules. Our results show that it is promising to use machine learning to automatically build systems for extracting information from biomedical text. The results also give a broad picture of the relative strengths of a wide variety of methods when tested on a reasonably large human-annotated corpus.
Article
Full-text available
Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora. We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation. We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at http://www.it.utu.fi/BioInfer.
Article
Full-text available
This paper provides guidance to some of the concepts surrounding recurrent neural networks. Contrary to feedforward networks, recurrent networks can be sensitive, and be adapted to past inputs. Backpropagation learning is described for feedforward networks, adapted to suit our (probabilistic) modeling needs, and extended to cover recurrent networks. The aim of this brief paper is to set the scene for applying and understanding recurrent neural networks.
There is a large body of works on multi-view clustering which exploit multiple representations (or views) of the same input data for better convergence. These multiple views can come from multiple modalities (image, audio, text) or different feature subsets. Obtaining one consensus partitioning after considering different views is usually a non-trivial task. Recently, multi-objective based multi-view clustering methods have suppressed the performance of single objective based multi-view clustering techniques. One key problem is that it is difficult to select a single solution from a set of alternative partitionings generated by multi-objective techniques on the final Pareto optimal front. In this paper, we propose a novel multi-objective based multi-view clustering framework which overcomes the problem of selecting a single solution in multi-objective based techniques. In particular, our proposed framework has three major components: (i) multi-view based multi-objective algorithm, Multiview-AMOSA, for initial clustering of data points; (ii) a generative model for generating a combined solution having probabilistic labels; and (iii) K-means algorithm for obtaining the final labels. As the first component, we have adopted a recently developed multi-view based multi-objective clustering algorithm to generate different possible consensus partitionings of a given dataset taking into account different views. A generative model is coupled with the first component to generate a single consensus partitioning after considering multiple solutions. It exploits the latent subsets of the non-dominated solutions obtained from the multi-objective clustering algorithm and combines them to produce a single probabilistic labeled solution. Finally, a simple clustering algorithm, namely K-means, is applied on the generated probabilistic labels to obtain the final cluster labels. Experimental validation of our proposed framework is carried out over several benchmark datasets belonging to three different domains; UCI datasets, multiview datasets, search result clustering datasets and patient stratification datasets. Experimental results show that our proposed framework achieves an improvement of around 2%-4% over different evaluation metrics in all the four domains in comparison to state-of-the art methods.
Article
Knowledge about protein–protein interactions is essential for understanding the biological processes such as metabolic pathways, DNA replication, and transcription etc. However, a majority of the existing Protein–Protein Interaction (PPI) systems are dependent primarily on the scientific literature, which is not yet accessible as a structured database. Thus, efficient information extraction systems are required for identifying PPI information from the large collection of biomedical texts. In this paper, we present a novel method based on attentive deep recurrent neural network, which combines multiple levels of representations exploiting word sequences and dependency path related information to identify protein–protein interaction (PPI) information from the text. We use the stacked attentive bi-directional long short term memory (Bi-LSTM) as our recurrent neural network to solve the PPI identification problem. This model leverages joint modeling of proteins and relations in a single unified framework, which is named as the ‘Attentive Shortest Dependency Path LSTM’ (Att-sdpLSTM) model. Experimentation of the proposed technique was conducted on five popular benchmark PPI datasets, namely AiMed, BioInfer, HPRD50, IEPA, and LLL. The evaluation shows the F1-score values of 93.29%, 81.68%, 78.73%, 76.25%, & 83.92% on AiMed, BioInfer, HPRD50, IEPA, and LLL dataset, respectively. Comparisons with the existing systems show that our proposed approach attains state-of-the-art performance.
Article
Chemical-disease relation (CDR) extraction is significantly important to various areas of biomedical research and health care. Nowadays, many large-scale biomedical knowledge bases (KBs) containing triples about entity pairs and their relations have been built. KBs are important resources for biomedical relation extraction. However, previous research pays little attention to prior knowledge. In addition, the dependency tree contains important syntactic and semantic information, which helps to improve relation extraction. So how to effectively use it is also worth studying. In this paper, we propose a novel convolutional attention network (CAN) for CDR extraction. Firstly, we extract the shortest dependency path (SDP) between chemical and disease pairs in a sentence, which includes a sequence of words, dependency directions, and dependency relation tags. Then the convolution operations are performed on the SDP to produce deep semantic dependency features. After that, an attention mechanism is employed to learn the importance/weight of each semantic dependency vector related to knowledge representations learned from KBs. Finally, in order to combine dependency information and prior knowledge, the concatenation of weighted semantic dependency representations and knowledge representations is fed to the softmax layer for classification. Experiments on the BioCreative V CDR dataset show that our method achieves comparable performance with the state-of-the-art systems, and both dependency information and prior knowledge play important roles in CDR extraction task.
Article
Automatically extracting the relationships between chemicals and diseases is significantly important to various areas of biomedical research and health care. Biomedical experts have built many large-scale knowledge bases (KBs) to advance the development of biomedical research. KBs contain huge amounts of structured information about entities and relationships, therefore plays a pivotal role in chemical-disease relation (CDR) extraction. However, previous researches pay less attention to the prior knowledge existing in KBs. This paper proposes a neural network-based attention model (NAM) for CDR extraction, which makes full use of context information in documents and prior knowledge in KBs. For a pair of entities in a document, an attention mechanism is employed to select important context words with respect to the relation representations learned from KBs. Experiments on the BioCreative V CDR dataset show that combining context and knowledge representations through the attention mechanism, could significantly improve the CDR extraction performance while achieve comparable results with state-of-the-art systems.
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates $backslash$emphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
The automatic extraction of protein–protein interactions (PPIs) reported in scientific publications are of great significance for biomedical researchers in that they could efficiently grasp the recent research results about biochemical events and molecular processes for conducting their original studies. This article introduces a deep convolutional neural network (DCNN) equipped with various feature embeddings to battle the limitations of the existing machine learning-based PPI extraction methods. The proposed model learns and optimises word embeddings based on the publicly available word vectors and also exploits position embeddings to identify the locations of the target protein names in sentences. Furthermore, it can employ various linguistic feature embeddings to improve the PPI extraction. The intensive experiments using AIMed data set known as the most difficult collection not only show the superiority of the suggested model but also indicate important implications in optimising the network parameters and hyperparameters.
Conference Paper
Various factors, such as identity, view, and illumination, are coupled in face images. Disentangling the identity and view representations is a major challenge in face recognition. Existing face recognition systems either use handcrafted features or learn features discriminatively to improve recognition accuracy. This is different from the behavior of primate brain. Recent studies [5, 19] discovered that primate brain has a face-processing network, where view and identity are processed by different neurons. Taking into account this instinct, this paper proposes a novel deep neural net, named multi-view perceptron (MVP), which can untangle the identity and view features, and in the meanwhile infer a full spectrum of multi-view images, given a single 2D face image. The identity features of MVP achieve superior performance on the MultiPIE dataset. MVP is also capable to interpolate and predict images under viewpoints that are unobserved in the training data.
Chemical-disease relations extraction based on the shortest dependency path tree
  • H Zhou
  • H Deng
  • J He
H. Zhou, H. Deng, and J. He, "Chemical-disease relations extraction based on the shortest dependency path tree," in Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, Sevilla, Spain, 2015, pp. 214-9.
Multi-view convolutional neural networks for 3d shape recognition
  • H Su
  • S Maji
  • E Kalogerakis
  • E Learned-Miller
H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, "Multi-view convolutional neural networks for 3d shape recognition," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 945-953.
Distributional semantics resources for biomedical text processing
  • S Moen
  • T S S Ananiadou
S. Moen and T. S. S. Ananiadou, "Distributional semantics resources for biomedical text processing," in Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan, 2013, pp. 39-43.
Evaluating impact of re-training a lexical disambiguation model on domain adaptation of an hpsg parser
  • T Hara
  • Y Miyao
  • J Tsujii
T. Hara, Y. Miyao, and J. Tsujii, "Evaluating impact of re-training a lexical disambiguation model on domain adaptation of an hpsg parser," in Proceedings of the Tenth International Conference on Parsing Technologies. Association for Computational Linguistics, 2007, pp. 11-22. [Online]. Available: http://aclweb.org/anthology/W07-2202
A method for stochastic optimization
  • D Kinga
  • J B Adam
D. Kinga and J. B. Adam, "A method for stochastic optimization," in International Conference on Learning Representations (ICLR), vol. 5, 2015.
Proteinprotein interaction extraction by leveraging multiple kernels and parsers
  • M Miwa
  • R Stre
  • Y Miyao
  • J Tsujii
M. Miwa, R. Stre, Y. Miyao, and J. Tsujii, "Proteinprotein interaction extraction by leveraging multiple kernels and parsers," International Journal of Medical Informatics, vol. 78, no. 12, pp. e39 -e46, 2009, mining of Clinical and Biomedical Text and Data Special Issue. [Online]. Available: http://www.sciencedirect.com/science/article/ pii/S1386505609000768