Discovering Patterns to Extract Protein-Protein Interactions from Full Texts

University of Waterloo, Ватерлоо, Ontario, Canada
Bioinformatics (Impact Factor: 4.98). 12/2004; 20(18):3604-12. DOI: 10.1093/bioinformatics/bth451
Source: PubMed


Although there are several databases storing protein-protein interactions, most such data still exist only in the scientific literature. They are scattered in scientific literature written in natural languages, defying data mining efforts. Much time and labor have to be spent on extracting protein pathways from literature. Our aim is to develop a robust and powerful methodology to mine protein-protein interactions from biomedical texts.
We present a novel and robust approach for extracting protein-protein interactions from literature. Our method uses a dynamic programming algorithm to compute distinguishing patterns by aligning relevant sentences and key verbs that describe protein interactions. A matching algorithm is designed to extract the interactions between proteins. Equipped only with a dictionary of protein names, our system achieves a recall rate of 80.0% and precision rate of 80.5%.
The program is available on request from the authors.

Download full-text


Available from: Donald G Payan
  • Source
    • "The simplest method is cooccurrence (Matsuo and Ishizuka 2004), which achieves high recall but low precision. Conversely, approaches based on rule and pattern can improve precision but greatly lower recall (Huang et al. 2004). Furthermore, these rules or patterns, which are generated from training dataset, are not always applicable to other data. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Long non-coding RNAs (lncRNAs) play important roles in regulating transcriptional and posttranscriptional levels. Knowledge of lncRNA-protein interactions (LPIs) is crucial for biologists to explain biological mechanism and guide experiments. Since most freshly discovered LPIs can be extracted from biomedical literature, LPIs extraction by text mining is highly relevant. In this study, we apply a feature-based text mining method to extract LPIs from biomedical literatures. Our method is composed of three steps. Firstly, we operate text pre-processing to convert text from three databases into structured representations. Secondly, we extract a set of features from structured representation sentences. And these features are utilized to generate feature vectors for candidate LPIs pairs. Finally, a random forest classifier is trained by the feature vectors. When we evaluate the method on our dataset, the performance of our method achieves F-score of 79.3%, and the results suggest that as the first text mining approach, the proposed method can efficiently extract LPIs from biomedical literature
    Full-text · Article · Dec 2015
  • Source
    • "Heuristic rules based on morphological clues and domain specific knowledge were incorporated to remove the negative interactions. Huang et al. [4] employed dynamic programming to learn PPI patterns based on POS tags automatically. Their results gave precision of 80.5% and recall of 80.0%. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Biomedical relation extraction aims to uncover high-quality relations from life science literature with high accuracy and efficiency. Early biomedical relation extraction tasks focused on capturing binary relations, such as protein-protein interactions, which are crucial for virtually every process in a living cell. Information about these interactions provides the foundations for new therapeutic approaches. In recent years, more interests have been shifted to the extraction of complex relations such as biomolecular events. While complex relations go beyond binary relations and involve more than two arguments, they might also take another relation as an argument. In the paper, we conduct a thorough survey on the research in biomedical relation extraction. We first present a general framework for biomedical relation extraction and then discuss the approaches proposed for binary and complex relation extraction with focus on the latter since it is a much more difficult task compared to binary relation extraction. Finally, we discuss challenges that we are facing with complex relation extraction and outline possible solutions and future directions.
    Full-text · Article · Aug 2014 · Computational and Mathematical Methods in Medicine
  • Source
    • "The BioNER system applies the biomarker-gene and disease-biomarker dictionaries using fuzzy- and pattern-matching methods to find and uniquely identify entity mentions in the literature [23–25]. Firstly, our BioNER receives the dictionary type to extract mentions and a list of document identifiers (obtained in the document selection step). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The biomedical literature represents a rich source of biomarker information. However, both the size of literature databases and their lack of standardization hamper the automatic exploitation of the information contained in these resources. Text mining approaches have proven to be useful for the exploitation of information contained in the scientific publications. Here, we show that a knowledge-driven text mining approach can exploit a large literature database to extract a dataset of biomarkers related to diseases covering all therapeutic areas. Our methodology takes advantage of the annotation of MEDLINE publications pertaining to biomarkers with MeSH terms, narrowing the search to specific publications and, therefore, minimizing the false positive ratio. It is based on a dictionary-based named entity recognition system and a relation extraction module. The application of this methodology resulted in the identification of 131,012 disease-biomarker associations between 2,803 genes and 2,751 diseases, and represents a valuable knowledge base for those interested in disease-related biomarkers. Additionally, we present a bibliometric analysis of the journals reporting biomarker related information during the last 40 years.
    Full-text · Article · Apr 2014 · BioMed Research International
Show more