Discovering Patterns to Extract Protein-Protein Interactions from Full Texts

University of Waterloo, Ватерлоо, Ontario, Canada
Bioinformatics (Impact Factor: 4.98). 12/2004; 20(18):3604-12. DOI: 10.1093/bioinformatics/bth451
Source: PubMed


Although there are several databases storing protein-protein interactions, most such data still exist only in the scientific literature. They are scattered in scientific literature written in natural languages, defying data mining efforts. Much time and labor have to be spent on extracting protein pathways from literature. Our aim is to develop a robust and powerful methodology to mine protein-protein interactions from biomedical texts.
We present a novel and robust approach for extracting protein-protein interactions from literature. Our method uses a dynamic programming algorithm to compute distinguishing patterns by aligning relevant sentences and key verbs that describe protein interactions. A matching algorithm is designed to extract the interactions between proteins. Equipped only with a dictionary of protein names, our system achieves a recall rate of 80.0% and precision rate of 80.5%.
The program is available on request from the authors.

Download full-text


Available from: Donald G Payan,
  • Source
    • "Heuristic rules based on morphological clues and domain specific knowledge were incorporated to remove the negative interactions. Huang et al. [4] employed dynamic programming to learn PPI patterns based on POS tags automatically. Their results gave precision of 80.5% and recall of 80.0%. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Biomedical relation extraction aims to uncover high-quality relations from life science literature with high accuracy and efficiency. Early biomedical relation extraction tasks focused on capturing binary relations, such as protein-protein interactions, which are crucial for virtually every process in a living cell. Information about these interactions provides the foundations for new therapeutic approaches. In recent years, more interests have been shifted to the extraction of complex relations such as biomolecular events. While complex relations go beyond binary relations and involve more than two arguments, they might also take another relation as an argument. In the paper, we conduct a thorough survey on the research in biomedical relation extraction. We first present a general framework for biomedical relation extraction and then discuss the approaches proposed for binary and complex relation extraction with focus on the latter since it is a much more difficult task compared to binary relation extraction. Finally, we discuss challenges that we are facing with complex relation extraction and outline possible solutions and future directions.
    Computational and Mathematical Methods in Medicine 08/2014; 2014:298473. DOI:10.1155/2014/298473 · 0.77 Impact Factor
  • Source
    • "The BioNER system applies the biomarker-gene and disease-biomarker dictionaries using fuzzy- and pattern-matching methods to find and uniquely identify entity mentions in the literature [23–25]. Firstly, our BioNER receives the dictionary type to extract mentions and a list of document identifiers (obtained in the document selection step). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The biomedical literature represents a rich source of biomarker information. However, both the size of literature databases and their lack of standardization hamper the automatic exploitation of the information contained in these resources. Text mining approaches have proven to be useful for the exploitation of information contained in the scientific publications. Here, we show that a knowledge-driven text mining approach can exploit a large literature database to extract a dataset of biomarkers related to diseases covering all therapeutic areas. Our methodology takes advantage of the annotation of MEDLINE publications pertaining to biomarkers with MeSH terms, narrowing the search to specific publications and, therefore, minimizing the false positive ratio. It is based on a dictionary-based named entity recognition system and a relation extraction module. The application of this methodology resulted in the identification of 131,012 disease-biomarker associations between 2,803 genes and 2,751 diseases, and represents a valuable knowledge base for those interested in disease-related biomarkers. Additionally, we present a bibliometric analysis of the journals reporting biomarker related information during the last 40 years.
    BioMed Research International 04/2014; 2014:253128. DOI:10.1155/2014/253128 · 1.58 Impact Factor
  • Source
    • "GIS determines the relationship described in sentences via the sentence expression patterns. However, most bioinformatic tools are developed to analyse gene–gene and protein–protein interactions, but not the specific TF–TGs relationship (Blaschke and Valencia, 2001; Huang et al., 2004; Kim et al., 2008; Seki and Mostafa, 2009; Polajnar et al., 2011). The method used to extract protein–protein information from biological texts is closely related to the methods presented herein and has received significant attention in recent years. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Moreover, the large amount of textual knowledge in the existing biomedical literature is growing rapidly, and the creation of manual patterns from the available literature is becoming more difficult. There is an increasing demand to extract potential generic regulatory relationships from unlabelled data sets. In this paper, we describe a Semi-Supervised, Weighted Pattern Learning method (SSWPL) to extract such generic regulatory information from the literature. SSWPL can build new regulatory patterns according to predefined initial patterns from unlabelled data in the literature. These constructed regulatory patterns are then used to extract generic regulatory information from PubMed abstracts. The results presented herein demonstrate that our method can be utilised to effectively extract generic regulatory relationships from the literature by using learned, weighted patterns through semi-supervised pattern learning.
    International Journal of Data Mining and Bioinformatics 01/2014; 9(4):401 - 416. DOI:10.1504/IJDMB.2014.062147 · 0.50 Impact Factor
Show more