Article

Automatically extracting functionally equivalent proteins from SwissProt.

Research Department of Structural & Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK.
BMC Bioinformatics (impact factor: 2.75). 11/2008; 9:418. DOI:10.1186/1471-2105-9-418 pp.418
Source: PubMed

ABSTRACT There is a frequent need to obtain sets of functionally equivalent homologous proteins (FEPs) from different species. While it is usually the case that orthology implies functional equivalence, this is not always true; therefore datasets of orthologous proteins are not appropriate. The information relevant to extracting FEPs is contained in databanks such as UniProtKB/Swiss-Prot and a manual analysis of these data allow FEPs to be extracted on a one-off basis. However there has been no resource allowing the easy, automatic extraction of groups of FEPs - for example, all instances of protein C.We have developed FOSTA, an automatically generated database of FEPs annotated as having the same function in UniProtKB/Swiss-Prot which can be used for large-scale analysis. The method builds a candidate list of homologues and filters out functionally diverged proteins on the basis of functional annotations using a simple text mining approach.
Large scale evaluation of our FEP extraction method is difficult as there is no gold-standard dataset against which the method can be benchmarked. However, a manual analysis of five protein families confirmed a high level of performance. A more extensive comparison with two manually verified functional equivalence datasets also demonstrated very good performance.
In summary, FOSTA provides an automated analysis of annotations in UniProtKB/Swiss-Prot to enable groups of proteins already annotated as functionally equivalent, to be extracted. Our results demonstrate that the vast majority of UniProtKB/Swiss-Prot functional annotations are of high quality, and that FOSTA can interpret annotations successfully. Where FOSTA is not successful, we are able to highlight inconsistencies in UniProtKB/Swiss-Prot annotation. Most of these would have presented equal difficulties for manual interpretation of annotations. We discuss limitations and possible future extensions to FOSTA, and recommend changes to the UniProtKB/Swiss-Prot format, which would facilitate text-mining of UniProtKB/Swiss-Prot.

0 0
 · 
1 Bookmark
 · 
50 Views
  • Article: Predicting gene function by conserved co-expression.
    [show abstract] [hide abstract]
    ABSTRACT: We show that gene co-expression, which generally provides only a very weak signal for the prediction of functional interactions, can provide a reliable signal by exploiting evolutionary conservation. The encoded proteins of conserved co-expressed gene pairs are highly likely to be part of the same pathway not only after speciation (98%), but also after parallel gene duplication (97%). Conserved co-expression combined with homology data enables us to predict specific gene functions. The use of conservation between parallel duplicated gene pairs to predict function is especially promising given that gene duplication is common in eukaryotes, and that data from only a single organism can be used.
    Trends in Genetics 06/2003; 19(5):238-42. · 10.06 Impact Factor
  • Source
    Article: Mining protein function from text using term-based support vector machines.
    [show abstract] [hide abstract]
    ABSTRACT: Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2.
    BMC Bioinformatics 02/2005; 6 Suppl 1:S22. · 2.75 Impact Factor
  • Source
    Article: Basic local alignment search tool.
    [show abstract] [hide abstract]
    ABSTRACT: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
    Journal of Molecular Biology 11/1990; 215(3):403-10. · 4.00 Impact Factor

Full-text

View
1 Download
Available from

Keywords

automated analysis
 
candidate list
 
different species
 
extracting FEPs
 
functional annotations
 
functional equivalence
 
functional equivalence datasets
 
functionally diverged proteins
 
functionally equivalent
 
functionally equivalent homologous proteins
 
generated database
 
gold-standard dataset
 
good performance
 
Large scale evaluation
 
large-scale analysis
 
manual analysis
 
one-off basis
 
orthologous proteins
 
possible future extensions
 
UniProtKB/Swiss-Prot functional annotations
 

Lisa McMillan