A systematic study of genome context methods: calibration, normalization and combination

Artificial Intelligence Center, SRI International, Menlo Park, California, USA.
BMC Bioinformatics (Impact Factor: 2.67). 10/2010; 11:493. DOI: 10.1186/1471-2105-11-493
Source: PubMed

ABSTRACT Genome context methods have been introduced in the last decade as automatic methods to predict functional relatedness between genes in a target genome using the patterns of existence and relative locations of the homologs of those genes in a set of reference genomes. Much work has been done in the application of these methods to different bioinformatics tasks, but few papers present a systematic study of the methods and their combination necessary for their optimal use.
We present a thorough study of the four main families of genome context methods found in the literature: phylogenetic profile, gene fusion, gene cluster, and gene neighbor. We find that for most organisms the gene neighbor method outperforms the phylogenetic profile method by as much as 40% in sensitivity, being competitive with the gene cluster method at low sensitivities. Gene fusion is generally the worst performing of the four methods. A thorough exploration of the parameter space for each method is performed and results across different target organisms are presented. We propose the use of normalization procedures as those used on microarray data for the genome context scores. We show that substantial gains can be achieved from the use of a simple normalization technique. In particular, the sensitivity of the phylogenetic profile method is improved by around 25% after normalization, resulting, to our knowledge, on the best-performing phylogenetic profile system in the literature. Finally, we show results from combining the various genome context methods into a single score. When using a cross-validation procedure to train the combiners, with both original and normalized scores as input, a decision tree combiner results in gains of up to 20% with respect to the gene neighbor method. Overall, this represents a gain of around 15% over what can be considered the state of the art in this area: the four original genome context methods combined using a procedure like that used in the STRING database. Unfortunately, we find that these gains disappear when the combiner is trained only with organisms that are phylogenetically distant from the target organism.
Our experiments indicate that gene neighbor is the best individual genome context method and that gains from the combination of individual methods are very sensitive to the training data used to obtain the combiner's parameters. If adequate training data is not available, using the gene neighbor score by itself instead of a combined score might be the best choice.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Protein function prediction is one of the most challenging problem in the post-genomic era. With the advances of the high-throughput techniques, the number of newly identified proteins has been increasing exponentially. However, the functional characterization of these new proteins have not increased in the same proportion. To fill this gap, a large number of computational methods have been proposed in the literature. Early approaches have explored homology relationships to associate known functions to the newly discovered proteins. Nevertheless, these approaches tend to fail when a new protein is considerably different (divergent) from other known ones. Accordingly, more accurate approaches that use expressive data representation and explore sophisticate computational techniques are urgently required. Regarding these points, this review provides a comprehensible description of machine learning approaches that are currently applied to protein function prediction problems. We start by defining several problems enrolled in understanding protein function aspects, and describing how machine learning can be applied to these problems. We aim to expose, in a systematical framework, the role of these techniques on protein function inference, sometimes difficult to follow up due to the rapid evolvement of the field. With this purpose in mind, we highlighted the most representative contributions, the recent advancements, and provide an insightful categorization and classification of machine learning methods in functional proteomics.
    06/2013; 7. DOI:10.2174/18722083113079990006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Annexins are Ca(2+)-binding, membrane-interacting proteins, widespread among eukaryotes, consisting usually of four structurally similar repeated domains. It is accepted that vertebrate annexins derive from a double genome duplication event. It has been postulated that a single domain annexin, if found, might represent a molecule related to the hypothetical ancestral annexin. The recent discovery of a single-domain annexin in a bacterium, Cytophaga hutchinsonii, apparently confirmed this hypothesis. Here, we present a more complex picture. Using remote sequence similarity detection tools, a survey of bacterial genomes was performed in search of annexin-like proteins. In total, we identified about thirty annexin homologues, including single-domain and multi-domain annexins, in seventeen bacterial species. The thorough search yielded, besides the known annexin homologue from C. hutchinsonii, homologues from the Bacteroidetes/Chlorobi phylum, from Gemmatimonadetes, from beta- and delta-Proteobacteria, and from Actinobacteria. The sequences of bacterial annexins exhibited remote but statistically significant similarity to sequence profiles built of the eukaryotic ones. Some bacterial annexins are equipped with additional, different domains, for example those characteristic for toxins. The variation in bacterial annexin sequences, much wider than that observed in eukaryotes, and different domain architectures suggest that annexins found in bacteria may actually descend from an ancestral bacterial annexin, from which eukaryotic annexins also originate. The hypothesis of an ancient origin of bacterial annexins has to be reconciled with the fact that remarkably few bacterial strains possess annexin genes compared to the thousands of known bacterial genomes and with the patchy, anomalous phylogenetic distribution of bacterial annexins. Thus, a massive annexin gene loss in several bacterial lineages or very divergent evolution would appear a likely explanation. Alternative evolutionary scenarios, involving horizontal gene transfer between bacteria and protozoan eukaryotes, in either direction, appear much less likely. Altogether, current evidence does not allow unequivocal judgement as to the origin of bacterial annexins.
    PLoS ONE 01/2014; 9(1):e85428. DOI:10.1371/journal.pone.0085428 · 3.53 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: One of the most pressing challenges in the post genomic era is the identification and characterization of protein-protein interactions (PPIs), as these are essential in understanding the cellular physiology of health and disease. Experimental techniques suitable for characterizing PPIs (X-ray crystallography or nuclear magnetic resonance spectroscopy, among others) are usually laborious, time-consuming and often difficult to apply to membrane proteins, and therefore require accurate prediction of the candidate interacting partners. High-throughput experimental methods (yeast two-hybrid and affinity purification) succumb to the same shortcomings, and can also lead to high rates of false positive and negative results. Therefore, reliable tools for predicting PPIs are needed. The use of the operon structure in the eukaryote Caenorhabditis elegans genome is a valuable, though underserved, tool for identifying physically or functionally interacting proteins. Based on the concept that genes organized in the same operon may encode physically or functionally related proteins, this algorithm is easy to be applied and, importantly, gives a limited number of candidate partners of a given protein, allowing for focused experimental verification. Moreover, this approach can be successfully used to predict PPIs in the human system, including those of membrane proteins. © 2014 S. Karger AG, Basel.
    Cellular Physiology and Biochemistry 01/2013; 32(7):41-56. DOI:10.1159/000356623 · 3.55 Impact Factor

Preview (3 Sources)

Available from