A systematic study of genome context methods: calibration, normalization and combination.

Artificial Intelligence Center, SRI International, Menlo Park, California, USA.
BMC Bioinformatics (Impact Factor: 3.02). 10/2010; 11:493. DOI:10.1186/1471-2105-11-493
Source: PubMed

ABSTRACT Genome context methods have been introduced in the last decade as automatic methods to predict functional relatedness between genes in a target genome using the patterns of existence and relative locations of the homologs of those genes in a set of reference genomes. Much work has been done in the application of these methods to different bioinformatics tasks, but few papers present a systematic study of the methods and their combination necessary for their optimal use.
We present a thorough study of the four main families of genome context methods found in the literature: phylogenetic profile, gene fusion, gene cluster, and gene neighbor. We find that for most organisms the gene neighbor method outperforms the phylogenetic profile method by as much as 40% in sensitivity, being competitive with the gene cluster method at low sensitivities. Gene fusion is generally the worst performing of the four methods. A thorough exploration of the parameter space for each method is performed and results across different target organisms are presented. We propose the use of normalization procedures as those used on microarray data for the genome context scores. We show that substantial gains can be achieved from the use of a simple normalization technique. In particular, the sensitivity of the phylogenetic profile method is improved by around 25% after normalization, resulting, to our knowledge, on the best-performing phylogenetic profile system in the literature. Finally, we show results from combining the various genome context methods into a single score. When using a cross-validation procedure to train the combiners, with both original and normalized scores as input, a decision tree combiner results in gains of up to 20% with respect to the gene neighbor method. Overall, this represents a gain of around 15% over what can be considered the state of the art in this area: the four original genome context methods combined using a procedure like that used in the STRING database. Unfortunately, we find that these gains disappear when the combiner is trained only with organisms that are phylogenetically distant from the target organism.
Our experiments indicate that gene neighbor is the best individual genome context method and that gains from the combination of individual methods are very sensitive to the training data used to obtain the combiner's parameters. If adequate training data is not available, using the gene neighbor score by itself instead of a combined score might be the best choice.

0 0
  • [show abstract] [hide abstract]
    ABSTRACT: Annexins are Ca(2+)-binding, membrane-interacting proteins, widespread among eukaryotes, consisting usually of four structurally similar repeated domains. It is accepted that vertebrate annexins derive from a double genome duplication event. It has been postulated that a single domain annexin, if found, might represent a molecule related to the hypothetical ancestral annexin. The recent discovery of a single-domain annexin in a bacterium, Cytophaga hutchinsonii, apparently confirmed this hypothesis. Here, we present a more complex picture. Using remote sequence similarity detection tools, a survey of bacterial genomes was performed in search of annexin-like proteins. In total, we identified about thirty annexin homologues, including single-domain and multi-domain annexins, in seventeen bacterial species. The thorough search yielded, besides the known annexin homologue from C. hutchinsonii, homologues from the Bacteroidetes/Chlorobi phylum, from Gemmatimonadetes, from beta- and delta-Proteobacteria, and from Actinobacteria. The sequences of bacterial annexins exhibited remote but statistically significant similarity to sequence profiles built of the eukaryotic ones. Some bacterial annexins are equipped with additional, different domains, for example those characteristic for toxins. The variation in bacterial annexin sequences, much wider than that observed in eukaryotes, and different domain architectures suggest that annexins found in bacteria may actually descend from an ancestral bacterial annexin, from which eukaryotic annexins also originate. The hypothesis of an ancient origin of bacterial annexins has to be reconciled with the fact that remarkably few bacterial strains possess annexin genes compared to the thousands of known bacterial genomes and with the patchy, anomalous phylogenetic distribution of bacterial annexins. Thus, a massive annexin gene loss in several bacterial lineages or very divergent evolution would appear a likely explanation. Alternative evolutionary scenarios, involving horizontal gene transfer between bacteria and protozoan eukaryotes, in either direction, appear much less likely. Altogether, current evidence does not allow unequivocal judgement as to the origin of bacterial annexins.
    PLoS ONE 01/2014; 9(1):e85428. · 3.73 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Phylogenetic profiles express the presence or absence of genes and their homologs across a number of reference genomes. They have emerged as an elegant representation framework for comparative genomics and have been used for the genome-wide inference and discovery of functionally linked genes or metabolic pathways. As the number of reference genomes grows, there is an acute need for faster and more accurate methods for phylogenetic profile analysis with increased performance in speed and quality. We propose a novel, efficient method for the detection of genomic idiosyncrasies, i.e. sets of genes found in a specific genome with peculiar phylogenetic properties, such as intra-genome correlations or inter-genome relationships. Our algorithm is a four-step process where genome profiles are first defined as fuzzy vectors, then discretized to binary vectors, followed by a de-noising step, and finally a comparison step to generate intra- and inter-genome distances for each gene profile. The method is validated with a carefully selected benchmark set of five reference genomes, using a range of approaches regarding similarity metrics and pre-processing stages for noise reduction. We demonstrate that the fuzzy profile method consistently identifies the actual phylogenetic relationship and origin of the genes under consideration for the majority of the cases, while the detected outliers are found to be particular genes with peculiar phylogenetic patterns. The proposed method provides a time-efficient and highly scalable approach for phylogenetic stratification, with the detected groups of genes being either similar to their own genome profile or different from it, thus revealing atypical evolutionary histories.
    PLoS ONE 01/2013; 8(1):e52854. · 3.73 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Systematic determination of gene function is an essential step in fully understanding the precise contribution of each gene for the proper execution of molecular functions in the cell. Gene functional linkage is defined as to describe the relationship of a group of genes with similar functions. With thousands of genomes sequenced, there arises a great opportunity to utilize gene evolutionary information to identify gene functional linkages. To this end, we established a computational method (called TRACE) to trace gene footprints through a gene functional network constructed from 341 prokaryotic genomes. TRACE performance was validated and successfully tested to predict enzyme functions as well as components of pathway. A so far undescribed chromosome partitioning-like protein ro03654 of an oleaginous bacteria Rhodococcus sp. RHA1 (RHA1) was predicted and verified experimentally with its deletion mutant showing growth inhibition compared to RHA1 wild type. In addition, four proteins were predicted to act as prokaryotic SNARE-like proteins, and two of them were shown to be localized at the plasma membrane. Thus, we believe that TRACE is an effective new method to infer prokaryotic gene functional linkages by tracing evolutionary events.
    PLoS ONE 01/2013; 8(6):e66817. · 3.73 Impact Factor

Full-text (2 Sources)

Available from