A systematic study of genome context methods: calibration, normalization and combination

Artificial Intelligence Center, SRI International, Menlo Park, California, USA.
BMC Bioinformatics (Impact Factor: 2.67). 10/2010; 11:493. DOI: 10.1186/1471-2105-11-493
Source: DBLP

ABSTRACT Genome context methods have been introduced in the last decade as automatic methods to predict functional relatedness between genes in a target genome using the patterns of existence and relative locations of the homologs of those genes in a set of reference genomes. Much work has been done in the application of these methods to different bioinformatics tasks, but few papers present a systematic study of the methods and their combination necessary for their optimal use.
We present a thorough study of the four main families of genome context methods found in the literature: phylogenetic profile, gene fusion, gene cluster, and gene neighbor. We find that for most organisms the gene neighbor method outperforms the phylogenetic profile method by as much as 40% in sensitivity, being competitive with the gene cluster method at low sensitivities. Gene fusion is generally the worst performing of the four methods. A thorough exploration of the parameter space for each method is performed and results across different target organisms are presented. We propose the use of normalization procedures as those used on microarray data for the genome context scores. We show that substantial gains can be achieved from the use of a simple normalization technique. In particular, the sensitivity of the phylogenetic profile method is improved by around 25% after normalization, resulting, to our knowledge, on the best-performing phylogenetic profile system in the literature. Finally, we show results from combining the various genome context methods into a single score. When using a cross-validation procedure to train the combiners, with both original and normalized scores as input, a decision tree combiner results in gains of up to 20% with respect to the gene neighbor method. Overall, this represents a gain of around 15% over what can be considered the state of the art in this area: the four original genome context methods combined using a procedure like that used in the STRING database. Unfortunately, we find that these gains disappear when the combiner is trained only with organisms that are phylogenetically distant from the target organism.
Our experiments indicate that gene neighbor is the best individual genome context method and that gains from the combination of individual methods are very sensitive to the training data used to obtain the combiner's parameters. If adequate training data is not available, using the gene neighbor score by itself instead of a combined score might be the best choice.

  • Source
    • "Finally, a score is computed as a P-value for the observed distances. See Bowers et al. (2004) or Ferrer et al. (2010) for details on this method and experimental results. The output of this stage is a number for each gene pair (G i ,G j ) that is assumed to be monotonically related with the likelihood that the two genes participate in the same subsystem. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Key problems for computational genomics include discovering novel pathways in genome data, and discovering functional interaction partners for genes to define new members of partially elucidated pathways. We propose a novel method for the discovery of subsystems from annotated genomes. For each gene pair, a score measuring the likelihood that the two genes belong to a same subsystem is computed using genome context methods. Genes are then grouped based on these scores, and the resulting groups are filtered to keep only high-confidence groups. Since the method is based on genome context analysis, it relies solely on structural annotation of the genomes. The method can be used to discover new pathways, find missing genes from a known pathway, find new protein complexes or other kinds of functional groups and assign function to genes. We tested the accuracy of our method in Escherichia coli K-12. In one configuration of the system, we find that 31.6% of the candidate groups generated by our method match a known pathway or protein complex closely, and that we rediscover 31.2% of all known pathways and protein complexes of at least 4 genes. We believe that a significant proportion of the candidates that do not match any known group in E.coli K-12 corresponds to novel subsystems that may represent promising leads for future laboratory research. We discuss in-depth examples of these findings. Predicted subsystems are available at Supplementary data are available at Bioinformatics online.
    Bioinformatics 07/2011; 27(18):2478-85. DOI:10.1093/bioinformatics/btr428 · 4.62 Impact Factor
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: Determining the network of physical protein associations is an important first step in developing mechanistic evidence for elucidating biological pathways. Despite rapid advances in the field of high throughput experiments to determine protein interactions, the majority of associations remain unknown. Here we describe computational methods for significantly expanding protein association networks. We describe methods for integrating multiple independent sources of evidence to obtain higher quality predictions and we compare the major publicly available resources available for experimentalists to use.
    Physical Biology 06/2011; 8(3):035008. DOI:10.1088/1478-3975/8/3/035008 · 3.14 Impact Factor
Show more

Preview (3 Sources)

Available from