A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases.

Bioinformatics Research Group, SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA.
BMC Bioinformatics (Impact Factor: 3.02). 07/2004; 5:76. DOI: 10.1186/1471-2105-5-76
Source: PubMed

ABSTRACT The PathoLogic program constructs Pathway/Genome databases by using a genome's annotation to predict the set of metabolic pathways present in an organism. PathoLogic determines the set of reactions composing those pathways from the enzymes annotated in the organism's genome. Most annotation efforts fail to assign function to 40-60% of sequences. In addition, large numbers of sequences may have non-specific annotations (e.g., thiolase family protein). Pathway holes occur when a genome appears to lack the enzymes needed to catalyze reactions in a pathway. If a protein has not been assigned a specific function during the annotation process, any reaction catalyzed by that protein will appear as a missing enzyme or pathway hole in a Pathway/Genome database.
We have developed a method that efficiently combines homology and pathway-based evidence to identify candidates for filling pathway holes in Pathway/Genome databases. Our program not only identifies potential candidate sequences for pathway holes, but combines data from multiple, heterogeneous sources to assess the likelihood that a candidate has the required function. Our algorithm emulates the manual sequence annotation process, considering not only evidence from homology searches, but also considering evidence from genomic context (i.e., is the gene part of an operon?) and functional context (e.g., are there functionally-related genes nearby in the genome?) to determine the posterior belief that a candidate has the required function. The method can be applied across an entire metabolic pathway network and is generally applicable to any pathway database. The program uses a set of sequences encoding the required activity in other genomes to identify candidate proteins in the genome of interest, and then evaluates each candidate by using a simple Bayes classifier to determine the probability that the candidate has the desired function. We achieved 71% precision at a probability threshold of 0.9 during cross-validation using known reactions in computationally-predicted pathway databases. After applying our method to 513 pathway holes in 333 pathways from three Pathway/Genome databases, we increased the number of complete pathways by 42%. We made putative assignments to 46% of the holes, including annotation of 17 sequences of previously unknown function.
Our pathway hole filler can be used not only to increase the utility of Pathway/Genome databases to both experimental and computational researchers, but also to improve predictions of protein function.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed.
    Proceedings of the National Academy of Sciences of the United States of America. 06/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The availability of thousands of sequenced genomes has revealed the diversity of biochemical solutions to similar chemical problems. Even for molecules at the heart of metabolism, such as cofactors, the pathway enzymes first discovered in model organisms like Escherichia coli or Saccharomyces cerevisiae are often not universally conserved. Tetrahydrofolate (THF) (or its close relative tetrahydromethanopterin) is a universal and essential C1-carrier that most microbes and plants synthesize de novo. The THF biosynthesis pathway and enzymes are, however, not universal and alternate solutions are found for most steps, making this pathway a challenge to annotate automatically in many genomes. Comparing THF pathway reconstructions and functional annotations of a chosen set of folate synthesis genes in specific prokaryotes revealed the strengths and weaknesses of different microbial annotation platforms. This analysis revealed that most current platforms fail in metabolic reconstruction of variant pathways. However, all the pieces are in place to quickly correct these deficiencies if the different databases were built on each other's strengths.
    Computational and Structural Biotechnology Journal. 06/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The increased demand and consumption of fossil fuels have raised interest in finding renewable energy sources throughout the globe. Much focus has been placed on optimizing microorganisms and primarily microalgae, to efficiently produce compounds that can substitute for fossil fuels. However, the path to achieving economic feasibility is likely to require strain optimization through using available tools and technologies in the fields of systems and synthetic biology. Such approaches invoke a deep understanding of the metabolic networks of the organisms and their genomic and proteomic profiles. The advent of next generation sequencing and other high throughput methods has led to a major increase in availability of biological data. Integration of such disparate data can help define the emergent metabolic system properties, which is of crucial importance in addressing biofuel production optimization. Herein, we review major computational tools and approaches developed and used in order to potentially identify target genes, pathways, and reactions of particular interest to biofuel production in algae. As the use of these tools and approaches has not been fully implemented in algal biofuel research, the aim of this review is to highlight the potential utility of these resources toward their future implementation in algal research.
    BioMed Research International 09/2014; 2014((2014)):1-12. · 2.71 Impact Factor


1 Download
Available from