Meeting report: a workshop on Best Practices in Genome Annotation

Informatics, J. Craig Venter Institute, Rockville, MD 20850 USA, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK and The Arabidopsis Information Resource, Carnegie Institution of Washington, Stanford, CA 94305 USA.
Database The Journal of Biological Databases and Curation (Impact Factor: 3.37). 01/2010; 2010:baq001. DOI: 10.1093/database/baq001
Source: PubMed


Efforts to annotate the genomes of a wide variety of model organisms are currently carried out by sequencing centers, model organism databases and academic/institutional laboratories around the world. Different annotation methods and tools have been developed over time to meet the needs of biologists faced with the task of annotating biological data. While standardized methods are essential for consistent curation within each annotation group, methods and tools can differ between groups, especially when the groups are curating different organisms. Biocurators from several institutes met at the Third International Biocuration Conference in Berlin, Germany, April 2009 and hosted the 'Best Practices in Genome Annotation: Inference from Evidence' workshop to share their strategies, pipelines, standards and tools. This article documents the material presented in the workshop.

Download full-text


Available from: Linda I Hannick,
  • Source
    • "To gain full use of the genomic information, the next step after sequencing is to annotate genes encoded in each genome. Several genome annotation pipelines have been established and actively used in a number of genome projects [4-6]. One of the most important products of genome annotation is a set of amino acid sequences translated from predicted protein-coding genes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due to the scarcity of proper assessment methods. Results We present a gene-structure-aware multiple sequence alignment method for gene prediction using amino acid sequences translated from homologous genes from many genomes. The approach provides rich information concerning the reliability of each predicted gene structure. We have also devised an iterative method that attempts to improve the structures of suspiciously predicted genes based on a spliced alignment algorithm using consensus sequences or reliable homologs as templates. Application of our methods to cytochrome P450 and ribosomal proteins from 47 plant genomes indicated that 50 ~ 60 % of the annotated gene structures are likely to contain some defects. Whereas more than half of the defect-containing genes may be intrinsically broken, i.e. they are pseudogenes or gene fragments, located in unfinished sequencing areas, or corresponding to non-productive isoforms, the defects found in a majority of the remaining gene candidates can be remedied by our iterative refinement method. Conclusions Refinement of eukaryotic gene structures mediated by gene-structure-aware multiple protein sequence alignment is a useful strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families from all kingdoms of life, not just plants.
    BMC Bioinformatics 06/2014; 15(1):189. DOI:10.1186/1471-2105-15-189 · 2.58 Impact Factor
  • Source
    • "However, the accuracy of the annotation also relies on the automated pipeline used [38], some predicted genes could be dissimilar to anything in the reference databases as they could have evolved extensively, represent uncharacterized sequences, or be misidentified [39]. Reference databases and computational methods constituting annotation pipelines are constantly developed, and there is hence a need to reprocess genome annotations on a regular basis to improve their quality and completeness [39], [40]. As it was not the primary objective in this study, the genomic sequence of the bacterium remains as draft. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Following the isolation, cultivation and characterization of the rumen bacterium Anaerovibrio lipolyticus in the 1960s, it has been recognized as one of the major species involved in lipid hydrolysis in ruminant animals. However, there has been limited characterization of the lipases from the bacterium, despite the importance of understanding lipolysis and its impact on subsequent biohydrogenation of polyunsaturated fatty acids by rumen microbes. This study describes the draft genome of Anaerovibrio lipolytica 5ST, and the characterization of three lipolytic genes and their translated protein. The uncompleted draft genome was 2.83 Mbp and comprised of 2,673 coding sequences with a G+C content of 43.3%. Three putative lipase genes, alipA, alipB and alipC, encoding 492-, 438- and 248- amino acid peptides respectively, were identified using RAST. Phylogenetic analysis indicated that alipA and alipB clustered with the GDSL/SGNH family II, and alipC clustered with lipolytic enzymes from family V. Subsequent expression and purification of the enzymes showed that they were thermally unstable and had higher activities at neutral to alkaline pH. Substrate specificity assays indicated that the enzymes had higher hydrolytic activity against caprylate (C8), laurate (C12) and myristate (C14).
    PLoS ONE 08/2013; 8(8):e69076. DOI:10.1371/journal.pone.0069076 · 3.23 Impact Factor
  • Source
    • "Despite an increase in associations between human gut microbial functions and host physiologies, little is known about the age-related microbial activities involved in the development and progression of the human microbiota [16,21,24]. While more and more microbial species are linked to particular characteristics [25], few studies show prediction of novel metagenomic samples using the gene content [22,26], even though there are a variety of ways to identify open reading frames and genes [27,28]. Since the consortia of microbiota is under conjoint influences of various forces, it is difficult to extract the effect caused by aging or to predict the host age based on microbial gene content [29,30]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Human gut microbial functions are often associated with various diseases and host physiologies. Aging, a less explored factor, is also suspected to affect or be affected by microbiome alterations. By combining functional feature selection with supervised classification, we aim to facilitate identification of age-related functional characteristics in metagenomes from several human gut microbiome studies (MetaHIT, MicroAge, MicroObes, Kurokawa et al.’s and Gill et al.’s dataset). Results We apply two feature selection methods, term frequency-inverse document frequency (TF-iDF) and minimum-redundancy maximum-relevancy (mRMR), to identify functional signatures that differentiate metagenomes by age. After features are reduced, we use a support vector machine (SVM) to predict host age of new metagenomes. Functional features are from protein families (Pfams), Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, KEGG ontologies and the Gene Ontology (GO) database. Initial investigations demonstrate that ordination of the functional principal components shows great overlap between different age groups. However, when feature selection is applied, mRMR tightens the ordination cluster for each age group, and TF-iDF offers better linear separation. Both TF-iDF and mRMR were used in conjunction with a SVM classifier and achieved areas under receiver operating characteristic curves (AUCs) 10 to 15% above chance to classify individuals above/below mid-ages (about 38 to 43 years old) using Pfams. Better performance around mid-ages is also observed when using other functional categories and age-balanced dataset. We also identified some age-related Pfams that improved age discrimination at age 65 with another feature selection method called LEfSe, on an age-balanced dataset. The selected functional characteristics identify a broad range of age-relevant metabolisms, such as reduced vitamin B12 synthesis, reduced activity of reductases, increased DNA damage, occurrences of stress responses and immune system compromise, and upregulated glycosyltransferases in the aging population. Conclusions Feature selection can yield biologically meaningful results when used in conjunction with classification, and makes age classification of new human gut metagenomes feasible. While we demonstrate the promise of this approach, the data-dependent prediction performance could be further improved. We hypothesize that while the Qin et al. dataset is the most comprehensive to date, even deeper sampling is needed to better characterize and predict the microbiomes’ functional content.
    01/2013; 1(1). DOI:10.1186/2049-2618-1-2
Show more