Community annotation: Procedures, protocols, and supporting tools

Department of Animal Science, Texas A&M University, College Station, Texas 77843, USA.
Genome Research (Impact Factor: 14.63). 12/2006; 16(11):1329-33. DOI: 10.1101/gr.5580606
Source: PubMed


Investigators at the Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC) and BeeBase organized a community-wide effort to manually annotate the honey bee (Apis mellifera) genome. Although various strategies for manual annotation have been used in the past, the value of dispersed community annotation has not yet been demonstrated. Here we make a case for the merit of dispersed community annotation. We present annotation procedures, standard protocols, and tools used for sequence analysis, data submission, and data management. We also report lessons learned from this dispersed community annotation effort for a metazoan genome.

1 Follower
3 Reads
  • Source
    • "Their phylogeny and branch information with D. mel are shown in supplementary figure S17, Supplementary Material online. The gene annotations were downloaded from BeeBase (Elsik et al. 2006), BeetleBase (Wang et al. 2007), SilkDB (Wang et al. 2005), and Ensembl (Hubbard et al. 2007), respectively. The species tree of a previous study (Zdobnov and Bork 2007) was used. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Widespread premature termination codon mutations (PTCs) were recently observed in human and fly populations. We took advantage of the population resequencing data in the Drosophila Genetic Reference Panel (DGRP) to investigate how the expression profile and the evolutionary age of genes shaped the allele frequency distribution of PTCs. After generating a high-quality dataset of PTCs, we clustered genes harboring PTCs into three categories: genes encoding low-frequency PTCs (≤ 1.5%), moderate-frequency PTCs (1.5%-10%) and high-frequency PTCs (> 10%). All three groups show narrow transcription compared to PTC-free genes, with the moderate- and high-PTC frequency groups showing a pronounced pattern. Moreover, nearly half (42%) of the PTC-encoding genes are not expressed in any tissue. Interestingly, the moderate-frequency PTC group is strongly enriched for genes expressed in midgut, whereas genes harboring high-frequency PTCs tend to have sex-specific expression. We further find that although young genes born in the last 60 million years (Myr) compose a mere 9% of the genome, they represent 16%, 30% and 50% of the genes containing low-, moderate- and high-frequency PTCs, respectively. Among DNA-based and RNA-based duplicated genes, the child copy is approximately twice as likely to contain PTCs as the parent copy, whereas young de novo genes are as likely to encode PTCs as DNA-based duplicated new genes. Based on these results, we conclude that expression profile and gene age jointly shaped the landscape of PTC-mediated gene loss. Therefore, we propose that new genes may need a long time to become stably maintained after the origination.
    Molecular Biology and Evolution 11/2014; 32(1). DOI:10.1093/molbev/msu299 · 9.11 Impact Factor
  • Source
    • "Integration of these data sets into a common platform has been inhibited by the need for systematic manual curation (5–8) of information from unstructured data sources (published articles and supplementary literature) and from structured entities (databases and other structured data sets). The massive volumes of dynamic bioinformatics data pose serious challenges to biocurators. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A large repertoire of gene-centric data has been generated in the field of zebrafish biology. Although the bulk of these data are available in the public domain, most of them are not readily accessible or available in nonstandard formats. One major challenge is to unify and integrate these widely scattered data sources. We tested the hypothesis that active community participation could be a viable option to address this challenge. We present here our approach to create standards for assimilation and sharing of information and a system of open standards for database intercommunication. We have attempted to address this challenge by creating a community-centric solution for zebrafish gene annotation. The Zebrafish GenomeWiki is a ‘wiki’-based resource, which aims to provide an altruistic shared environment for collective annotation of the zebrafish genes. The Zebrafish GenomeWiki has features that enable users to comment, annotate, edit and rate this gene-centric information. The credits for contributions can be tracked through a transparent microattribution system. In contrast to other wikis, the Zebrafish GenomeWiki is a ‘structured wiki’ or rather a ‘semantic wiki’. The Zebrafish GenomeWiki implements a semantically linked data structure, which in the future would be amenable to semantic search.Database URL:
    Database The Journal of Biological Databases and Curation 02/2014; 2014:bau011. DOI:10.1093/database/bau011 · 3.37 Impact Factor
  • Source
    • "The contigs consisting of two and five ESTs, respectively, were both over 200 bp in size but lacked an apparent coding potential. As these expression tags had not been computationally predicted as genes in the Official Gene set 2.0 for the honey bee [1], [17] we originally simply named them according to the genome scaffold they were located in, already considering that they might represent long noncoding RNAs [16]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Division of labor in social insect colonies relies on a strong reproductive bias that favors queens. Although the ecological and evolutionary success attained through caste systems is well sketched out in terms of ultimate causes, the molecular and cellular underpinnings driving the development of caste phenotypes are still far from understood. Recent genomics approaches on honey bee developmental biology revealed a set of genes that are differentially expressed genes in larval ovaries and associated with transgressive ovary size in queens and massive cell death in workers. Amongst these, two contigs called special attention, both being over 200 bp in size and lacking apparent coding potential. Herein, we obtained their full cDNA sequences. These and their secondary structure characteristics placed in evidence that they are bona fide long noncoding RNAs (lncRNA) differentially expressed in larval ovaries, thus named lncov1 and lncov2. Genomically, both map within a previously identified QTL on chromosome 11, associated with transgressive ovary size in honey bee workers. As lncov1 was over-expressed in worker ovaries we focused on this gene. Real-time qPCR analysis on larval worker ovaries evidenced an expression peak coinciding with the onset of autophagic cell death. Cellular localization analysis through fluorescence in situ hybridization revealed perinuclear spots resembling omega speckles known to regulate trafficking of RNA-binding proteins. With only four lncRNAs known so far in honey bees, two expressed in the ovaries, these findings open a novel perspective on regulatory factors acting in the fine tuning of developmental processes underlying phenotypic plasticity related to social life histories.
    PLoS ONE 10/2013; 8(10):e78915. DOI:10.1371/journal.pone.0078915 · 3.23 Impact Factor
Show more