The Gene Ontology's Reference Genome Project: A Unified Framework for Functional Annotation across Species

PLoS Computational Biology (Impact Factor: 4.83). 01/2009; 5. DOI: 10.1371/journal.pcbi.1000431
Source: OAI

ABSTRACT The Gene Ontology (GO) is a collaborative effort that provides structured vocabularies for annotating the molecular function, biological role, and cellular location of gene products in a highly systematic way and in a species-neutral manner with the aim of unifying the representation of gene function across different organisms. Each contributing member of the GO Consortium independently associates GO terms to gene products from the organism(s) they are annotating. Here we introduce the Reference Genome project, which brings together those independent efforts into a unified framework based on the evolutionary relationships between genes in these different organisms. The Reference Genome project has two primary goals: to increase the depth and breadth of annotations for genes in each of the organisms in the project, and to create data sets and tools that enable other genome annotation efforts to infer GO annotations for homologous genes in their organisms. In addition, the project has several important incidental benefits, such as increasing annotation consistency across genome databases, and providing important improvements to the GO's logical structure and biological content.

  • Source
    • "This work introduced an interesting new tool for analyzing and visualizing the gene datasets. Because the typical Arabidopsis gene ontology (GO) (Gaudet et al., 2009) annotation provided limited understanding regarding which class of genes was important in dormancy and germination , the authors reannotated the genes in relation to previously described roles in germinationand dormancy-related terms (Microsoft Excel TAGGIT macro) (Carrera et al., 2007). This TAGGIT workflow was used for reanalyzing new and previous microarray data and has been shown to give a distinct visual gene signature for dormant and after-ripened seeds (Holdsworth et al., 2008b). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The success of flowering plants in most ecosystems could be seen as a result of the emergence of the seed structure as a reproduction vehicle. If plants are broadly defined as immobile living organisms, in contrast to animals, the seed would be the exception to that definition. The seed, enclosing an embryonic plant and nutrient sources, represents the final stage of the plant reproduction and allows the safe dispersion of the progeny. For doing that, the seed needs to survive a challenging and changing environment and to preserve the next generation until conditions are favorable for survival. The delay between seed formation and seed germination is one of the most important times during the entire plant cycle and has to be carefully synchronized with the environment to maximize seedling survival. This timing is principally determined by seed dormancy, which is a biological condition (physiological, morphological, and physical) that temporarily blocks germination, keeping the seed quiescent (Baskin and Baskin, 2004). Physiological dormancy is the most common form and generally includes components of embryo- and seed coat–based dormancy. Seed dormancy is an adaptive trait with high variability across species that has enormous importance in both wild and domesticated plants. Seed dormancy programs determine the ecological niche in which the seed germinates and prospers and are related to different factors such as climate, moisture, soil characteristics, light, temperature, nutrients, abiotic and biotic stress factors, and many others (Finch- Savage and Leubner-Metzger, 2006). In domesticated species, the control of seed dormancy is crucial and influences important agricultural traits, such as uniform germination and stand establishment, preharvest sprouting susceptibility (Gubler et al., 2005), and seed storage requirements.
    Seed Genomics, Edited by P. W. Becraft, 02/2013: chapter High-Throughput Genetic Dissection of Seed Dormancy: pages 111-122; Wiley-Blackwell, Oxford, UK., ISBN: 9780470960158
  • Source
    • "Finally, in order to provide a standard framework for data integration and a reliable engine for SNPs selection, the database has been built on a strong ontology layer. Whenever available, data have been annotated using ontological terms: Gene Ontology [22] for genes and KEGG Pathway ontology (derived from the hierarchical organization of KEGG pathways) for pathways are just some of the hierarchically structured vocabularies that underlie the infrastructure. Additionally, ontology structures allow to improve the performance of statistical and analytical evaluations by means of the graphs that undergo the hierarchically structured vocabularies and that shed light on the relationships between biological components. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The identification of genes and SNPs involved in human diseases remains a challenge. Many public resources, databases and applications, collect biological data and perform annotations, increasing the global biological knowledge. The need of SNPs prioritization is emerging with the development of new high-throughput genotyping technologies, which allow to develop customized disease-oriented chips. Therefore, given a list of genes related to a specific biological process or disease as input, a crucial issue is finding the most relevant SNPs to analyse. The selection of these SNPs may rely on the relevant a-priori knowledge of biomolecular features characterising all the annotated SNPs and genes of the provided list. The bioinformatics approach described here allows to retrieve a ranked list of significant SNPs from a set of input genes, such as candidate genes associated with a specific disease. The system enriches the genes set by including other genes, associated to the original ones by ontological similarity evaluation. The proposed method relies on the integration of data from public resources in a vertical perspective (from genomics to systems biology data), the evaluation of features from biomolecular knowledge, the computation of partial scores for SNPs and finally their ranking, relying on their global score. Our approach has been implemented into a web based tool called SNPRanker, which is accessible through at the URL . An interesting application of the presented system is the prioritisation of SNPs related to genes involved in specific pathologies, in order to produce custom arrays.
    Journal of integrative bioinformatics 01/2010; 7(3). DOI:10.2390/biecoll-jib-2010-138
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts. We employ the Textpresso category-based information retrieval and extraction system (, developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed. Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation.
    BMC Bioinformatics 02/2009; 10:228. DOI:10.1186/1471-2105-10-228 · 2.67 Impact Factor
Show more