Conference Paper

Estimating the Quality of Ontology-Based Annotations by Considering Evolutionary Changes

DOI: 10.1007/978-3-642-02879-3_7 Conference: Data Integration in the Life Sciences, 6th International Workshop, DILS 2009, Manchester, UK, July 20-22, 2009. Proceedings
Source: DBLP


Ontology-based annotations associate objects, such as genes and proteins, with well-defined ontology concepts to semantically
and uniformly describe object properties. Such annotation mappings are utilized in different applications and analysis studies
whose results strongly depend on the quality of the used annotations. To study the quality of annotations we propose a generic
evaluation approach considering the annotation generation methods (provenance) as well as the evolution of ontologies, object
sources, and annotations. Thus, it facilitates the identification of reliable annotations, e.g., for use in analysis applications.
We evaluate our approach for functional protein annotations in Ensembl and Swiss-Prot using the Gene Ontology.

Download full-text


Available from: Anika Groß,
  • Source
    • "Likewise, measures of accuracy based on term specificity have been called into question [5]. Other approaches that address annotation error rates or accuracy such as [6] and [8] downplay the role of ontology structural quality, and ignore the effect that the ontology structure can have on real-world applications. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The Gene Ontology and its associated annotations are critical tools for interpreting lists of genes. Here, we introduce a method for evaluating the Gene Ontology annotations and structure based on the impact they have on gene set enrichment analysis, along with an example implementation. This task-based approach yields quantitative assessments grounded in experimental data and anchored tightly to the primary use of the annotations. Applied to specific areas of biological interest, our framework allowed us to understand the progress of annotation and structural ontology changes from 2004 to 2012. Our framework was also able to determine that the quality of annotations and structure in the area under test have been improving in their ability to recall underlying biological traits. Furthermore, we were able to distinguish between the impact of changes to the annotation sets and ontology structure. Our framework and implementation lay the groundwork for a powerful tool in evaluating the usefulness of the Gene Ontology. We demonstrate both the flexibility and the power of this approach in evaluating the current and past state of the Gene Ontology as well as its applicability in developing new methods for creating gene annotations.
    Journal of Biomedical Semantics 04/2013; 4 Suppl 1(Suppl 1):S4. DOI:10.1186/2041-1480-4-S1-S4 · 2.26 Impact Factor
  • Source
    • "Annotations are frequently assigned a score, using a variety of methods. These approaches include assigning confidence scores to annotations based on their stability (Gross et al., 2009) or combining the breadth (coverage of gene product) and the depth (level of detail) for the terms in the Gene Ontology (GO) (Buza et al. 2008). However, while deeper nodes within an ontology are generally more specialized, these measures are problematic; first GO has three root domains and second an ontology, such as GO, is a graph not a tree, therefore depth is not necessarily meaningful. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use the UniProt Knowledgebase (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations. By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality. Source code is available at the authors website:
    Bioinformatics 09/2012; 28(18):i562-i568. DOI:10.1093/bioinformatics/bts372 · 4.98 Impact Factor
  • Source
    • "Typical ontology changes include the addition of new categories and relationships as well as the revision of the existing structure (Hartung et al., 2012b, 2008). These ontological modifications can trigger changes in the annotation (Gross et al., 2009), e.g. when a category is removed, the annotations need to be moved or deleted. Further, annotations may be edited to reflect new knowledge or to eliminate inconsistencies Dolan et al. (2005). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Ontologies are used in the annotation and analysis of biological data. As knowledge accumulates, ontologies and annotation undergo constant modifications to reflect this new knowledge. These modifications may influence the results of statistical applications such as functional enrichment analyses that describe experimental data in terms of ontological groupings. Here, we investigate to what degree modifications of the Gene Ontology (GO) impact these statistical analyses for both experimental and simulated data. The analysis is based on new measures for the stability of result sets and considers different ontology and annotation changes. Our results show that past changes in the GO are non-uniformly distributed over different branches of the ontology. Considering the semantic relatedness of significant categories in analysis results allows a more realistic stability assessment for functional enrichment studies. We observe that the results of term-enrichment analyses tend to be surprisingly stable despite changes in ontology and annotation. Supplementary information: Supplementary Data are available at Bioinformatics online.
    Bioinformatics 09/2012; 28(20):2671-7. DOI:10.1093/bioinformatics/bts498 · 4.98 Impact Factor
Show more