Conference Paper

Estimating the Quality of Ontology-Based Annotations by Considering Evolutionary Changes.

DOI: 10.1007/978-3-642-02879-3_7 Conference: Data Integration in the Life Sciences, 6th International Workshop, DILS 2009, Manchester, UK, July 20-22, 2009. Proceedings
Source: DBLP

ABSTRACT Ontology-based annotations associate objects, such as genes and proteins, with well-defined ontology concepts to semantically
and uniformly describe object properties. Such annotation mappings are utilized in different applications and analysis studies
whose results strongly depend on the quality of the used annotations. To study the quality of annotations we propose a generic
evaluation approach considering the annotation generation methods (provenance) as well as the evolution of ontologies, object
sources, and annotations. Thus, it facilitates the identification of reliable annotations, e.g., for use in analysis applications.
We evaluate our approach for functional protein annotations in Ensembl and Swiss-Prot using the Gene Ontology.

Download full-text


Available from: Anika Groß, Aug 08, 2015
  • Source
    • "Annotations are frequently assigned a score, using a variety of methods. These approaches include assigning confidence scores to annotations based on their stability (Gross et al., 2009) or combining the breadth (coverage of gene product) and the depth (level of detail) for the terms in the Gene Ontology (GO) (Buza et al. 2008). However, while deeper nodes within an ontology are generally more specialized, these measures are problematic; first GO has three root domains and second an ontology, such as GO, is a graph not a tree, therefore depth is not necessarily meaningful. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this article. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use the UniProt Knowledgebase (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations. By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality. Source code is available at the authors website:
    Bioinformatics 09/2012; 28(18):i562-i568. DOI:10.1093/bioinformatics/bts372 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Matching life science ontologies to determine ontology mappings has recently become an active field of research. The large size of existing ontologies and the application of complex match strategies for obtaining high quality mappings makes ontology matching a resource- and time-intensive process. To improve performance we investigate different approaches for parallel matching on multiple compute nodes. In particular, we consider inter-matcher and intra-matcher parallelism as well as the parallel execution of element- and structure-level matching. We implemented a distributed infrastructure for parallel ontology matching and evaluate different approaches for parallel matching of large life science ontologies in the field of anatomy and molecular biology.
    Data Integration in the Life Sciences, 7th International Conference, DILS 2010, Gothenburg, Sweden, August 25-27, 2010. Proceedings; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: ABSTRACT: Ontologies are increasingly used to structure and semantically describe entities of domains, such as genes and proteins in life sciences. Their increasing size and the high frequency of updates resulting in a large set of ontology versions necessitates efficient management and analysis of this data. We present GOMMA, a generic infrastructure for managing and analyzing life science ontologies and their evolution. GOMMA utilizes a generic repository to uniformly and efficiently manage ontology versions and different kinds of mappings. Furthermore, it provides components for ontology matching, and determining evolutionary ontology changes. These components are used by analysis tools, such as the Ontology Evolution Explorer (OnEX) and the detection of unstable ontology regions. We introduce the component-based infrastructure and show analysis results for selected components and life science applications. GOMMA is available at GOMMA provides a comprehensive and scalable infrastructure to manage large life science ontologies and analyze their evolution. Key functions include a generic storage of ontology versions and mappings, support for ontology matching and determining ontology changes. The supported features for analyzing ontology changes are helpful to assess their impact on ontology-dependent applications such as for term enrichment. GOMMA complements OnEX by providing functionalities to manage various versions of mappings between two ontologies and allows combining different match approaches.
    Journal of Biomedical Semantics 09/2011; 2(1):6. DOI:10.1186/2041-1480-2-6 · 2.62 Impact Factor
Show more