Article

A checklist-based approach for quality assessment of scientific information

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The Semantic Web is becoming a major platform for disseminating and sharing scientific data and results. Quality of these information is a critical factor in selecting and reusing them. Existing quality assessment approaches in the Semantic Web largely focus on using general quality dimensions (accuracy, relevancy, etc.) to establish quality metrics. However, specific quality assessment tasks may not fit into these dimensions and scientists may find these dimensions too general for expressing their specific needs. Therefore, we present a checklist-based approach, which allows the expression of specific quality requirements, saving users from the constraints of the existing quality dimensions. We demonstrate our approach by two scenarios and share our lessons about different semantic web technologies that were tested during our implementation.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In addition, we used the minimal information model "Minim", also in Semantic Web format, to specify which elements in an RO we consider "must haves", "should haves" and "could haves" according to user-defined requirements [23]. A checklist service subsequently queries the Minim annotations as an aid to make sufficiently complete ROs [24]. ...
... When building an RO in myExperiment users are provided with a mechanism of quality insurance by our so-called checklist evaluation tool, which is built upon the Minim checklist ontology [23,44] and defined using Web Ontology Language. Its basic function is to assess that all required information and descriptions about the aggregated resources are present and complete. ...
Article
Full-text available
One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinformatics community. Our assumption is that offering means of digital, structured aggregation and annotation of the objects of an experiment will provide necessary meta-data for a scientist to understand and recreate the results of an experiment. To support this we explored a model for the semantic description of a workflow-centric Research Object (RO), where an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. We applied this model to a case study where we analysed human metabolite variation by workflows. We present the application of the workflow-centric RO model for our bioinformatics case study. Three workflows were produced following recently defined Best Practices for workflow design. By modelling the experiment as an RO, we were able to automatically query the experiment and answer questions such as “which particular data was input to a particular workflow to test a particular hypothesis?”, and “which particular conclusions were drawn from a particular workflow?”. Applying a workflow-centric RO model to aggregate and annotate the resources used in a bioinformatics experiment, allowed us to retrieve the conclusions of the experiment in the context of the driving hypothesis, the executed workflows and their input data. The RO model is an extendable reference model that can be used by other systems as well. The Research Object is available at http://www.myexperiment.org/packs/428 The Wf4Ever Research Object Model is available at http://wf4ever.github.io/ro
... Everything in this RO, as well as the RO itself, is uniquely identified and can be referred to. This list of 5 rules was implemented as a checklist and whether an RO is compliant with this checklist can be automatically assessed using the RO quality assessment tool [51]. ...
Article
Full-text available
Motivation: Reproducing the results from a scientific paper can be challenging due to the absence of data and the computational tools required for their analysis. In addition, details relating to the procedures used to obtain the published results can be difficult to discern due to the use of natural language when reporting how experiments have been performed. The Investigation/Study/Assay (ISA), Nanopublications (NP), and Research Objects (RO) models are conceptual data modelling frameworks that can structure such information from scientific papers. Computational workflow platforms can also be used to reproduce analyses of data in a principled manner. We assessed the extent by which ISA, NP, and RO models, together with the Galaxy workflow system, can capture the experimental processes and reproduce the findings of a previously published paper reporting on the development of SOAPdenovo2, a de novo genome assembler. Results: Executable workflows were developed using Galaxy, which reproduced results that were consistent with the published findings. A structured representation of the information in the SOAPdenovo2 paper was produced by combining the use of ISA, NP, and RO models. By structuring the information in the published paper using these data and scientific workflow modelling frameworks, it was possible to explicitly declare elements of experimental design, variables, and findings. The models served as guides in the curation of scientific information and this led to the identification of inconsistencies in the original published paper, thereby allowing its authors to publish corrections in the form of an errata. Availability: SOAPdenovo2 scripts, data, and results are available through the GigaScience Database: http://dx.doi.org/10.5524/100044; the workflows are available from GigaGalaxy: http://galaxy.cbiit.cuhk.edu.hk; and the representations using the ISA, NP, and RO models are available through the SOAPdenovo2 case study website http://isa-tools.github.io/soapdenovo2/. Contact: philippe.rocca-serra@oerc.ox.ac.uk and susanna-assunta.sansone@oerc.ox.ac.uk.
Conference Paper
Workflows have become a popular means for implementing experiments in computational sciences. They are beneficial over other forms of implementation, as they require a formalisation of the experiment process, they provide a standard set of functions to be used, and provide an abstraction of the underlying system. Thus, they facilitate understandability and repeatability of experimental research. Also, additional meta data standards such as Research Objects, which allow to add more meta-data about the research process, shall enable better reproducibility of experiments. However, as several studies have shown, merely implementing an experiment as a workflow in a workflow engine is not sufficient to achieve these goals, as still a number of challenges and pitfalls prevail. In this paper, we want to quantify how many workflow executions are easy to repeat. To this end, we automatically obtain and analyse a set of almost 1,500 workflows available in the myExperiment platform, focusing on the ones authored in the Taverna workflow language. We provide statistics on the types of processing steps used, and investigate what vulnerabilities in regards to re-execution are faced. We then try to automatically execute the workflows. Form these results, we conclude which are the most common causes for failures, and analyse how these can be countered, with existing or yet to be developed approaches.
Article
Full-text available
The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object. In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and finding a canonic judgement that is suitable for each and every task is not feasible. To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion. We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.
Conference Paper
Full-text available
A workflow-centric research object bundles a workflow, the provenance of the results obtained by its enactment, other digital objects that are relevant for the experiment (papers, datasets, etc.), and anno-tations that semantically describe all these objects. In this paper, we propose a model to specify workflow-centric research objects, and show how the model can be grounded using semantic technologies and exist-ing vocabularies, in particular the Object Reuse and Exchange (ORE) model and the Annotation Ontology (AO). We describe the life-cycle of a research object, which resembles the life-cycle of a scientific experiment.