[Show abstract][Hide abstract] ABSTRACT: Background
One of the main challenges for biomedical research lies in the computer-assisted integrative study of large and increasingly complex combinations of data in order to understand molecular mechanisms. The preservation of the materials and methods of such computational experiments with clear annotations is essential for understanding an experiment, and this is increasingly recognized in the bioinformatics community. Our assumption is that offering means of digital, structured aggregation and annotation of the objects of an experiment will provide necessary meta-data for a scientist to understand and recreate the results of an experiment. To support this we explored a model for the semantic description of a workflow-centric Research Object (RO), where an RO is defined as a resource that aggregates other resources, e.g., datasets, software, spreadsheets, text, etc. We applied this model to a case study where we analysed human metabolite variation by workflows.
We present the application of the workflow-centric RO model for our bioinformatics case study. Three workflows were produced following recently defined Best Practices for workflow design. By modelling the experiment as an RO, we were able to automatically query the experiment and answer questions such as ?which particular data was input to a particular workflow to test a particular hypothesis??, and ?which particular conclusions were drawn from a particular workflow??.
Applying a workflow-centric RO model to aggregate and annotate the resources used in a bioinformatics experiment, allowed us to retrieve the conclusions of the experiment in the context of the driving hypothesis, the executed workflows and their input data. The RO model is an extendable reference model that can be used by other systems as well.
The Research Object is available at http://www.myexperiment.org/packs/428 The Wf4Ever Research Object Model is available at http://wf4ever.github.io/ro
[Show abstract][Hide abstract] ABSTRACT: Semantic annotation of web services has been proposed as a solution to the problem of discovering services to fit a particular need, and reusing them appropriately. While there exist tools that assist human users in the annotation task, e.g., Radiant and Meteor-S, no semantic annotation proposal considers the problem of verifying the accuracy of the resulting annotations. Early evidence from workflow compatibility checking suggests that the proportion of annotations that are inaccurate is high, and yet no tools exist to help annotators to test the results of their work systematically before they are deployed for public use. In this paper, we adapt techniques from conventional software testing to the verification of semantic annotations for web service input and output parameters. We present an algorithm for the testing process, and discuss ways in which manual effort from the annotator during testing can be reduced. We also present two adequacy criteria for specifying test cases used as input for the testing process. These criteria are based on structural coverage of the domain ontology used for annotation. The results of an evaluation exercise, based on a collection of annotations for bioinformatics web services, show that defects can be successfully detected by the technique.
IEEE Transactions on Services Computing 07/2014; · 2.46 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Research in life sciences is increasingly being conducted in a digital and
online environment. In particular, life scientists have been pioneers in
embracing new computational tools to conduct their investigations. To support
the sharing of digital objects produced during such research investigations, we
have witnessed in the last few years the emergence of specialized repositories,
e.g., DataVerse and FigShare. Such repositories provide users with the means to
share and publish datasets that were used or generated in research
investigations. While these repositories have proven their usefulness,
interpreting and reusing evidence for most research results is a challenging
task. Additional contextual descriptions are needed to understand how those
results were generated and/or the circumstances under which they were
concluded. Because of this, scientists are calling for models that go beyond
the publication of datasets to systematically capture the life cycle of
scientific investigations and provide a single entry point to access the
information about the hypothesis investigated, the datasets used, the
experiments carried out, the results of the experiments, the people involved in
the research, etc. In this paper we present the Research Object (RO) suite of
ontologies, which provide a structured container to encapsulate research data
and methods along with essential metadata descriptions. Research Objects are
portable units that enable the sharing, preservation, interpretation and reuse
of research investigation results. The ontologies we present have been designed
in the light of requirements that we gathered from life scientists. They have
been built upon existing popular vocabularies to facilitate interoperability.
Furthermore, we have developed tools to support the creation and sharing of
Research Objects, thereby promoting and facilitating their adoption.
[Show abstract][Hide abstract] ABSTRACT: Provenance is a critical ingredient for establishing trust of published scientific content. This is true whether we are considering a data set, a computational workflow, a peer-reviewed publication or a simple scientific claim with supportive evidence. Existing vocabularies such as DC Terms and the W3C PROV-O are domain-independent and general-purpose and they allow and encourage for extensions to cover more specific needs. We identify the specific need for identifying or distinguishing between the various roles assumed by agents manipulating digital artifacts, such as author, contributor and curator.
We present the Provenance, Authoring and Versioning ontology (PAV): a lightweight ontology for capturing just enough descriptions essential for tracking the provenance, authoring and versioning of web resources. We argue that such descriptions are essential for digital scientific content. PAV distinguishes between contributors, authors and curators of content and creators of representations in addition to the provenance of originating resources that have been accessed, transformed and consumed. We explore five projects (and communities) that have adopted PAV illustrating their usage through concrete examples. Moreover, we present mappings that show how PAV extends the PROV-O ontology to support broader interoperability.
The authors strived to keep PAV lightweight and compact by including only those terms that have demonstrated to be pragmatically useful in existing applications, and by recommending terms from existing ontologies when plausible.
We analyze and compare PAV with related approaches, namely Provenance Vocabulary, DC Terms and BIBFRAME. We identify similarities and analyze their differences with PAV, outlining strengths and weaknesses of our proposed model. We specify SKOS mappings that align PAV with DC Terms.
[Show abstract][Hide abstract] ABSTRACT: Dataspace management systems (DSMSs) hold the promise of pay-as-you-go data integration. We describe a comprehensive model of DSMS functional-ity using an algebraic style. We begin by characterizing a dataspace life cycle and highlighting opportunities for both automation and user-driven improvement tech-niques. Building on the observation that many of the techniques developed in model management are of use in data integration contexts as well, we briefly introduce the model management area and explain how previous work on both data integration and model management needs extending if the full dataspace life cycle is to be sup-ported. We show that many model management operators already enable important functionality (e.g., the merging of schemas, the composition of mappings, etc.) and formulate these capabilities in an algebraic structure, thereby giving rise to the no-tion of the core functionality of a DSMS as a many-sorted algebra. Given this view, we show how core tasks in the dataspace life cycle can be enacted by means of al-gebraic programs. An extended case study illustrates how such algebraic programs capture a challenging, practical scenario.
[Show abstract][Hide abstract] ABSTRACT: Reusing and repurposing scientific workflows for novel scientific experiments is nowadays facilitated by workflow repositories. Such repositories allow scientists to find existing workflows and re-execute them. However, workflow input parameters often need to be adjusted to the research problem at hand. Adapting these parameters may become a daunting task due to the infinite combinations of their values in a wide range of applications. Thus, a scientist may preferably use an automated optimization mechanism to adjust the workflow set-up and improve the result. Currently, automated optimizations must be started from scratch as optimization meta-data are not stored together with workflow provenance data. This important meta-data is lost and can neither be reused nor assessed by other researchers. In this paper we present a novel approach to capture optimization meta-data by extending the Research Object model and reusing the W3C standards. We validate our proposal through a real-world use case taken from the biodivertsity domain, and discuss the exploitation of our solution in the context of existing e-Science infrastructures.
Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science; 11/2013
[Show abstract][Hide abstract] ABSTRACT: Thanks to the proliferation of computational techniques and the availability of datasets, data-intensive research has become commonplace in science. Sharing and re-use of datasets is key to scientific progress. A critical requirement for enabling data re-use, is for data to be accompanied by lineage metadata that describes the context in which data is produced, the source datasets from which it was derived and the tooling or settings involved in its generation. By and large, this metadata is provided through a manual curation process, which is tedious, repetitive and time consuming. In this paper, we explore the problem of curating data artifacts generated from scientific workflows, which have become an established method for organizing computational data analyses. Most workflow systems can be instrumented to gather provenance, i.e. lineage, information about the data artifacts generated as a result of their execution. While this form of raw provenance provides elaborate information on localized lineage traced during a run in the form of data derivation or activity causality relations, it is of little use when one needs to report on lineage in a broader scientific context. And, consequently, datasets resulting from workflow-based analyses also require manual curation prior to their publishing. We argue that by making the analysis process explicit, workflow-based investigations provide an opportunity for semi-automating data curation. In this paper we introduce a novel approach that semi-automates curation through a special kind of workflow, which we call a Labeling Workflow. Using 1) the description of a scientific workflow, 2) a set of semantic annotations characterizing the data processing in workflows, and, 3) a library of label handling functions, we devise a Labeling Workflow, which can be executed over raw provenance in order to curate the data artifacts it refers to. We semi-formally describe the elements of our solution, and showcase its usefulness using an example from Biodiversity.
[Show abstract][Hide abstract] ABSTRACT: While workflow technology has gained momentum in the last decade as a means for specifying and enacting computational experiments in modern science, reusing and repurposing existing workflows to build new scientific experiments is still a daunting task. This is partly due to the difficulty that scientists experience when attempting to understand existing workflows, which contain several data preparation and adaptation steps in addition to the scientifically significant analysis steps. One way to tackle the understandability problem is through providing abstractions that give a high-level view of activities undertaken within workflows. As a first step towards abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna and Wings systems. Our analysis has resulted in a set of scientific workflow motifs that outline i) the kinds of data intensive activities that are observed in workflows (data oriented motifs), and ii) the different manners in which activities are implemented within workflows (workflow oriented motifs). These motifs can be useful to inform workflow designers on the good and bad practices for workflow development, to inform the design of automated tools for the generation of workflow abstractions, etc.
Future Generation Computer Systems 09/2013; · 2.64 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: One aspect of the vision of dataspaces has been articulated as providing various benefits of classical data integration with reduced up-front costs. In this paper, we present techniques that aim to support schema mapping specification through interaction with end users in a pay-as-you-go fashion. In particular, we show how schema mappings, that are obtained automatically using existing matching and mapping generation techniques, can be annotated with metrics estimating their fitness to user requirements using feedback on query results obtained from end users.Using the annotations computed on the basis of user feedback, and given user requirements in terms of precision and recall, we present a method for selecting the set of mappings that produce results meeting the stated requirements. In doing so, we cast mapping selection as an optimization problem. Feedback may reveal that the quality of schema mappings is poor. We show how mapping annotations can be used to support the derivation of better quality mappings from existing mappings through refinement. An evolutionary algorithm is used to efficiently and effectively explore the large space of mappings that can be obtained through refinement.User feedback can also be used to annotate the results of the queries that the user poses against an integration schema. We show how estimates for precision and recall can be computed for such queries. We also investigate the problem of propagating feedback about the results of (integration) queries down to the mappings used to populate the base relations in the integration schema.
Information Systems 07/2013; 38(5):656–687. · 1.77 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The Taverna workflow tool suite (http://www.taverna.org.uk) is designed to combine distributed Web Services and/or local tools into complex analysis pipelines. These pipelines can be executed on local desktop machines or through larger infrastructure (such as supercomputers, Grids or cloud environments), using the Taverna Server. In bioinformatics, Taverna workflows are typically used in the areas of high-throughput omics analyses (for example, proteomics or transcriptomics), or for evidence gathering methods involving text mining or data mining. Through Taverna, scientists have access to several thousand different tools and resources that are freely available from a large range of life science institutions. Once constructed, the workflows are reusable, executable bioinformatics protocols that can be shared, reused and repurposed. A repository of public workflows is available at http://www.myexperiment.org. This article provides an update to the Taverna tool suite, highlighting new features and developments in the workbench and the Taverna Server.
Nucleic Acids Research 05/2013; · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper presents an extension to the W3C PROV provenance model, aimed at representing process structure. Although the modelling of process structure is out of the scope of the PROV specification, it is beneficial when capturing and analyzing the provenance of data that is produced by programs or other formally encoded processes. In the paper, we motivate the need for such and extended model in the context of an ongoing large data federation and preservation project, DataONE2, where provenance traces of scientific workflow runs are captured and stored alongside the data products. We introduce new provenance relations for modelling process structure along with their usage patterns, and present sample queries that demonstrate their benefit.
Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance; 04/2013
[Show abstract][Hide abstract] ABSTRACT: This paper presents an extension to the W3C PROV provenance model, aimed at representing process structure. Although the modelling of process structure is out of the scope of the PROV specification, it is beneficial when capturing and analyzing the provenance of data that is produced by programs or other formally encoded processes. In the paper, we motivate the need for such and extended model in the context of an ongoing large data federation and preservation project, DataONE, where provenance traces of scientific workflow runs are captured and stored alongside the data products. We introduce new provenance relations for modelling process structure along with their usage patterns, and present sample queries that demonstrate their benefit.
Proceedings of the 5th USENIX conference on Theory and Practice of Provenance; 04/2013
[Show abstract][Hide abstract] ABSTRACT: Many scientists are using workflows to systematically design and run computational experiments. Once the workflow is executed, the scientist may want to publish the dataset generated as a result, to be, e.g., reused by other scientists as input to their experiments. In doing so, the scientist needs to curate such dataset by specifying metadata information that describes it, e.g. its derivation history, origins and ownership. To assist the scientist in this task, we explore in this paper the use of provenance traces collected by workflow management systems when enacting workflows. Specifically, we identify the shortcomings of such raw provenance traces in supporting the data publishing task, and propose an approach whereby distilled, yet more informative, provenance traces that are fit for the data publishing task can be derived.
Proceedings of the Joint EDBT/ICDT 2013 Workshops; 03/2013
[Show abstract][Hide abstract] ABSTRACT: We describe a corpus of provenance traces that we have collected by executing 120 real world scientific workflows. The workflows are from two different workflow systems: Taverna  and Wings , and 12 different application domains (see Figure 1). Table 1 provides a summary of this PROV-corpus.
Proceedings of the Joint EDBT/ICDT 2013 Workshops; 01/2013
[Show abstract][Hide abstract] ABSTRACT: Provenance, a form of structured metadata designed to record the origin or source of information, can be instrumental in deciding whether information is to be trusted, how it can be integrated with other diverse information sources, and how to establish attribution of information to authors through- out its history. The PROV set of speci cations, produced by the World Wide Web Consortium (W3C), is designed to pro- mote the publication of provenance information on the Web, and o ers a basis for interoperability across diverse prove- nance management systems. The PROV provenance model is deliberately generic and domain-agnostic, but extension mechanisms are available and can be exploited for modelling speci c domains. This tutorial provides an account of these speci cations. Starting from intuitive and informal exam- ples that present idiomatic provenance patterns, it progres- sively introduces the relational model of provenance along with the constraints model for validation of provenance doc- uments, and concludes with example applications that show the extension points in use.
[Show abstract][Hide abstract] ABSTRACT: Scientific workflows have become the workhorse of Big Data analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta-data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how primitives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summarization strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Tavern a system.
Big Data (BigData Congress), 2013 IEEE International Congress on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: Scientific workflows are often data intensive. The data sets obtained by enacting scientific workflows have several applications, e.g., they can be used to identify data correlations or to understand phenomena, and therefore are worth storing in repositories for future analyzes. Our experience suggests that such datasets often contain duplicate records. Indeed, scientists tend to enact the same workflow multiple times using the same or overlapping datasets, which gives rise to duplicates in workflow results. The presence of duplicates may increase the complexity of workflow results interpretation and analyzes. Moreover, it unnecessarily increases the size of datasets within workflow results repositories. In this paper, we present an approach whereby duplicates detection is guided by workflow provenance trace. The hypothesis that we explore and exploit is that the operations that compose a workflow are likely to produce the same (or overlapping) dataset given the same (or overlapping) dataset. A preliminary analytic and empirical validation shows the effectiveness and applicability of the method proposed.
Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes; 06/2012
[Show abstract][Hide abstract] ABSTRACT: PROV is a specification, promoted by the World Wide Web consortium, for recording the provenance of web resources. It includes a schema, consistency constraints and inference rules on the schema, and a language for recording provenance facts. In this paper we describe a implementation of PROV that is based on the DLV Datalog engine. We argue that the deductive databases paradigm, which underpins the Datalog model, is a natural choice for expressing at the same time (i) the intensional features of the provenance model, namely its consistency constraints and inference rules, (ii) its extensional features, i.e., sets of provenance facts (called a provenance graph), and (iii) declarative recursive queries on the graph. The deductive and constraint solving capability of DLV can be used to validate a graph against the constraints, and to derive new provenance facts. We provide an encoding of the PROV rules as Datalog rules and constraints, and illustrate the use of deductive capabilities both for queries and for constraint validation, namely to detect inconsistencies in the graphs. The DLV code along with a parser to map the PROV assertion language to Datalog syntax, are publicly available.
Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes; 06/2012
[Show abstract][Hide abstract] ABSTRACT: A workflow-centric research object bundles a workflow, the provenance of the results obtained by its enactment, other digital objects that are relevant for the experiment (papers, datasets, etc.), and anno-tations that semantically describe all these objects. In this paper, we propose a model to specify workflow-centric research objects, and show how the model can be grounded using semantic technologies and exist-ing vocabularies, in particular the Object Reuse and Exchange (ORE) model and the Annotation Ontology (AO). We describe the life-cycle of a research object, which resembles the life-cycle of a scientific experiment.
Second International Conference on the Future of Scholarly Communication and Scientific Publishing Sepublica2012.; 05/2012