Conference PaperPDF Available

Towards the Preservation of Scientific Workflows

Authors:

Abstract and Figures

Some of the shared digital artefacts of digital research are executable in the sense that they describe an automated pro-cess which generates results. One example is the compu-tational scientific workflow which is used to conduct auto-mated data analysis, predictions and validations. We de-scribe preservation challenges of scientific workflows, and suggest a framework to discuss the reproducibility of work-flow results. We describe curation techniques that can be used to avoid the 'workflow decay' that occurs when steps of the workflow are vulnerable to external change. Our ap-proach makes extensive use of provenance information and also considers aggregate structures called Research Objects as a means for promoting workflow preservation.
Content may be subject to copyright.
Towards the Preservation of Scientific Workflows
David De Roure
Oxford e-Research Centre
University of Oxford
Oxford, UK
david.deroure@oerc.ox.ac.uk
Khalid Belhajjame
School of Computer Science
University of Manchester
Manchester, UK
khalidb@cs.man.ac.uk
Paolo Missier
School of Computing Science
Newcastle University
Newcastle upon Tyne, UK
paolo.missier@ncl.ac.uk
José Manuel
Gómez-Pérez
iSOCO
Madrid, Spain
jmgomez@isoco.com
Raúl Palma
Pozna´
n Supercomputing and
Networking Center
Pozna´
n, Poland
palma@man.poznan.pl
José Enrique Ruiz
Instituto de Astrofísica de
Andalucía
Granada, Spain
jer@iaa.es
Kristina Hettne & Marco
Roos
Leiden University Medical
Center, Leiden, NL
{k.m.hettne,m.roos}@lumc.nl
Graham Klyne
Department of Zoology
University of Oxford
Oxford, UK
graham.klyne@zoo.ox.ac.uk
Carole Goble
School of Computer Science
University of Manchester
Manchester, UK
carole.goble@manchester.ac.uk
ABSTRACT
Some of the shared digital artefacts of digital research are
executable in the sense that they describe an automated pro-
cess which generates results. One example is the compu-
tational scientific workflow which is used to conduct auto-
mated data analysis, predictions and validations. We de-
scribe preservation challenges of scientific workflows, and
suggest a framework to discuss the reproducibility of work-
flow results. We describe curation techniques that can be
used to avoid the ‘workflow decay’ that occurs when steps
of the workflow are vulnerable to external change. Our ap-
proach makes extensive use of provenance information and
also considers aggregate structures called Research Objects
as a means for promoting workflow preservation.
Categories and Subject Descriptors
H.3.5 [Online Information Services]: Data sharing; H.5.3
[Group and Organization Interfaces]: Collaborative com-
puting
1. INTRODUCTION
Research is being conducted in an increasingly digital and
online environment. Consequently we are seeing the emer-
gence of new digital artefacts. In some respects these objects
can be regarded as data; however some warrant particular
attention, such as when the object includes a description
of some part of the research method that is captured as a
computational process. Processes encapsulate the knowl-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
edge related to the generation, (re)use and general transfor-
mation of data in experimental sciences. For example, an
object might contain raw data, the description of a compu-
tational analysis process and the results of executing that
process, thus oering the capability to reproduce and reuse
the research process. Processes are key to the understand-
ing and evolution of science; consequently as the scientific
community need to curate and preserve data, so we should
preserve and curate associated processes [5]. The problem,
as observed by Donoho et al, is that “current computational
science practice does not generate routinely verifiable knowl-
edge” [3].
In this paper we focus on computational scientific work-
flows which are increasingly becoming part of the scholarly
knowledge cycle [11]. A computational scientific workflow
is a precise, executable description of a scientific procedure
– a multi-step process to coordinate multiple components
and tasks, like a script. Each task represents the execu-
tion of a computational process, such as running a program,
submitting a query to a database, submitting a job to a com-
putational facility, or invoking a service over the Web to use
a remote resource. Data output from one task is consumed
by subsequent tasks according to a predefined graph topol-
ogy that orchestrates the flow of data. The components
(the dataset, service, facility or code) might be local and
hosted along with the workflow, or remote (public reposito-
ries hosted by third parties) [9].
Workflows have become an important tool in many ar-
eas, notably in the Life Sciences where tools like Taverna [7]
are popular. From a researcher’s standpoint, workflows are a
transparent means for encoding an in silico scientific method
that supports reproducible science and the sharing and repli-
cating of best-of-practice and know-how through reuse.
However, the preservation of scientific methods in the form
of computational workflows faces challenges which deal pre-
cisely with their executable aspects and their vulnerability
to the volatility of the resources – data and services – re-
quired for their execution. Changes made by third parties
to the workflow components may lead to a decay of the abil-
ity of the workflow to be executed and consequently hinder
the repeatability and reproducibility of their results.
This paper highlights such challenges and states the promi-
nent role of information quality evaluation and curation in
order to diagnose and react to workflow decay. Although we
draw on our specific experience with workflows, the frame-
work in this paper is designed for a more generalised no-
tion of executable objects which we refer to as Research Ob-
jects [1].
We begin by discussing the diculties underlying scientific
workflow preservation (in Sec. 2). We go on to highlight the
role that Research Objects, as artefacts that bundle work-
flows together with other resources, can play in ensuring the
preservation of scientific workflows (in Sec. 3). We close the
paper by discussing our ongoing work (in Sec. 4).
2. PRESERVATION CHALLENGES
To illustrate preservation needs in scientific workflows, we
use an example workflow from the field of astronomy, which
is used to extract a list of companion galaxies. The workflow
is illustrated in Figure 1. It starts by running two activities
in parallel, the first extracts a list of companion galaxies
by querying the public Virtual Observatory (VO) database,
and the second activity extracts a second list of companion
galaxies by invoking a web service. The results of the two
activities are then cross matched to obtain an improved list
of companion galaxies.
Figure 1: Extracting companion galaxies.
The content of the VO database is sub ject to update, and
the implementation of the web service responsible for de-
tecting companion galaxies is also subject to modifications.
Thus it is possible, and likely, that the workflow produces
dierent lists of companion galaxies when run at dierent
times. It is important therefore to record the provenance of
workflow outputs; i.e. the sources of information and pro-
cesses involved in producing a particular list [12].
Should the VO database become unavailable or alter its
interface so that the workflow can no longer access it, the
workflow will become inoperable. This workflow decay
is a fundamental challenge for the preservation of scientific
workflows. Even though a workflow description remains un-
changed, and may still have value in helping interpret re-
sults, the execution of that workflow may fail or yield dier-
ent results. This is due to dependencies on resources outside
the immediate context of the object which are subject to in-
dependent change. Further use cases can be found in [13]
for bioinformatics and [14] for astronomy.
Gil et al observe that “It must be possible to re-execute
workflows many years later and obtain the same results.
This requirement poses challenges in terms of creating a
stable layer of abstraction over a rapidly evolving infras-
tructure while providing the flexibility needed to address
evolving requirements and applications and to support new
capabilities” [4].
This abstraction approach insulates from some change,
but we will still experience decay when the execution is de-
pendent on resources and services that use independently
controlled resources. For example, service providers such
as the European Bioinformatics Institute (EMBL-EBI) rou-
tinely update their service oerings, and must do in response
to developments in the field of life science. Resources become
obsolete or are no longer sustained. Even workflows that
depend on local components are still vulnerable to changes
in operating systems, data management sustainability and
access to computational infrastructure. We note that work-
flows have many of the properties of software, such as the
composition of components with external dependencies, and
hence some aspects of software preservation [10] are applica-
ble. We also observe that the above requirement is actually
quite stringent: it must be possible to reproduce an experi-
ment, it is helpful if rerunning a workflow produces the same
results but this is not the only way.
We need a means to i) evaluate the current status of the
resources upon which the workflow depends and ii) react
to any signs of diagnosed decay in order to ensure work-
flow execution. In the Wf4Ever project1we are address-
ing this twofold goal through the combination of techniques
for computing information quality and, more specifically,
the integrity and authenticity of the associated resources,
and curation techniques. Foreseeing the case where actual
reproducibility cannot be achieved despite such eorts, we
propose partial reproducibility as the means required to
play back workflow execution based on the provenance of
previous executions.
2.1 Reproducibility in scientific workflows
To provide a framework for this discussion we briefly anal-
yse, in abstract terms, the key scenarios that arise when
attempting to reproduce a workflow execution. The short
formalism that follows identifies four cases for consideration
here and can readily be used to discuss other cases.
Let WS,D denote a workflow Wwith dependencies on a
set of services Sand on a data state D. A typical example
would be a bioinformatics workflow that depends on a set S
of EBI services, some of which provide query capabilities into
some of the EBI databases. Here Drepresents the content of
those databases. Let exec(WS,D ,d,t) denote the execution
of Won input dataset dat time t.
As noted earlier, both service specification and implemen-
tation will evolve over time (and some services may be re-
tired), and the state of the databases will change as well.
Let Sand Ddenote the new service and data dependen-
cies at some later time t(possibly months, or years). At this
time, an investigator may be interested in using Wwith the
following goals, and corresponding concrete options:
1. Updated workflow on original data. To update the old
outcomes using the current, updated state of services
and databases (possibly, to compare with the original
outcomes): exec(WS,D,d,t
).
2. Updated workflow on new data. To test the workflow in
its current state on a new dataset: exec(WS,D ,d
,t
).
3. Original workflow on new data. To replicate the origi-
nal experiment on a new dataset d:exec(WS,D ,d
,t
).
4. Original workflow on original data. To confirm earlier
claims on the original outcomes. This translates into
exec(WS,D ,d,t
), i.e., the same input dis used on W’s
original configuration;
1http://www.wf4ever-project.org
Dierent issues arise in each of these four cases. Cases
(1) and (2) highlight workflow decay, primarily due to the
evolution SS. This is a dicult problem, which involves
the evaluation of the integrity and authenticity of S and D as
they evolve into Sand Drespectively and some form of on-
going curation of Win order to make it compatible with S.
We describe three approaches to curation below (Sec. 3.1),
amongst which the first two have been investigated in the
context of the myExperiment workflow repository [2]. For
information quality, we propose provenance as an important
type of evidence that can support the detection of work-
flow decay with respect to external resources S and D (Sec.
2.3). By providing scientists with such provenance-enabled
diagnosis, we aim at feeding curation systems with accurate
information of what is causing workflow decay, how and why.
Cases (3) and (4) are increasingly relevant in e-Science,
as they are paradigmatic of the emerging executable publi-
cations [8] scenario. In an executable paper, some of the
quantitative results (tables, charts) that appear in the pub-
lication are not statically part of the text, but dynamically
linked to the process that produced them. In our case, the
results exec(WS,D ,d,t
) are published in the paper, but they
are also linked to WS,D as well as the input d. The intent of
this emerging form of “active publication” is precisely to let
readers replicate, entirely or in part, the computational por-
tion of an experiment in order to reproduce its results. For
example, Koop et al. [8] proposed a method that automati-
cally captures provenance information of the experiments in
order to assist authors by integrating and updating experi-
ment results into the paper as they write it.
Supporting this scenario is not simple as it requires the
entire set of original resources Sand D, to be available at
time t, along with the guarantee that a suitable runtime
environment can be provided for the services, as well as any
other software component in S. Although approaches based
on Virtual Machines (VM) are common in this case [6], the
high volume of state data, along with third party services
that cannot be replicated locally, and the potentially high
cost of execution for computationally expensive workflows,
may make this approach infeasible. For example, modeling
3D data of galaxies [14] involves the manipulation of large
data cubes, the size of which may reach tens of TB. Partial
reproducibility alleviates the problem in practical cases.
2.2 Partial Reproducibility
Consider the astronomy workflow presented in Figure 1.
For (3) and (4), insisting on executing Win its original en-
vironment is not always feasible and may not be needed. An
executable paper may for example provide limited workflow
execution capability to readers, permitting only execution of
lightweight tasks, such as analysis and charting of tabular
data, as opposed to compute-intensive simulations, for ex-
ample. This corresponds to splitting the workflow into two
portions (top/bottom), where only the latter is made avail-
able for readers to experiment with, while they will still have
to rely upon the usual peer-review guarantees regarding the
correctness of the top portion of the workflow.
Executing Wis unnecessary provided that a complete
and reliable provenance trace has been recorded at time
t. By combining provenance traces with partitioned exe-
cutable workflow fragments, provenance can be used to“play
back” the original execution and be queried to inspect all
data dependencies that resulted from that execution: (i)
the provenance is recorded from the execution of segments
that are heavily dependent on Sand D, which are then
omitted, and (ii) a VM approach is used for the remaining
segments, which are executable. Partitioning requires that
the executable segments be found downstream (in terms of
the directed graph that represents the workflow structure)
from the omitted segments. This places a requirement on
workflow design. Minimising the associated cost to repro-
ducibility of a workflow, which involves S,D, and the actual
cost of execution (which may well be a monetary cost, for
example in the case of cloud-based computations) presents
the challenge of finding the optimal partitioning. These are
just two of the challenges arising from taking this pragmatic
approach.
2.3 Information quality evaluation
In order to detect workflow decay with respect to the evo-
lution of the tuple (S,D) of services and data needed for
workflow execution, we focus on two main aspects relevant
for information quality: integrity and authenticity. Integrity
refers to the quality or condition of being whole, complete
and unaltered while authenticity aims at the lineage of data.
One of the main sources required to evaluate information
quality is provenance information (in our case about S and
D) which oers the means to verify the evolution of data and
services, to analyse the processes that led to their current
status, and to decide whether they are still consistent with
a given workflow. We build on provenance to compute the
integrity and authenticity of workflows with respect to (S,D),
thus providing scientists with accurate information about
what is causing the workflow decay due to changes in such
resources, how and why.
We can use and extend existing provenance vocabular-
ies like the Open Provenance Model2to record and rea-
son about provenance metadata relevant to the diagnosis
of workflow decay. Additional challenges include providing
scientists with the means to interpret easily the results of
such analysis and to assist them in the early diagnosis of
workflow decay and the selection of the most appropriate
curation techniques.
3. PRESERVING WORKFLOWS USING
RESEARCH OBJECTS
3.1 Preservation in Practice
The myExperiment3social website for finding, storing and
sharing workflows has been in operation since 2007 and holds
the largest public collection of scientific workflows [2]. As
such it provides a useful case study in workflow decay and
preservation, supporting two main mechanisms.
First, the continual downloading and uploading of work-
flows provides a community curation mechanism for work-
flows that are reused, and these in turn can act as examples
to inform people updating other workflows. Expert cura-
tors, e.g. scientists, are involved in annotating workflows,
by tagging and providing exemplars.
The second mechanism is assistive curation using semi-
automated processes to perform ‘housekeeping’ on the cor-
pus of workflows. For example, when a service provider an-
nounces that a service is deprecated and will be removed or
2http://openprovenance.org
3http://www.myexperiment.org
replaced on a certain date, the workflows aected by this can
be tagged accordingly and replacement advice propagated to
the appropriate users. Potentially this could progress to au-
tonomic curation where workflows could be executed and
repaired automatically, for example when services change.
The assistive approach keeps the ‘human in the loop’ and
the wf4ever project is pursuing this approach by focusing
on recommendations for curation and repair; for example a
replacement for a service can be confirmed using provenance
logs.
3.2 Research Objects
Workflow specifications are insucient for guaranteeing
the preservation of scientific workflows. The reproducibil-
ity strategies listed in Sec. 2.1 show that, in addition to
workflow specification, we need information about the com-
ponents that implement workflow steps, the data used and
produced as a result of workflow enactment.
In practice myExperiment users sometimes choose to ag-
gregate workflows with associated data (in ‘packs’) and this
provides a powerful means to track Sand D. Building on
packs, to cater for workflow preservation we use the notion of
aResearch Object, which can be viewed as an aggregation of
resources that bundles workflow specification and additional
auxiliary resources. These may include input and output
data which enables workflows to be validated.
The elements that compose a Research Object may dier
from one to another, and this dierence may have conse-
quences on the level of reproducibility that can be guaran-
teed. At one end of the spectrum, the Research Object is
represented by a paper. As we progress to the other end
the Research Object is enriched to include elements such as
the workflow implementing the computation, annotations
describing the experiment implemented and the hypothe-
sis investigated, and provenance traces of past executions
of the workflow. Assessing the reproducibility of compu-
tations described using electronic papers can be tedious: a
paper may just sketch the method implemented by the com-
putation in question, without delving into details that are
necessary to check that the results obtained, or claimed, in
the paper can be reproduced. Verifying the reproducibility
of Research Objects at the other end of the spectrum is less
dicult. The provenance trace provides data examples to
re-enact the workflow and a means to verify that the results
of workflow executions are comparable with prior results.
To ensure the preservation of a workflow and the repro-
ducibility of its results, the Research Object needs to be
managed and curated throughout the lifecycle of the asso-
ciated workflow. The provenance of the Research Object
elements (i.e., workflow, data sets and web services) is key
to understanding, comparing and debugging scientific work-
flows and to verifying the validity of a claim made within the
context of a Research Object by revealing the data inputs
used to yield a given workflow result. We need to support
the logging, browsing and querying of the provenance linking
components of Research Objects and the traces of workflow
executions.
4. CONCLUSIONS
As research practice evolves we anticipate a growing quan-
tity and diversity of executable objects, in particular compu-
tational scientific workflows. We outlined the challenges un-
derlying the preservation of scientific workflows and sketched
preliminary solutions that can be adopted for that purpose.
We used the concept of Research Object as an abstraction
for the management of executable objects throughout their
life-cycle. We anticipate that this work will give rise to rec-
ommendations and best practices for authors and curators
of scientific workflows to meet preservation requirements.
We are investigating the reproducibility and curation strate-
gies reported in this paper and developing a software archi-
tecture and reference implementation for workflow preserva-
tion. The development of the reference implementation will
rest on existing developments in scientific workflow reposi-
tories, digital libraries and preservation systems. In particu-
lar, we will build on well-established digital libraries, such as
dLibra4, to extend the myExperiment workflow repository
with further preservation capabilities.
5. ACKNOWLEDGMENTS
Wf4Ever is funded by the Seventh Framework Programme
of the European Commission (Digital Libraries and Digital
Preservation area ICT-2009.4.1 project reference 270192).
myExperiment is funded by UK JISC. The dLibra Digital
Library Framework has been produced by the Pozna´n Su-
percomputing and Networking Center since 1999. We are
grateful to all our collaborators in these projects.
6. REFERENCES
[1] S. Bechhofer, J. Ainsworth, et al. Why linked data is not
enough for scientists. In IEEE Sixth International
Conference on e-Science, pages 300–307, 2010.
[2] D. De Roure, C. Goble, and R. Stevens. The design and
realisation of the myExperiment virtual research
environment for social sharing of workflows. Future
Generation Computer Systems, 25(5):561–567, 2009.
[3] D. L. Donoho, A. Maleki, I. Rahman, et al. Reproducible
research in computational harmonic analysis. Computing in
Science and Engg., 11:8–18, January 2009.
[4] Y. Gil, E. Deelman, et al. Examining the challenges of
scientific workflows. IEEE Computer, 40:24–32, Dec. 2007.
[5] C. Goble and D. De Roure. Curating scientific web services
and workflows. Educause Review,43(5),2008.
[6] P. J. Guo and D. Engler. CDE: Using System Call
Interposition to Automatically Create Portable Software
Packages. In Proc. USENIX Annual Tech. Conf.,2011.
[7] D. Hull, K. Wolstencroft, R. Stevens, et al. Taverna: a tool
for building and running workflows of services. Nucleic
Acids Research, 34(suppl 2):W729–W732, 1 July 2006.
[8] D. Koop et al. A provenance-based infrastructure to
support the life cycle of executable papers. Procedia
Computer Science,4:648–657,2011.Proceedingsofthe
International Conference on Computational Science.
[9] B. Lud¨
ascher et al. Scientific process automation and
workflow management. In Scientific Data Management,
Computational Science Series. Chapman & Hall, 2009.
[10] B. Matthews et al. A framework for software preservation.
International Journal of Digital Curation,5(1),2010.
[11] J. P. Mesirov. Accessible reproducible research. Science,
327(5964):415–416, 2010.
[12] P. Missier. Modelling and computing the quality of
information in e-science. PhD thesis, University of
Manchester, 2008.
[13] M. Roos. Genomics Workflow Preservation Requirements.
Technical report, Deliverable D6.1, Wf4Ever project, 2011.
[14] L. Verdes-Montenegro. Astronomy Workflow Preservation
Requirements. Technical report, Deliverable 5.1, Wf4Ever
project, 2011.
4http://dlibra.psnc.pl
... However a generalised set of rules and recommendations to achieve this is still a challenge to be met as workflow implementation, storage, sharing and reuse significantly varies depending on the choice of approach and platform used by the researcher. A common phenomenon to every approach however is 'workflow decay'[59]caused by the factors such as the evolution of technical environment used to implement a workflow, updates in the state of external factors such as databases and unavailability of third party web resources. Our study contributes to understanding the requirements of reproducibility of genomic workflows by investigating a set of assumptions evident from practical implementation of the case study and providing standardised recommendations for computational genomic workflow studies. ...
... Genomic data analysis has grown complex with the increased involvement of customized scripts and online resources needed to carry out difficult tasks, increasing both the technical knowledge required and the chance that something will break. One of the major reasons for non-reproducibility of workflows is use of volatile third party resources such as databases, tools or websites[59]. Many workflows cannot be run because the third party resources they rely on are no longer available and the results could only be reproduced using the specific version of the software, hence rendering workflows unusable. ...
Article
Full-text available
Background Computational bioinformatics workflows are extensively used to analyse genomics data, with different approaches available to support implementation and execution of these workflows. Reproducibility is one of the core principles for any scientific workflow and remains a challenge, which is not fully addressed. This is due to incomplete understanding of reproducibility requirements and assumptions of workflow definition approaches. Provenance information should be tracked and used to capture all these requirements supporting reusability of existing workflows. Results We have implemented a complex but widely deployed bioinformatics workflow using three representative approaches to workflow definition and execution. Through implementation, we identified assumptions implicit in these approaches that ultimately produce insufficient documentation of workflow requirements resulting in failed execution of the workflow. This study proposes a set of recommendations that aims to mitigate these assumptions and guides the scientific community to accomplish reproducible science, hence addressing reproducibility crisis. Conclusions Reproducing, adapting or even repeating a bioinformatics workflow in any environment requires substantial technical knowledge of the workflow execution environment, resolving analysis assumptions and rigorous compliance with reproducibility requirements. Towards these goals, we propose conclusive recommendations that along with an explicit declaration of workflow specification would result in enhanced reproducibility of computational genomic analyses. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1747-0) contains supplementary material, which is available to authorized users.
... Néanmoins, ces approches -fortement techniques -font apparaître le problème important de la dégradation des workflows dans le temps, et s'y confrontent (D. DE ROURE et al., 2011). Ces workflows reposent sur des ressources ad-hoc ou externes pour mener à bien ces analyses qui, avec le temps, peuvent être modifiées, altérées, supprimées ou simplement inaccessibles, empêchant ainsi leur bonne exécution. Cette dégradation s'explique majoritairement par le fait que les workflows manquent d'informations, à la fois te ...
Thesis
Cette thèse en informatique porte sur la problématique de la capitalisation des processus d’analyse de traces d’apprentissage au sein de la communauté des Learning Analytics (LA). Il s’agit de permettre de partager, adapter et réutiliser ces processus d’analyse de traces.Actuellement, cette capitalisation est limitée par deux facteurs importants : les processus d’analyse sont dépendants des outils d’analyse qui les mettent en œuvre - leur contexte technique - et du contexte pédagogique pour lequel ils sont menés. Cela empêche de les partager, mais aussi de les réexploiter simplement en dehors de leurs contextes initiaux, quand bien même les nouveaux contextes seraient similaires.L’objectif de cette thèse est de fournir des modélisations et des méthodes permettant la capitalisation des processus d’analyse de traces d’apprentissage, ainsi que d’assister les différents acteurs de l’analyse, notamment durant la phase de réutilisation. Pour cela, nous répondons aux trois verrous scientifiques suivant : comment partager et combiner des processus d’analyse mis en œuvre dans différents outils d’analyse ? ; comment permettre de réexploiter un processus d’analyse existant pour répondre à un autre besoin d’analyse ? ; comment assister les différents acteurs lors de l’élaboration et de l’exploitation de processus d’analyse ?Notre première contribution, issue d’une synthèse de l’état de l’art, est la formalisation d’un cycle d'élaboration et d'exploitation des processus d'analyse, afin d'en définir les différentes étapes, les différents acteurs et leurs différents rôles. Cette formalisation est accompagnée d’une définition de la capitalisation et de ses propriétés.Notre deuxième contribution répond au premier verrou lié à la dépendance technique des processus d’analyse actuels, et à leur partage. Nous proposons un méta-modèle qui permet de décrire les processus d’analyse indépendamment des outils d’analyse. Ce méta-modèle formalise la description des opérations utilisées dans les processus d'analyse, des processus eux-mêmes et des traces utilisées, afin de s’affranchir des contraintes techniques occasionnées par ces outils. Ce formalisme commun aux processus d’analyse permet aussi d’envisager leur partage. Il a été mis en œuvre et évalué dans un de nos prototypes.Notre troisième contribution traite le deuxième verrou sur la réexploitation des processus d’analyse. Nous proposons un framework ontologique pour les processus d'analyse, qui permet d'introduire de manière structurée des éléments sémantiques dans la description des processus d'analyse. Cette approche narrative enrichit ainsi le formalisme précédent et permet de satisfaire les propriétés de compréhension, d’adaptation et de réutilisation nécessaires à la capitalisation. Cette approche ontologique a été mise en œuvre et évaluée dans un autre de nos prototypes.Enfin, notre dernière contribution répond au dernier verrou identifié et concerne de nouvelles pistes d’assistances aux acteurs, notamment une nouvelle méthode de recherche des processus d’analyse, s’appuyant sur nos propositions précédentes. Nous exploitons le cadre ontologique de l’approche narrative pour définir des règles d’inférence et des heuristiques permettant de raisonner sur les processus d’analyse dans leur ensemble (e.g. étapes, configurations) lors de la recherche. Nous utilisons également le réseau sémantique sous-jacent à cette modélisation ontologique pour renforcer l’assistance aux acteurs en leur fournissant des outils d’inspection et de compréhension lors de la recherche. Cette assistance a été mise en œuvre dans un de nos prototypes, et évaluée empiriquement.
... Their algorithm queries provenance traces generated by workflow executions in the e- Science Central platform. PDIFF assists reproducibility analysis by identifying possible causes of divergent results such as data and workflow evolution, service version upgrades (problems related to the workflow decay [Roure et al. 2011]), and also non-deterministic behavior in some of the services. It is also able to compare three different types of files: text or CSV, XML, and mathematical models. ...
Article
Until not long ago, manually capturing and storing provenance from scientific experiments were constant concerns for scientists. With the advent of computational experiments (modeled as scientific workflows) and Scientific Workflow Management Systems, produced and consumed data, as well as the provenance of a given experiment, are automatically managed, so provenance capturing and storing in such a context is no longer a major concern. Similarly to several existing big data problems, the bottom line is now on how to analyze the large amounts of provenance data generated by workflow executions and how to be able to extract useful knowledge of this data. In this context, this article surveys the current state of the art on provenance analytics by presenting the key initiatives that have been taken to support provenance data analysis. We also contribute by proposing a taxonomy to classify elements related to provenance analytics.
... The Research Object Model [2,12] is a comprehensive standard defining the concept of a research object as a bundle of artifacts, specifying a complete digital record of a piece of research. Implementations of the standard have primarily focused on structured workflow objects [13][14][15], and only recently have been extended for general applications (i.e., applications executed without a formal workflow system). ...
Article
Full-text available
Science is conducted collaboratively, often requiring the sharing of knowledge about computational experiments. When experiments include only datasets, they can be shared using Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs). An experiment, however, seldom includes only datasets, but more often includes software, its past execution, provenance, and associated documentation. The Research Object has recently emerged as a comprehensive and systematic method for aggregation and identification of diverse elements of computational experiments. While a necessary method, mere aggregation is not sufficient for the sharing of computational experiments. Other users must be able to easily recompute on these shared research objects. Computational provenance is often the key to enable such reuse. In this paper, we show how reusable research objects can utilize provenance to correctly repeat a previous reference execution, to construct a subset of a research object for partial reuse, and to reuse existing contents of a research object for modified reuse. We describe two methods to summarize provenance that aid in understanding the contents and past executions of a research object. The first method obtains a process-view by collapsing low-level system information, and the second method obtains a summary graph by grouping related nodes and edges with the goal to obtain a graph view similar to application workflow. Through detailed experiments, we show the efficacy and efficiency of our algorithms.
... These richly annotation objects are what we call workflow-centric research objects. The notion of Research Object has been introduced in previous work [20,19,1] -here we focus on Research Objects that encapsulate scientific workflows (hence workflow-centric). In particular, we build on earlier work on my-Experiment packs, which are bundles that contain elements such as workflows, documents and presentations [15]. ...
Article
Full-text available
A workflow-centric research object bundles a workflow, the provenance of the results obtained by its enactment, other digital objects that are relevant for the experiment (papers, datasets, etc.), and annotations that semantically describe all these objects. In this paper, we propose a model to specify workflow-centric research objects, and show how the model can be grounded using semantic technologies and existing vocabularies, in particular the Object Reuse and Exchange (ORE) model and the Annotation Ontology (AO).We describe the life-cycle of a research object, which resembles the life-cycle of a scientific experiment.
... Resources required for executing workflows, like services and data, can be either local and hosted along with the workflow or remote, such as public repositories or web services hosted by third parties. Over time these workflows, particularly those from the life sciences domain, are notably subject to a decayed or reduced ability to be executed or produce the same results [32]. This is what we call workflow decay. ...
Article
Full-text available
Analysis of “omics” data is often a long and segmented process, encompassing multiple stages from initial data collection to processing, quality control and visualization. The cross-modal nature of recent genomic analyses renders this process challenging to both automate and standardize; consequently, users often resort to manual interventions that compromise data reliability and reproducibility. This in turn can produce multiple versions of datasets across storage systems. As a result, scientists can lose significant time and resources trying to execute and monitor their analytical workflows and encounter difficulties sharing versioned data. In 2015, the Ludmer Centre for Neuroinformatics and Mental Health at McGill University brought together expertise from the Douglas Mental Health University Institute, the Lady Davis Institute and the Montreal Neurological Institute (MNI) to form a genetics/epigenetics working group. The objectives of this working group are to: (i) design an automated and seamless process for (epi)genetic data that consolidates heterogeneous datasets into the LORIS open-source data platform; (ii) streamline data analysis; (iii) integrate results with provenance information; and (iv) facilitate structured and versioned sharing of pipelines for optimized reproducibility using high-performance computing (HPC) environments via the CBRAIN processing portal. This article outlines the resulting generalizable “omics” framework and its benefits, specifically, the ability to: (i) integrate multiple types of biological and multi-modal datasets (imaging, clinical, demographics and behavioral); (ii) automate the process of launching analysis pipelines on HPC platforms; (iii) remove the bioinformatic barriers that are inherent to this process; (iv) ensure standardization and transparent sharing of processing pipelines to improve computational consistency; (v) store results in a queryable web interface; (vi) offer visualization tools to better view the data; and (vii) provide the mechanisms to ensure usability and reproducibility. This framework for workflows facilitates brain research discovery by reducing human error through automation of analysis pipelines and seamless linking of multimodal data, allowing investigators to focus on research instead of data handling.
Article
Full-text available
Accurate and comprehensive storage of provenance information is a basic requirement for modern scientific computing. A significant effort in recent years has developed robust theories and standards for the representation of these traces across a variety of execution platforms. Whilst these are necessary to enable repeatability they do not exploit the captured information to its full potential.This data is increasingly being captured from applications hosted on Cloud Computing platforms, which offer large scale computing resources without significant up front costs. Medical applications, which generate large datasets are also suited to cloud computing as the practicalities of storing and processing such data locally are becoming increasingly challenging.This paper shows how provenance can be captured from medical applications, stored using a graph database and then used to answer audit questions and enable repeatability. This static provenance will then be combined with performance data to predict future workloads, inform decision makers and reduce latency. Finally, cost models which are based on real world cloud computing costs will be used to determine optimum strategies for data retention over potentially extended periods of time.
Article
Full-text available
As publishers establish a greater online presence as well as infrastructure to support the distribution of more varied information, the idea of an executable paper that enables greater interaction has developed. An executable paper provides more information for computational experiments and results than the text, tables, and figures of standard papers. Executable papers can bundle computational content that allow readers and reviewers to interact, validate, and explore experiments. By including such content, authors facilitate future discoveries by lowering the barrier to reproducing and extending results. We present an infrastructure for creating, disseminating, and maintaining executable papers. Our approach is rooted in provenance, the documentation of exactly how data, experiments, and results were generated. We seek to improve the experience for everyone involved in the life cycle of an executable paper. The automated capture of provenance information allows authors to easily integrate and update results into papers as they write, and also helps reviewers better evaluate approaches by enabling them to explore experimental results by varying parameters or data. With a provenance-based system, readers are able to examine exactly how a result was developed to better understand and extend published findings.
Article
Full-text available
Scientific computation is emerging as absolutely central to the scientific method. Unfortunately, it's error-prone and currently immature-traditional scientific publication is incapable of finding and rooting out errors in scientific computation-which must be recognized as a crisis. An important recent development and a necessary response to the crisis is reproducible computational research in which researchers publish the article along with the full computational environment that produces the results. In this article, the authors review their approach and how it has evolved over time, discussing the arguments for and against working reproducibly.
Conference Paper
It can be painfully hard to take software that runs on one person's machine and get it to run on another machine. Online forums and mailing lists are filled with discussions of users' troubles with compiling, installing, and configuring software and their myriad of dependencies. To eliminate this dependency problem, we created a system called CDE that uses system call interposition to monitor the execution of x86-Linux programs and package up the Code, Data, and Environment required to run them on other x86-Linux machines. Creating a CDE package is completely automatic, and running programs within a package requires no installation, configuration, or root permissions. Hundreds of people in both academia and industry have used CDE to distribute software, demo prototypes, make their scientific experiments reproducible, run software natively on older Linux distributions, and deploy experiments to compute clusters.
Article
In this paper we suggest that the full scientific potential of workflows will be achieved through mechanisms for sharing and collaboration, empowering scientists to spread their experimental protocols and to benefit from those of others. To facilitate this process we have designed and built the Virtual Research Environment for collaboration and sharing of workflows and experiments. In contrast to systems which simply make workflows available, provides mechanisms to support the sharing of workflows within and across multiple communities. It achieves this by adopting a social web approach which is tailored to the particular needs of the scientist. We present the motivation, design and realisation of .
Article
Abstract—Scientific data stands to represent a significant portion of the linked open data cloud and science itself stands to benefit from the data fusion capability that this will afford. However, simply publishing linked data into the cloud does not necessarily meet the requirements of reuse. Publishing has requirements of provenance, quality, credit, attribution, methods in order to provide the reproducibility that allows validation of results. In this paper we make the case for a scientific data publication model on top of linked data and introduce the notion of Research Objects as first class citizens for sharing and publishing.
Article
Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct. Mathematics papers are expected to contain a proof complete enough to allow knowledgeable readers to fill in any details. Papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension.