Conference Paper

A Framework for Scientific Workflow Reproducibility in the Cloud

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Workflow is a well-established means by which to capture scientific methods in an abstract graph of interrelated processing tasks. The reproducibility of scientific workflows is therefore fundamental to reproducible e-Science. However, the ability to record all the required details so as to make a workflow fully reproducible is a long-standing problem that is very difficult to solve. In this paper, we introduce an approach that integrates system description, source control, container management and automatic deployment techniques to facilitate workflow reproducibility. We have developed a framework that leverages this integration to support workflow execution, re-execution and reproducibility in the cloud and in a personal computing environment. We demonstrate the effectiveness of our approach by examining various aspects of repeatability and reproducibility on real scientific workflows. The framework allows workflow and task images to be captured automatically, which improves not only repeatability but also runtime performance. It also gives workflows portability across different cloud environments. Finally, the framework can also track changes in the development of tasks and workflows to protect them from unintentional failures.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Our framework [14], consists of six main components split into two layers (Fig. 1). The upper, repository layer includes the Task, Workflow, Core and Image Repositories, while the lower, deployment layer includes the Enactment Engine, Automatic Image Creation facility (AIC) and the 130 newly added Optimizer. ...
... 135 The Task and Workflow Repositories are used to store and version the user-defined workflow and task descriptors that are specified using TOSCA.Importantly, the Task, Workflow and Core Repositories are backed by a version control platform such as GitHub. This provides the ability 140 to track the complete history of the workflow and task changes, and allows the framework to control the changes that could potentially affect workflow's reproducibility; for more details please see [14]. ...
... Building a workflow using TOSCA starts by defining Our approach maintains each workflow and workflow task in a separate code repository [14]. This brings multiple benefits: the repositories mark clear boundaries be-260 tween components, offer independent version control and allow for easy referencing and sharing. ...
Article
Full-text available
Scientific workflows play a vital role in modern science as they enable scientists to specify, share and reuse computational experiments. To maximise the benefits, workflows need to support the reproducibility of the experimental methods they capture. Reproducibility enables effective sharing as scientists can re-execute experiments developed by others and quickly derive new or improved results. However, achieving reproducibility in practice is problematic – previous analyses highlight issues due to uncontrolled changes in the input data, configuration parameters, workflow description and the software used to implement the workflow tasks. The resulting problems have become known as workflow decay. In this paper we present a novel framework that addresses workflow decay through the integration of system description, version control, container management and automated deployment techniques. It then introduces a set of performance optimisation techniques that significantly reduce the runtime overheads caused by making workflows reproducible. The resulting system significantly improves the performance, repeatability and also the ability to share and re-use workflows by combining a method to uniquely identify task and workflow images with an automated image capture facility and a multi-level cache. The system is evaluated through an extensive set of experiments that validate the approach and highlight the key benefits of the proposed optimisations. This includes methods for reducing the runtime of workflows by up to an order of magnitude in cases where they are enacted concurrently on the same host VM and in different Clouds, and where they share tasks.
... Even though workflows are an efficient tool to organize and coordinate various tasks; however, it is hard to port the workflow and its execution environment to others because each task in the workflow has dependencies. To address the issue, various workflow systems and related schemes have been proposed for the cloud computing environment [24] and grid computing environment [25] to enhance the portability, i.e., to make it easy to re-implement the environment by capturing dependencies and virtualizing the underlying system (in cloud computing) [24] or sharing hardware configured for the required environment with other organizations (in grid computing) [25]. However, there is room to improve for those suggestions. ...
... Even though workflows are an efficient tool to organize and coordinate various tasks; however, it is hard to port the workflow and its execution environment to others because each task in the workflow has dependencies. To address the issue, various workflow systems and related schemes have been proposed for the cloud computing environment [24] and grid computing environment [25] to enhance the portability, i.e., to make it easy to re-implement the environment by capturing dependencies and virtualizing the underlying system (in cloud computing) [24] or sharing hardware configured for the required environment with other organizations (in grid computing) [25]. However, there is room to improve for those suggestions. ...
... However, there is room to improve for those suggestions. For example, using the proposed approach [24] may increase the system complexity because of its component-reusing mechanism. Even though those proposals achieved a certain level of portability, those suggestions can be improved in reusability. ...
Article
Full-text available
[sangyoon]As Artificial Intelligence (AI) is becoming ubiquitous in many applications, serverless computing is also emerging as a building block for developing cloud-based AI services. Serverless computing has received much interest because of its simplicity, scalability, and resource efficiency. However, due to the trade-off with resource efficiency, serverless computing suffers from the cold start problem, that is, a latency between a request arrival and function execution[sangyoon] that is encountered due to resource provisioning. [sangyoon]In serverless computing, functions can be composed as workflows to process a complex task, and the cold start problem has a significant influence on workflow response time because the cold start can occur in each function.The cold start problem significantly influences the overall response time of workflow that consists of functions because the cold start may occur in every function within the workflow. Function fusion can be one of the solutions to mitigate the cold start latency of a workflow. If two functions are fused into a single function, the cold start of the second function is removed; however, if parallel functions are fused, the workflow response time can be increased because the parallel functions run sequentially even if the cold start latency is reduced. This study presents an approach to mitigate the cold start latency of a workflow using function fusion while considering a parallel run. First, we identify three latencies that affect response time, present a workflow response time model considering the latency, and efficiently find a fusion solution that can optimize the response time on the cold start. Our method shows a response time of 28–86% of the response time of the original workflow in five workflows.
... The former method is used in RO-Manager [16], a tool that uses the RO-Bundle specification [6]. A more recent approach relies on user action to create the topology, relationship, and node specifications based on a standard [17] that are eventually translated to a container [18]. In this paper, we focus on automatically creating research objects using AV. ...
... HashValues and u i 2 .HashValues are similar)) then 18 Remove u i 1 from G 1 and push u i ...
Article
Full-text available
Science is conducted collaboratively, often requiring the sharing of knowledge about computational experiments. When experiments include only datasets, they can be shared using Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs). An experiment, however, seldom includes only datasets, but more often includes software, its past execution, provenance, and associated documentation. The Research Object has recently emerged as a comprehensive and systematic method for aggregation and identification of diverse elements of computational experiments. While a necessary method, mere aggregation is not sufficient for the sharing of computational experiments. Other users must be able to easily recompute on these shared research objects. Computational provenance is often the key to enable such reuse. In this paper, we show how reusable research objects can utilize provenance to correctly repeat a previous reference execution, to construct a subset of a research object for partial reuse, and to reuse existing contents of a research object for modified reuse. We describe two methods to summarize provenance that aid in understanding the contents and past executions of a research object. The first method obtains a process-view by collapsing low-level system information, and the second method obtains a summary graph by grouping related nodes and edges with the goal to obtain a graph view similar to application workflow. Through detailed experiments, we show the efficacy and efficiency of our algorithms.
... Further, besides reproducing the exact process, we could easily explore how the experiment behaviors with different input datasets, execution arguments and environments. Cloud-based reproducibility [2] has been a major approach for reproducible computing services because the full stack of the computation environment, including data, software and hardware, could all be provisioned and shared via various cloud services. Cloud computing is a promising way to support reproducibility. ...
Preprint
Full-text available
Cloud computing has become a major approach to enable reproducible computational experiments because of its support of on-demand hardware and software resource provisioning. Yet there are still two main difficulties in reproducing big data applications in the cloud. The first is how to automate end-to-end execution of big data analytics in the cloud including virtual distributed environment provisioning, network and security group setup, and big data analytics pipeline description and execution. The second is an application developed for one cloud, such as AWS or Azure, is difficult to reproduce in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automatic scalable big data application execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. Based on the approach, we propose and develop an open-source toolkit that supports 1) on-demand distributed hardware and software environment provisioning, 2) automatic data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproducibility of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using three big data analytics applications that run on a virtual CPU/GPU cluster. Three main behaviors of our toolkit were benchmarked: i) execution overhead ratio for reproducibility support, ii) differences of reproducing the same application on AWS and Azure in terms of execution time, budgetary cost and cost-performance ratio, iii) differences between scale-out and scale-up approach for the same application on AWS and Azure.
... TOSCA [15] is an OASIS standard to describe the topology of cloudbased applications towards portable, reproducible application deployments. Qasha et al. [21] combine two execution-environment reproducibility techniques (i.e., the logical and physical preservation) of scientific workflows using TOSCA in a container-based approach. In addition to the plain reproducibility concerns, our middleware architecture employs reflection concepts to reconfigure deployment plans, resulting in efficient execution environments. ...
Article
The simulation and optimization of complex engineering designs in automotive or aerospace involves multiple mathematical tools, long-running workflows and resource-intensive computations on distributed infrastructures. Finding the optimal deployment in terms of task distribution, parallelization, collocation and resource assignment for each execution is a step-wise process involving both human input with domain-specific knowledge about the tools as well as the acquisition of new knowledge based on the actual execution history. In this paper, we present a policy-driven adaptive and reflective middleware that supports smart cloud-based deployment and execution of engineering workflows. This middleware supports deep inspection of the workflow task structure and execution, as well as of the very specific mathematical tools, their executions and used parameters. The reflective capabilities are based on multiple meta-models to reflect workflow structure, deployment, execution and resources. Adaptive deployment is driven by both human input as meta-data annotations as well as adaptation policies that reason over the actual execution history of the workflows. We validate and evaluate this middleware in real-life application cases and scenarios in the domain of aeronautics.
... Methodology, or best practice in reproducibility usually discusses the general principals [5] or combines tools and platforms to explore the best practice [6]. Generally speaking, methodology is hard to follow as they tend to be ideal and the researchers may not be familiar with or have the ability to setup the toolchain used. ...
Preprint
The reproducibility of scientific experiment is vital for the advancement of disciplines based on previous work. To achieve this goal, many researchers focus on complex methodology and self-invented tools which have difficulty in practical usage. In this article, we introduce the DevOps infrastructure from software engineering community and shows how DevOps can be used effectively to reproduce experiments for computer science related disciplines. DevOps can be enabled using freely available cloud computing machines for medium sized experiment and self-hosted computing engines for large scale computing, thus powering researchers to share their experiment result with others in a more reliable way.
... A similar approach for reproducing scientific workflows on the cloud, using TOSCA, is proposed in Qasha et al. in [19]. The authors propose a framework that aims to use Github and DockerHub 7 as repositories for storing workflows and tasks. ...
... Such platforms include Nextflow [17], Bwb [18] and Pachyderm [19]. While containerization addresses many of the issues outlined above and has facilitated the execution of generic workflows in a languageand cloud-agnostic manner [20,21], IaaS services still require users to deploy and manage clusters. Resource management tools like Docker Swarm and Kubernetes are mature and widely used technologies that help manage container orchestration and even support auto-scaling of resources, but they still require an installation and configuration process that may be cumbersome, or in the case of managed solutions, expensive. ...
Conference Paper
Full-text available
Scientific and commercial applications are increasingly being executed in the cloud, but the difficulties associated with cluster management render on-demand resources inaccessible or inefficient to many users. Recently, the serverless execution model, in which the provisioning of resources is abstracted from the user, has gained prominence as an alternative to traditional cyberinfrastructure solutions. With its inherent elasticity, the serverless paradigm constitutes a promising computational model for scientific workflows, allowing domain specialists to develop and deploy workflows that are subject to varying workloads and intermittent usage without the overhead of infrastructure maintenance. We present the Serverless Workflow Enablement and Execution Platform (SWEEP), a cloud-agnostic workflow management system with a purely serverless execution model that allows users to define, run and monitor generic cloud-native workflows. We demonstrate the use of SWEEP on workflows from two disparate scientific domains and present an evaluation of performance and scaling.
... This step helps ensure that if discrepancies are noticed in the future they can be attributed to something other than the computing environment. More thorough discussion of software containers is beyond the scope of this article and are available elsewhere (Chamberlain and Schommer 2014;Boettiger 2015Boettiger , 2017Di Tommaso et al. 2015;Hung et al. 2016;Qasha et al. 2017). ...
Article
Full-text available
The ability to replicate scientific experiments is a cornerstone of the scientific method. Sharing ideas, workflows, data, and protocols facilitates testing the generalizability of results, increases the speed that science progresses, and enhances quality control of published work. Fields of science such as medicine, the social sciences, and the physical sciences have embraced practices designed to increase replicability. Granting agencies, for example, may require data management plans and journals may require data and code availability statements along with the deposition of data and code in publicly available repositories. While many tools commonly used in replicable workflows such as distributed version control systems (e.g., 'git') or script programming languages for data cleaning and analysis may have a steep learning curve, their adoption can increase individual efficiency and facilitate collaborations both within entomology and across disciplines. The open science movement is developing within the discipline of entomology, but practitioners of these concepts or those desiring to work more collaboratively across disciplines may be unsure where or how to embrace these initiatives. This article is meant to introduce some of the tools entomologists can incorporate into their workflows to increase the replicability and openness of their work. We describe these tools and others, recommend additional resources for learning more about these tools, and discuss the benefits to both individuals and the scientific community and potential drawbacks associated with implementing a replicable workflow.
... It may require redeploying P on a new infrastructure and ensuring that the system and software dependencies are maintained correctly, or that the results obtained using new versions of third party libraries remain valid. Addressing these architectural and reproducibility issues is a research area of growing interest [23,21,22,46]. ...
Preprint
The value of knowledge assets generated by analytics processes using Data Science techniques tends to decay over time, as a consequence of changes in the elements the process depends on: external data sources, libraries, and system dependencies. For large-scale problems, refreshing those outcomes through greedy re-computation is both expensive and inefficient, as some changes have limited impact. In this paper we address the problem of refreshing past process outcomes selectively, that is, by trying to identify the subset of outcomes that will have been affected by a change, and by only re-executing fragments of the original process. We propose a technical approach to address the selective re-computation problem by combining multiple techniques, and present an extensive experimental study in Genomics, namely variant calling and their clinical interpretation, to show its effectiveness. In this case study, we are able to decrease the number of required re-computations on a cohort of individuals from 495 (blind) down to 71, and that we can reduce runtime by at least 60% relative to the naïve blind approach, and in some cases by 90%. Starting from this experience, we then propose a blueprint for a generic re-computation meta-process that makes use of process history metadata to make informed decisions about selective re-computations in reaction to a variety of changes in the data.
... In [16], a custom OWL ontology for describing the computational infrastructure used in a computational experiment is proposed. In [14], TOSCA specifications of the underlying infrastructure are used to deploy workflows on different cloud platforms in a portable way. The problem with such proprietary solutions that focus on portability and abstraction [1,12] is their maintainability, a very important quality crucial for reproducibility. ...
Chapter
We propose a comprehensive solution for reproducibility of scientific workflows. We focus particularly on Kubernetes-managed container clouds, increasingly important in scientific computing. Our solution addresses conservation of the scientific procedure, scientific data, execution environment and experiment deployment, while using standard tools in order to avoid maintainability issues that can obstruct reproducibility. We introduce an Experiment Digital Object (EDO), a record published in an open science repository that contains artifacts required to reproduce an experiment. We demonstrate a variety of reproducibility scenarios including experiment repetition (same experiment and conditions), replication (same experiment, different conditions), and propose a smart reuse scenario in which a previous experiment is partially replayed and partially re-executed. The approach is implemented in the HyperFlow workflow management system and experimentally evaluated using a genomic scientific workflow. The experiment is published as an EDO record on the Zenodo platform.
... This resulted in significant improvement in the execution performance of workflows in the cloud. Another example to find new, innovative ways of executing workflows in clouds is described in [31] and [32] by Qasha et al. This approach also works based on service orchestration but the workflow tasks are submitted in a novel way to the cloud. ...
Article
Full-text available
The paper describes a new cloud-oriented workflow system called Flowbster. It was designed to create efficient data pipelines in clouds by which large compute-intensive data sets can efficiently be processed. The Flowbster workflow can be deployed in the target cloud as a virtual infrastructure through which the data to be processed can flow and meanwhile it flows through the workflow it is transformed as the business logic of the workflow defines it. Instead of using the enactor based workflow concept Flowbster applies the service choreography concept where the workflow nodes directly communicate with each other. Workflow nodes are able to recognize if they can be activated with a certain data set without the interaction of central control service like the enactor in service orchestration workflows. As a result Flowbster workflows implement a much more efficient data path through the workflow than service orchestration workflows. A Flowbster workflow works as a data pipeline enabling the exploitation of pipeline parallelism, workflow parallel branch parallelism and node scalability parallelism. The Flowbster workflow can be deployed in the target cloud on-demand based on the underlying Occopus cloud deployment and orchestrator tool. Occopus guarantees that the workflow can be deployed in several major types of IaaS clouds (OpenStack, OpenNebula, Amazon, CloudSigma). It takes care of not only deploying the nodes of the workflow but also to maintain their health by using various health-checking options. Flowbster also provides an intuitive graphical user interface for end-user scientists. This interface hides the low level cloud-oriented layers and hence users can concentrate on the business logic of their data processing applications without having detailed knowledge on the underlying cloud infrastructure.
... TOSCA [14] is an OASIS standard to describe the topology of cloudbased applications towards portable, reproducible application deployments. Qasha et al. [10] combine two executionenvironment reproducibility techniques (i.e. the logical and physical preservation) of scientific workflows using TOSCA in a container-based approach. In addition to the plain reproducibility concerns, our middleware architecture employs reflection concepts to reconfigure deployment plans, resulting in efficient execution environments. ...
Conference Paper
Full-text available
The simulation and optimization of complex engineering designs in automotive or aerospace involves multiple mathematical tools, long-running workflows and resource-intensive computations on distributed infrastructures. Finding the optimal deployment in terms of task distribution, parallelization, collocation and resource assignment for each execution is a step-wise process involving both human input with domain-specific knowledge about the tools as well as the acquisition of new knowledge based on the actual execution history. In this paper, we present motivating scenarios as well as an architecture for adaptive and reflective middleware that supports smart cloud-based deployment and execution of engineering workflows. This middleware supports deep inspection of the workflow task structure and execution, as well as of the very specific mathematical tools, their executions and used parameters. The reflective capabilities are based on multiple meta-models to reflect workflow structure, deployment, execution and resources. Adaptive deployment is driven by both human input as meta-data annotations as well as the actual execution history of the workflows.
... Other methods to build and reuse containers such as Topology and Orchestration Specification for Cloud Applications [24] still rely on user creating the topology, relationship, and node specifications, which are eventually translated to Dockerfiles [25]. In our case, Docker is merely a wrapper for standardization since application virtualization creates a selfcontained container and the translation to Dockerfiles from the collected provenance is fairly straightforward. ...
Article
Scientists often need to share their work. Typically, their data is shared in the form of Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs). Scientists' work, however, may not be limited to data, but can also include code, provenance, documents, etc. The Research Object has recently emerged as a method for the identification, aggregation, and exchange of this scholarly information on the Web. Several science communities now engage in a formal process to create research objects and share them on scholarly exchange websites such as Figshare or Hydroshare, but often sharing is not sufficient for scientists. They need to compute further on the shared information. In this paper, we present the sciunit, a reusable research object whose contents can be re-computed, and thus measured. We describe how to efficiently create, store, and repeat computational work with sciunits. We show through experiments that sciunits can replicate and re-run computational applications with minimal storage and processing overhead. Finally, we provide an overview of sharing and reproducible cyberinfrastructure based on sciunits, increasingly being used in the domain of geosciences.
... This resulted in increased execution performance of workflows in the cloud. Another example to find new, innovative ways of executing workflows in clouds is described in [25] and [26] by Qasha et al. This approach targets service orchestration but the workflow tasks are submitted in a novel way to the cloud. ...
Article
The reproducibility of scientific experiments is crucial for corroborating, consolidating and reusing new scientific discoveries. However, the constant pressure for publishing results (Fanelli, 2010) has removed reproducibility from the agenda of many researchers: in a recent survey published in Nature (with more than 1500 scientists) over 70% of the participants recognize to have failed to reproduce the work from another colleague at some point in time (Baker, 2016). Analyses from psychology and cancer biology show reproducibility rates below 40% and 10% respectively (Collaboration, 2015) (Begley & Lee, 2012). As a consequence, retractions of publications have occurred in the last years in several disciplines (Marcus & Oransky, 2014) (Rockoff, 2015), and the general public is now skeptical about scientific studies on topics like pesticides, depression drugs or flu pandemics (American, 2010).
Article
Emerging interdisciplinary data-intensive science gateway applications in engineering fields (e.g., bioinformatics, cybermanufacturing) demand the use of high-performance computing resources. However, to mitigate operational costs and management efforts for these science gateway applications, there is a need to effectively deploy them on federated heterogeneous resources managed by external Cloud Service Providers (CSPs). In this paper, we present a novel methodology to deliver fast, automatic and flexible resource provisioning services for such application-owners with limited expertise in composing and deploying suitable cloud architectures. Our methodology features a Component Abstraction Model to implement intelligent resource ‘abstractions’ coupled with ‘reusable’ hardware and software configuration in the form of “custom templates” to simplify heterogeneous resource management efforts. It also features a novel middleware that provides services via a set of recommendation schemes for a context-aware requirement-collection questionnaire. Recommendations match the requirements to available resources and thus assist novice and expert users to make relevant configuration selections with CSP collaboration. To evaluate our middleware, we study the impact of user preferences in requirement collection, jobs execution and resource adaptation for a real-world manufacturing application on Amazon Web Services and the GENI cloud platforms. Our experiment results show that our scheme improves the resource recommendation accuracy in the manufacturing science gateway application by up to 21% compared to the existing schemes. We also show the impact of custom templates knowledgebase maturity at the CSP side for handling novice and expert user preferences in terms of the resource recommendation accuracy.
Article
Full-text available
The reproducibility of computational environmental models is an important challenge that calls for open and reusable code and data, well-documented workflows, and controlled environments that allow others to verify published findings. This requires an ability to document and share raw datasets, data preprocessing scripts, model inputs, outputs, and the specific model code with all associated dependencies. HydroShare and GeoTrust, two scientific cyberinfrastructures under development, can be used to improve reproducibility in computational hydrology. HydroShare is a web-based system for sharing hydrologic data and models as digital resources including detailed, hydrologic-specific resource metadata. GeoTrust provides tools for scientists to efficiently reproduce and share geoscience applications. This paper outlines a use case example, which focuses on a workflow that uses the MODFLOW model, to demonstrate how the cyberinfrastructures HydroShare and GeoTrust can be integrated in a way that easily and efficiently reproduces computational workflows.
Chapter
Service composition is a popular approach for building software applications from several individual services. Using imperative workflow technologies, service compositions can be specified as workflow models comprising activities that are implemented, e.g., by service calls or scripts. While scripts are typically included in the workflow model itself and can be executed directly by the workflow engine, the required services must be deployed in a separate step. Moreover, to enable their invocation, an additional step is required to configure the workflow model regarding the endpoints of the deployed services, i.e., IP-address, port, etc. However, a manual deployment of services and configuration of the workflow model are complex, time-consuming, and error-prone tasks. In this paper, we present an approach that enables defining service compositions in a self-contained manner using imperative workflow technology. For this, the workflow models can be packaged with all necessary deployment models and software artifacts that implement the required services. As a result, the service deployment in the target environment where the workflow is executed as well as the configuration of the workflow with the endpoint information of the services can be automated completely. We validate the technical feasibility of our approach by a prototypical implementation based on the TOSCA standard and OpenTOSCA.
Article
Workflow engines are commonly used to orchestrate large-scale scientific computations such as, but not limited to weather, climate, natural disasters, food safety, and territorial management. However, to implement, manage, and execute real-world scientific applications in the form of workflows on multiple infrastructures (servers, clusters, cloud) remains a challenge. In this paper, we present DagOnStar (Directed Acyclic Graph On Anything), a lightweight Python library implementing a workflow paradigm based on parallel patterns that can be executed on any combination of local machines, on-premise high performance computing clusters, containers, and cloud-based virtual infrastructures. DagOnStar is designed to minimize data movement to reduce the application storage footprint. A case study based on a real-world application is explored to illustrate the use of this novel workflow engine: a containerized weather data collection application deployed on multiple infrastructures. An experimental comparison with other state-of-the-art workflow engines shows that DagOnStar can run workflows on multiple types of infrastructure with an improvement of 50.19% in run time when using a parallel pattern with eight task-level workers.
Chapter
Full-text available
One critical aspect of science is the ability to reproduce the same experiment by another researcher. In other to do so, the same ambient, variables, data, setup should be considered. The method tells how the original researcher planned and did their research, but how can others replicate or even advance the preview research? The scientific community has been focusing on efforts to increase transparency and reproducibility and develop a “culture of reproducibility.” When researchers share their data, their workflow, and co-evolute a way of doing research, all the players win. The value co-creation is established in a business ecosystem. The actor who is part of the business platform by the co-creation can leverage the advantage of one or more partners that make up the platform. Thus, the knowledge created from the interaction between the different technological domains and knowledge shared on the platform can improve all the research and researchers. Stating that, this chapter proposes a business ecosystem model to ensure research repeatability.
Article
The reproducibility of scientific experiments is crucial for corroborating, consolidating and reusing new scientific discoveries. However, the constant pressure for publishing results (Fanelli, 2010) has removed reproducibility from the agenda of many researchers: in a recent survey published in Nature (with more than 1500 scientists) over 70% of the participants recognize to have failed to reproduce the work from another colleague at some point in time (Baker, 2016). Analyses from psychology and cancer biology show reproducibility rates below 40% and 10% respectively (Collaboration, 2015) (Begley & Lee, 2012). As a consequence, retractions of publications have occurred in the last years in several disciplines (Marcus & Oransky, 2014) (Rockoff, 2015), and the general public is now skeptical about scientific studies on topics like pesticides, depression drugs or flu pandemics (American, 2010).
Article
Full-text available
In the past decades, one of the most common forms of addressing reproducibility in scientific workflow-based computational science has consisted of tracking the provenance of the produced and published results. Such provenance allows inspecting intermediate and final results, improves understanding, and permits replaying a workflow execution. Nevertheless, this approach does not provide any means for capturing and sharing the very valuable knowledge about the experimental equipment of a computational experiment, i.e., the execution environment in which the experiments are conducted. In this work, we propose a novel approach based on semantic vocabularies that describes the execution environment of scientific workflows, so as to conserve it. We define a process for documenting the workflow application and its related management system, as well as their dependencies. Then we apply this approach over three different real workflow applications running in three distinct scenarios, using public, private, and local Cloud platforms. In particular, we study one astronomy workflow and two life science workflows for genomic information analysis. Experimental results show that our approach can reproduce an equivalent execution environment of a predefined virtual machine image on all evaluated computing platforms.
Article
Full-text available
Virtualization is becoming increasingly important in bioscience, enabling assembly and provisioning of complete computer setups, including operating system, data, software, and services packaged as virtual machine images (VMIs). We present an open catalog of VMIs for the life sciences, where scientists can share information about images and optionally upload them to a server equipped with a large file system and fast Internet connection. Other scientists can then search for and download images that can be run on the local computer or in a cloud computing environment, providing easy access to bioinformatics environments. We also describe applications where VMIs aid life science research, including distributing tools and data, supporting repro- ducible analysis, and facilitating education. BioImg.org is freely available at: https://bioimg.org.
Conference Paper
Full-text available
Scientific workflows play an increasingly important role in building scientific applications, while cloud computing provides on-demand access to large compute resources. Combining the two offers the potential to increase dramatically the ability to quickly extract new results from the vast amounts of scientific data now being collected. However, with the proliferation of cloud computing platforms and workflow management systems, it becomes more and more challenging to define workflows so they can reliably run in the cloud and be reused easily. This paper shows how TOSCA, a new standard for cloud service management, can be used to systematically specify the components and life cycle management of scientific workflows by mapping the basic elements of a real workflow onto entities specified by TOSCA. Ultimately, this will enable workflow definitions that are portable across clouds, resulting in the greater reusability and reproducibility of workflows.
Article
Full-text available
Scientific workflows are a popular mechanism for specifying and automating data-driven in silico experiments. A significant aspect of their value lies in their potential to be reused. Once shared, workflows become useful building blocks that can be combined or modified for developing new experiments. However, previous studies have shown that storing workflow specifications alone is not sufficient to ensure that they can be successfully reused, without being able to understand what the workflows aim to achieve or to re-enact them. To gain an understanding of the workflow, and how it may be used and repurposed for their needs, scientists require access to additional resources such as annotations describing the workflow, datasets used and produced by the workflow, and provenance traces recording workflow executions.
Article
Full-text available
As computational work becomes more and more integral to many aspects of scientific research, computational reproducibility has become an issue of increasing importance to computer systems researchers and domain scientists alike. Though computational reproducibility seems more straight forward than replicating physical experiments, the complex and rapidly changing nature of computer environments makes being able to reproduce and extend such work a serious challenge. In this paper, I explore common reasons that code developed for one research project cannot be successfully executed or extended by subsequent researchers. I review current approaches to these issues, including virtual machines and workflow systems, and their limitations. I then examine how the popular emerging technology Docker combines several areas from systems research - such as operating system virtualization, cross-platform portability, modular re-usable elements, versioning, and a `DevOps' philosophy, to address these challenges. I illustrate this with several examples of Docker use with a focus on the R statistical environment.
Article
Full-text available
One of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. This has the added benefit of allowing methods to be adopted and adapted for other purposes. In the field of e-Science, services – often choreographed through workflow, process data to generate results. The reproduction of results is often not straightforward as the computational objects may not be made available or may have been updated since the results were generated. For example, services are often updated to fix bugs or improve algorithms. This paper addresses these problems in three ways. Firstly, it introduces a new framework to clarify the range of meanings of ‘reproducibility’. Secondly, it describes a new algorithm, PDIFF, that uses a comparison of workflow provenance traces to determine whether an experiment has been reproduced; the main innovation is that if this is not the case then the specific point(s) of divergence are identified through graph analysis, assisting any researcher wishing to understand those differences. One key feature is support for user-defined, semantic data comparison operators. Finally, the paper describes an implementation of PDIFF that leverages the power of the e-Science Central platform that enacts workflows in the cloud. As well as automatically generating a provenance trace for consumption by PDIFF, the platform supports the storage and reuse of old versions of workflows, data and services; the paper shows how this can be powerfully exploited to achieve reproduction and reuse. Copyright © 2013 John Wiley & Sons, Ltd.
Conference Paper
Full-text available
Workflows provide a popular means for preserving scientific methods by explicitly encoding their process. However, some of them are subject to a decay in their ability to be re-executed or reproduce the same results over time, largely due to the volatility of the resources required for workflow executions. This paper provides an analysis of the root causes of workflow decay based on an empirical study of a collection of Taverna workflows from the myExperiment repository. Although our analysis was based on a specific type of workflow, the outcomes and methodology should be applicable to workflows from other systems, at least those whose executions also rely largely on accessing third-party resources. Based on our understanding about decay we recommend a minimal set of auxiliary resources to be preserved together with the workflows as an aggregation object and provide a software tool for end-users to create such aggregations and to assess their completeness.
Article
Full-text available
For cloud services to be portable, their management must also be portable to the targeted environment, as must the application components themselves. Here, the authors show how plans in the Topology and Orchestration Specification for Cloud Applications (TOSCA) can enable portability of these operational aspects.
Conference Paper
Full-text available
We describe ReproZip, a tool that makes it easier for authors to publish reproducible results and for reviewers to validate these results. By tracking operating system calls, ReproZip systematically captures detailed prove-nance of existing experiments, including data dependen-cies, libraries used, and configuration parameters. This information is combined into a package that can be in-stalled and run on a different environment. An impor-tant goal that we have for ReproZip is usability. Besides simplifying the creation of reproducible results, the sys-tem also helps reviewers. Because the package is self-contained, reviewers need not install any additional soft-ware to run the experiments. In addition, ReproZip gen-erates a workflow specification for the experiment. This not only enables reviewers to execute this specification within a workflow system to explore the experiment and try different configurations, but also the provenance kept by the workflow system can facilitate communication be-tween reviewers and authors.
Conference Paper
Full-text available
Computational experiments have become an integral part of the scientific method, but reproducing, archiving, and querying them is still a challenge. The first barrier to a wider adoption is the fact that it is hard both for authors to derive a compendium that encapsulates all the components needed to reproduce a result and for reviewers to verify the results. In this tutorial, we will present a series of guidelines and, through hands-on examples, review existing tools to help authors create of reproducible results. We will also outline open problems and new directions for database-related research having to do with querying computational experiments.
Article
Full-text available
SIGMOD has offered, since 2008, to verify the experiments published in the papers accepted at the conference. This year, we have been in charge of reproducing the experiments provided by the authors (repeatability), and exploring changes to experiment parameters (workability). In this paper, we assess the SIGMOD repeatability process in terms of participation, review process and results. While the participation is stable in terms of number of submissions, we find this year a sharp contrast between the high participation from Asian authors and the low participation from American authors. We also find that most experiments are distributed as Linux packages accompanied by instructions on how to setup and run the experiments. We are still far from the vision of executable papers.
Article
Full-text available
Scientific workflow systems have become a necessary tool for many applications, enabling the composition and execution of complex analysis on distributed resources. Today there are many workflow systems, often with overlapping functionality. A key issue for potential users of work- flow systems is the need to be able to compare the capabilities of the various available tools. There can be confusion about system functionality and the tools are often selected without a proper functional analysis. In this paper we extract a taxonomy of features from the way sci- entists make use of existing workflow systems and we illustrate this feature set by providing some examples taken from existing workflow systems. The taxonomy provides end users with a mechanism by which they can assess the suitability of workflow in general and how they might use these features to make an informed choice about which workflow system would be a good choice for their particular application.
Article
Full-text available
Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy http://usegalaxy.org, an open web-based platform for genomic research, addresses these problems. Galaxy automatically tracks and manages data provenance and provides support for capturing the context and intent of computational methods. Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis.
Article
Full-text available
myExperiment (http://www.myexperiment.org) is an online research environment that supports the social sharing of bioinformatics workflows. These workflows are procedures consisting of a series of computational tasks using web services, which may be performed on data from its retrieval, integration and analysis, to the visualization of the results. As a public repository of workflows, myExperiment allows anybody to discover those that are relevant to their research, which can then be reused and repurposed to their specific requirements. Conversely, developers can submit their workflows to myExperiment and enable them to be shared in a secure manner. Since its release in 2007, myExperiment currently has over 3500 registered users and contains more than 1000 workflows. The social aspect to the sharing of these workflows is facilitated by registered users forming virtual communities bound together by a common interest or research project. Contributors of workflows can build their reputation within these communities by receiving feedback and credit from individuals who reuse their work. Further documentation about myExperiment including its REST web service is available from http://wiki.myexp eriment.org. Feedback and requests for support can be sent to [email protected] /* */
Conference Paper
Provenance has been considered as a means to achieve scientific workflow reproducibility to verify the workflow processes and results. Cloud computing provides a new computing paradigm for the workflow execution by offering a dynamic and scalable environment with on-demand resource provisioning. In the absence of Cloud infrastructure information, achieving workflow reproducibility on the Cloud becomes a challenge. This paper presents a framework, named ReCAP, to capture the Cloud infrastructure information and to interlink it with the workflow provenance to establish the Cloud-Aware Provenance (CAP). This paper identifies different scenarios of using the Cloud for workflow execution and presents different mapping approaches. The reproducibility of the workflow execution is performed by re-provisioning the similar Cloud resources using CAP and re-executing the workflow; and by comparing the outputs of workflows. Finally, this paper also presents the evaluation of ReCAP in terms of captured provenance, workflow execution time and workflow output comparison.
Article
Dataflow-style workflows offer a simple, high-level programming model for flexible prototyping of scientific applications as an attractive alternative to low-level scripting. At the same time, workflow management systems (WfMS) may support data parallelism over big datasets by providing scalable, distributed deployment and execution of the workflow over a cloud infrastructure. In theory, the combination of these properties makes workflows a natural choice for implementing Big Data processing pipelines, common for instance in bioinformatics. In practice, however, correct workflow design for parallel Big Data problems can be complex and very time-consuming.
Conference Paper
The value of workflows to the scientific community spans over time and space. Not only results but also performance and resource consumption of a workflow need to be replayed over time and in varying environments. Achieving such repeatability in practice is challenging due to changes in software and infrastructure over time. In this work, we introduce a new abstraction that builds on the concept of virtual appliance to enable workflow repeatability. We have also developed a novel architecture to leverage this abstraction and realized it into a system implementation that supports a popular workflow management system and builds on a federated in-production environment. We demonstrate the effectiveness of our approach by examining various aspects of workflow repeatability. Our results show that workflows can be replayed with 2% fidelity when considering their walltime as performance metric.
Book
Portability and automated management of composite applications are major concerns of todayâs enterprise IT. These applications typically consist of heterogeneous distributed components combined to provide the applicationâs functionality. This architectural style challenges the operation and management of the application as a whole and requires new concepts for deployment, configuration, operation, and termination. The upcoming OASIS Topology and Orchestration Specification for Cloud Applications (TOSCA) standard provides new ways to enable portable automated deployment and management of composite applications. TOSCA describes the structure of composite applications as topologies containing their components and their relationships. Plans capture management tasks by orchestrating management operations exposed by the components. This chapter provides an overview on the concepts and usage of TOSCA.
Article
Computational reproducibility depends on the ability to not only isolate necessary and sufficient computational artifacts but also to preserve those artifacts for later re-execution. Both isolation and preservation present challenges in large part due to the complexity of existing software and systems as well as the implicit dependencies, resource distribution, and shifting compatibility of systems that result over time—all of which conspire to break the reproducibility of an application. Sandboxing is a technique that has been used extensively in OS environments in order to isolate computational artifacts. Several tools were proposed recently that employ sandboxing as a mechanism to ensure reproducibility. However, none of these tools preserve the sandboxed application for re-distribution to a larger scientific communityaspects that are equally crucial for ensuring reproducibility as sandboxing itself. In this paper, we describe a framework of combined sandboxing and preservation, which is not only efficient and invariant, but also practical for large-scale reproducibility. We present case studies of complex high-energy physics applications and show how the framework can be useful for sandboxing, preserving, and distributing applications. We report on the completeness, performance, and efficiency of the framework, and suggest possible standardization approaches.
Article
Docker promises the ability to package applications and their dependencies into lightweight containers that move easily between different distros, start up quickly and are isolated from each other.
Article
As science becomes increasingly computational, reproducibility has become increasingly difficult, perhaps surprisingly. In many contexts, virtualization and cloud computing can mitigate the issues involved without significant overhead to the researcher, enabling the next generation of rigorous and reproducible computational science.
Article
Researchers working on the planning, scheduling, and execution of scientific workflows need access to a wide variety of scientific workflows to evaluate the performance of their implementations. This paper provides a characterization of workflows from six diverse scientific applications, including astronomy, bioinformatics, earthquake science, and gravitational-wave physics. The characterization is based on novel workflow profiling tools that provide detailed information about the various computational tasks that are present in the workflow. This information includes I/O, memory and computational characteristics. Although the workflows are diverse, there is evidence that each workflow has a job type that consumes the most amount of runtime. The study also uncovered inefficiency in a workflow component implementation, where the component was re-reading the same data multiple times.
Topology and Orchestration Specification for Cloud Applications version 1.0
  • O Standard
O. Standard, "Topology and Orchestration Specification for Cloud Applications version 1.0," pp. 1-114, 2013.
Implementing reproducible research
  • V Stodden
  • F Leisch
  • R D Peng
V. Stodden, F. Leisch, and R. D. Peng, Implementing reproducible research. CRC Press, 2014.
Using Docker to support Reproducible Research (submission to WSSSPE2)
  • R Chamberlain
  • J Schommer
R. Chamberlain and J. Schommer, "Using Docker to support Reproducible Research (submission to WSSSPE2)," pp. 1-4, 2014.