Conference Paper

Extending the Galaxy portal with parallel and distributed execution capability

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The Galaxy platform is a web-based science portal for scientific computing supporting the life sciences user community. While user-friendly and intuitive for doing small to medium-scale computations, it currently has limited support for large-scale parallel and distributed computing. The Swift parallel scripting framework is capable of composing ordinary applications into parallel scripts that can be run on multiscale distributed and performance computing platforms. In complex distributed environments, often the user end of the application lifecycle slows because of the technical complexities brought in by the scale, access methods, and resource management nuances. Galaxy offers a simple way of designing, composing, executing, reusing, and reproducing application runs. An integration between the Swift and Galaxy systems can accelerate science as well as bring the respective user communities together in an interactive, user-friendly, parallel and distributed data analysis environment enabled on a broad range of computational infrastructures.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... 1) User constructs her workflow using a simple and intuitive interface such as Galaxy [1]. ...
... User provides the workflow either as a Swift script or using some other interface. In case of latter, an approach such as [1] is used to obtain a Swift script. In the Swift script, a workflow is represented as tasks, each of which consumes or produces a number of files. ...
Conference Paper
Workflows play an important role in expressing and executing scientific applications. In recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources that are geographically distributed. These computational sites are heterogeneous in nature and performance of different tasks in a workflow varies from one site to another. Additionally, users typically have a limited resource allocation at each site. In such cases, judicious scheduling strategy is required in order to map tasks in the workflow to resources so that the workload is balanced among sites and the overhead is minimized in data transfer. Most existing systems either run the entire workflow in a single site or use naive approaches to distribute the tasks across sites or leave it to the user to optimize the allocation of tasks to distributed resources. This results in a significant loss in productivity for a scientist. In this paper, we propose a multi-site workflow scheduling technique that uses performance models to predict the execution time on different resources and dynamic probes to identify the achievable network throughput between sites. We evaluate our approach using real world applications in a distributed environment using the Swift distributed execution framework and show that our approach improves the execution time by up to 60% compared to the default schedule.
... Visual workflow management environments: Many workflow systems (e.g., Taverna (Oinn et al. 2004), Kepler (McPhillips et al. 2009), Galaxy (Goecks et al. 2010), and CloudSME (Taylor et al. )) are based on comparable data-driven computing models but lack Swift's scalability, its simple generality for supporting arbitrary applications, and its provider-based architecture for broad platform support. Recent work has integrated Swift's execution model into the Galaxy user interface model, to provide the best benefits of both workflow models (Maheshwari et al. 2013a, Maheshwari et al. 2013b. ...
Conference Paper
Full-text available
As high-performance computing resources have become increasingly available, new modes of computational processing and experimentation have become possible. This tutorial presents the Extreme-scale Model Exploration with Swift/T (EMEWS) framework for combining existing capabilities for model exploration approaches (e.g., model calibration, metaheuristics, data assimilation) and simulations (or any "black box" application code) with the Swift/T parallel scripting language to run scientific workflows on a variety of computing resources, from desktop to academic clusters to Top 500 level supercomputers. We will present a number of use-cases, starting with a simple agent-based model parameter sweep, and ending with a complex adaptive parameter space exploration workflow coordinating ensembles of distributed simulations. The use-cases are published on a public repository for interested parties to download and run on their own.
Article
In recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources that are distributed. These sites are heterogeneous in nature and performance of different tasks in a workflow varies from one site to another. Additionally, users typically have a limited resource allocation at each site capped by administrative policies. In such cases, judicious scheduling strategy is required in order to map tasks in the workflow to resources so that the workload is balanced among sites and the overhead is minimized in data transfer. Most existing systems either run the entire workflow in a single site or use naïve approaches to distribute the tasks across sites or leave it to the user to optimize the allocation of tasks to distributed resources. This results in a significant loss in productivity. We propose a multi-site workflow scheduling technique that uses performance models to predict the execution time on resources and dynamic probes to identify the achievable network throughput between sites. We evaluate our approach using real world applications using the Swift parallel and distributed execution framework. We use two distinct computational environments-geographically distributed multiple clusters and multiple clouds. We show that our approach improves the resource utilization and reduces execution time when compared to the default schedule.
Article
Full-text available
This document is obsolete. The definitive document is Standard ECMA-404 The JSON Data Interchange Syntax. JavaScript Object Notation (JSON) is a lightweight, text-based, language-independent data interchange format. It was derived from the ECMAScript Programming Language Standard. JSON defines a small set of formatting rules for the portable representation of structured data.
Article
Full-text available
Science gateways have dramatically simplified the work required by science communities to run their codes on TeraGrid resources. Gateway development typically spans the duration of a particular grant, with the first production runs occurring some months after the award and concluding near the end of the project. Scientists use gateways as a means to interface with large resources. Our gateway infrastructure facilitates this by hiding away the various details of the underlying resources and presents an intuitive way to interact with the resource. In this paper, we present our work on GPSI, a general-purpose science gateway infrastructure that can be easily customized to meet the needs of an application. This reduces the time to deployment and improves scientific productivity. Our contribution in this paper is two-fold: to elaborate our vision for a user-driven gateway infrastructure that includes components required by multiple science domains, thus aiding the speedy development of gateways, and presenting our experience in moving from our initial portal implementations to the current effort based on Python [15] and Django [16].
Article
Full-text available
This paper presents the design, implementation, and usage of a virtual laboratory for medical image analysis. It is fully based on the Dutch grid, which is part of the Enabling Grids for E-sciencE (EGEE) production infrastructure and driven by the gLite middleware. The adopted service-oriented architecture enables decoupling the user-friendly clients running on the user's workstation from the complexity of the grid applications and infrastructure. Data are stored on grid resources and can be browsed/viewed interactively by the user with the Virtual Resource Browser (VBrowser). Data analysis pipelines are described as Scufl workflows and enacted on the grid infrastructure transparently using the MOTEUR workflow management system. VBrowser plug-ins allow for easy experiment monitoring and error detection. Because of the strict compliance to the grid authentication model, all operations are performed on behalf of the user, ensuring basic security and facilitating collaboration across organizations. The system has been operational and in daily use for eight months (December 2008), with six users, leading to the submission of 9000 jobs/month in average and the production of several terabytes of data.
Article
Full-text available
Accessing and analyzing the exponentially expanding genomic sequence and functional data pose a challenge for biomedical researchers. Here we describe an interactive system, Galaxy, that combines the power of existing genome annotation databases with a simple Web portal to enable users to search remote resources, combine data from independent queries, and visualize the results. The heart of Galaxy is a flexible history system that stores the queries from each user; performs operations such as intersections, unions, and subtractions; and links to other computational tools. Galaxy can be accessed at http://g2.bx.psu.edu.
Article
Full-text available
In 2002, the National Science Foundation established the Network for Computational Nanotechnology (NCN), a network of universities supporting the National Nanotechnology Initiative by bringing computational tools online, making the tools easy to use, and supporting the tools with educational materials. Along the way, NCN created a unique cyberinfrastructure to support its Web site, nanoHUB.org, where researchers, educators, and professionals collaborate, share resources, and solve real nanotechnology problems. In 2007, nanoHUB.org served more than 56,000 users from 172 countries. In this article, the authors share their experiences in developing this cyberinfrastructure and using it, particularly in an educational context.
Article
Scientists, engineers, and statisticians must execute domain-specific application programs many times on large collections of file-based data. This activity requires complex orchestration and data management as data is passed to, from, and among application invocations. Distributed and parallel computing resources can accelerate such processing, but their use further increases programming complexity. The Swift parallel scripting language reduces these complexities by making file system structures accessible via language constructs and by allowing ordinary application programs to be composed into powerful parallel scripts that can efficiently utilize parallel and distributed resources. We present Swift’s implicitly parallel and deterministic programming model, which applies external applications to file collections using a functional style that abstracts and simplifies distributed parallel execution.
Decision Support System for Agrotechnology Transfer Version 4.0: Crop Model Documentation
  • J W Jones
  • G Hoogenboom
  • P Wilkens
  • C Porter
  • G Tsuji
J. W. Jones, G. Hoogenboom, P. Wilkens, C. Porter, and G. Tsuji, editors. Decision Support System for Agrotechnology Transfer Version 4.0: Crop Model Documentation. University of Hawaii, 2003.