Conference Paper

Towards Interactive, Reproducible Analytics at Scale on HPC Systems

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Rich user interfaces like Jupyter have the potential to make interacting with a supercomputer easier and more productive, consequently attracting new kinds of users and helping to expand the application of supercomputing to new science domains. For the scientist user, the ideal rich user interface delivers a familiar, responsive, introspective, modular, and customizable platform upon which to build, run, capture, document, re-run, and share analysis workflows. From the provider or system administrator perspective, such a platform would also be easy to configure, deploy securely, update, customize, and support. Jupyter checks most if not all of these boxes. But from the perspective of leadership computing organizations that provide supercomputing power to users, such a platform should also make the unique features of a supercomputer center more accessible to users and more composable with high-performance computing (HPC) workflows. Project Jupyter’s ( https://jupyter.org/about ) core design philosophy of extensibility, abstraction, and agnostic deployment has allowed HPC centers like National Energy Research Scientific Computing Center to bring in advanced supercomputing capabilities that can extend the interactive notebook environment. This has enabled a rich scientific discovery platform, particularly for experimental facility data analysis and machine learning problems.
Article
Full-text available
Bringing HEP computing to HPC can be difficult. Software stacks are often very complicated with numerous dependencies that are difficult to get installed on an HPC system. To address this issue, NERSC has created Shifter, a framework that delivers Docker-like functionality to HPC. It works by extracting images from native formats and converting them to a common format that is optimally tuned for the HPC environment. We have used Shifter to deliver the CVMFS software stack for ALICE, ATLAS, and STAR on the supercomputers at NERSC. As well as enabling the distribution multi-TB sized CVMFS stacks to HPC, this approach also offers performance advantages. Software startup times are significantly reduced and load times scale with minimal variation to 1000s of nodes. We profile several successful examples of scientists using Shifter to make scientific analysis easily customizable and scalable. We will describe the Shifter framework and several efforts in HEP and NP to use Shifter to deliver their software on the Cori HPC system.
Poster
Full-text available
Parsl is a parallel scripting library for Python that provides a simple model for describing and executing data ow-based scripts over arbitrary execution resources such clusters, grids, and high- performance systems. Parsl’s execution layer abstracts the di er- ences between providers enabling provisioning and management of compute nodes. In this poster we describe the development of a new execution provider for Parsl that is designed to support Amazon Web Services (AWS) and Microsoft Azure. This provider supports the transparent execution of implicitly parallel Python-based scripts using elastic cloud resources. We demonstrate that Parsl is capable of executing thousands of applications per second over this elastic execution fabric.
Article
Full-text available
Here we present Singularity, software developed to bring containers and reproducibility to scientific computing. Using Singularity containers, developers can work in reproducible environments of their choosing and design, and these complete environments can easily be copied and executed on other platforms. Singularity is an open source initiative that harnesses the expertise of system and software engineers and researchers alike, and integrates seamlessly into common workflows for both of these groups. As its primary use case, Singularity brings mobility of computing to both users and HPC centers, providing a secure means to capture and distribute software and compute environments. This ability to create and deploy reproducible environments across these centers, a previously unmet need, makes Singularity a game changing development for computational science.
Article
Full-text available
The increasing concern for the availability of scientific data has resulted in a number of initiatives promoting the archival and curation of datasets as a legitimate research outcome. Among them, dataset repositories fill the gap of providing long-term preservation of diverse kinds of data along with its meta-descriptions, and support citation. Unsurprisingly, the concern for quality arises as in the publication of papers. However, repositories support a larger variety of use cases, and many of them implement minimal control on the data uploaded by users. An approach to tackle with quality control in repositories is that of letting communities of users to filter the relevant resources for them, at the same time providing some form of trust to users of the data. However, there is a lack of knowledge of the extent to which this social approach that relies on communities self-organizing actually contributes to the effective organization inside repositories. This paper reports the results of a study on the Zenodo repository, describing its main contents and how communities have emerged naturally around the deposited contents.
Article
Full-text available
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Article
Full-text available
On September 14, 2015 at 09:50:45 UTC the two detectors of the Laser Interferometer Gravitational-Wave Observatory simultaneously observed a transient gravitational-wave signal. The signal sweeps upwards in frequency from 35 to 250 Hz with a peak gravitational-wave strain of 1.0 × 10[superscript -21]. It matches the waveform predicted by general relativity for the inspiral and merger of a pair of black holes and the ringdown of the resulting single black hole. The signal was observed with a matched-filter signal-to-noise ratio of 24 and a false alarm rate estimated to be less than 1 event per 203 000 years, equivalent to a significance greater than 5.1σ. The source lies at a luminosity distance of 410[+160 over -180] Mpc corresponding to a redshift z = 0.09[+0.03 over -0.04]. In the source frame, the initial black hole masses are 36[+5 over -4]M[subscript ⊙] and 29[+4 over -4]M[subscript ⊙], and the final black hole mass is 62[+4 over -4]M[subscript ⊙], with 3.0[+0.5 over -0.5]M[subscript ⊙]c[superscript 2] radiated in gravitational waves. All uncertainties define 90% credible intervals. These observations demonstrate the existence of binary stellar-mass black hole systems. This is the first direct detection of gravitational waves and the first observation of a binary black hole merger.
Chapter
Large scale experimental science workflows require support for a unified, interactive, real-time platform that can manage a distributed set of resources connected to High Performance Computing (HPC) systems. What is needed is a tool that provides the ease-of-use and interactivity of a web science gateway, while providing the scientist the ability to build custom, ad-hoc workflows in a composable way. The Jupyter platform can play a key role here to enable the ingestion and analysis of real-time streaming data, integrate with HPC resources in a closed-loop, and enable interactive ad-hoc analyses with running workflows.
Article
An improved architecture and enthusiastic user base are driving uptake of the open-source web tool. An improved architecture and enthusiastic user base are driving uptake of the open-source web tool.
Conference Paper
The Minnesota Supercomputing Institute has implemented Jupyterhub and the Jupyter notebook server as a general-purpose point-of-entry to interactive high performance computing services. This mode of operation runs counter to traditional job-oriented HPC operations, but offers significant advantages for ease-of-use, data exploration, prototyping, and workflow development. From the user perspective, these features bring the computing cluster nearer to parity with emerging cloud computing options. On the other hand, retreating from fully-scheduled, job-based resource allocation poses challenges for resource availability and utilization efficiency, and can involve tools and technologies outside the typical core competencies of a supercomputing center's operations staff. MSI has attempted to mitigate these challenges by adopting Jupyter as a common technology platform for interactive services, capable of providing command-line, graphical, and workflow-oriented access to HPC resources while still integrating with job scheduling systems and using existing compute resources. This paper will describe the mechanisms that MSI has put in place, advantages for research and instructional uses, and lessons learned.
Conference Paper
Supercomputing centers are seeing increasing demand for user-defined software stacks (UDSS), instead of or in addition to the stack provided by the center. These UDSS support user needs such as complex dependencies or build requirements, externally required configurations, portability, and consistency. The challenge for centers is to provide these services in a usable manner while minimizing the risks: security, support burden, missing functionality, and performance. We present Charliecloud, which uses the Linux user and mount namespaces to run industry-standard Docker containers with no privileged operations or daemons on center resources. Our simple approach avoids most security risks while maintaining access to the performance and functionality already on offer, doing so in just 800 lines of code. Charliecloud promises to bring an industry-standard UDSS user workflow to existing, minimally altered HPC resources.
Technical Report
TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
Article
This paper describes DataONE, a federated data network that is being built to improve access to, and preserve data about, life on Earth and the environment that sustains it. DataONE supports science by: (1) engaging the relevant science, library, data, and policy communities; (2) facilitating easy, secure, and persistent storage of data; and (3) disseminating integrated and user-friendly tools for data discovery, analysis, visualization, and decision-making. The paper provides an overview of the DataONE architecture and community engagement activities. The role of identifiers in DataONE and the policies and procedures involved in data submission, curation, and citation are discussed for one of the affiliated data centers. Finally, the paper highlights EZID, a service that enables digital object producers to easily obtain and manage long-term identifiers for their digital content.
Article
Docker promises the ability to package applications and their dependencies into lightweight containers that move easily between different distros, start up quickly and are isolated from each other.
Article
As science becomes increasingly computational, reproducibility has become increasingly difficult, perhaps surprisingly. In many contexts, virtualization and cloud computing can mitigate the issues involved without significant overhead to the researcher, enabling the next generation of rigorous and reproducible computational science.
Article
Research and practice in digital libraries (DL) has exploded worldwide in the 1990s. Substantial research funding has become available, libraries are actively involved in DL projects and conferences, journals and online news lists proliferate. This article explores reasons for these developments and the influence of key players, while speculating on future directions. We find that the term 'digital library' is used in two distinct senses. In general, researchers view digital libraries as content collected on behalf of user communities, while practicing librarians view digital libraries as institutions or services. Tensions exist between these communities over the scope and concept of the term 'library'. Research-oriented definitions serve to build a community of researchers and to focus attention on problems to be addressed; these definitions have expanded considerably in scope throughout the 1990s. Library community definitions are more recent and serve to focus attention on practical challenges to be addressed in the transformation of research libraries and universities. Future trends point toward the need for extensive research in digital libraries and for the transformation of libraries as institutions. The present ambiguity of terminology is hindering the advance of research and practice in digital libraries and in our ability to communicate the scope and significance of our work.
Article
We introduce a set of integrated developments in web application software, networking, data citation standards, and statistical methods designed to put some of the universe of data and data sharing practices on somewhat firmer ground. We have focused on social science data, but aspects of what we have developed may apply more widely. The idea is to facilitate the public distribution of persistent, authorized, and verifiable data, with powerful but easy-to-use technology, even when the data are confidential or proprietary. We intend to solve some of the sociological problems of data sharing via technological means, with the result intended to benefit both the scientific community and the sometimes apparently contradictory goals of individual researchers. Government Version of Record
Article
The revolution the Web has brought to information dissemination is not so much due to the availability of data-huge amounts of information has long been available in libraries-but rather the improved efficiency of accessing (improved accessibility to) that information. The Web promises to make more scientific articles more easily available. By making the context of citations easily and quickly browsable, autonomous citation indexing can help to evaluate the importance of individual contributions more accurately and quickly. Digital libraries incorporating ACI can help organize scientific literature and may significantly improve the efficiency of dissemination and feedback. ACI may also help speed the transition to scholarly electronic publishing
Understanding interactive and reproducible computing with jupyter tools at facilities
  • D Paine
  • L Ramakrishnan
Kale: A system for enabling human-in-the-loop interactivity in hpc workflows
  • S Cholia
  • M Henderson
  • O Evans
  • F Pérez
Introduction to Tornado
  • M Dory
  • A Parrish
  • B Berg
Workflows for e-Science: scientific workflows for grids
  • I J Taylor
  • E Deelman
  • D B Gannon
  • M Shields
Burrito: Wrapping your lab notebook in computational infrastructure
  • P J Guo
  • M I Seltzer
Pangeo: a big-data ecosystem for scalable earth system science
  • J Hamman
  • M Rocklin
  • R Abernathy
Pangeo: a big-data ecosystem for scalable earth system science
  • hamman
Kale: A system for enabling human-in-the-loop interactivity in hpc workflows
  • cholia