Josh MilthorpeOak Ridge National Laboratory | ORNL · Computer Science and Mathematics Division
Josh Milthorpe
PhD (Australian National University)
About
40
Publications
7,305
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
178
Citations
Introduction
My research interests lie in high-productivity, high performance scientific programming. Primary interests include parallel programming models, physical simulation and numerical computing.
Additional affiliations
February 2017 - March 2022
May 2014 - December 2016
March 2013 - December 2013
Publications
Publications (40)
OpenCL is an attractive programming model for high-performance computing systems composed of heterogeneous compute devices, with wide support from hardware vendors allowing portability of application codes. For accelerator designers and HPC integrators, understanding the performance characteristics of scientific workloads is of utmost importance. H...
Driven by increasing core count and decreasing meantime to failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures. The async-finish task model, adapted for distributed systems as the asynchronous partitioned global address space programming model, provides a simple way to d...
SYCL is a single-source programming model for heterogeneous systems; it promises improved maintainability, productivity, and opportunity for compiler optimization, when compared to accelerator specific programming models. Several implementations of the SYCL standard have been developed over the past few years, including several backends using conte...
A performance-portable application can run on a variety of different hardware platforms, achieving an acceptable level of performance without requiring significant rewriting for each platform. Several performance-portable programming models are now suitable for high-performance scientific application development, including OpenMP and Kokkos. Chapel...
Most contemporary HPC programming models assume an inelastic runtime in which the resources allocated to an application remain fixed throughout its execution. Conversely, elastic runtimes can expand and shrink resources based on availability and/or dynamic application requirements. In this paper, we implement elasticity for PaRSEC, a task-based dat...
We demonstrate Dagster, a system that implements a new approach to scheduling interdependent (Boolean) SAT search activities in high-performance computing (HPC) environments. Our system takes as input a set of disjunctive clauses (i.e., DIMACS CNF) and a labelled directed acyclic graph (DAG) structure describing how the clauses are decomposed into...
The task-based dataflow programming model has emerged as an alternative to the process-centric programming model for extreme-scale applications. However, load balancing is still a challenge in task-based dataflow runtimes. In this paper, we present extensions to the PaRSEC runtime to demonstrate that distributed work stealing is an effective load-b...
We describe Dagster, a system that implements a new approach to scheduling interdependent (Boolean) SAT search activities in high-performance computing (HPC) environments. This system allows practitioners to solve challenging problems by efficiently distributing search effort across computing cores in a customizable way. Our solver takes as input a...
The task-based dataflow programming model has emerged as an alternative to the process-centric programming model for extreme-scale applications. However, load balancing is still a challenge in task-based dataflow runtimes. In this paper, we present extensions to the PaR-SEC runtime to demonstrate that distributed work stealing is an effective load-...
SYCL is a single-source programming model for heterogeneous systems; it promises improved maintainability, productivity, and opportunity for compiler optimization, when compared to accelerator specific programming models. Several implementations of the SYCL standard have been developed over the past few years, including several backends using conte...
High-performance computing developers are faced with the challenge of optimizing the performance of OpenCL workloads on diverse architectures. The Architecture-Independent Workload Characterization (AIWC) tool is a plugin for the Oclgrind OpenCL simulator that gathers metrics of OpenCL programs that can be used to understand and predict program per...
High-performance computing developers are faced with the challenge of optimizing the performance of OpenCL workloads on diverse architectures. The Architecture-Independent Workload Characterization (AIWC) tool is a plugin for the Oclgrind OpenCL simulator that gathers metrics of OpenCL programs that can be used to understand and predict program per...
Cloud computing has made the resources needed to execute large-scale in-memory distributed computations widely available. Specialized programming models, e.g., MapReduce, have emerged to offer transparent fault tolerance and fault recovery for specific computational patterns, but they sacrifice generality. In contrast, the Resilient X10 programming...
Measuring performance-critical characteristics of application workloads is important both for developers, who must understand and optimize the performance of codes, as well as designers and integrators of HPC systems, who must ensure that compute architectures are suitable for the intended workloads. However, if these workload characteristics are t...
OpenCL is an attractive model for heterogeneous high-performance computing systems, with wide support from hardware vendors and significant performance portability. To support efficient scheduling on HPC systems it is necessary to perform accurate performance predictions for OpenCL workloads on varied compute devices, which is challeng- ing due to...
For reasons of both performance and energy efficiency, high performance computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL framework supports portable programming across a wide range of computing devices and is gaining influence in programming next-generation accelerators. To characterize the performance of these devices ac...
The Architecture Independent Workload Characterization (AIWC) tool was introduced to collect and benchmark accelerator performance. The AIWC tool is intended to facilitate high-performance computing (HPC) systems, which are becoming increasingly heterogeneous at the node level. This heterogeneity is evidenced by cutting-edge systems with fast inter...
Low-power system-on-chip (LPSoC) processors provide an interesting alternative as building blocks for future HPC systems due to their high energy efficiency. However, understanding their performance-energy trade-offs and minimizing the energy-to-solution for an application running across the heterogeneous devices of an LPSoC remains a challenge. In...
Cloud computing has made the resources needed to execute large-scale in-memory distributed computations widely available. Specialized programming models, e.g., MapReduce, have emerged to offer transparent fault tolerance and fault recovery for specific computational patterns, but they sacrifice generality. In contrast, the Resilient X10 programming...
The X10 programming language offers a simple but expressive model of concurrency and distribution. Domain Specific Languages embedded in X10 (eDSL) can build upon this model to offer scheduling and placement facilities tailored to particular patterns of applications, e.g. stencils or graph traversals. They exploit X10's rich type system and closure...
Many PGAS languages and libraries rely on high performance transport layers such as GASNet and MPI to achieve low communication latency, portability and scalability. As systems increase in scale, failures are expected to become normal events rather than exceptions. Unfortunately, GASNet and standard MPI do not provide fault tolerance capabilities....
The Asynchronous Partitioned Global Address Space (APGAS) programming model enables programmers to express the parallelism and locality necessary for high performance scientific applications on extreme-scale systems. We used the well-known LULESH hydrodynamics proxy application to explore the performance and programmability of the APGAS model as ex...
X10 programs have achieved high efficiency on petascale clusters by making significant use of parallelism between places, however, there has been less focus on exploiting local parallelism within a place. This paper introduces a standard mechanism - foreach - for efficient local parallel iteration in X10, including support for worker-local data. Li...
APGAS (Asynchronous Partitioned Global Address Space) is a model for concurrent and distributed programming, known primarily as the foundation of the X10 programming language. In this paper, we present an implementation of this model as an embedded domain-specific language for Scala. We illustrate common usage patterns and contrast with alternative...
High performance computing is a key technology that enables large-scale physical simulation in modern science. While great advances have been made in methods and algorithms for scientific computing, the most commonly used programming models encourage a fragmented view of computation that maps poorly to the underlying computer architecture.
Scienti...
The Global Matrix Library (GML) is a distributed matrix library in the X10 language. GML is designed to simplify the development of scalable linear algebra applications. By hiding the communication and parallelism details, GML programs are written in a sequential style that is easy to use and understand by non expert programmers. Resilience is beco...
Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manne...
Associated Legendre polynomials and spherical harmonics are central to
calculations in many fields of science and mathematics - not only chemistry but
computer graphics, magnetic, seismology and geodesy. There are a number of
algorithms for these functions published since 1960 but none of them satisfy
our requirements. In this paper, we present a c...
Effective support for array-based programming has long been one of the central design concerns of the X10 programming language. After significant research and exploration, X10 has adopted an approach based on providing arrays via user definable and extensible class libraries. This paper surveys the range of array abstractions available to the progr...
The fast multipole method (FMM) is a complex, multi-stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address space language with support for asynchronous activities. The parallel tasks comprising FMM may be expressed in X10 using a scalable pat...
Use of the resolution of Ewald operator method for computing long-range Coulomb and exchange interactions is presented. We show that the accuracy of this method can be controlled by a single parameter in a manner similar to that used by conventional algorithms that compute two-electron integrals. Significant performance advantages over conventional...
The use of ghost regions is a common feature of many distributed grid applications. A ghost region holds local read-only copies of remotely-held boundary data which are exchanged and cached many times over the course of a computation. X10 is a modern parallel programming language intended to support productive development of distributed application...
X10 is an emerging Partitioned Global Address Space (PGAS) language intended to increase significantly the productivity of developing scalable HPC applications. The language has now matured to a point where it is meaningful to consider writing large scale scientific application codes in X10. This paper reports our experiences writing three codes fr...
Interval arithmetic is an alternative computational paradigm that enables arithmetic operations to be performed with guarantee error bounds. In this paper interval arithmetic is used to compare the accuracy of various methods for computing the electrostatic energy for a system of point charges. A number of summation approaches that scale as O(N<sup...
Interval analysis is an alternative to conventional floating-point computations that offers guaranteed error bounds. Despite
this advantage, interval methods have not gained widespread use in large scale computational science applications. This paper
addresses this issue from a performance perspective, comparing the performance of floating point an...
Interval analysis is an alternative to conventional floating-point computation that offers guaranteed error bounds. Despite this advantage, interval methods have rarely been applied in high performance scientific computing. In part, this is because of the additional cost associated with performing interval operations over the corresponding floating...
In scientific computing, the approximation of continuous physical phenomena by floating-point numbers gives rise to rounding error. The behaviour of rounding errors is difficult to predict, and most scientific applications ignore it. For applications in which the accuracy of the result is critical, this is not an acceptable choice.
Interval analys...