About
108
Publications
7,753
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,329
Citations
Introduction
Additional affiliations
March 2015 - present
February 2013 - February 2015
August 2008 - December 2012
Publications
Publications (108)
We present a randomized differential testing approach to test OpenMP implementations. In contrast to previous work that manually creates dozens of verification and validation tests, our approach is able to randomly generate thousands of tests, exposing OpenMP implementations to a wide range of program behaviors. We represent the space of possible r...
As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced...
The Open MPI for Exascale (OMPI-X) project was one of two in the Exascale Computing Project (ECP) focused on advancing the MPI ecosystem. The OMPI-X team worked with other MPI Forum members to champion several important features for inclusion in the MPI 4.0, 4.1, and upcoming 5.0 MPI standard versions, in support of the needs of exascale applicatio...
In this report, we describe the current capabilities and research needs related to formal methods at the NNSA labs. In particular, we identify medium-term and long-term research gaps in programming languages, formalization efforts of complex systems, embedded systems verification, hardware verification, cybersecurity, formal methods usability, work...
The emergence of multiple resource management systems, such as SLURM and Kubernetes, for different computational purposes has led to a desire to support a single workflow that spans multiple resource management domains, which can include multiple HPCs, edges, and cloud, over different network domains. Best-of-class tools developed in one domain oft...
As the demand for developing and porting numerical applications to heterogeneous computing platforms increases, such programs may exhibit numerical inconsistencies caused by architectural differences and aggressive compiler optimizations. These numerical inconsistencies can negatively impact reproducibility and debugging. This paper presents Ciel,...
As we reach the limits of Moore’s law and the end of Dennard scaling, increased emphasis is being given to alternative system architectures and computing paradigms to achieve future performance improvements. Approximate computing, where accuracy is traded for performance, is a promising approach in the post-Moore era. However, for approximate compu...
The Better Scientific Software Fellowship (BSSwF) was launched in 2018 to foster and promote practices, processes, and tools to improve developer productivity and software sustainability of scientific codes. BSSwF vision is to grow the community with practitioners, leaders, mentors, and consultants to increase the visibility of scientific software....
In recent decades, High Performance Computing (HPC) and simulations have become determinant in many areas of engineering and science. Since many HPC applications rely extensively on floating-point arithmetic operations to solve computational problems, many kinds of numerical errors can be introduced during the program execution, leading to instabil...
The traditional method to study application resilience to errors in HPC applications uses fault injection (FI), a time-consuming approach. While analytical models have been built to overcome the inefficiencies of FI, they lack accuracy. In this paper, we present PARIS, a machine-learning method to predict application resilience that avoids the time...
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effecti...
Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest check-point. However, re-deploying an application incurs overhead by tearing do...
Program synthesis is an active research field in academia, national labs, and industry. Yet, work directly applicable to scientific computing, while having some impressive successes, has been limited. This report reviews the relevant areas of program synthesis work for scientific computing, discusses successes to date, and outlines opportunities fo...
An approach to reproducibility problems related to porting software across machines and compilers.
Compilers optimize OpenMP programs differently than their serial elision. Early outlining of parallel regions and invocation of parallel code via OpenMP runtime functions are two of the most profound differences. Understanding the interplay between compiler optimizations, OpenMP compilation, and application performance is hard and usually requires...
The Exascale Computing Project (ECP) focuses on the development of future exascale‐capable applications. Most ECP applications use the message passing interface (MPI) as their parallel programming model with mini‐apps serving as proxies. This paper explores the explicit usage of MPI in such ECP proxy applications. We empirically analyze 14 proxy ap...
Communication traces collected from MPI applications are an important source of information for performance optimization as they can help analysts determine communication patterns and identify inefficiencies. However, their collection, especially at scale, is time consuming, since it usually requires running the complete target application on a lar...
Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, re-deploying an application incurs overhead by tearing dow...
The difficulty of deep experimentation with Message Passing Interface (MPI) implementations—which are quite large and complex—substantially raises the cost and complexity of proof-of-concept activities and limits the community of potential contributors to new and better MPI features and implementations alike. Our goal is to enable researchers to ex...
Understanding the state-of-the-practice in MPI usage is paramount for many aspects of supercomputing, including optimizing the communication of HPC applications and informing standardization bodies and HPC systems procurements regarding the most important MPI features. Unfortunately, no previous study has characterized the use of MPI on application...
Floating-point arithmetic is widely used in applications from several fields including scientific computing, machine learning, graphics, and finance. Many of these applications are rapidly adopting the use of GPUs to speedup computations. GPUs, however, have limited support to detect floating-point exceptions, which hinders the development of relia...
Mixed precision computations improve high performance computing throughput for applications that can tolerate decreased mathematical precision in their computations. Native mixed precision computation is commonplace in today's GPGPU accelerators where it is applied to applications with well-known tolerances for reduced mathematical precision. Appli...
Successful HPC software applications are long-lived. When ported across machines and their compilers, these applications often produce different numerical results, many of which are unacceptable. Such variability is also a concern while optimizing the code more aggressively to gain performance. Efficient tools that help locate the program units (fi...
As fault recovery mechanisms become increasingly important in HPC systems, the need for a new recovery model for workflows on these systems grows as well. While the traditional approach in which each system component attempts its own independent recovery after a fault works well at each individual application level, this model does not scale to the...
A technique to perform floating-point mixed-precision tuning on GPU applications.
We present GPUMixer, a tool to perform mixed-precision floating-point tuning on scientific GPU applications. While precision tuning techniques are available, they are designed for serial programs and are accuracy-driven, i.e., they consider configurations that satisfy accuracy constraints, but these configurations may degrade performance. GPUMixer,...
We present GPUMixer, a tool to perform mixed-precision floating-point tuning on scientific GPU applications. While precision tuning techniques are available, they are designed for serial programs and are accuracy-driven, i.e., they consider configurations that satisfy accuracy constraints, but these configurations may degrade performance. GPUMixer,...
Large scientific simulations must be able to achieve the full-system potential of supercomputers. When they tap into high-performance features, however, a phenomenon known as non-determinism may be introduced in their program execution, which significantly hampers application development. Pruners is a new toolset to detect and remedy non-determinis...
When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new r...
Extreme-scale scientific applications can be more vulnerable to soft errors (transient faults) as high-performance computing systems increase in scale. The common practice to evaluate the resilience to faults of an application is random fault injection, a method that can be highly time consuming. While resilience prediction modeling has been recent...
As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain...
Floating-point arithmetic is the computational foundation of numerical scientific software. Compiler optimizations that affect floating-point arithmetic can have a significant impact on the integrity, reproducibility, and performance of HPC scientific applications. Unfortunately the interplay between the compiler-induced variability and runtime per...
When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new r...
As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain...
Scientists from many different fields have been developing Bulk‐Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger‐scale future HPC systems, providing efficient fault‐tolerance mechanisms for this class of applications is paramount. The global‐restart mod...
The running time of GPU kernels depends on an invocation parameter, the number of threads in each thread block. Sometime the dependence is quite strong leading to 50--100% change in execution time for long-running kernels. Till now, it has been an art form to decide on the optimal setting for this parameter. NVIDIA provides a tool for CUDA kernels,...
Compiler-based fault injection (FI) has become a popular technique for resilience studies to understand the impact of soft errors in supercomputing systems. Compiler-based FI frameworks inject faults at a high intermediate-representation level. However, they are less accurate than machine code, binary-level FI because they lack access to all dynami...
Understanding DRAM errors in high-performance computing (HPC) clusters is paramount to address future HPC resilience challenges. While there have been studies on this topic, previous work has focused on on-node and single-rack characteristics of errors; conversely, few studies have presented insights into the spatial behavior of DRAM errors across...
Maintaining leadership in HPC requires the ability to support simulations at large scales and fidelity. In this study, we detail one of the most significant productivity challenges in achieving this goal, namely the increasing proclivity to bugs, especially in the face of growing hardware and software heterogeneity and sheer system scale. We identi...
Debugging intermittently occurring bugs within MPI applications is challenging, and message races, a condition in which two or more sends race to match with a receive, are one of the common root causes. Many debugging tools have been proposed to help programmers resolve them, but their runtime interference perturbs the timing such that subtle races...
Debugging intermittently occurring bugs within MPI applications is challenging, and message races, a condition in which two or more sends race to match with a receive, are one of the common root causes. Many debugging tools have been proposed to help programmers resolve them, but their runtime interference perturbs the timing such that subtle races...
We present a technique to pinpoint scale-dependent integer overflow bugs, a class of bugs in large-scale parallel applications that is hard and time-consuming to detect manually. Rather than detecting integer overflows when applications are deployed at large scale, as existing techniques do, our method forecasts these overflows without requiring th...
With complex codes moving to systems of greater on-node parallelism using OpenMP, debugging these codes is becoming increasingly challenging. While debuggers can significantly aid programmers, OpenMP support within existing debuggers is either largely ineffective or unsustainable. The OpenMP tools working group is working to specify a debugging int...
Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and...
This paper presents IPAS, an instruction duplication technique that protects scientific applications from silent data corruption (SDC) in their output. The motivation for IPAS is that, due to natural error masking, only a subset of SDC errors actually affects the output of scientific codes—we call these errors silent output corruption (SOC) errors....
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this...
The ability to record and replay program execution helps significantly in debugging non-deterministic MPI applications by reproducing message-receive orders. However, the large amount of data that traditional record-and-reply techniques record precludes its practical applicability to massively parallel applications. In this paper, we propose a new...
With complex codes moving to systems of increasing on-node parallelism using OpenMP, debugging these codes is becoming increasingly challenging. While debuggers can significantly aid programmers, existing ones support OpenMP at a low system-thread level, reducing their effectiveness. The previously published draft for a standard OpenMP debugging in...
Experts state that dynamic analysis techniques help programmers in finding the root cause of bugs in large-scale high-performance computing (HPC) parallel applications. These applications can run detailed numerical simulations that model the real world. The numerical correctness and software reliability of these applications is a major concern for...
Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versionin...
The User Level Failure Mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in MPI. Previous work has presented performance evaluations of the interface; yet questions related to its programability and applicability remain unanswered. In this paper, we present our experiences on using ULFM in a case study (a large molec...
Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originate...
Neither static nor dynamic data race detection methods, by themselves, have proven to be sufficient for large HPC applications, as they often result in high runtime overheads and/or low race-checking accuracy. While combined static and dynamic approaches can fare better, creating such combinations, in practice, requires attention to many details. S...
Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about