Ignacio Laguna

Ignacio Laguna
Lawrence Livermore National Laboratory | LLNL · Center for Applied Scientific Computing

Ph.D.

About

108
Publications
7,753
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,329
Citations
Additional affiliations
March 2015 - present
Lawrence Livermore National Laboratory
Position
  • Computer Scientist
February 2013 - February 2015
Lawrence Livermore National Laboratory
Position
  • PostDoc Position
August 2008 - December 2012
Purdue University West Lafayette
Position
  • Research Assistant

Publications

Publications (108)
Preprint
Full-text available
We present a randomized differential testing approach to test OpenMP implementations. In contrast to previous work that manually creates dozens of verification and validation tests, our approach is able to randomly generate thousands of tests, exposing OpenMP implementations to a wide range of program behaviors. We represent the space of possible r...
Preprint
Full-text available
As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced...
Article
The Open MPI for Exascale (OMPI-X) project was one of two in the Exascale Computing Project (ECP) focused on advancing the MPI ecosystem. The OMPI-X team worked with other MPI Forum members to champion several important features for inclusion in the MPI 4.0, 4.1, and upcoming 5.0 MPI standard versions, in support of the needs of exascale applicatio...
Technical Report
Full-text available
In this report, we describe the current capabilities and research needs related to formal methods at the NNSA labs. In particular, we identify medium-term and long-term research gaps in programming languages, formalization efforts of complex systems, embedded systems verification, hardware verification, cybersecurity, formal methods usability, work...
Chapter
The emergence of multiple resource management systems, such as SLURM and Kubernetes, for different computational purposes has led to a desire to support a single workflow that spans multiple resource management domains, which can include multiple HPCs, edges, and cloud, over different network domains. Best-of-class tools developed in one domain oft...
Chapter
As the demand for developing and porting numerical applications to heterogeneous computing platforms increases, such programs may exhibit numerical inconsistencies caused by architectural differences and aggressive compiler optimizations. These numerical inconsistencies can negatively impact reproducibility and debugging. This paper presents Ciel,...
Article
As we reach the limits of Moore’s law and the end of Dennard scaling, increased emphasis is being given to alternative system architectures and computing paradigms to achieve future performance improvements. Approximate computing, where accuracy is traded for performance, is a promising approach in the post-Moore era. However, for approximate compu...
Article
Full-text available
The Better Scientific Software Fellowship (BSSwF) was launched in 2018 to foster and promote practices, processes, and tools to improve developer productivity and software sustainability of scientific codes. BSSwF vision is to grow the community with practitioners, leaders, mentors, and consultants to increase the visibility of scientific software....
Conference Paper
In recent decades, High Performance Computing (HPC) and simulations have become determinant in many areas of engineering and science. Since many HPC applications rely extensively on floating-point arithmetic operations to solve computational problems, many kinds of numerical errors can be introduced during the program execution, leading to instabil...
Article
The traditional method to study application resilience to errors in HPC applications uses fault injection (FI), a time-consuming approach. While analytical models have been built to overcome the inefficiencies of FI, they lack accuracy. In this paper, we present PARIS, a machine-learning method to predict application resilience that avoids the time...
Preprint
Full-text available
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effecti...
Preprint
Full-text available
Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest check-point. However, re-deploying an application incurs overhead by tearing do...
Preprint
Full-text available
Program synthesis is an active research field in academia, national labs, and industry. Yet, work directly applicable to scientific computing, while having some impressive successes, has been limited. This report reviews the relevant areas of program synthesis work for scientific computing, discusses successes to date, and outlines opportunities fo...
Article
An approach to reproducibility problems related to porting software across machines and compilers.
Chapter
Compilers optimize OpenMP programs differently than their serial elision. Early outlining of parallel regions and invocation of parallel code via OpenMP runtime functions are two of the most profound differences. Understanding the interplay between compiler optimizations, OpenMP compilation, and application performance is hard and usually requires...
Article
The Exascale Computing Project (ECP) focuses on the development of future exascale‐capable applications. Most ECP applications use the message passing interface (MPI) as their parallel programming model with mini‐apps serving as proxies. This paper explores the explicit usage of MPI in such ECP proxy applications. We empirically analyze 14 proxy ap...
Article
Communication traces collected from MPI applications are an important source of information for performance optimization as they can help analysts determine communication patterns and identify inefficiencies. However, their collection, especially at scale, is time consuming, since it usually requires running the complete target application on a lar...
Chapter
Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, re-deploying an application incurs overhead by tearing dow...
Chapter
The difficulty of deep experimentation with Message Passing Interface (MPI) implementations—which are quite large and complex—substantially raises the cost and complexity of proof-of-concept activities and limits the community of potential contributors to new and better MPI features and implementations alike. Our goal is to enable researchers to ex...
Conference Paper
Full-text available
Understanding the state-of-the-practice in MPI usage is paramount for many aspects of supercomputing, including optimizing the communication of HPC applications and informing standardization bodies and HPC systems procurements regarding the most important MPI features. Unfortunately, no previous study has characterized the use of MPI on application...
Conference Paper
Full-text available
Floating-point arithmetic is widely used in applications from several fields including scientific computing, machine learning, graphics, and finance. Many of these applications are rapidly adopting the use of GPUs to speedup computations. GPUs, however, have limited support to detect floating-point exceptions, which hinders the development of relia...
Conference Paper
Full-text available
Mixed precision computations improve high performance computing throughput for applications that can tolerate decreased mathematical precision in their computations. Native mixed precision computation is commonplace in today's GPGPU accelerators where it is applied to applications with well-known tolerances for reduced mathematical precision. Appli...
Conference Paper
Successful HPC software applications are long-lived. When ported across machines and their compilers, these applications often produce different numerical results, many of which are unacceptable. Such variability is also a concern while optimizing the code more aggressively to gain performance. Efficient tools that help locate the program units (fi...
Conference Paper
As fault recovery mechanisms become increasingly important in HPC systems, the need for a new recovery model for workflows on these systems grows as well. While the traditional approach in which each system component attempts its own independent recovery after a fault works well at each individual application level, this model does not scale to the...
Presentation
Full-text available
A technique to perform floating-point mixed-precision tuning on GPU applications.
Conference Paper
Full-text available
We present GPUMixer, a tool to perform mixed-precision floating-point tuning on scientific GPU applications. While precision tuning techniques are available, they are designed for serial programs and are accuracy-driven, i.e., they consider configurations that satisfy accuracy constraints, but these configurations may degrade performance. GPUMixer,...
Chapter
We present GPUMixer, a tool to perform mixed-precision floating-point tuning on scientific GPU applications. While precision tuning techniques are available, they are designed for serial programs and are accuracy-driven, i.e., they consider configurations that satisfy accuracy constraints, but these configurations may degrade performance. GPUMixer,...
Article
Large scientific simulations must be able to achieve the full-system potential of supercomputers. When they tap into high-performance features, however, a phenomenon known as non-determinism may be introduced in their program execution, which significantly hampers application development. Pruners is a new toolset to detect and remedy non-determinis...
Article
When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new r...
Preprint
Extreme-scale scientific applications can be more vulnerable to soft errors (transient faults) as high-performance computing systems increase in scale. The common practice to evaluate the resilience to faults of an application is random fault injection, a method that can be highly time consuming. While resilience prediction modeling has been recent...
Article
Full-text available
As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain...
Preprint
Floating-point arithmetic is the computational foundation of numerical scientific software. Compiler optimizations that affect floating-point arithmetic can have a significant impact on the integrity, reproducibility, and performance of HPC scientific applications. Unfortunately the interplay between the compiler-induced variability and runtime per...
Conference Paper
Full-text available
When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new r...
Preprint
As high-performance computing systems scale in size and computational power, the danger of silent errors, i.e., errors that can bypass hardware detection mechanisms and impact application state, grows dramatically. Consequently, applications running on HPC systems need to exhibit resilience to such errors. Previous work has found that, for certain...
Article
Scientists from many different fields have been developing Bulk‐Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger‐scale future HPC systems, providing efficient fault‐tolerance mechanisms for this class of applications is paramount. The global‐restart mod...
Conference Paper
The running time of GPU kernels depends on an invocation parameter, the number of threads in each thread block. Sometime the dependence is quite strong leading to 50--100% change in execution time for long-running kernels. Till now, it has been an art form to decide on the optimal setting for this parameter. NVIDIA provides a tool for CUDA kernels,...
Conference Paper
Compiler-based fault injection (FI) has become a popular technique for resilience studies to understand the impact of soft errors in supercomputing systems. Compiler-based FI frameworks inject faults at a high intermediate-representation level. However, they are less accurate than machine code, binary-level FI because they lack access to all dynami...
Conference Paper
Understanding DRAM errors in high-performance computing (HPC) clusters is paramount to address future HPC resilience challenges. While there have been studies on this topic, previous work has focused on on-node and single-rack characteristics of errors; conversely, few studies have presented insights into the spatial behavior of DRAM errors across...
Article
Full-text available
Maintaining leadership in HPC requires the ability to support simulations at large scales and fidelity. In this study, we detail one of the most significant productivity challenges in achieving this goal, namely the increasing proclivity to bugs, especially in the face of growing hardware and software heterogeneity and sheer system scale. We identi...
Conference Paper
Debugging intermittently occurring bugs within MPI applications is challenging, and message races, a condition in which two or more sends race to match with a receive, are one of the common root causes. Many debugging tools have been proposed to help programmers resolve them, but their runtime interference perturbs the timing such that subtle races...
Article
Debugging intermittently occurring bugs within MPI applications is challenging, and message races, a condition in which two or more sends race to match with a receive, are one of the common root causes. Many debugging tools have been proposed to help programmers resolve them, but their runtime interference perturbs the timing such that subtle races...
Data
Full-text available
Article
Full-text available
We present a technique to pinpoint scale-dependent integer overflow bugs, a class of bugs in large-scale parallel applications that is hard and time-consuming to detect manually. Rather than detecting integer overflows when applications are deployed at large scale, as existing techniques do, our method forecasts these overflows without requiring th...
Conference Paper
With complex codes moving to systems of greater on-node parallelism using OpenMP, debugging these codes is becoming increasingly challenging. While debuggers can significantly aid programmers, OpenMP support within existing debuggers is either largely ineffective or unsustainable. The OpenMP tools working group is working to specify a debugging int...
Article
Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and...
Conference Paper
This paper presents IPAS, an instruction duplication technique that protects scientific applications from silent data corruption (SDC) in their output. The motivation for IPAS is that, due to natural error masking, only a subset of SDC errors actually affects the output of scientific codes—we call these errors silent output corruption (SOC) errors....
Article
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this...
Conference Paper
The ability to record and replay program execution helps significantly in debugging non-deterministic MPI applications by reproducing message-receive orders. However, the large amount of data that traditional record-and-reply techniques record precludes its practical applicability to massively parallel applications. In this paper, we propose a new...
Conference Paper
With complex codes moving to systems of increasing on-node parallelism using OpenMP, debugging these codes is becoming increasingly challenging. While debuggers can significantly aid programmers, existing ones support OpenMP at a low system-thread level, reducing their effectiveness. The previously published draft for a standard OpenMP debugging in...
Article
Experts state that dynamic analysis techniques help programmers in finding the root cause of bugs in large-scale high-performance computing (HPC) parallel applications. These applications can run detailed numerical simulations that model the real world. The numerical correctness and software reliability of these applications is a major concern for...
Article
Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versionin...
Article
The User Level Failure Mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in MPI. Previous work has presented performance evaluations of the interface; yet questions related to its programability and applicability remain unanswered. In this paper, we present our experiences on using ULFM in a case study (a large molec...
Conference Paper
Full-text available
Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originate...
Article
Neither static nor dynamic data race detection methods, by themselves, have proven to be sufficient for large HPC applications, as they often result in high runtime overheads and/or low race-checking accuracy. While combined static and dynamic approaches can fare better, creating such combinations, in practice, requires attention to many details. S...
Article
Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about