Michael Allen Heroux

Michael Allen Heroux
Sandia National Laboratories · Center for Computing Research

PhD Applied Mathematics

About

175
Publications
32,529
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,685
Citations
Additional affiliations
January 2016 - September 2016
Sandia National Laboratories
Position
  • Senior Researcher
August 1998 - present
St. John's University
Position
  • Scientist in Residence
May 1998 - present
Sandia National Laboratories
Position
  • Distinguished Member of Staff

Publications

Publications (175)
Preprint
Full-text available
In this paper, we discuss the need for an integrated software stack that unites artificial intelligence (AI) and modeling and simulation (ModSim) tools to advance scientific discovery. The authors advocate for a unified AI/ModSim software ecosystem that ensures compatibility across a wide range of software on diverse high-performance computing syst...
Article
Full-text available
The Exascale Computing Project (ECP) Software Technology and Co-Design teams addressed the growing complexities in high-performance computing (HPC) by developing scalable software libraries and tools that leverage exascale system capabilities. As we enter the exascale era, the need for reusable, optimized software solutions that can handle the uniq...
Article
Full-text available
Computational and data-enabled science and engineering are revolutionizing advances throughout science and society, at all scales of computing. For example, teams in the U.S. DOE Exascale Computing Project have been tackling new frontiers in modeling, simulation, and analysis by exploiting unprecedented exascale computing capabilities—building an a...
Article
The U.S. Department of Energy (DOE) Exascale Computing Project (ECP) funded the development of new (and the transformation of important existing) applications, libraries, and tools that realized improvement in performance and capabilities of often 100 times or more on emerging exascale computers. This exceptional gain inspired the title of this spe...
Article
The development of scientific software—a cornerstone of long-term collaboration and scientific progress—parallels the development of other types of software but still poses distinct challenges, especially in high-performance computing. Although web searches yield numerous resources on software engineering, there is still a scarcity specifically for...
Article
Emerging exascale architectures and systems will provide a sizable increase in raw computing power for science. To ensure the full potential of these new and diverse architectures, as well as the longevity and sustainability of science applications, we need to embrace software ecosystems as first-class citizens.
Chapter
Full-text available
Productivity and Sustainability Improvement Planning (PSIP) is a lightweight, iterative workflow that allows software development teams to identify development bottlenecks and track progress to overcome them. In this paper, we present an overview of PSIP and how it compares to other software process improvement (SPI) methodologies, and provide two...
Article
Full-text available
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department...
Conference Paper
Software engineering (SWE) for modeling, simulation, and data analytics for computational science and engineering (CSE) is challenging, with ever-more sophisticated, higher fidelity simulation of ever-larger, more complex problems involving larger data volumes, more domains, and more researchers. Targeting both commodity and custom high-end compute...
Article
Software is the key crosscutting technology that enables advances in mathematics, computer science, and domain-specific science and engineering to achieve robust simulations and analysis for science, engineering, and other research fields. However, software itself has not traditionally received focused attention from research communities; rather, s...
Preprint
Software is the key crosscutting technology that enables advances in mathematics, computer science, and domain-specific science and engineering to achieve robust simulations and analysis for science, engineering, and other research fields. However, software itself has not traditionally received focused attention from research communities; rather, s...
Article
Full-text available
Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of...
Article
Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and fle...
Preprint
Full-text available
Although the “big data” revolution first came to public prominence (circa 2010) in online enterprises like Google, Amazon, and Facebook, it is now widely recognized as the initial phase of a watershed transformation that modern society generally—and scientific and engineering research in particular—are in the process of undergoing. Responding to th...
Article
Agile Development is used for many problems, often with different priorities and challenges. However, generalized engineering methodologies often overlook the particularities of a project. To solve this problem, we have looked at ways engineers have modified development methodologies for a particular focus, and created a generalized framework for l...
Article
Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restart...
Article
Algorithmic differentiation (AD) by source-transformation is an established method for computing derivatives of computational algorithms. Static dataflow analysis is commonly used by AD tools to determine the set of active variables, that is, variables that are influenced by the program input in a differentiable way and have a differentiable influe...
Article
Extreme-scale computational science increasingly demands multiscale and multiphysics formulations. Combining software developed by independent groups is imperative: no single team has resources for all predictive science and decision support capabilities. Scientific libraries provide high-quality, reusable software components for constructing appli...
Poster
Full-text available
M. Heroux? (PI), K. Evansy (Co-PI), R. Bartlett?, J. Campbellz, B. Collinsy, S. Johnsony, A. Prokopenkoy, G. Rockefellerz, M. Youngy
Article
We describe an efficient parallel implementation of the selected inversion algorithm for distributed memory computer systems, which we call PSelInv. The PSelInv method computes selected elements of a general sparse matrix A that can be decomposed as A = LU, where L is lower triangular and U is upper triangular. The implementation described in this...
Article
Over the past two decades, computational methods have radically changed the ability of researchers from all areas of scholarship to process and analyze data and to simulate complex systems. But with these advances come challenges that are contributing to broader concerns over irreproducibility in the scholarly literature, among them the lack of tra...
Article
Full-text available
Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs...
Article
Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and...
Article
In this work, we develop an extension of the Curiously Recurring Template Pattern (CRTP), which allows us to organize three related concepts in a class hierarchy. Generalizations, specializations and special procedures are the concepts that we use to define and implement several tools. We call these tools general template units because they are wel...
Article
This remark describes efficiency improvements to Algorithm 916 [Zaghloul and Ali 2011]. It is shown that the execution time required by the algorithm, when run at its highest accuracy, may be improved by more than a factor of 2. A better accuracy vs efficiency tradeoff scheme is also implemented; this requires the user to supply the number of signi...
Article
Language standards such as C99 and C11, as well as the IEEE Standard for Floating-Point Arithmetic 754 (IEEE Std 754-2008) specify the expected behavior of binary and decimal floating-point arithmetic in computer-programming environments and the handling of special values and exception conditions. Many researchers focus on verifying the compliance...
Article
Domain-decomposition (DD) methods are used in most, if not all, modern parallel implementations of finite element modelling software. In the solver stage, the algebraic additive Schwarz (AAS) domain-decomposition preconditioner represents a fundamental component and its performance and scalability are key to the overall performance of the solution...
Article
Optimal Morse matchings reveal essential structures of cell complexes that lead to powerful tools to study discrete geometrical objects, in particular, discrete 3-manifolds. However, such matchings are known to be NP-hard to compute on 3-manifolds through a reduction to the erasability problem. Here, we refine the study of the complexity of problem...
Article
Full-text available
We present the HPCG benchmark: High Performance Conjugate Gradients that is aimed providing more application-oriented measurement of system performance when compared with the High Performance LINPACK benchmark. We show the model partial differential equation and its discretization as well as the algorithm for iteratively solving it. The performance...
Conference Paper
Full-text available
Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local...
Conference Paper
Full-text available
The economics of software tools have proven challenging to understand for users and stakeholders in CSE. In the past, many funding agencies have supported academic and governmental research that produced high-value (but not necessarily high-quality) software as a byproduct of the proposed research, not as a direct aim of the proposal or line item i...
Article
Full-text available
We describe a new high-performance conjugate-gradient (HPCG) benchmark. HPCG is composed of computations and data-access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and to be representative of their performance. HPCG is meant to help drive the comp...
Article
Full-text available
The performance of a large-scale, production-quality science and engineering application (‘app’) is often dominated by a small subset of the code. Even within that subset, computational and data access patterns are often repeated, so that an even smaller portion can represent the performance-impacting features. If application developers, parallel c...
Conference Paper
Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarti...
Article
Full-text available
The scientific community relies on the peer review process for assuring the quality of published material, the goal of which is to build a body of work we can trust. Computational journals such as the ACM Transactions on Mathematical Software (TOMS) use this process for rigorously promoting the clarity and completeness of content, and citation of p...
Article
Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versionin...
Technical Report
Full-text available
The emergence of high-concurrency architectures offering unprecedented performance has brought many high-performance partial differential equation (PDE) discretization codes to the precipice of a major refactor. To help address this challenge a workshop titled "Algorithms and Abstractions for Assembly in PDE Codes" was held in the Computer Science...
Technical Report
Full-text available
The emergence of high-concurrency architectures offering unprecedented performance has brought many high-performance partial differential equation (PDE) discretization codes to the precipice of a major refactor. To help address this challenge a workshop titled "Algorithms and Abstractions for Assembly in PDE Codes" was held in the Computer Science...
Technical Report
Full-text available
The emergence of high-concurrency architectures offering unprecedented performance has brought many high-performance partial differential equation (PDE) discretization codes to the precipice of a major refactor. To help address this challenge a workshop titled "Algorithms and Abstractions for Assembly in PDE Codes" was held in the Computer Science...
Technical Report
Full-text available
The emergence of high-concurrency architectures offering unprecedented performance has brought many high-performance partial differential equation (PDE) discretization codes to the precipice of a major refactor. To help address this challenge a workshop titled "Algorithms and Abstractions for Assembly in PDE Codes" was held in the Computer Science...
Article
Krylov subspace projection methods are widely used iterative methods for solving large-scale linear systems of equations. Researchers have demonstrated that communication avoiding (CA) techniques can improve Krylov methods' performance on modern computers, where communication is becoming increasingly expensive compared to arithmetic operations. In...
Article
The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that p...
Article
We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple sets of Givens rotations to the eigenvector/singular vector matrix. This algorithm is then implement...
Article
Full-text available
Large-scale computing platforms have always dealt with unreliability coming from many sources. In contrast applications for large-scale systems have generally assumed a fairly simplistic failure model: The computer is a reliable digital machine, with consistent execution time and infrequent failures that can be handled by occasionally storing a che...
Article
The co-design of architectures and algorithms has been postulated as a strategy for achieving Exascale computing in this decade. Exascale design space exploration is prohibitively expensive, at least partially due to the size and complexity of scientific applications of interest. Application codes can contain millions of lines and involve many libr...
Article
Computational science and engineering application programs are typically large, complex, and dynamic, and are often constrained by distribution limitations. As a means of making tractable rapid explorations of scientific and engineering application programs in the context of new, emerging, and future computing architectures, a suite of “miniapps” h...
Article
Full-text available
The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an object-oriented framework. It is intended for large-scale, complex multiphysics engineering and scientific applications. Epetra is one of its basic packages. It provides serial and parallel linear alg...
Article
Preparations for Exascale computing have led to the realization that future computing environments will be significantly different from those that provide Petascale capabilities. This change is driven by energy constraints, which is compelling architects to design systems that will require a significant re-thinking of how algorithms are developed a...
Conference Paper
Preparations for exascale computing have led to the realization that computing environments will be significantly different from those that provide petascale capabilities. This change is driven by energy constraints, which has compelled hardware architects to design systems that will require a significant re-thinking of how application algorithms a...
Conference Paper
The computing community is in the midst of a disruptive architectural change. The advent of manycore and heterogeneous computing nodes forces us to reconsider every aspect of the system software and application stack. To address this challenge there is a broad spectrum of approaches, which we roughly classify as either revolutionary or evolutionary...
Conference Paper
The push to exascale computing is informed by the assumption that the architecture, regardless of the specific design, will be fundamentally different from petascale computers. The Mantevo project has been established to produce a set of proxies, or “miniapps,” which enable rapid exploration of key performance issues that impact a broad set of scie...
Conference Paper
Software lifecycles are becoming an increasingly important issue for computational science & engineering (CSE) software. The process by which a piece of CSE software begins life as a set of research requirements and then matures into a trusted high-quality capability is both commonplace and extremely challenging. Although an implicit lifecycle is o...
Article
Full-text available
MueLu is a library within the Trilinos software project [An overview of Trilinos, Technical Report SAND2003-2927, Sandia National Laboratories, 2003] and provides a framework for parallel multigrid preconditioning methods for large sparse linear systems. ...
Article
Full-text available
Energy increasingly constrains modern computer hardware, yet protecting computations and data against errors costs energy. This holds at all scales, but especially for the largest parallel computers being built and planned today. As processor counts continue to grow, the cost of ensuring reliability consistently throughout an application will becom...
Conference Paper
Full-text available
With the ubiquity of multicore processors, it is crucial that solvers adapt to the hierarchical structure of modern architectures. We present ShyLU, a “hybrid-hybrid” solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative part is based on approximate Schur complements wher...
Article
We describe methods to determine optimal coarse-grained models of lipid bilayers for use in fluids density functional theory (fluids-DFT) calculations. Both coarse-grained lipid architecture and optimal parametrizations of the models based on experimental measures are discussed in the context of dipalmitoylphosphatidylcholine (DPPC) lipid bilayers...
Article
Full-text available
Since An Overview of the Trilinos Project [ACM Trans. Math. Softw. 313 2005, 397--423] was published in 2005, Trilinos has grown significantly. It now supports the development of a broad collection of libraries for scalable computational science and ...
Article
Software lifecycles are becoming an increasingly important issue for computational science and engineering (CSE) software. The process by which a piece of CSE software begins life as a set of research requirements and then matures into a trusted high-quality capability is both commonplace and extremely challenging. Although an implicit lifecycle is...
Article
Full-text available
A broad range of scientific computation involves the use of difference stencils. In a parallel computing environment, this computation is typically implemented by decomposing the spacial domain, inducing a 'halo exchange' of process-owned boundary data. This approach adheres to the Bulk Synchronous Parallel (BSP) model. Because commonly available a...
Article
We present Tpetra, a Trilinos package for parallel linear algebra primitives implementing the Petra object model. We describe Tpetra s design, based on generic programming via C++ templated types and template metaprogramming. We discuss some benefits of this approach in the context of scientific computing, with illustrations consisting of code and...
Conference Paper
Full-text available
The Xyce Parallel Circuit Simulator, which has demonstrated scalable circuit simulation on hundreds of processors, heavily leverages the high-performance scientific libraries provided by Trilinos. With the move towards multi-core CPUs and GPU technology, retaining this scalability on future parallel architectures will be a challenge. This paper wil...
Article
Full-text available
We present Tpetra, a Trilinos package for parallel linear algebra primitives implementing the Petra object model. We describe Tpetra's design, based on generic programming via C++ templated types and template metaprogramming. We discuss some benefits of this approach in the context of scientific computing, with illustrations consisting of code and...
Article
Full-text available
Since An Overview of the Trilinos Project [ACM Trans. Math. Softw. 31(3) (2005), 397-423] was published in 2005, Trilinos has grown significantly. It now supports the development of a broad collection of libraries for scalable computational science and engineering applications, and a full-featured software infrastructure for rigorous lean/agile sof...
Chapter
There is considerable interest in achieving a 1000 fold increase in supercomputing power in the next decade, but the challenges are formidable. In this paper, the authors discuss some of the driving science and security applications that require Exascale computing (a million, trillion operations per second). Key architectural challenges include pow...
Conference Paper
Application performance is determined by a combination of many choices: hardware plat-form, runtime environment, languages and compilers used, algorithm choice and implementation, and more. In this complicated environment, we find that the use of mini-applications - small self-contained proxies for real applications - is an excellent approach for r...
Conference Paper
With the increasing levels of parallelism in a compute node, it is important to exploit multiple levels of parallelism even within a single compute node. We present ShyLU (pronounced "Shy-loo" for Scalable Hybrid LU), a "hybrid-hybrid" solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative metho...
Conference Paper
Several recent studies discuss potential Exascale architectures, identify key technical challenges and describe research that is beginning to address several of these challenges [1,2]. Co-design is a key element of the U.S. Department of Energy’s strategy to achieve Exascale computing [3]. Architectures research is needed but will not, by itself, m...
Article
Full-text available
It is often observed that software engineering (SE) processes and practices for computational science and engineering (CSE) lag behind other SE areas [7]. This issue has been a concern for funding agencies, since new research increasingly relies upon and produces computational tools. At the same time, CSE research organizations find it difficult to...
Article
Self-similarity is a property of physical systems that describes how to scale parameters such that dissimilar systems appear to be similar. Computer systems are self-similar if certain ratios of computational forces, also known as computational intensities, are equal. Two machines with different computational power, different network bandwidth and...
Article
Full-text available
Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have...
Article
A programming model is a set of software technologies that support the expression of algorithms and provide applications with an abstract representation of the capabilities of the underlying hardware architecture. The primary goals are productivity, portability and performance.
Article
Full-text available
Current iterative methods for solving linear equations as-sume reliability of data (no "bit flips") and arithmetic (cor-rect up to rounding error). If faults occur, the solver usu-ally either aborts, or computes the wrong answer without indication. System reliability guarantees consume energy or reduces performance. As processor counts continue to...
Conference Paper
Full-text available
Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applicat...
Article
Full-text available
This report summarizes the progress made as part of a one year lab-directed research and development (LDRD) project to fund the research efforts of Bryan Marker at the University of Texas at Austin. The goal of the project was to develop new techniques for automatically tuning the performance of dense linear algebra kernels. These kernels often rep...
Conference Paper
Full-text available
As computational science applications grow more parallel with multi-core supercomputers having hundreds of thousands of computational cores, it will become increasingly difficult for solvers to scale. Our approach is to use hybrid MPI/threaded numerical algorithms to solve these systems in order to reduce the number of MPI tasks and increase the pa...
Article
Full-text available
There is considerable interest in achieving a 1000 fold increase in supercomputing power in the next decade, but the challenges are formidable. In this paper, the authors discuss some of the driving science and security applications that require Exascale computing a million, trillion operations per second. Key architectural challenges include power...
Conference Paper
Full-text available
Multicore nodes have become ubiquitous in just a few years. At the same time, writing portable parallel software for multicore nodes is extremely challenging. Widely available programming models such as OpenMP and Pthreads are not useful for devices such as graphics cards, and more flexible programming models such as RapidMind are only available co...
Article
The Trilinos Project started approximately nine years ago as a small effort to enable research, development and ongoing support of small, related solver software efforts. The 'Tri' in Trilinos was intended to indicate the eventual three packages we planned to develop. In 2007 the project expanded its scope to include any package that was an enablin...
Article
Preparing applications for a transition from petascale to exascale systems will require a very large investment in several areas of software research and development. The introduction of manycore nodes, the abundance of parallelism, an increase in system faults (including soft errors) and a complicated, multi-component software environment are some...
Conference Paper
Full-text available
This paper presents a parallel programming model, Parallel Phase Model (PPM), for next-generation high-end parallel machines based on a distributed memory architecture consisting of a networked cluster of nodes with a large number of cores on each node. PPM has a unified high-level programming abstraction that facilitates the design and implementat...
Article
Analysis of a timing formula for a molecular dynamics kernel reveals an equivalence class of parallel machines with a fixed point that is independent of the particular machine in the class. Three different machines, CRAY, IBM and SGI, are self-similar in that they follow the same path along a performance surface as the processor count and problem s...