# Michael Allen HerouxSandia National Laboratories · Center for Computing Research

Michael Allen Heroux

PhD Applied Mathematics

## About

170

Publications

28,074

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

6,145

Citations

Citations since 2017

Introduction

Additional affiliations

January 2016 - September 2016

August 1998 - present

May 1998 - present

## Publications

Publications (170)

Emerging exascale architectures and systems will provide a sizable increase in raw computing power for science. To ensure the full potential of these new and diverse architectures, as well as the longevity and sustainability of science applications, we need to embrace software ecosystems as first-class citizens.

Productivity and Sustainability Improvement Planning (PSIP) is a lightweight, iterative workflow that allows software development teams to identify development bottlenecks and track progress to overcome them. In this paper, we present an overview of PSIP and how it compares to other software process improvement (SPI) methodologies, and provide two...

Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department...

Software engineering (SWE) for modeling, simulation, and data analytics for computational science and engineering (CSE) is challenging, with ever-more sophisticated, higher fidelity simulation of ever-larger, more complex problems involving larger data volumes, more domains, and more researchers. Targeting both commodity and custom high-end compute...

Software is the key crosscutting technology that enables advances in mathematics, computer science, and domain-specific science and engineering to achieve robust simulations and analysis for science, engineering, and other research fields. However, software itself has not traditionally received focused attention from research communities; rather, s...

Software is the key crosscutting technology that enables advances in mathematics, computer science, and domain-specific science and engineering to achieve robust simulations and analysis for science, engineering, and other research fields. However, software itself has not traditionally received focused attention from research communities; rather, s...

Over the past four years, the Big Data and Exascale Computing (BDEC) project organized a series of five international workshops that aimed to explore the ways in which the new forms of data-centric discovery introduced by the ongoing revolution in high-end data analysis (HDA) might be integrated with the established, simulation-centric paradigm of...

Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to be executed correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The finite element method (FEM) is a popular and fle...

Although the “big data” revolution first came to public prominence (circa 2010) in online enterprises like Google, Amazon, and Facebook, it is now widely recognized as the initial phase of a watershed transformation that modern society generally—and scientific and engineering research in particular—are in the process of undergoing. Responding to th...

Agile Development is used for many problems, often with different priorities and challenges. However, generalized engineering methodologies often overlook the particularities of a project. To solve this problem, we have looked at ways engineers have modified development methodologies for a particular focus, and created a generalized framework for l...

Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restart...

Algorithmic differentiation (AD) by source-transformation is an established method for computing derivatives of computational algorithms. Static dataflow analysis is commonly used by AD tools to determine the set of active variables, that is, variables that are influenced by the program input in a differentiable way and have a differentiable influe...

Extreme-scale computational science increasingly demands multiscale and multiphysics formulations. Combining software developed by independent groups is imperative: no single team has resources for all predictive science and decision support capabilities. Scientific libraries provide high-quality, reusable software components for constructing appli...

M. Heroux? (PI), K. Evansy (Co-PI), R. Bartlett?, J. Campbellz, B. Collinsy, S. Johnsony, A. Prokopenkoy, G. Rockefellerz, M. Youngy

We describe an efficient parallel implementation of the selected inversion algorithm for distributed memory computer systems, which we call PSelInv. The PSelInv method computes selected elements of a general sparse matrix A that can be decomposed as A = LU, where L is lower triangular and U is upper triangular. The implementation described in this...

Over the past two decades, computational methods have radically changed the ability of researchers from all areas of scholarship to process and analyze data and to simulate complex systems. But with these advances come challenges that are contributing to broader concerns over irreproducibility in the scholarly literature, among them the lack of tra...

Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs...

Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and...

In this work, we develop an extension of the Curiously Recurring Template Pattern (CRTP), which allows us to organize three related concepts in a class hierarchy. Generalizations, specializations and special procedures are the concepts that we use to define and implement several tools. We call these tools general template units because they are wel...

This remark describes efficiency improvements to Algorithm 916 [Zaghloul and Ali 2011]. It is shown that the execution time required by the algorithm, when run at its highest accuracy, may be improved by more than a factor of 2. A better accuracy vs efficiency tradeoff scheme is also implemented; this requires the user to supply the number of signi...

Language standards such as C99 and C11, as well as the IEEE Standard for Floating-Point Arithmetic 754 (IEEE Std 754-2008) specify the expected behavior of binary and decimal floating-point arithmetic in computer-programming environments and the handling of special values and exception conditions. Many researchers focus on verifying the compliance...

Domain-decomposition (DD) methods are used in most, if not all, modern parallel implementations of finite element modelling software. In the solver stage, the algebraic additive Schwarz (AAS) domain-decomposition preconditioner represents a fundamental component and its performance and scalability are key to the overall performance of the solution...

Optimal Morse matchings reveal essential structures of cell complexes that lead to powerful tools to study discrete geometrical objects, in particular, discrete 3-manifolds. However, such matchings are known to be NP-hard to compute on 3-manifolds through a reduction to the erasability problem. Here, we refine the study of the complexity of problem...

We present the HPCG benchmark: High Performance Conjugate Gradients that is aimed providing more application-oriented measurement
of system performance when compared with the High Performance LINPACK benchmark. We show the model partial differential equation
and its discretization as well as the algorithm for iteratively solving it. The performance...

Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local...

The economics of software tools have proven challenging to understand for users and stakeholders in CSE. In the past, many funding agencies have supported academic and governmental research that produced high-value (but not necessarily high-quality) software as a
byproduct of the proposed research, not as a direct aim of the proposal or line item i...

We describe a new high-performance conjugate-gradient (HPCG) benchmark. HPCG is composed of computations and data-access patterns commonly found in scientific applications. HPCG strives for a better correlation to existing codes from the computational science domain and to be representative of their performance. HPCG is meant to help drive the comp...

The performance of a large-scale, production-quality science and engineering application (‘app’) is often dominated by a small subset of the code. Even within that subset, computational and data access patterns are often repeated, so that an even smaller portion can represent the performance-impacting features. If application developers, parallel c...

Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarti...

The scientific community relies on the peer review process for assuring the quality of published material, the goal of which is to build a body of work we can trust. Computational journals such as the ACM Transactions on Mathematical Software (TOMS) use this process for rigorously promoting the clarity and completeness of content, and citation of p...

Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versionin...

The emergence of high-concurrency architectures offering unprecedented performance has brought many high-performance partial differential equation (PDE) discretization codes to the precipice of a major refactor. To help address this challenge a workshop titled "Algorithms and Abstractions for Assembly in PDE Codes" was held in the Computer Science...

The emergence of high-concurrency architectures offering unprecedented performance has brought many high-performance partial differential equation (PDE) discretization codes to the precipice of a major refactor. To help address this challenge a workshop titled "Algorithms and Abstractions for Assembly in PDE Codes" was held in the Computer Science...

The emergence of high-concurrency architectures offering unprecedented performance has brought many high-performance partial differential equation (PDE) discretization codes to the precipice of a major refactor. To help address this challenge a workshop titled "Algorithms and Abstractions for Assembly in PDE Codes" was held in the Computer Science...

Krylov subspace projection methods are widely used iterative methods for solving large-scale linear systems of equations. Researchers have demonstrated that communication avoiding (CA) techniques can improve Krylov methods' performance on modern computers, where communication is becoming increasingly expensive compared to arithmetic operations. In...

The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that p...

We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple sets of Givens rotations to the eigenvector/singular vector matrix. This algorithm is then implement...

Large-scale computing platforms have always dealt with unreliability coming from many sources. In contrast applications for large-scale systems have generally assumed a fairly simplistic failure model: The computer is a reliable digital machine, with consistent execution time and infrequent failures that can be handled by occasionally storing a che...

The co-design of architectures and algorithms has been postulated as a strategy for achieving Exascale computing in this decade. Exascale design space exploration is prohibitively expensive, at least partially due to the size and complexity of scientific applications of interest. Application codes can contain millions of lines and involve many libr...

Computational science and engineering application programs are typically large, complex, and dynamic, and are often constrained by distribution limitations. As a means of making tractable rapid explorations of scientific and engineering application programs in the context of new, emerging, and future computing architectures, a suite of “miniapps” h...

The Trilinos Project is an effort to facilitate the design, development,
integration and ongoing support of mathematical software libraries within an
object-oriented framework. It is intended for large-scale, complex multiphysics
engineering and scientific applications. Epetra is one of its basic packages.
It provides serial and parallel linear alg...

Preparations for Exascale computing have led to the realization that future computing environments will be significantly different from those that provide Petascale capabilities. This change is driven by energy constraints, which is compelling architects to design systems that will require a significant re-thinking of how algorithms are developed a...

Preparations for exascale computing have led to the realization that computing environments will be significantly different from those that provide petascale capabilities. This change is driven by energy constraints, which has compelled hardware architects to design systems that will require a significant re-thinking of how application algorithms a...

The computing community is in the midst of a disruptive architectural change. The advent of manycore and heterogeneous computing nodes forces us to reconsider every aspect of the system software and application stack. To address this challenge there is a broad spectrum of approaches, which we roughly classify as either revolutionary or evolutionary...

The push to exascale computing is informed by the assumption that the architecture, regardless of the specific design, will be fundamentally different from petascale computers. The Mantevo project has been established to produce a set of proxies, or “miniapps,” which enable rapid exploration of key performance issues that impact a broad set of scie...

Software lifecycles are becoming an increasingly important issue for computational science & engineering (CSE) software. The process by which a piece of CSE software begins life as a set of research requirements and then matures into a trusted high-quality capability is both commonplace and extremely challenging. Although an implicit lifecycle is o...

MueLu is a library within the Trilinos software project [An overview of Trilinos, Technical Report SAND2003-2927, Sandia National Laboratories, 2003] and provides a framework for parallel multigrid preconditioning methods for large sparse linear systems. ...

Energy increasingly constrains modern computer hardware, yet protecting
computations and data against errors costs energy. This holds at all scales,
but especially for the largest parallel computers being built and planned
today. As processor counts continue to grow, the cost of ensuring reliability
consistently throughout an application will becom...

With the ubiquity of multicore processors, it is crucial that solvers adapt to the hierarchical structure of modern architectures. We present ShyLU, a “hybrid-hybrid” solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative part is based on approximate Schur complements wher...

We describe methods to determine optimal coarse-grained models of lipid bilayers for use in fluids density functional theory (fluids-DFT) calculations. Both coarse-grained lipid architecture and optimal parametrizations of the models based on experimental measures are discussed in the context of dipalmitoylphosphatidylcholine (DPPC) lipid bilayers...

Since An Overview of the Trilinos Project [ACM Trans. Math. Softw. 313 2005, 397--423] was published in 2005, Trilinos has grown significantly. It now supports the development of a broad collection of libraries for scalable computational science and ...

A broad range of scientific computation involves the use of difference stencils. In a parallel computing environment, this computation is typically implemented by decomposing the spacial domain, inducing a 'halo exchange' of process-owned boundary data. This approach adheres to the Bulk Synchronous Parallel (BSP) model. Because commonly available a...

We present Tpetra, a Trilinos package for parallel linear algebra primitives implementing the Petra object model. We describe Tpetra s design, based on generic programming via C++ templated types and template metaprogramming. We discuss some benefits of this approach in the context of scientific computing, with illustrations consisting of code and...

The Xyce Parallel Circuit Simulator, which has demonstrated scalable circuit simulation on hundreds of processors, heavily leverages the high-performance scientific libraries provided by Trilinos. With the move towards multi-core CPUs and GPU technology, retaining this scalability on future parallel architectures will be a challenge. This paper wil...

We present Tpetra, a Trilinos package for parallel linear algebra primitives implementing the Petra object model. We describe Tpetra's design, based on generic programming via C++ templated types and template metaprogramming. We discuss some benefits of this approach in the context of scientific computing, with illustrations consisting of code and...

Since An Overview of the Trilinos Project [ACM Trans. Math. Softw. 31(3) (2005), 397-423] was published in 2005, Trilinos has grown significantly. It now supports the development of a broad collection of libraries for scalable computational science and engineering applications, and a full-featured software infrastructure for rigorous lean/agile sof...

There is considerable interest in achieving a 1000 fold increase in supercomputing power in the next decade, but the challenges are formidable. In this paper, the authors discuss some of the driving science and security applications that require Exascale computing (a million, trillion operations per second). Key architectural challenges include pow...

Application performance is determined by a combination of many choices: hardware plat-form, runtime environment, languages and compilers used, algorithm choice and implementation, and more. In this complicated environment, we find that the use of mini-applications - small self-contained proxies for real applications - is an excellent approach for r...

With the increasing levels of parallelism in a compute node, it is important to exploit multiple levels of parallelism even within a single compute node. We present ShyLU (pronounced "Shy-loo" for Scalable Hybrid LU), a "hybrid-hybrid" solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative metho...

Several recent studies discuss potential Exascale architectures, identify key technical challenges and describe research that
is beginning to address several of these challenges [1,2]. Co-design is a key element of the U.S. Department of Energy’s strategy
to achieve Exascale computing [3]. Architectures research is needed but will not, by itself, m...

Self-similarity is a property of physical systems that describes how to scale parameters such that dissimilar systems appear to be similar. Computer systems are self-similar if certain ratios of computational forces, also known as computational intensities, are equal. Two machines with different computational power, different network bandwidth and...

Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have...

Current iterative methods for solving linear equations as-sume reliability of data (no "bit flips") and arithmetic (cor-rect up to rounding error). If faults occur, the solver usu-ally either aborts, or computes the wrong answer without indication. System reliability guarantees consume energy or reduces performance. As processor counts continue to...

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applicat...

It is often observed that software engineering (SE) processes and practices for computational science and engineering (CSE) lag behind other SE areas [7]. This issue has been a concern for funding agencies, since new research increasingly relies upon and produces computational tools. At the same time, CSE research organizations find it difficult to...

This report summarizes the progress made as part of a one year lab-directed research and development (LDRD) project to fund the research efforts of Bryan Marker at the University of Texas at Austin. The goal of the project was to develop new techniques for automatically tuning the performance of dense linear algebra kernels. These kernels often rep...

As computational science applications grow more parallel with multi-core supercomputers having hundreds of thousands of computational
cores, it will become increasingly difficult for solvers to scale. Our approach is to use hybrid MPI/threaded numerical algorithms
to solve these systems in order to reduce the number of MPI tasks and increase the pa...

There is considerable interest in achieving a 1000 fold increase in supercomputing power in the next decade, but the challenges are formidable. In this paper, the authors discuss some of the driving science and security applications that require Exascale computing a million, trillion operations per second. Key architectural challenges include power...

Multicore nodes have become ubiquitous in just a few years. At the same time, writing portable parallel software for multicore nodes is extremely challenging. Widely available programming models such as OpenMP and Pthreads are not useful for devices such as graphics cards, and more flexible programming models such as RapidMind are only available co...

The Trilinos Project started approximately nine years ago as a small effort to enable research, development and ongoing support of small, related solver software efforts. The 'Tri' in Trilinos was intended to indicate the eventual three packages we planned to develop. In 2007 the project expanded its scope to include any package that was an enablin...

Preparing applications for a transition from petascale to exascale systems will require a very large investment in several areas of software research and development. The introduction of manycore nodes, the abundance of parallelism, an increase in system faults (including soft errors) and a complicated, multi-component software environment are some...

This paper presents a parallel programming model, Parallel Phase Model (PPM), for next-generation high-end parallel machines based on a distributed memory architecture consisting of a networked cluster of nodes with a large number of cores on each node. PPM has a unified high-level programming abstraction that facilitates the design and implementat...

Analysis of a timing formula for a molecular dynamics kernel reveals an equivalence class
of parallel machines with a fixed point that is independent of the particular machine in the class.
Three different machines, CRAY, IBM and SGI, are self-similar in that they follow the same path along a performance
surface as the processor count and problem s...

Computational Science and Engineering (CSE) software is typically developed using research funding where the primary focus is research and development of advanced algorithms and modeling capabilities. As a result, formal software engineering is seldom a primary goal. CSE software developers intend to write good software, but often lack the training...

Application performance is determined by a combination of many choices: hardware platform, runtime environment, languages and compilers used, algorithm choice and implementation, and more. In this complicated environment, we find that the use of mini-applications - small self-contained proxies for real applications - is an excellent approach for ra...

Future generations of scalable computers will rely on multicore nodes for a significant portion of overall system performance. At present, most applications and libraries cannot exploit multiple cores beyond running addition MPI processes per node. In this paper we discuss important multicore architecture issues, programming models, algorithms requ...

Bundle-exchange-compute (BEC) is a new virtual shared memory parallel programming environment for distributed-memory machines. Different from and complementary to other global address space (GAS) programming model research efforts, BEC has built-in efficient support for unstructured applications that inherently require high-volume random fine-grain...