
Bruce Hendrickson- Sandia National Laboratories
Bruce Hendrickson
- Sandia National Laboratories
About
154
Publications
26,515
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
10,519
Citations
Introduction
Skills and Expertise
Current institution
Publications
Publications (154)
It is our view that the state of the art in constructing a large collection
of graph algorithms in terms of linear algebraic operations is mature enough to
support the emergence of a standard set of primitive building blocks. This
paper is a position paper defining the problem and announcing our intention to
launch an open effort to define this sta...
The Communications Web site, http://cacm.acm.org, features more than a dozen bloggers in the BLOG@CACM community. In each issue of Communications, we'll publish selected posts or excerpts.twitterFollow us on Twitter ...
Given its leading role in high-performance computing for modeling and simulation and its many experimental facilities, the US Department of Energy has a tremendous need for data-intensive science. Locating the challenges and commonalities among three case studies illuminates, in detail, the technical challenges involved in realizing data-intensive...
Communities of vertices within a giant network such as the World Wide Web are likely to be vastly smaller than the network itself. However, Fortunato and Barthélemy have proved that modularity maximization algorithms for community detection may fail to resolve communities with fewer than √L/2 edges, where L is the number of edges in the entire netw...
Graphs are a general approach for representing information that spans the widest possible range of computing applications. They are particularly important to computational biology, web search, and knowledge discovery. As the sizes of graphs increase, the need to apply advanced mathematical and computational techniques to solve these problems is gro...
In the past two decades, computational methods have emerged as an essential component of the scientific and engineering enterprise. A diverse assortment of scientific applications has been simulated and explored via advanced computational techniques. Computer vendors have built enormous parallel machines to support these activities, and the researc...
Despite decades of activity, parallel computing remains immature. Like much of computer science, advances in the field are driven by a mixture of theoretical insights and technological advances. But in parallel computing, the gap between theory and practice remains disconcertingly wide. Key theoretical concepts in parallel computing were developed...
Large, complex graphs arise in many settings including the Internet, social networks, and communication networks. To study such data sets, the authors explored the use of high-performance computing (HPC) for graph algorithms. They found that the challenges in these applications are quite different from those arising in traditional HPC applications...
BlueGene/L (BG/L), developed through a partnership between IBM and Lawrence Livermore National Laboratory (LLNL), is currently the world's largest system both in terms of scale, with 131,072 processors, and absolute performance, with a peak rate of 367 Tflop/s. BG/L has led the last four Top500 lists with a Linpack rate of 280.6 Tflop/s for the ful...
Over the past half-century, the Applied Mathematics program in the U.S. Department of Energy's Office of Advanced Scientific Computing Research has made significant, enduring advances in applied mathematics that have been essential enablers of modern computational science. Motivated by the scientific needs of the Department of Energy and its predec...
Since the early days of supercomputing, numerical routines have caused the highest demand for computing power anywhere, making their efficient parallelization one of the core methodical tasks in high-performance computing. And still, many of today’s fastest computers in the world are mostly used for the solution of huge systems of equations as they...
In this paper we apply theoretical and practical results from facility
location theory to the problem of community detection in networks. The result
is an algorithm that computes bounds on a minimization variant of local
modularity. We also define the concept of an edge support and a new measure of
the goodness of community structures with respect...
Search-based graph queries, such as finding short paths and isomorphic subgraphs, are dominated by memory latency. If input graphs can be partitioned appropriately, large cluster-based computing platforms can run these queries. However, the lack of compute-bound processing at each vertex of the input graph and the constant need to retrieve neighbor...
Graph algorithms are becoming increasingly important for solving many problems in scientific computing, data mining and other domains. As these problems grow in scale, parallel computing resources are required to meet their computational and memory requirements. Unfortunately, the algorithms, software, and hardware that have worked well for develop...
Latent semantic analysis (LSA) is a method for information retrieval and processing which is based upon the singular value decomposition. It has a geometric interpretation in which objects (e.g. documents and keywords) are placed in a low-dimensional geometric space. In this paper, we derive an alternative algebraic/geometric method for placing obj...
Support theory is a methodology for bounding eigenvalues and generalized eigenvalues of matrices and matrix pencils; such bounds have been stated both in algebraic terms and in combinatorial terms based on embeddings of the underlying graphs of the matrices. In this paper, we present a theorem that demonstrates the connection between these various...
The Cray MTA-2 system provides exceptional perfor- mance on a variety of sparse graph algorithms. Unfor- tunately, it was an extremely expensive platform. Cray is preparing an Eldorado platform that leverages the Cray XT3 network and system infrastructure while integrating a new revision of the MTA-2 processors that is pin compatible with the AMD O...
This paper addresses the problem of partitioning the nonzeros of sparse nonsymmetric and nonsquare matrices in order to efficiently
compute parallel matrix-vector and matrix-transpose-vector multiplies. Our goal is to balance the work per processor while
keeping communications costs low. Although the symmetric partitioning problem has been well-stu...
Many emerging applications are built upon large, unstructured datasets that exhibit highly irregular (or even nearly random) memory access patterns. Examples include informatics applications, and other problems that are often represented by unstructured graph-based data structures. It is well known that these applications are challenging for conven...
Combinatorial algorithms have long played a crucial enabling role in scientific and engineering computations. The importance of discrete algorithms continues to grow with the demands of new applications and advanced architectures. This paper surveys some recent developments in this rapidly changing and highly interdisciplinary field.
Combinatorial algorithms have long played a crucial, albeit under-recognized role in scientific computing. This impact ranges well beyond the familiar applications of graph algorithms in sparse matri- ces to include mesh generation, optimization, computational biology and chemistry, data analysis and parallelization. Trends in science and in comput...
Sparse matrix-vector multiplication is the kernel for many scientific computations. Parallelizing this operation requires the matrix to be divided among processors. This division is commonly phrased in terms of graph partitioning. Although this abstraction has proved to be very useful, it has significant flaws and limitations. The cost model implic...
We will discuss our experiences in designing and us-ing a software infrastructure for processing seman-tic graphs on massively multithreaded computers. We have developed implementations of several algo-rithms for connected components, subgraph isomor-phism, and s-t connectivity. We will discuss their performance on the existing Cray MTA-2, and thei...
A new trend in processor design is increased on-chip support for multithreading in the form of both chip multiprocessors and simultaneous multithreading. Recent research in data- base systems has begun to explore increased thread-level parallelism made possible by these new multicore and mul- tithreaded processors. The question of how best to use t...
Latent semantic analysis (LSA) is a method for information retrieval and processing which is based upon the singular value decomposition. It has a geometric interpretation in which objects (e.g. documents and keywords) are placed in a low-dimensional geometric space. In this paper, we derive an alternative algebraic/geometric method for placing obj...
Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a distributed breadth- first search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges. Scalability was tested on IBM BlueGene/L with 32,768 nodes...
In many applications of parallel computing, distribution of the data unambiguously implies distribution of work among processors. But, there are exceptions where some tasks can be assigned to one of several processors without altering the total volume of communication. In this paper, we study the problem of exploiting this flexibility in assignment...
The method of discrete ordinates is commonly used to solve the Boltzmann transport equation. The solution in each ordinate direction is most efficiently computed by sweeping the radiation flux across the computational grid. For unstructured grids this poses many challenges, particularly when implemented on distributed-memory parallel machines where...
The traditional, serial, algorithm for finding the strongly connected components in a graph is based on depth first search and has complexity which is linear in the size of the graph. Depth first search is difficult to parallelize, which creates a need for a different parallel algorithm for this problem. We describe the implementation of a recently...
Data partitioning and load balancing are important components of parallel computations. Many different partitioning strategies have been developed, with great effectiveness in parallel applications. But the load-balancing problem is not yet solved completely; new applications and architectures require new partitioning features. Existing algorithms...
As the need for complex parallel simulation software grows, better strategies for efficient and effective software development become important. We advocate a toolkit-or 'tinkertoy'-approach to parallel application development. By providing efficient implementations of basic services commonly needed by applications, toolkits allow application devel...
Design and analysis of algorithms graph algorithms parallel algorithms strongly connected components divide--and--conquer discrete ordinates method Abstract: Strongly connected components of a directed graph can be found in an optimal linear time, by algorithms based on depth first search. Unfortunately, depth first search is difficult to paralleli...
Combinatorial algorithms have long played a pivotal enabling role in many applications of parallel computing. Graph algorithms in particular arise in load balancing, scheduling, mapping and many other aspects of the parallelization of irregular applications. These are still active research areas, mostly due to evolving computational techniques and...
This paper analyses a novel method for constructing preconditioners for diagonally dominant symmetric positive-definite matrices. The method discussed here is based on a simple idea: we construct M by simply dropping offdiagonal non-zeros from A and modifying the diagonal elements to maintain a certain row-sum property. The preconditioners are exte...
We consider linear systems arising from the use of the finite element method for solving scalar linear elliptic problems. Our main result is that these linear systems, which are symmetric and positive semidefinite, are well approximated by symmetric diagonally dominant matrices. Our framework for defining matrix approximation is support theory. Sig...
We show in this note how support preconditioners can be applied to a class of linear systems arising from use of the finite element method to solve linear elliptic problems. Our technique reduces the problem, which is symmetric and positive definite, to a symmetric positive definite diagonally dominant problem. Significant theory has already been d...
Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages. We propose a model for optimizing the exchange of messages under such circumstances which we call the minimum phase remapping probl...
We present support theory, a set of techniques for bounding extreme eigenvalues and condition numbers for matrix pencils. Our intended application of support theory is to enable proving condition number bounds for preconditioners for symmetric, positive definite systems. One key feature sets our approach apart from most other works: We use support...
Discrete ordinates methods are commonly used to simulate radiation transport for fire or weapons modeling. The computation proceeds by sweeping the flux across a grid. A particular cell can't be computed until all the cells immediately upwind of it are finished. If the directed dependence graph for the grid cells contains a cycle then sweeping meth...
The Zoltan library is a collection of data management services for parallel, unstructured, adaptive, and dynamic applications that is available as open-source software from www.cs.sandia.gov/zoltan. It simplifies the load-balancing, data movement, unstructured-communication, and memory usage difficulties that arise in dynamic applications such as a...
The explosive growth in the availability of information is overwhelming traditional information management systems. Although individual pieces of information have become easy to find, the larger context in which they exist has become harder to track. These contextual questions are ideally suited to visualization since the humrex visual system is re...
In many applications of parallel computing, distribution of the data unambiguously implies distribution of work among processors. But there are exceptions where some tasks can be assigned to one of several processors without altering the total volume of communication. In this paper, we study the problem of exploiting this flexibility in assignment...
Envelope methods for solving sparse systems of linear equations require the matrix to be reordered so that the nonzeros are near the diagonal. Optimal reorderings are known to be NP-complete, but a variety of heuristics have been proposed. In this paper we describe a multilevel approach for finding small envelope orderings and related ordering prob...
Introduction and General Principles Philosophy of Zoltan Coding Principles in Zoltan Include files Global Variables Function Names Parallel Communication Memory Management Errors, Warnings and Return Codes Zoltan Distribution CVS Layout of Directories Compilation and Makefiles Load-Balancing Interface and Data Structures Interface Functions ID Data...
. We present a little-known preconditioning technique, called support-graph preconditioning, and use it to analyze two classes of preconditioners. The technique was first described in a talk by Pravin Vaidya, who did not formally publish his results. Vaidya used the technique to devise and analyze a class of novel preconditioners. The technique was...
Graph partitioning is an important tool for dividing work amongst processors of a parallel machine, but it is unsuitable for some important applications. Specifically, graph partitioning requires the work per processor to be a simple sum of vertex weights. For many applications, this assumption is not true --- the work (or memory) is a complex func...
:Developing parallel software for unstructured problems continues to be a difficultundertaking, particularly for distributed memory machines. Framework and librarysupport are limited for non-standard applications and developers are often forced tocode from scratch. This is particularly true for complex, unstructured applications.In this paper, we s...
This memorycannot be utilized in subsequent phases, decreasing the total memory which is usablefor communication, thus potentially increasing the number of phases. Instead,another processor can temporarily move some of its data to this processor to freeup space for messages. An example is illustrated in Fig. 3. In this simple example,the top two pr...
The method of discrete ordinates is commonly used to solve the Boltzmann radiation transport equation for applications ranging from simulations of fires to weapons effects. The equations are most efficiently solved by sweeping the radiation flux across the computational grid. For unstructured grids this poses several interesting challenges, particu...
Calculations can naturally be described as graphs in which vertices represent computation and edges reflect data dependencies. By partitioning the vertices of a graph, the calculation can be divided among processors of a parallel computer. However, the standard methodology for graph partitioning minimizes the wrong metric and lacks expressibility....
Three classes of parallel algorithms for short--range classical molecular dynamics are presented and contrasted and their suitability for simulation of molecular systems is discussed. Performance of the algorithms on the Intel Paragon and Cray T3D in benchmark simulations of Lennard--Jones systems and of a macromolecular system is also highlighted....
Parallel computing offers new capabilities for using molecular dynamics (MD) to simulate larger numbers of atoms and longer time scales. In this paper we discuss two methods we have used to implement the embedded atom method (EAM) formalism for molecular dynamics on multiple-instruction/multiple-data (MIMD) parallel computers. The first method (ato...
Grid partitioning is the method of choice for decomposing a wide variety of computational problems into naturally parallel pieces. In problems where computational load on the grid or the grid itself changes as the simulation progresses, the ability to repartition dynamically and in parallel is attractive for achieving higher performance. We describ...
Effective use of a parallel computer requires that a calculation be carefully divided among the processors. This load balancing problem appears in many guises and has been a fervent area of research for the past decade or more. Although great progress has been made, and useful software tools developed, a number of challenges remain. It is the convi...
The standard serial algorithm for strongly connected components is based on depth first search, which is difficult to parallelize.
We describe a divide-and-conquer algorithm for this problem which has significantly greater potential for parallelization.
For a graph with n vertices in which degrees are bounded by a constant, we show the expected ser...
In many important computational mechanics applications, the computation adapts dynamically during the simulation. Examples include adaptive mesh refinement, particle simulations and transient dynamics calculations. When running these kinds of simulations on a parallel computer, the work must be assigned to processors in a dynamic fashion to keep th...
The design of general-purpose dynamic load-balancing tools for parallel applications is more challenging than the design of static partitioning tools. Both algorithmic and software engineering issues arise. We have addressed many of these issues in the design of the Zoltan dynamic load-balancing library. Zoltan has an object-oriented interface that...
Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages. We propose a model for optimizing the exchange of messages under such circumstances which we call the minimum phase remapping probl...
The computing power available to scientists and engineers has increased dramatically in the past decade, due in part to progress in making massively parallel computing practical and available. The expectation for these machines has been great. The reality is that progress has been slower than expected. Nevertheless, massively parallel computing is...
A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrixtranspose -vector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partition...
. A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrix-transposevector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partitio...
Algorithms for finding the prime factors of large composite numbers are of practical importance because of the widespread use of public key cryptosystems whose security depends on the presumed difficulty of the factorisation problem. In recent years ...
A method of data mining represents related items in a multidimensional space. Distance between items in the multidimensional space corresponds to the extent of relationship between the items. The user can select portions of the space to perceive. The user also can interact with and control the communication of the space, focusing attention on aspec...
We describe a general strategy we have found effective for parallelizing solid mechanics simulations. Such simulations often have several computationally intensive parts, including finite element integration, detection of material contacts, and particle interaction if smoothed particle hydrodynamics is used to model highly deforming materials. The...
A common operation in scientific computing is the multiplication of a sparse, rectangular or structurally nonsymmetric matrix and a vector. In many applications the matrixtranspose -vector product is also required. This paper addresses the efficient parallelization of these operations. We show that the problem can be expressed in terms of partition...
A number of computational procedures employ multiple grids on which solutions are computed. For example, in multi-physics simulations a primary grid may be used to compute mechanical deformation of an object while a secondary grid is used for thermal conduction calculations. When modeling coupled thermo-mechanical effects, solution data must be int...
. This paper addresses the problem of partitioning the nonzeros of sparse nonsymmetric and nonsquare matrices in order to efficiently compute parallel matrix-vector and matrix-transpose-vector multiplies. Our goal is to balance the work per processor while keeping communications costs low. Although the symmetric partitioning problem has been well-s...
The explosive growth in the availability of information is overwhelming traditional information management systems. Although individual pieces of information have become easy to find, the larger context in which they exist has become harder to track. These contextual questions are ideally suited to visualization since the human visual system is rem...
An efficient, scalable, parallel algorithm for treating material surface contacts in solid mechanics finite element programs
has been implemented in a modular way for multiple-instruction, multiple-data (MIMD) parallel computers. The serial contact
detection algorithm that was developed previously for the transient dynamics finite element code PRON...
Transient dynamics simulations are commonly used to model phenomena such as car crashes, underwater explosions, and the response of shipping containers to high-speed impacts. Physical objects in such a simulation are typically represented by Lagrangian meshes because the meshes can move and deform with the objects as they undergo stress. Fluids (ga...
We describe a parallel algorithm for finding the eigenvalues and eigenvectors of a dense symmetric matrix, with an emphasis on the dense linear algebra operations. We follow the traditional three-step process: reduce to tridiagonal form, solve the tridiagonal problem, then backtransform the result. Since the different steps have different algorithm...
We describe our parallelization of PRONTO, Sandia's transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for different key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynami...
Many important macroscopic properties of materials depend upon the number of microscopic degrees of freedom. The task of counting the number of such degrees of freedom can be computationally very expensive. We describe a new approach for this calculation which is appropriate for two-dimensional, glass-like networks, building upon recent work in gra...
An efficient, scalable, parallel algorithm for treating contacts in solid mechanics has been applied to interactions between particles in smooth particle hydrodynamics (SPH). The algorithm uses three different decompositions within a single timestep: (1) a static FE-decomposition of mesh elements; (2) a dynamic SPH-decomposition of SPH particles; (...
Graph partitioning is an important abstraction used in solving many scientific computing problems. Unfortunately, the standard partitioning model does not incorporate considerations that are important in many settings. We address this by describing a generalized partitioning model which incorporates the notion of partition skew and is applicable to...
. Transient dynamics simulations are commonly used to model phenomena such as car crashes, underwater explosions, and the response of shipping containers to high-speed impacts. Physical objects in such a simulation are typically represented by Lagrangian meshes because the meshes can move and deform with the objects as they undergo stress. Fluids (...
. Envelope methods for solving sparse systems of linear equations require the matrix to be reordered so that the nonzeros are near the diagonal. Optimal reorderings are known to be NP-complete, but a variety of heuristics have been proposed. In this paper we describe a multilevel approach for finding small envelope orderings and related ordering pr...
Transient dynamics simulations are commonly used to model phenomena such as car crashes, underwater explosions, and the response of shipping containers to high-speed impacts. Physical objects in such a simulation are typically represented by Lagrangian meshes because the meshes can move and deform with the objects as they undergo stress. Fluids (ga...
Terminal propagation is a method developed in the circuit placement community for adding constraints to graph partitioning problems. This paper adapts and expands this idea, and applies it to the problem of partitioning data structures among the processors of a parallel computer. We show how the constraints in terminal propagation can be used to en...
Short--range molecular dynamics simulations of molecular systems are commonly parallelized by replicated--data methods, where each processor stores a copy of all atom positions. This enables computation of bonded 2--, 3--, and 4--body forces within the molecular topology to be partitioned among processors straightforwardly. A drawback to such metho...
Simulations of interacting particles are common in science and engineering, appearing in such diverse disciplines as astrophysics, fluid dynamics, molecular physics, and materials science. These simulations are often computationally intensive and so are natural candidates for massively parallel computing. Many-body simulations that directly compute...
. The multiplication of a vector by a matrix is the kernel operation in many algorithms used in scientific computation. A fast and efficient parallel algorithm for this calculation is therefore desirable. This paper describes a parallel matrix--vector multiplication algorithm which is particularly well suited to dense matrices or matrices with an i...
Efficient use of a distributed memory parallel computer requires that the computational load be balanced across processors in a way that minimizes interprocessor communication. A new domain mapping algorithm is presented that extends recent work in which ideas from spectral graph theory have been applied to this problem. The generalization of spect...
The graph partitioning problem is that of dividing the vertices of a graph into sets of specified sizes such that few edges cross between sets. This NP-complete problem arises in many important scientific and engineering problems. Prominent examples include the decomposition of data structures for parallel computation, the placement of circuit elem...
Many scientific and engineering applications require a detailed analysis of complex systems with strongly coupled fluid flow, thermal energy transfer mass transfer and nonequilibrium chemical reactions. Here we describe the performance of a newly developed application code, SALSA, designed to simulate these complex flows on large-scale parallel mac...
Given a set of objects and a correlation function f reflecting the desire for two items to be near each other, find all sequences {pi} of the items so that correlation preferences are preserved; that is if {pi}(i) < {pi}(j) < {pi}(k) then f(i,j) {ge} f(i,k) and f(j,k) {ge} f(i,k). This seriation problem has numerous applications, for instance, solv...
Grid partitioning is the method of choice for decomposing a wide variety of computational problems into naturally parallel pieces. In problems where computational load on the grid or the grid itself changes as the simulation progresses, the ability to repartition dynamically and in parallel is attractive for achieving higher performance. We describ...