Fig 9 - uploaded by Joe Eaton

Content may be subject to copyright.

Source publication

The solution of large sparse linear systems arises in many applications, such as computational fluid dynamics and oil reservoir simulation. In realistic cases the matrices are often so large that they require large scale distributed parallel computing to obtain the solution of interest in a reasonable time. In this paper we discuss the design and i...

## Context in source publication

**Context 1**

... result of packing for the matrix and vector is shown on the right side of Fig 9, where for the matrix the diagonal blocks and the remaining columns are indicated by the solid line and dashed lines, respectively, while for the vector the local and halo elements are indicated by the solid line and dashed lines, respectively. ...

## Similar publications

We show how a scalable preconditioner for the primal discontinuous Petrov-Galerkin (DPG) method can be developed using existing algebraic multigrid (AMG) preconditioning techniques. The stability of the DPG method gives a norm equivalence which allows us to exploit existing AMG algorithms and software. We show how these algebraic preconditioners ca...

## Citations

... Libraries PETSc [23], Eigen [24], AMGX [25] and CUSP [26] are only few of the examples of software libraries developed for solving linear systems. Although these solutions constitute the current state-of-the-art, they are either specific to a single target hardware or support only a single format internally and it is often a significant undertaking to provide full support for a new format. ...

Sparse matrices and linear algebra are at the heart of scientific simulations. More than 70 sparse matrix storage formats have been developed over the years, targeting a wide range of hardware architectures and matrix types. Each format is developed to exploit the particular strengths of an architecture, or the specific sparsity patterns of matrices, and the choice of the right format can be crucial in order to achieve optimal performance. The adoption of dynamic sparse matrices that can change the underlying data-structure to match the computation at runtime without introducing prohibitive overheads has the potential of optimizing performance through dynamic format selection. In this paper, we introduce Morpheus, a library that provides an efficient abstraction for dynamic sparse matrices. The adoption of dynamic matrices aims to improve the productivity of developers and end-users who do not need to know and understand the implementation specifics of the different formats available, but still want to take advantage of the optimization opportunity to improve the performance of their applications. We demonstrate that by porting HPCG to use Morpheus, and without further code changes, 1) HPCG can now target heterogeneous environments and 2) the performance of the SpMV kernel is improved up to 2.5x and 7x on CPUs and GPUs respectively, through runtime selection of the best format on each MPI process.

... The algebraic multigrid (AMG) method is one of the most efficient solution techniques for solving linear systems arising from the discretization of second-order elliptic PDEs. COMSOL Multiphysics and ANSYS SpaceClaim both have AMG solvers to solve large-scale sparse linear systems, and the AMG method is often used as a preconditioner in Krylov subspace solvers [5,6]. Meanwhile, many scholars have proposed algebraic multigrid preconditioned conjugate gradient (AMGPCG) solvers [7][8][9][10], which are used to solve the problem of poor convergence in ill-conditioned linear systems. ...

At present, electron optical simulator (EOS) takes a long time to solve linear FEM systems. The algebraic multigrid preconditioned conjugate gradient (AMGPCG) method can improve the efficiency of solving systems. This paper is focused on the implementation of the AMGPCG method in EOS. The aggregation-based scheme, which uses two passes of a pairwise matching algorithm and the K-cyle scheme, is adopted in the aggregation-based algebraic multigrid method. Numerical experiments show the advantages and disadvantages of the AMG algorithm in peak memory and solving efficiency. The AMGPCG is more efficient than the iterative methods used in the past and only needs one coarsening when EOS computes the particle motion trajectory.

... However, efficiently parallelize the AMG algorithm with the massive parallel accelerator such as GPGPU is a challange problem [19]. In GPGPU platform, the sparse approximate inverse (SPAI) preconditioner works well in many real-world applications according to latest literature [13][21] [23]. ...

Sparse Matrix-Vector Multiplication (SpMV) is a critical operation for the iterative solver of Finite Element Methods on computer simulation. Since the SpMV operation is a memory-bound algorithm, the efficiency of data movements heavily influenced the performance of the SpMV on GPU. In recent years, many research is conducted in accelerating the performance of SpMV on the graphic processing units (GPU). The performance optimization methods used in existing studies focus on the following areas: improve the load balancing between GPU processors, and reduce the execution divergence between GPU threads. Although some studies have made preliminary optimization on the input vector fetching, the effect of explicitly caching the input vector on GPU base SpMV has not been studied in depth yet. In this study, we are trying to minimize the data movements cost for GPU-based SpMV using a new framework named "explicit caching Hybrid (EHYB)". The EHYB framework achieved significant performance improvement by using the following methods: 1. Improve the speed of data movements by partitioning and explicitly caching the input vector to the shared memory of the CUDA kernel. 2. Reduce the volume of data movements by storing the major part of the column index with a compact format. We tested our implementation with sparse matrices derived from FEM applications in different areas. The experiment results show that our implementation can overperform the state-of-the-arts implementation with significant speedup, and leads to higher FLOPs than the theoryperformance up-boundary of the existing GPU-based SpMV implementations.

... Porting well-established numerical methods to GPUs is therefore an important research topic. In this regard, several works focus on the development of AMG solvers for GPUs, e.g., [9] discusses strategies and experiences for porting to GPUs the solvers from the hypre software, including AMG solvers [11,30], while in [16] the AmgX 1 solver from NVIDIA is presented, a reference solver specifically developed for GPUs. Further, block-asynchronous smoothers are studied in [2]. ...

... Regarding the reported speedups of the GPU version over the multithreaded one, they mainly indicate for the considered machine a typical performance improvement from transferring the computation to GPU. We now report on the comparison of the considered method with AmgX, a GPU only solver by NVIDIA [16]. Two configurations of AmgX are considered, referred to as aggre- TABLE 3.3 Setup time for AGMG (sequential, on CPU), solve time, and number of iterations for the hybrid GPU-CPU version of AGMG (AGMG-GPU) and AmgX applied to the 2D problems; "Agg" refers to the aggregation configuration and "Cls" to the classical one; n.c. ...

... With increasing number of devices, the linear (barotropic) solver eventually becomes a bottleneck. We expect to be able to remedy this by using NVIDIA's AmgX library (Naumov et al., 2015), which is optimized from the ground up for multi-GPU, distributed use cases like ours. ...

Even to this date, most earth system models are coded in Fortran, especially those used at the largest compute scales. Our ocean model Veros takes a different approach: it is implemented using the high‐level programming language Python. Besides numerous usability advantages, this allows us to leverage modern high‐performance frameworks that emerged in tandem with the machine learning boom. By interfacing with the JAX library, Veros is able to run high‐performance simulations on both central processing units (CPU) and graphical processing unit (GPU) through the same code base, with full support for distributed architectures. On CPU, Veros is able to match the performance of a Fortran reference, both on a single process and on hundreds of CPU cores. On GPU, we find that each device can replace dozens to hundreds of CPU cores, at a fraction of the energy consumption. We demonstrate the viability of using GPUs for earth system modeling by integrating a global 0.1° eddy‐resolving setup in single precision, where we achieve 1.3 model years per day on a single compute instance with 16 GPUs, comparable to over 2,000 Fortran processes.

... For computations that are naturally or embarrassingly parallel, such as the evaluation of properties for each cell from interpolation tables, exposing this concurrency is often relatively straightforward. For others, including complex multilevel preconditioners (Esler et al. 2012;Naumov et al. 2015), very significant restructuring is required to achieve good performance, often in addition to the domain decomposition methodology used in conventional CPU-based parallelism. ...

Recently, graphics processing units (GPUs) have been demonstrated to provide a significant performance benefit for black-oil reservoir simulation, as well as flash calculations that serve an important role in compositional simulation. A comprehensive approach to compositional simulation based on GPUs has yet to emerge, and the question remains as to whether the benefits observed in black-oil simulation persist with a more complex fluid description. We present a positive answer to this question through the extension of a commercial GPU-basedblack-oil simulator to include a compositional description based on standard cubic equations of state (EOSs). We describe the motivations for the selected nonlinear formulation, including the choice of primary variables and iteration scheme, and support for both fully implicit methods (FIMs) and adaptive implicit methods (AIMs). We then present performance results on an example sector model and simplified synthetic case designed to allow a detailed examination of runtime and memory scaling with respect to the number of hydrocarbon components and model size, as well as the number of processors. We finally show results from two complex asset models (synthetic and real) and examine performance scaling with respect to GPU generation, demonstrating that performance correlates strongly with GPU memory bandwidth.
NOTE: This paper is published as part of the 2021 SPE Reservoir Simulation Conference Special Issue.

... The OpenFOAM package uses the Nvidia AMGx library for its GPU accelerated linear solves [14]. AMGx has previously demonstrated excellent weak scaling on up to 512 Kepler class GPUs [15], although those simulations consisted of structured grid models whose matrices arise from a 7-point stencil. While their scaling studies addressed elliptic solves, the matrices were well conditioned and thus straightforward to handle by AMG methods. ...

... This contrasts with our study, which focuses on poorly conditioned matrices that lack structure. More specifically, we focus on the strong scaling behavior down to 10 5 DoFs per GPU whereas [15] addressed problems whose minimum size was an order of magnitude larger. Ellipsys3D [16][17][18] was recently validated against Nalu-Wind for turbine blade modelling [19]. ...

The U.S. Department of Energy has identified exascale-class wind farm simulation as critical to wind energy scientific discovery. A primary objective of the ExaWind project is to build high-performance, predictive computational fluid dynamics (CFD) tools that satisfy these modeling needs. GPU accelerators will serve as the computational thoroughbreds of next-generation, exascale-class supercom-puters. Here, we report on our efforts in preparing the ExaWind unstructured mesh solver, Nalu-Wind, for exascale-class machines. For computing at this scale, a simple port of the incompressible-flow algorithms to GPUs is insufficient. One needs novel algorithms that are application aware, memory efficient, and optimized for the latest-generation GPU devices to achieve high performance. The result of our efforts are unstructured-mesh simulations of wind turbines that can effectively leverage thousands of GPUs. In particular, we demonstrate a first-of-its-kind, incompressible-flow simulation using Algebraic Multigrid solvers that strong scales to over 4000 GPUs on the Summit supercomputer.

... These requirements severely restrict the available options. In fact, NVIDIA's own AmgX library [27] is currently the only established package that meets all of these criteria. Although the options are limited, AmgX is considered the current state-of-the-art for sparse linear solvers for multi-GPU and supports a number of arbitrarily nested solvers, smoothers, and preconditioners. ...

Recent advances in the development of Eulerian incompressible smoothed particle hydrodynamics (EISPH), such as high-order convergence and natural coupling with Lagrangian formulations, demonstrate its potential as a meshless alternative to traditional computational fluid dynamics (CFD) methods. This work aims to address one of the major outstanding limitations of EISPH, its relatively high computational cost, by providing an implementation that can be deployed on multiple graphical processing units (GPUs). To this end, a pre-existing multi-GPU version of the open-source Lagrangian weakly-compressible code DualSPHysics is converted to an EISPH formulation and inte-grated with an open-source multi-GPU multigrid solver (AmgX) to treat the pressure Poisson equation implicitly. The most challenging aspect of this work is the integration of AmgX within DualSPHysics, since AmgX is designed for distributed systems and therefore conflicts with the single-node shared memory design of the multi-GPU DualSPHysics code. The present imple-mentation is first validated against the 2D Taylor-Green vortex flow, showing excellent agreement with the analytical solution and demonstrating second-order convergence. Detailed performance tests are then presented to investigate memory consumption and scaling characteristics. The results show approximately 77–86%and 89–91% efficiency on up to four GPUs for strong and weak scaling, respectively. Furthermore, the present implementation is shown to permit problem sizes on the order of 33.5 million (2D) and 10.2 million (3D) particles on four GPUs (64 GB total device memory), which is beyond what has previously been reported in the literature for implicit incompressible SPH on GPUs and demonstrates its potential as an alternative to traditional CFD methods.

... First attempts to port linear solvers to the GPU hardware were based on a simple straightforward approach Saad, 2013a, 2013b). More recently, also some AMG based solvers entirely running on a single or multiple GPUs have been proposed exhibiting a very good performance on classic linear algebra problems (Bell et al., 2012;Bernaschi et al., 2019b;Gandham et al., 2014;Naumov et al., 2015). However, in challenging real-world problems such as those arising from structural mechanics or fluid flow in highly heterogeneous formations, standard AMG solvers may be slow to converge or even fail, so that more advanced approaches are needed. ...

The solution of linear systems of equations is a central task in a number of scientific and engineering applications. In many cases the solution of linear systems may take most of the simulation time thus representing a major bottleneck in the further development of scientific and technical software. For large scale simulations, nowadays accounting for several millions or even billions of unknowns, it is quite common to resort to preconditioned iterative solvers for exploiting their low memory requirements and, at least potential, parallelism. Approximate inverses have been shown to be robust and effective preconditioners in various contexts. In this work, we show how adaptive Factored Sparse Approximate Inverse (aFSAI), characterized by a very high degree of parallelism, can be successfully implemented on a distributed memory computer equipped with GPU accelerators. Taking advantage of GPUs in adaptive FSAI set-up is not a trivial task, nevertheless we show through an extensive numerical experimentation how the proposed approach outperforms more traditional preconditioners and results in a close-to-ideal behavior in challenging linear algebra problems.

... Most of the existing AMG software packages (e.g. FASP [40], BoomerAMG [41] and AmgX [42]) are built on it. It has Setup and Solve phases. ...

In this study we construct a time-space finite element (FE) scheme and furnish cost-efficient approximations for one-dimensional multi-term time fractional advection diffusion equations on a bounded domain $\Omega$. Firstly, a fully discrete scheme is obtained by the linear FE method in both temporal and spatial directions, and many characterizations on the resulting matrix are established. Secondly, the condition number estimation is proved, an adaptive algebraic multigrid (AMG) method is further developed to lessen computational cost and analyzed in the classical framework. Finally, some numerical experiments are implemented to reach the saturation error order in the $L^2(\Omega)$ norm sense, and present theoretical confirmations and predictable behaviors of the proposed algorithm.