Conference Paper

Multigrid on GPU: Tackling Power Grid Analysis on parallel SIMT platforms

DOI: 10.1109/ICCAD.2008.4681645 Conference: 2008 International Conference on Computer-Aided Design (ICCAD'08), November 10-13, 2008, San Jose, CA, USA
Source: DBLP


The challenging task of analyzing on-chip power (ground) distribution networks with multi-million node complexity and beyond is key to todaypsilas large chip designs. For the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with promising performance. Several key enablers including GPU-specific algorithm design, circuit topology transformation, workload partitioning, performance tuning are embodied in our GPU-accelerated hybrid multigrid algorithm, GpuHMD, and its implementation. In particular, a proper interplay between algorithm design and SIMT architecture consideration is shown to be essential to achieve good runtime performance. Different from the standard CPU based CAD development, care must be taken to balance between computing and memory access, reduce random memory access patterns and simplify flow control to achieve efficiency on the GPU platform. Extensive experiments on industrial and synthetic benchmarks have shown that the proposed GpuHMD engine can achieve 100times runtime speedup over a state-of-the-art direct solver and be more than 15times faster than the CPU based multigrid implementation. The DC analysis of a 1.6 million-node industrial power grid benchmark can be accurately solved in three seconds with less than 50 MB memory on a commodity GPU. It is observed that the proposed approach scales favorably with the circuit complexity, at a rate about one second per million nodes.

Full-text preview

Available from:
  • Source
    • "Recently, there are also some research works for GPU-based iterative solver for sparse systems [13] [38] [8] [41] [4] [5] [12] [20]. In [38], GMRES solver has been accelerated on GPU by simply parallelizing the computing of polynomial preconditioners. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose an efficient parallel dynamic linear solver, called GPU-GMRES, for transient analysis of large linear dynamic systems such as large power grid networks. The new method is based on the preconditioned generalized minimum residual (GMRES) iterative method implemented on heterogeneous CPU–GPU platforms. The new solver is very robust and can be applied to power grids with different structures as well as for general analysis problems for large linear dynamic systems with asymmetric matrices. The proposed GPU-GMRES solver adopts the very general and robust incomplete LU based preconditioner. We show that by properly selecting the right amount of fill-ins in the incomplete LU factors, a good trade-off between GPU efficiency and convergence rate can be achieved for the best overall performance. Such tunable feature can make this algorithm very adaptive to different problems. GPU-GMRES solver properly partitions the major computing tasks in GMRES solver to minimize the data traffic between CPU and GPUs to enhance performance of the proposed method. Furthermore, we propose a new fast parallel sparse matrix–vector (SpMV) multiplication algorithm to further accelerate the GPU-GMRES solver. The new algorithm, called segSpMV, can enjoy full coalesced memory access compared to existing approaches. To further improve the scalability and efficiency, segSpMV method is further extended to multi-GPU platforms, which leads to more scalable and faster multi-GPU GMRES solver. Experimental results on the set of the published IBM benchmark circuits and mesh-structured power grid networks show that the GPU-GMRES solver can deliver order of magnitudes speedup over the direct LU solver, UMFPACK. The resulting multi-GPU-GMRES can also deliver 3–12× speedup over the CPU implementation of the same GMRES method on transient analysis.
    Full-text · Article · Jul 2015 · Integration the VLSI Journal
  • Source
    • "Nowadays, the emerging multi-core and many-core platforms bring powerful computing resources and opportunities for parallel computing. Even more, cloud computing techniques [34] drive distributed systems scaling to thousands of computing nodes [35]–[37], etc. Distributed computing systems have been incorporated into products of many leading EDA companies and in-house simulators [38]–[42]. However, building scalable and efficient distributed algorithmic framework for transient linear circuit simulation framework is still a challenge to leverage these powerful computing tools. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we design an efficient and accurate algorithmic framework using matrix exponentials for time-domain simulation of power delivery network (PDN). Thanks to the explicit exponential time integration scheme with high order approximation of differential equation system, our framework can reuse factorized matrices for adaptive time stepping without loss of accuracy. The key operation of matrix exponential and vector product (MEVP) is computed by proposed efficient rational Krylov subspace method and helps achieve large stepping. With the enhancing capability of time marching and high-order approximation capability, we design R-MATEX, which outperforms the classical PDN simulation method using trapezoidal formulation with fixed step size (TR-FTS). We also propose a distributed computing framework, DR-MATEX, and highly accelerate the simulation speedup by reducing Krylov subspace generations caused by frequent breakpoints from the side of current sources. By virtue of the superposition property of linear system and scaling invariance property of Krylov subspace, DR-MATEX can divide the whole simulation task into subtasks based on the alignments of breakpoints among current sources. Then, the subtasks are processed in parallel at different computing nodes and summed up at the end of simulation to provide the accurate solutions. The experimental results show R-MATEX and DR-MATEX can achieve 11.4x and 68.0x runtime speedups on average over TR-FTS.
    Full-text · Article · May 2015 · IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
  • Source
    • "They used geometric and algebraic multigrid (aMG) for finite-difference type discretisations. More recent publications presenting applications that require multigrid solvers are supersonic flows (aMG, unstructured grids [6]), (interactive) flow simulations for feature film (aMG/gMG, structured [7] [8]), out-of core multigrid for gigapixel image stitching (gMG/aMG, structured [9]), image denoising and optical flow (gMG/aMG, structured [10]), power grid analysis (aMG, structured/unstructured [11]) and electric potential in the human heart (aMG, unstructured [12]). This last paper is similar in spirit to our work, since the authors also reduce (almost) the entire multigrid algorithm to sequences of sparse matrix-vector multiplications. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe our FE-gMG solver, a finite element geometric multigrid approach for problems relying on unstructured grids. We augment our GPU- and multicore-oriented implementation technique based on cascades of sparse matrix–vector multiplication by applying strong smoothers. In particular, we employ Sparse Approximate Inverse (SPAI) and Stabilised Approximate Inverse (SAINV) techniques. We focus on presenting the numerical efficiency of our smoothers in combination with low- and high-order finite element spaces as well as the hardware efficiency of the FE-gMG. For a representative problem and computational grids in 2D and 3D, we achieve a speedup of an average of 5 on a single GPU over a multithreaded CPU code in our benchmarks. In addition, our strong smoothers can deliver a speedup of 3.5 depending on the element space, compared to simple Jacobi smoothing. This can even be enhanced to a factor of 7 when combining the usage of approximate inverse-based smoothers with clever sorting of the degrees of freedom. In total the FE-gMG solver can outperform a simple (multicore-) CPU-based multigrid by a total factor of over 40.
    Preview · Article · Jul 2013 · Computers & Fluids
Show more