Conference Paper

Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms.

DOI: 10.1109/ICCAD.2008.4681645 Conference: 2008 International Conference on Computer-Aided Design (ICCAD'08), November 10-13, 2008, San Jose, CA, USA
Source: DBLP

ABSTRACT The challenging task of analyzing on-chip power (ground) distribution networks with multi-million node complexity and beyond is key to todaypsilas large chip designs. For the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with promising performance. Several key enablers including GPU-specific algorithm design, circuit topology transformation, workload partitioning, performance tuning are embodied in our GPU-accelerated hybrid multigrid algorithm, GpuHMD, and its implementation. In particular, a proper interplay between algorithm design and SIMT architecture consideration is shown to be essential to achieve good runtime performance. Different from the standard CPU based CAD development, care must be taken to balance between computing and memory access, reduce random memory access patterns and simplify flow control to achieve efficiency on the GPU platform. Extensive experiments on industrial and synthetic benchmarks have shown that the proposed GpuHMD engine can achieve 100times runtime speedup over a state-of-the-art direct solver and be more than 15times faster than the CPU based multigrid implementation. The DC analysis of a 1.6 million-node industrial power grid benchmark can be accurately solved in three seconds with less than 50 MB memory on a commodity GPU. It is observed that the proposed approach scales favorably with the circuit complexity, at a rate about one second per million nodes.

0 Followers
 · 
95 Views
  • Source
    • "They used geometric and algebraic multigrid (aMG) for finite-difference type discretisations. More recent publications presenting applications that require multigrid solvers are supersonic flows (aMG, unstructured grids [6]), (interactive) flow simulations for feature film (aMG/gMG, structured [7] [8]), out-of core multigrid for gigapixel image stitching (gMG/aMG, structured [9]), image denoising and optical flow (gMG/aMG, structured [10]), power grid analysis (aMG, structured/unstructured [11]) and electric potential in the human heart (aMG, unstructured [12]). This last paper is similar in spirit to our work, since the authors also reduce (almost) the entire multigrid algorithm to sequences of sparse matrix-vector multiplications. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe our FE-gMG solver, a finite element geometric multigrid approach for problems relying on unstructured grids. We augment our GPU- and multicore-oriented implementation technique based on cascades of sparse matrix–vector multiplication by applying strong smoothers. In particular, we employ Sparse Approximate Inverse (SPAI) and Stabilised Approximate Inverse (SAINV) techniques. We focus on presenting the numerical efficiency of our smoothers in combination with low- and high-order finite element spaces as well as the hardware efficiency of the FE-gMG. For a representative problem and computational grids in 2D and 3D, we achieve a speedup of an average of 5 on a single GPU over a multithreaded CPU code in our benchmarks. In addition, our strong smoothers can deliver a speedup of 3.5 depending on the element space, compared to simple Jacobi smoothing. This can even be enhanced to a factor of 7 when combining the usage of approximate inverse-based smoothers with clever sorting of the degrees of freedom. In total the FE-gMG solver can outperform a simple (multicore-) CPU-based multigrid by a total factor of over 40.
    Computers & Fluids 07/2013; 80:327–332. DOI:10.1016/j.compfluid.2012.01.025 · 1.53 Impact Factor
  • Source
    • "Any changes beyond the system capa­ bility may incur architecture change, circuit redesign or even new chip fabrication with high cost. The application of programmable elements, such as GPU, mitigates the redesign cost, but achieving the system reconfigurability and power efficiency simultaneously still remains as a challenge [11]. * Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial ad­ vantage and that copies bear this notice and the full citation on the first page. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The invention of neuromorphic computing architecture is inspired by the working mechanism of human-brain. Memristor technology revitalized neuromorphic computing system design by efficiently executing the analog Matrix-Vector multiplication on the memristor-based crossbar (MBC) structure. However, programming the MBC to the target state can be very challenging due to the difficulty to real-time monitor the memristor state during the training. In this work, we quantitatively analyzed the sensitivity of the MBC programming to the process variations and input signal noise. We then proposed a noise-eliminating training method on top of a new crossbar structure to minimize the noise accumulation during the MBC training and improve the trained system performance, i.e., the pattern recall rate. A digital-assisted initialization step for MBC training is also introduced to reduce the training failure rate as well as the training time. Experimental results show that our noise-eliminating training method can improve the pattern recall rate. For the tested patterns with 128 x 128 pixels our technique can reduce the MBC training time by 12.6% ~ 14.1% for the same pattern recognition rate, or improve the pattern recall rate by 18.7% ~ 36.2% for the same training time.
    Proceedings of the 50th Annual Design Automation Conference; 05/2013
  • Source
    • "Recently, massive parallel algorithms have been developed on GPU [24], [26]. Parallel GPU software packages for linear systems, such as CUBLAS for dense matrices [27], and CUSPARSE for sparse matrices [27], have also been developed for public use. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a design for testability technique to avoid scan shift failure due to flip–flop simultaneous triggering. The proposed technique changes test clock domains of flip–flops in the regions where severe IR-drop problems occur. A massive parallel algorithm using a graphic processor unit is adopted to speed up the IR-drop simulation during optimization. The experimental data on large benchmark circuits show that peak IR-drop values are reduced by 15% on average compared with the circuit after simple MD-SCAN partition. Our proposed technique quickly optimizes a half-million-gate design within two hours.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 03/2013; 32(4):644. DOI:10.1109/TCAD.2012.2228741 · 1.20 Impact Factor
Show more

Preview

Download
1 Download
Available from