Conference Paper

Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms.

DOI: 10.1109/ICCAD.2008.4681645 Conference: 2008 International Conference on Computer-Aided Design (ICCAD'08), November 10-13, 2008, San Jose, CA, USA
Source: DBLP

ABSTRACT The challenging task of analyzing on-chip power (ground) distribution networks with multi-million node complexity and beyond is key to todaypsilas large chip designs. For the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with promising performance. Several key enablers including GPU-specific algorithm design, circuit topology transformation, workload partitioning, performance tuning are embodied in our GPU-accelerated hybrid multigrid algorithm, GpuHMD, and its implementation. In particular, a proper interplay between algorithm design and SIMT architecture consideration is shown to be essential to achieve good runtime performance. Different from the standard CPU based CAD development, care must be taken to balance between computing and memory access, reduce random memory access patterns and simplify flow control to achieve efficiency on the GPU platform. Extensive experiments on industrial and synthetic benchmarks have shown that the proposed GpuHMD engine can achieve 100times runtime speedup over a state-of-the-art direct solver and be more than 15times faster than the CPU based multigrid implementation. The DC analysis of a 1.6 million-node industrial power grid benchmark can be accurately solved in three seconds with less than 50 MB memory on a commodity GPU. It is observed that the proposed approach scales favorably with the circuit complexity, at a rate about one second per million nodes.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Sparse linear systems are found in a variety of scientific and engineering problems. In VLSI CAD tools, DC circuit analysis creates large, sparse systems represented by matrices and vectors. The algorithms designed to solve these systems are known to be quite time consuming and many previous attempts have been made to parallelize them. Graphics cards have evolved from specialized devices into massively parallel, general purpose computing units. With their parallel architecture and SIMD processing units, they are well designed for high-throughput operations on large matrices. Various APIs have been developed to allow users to access the resources of their GPUs. One relatively new API, OpenCL, provides a high level abstraction of GPU architecture. OpenCL, with its open source standard and support for both CPU and GPU compute devices, may become a dominant framework for parallel computing on GPUs in the future. Here, we test an OpenCL implementation of a sparse linear solver for VLSI CAD tools.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The possibility of porting algorithms to graphics processing units (GPUs) raises significant interest among researchers. The natural next step is to employ multiple GPUs, but communication overhead may limit further performance improvement. In this paper, we investigate techniques reducing overhead on hybrid CPU–GPU platforms, including careful data layout and usage of GPU memory spaces, and use of non-blocking communication. In addition, we propose an accurate automatic load balancing technique for heterogeneous environments. We validate our approach on a hybrid Jacobi solver for 2D Laplace’s Equation. Experiments carried out using various graphics hardware and types of connectivity have confirmed that the proposed data layout allows our fastest CUDA kernels to reach the analytical limit for memory bandwidth (up to 106 GB/s on NVidia GTX 480), and that the non-blocking communication significantly reduces overhead, allowing for almost linear speed-up, even when communication is carried out over relatively slow networks.
    International Journal of Parallel Programming 12/2014; 42(6). DOI:10.1007/s10766-013-0293-2 · 0.50 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a representative random walk technique for fast transient IR-drop analysis. It selects only a small number of nodes to model the original network for simulation so that the memory and runtime are significantly reduced. Experimental results on benchmark circuits show that our proposed technique can be up to 330 times faster than a commercial simulator while the average error is less than 10%. Furthermore, the exhaustive simulation of all 26-K delay fault test patterns on a 400-K-gate design can be finished within a week. The proposed technique is very useful to simulate capture cycles for identifying the test patterns that cause excessive IR drop during at-speed testing.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 09/2014; 22(9):1980-1989. DOI:10.1109/TVLSI.2013.2280616 · 1.14 Impact Factor


1 Download
Available from