Page 1

Multigrid on GPU: Tackling Power Grid Analysis on

Parallel SIMT Platforms

Zhuo Feng and Peng Li

Department of Electrical and Computer Engineering

Texas A&M University, College Station, TX 77843

Email: {fengzhuo, pli}@neo.tamu.edu

Abstract—The challenging task of analyzing on-chip power (ground)

distribution networks with multi-million node complexity and beyond

is key to today’s large chip designs. For the first time, we show how

to exploit recent massively parallel single-instruction multiple-thread

(SIMT) based graphics processing unit (GPU) platforms to tackle power

grid analysis with promising performance. Several key enablers including

GPU-specific algorithm design, circuit topology transformation, workload

partitioning, performance tuning are embodied in our GPU-accelerated

hybrid multigrid algorithm, GpuHMD, and its implementation. In

particular, a proper interplay between algorithm design and SIMT

architecture consideration is shown to be essential to achieve good

runtime performance. Different from the standard CPU based CAD

development, care must be taken to balance between computing and

memory access, reduce random memory access patterns and simplify flow

control to achieve efficiency on the GPU platform. Extensive experiments

on industrial and synthetic benchmarks have shown that the proposed

GpuHMD engine can achieve 100X runtime speedup over a state-

of-the-art direct solver and be more than 15X faster than the CPU

based multigrid implementation. The DC analysis of a 1.6 million-node

industrial power grid benchmark can be accurately solved in three

seconds with less than 50MB memory on a commodity GPU. It is observed

that the proposed approach scales favorably with the circuit complexity,

at a rate about one second per million nodes.

I. INTRODUCTION

The sheer size of present day power/ground distribution networks

makes their analysis and verification extremely runtime and memory

consuming, and at the same time, limits the extent to which these

networks can be optimized. In the past decade, on the standard

general-purpose CPU platform, a body of power grid analysis meth-

ods have been proposed [1], [2], [3], [4], [5], [6], [7], [8] with

various tradeoffs. Recently, the emergence of massively parallel

single-instruction multiple-data (SIMD), or more precisely, single-

instruction multiple-thread (SIMT) [9], based GPU platforms offers

a promising opportunity to address the challenges in large scale power

grid analysis. Today’s commodity GPUs can deliver more than 380

GLOPS of theoretical computing power and 86GB/s off-chip memory

bandwidth, which are 3-4X greater than offered by modern day

general-purpose quad-core microprocessors [9]. The ongoing GPU

performance scaling trend justifies the development of a suitable

subset of CAD applications on such platform.

However, converting the impressive theoretical GPU computing

power to usable design productivity can be rather nontrivial. Deeply

rooted in graphics applications, the GPU architecture is designed

to deliver high-performance for data parallelism parallel computing.

Except for straightforward general-purpose SIMD tasks such as

parallel table lookups, rethinking and re-engineering are required to

express the data parallelism hidden in an application in a suitable

form to be exploited on GPU. For power grid analysis, the above

goal is achieved in the proposed GPU-accelerated hybrid multigrid

algorithm GpuHMD and its implementation via a careful interplay

between algorithm design and SIMT architecture consideration. Such

interplay is essential in the sense that it makes it possible to balance

between computing and memory access, reduce random memory

access patterns and simplify flow control, key to efficient GPU

Multiprocessor N

..

Multiprocessor 2

Multiprocessor 1

Instruction

Unit

SP 1

…

SP 2

Shared memory

…

VDDVDD VDD

VDD VDDVDD

CPU Processing

Device memory

4. Correct & Smooth on

the Original Grid on CPU

1. Compute Residual on

the Original Grid on CPU

3. Return the GMD

Solution to CPU

2. Solve the Approximate

Regular Grid on GPU

86GB/s

Geometric Multigrid

Solver on GPU

GPU Processing

2~4 GB/s

Grid Approximation

Fig. 1. Overall GpuHMD analysis flow.

computing. To the best of our knowledge, GpuHMD is the first

reported GPU based power grid analysis tool.

As shown in Fig. 1, GpuHMD is built upon a custom geometric

multigrid (MG) algorithm as opposed to a direct solver. Despite

the attempts to develop general-purpose direct matrix solvers on

GPUs [10], so far the progress has been limited for large sparse

problems due to the very natures of GPU such as the inefficiency in

handling random complex data structures and memory access. Being

a multi-level iterative numerical method, multigrid naturally provides

a divide-and-conquer based solution that meets the stringent on-chip

shared memory constraint in GPU. To further enhance the efficiency

of geometric multigrid, a topology regularization step is taken to

convert a possibly irregular 3D grid into a regular 2D structure,

thereby significantly reducing random memory access and thread

divergence. New coarse grid construction, block smoothing strategies,

restriction and prolongation operators are developed geometrically,

maintaining the desirable regularity throughout the entire multigrid

hierarchy. The proposed GpuHMD is referred to as a hybrid approach

in the sense that the entire workload is split between the host (or CPU)

and the GPU. The multigrid hierarchy is purposely made deep such

that more than 95% of the work is pushed onto the GPU. Only a

minimum of small matrix solve, residue computation and smoothing

operation is conducted on the CPU. To remove the possible small

error caused by topology regularization, a few iterative steps between

the CPU and GPU may be performed. Through in-depth theoretical

analysis and empirical data, we show that in practice the required

number of CPU-GPU iterations is typically small and accurate power

grid solutions converge fast.

In this paper, we focus only on the DC analysis of power

grids, however, the same framework can be rather straightforwardly

extended to transient analysis. Extensive experiments have shown

the promising potential of GpuHMD: it is typically more than 100X

faster than the state-of-art direct methods [11] on PC and 15X faster

than the CPU-based multigrid implementation. We envision that with

978-1-4244-2820-5/08/$25.00 ©2008 IEEE647

Page 2

the future GPU performance improvement and the use of multiple

GPU card systems, network analyses that were impossible in the

past may become feasible, leading to a new level of verification and

design. For instance, it may be possible to facilitate the analysis

of large interconnect dominated nonlinear networks (e.g., power

grids and mesh circuits coupled with devices), where the dominant

linear portion of the problem is efficiently solved in a GpuHMD-like

fashion.

II. BACKGROUND AND OVERVIEW

We first review the power grid analysis problems and the GPU

architecture. Next, an overview of the proposed GpuHMD approach

is provided.

A. Power grid analysis

At the heart of either DC or transient power grid analysis lies the

solution of certain large matrix problems. For instance, a system of

linear equations are formulated in the DC analysis [2]:

GV = I,

(1)

where when appropriately formulated, G is a symmetric positive

definite matrix representing the interconnected resistors; V is the

unknown vector of node voltages; and I is a vector of independent

current sources. The feasible solution of such large linear systems

with tens or even hundreds of millions of unknowns, as seen in

today’s industrial designs, are hampered by the excessive runtime

and memory usage required.

B. GPU matrix solvers?

A basic understanding of the SIMT GPU architecture is instru-

mental for evaluating the potential in applying GPU matrix solvers to

large power grid problems. Consider a recent commodity GPU model,

NIVIDIA G80 series. Each card has 16 streaming multiprocessors

(SMs) with each SM containing eight streaming processors (SPs)

running at 1.35GHz. An SP operates in single-instruction, multiple-

thread (SIMT) fashion and has a 32-bit, single-precision floating-

point, multiply-add arithmetic unit [12]. Additionally, an SM has

8192 registers which are dynamically shared by the threads running

on it and can access global, shared, and constant memories. The

bandwidth of the off-chip memory can be as high as 86GB/s, but

the memory bandwidth may reduce significantly under many random

memory accesses. The following programming guidelines play very

important roles for efficient GPU computing [9]:

1) Low control flow overhead: execute the same computation on

many data elements in parallel;

2) High SP floating point arithmetic intensity: perform as many

as possible calculations per memory access;

3) Minimum random memory access: pack data for coalesced

memory access.

Due to the very nature of the SIMT architecture, it remains as

a challenge to implement efficient general-purpose sparse matrix

solvers on GPU. In recent such attempts, it is reported that most

of runtime is spent on data fetching and writing, but not on data

processing [13], [14]. For instance, traditional iterative methods such

as conjugate gradient and multigrid [13] involve many sparse matrix-

vector computations, leading to rather complex control flows and a

large number of random memory accesses that can result in extremely

inefficient GPU implementations. On the other hand, a problem with

a structured data and memory access pattern can be processed by

GPU rather efficiently. The performance of a dense matrix-matrix

multiplication kernel on GPU can reach a performance of over 90

GFLOPS, which is orders of magnitude faster than on CPU [12].

Considering the above facts, it is unlikely to facilitate efficient power

Smooth

Smooth

Smooth

Matrix Solve

Smooth

Restrict

Smooth

Restrict Prolong & Correct

Smooth

Prolong & Correct

Smooth

Restrict

Smooth

Restrict Prolong & Correct

Smooth

Prolong & Correct

…

GPU

CPU

Send RHS to

GMD Solver

Return solution

from GPU

Matrix Solve

Check

Convergence…

CPU

Fig. 2.Acceleration of GMD solve on GPU.

grid analysis by building around immature general-purpose GPU

matrix solvers or implementing existing CPU-oriented power grid

analysis methods [1], [2], [4] on GPU.

C. Our approach

1) Power grid uniformity: To achieve the best analysis efficiency

on SIMT platforms, understanding the physical properties of practical

power grid designs is critical. It can be expected that if the power

grid can be stored and processed like pixel graphics, the GPU SIMT

platform can be of a significant advantage over the general purpose

CPU platform. Not surprisingly, after examining a set of published

industrial power grids [15], [16], we have found that real-life designs

have a high degree of global uniformity while exhibiting some local

irregularity. Therefore, to maintain regularity on GPU, it is very

natural for us to consider solving an approximate regular power grid

that is close to the original grid. However, this brings up the need

for developing “regular” numerical methods and correction schemes

to guarantee solution accuracy.

2) GPU based geometric multigrid method: Mutigrid methods are

among the fastest numerical algorithms for solving large PDE-like

problems [17], where a hierarchy of exact to coarse replicas (e.g. fine

vs. coarse grids) of the given linear problem are created. Via iterative

updates, the high and low frequency components of the solution

error are quickly damped on the fine and coarse grids, respectively,

contributing to the efficiency of multigrid. When properly designed,

multigrid methods can achieve a linear complexity in the number of

unknowns. The hierarchical iterative nature of multigrid is attractive

to GPU platforms since the GPU on-chip shared memory is rather

limited. Multigrid methods typically fall into two categories, geomet-

ric multigrid (GMD) and algebraic multigrid (AMG). AMG may be

considered as a robust black-box method and requires an expensive

setup phase while GMD may be implemented more efficiently if

specific geometric structures of the problem can be exploited. The

key operations of a multigrid method include:

1) Smoothing: point or block iterative methods (e.g. Gauss-Seidel)

applied to damp the solution error on a grid;

2) Restriction: mapping from a fine grid to the next coarser grid

(applied to map the fine grid residue to the coarse grid);

3) Prolongation: mapping from a coarse grid to the next finer grid

(applied to map the coarse grid solution to the fine grid);

4) Correction: use the mapped coarse grid solution to correct the

fine grid solution.

On the k-th level grid with an initial solution of vk, a typical multigrid

cycle MG(k,vk) has the following steps [17]:

1) Apply pre-smoothing to update the solution;

2) Compute the residue on the k-th grid and map it to the k+1-th

coarser grid via restriction;

648

Page 3

3) Using the mapped residue to solve the k +1-th grid directly if

the coarsest level is reached, otherwise apply a multigrid cycle

MG(k + 1,vk+1) with a zero initial guess vk+1= 0;

4) Map the solution vk+1 to the k-th grid via prolongation, and

correct the solution vk by adding vk+1;

5) Apply post-smoothing to further improve vk at the k-th level

grid and return the final vk.

A GPU-specific GMD method is developed in our approach.

Starting from a regularized power grid, all the key components of

multigrid are realized in a geometrically regular fashion across the

entire multigrid hierarchy, leading to simple flow controls and highly

regular memory access patterns, favoring the GPU implementation.

3) Hybrid multigrid (HMD) iterations: The approximate regular

power grid is solved efficiently using our custom GMD method on

GPU (Fig. 2), where no explicit sparse matrix-vector operations are

needed. The work associated with the GMD constitutes the dominant

workload of the entire GpuHMD approach. To guarantee the accuracy

of the final power grid solution, we further apply HMD iterations

between the GPU and host to remove any error that may arise from

only solving the approximate regular grid. Denote the true (original)

power grid by GridO and the regularized grid by GridR, HMD

iterations involve the following main steps (Fig. 1):

1) (CPU:) Compute the residue of the current solution on GridO

and map the residue to GridR;

2) (GPU:) Solve the GridR problem under the mapped residue

using GMD and return the solution to GridO;

3) (CPU:) Update the GridO solution using the GPU result and

apply additional smoothing;

4) (CPU:) If the solution error is small enough, exit; otherwise

repeat the above steps.

The bulk workload of the entire GpuHMD approach is done on GPU

via solving the regular grid (step-2). Only a fraction of the work such

as simple residue computation and smooth steps is preformed on the

host, where the general-purpose CPU is more efficient in terms of

handling the original (irregular) power grid.

III. REGULAR GRID APPROXIMATION

We discuss several key issues in converting a three-dimensional

irregular power grid to a two-dimensional regular approximation that

can be processed efficiently on GPU.

A. Mapping to a regular grid

The goal is to map the original 3D irregular power grid to a 2D

regular one such that the electrical property of the original grid can be

well preserved. As such, the regular grid solution can be very close

to the true solution, reducing the number of the GPU-CPU HMD

iterations required.

The mapping procedure has two subsequent steps: 3D irregular

to 2D irregular, and 2D irregular to 2D regular mappings. First, by

neglecting via resistances, all the metal layers in the network are

overlapped on the same 2D plane, forming a collapsed 2D irregular

grid. By analyzing industrial power grid benchmarks, we found

that neglecting via resistances typically does not alter the circuit

solution in any significant way. Nevertheless, the error induced can

be corrected through the HMD iterations. Then, by examining the

pitches in the collapsed 2D irregular grid, a fixed uniform pitch is

chosen for the X and Y directions for the final 2D regular grid, on

which all the circuit elements are mapped to. Consider the simple

example in Fig. 3, where a two-metal-layer irregular grid is mapped

to a single-layer regular grid. The conductance values on the regular

grid can be obtained as follows:

G1 = 2g31+ g21,G2 = 2g31+ g22,G3 = 2g32,

G4 = 2g32+ g23,G5 = g33+ g24.

(2)

Original Grid

Metal 1

Metal 2

g31

g32

g33

g21

g22

VDD

gz1

VDD

gz2

g24

g23

I1

I2

(1)

(2)

(3)

(4)(5)

VDD

gz1

VDD

gz2

G5

(6)

I1

I2

G1

G2G3

G4

Regular Grid

Metal layers

(1) (2) (3)(4) (5)(6) Regular Grid

Node Index

Average

Pitch

via

via

via

via

Fig. 3.

layer regular grid.

Cross section view of mapping a two-layer irregular grid to a single-

Note that because of irregularity of the original grid, some of

the regular grid nodes may not correspond to any of the original

nodes. In this case, small dummy conductances (Gmin = 1e−6) are

inserted between such regular grid node and its neighboring nodes.

Note also that the uniform pitches of the regular grid may be set to the

averaged pitch values in the irregular grid and can be adjusted when

appropriate. Smaller uniform pitch values lead to increased regular

grid size and improved grid approximation. The possible grid size

increase in the regular grid does not significantly impact the overall

runtime efficiency of our approach due to the linear complexity of

the GPU GMD solver. The improved grid approximation, however,

may contribute to faster HMD convergence. As will be demonstrated

later, both the accuracy and efficiency of our GpuHMD algorithm are

not sensitive to the regular grid size. This is the case even when the

regular grid size is varied from 50% to 150% of the original grid

size.

Algorithm 1 3D irregular-to-2D-regular grid mapping

Input: The original power grid netlist.

Output: The regular grid netlist consists of all the elements of Gh, Gv,

Gz, Iz with their table indices.

1: Extract the horizontal and vertical node pitches from the netlist and

compute the average pitches δX and δY ;

2: For each circuit element except via resistors:

a) Extract their node locations xiand yi;

b) Compute their regular grid indices by:

Ixi= floor[(xi− xmin)/δX], Iyi= floor[(yi− ymin)/δY ];

c) Stamp the conductance values into 2D table based storage.

B. Table-based representation of the regular grid

The 2D regular grid is represented by several tables, denoted by

Gh, Gv, Gz and Iz. The simple representation allows for efficient

coalesced memory access to the device memory and is shown to be

critical to the GPU implementation. For a regular grid node N[i,j],

the following four tables are adopted:

Gh[i,j] : Horizontally connected conductance between

node N[i,j] and node N[i + 1,j];

Gv[i,j] : Vertically connected conductance between node

N[i,j] and node N[i,j + 1];

649

Page 4

Gz[i,j] : The conductance that connects node N[i,j] and

the voltage sources;

Iz[i,j] : The current sources that flows out node N[i,j].

The mapping procedure is summarized in Algorithm 1.

IV. GEOMETRIC MULTIGRID ON GPU

While the 2D regular grid can be obtained in a relatively straight-

forward manner, developing an efficient regular grid solver on GPU

is non-trivial. Naive implementations for either data transferring or

processing can lead to severe performance degradation. The proposed

GPU based GMD solver is described by covering the key issues

concerning the discussion in Section II-B.

A. Coarse grid generation and inter-grid operators

With the mapped regular 2D grid sitting at the bottom of the

multigrid hierarchy, a set of increasingly coarser grids shall be created

to occupy the higher levels. In this case, the regular grid produced

by the previous mapping step serves as the finest grid in our GMD

method. Ideally, these coarse grids should be created such that the

increasingly global behavior of the finest grid is well preserved using

a decreasing number of grid nodes. Unlike in CPU based multigrid

methods, here, it is critical to carry the regularity of the finest grid

throughout the multigrid hierarchy so as to achieve good efficiency

on the GPU platform. The goal is achieved from the following view

of the I/O characteristics of the power grid.

When creating the next coarser grid, we distinguish two types of

wire resistances: resistances connecting a grid node to a VDD source

(or VDD pad conductances) vs. those connecting a grid node to one

of its four neighboring nodes (or internal resistances) on the regular

grid, as shown in Fig. 4. Importantly, the two types of resistances

are handled differently. We maintain the same total current Iz that

flows out the network and the same total wire conductance (Gz)

that connects the grid to ideal voltage sources (e.g. total VDD pad

conductance). In this way, the same pullup and pulldown strengths are

kept in the coarser grid of a power distribution network. Denote the

voltages of M grid nodes that connect to an ideal voltage source via

a wire resistance by Vi for i = 1,...,M, and the N loading current

sources by Ij for j = 1,...,N. The following equation holds:

M

?

i=1

(V DD − Vi)Gzi=

N

?

j=1

Izj.

(3)

To maintain approximately same node voltages Vi at V DD pad

locations in the coarser grid, we ensure that

M

?

i=1Gziand

N ?

j=1Izj

are unchanged. Consequently, as shown in Fig. 4, both the V DD

pad conductance (Gz) and current loadings (or residues) are summed

up when creating the coarser grid problem. Differently, internal

conductances are averaged to create a coarser regular grid that

approximately preserves the global behavior of the fine grid.

Use H and h to indicate the fine and coarser grid components,

respectively, the coarser grid is created as follows:

Gh

h[i,j]=

1

4×?GH

GH

1

4×?GH

GH

?GH

h[2i,2j] + GH

h[2i + 1,2j] +

h[2i + 1,2j + 1]?,

v[2i + 1,2j] +

v[2i + 1,2j + 1]?,

z[2i + 1,2j + 1]?,

h[2i,2j + 1] + GH

Gh

v[i,j]=

v[2i,2j] + GH

v[2i,2j + 1] + GH

z[2i,2j] + GH

GH

Gh

z[i,j]=

z[2i + 1,2j] +

z[2i,2j + 1] + GH

(4)

2zG

1zG

VDDVDD

VDDVDD

3zG

4zG

VDD

4

1

i

zz

i

GG

?

??

Fine Grid Coarse GridFine GridCoarse Grid

2zI

1zI

3zI

4zI

4

1

i

zz

i

II

?

??

Fig. 4.VDD pads (Gz) and current sources (residues) in fine/coarse grids.

where i and j denote grid locations, and the numbers of nodes along

the horizontal and vertical directions are reduced by a factor of two

in the coarser grid. The restriction and prolongation operators are:

Rh[i,j] = RH[2i,2j] + RH[2i + 1,2j]+

RH[2i,2j + 1] + RH[2i + 1,2j + 1],

(5)

EH[2i,2j] = EH[2i,2j + 1] = EH[2i + 1,2j] =

EH[2i + 1,2j + 1] = Eh[i,j],

(6)

where residues and errors (solution corrections) are denoted by R

and E, respectively. Apparently, the coarser grid problem is defined

completely based on geometry and can be stored in the same regular

table-based representation. In our GMD implementation, the coarsest

grid is solved via a direct method on the host. To reduce the overhead

of this sparse matrix solve on CPU and fully utilize the GPU

computing power, the GMD hierarchy is purposely made deep. In

our implementation, four to five grid levels are used, making the size

of the coarsest problem vary from a few hundred to a few thousand

times smaller than the finest grid. This choice may push, say 95%,

of the overall computation onto the GPU.

B. Point vs. block smoothers

The choice of smoother is critical in GMD. Typically, point

Gauss-Seidel or weighted Jacobi smoothers are used for CPU based

GMD methods However, a block based smoother is adopted in

our approach to fully utilize the SIMT GPU computer power. On

GPU, a number (more precisely a warp [9]) of threads may be

simultaneously executed in a single-instruction multiple-data fashion

on a multiprocessor. This implies that multiple circuit nodes can be

processed in the smoothing step at the same time. In our approach, a

block of circuit nodes are loaded into a multiprocessor at a time.

Then, multiple treads are launched to simultaneously smooth the

circuit nodes in the block for a number of iterations. As a result,

such processing step (almost) completely solves the circuit block,

effectively leading to a block smoother. This approach ensures that

a meaningful amount of compute work is done before the data is

released and a new memory access takes place. In other words, it

contributes to efficient GPU computing by increasing the arithmetic

intensity. This block smoother is discussed in detail in Section V.

V. ACCELERATING GMD ON GPU

To gain good efficiency on the GPU platform, care must be taken

to facilitate thread organization, memory and register allocation,

workload balancing as well as hardware-specific algorithms.

A. Thread organization

Through a suitable programming model (e.g. CUDA [9]), threads

shall be packed properly for efficient execution on multiprocessors.

On a multiprocessor, threads are organized in units of blocks, where

the number of blocks should be properly chosen to maximize the

performance. The optimal block size shall be multiples of 32 threads

650

Page 5

Execution Time

Global Memory

…

Multiprocessors

Shared Memory

SM1

Shared Memory

SM2

Shared Memory

SM3

…

SP1

SP2

SP3

SP4

SP5

SP6

SP7

SP8

SP1

SP2

SP3

SP4

SP5

SP6

SP7

SP8

SP1

SP2

SP3

SP4

SP5

SP6

SP7

SP8

Streaming Processors

…

Gauss-Seidel Iterations

among blocks

Weighted Jacobi

Iterations within each block

Fig. 5. Mixed block relaxation (smoother) on GPU.

for a commercial GPU [9]. In our implementation, the actual optimal

block size is chosen experimentally.

B. Memory and register allocation

Before the GMD solve starts on GPU, 1D tables are allocated on

the CPU to store all the regular grids in the multigrid hierarchy. Then,

the data are transferred to the device (CPU). We bind the conductance

tables (Gh,Gv and Gz) to the texture memory and other data to

the on-board GPU device memory. Texture memory is cached, so

its access latency is significantly smaller than the device memory.

However, the texture memory is read-only and cannot be used for

solution updates. Therefore, residues, solution and error vectors are

stored in the device memory. Since the device memory is not cached,

coalesced memory accesses are employed to achieve the best memory

bandwidth.

The fast on-chip shared memory and registers are very limited

resources on GPU. If the required shared memory and registers

exceed what are available, an application will fail. On the other hand,

more than one block of threads should be run on the same stream

multiprocessor (SM). This will hide the memory read/write latency

in a better way, leading to a much higher performance throughput.

With this in mind, all components of our GPU GMD method are

developed carefully to fully utilize GPU resources. As an example,

in the smoothing steps, the solution and right hand side (RHS) vectors

are loaded from the global memory to the shared memory, while the

resistance grid data are loaded from the the texture memory to the

registers. The above scheme allows more than two blocks of threads

to be launched concurrently within the resource limitation on an SM.

Otherwise, if the grid data were loaded to the shared memory, only

one block of threads could be run, making the memory access latency

a higher impact.

C. Mixed block-wise smoother

In our GMD solver, the relaxation (smoothing) steps dominate

the overall computation. Hence, an efficient implementation of the

smoother is critical. On CPU, point-wise iterative methods such

as Gauss-Seidel or weighted Jacobi are often adopted. However,

to improve the arithmetic intensity and work better with efficient

coalesced (block) memory accesses and control flows, global block

Gauss-Seidel iteration (GBG iteration) and local block weighted

Jacobi iteration (LBJ iteration) schemes are introduced.

As illustrated in Fig. 5, during each GBG iteration, the whole 2D

regular grid is partitioned into small blocks which are subsequently

transferred to streaming processors. Next, k times LBJ iterations are

conducted within each block locally. Since only the threads within

the same thread block can share the data with each other, the solution

of this local block can not be shared by others unless it is sent back to

the global memory. As processed block solutions are written back to

the global memory, the smoothing of subsequent blocks will be based

upon the most recent solutions of the neighboring blocks. Therefore,

from this global point of view, the smoother is a block Gauss-Seidel

iterative (or GBG) method. On the other hand, when each block

is being smoothed, all its nodes are processed by multiple threads

simultaneously in a weighted Jacobi fashion, referred to as LBJ

iterations. The above mixed block-iteration scheme has been carefully

tailored for our GPU based GMD engine, particulary through the

following considerations:

1) To increase the arithmetic intensity, we perform k times LBJ

iterations for each global memory access. k can be determined

based upon the block size: larger block size may include more

local iterations. However, excessive local iterations may not

help the overall convergence since the boundary information is

not updated.

2) To hide the memory latency and thread synchronization time,

we allow two or more blocks to run concurrently on each

multiprocessor to avoid idle processors during the thread syn-

chronization and device memory access.

The block size may impact the overall performance significantly. A

too large block size may lead to slow convergence while a too small

size may cause bad memory efficiency and shared memory bank

conflicts. To minimize shared memory and register bank conflicts,

block sizes such as 4 × 4 or 8 × 8 are observed to offer good

performance.

D. Dummy grid nodes

As discussed before, GPU data processing favors block-like op-

erations. If the grid dimensions are not multiples of the block size,

extra handling is required. For example, assume one smoothing kernel

of the GMD solver is executed on all multigrid levels based on

8 × 8 thread blocks. Then, all the grid widths and heights need

to be modified to be multiples of the block size. To this end,

certain dummy grids can be attached to the periphery of the original

grid. It is important to isolate these dummy grids from the original

grid, as shown in Fig. 6. Otherwise, the GMD convergence can be

significantly impacted.

VDDVDDVDD

VDD

VDD

Original Grid Dummy Grid+=Final Grid

Fig. 6.Appending dummy grid nodes for a chosen block size.

VI. HYBRID MULTIGRID FOR POWER GRID ANALYSIS

Although solving the mapped 2D regular grids on GPU typically

provides pretty accurate results, the solution quality may not be

completely guaranteed since grid approximations can lead to various

accuracy levels. To have a robust error control scheme, interactions

between the 2D regular grid and the original 3D irregular grid

are important. In this work, we propose a hybrid multigrid (HMD)

analysis framework to iteratively correct the error components that

are caused by grid approximation. The main steps of our HMD flow

is shown in Fig. 1 and Fig. 7, and also outlined in Section II-C.

651