Page 1

Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU

Scott Rostrup and Hans De Sterck

Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada

Abstract

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational per-

formance. Two technologies that have received significant attention are IBM’s Cell Processor and NVIDIA’s CUDA programming

model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial

differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The

message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of

the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data

layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code perfor-

mance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and

GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors

or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32

Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some

preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper

provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight

into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides

insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.

Keywords: parallel performance, Cell processor, GPU, hyperbolic system, code optimization

PROGRAM SUMMARY

Program Title: SWsolver

Journal Reference:

Catalogue identifier:

Licensing provisions: GPL v3

Programming language: C, CUDA

Computer: Parallel Computing Clusters. Individual compute nodes

may consist of x86 CPU, Cell processor, or x86 CPU with attached

NVIDIA GPU accelerator.

Operating system: Linux

RAM: Tested on Problems requiring up to 4 GB per compute node.

Number of processors used: Tested on 1-128 x86 CPU cores, 1-32 Cell

Processors, and 1-32 NVIDIA GPUs.

Keywords: Parallel Computing, Cell Processor, GPU, Hyberbolic

PDEs

Classification: 12

External routines/libraries: MPI, CUDA, IBM Cell SDK

Subprograms used: numdiff (for test run)

Nature of problem:

MPI-parallel simulation of Shallow Water equations using high-

resolution 2D hyperbolic equation solver on regular Cartesian grids

for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.

Solution method:

SWsolver provides 3 implementations of a high-resolution 2D Shallow

Water equation solver on regular Cartesian grids, for CPU, Cell Pro-

cessor, and NVIDIA GPU. Each implementation uses MPI to divide

work across a parallel computing cluster.

Running time:

The test run provided should run in a few seconds on all architectures.

In the results section of the manuscript a comprehensive analysis of

performance for different problem sizes and architectures is given.

1. Introduction

Recent microprocessor advances have focused on increas-

ing parallelism rather than frequency, resulting in the develop-

ment of highly parallel architectures such as graphics process-

ing units (GPUs) [1, 2] and IBM’s Cell processor [3, 4]. Their

potential for excellent performance on computation-intensive

scientific applications coupled with their availability as com-

modity hardware has led researchers to adapt computational

kernels to these parallel architectures, which are often referred

to as accelerator architectures.

This paper investigates mapping high-resolution finite vol-

ume methods for nonlinear hyperbolic partial differential equa-

tion (PDE) systems [5] onto two different types of accelerator

architecture, namely, IBM’sCellprocessorandNVIDIAGPUs.

Performance on these architectures is then compared with per-

formance on Intel x86 central processing units (CPUs). The

accelerator architectures are investigated as both stand-alone

computational accelerators and as components of parallel clus-

ters. A high-resolution explicit numerical scheme is imple-

mentedforarelativelysimplebutrepresentativemodelproblem

Preprint submitted to Computer Physics CommunicationsJuly 25, 2010

Page 2

in this class, namely, the shallow water equations. The numeri-

cal method is implemented on two-dimensional (2D) structured

grids, for three architectures (x86 CPU, GPU, and Cell), and in

parallel using the message passing interface (MPI).

A major goal of this paper is to compare the computational

performance that can be obtained on clusters with these three

types of architectures, for a 2D model problem that is represen-

tative of a large class of structured grid based simulation algo-

rithms. Simulations of this type are widely used in many ar-

eas of computational science and engineering. Another impor-

tant goal is to provide computational scientists and engineers

who are considering porting their codes to accelerator environ-

ments with insight into techniques for optimizing structured

grid based explicit algorithms on clusters with Cell and GPU

accelerators, and into the learning curve and programming ef-

fort involved. It was also our aim to write this paper in a way

that is accessible to computational scientists who may not have

specific background in Cell or GPU computing.

There is extensive related work in the literature on the use

of Cell processors and GPUs for scientific computing applica-

tions. Many of the papers in the literature deal with optimized

implementations for either Cell processors [6, 7, 8, 9] or GPUs

[10, 11, 12, 13, 14, 15]. Most of these papers deal with stan-

dalone or shared-memory hardware configurations, and do not

involve distributed memory communication and MPI. Related

work in the computational fluid dynamics area can be found in

[16, 17, 18, 19]. Work that directly compares Cell with GPU

performance is not widespread [20], and applications on par-

allel clusters with Cell and GPU accelerators have only more

recently started to come to the forefront [21, 22]. Our paper

goes further than existing work in comparing Cell with GPU

performance on clusters with MPI, and these are relevant ex-

tensions of existing work since large clusters with accelerators

are already being deployed and appear to be a promising direc-

tion for the future.

In our approach we have developed a unified code frame-

work for our model problem, for hardware platforms that in-

clude distributed memory clusters with x86 CPU, Cell and GPU

components. Several levels of parallelism are exploited (see

Fig. 1). At the coarsest level of parallelism, we partition the

computational domain over the distributed memory nodes of

the cluster and use MPI for communication. We carry out per-

formance tests on clusters provided by Ontario’s Shared Hi-

erarchical Academic Research Computing Network (SHARC-

NET, [23]) and the Juelich Supercomputing Centre (JSC, [24]).

These clusters have two CPUs, Cell processors or GPUs per

cluster node. At finer levels of parallelism, we exploit the par-

allel acceleration features provided by x86 CPUs, and Cell and

GPU devices. The x86 CPUs we use feature four cores per

CPU, and the cores provide single instruction, multiple data

(SIMD) vector parallelism through streaming SIMD extensions

(SSE). The Cell processors feature eight SIMD vector proces-

sor cores. The GPUs feature dozens of streaming multiproces-

sors with single instruction multiple thread (SIMT) parallelism.

We exploit these different levels of parallelism through opti-

mization of data layout, data flow and data-parallel instructions.

Our development code is available on our website [25] and via

the Computer Programs in Physics (CPiP) program library. We

report runtime performance results for the various levels of op-

timization performed, and first compare Cell and GPU perfor-

mance to performance on a single CPU core, as is customary

in the literature. We also compare CPU, Cell and GPU perfor-

manceonachip-by-chipbasis, onanode-by-nodebasis(i.e., on

single cluster nodes without MPI), and on clusters (with MPI).

Our GPU cluster results use NVIDIA Tesla GPUs with GT200

architecture, but we also include some results on recently in-

troduced NVIDIA GPUs with the next-generation Fermi archi-

tecture. Our Fermi results are preliminary: we did not further

optimize our code for the Fermi platform, but found it interest-

ing to include results that show how a code developed on the

GT200 architecture performs on Fermi. We conclude on the

suitability of the accelerator architectures studied for the appli-

cation class considered, and discuss the speed-up that may be

gained on current and future accelerator architectures for this

class of applications.

The rest of this paper is organized as follows. In Section

2 we briefly describe the class of scientific computing prob-

lems we target in this study, and the specific model problem we

have implemented. Section 3 gives a brief overview of the as-

pects of the CPU, Cell and GPU architectures that are important

for code optimization. Section 4 describes how our simulation

code implementation was optimized for the architectures un-

der consideration. Section 5 describes the clusters we use and

compares performance of the optimized simulation code on the

CPU, Cell and GPU platforms, and Section 6 formulates con-

clusions.

2. Hyperbolic PDE Simulation Problem

In this paper we target acceleration of a class of structured

grid simulations in which grid quantities are evolved from step

to step using information from nearby grid cells. One appli-

cation area where this type of successive short-range updates

are used is fluid and plasma simulation with explicit time inte-

gration, but there are many other use cases with this pattern in

the computational science and engineering field. The particu-

lar problems we study are nonlinear hyperbolic PDE systems,

which require storage of multiple unknowns in each grid cell,

and which involve a relatively large number of floating point

operations (FLOPS) per grid cell in each time step. (Note that,

in this paper, we will write FLOPS/s when we mean floating

point operations per second.) For ease of implementation and

experimentation, we chose a relatively simple fluid simulation

problemand arelatively simplebutcommonly usedalgorithmic

approach. However, these choices are representative of a large

class of existing simulation codes, and our approach can eas-

ily be generalized. Therefore, many of our findings carry over

to this general class of simulation problems. In particular, we

chose to investigate shallow water flow on 2D Cartesian grids,

using a high-resolution finite volume method with explicit time

integration [5].

Our code computes numerical solutions of the shallow water

2

Page 3

Figure 1: General overview of the different levels of parallelism exploited. At the coarsest level of parallelism (left) we partition

the computational domain over the distributed memory nodes of the cluster and use MPI for communication between neighboring

partitions. At the finest level of parallelism (right), we utilize SIMD vectors (CPU and Cell) or SIMT thread parallelism (GPU). At

intermediate levels, we use Local Store-sized blocks of data (Cell) or thread blocks (GPU). The actual details of the different levels

of parallelism depend on the platform and are represented more explicitly in Figs. 4 (CPU), 5 (Cell), and 7 (GPU).

equations, which are given by

wherehistheheight ofthewater, gisgravity, anduandvrepre-

sent the fluid velocities. The gravitational constant g is taken to

beoneinthetestsimulationsreportedinthispaper. Theshallow

water system is a nonlinear system of hyperbolic conservation

laws [5], and given an initial condition, a 2D domain and appro-

priate boundary conditions, it describes the evolution in time of

the unknown functions h(x,y,t), u(x,y,t) and v(x,y,t). We dis-

cretize the equations on a rectangular domain with a structured

Cartesian grid, and evolve the solution numerically in time us-

ing a finite volume numerical method with explicit time inte-

gration [5]. In what follows we write U = [h

update the solution in each grid cell (i, j) using an explicit dif-

ference method. One approach to this problem is to use so-

called unsplit methods of the form

∂

∂t

h

hu

hv

+∂

∂x

hu

hu2+gh2

huv

2

+∂

∂y

hv

huv

hv2+gh2

2

=

0

0

0

,

(1)

hu hv]T. We

Un+1

i,j = Un

i,j−∆t

∆x(Fn

i+1

2,j− Fn

i−1

2,j) −∆t

∆y(Gn

i,j+1

2−Gn

i,j−1

2). (2)

Here, i, j are the spatial grid indices and n is the temporal index,

and F and G stand for numerical approximations to the fluxes

of Eq. (1) in the x and y directions, respectively. The vector

Un

at time level n. Alternatively, one can consider a dimensional

splitting approach

k

∆x

Un+1

∆y

and this is the method we chose to implement. An advantage

of the dimensional splitting approach is that Eq. (3) leads to ac-

curacy that is in practice close to second-order time accuracy

i,jis the vector of three unknown function values in cell (i, j)

U∗

i,j= Un

i,j−

?

?

Fn

i+1

2,j− Fn

i−1

2,j

?

?

,

i,j = U∗

i,j−

k

G∗

i,j+1

2−G∗

i,j−1

2

,

(3)

(see [5], pp. 386, 388, 444) without the need for a two-stage

time integration. We use an expression for the numerical fluxes

F and G ([5], p. 121, Eqs. (6.59)-(6.60)) that is second-order

accurate away from discontinuities, utilizing a Roe Riemann

solver ([5], p. 481) with flux limiter. The update formula for

any point (i, j) on the grid involves values from two neighbor-

ing grid points in each of the up, down, left and right directions,

leading to a nine-point stencil for grid cell updates. For paral-

lel implementations, this means that two layers of ghost cells

need to be communicated between blocks after each iteration

[5]. For numerical stability, the timestep size is limited by the

well-known Courant-Friedrichs-Lewy condition, which implies

that the timestep size must decrease proportional to the spatial

grid size as the grid is refined. Grid cell updates may be com-

puted in parallel and the arithmetic density per grid point is

high (see Table 1), which, along with the structured nature of

the grid data, makes this algorithm a good candidate for accel-

eration on Cell or GPU. The arithmetic density is computed by

calculating the minimum number of floating point operations

necessary to update all grid cells. That is, flux calculations are

counted once per cell interface and the calculation of interme-

diate results that may be reused is not counted multiple times in

the number of operations. This is a flat operation count: no spe-

cial consideration is given to square root or division operations.

It is useful to point out that, among the 360 FLOPS per grid

cell, there are 2 square roots and 16 divisions. This is important

since square roots and divisions may be evaluated in software

or on a restricted number of processor sub-components on Cell

and GPU devices (depending on the precision, see below), so

actual arithmetic density on those platforms may effectively be

higher than what is reported in Table 1. Note that our algorithm

has such a high effective arithmetic density for several reasons:

we have a coupled system of three PDEs (3×9=27 values en-

ter into the formula to update each grid value, instead of just

9 for uncoupled equations solved with the same accuracy), the

3

Page 4

system is highly nonlinear and requires sophisticated numerical

flux formulas based on Riemann solvers ([5], p. 481), and the

flux formulas involve square roots and divisions. Since our al-

gorithm is implemented in two passes, the minimum number of

memory operations is each grid cell being read twice, and then

stored twice, in each timestep.

FLOPS per grid cell

Precision

Memory per grid cell

FLOPS/Byte

360

SPDP

48 Bytes96 Bytes

7.5 3.75

Table 1: The compute kernel requires a minimum of 7.5 and

3.75 FLOPS per Byte of data loaded or stored in single preci-

sion (SP) and double precision (DP), respectively.

The test problem used for the simulations in this paper has

initial conditions

?x

u(x,y,0) = v(x,y,0) = 0,

h(x,y,0) =1

4L+y

W

?

+ 1,

on a square domain Ω = [−L,L] × [−W,W]. Boundary condi-

tions are perfect walls [5].

As noted above, we have chosen a relatively simple set of hy-

perbolic equations for this optimization and performance study

paper. However, more complicated hyperbolic systems, includ-

ing the compressible Euler and Magnetohydrodynamics equa-

tions which are widely used for fluid and plasma simulations,

can be approximated numerically by the same or similar meth-

ods, and extension of our approach from 2D to 3D body-fitted

structured grids or to unsplit explicit methods is also not diffi-

cult. We have deliberately chosen this relatively simple model

problem for this paper because its simplicity allows us to ex-

plain the essential aspects of optimizing structured grid prob-

lems for Cell and GPU architectures, without being distracted

by non-essential details of a more complicated application.

Similarly, readers can easily investigate and comprehend the

details of our implementation in the simulation code that we

provide, without being overwhelmed by complications of the

application. However, the approach and conclusions of our pa-

per carry over directly to a broad class of important fluid and

plasma simulation problems and algorithms.

3. Hardware Description

In this section we give a brief overview of the aspects of the

x86 CPU, IBM Cell and NVIDIA GPU architectures that are

important for optimization of our algorithmic approach.

3.1. Intel Xeon CPU

The Intel Xeon E5430 processors have four cores, and the

particular features that are important in the context of this pa-

per are the cache-based architecture and the SIMD vector paral-

lelism provided through the streaming SIMD extensions (SSE)

mechanism. Each core has SIMD vector units that are 128 bits

wide and are capable of performing four single precision cal-

culations or two double precision calculations at the same time.

While compiler features are being developed that can automat-

ically exploit this functionality, we found that for good perfor-

mance it is at present still necessary to explicitly call intrinsic

library functions that access these SIMD capabilities efficiently

(see Section 4.1). The Intel Xeon E5430 quad-core processors

used in this study have a clockrate of 2.66GHz, a 12MB L2

cache, and each core has a 16KB L1 cache.

3.2. Cell Processor

The Cell Broadband Engine Architecture (CBEA), devel-

oped jointly by IBM, Sony, and Toshiba is a microproces-

sor design focused on maximizing computational throughput

and memory bandwidth while minimizing power consumption

[3, 4]. The first implementation of the CBEA is the Cell pro-

cessor and it has been used successfully in several large-scale

scientific computing clusters [26, 27], notably Los Alamos Na-

tional Laboratory’s petaflop-scale system Roadrunner [28].

The heterogeneous multi-core design of the Cell processor

may be thought of as a network on a chip, with different cores

specialized for different computational tasks (Fig. 2). Since the

Cell processor is designed for high computational throughput

applications, eight of its nine processor cores are vector proces-

sors, called synergistic processing elements (SPEs). The other

core is a more conventional (and relatively slow) CPU, called

the PowerPC processing element (PPE). The PPE has a 64-bit

processor (called the PowerPC processing unit (PPU)) as well

as a memory subsystem containing a 512KB L2 cache. The

PPU runs the operating system and is suitable for general pur-

pose computing. However, in practice its main task is to coor-

dinate the activities of the SPEs.

Communication on the chip is carried out through the ele-

ment interconnect bus. It has a high bandwidth (204.8 GB/s)

and connects the PPE, SPEs, and main memory through a four-

channel ring topology, with two channels going in each direc-

tion (Fig. 2). For main memory the Cell uses Rambus XDR

DRAM memory which delivers 25.6 GB/s maximum band-

width on two 32-bit channels of 12.8 GB/s each.

The SPE is the main computational workhorse of the Cell

Processor. It has a 3.2GHz SIMD processor (called the syn-

ergistic processing unit (SPU)) that operates on 128-bit wide

vectors which it stores in its 128 128-bit registers.

Each SPE has 256KB of on-chip memory called the Local

Store (LS). The SPU draws on the LS for both its instructions

and data: if data is not in the LS it has no automatic mechanism

to look for it in main memory. All data transfers between the LS

and main memory are controlled via software-controlled direct

memory access (DMA) commands. Each SPE has a memory

flow controller that takes care of DMAs and operates indepen-

dently of the SPU. DMAs may also transfer data directly be-

tween the local stores of different SPEs.

The SPU has only static branch predicting capabilities and

has no other registers besides the 128-bit registers. It sup-

ports both single and double precision floating point instruc-

tions. However, hardware support for transcendental functions

4

Page 5

MFC LS

SPE

MFCLS

SPE

MFC LS

SPE

MFC LS

SPE

MFC LS

SPE

MFC LS

SPE

MFCLS

SPE

MFCLS

SPE

EIB

L1

L2

PPE

XDR DRAM

Interface

Coherent

Interface

I/O

Interface

Main

Memory

Figure 2: Hardware diagram of the Cell processor. The 8 SPEs are the SIMD vector processors, the PPE is the PowerPC CPU,

and the rings illustrate the four-channel ring topology of the Element Interconnect Bus. Also shown is the XDR DRAM memory

interface to the Cell blade main memory, and the I/O interfaces which allow two Cell processors on one blade to share SPEs.

is only available in the form of reduced precision approxima-

tions of reciprocals and reciprocal square roots. Full single and

double precision transcendentals must be evaluated in software.

Most Cell tests in this paper are performed on the cluster de-

scribed in Section 5.1.2 with PowerXCell 8i processors, but we

alsoincludesometestsonCellprocessorsinSony’sPlayStation

3, which are an earlier generation of the Cell processor with less

hardware support for double precision calculations, and which

have two of their SPEs disabled.

3.3. NVIDIA GPUs and CUDA Programming Model

GPUs are not, as their name would suggest, solely used for

graphics applications: NVIDIA Tesla GPUs have evolved to be

general purpose high-throughput data-parallel computing de-

vices [1]. The GPU attaches to a host CPU system via the PCI

Express bridge as an add-on computational accelerator with its

own separate DRAM (up to 4GB), which we call GPU global

memory, and some specialized on-chip memory. Programs may

be developed to make use of the GPU by using NVIDIA’s

CUDA programming model which provides extensions to the

C programming language [29, 30]. (CUDA stands for compute

unified device architecture.) The GPU is incorporated into a

program’s execution by calling what is known as a kernel func-

tion from within the CPU host code. A kernel is defined sim-

ilarly to a normal C function but when called, a user-specified

number of threads are spawned, each of which executes the ker-

nel function on the GPU in parallel. The threads are mapped

into groups of up to 512 called thread blocks, and the threads

within a thread block are grouped into smaller groups of 32

threads called warps.

The NVIDIA GT200 architecture uses a hierarchal organiza-

tion of thread processors and memory to implement a single in-

struction multiple thread (SIMT) streaming multiprocessor de-

sign, shown schematically in Fig. 3. Threads are farmed out to

the hundreds of identical scalar processors (SPs) on the GPU.

(The Tesla T10 GPU we use has 240 SPs.) Ideally, many more

threads are spawned than the number of SPs. The SPs are or-

ganized into blocks of eight, called streaming multiprocessors

(SMs). Each SM in addition to the eight SPs has a special func-

tion unit (SFU) for computing transcendental functions and a

double precision unit (DP) which can also act as an SFU. Each

SM also has a block of local memory called shared memory vis-

ible to all threads within a thread block, and a scheduling unit

used to schedule warps. The GPU is capable of swapping warps

into and out of context without any performance overhead. This

functionality provides an important method of hiding memory

and instruction latency on the GPU hardware.

When a kernel function is called, it is initiated on the GPU by

mapping multiple thread blocks onto the SMs. Thread blocks

are divided on the SMs into groups of 32 threads called warps

and execution proceeds in a SIMT fashion within each warp.

Threads within a thread block may be synchronized if neces-

sary. However, there is no generally efficient mechanism for

synchronization across the thread blocks within a kernel func-

tion.

3.3.1. Fermi GPU

The Fermi architecture, released in the spring of 2010, is

NVIDIA’s next-generation GPU. It is the successor of the

GT200 architecture described above, and is the first in which

NVIDIA focussed on general-purpose computation perfor-

mance. The main improvements to note for this paper are the

full IEEE floating point compliance, the improved double pre-

cision performance, and the addition of a cache hierarchy. The

double precision performance on Fermi is half the speed of sin-

gle precision, bringing it in line with most CPUs. The addition

of a cache hierarchy, consisting of a global L2 cache, as well

as a per-SM L1 cache gives more flexibility in non-uniform

memory accesses. The Fermi C2050 features 448 SPs orga-

nized in 14 SMs. Each SM has 32 SPs, 16 DPs, and 4 SFUs.

The Fermi C2050 features a 1.15 GHz clock speed which is

slower than the Tesla T10’s 1.30 GHz. For the rest of the de-

5