Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
ABSTRACT Increasingly, highperformance computing is looking towards dataparallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the dataparallel devices provide are described in terms of data layout, data flow and dataparallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chiptochip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the nextgeneration Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speedup that may be gained on current and future accelerator architectures for this class of applications.Program summaryProgram title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1128 x86 CPU cores, 132 Cell Processors, and 132 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPIparallel simulation of Shallow Water equations using highresolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a highresolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Subprogram numdiff is used for the test run.

Conference Paper: HIGH PERFORMANCE GPU SPEEDUP STRATEGIES FOR THE COMPUTATION OF 2D INUNDATION MODELS
[Show abstract] [Hide abstract]
ABSTRACT: Twodimensional (2D) models are increasingly used for inundation assessment in situations involving large domains of millions of computational elements and longtime scales of several months. Practical applications often involve a compromise between spatial accuracy and computational efficiency and to achieve the necessary spatial resolution, rather fine meshes become necessary requiring more data storage and very long computer times that may become comparable to the real simulated process. The use of conventional 2D nonparallelized models (CPU based) makes simulations impractical in real project applications and improving the performance of such complex models constitutes an important challenge not yet resolved. We present the newest developments of the RiverFLO2D Plus model based on a fourthgeneration finite volume numerical scheme on flexible triangular meshes that can run on highly efficient Graphical Processing Units (GPU's). In order to reduce the computational load, we have implemented two strategies: OpenMP parallelization and GPU techniques. Since dealing with transient inundation flows the number of wet elements changes during the simulation, a dynamic task assignment to the processors that ensures a balanced work load has been included in the Open MP implementation. Our strict method to control volume conservation (errors of Order 10 14 %) in the numerical modeling of the wetting/drying fronts involves a correction step that is not fully local, which requires special handling to avoid degrading the model. The efficiency of the model is demonstrated by means of results that show that the proposed method reduces the computational time by more than 30 times in comparison to equivalent CPU implementations. We present performance tests using the latest GPU hardware technology, that shows that the parallelization techniques implemented in RiverFLO2D Plus can significantly reduce the ComputationalLoad/HardwareInvestment ratio by a factor of 200300 allowing 2D model endusers to obtain the performance of a super computation infrastructure at a much lower cost.HIC2014, 11th International Conference on Hydroinformatics; 08/2014  [Show abstract] [Hide abstract]
ABSTRACT: The aim of this review article is to give an introduction to implementations of the Ising model accelerated by Graphics Processing Units (GPUs) and to summarize different techniques that have been used and tested by different groups. Different parallelization schemes and algorithms are discussed and compared, technical details are pointed out and their performance potential is evaluated.The European Physical Journal Special Topics 08/2012; 210(1). · 1.76 Impact Factor  SourceAvailable from: Asier Lacasta
Article: An optimized GPU implementation of a 2D free surface simulation model on unstructured meshes
[Show abstract] [Hide abstract]
ABSTRACT: This work is related with the implementation of a finite volume method to solve the 2D Shallow Water Equations on Graphic Processing Units (GPU). The strategy is fully oriented to work efficiently with unstructured meshes which are widely used in many fields of Engineering. Due to the design of the GPU cards, structured meshes are better suited to work with than unstructured meshes. In order to overcome this situation, some strategies are proposed and analyzed in terms of computational gain, by means of introducing certain ordering on the unstructured meshes. The necessity of performing the simulations using unstructured instead of structured meshes is also justified by means of some test cases with analytical solution.Advances in Engineering Software 12/2014; 78:1–15. · 1.42 Impact Factor
Page 1
Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU
Scott Rostrup and Hans De Sterck
Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Abstract
Increasingly, highperformance computing is looking towards dataparallel computational devices to enhance computational per
formance. Two technologies that have received significant attention are IBM’s Cell Processor and NVIDIA’s CUDA programming
model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial
differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The
message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of
the simulation code at the several finer levels of parallelism that the dataparallel devices provide are described in terms of data
layout, data flow and dataparallel instructions. Optimized Cell and GPU performance are compared with reference code perfor
mance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and
GPU platforms on a chiptochip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors
or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32
Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some
preliminary results on recently introduced NVIDIA GPUs with the nextgeneration Fermi architecture are also included. This paper
provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight
into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides
insight into the speedup that may be gained on current and future accelerator architectures for this class of applications.
Keywords: parallel performance, Cell processor, GPU, hyperbolic system, code optimization
PROGRAM SUMMARY
Program Title: SWsolver
Journal Reference:
Catalogue identifier:
Licensing provisions: GPL v3
Programming language: C, CUDA
Computer: Parallel Computing Clusters. Individual compute nodes
may consist of x86 CPU, Cell processor, or x86 CPU with attached
NVIDIA GPU accelerator.
Operating system: Linux
RAM: Tested on Problems requiring up to 4 GB per compute node.
Number of processors used: Tested on 1128 x86 CPU cores, 132 Cell
Processors, and 132 NVIDIA GPUs.
Keywords: Parallel Computing, Cell Processor, GPU, Hyberbolic
PDEs
Classification: 12
External routines/libraries: MPI, CUDA, IBM Cell SDK
Subprograms used: numdiff (for test run)
Nature of problem:
MPIparallel simulation of Shallow Water equations using high
resolution 2D hyperbolic equation solver on regular Cartesian grids
for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.
Solution method:
SWsolver provides 3 implementations of a highresolution 2D Shallow
Water equation solver on regular Cartesian grids, for CPU, Cell Pro
cessor, and NVIDIA GPU. Each implementation uses MPI to divide
work across a parallel computing cluster.
Running time:
The test run provided should run in a few seconds on all architectures.
In the results section of the manuscript a comprehensive analysis of
performance for different problem sizes and architectures is given.
1. Introduction
Recent microprocessor advances have focused on increas
ing parallelism rather than frequency, resulting in the develop
ment of highly parallel architectures such as graphics process
ing units (GPUs) [1, 2] and IBM’s Cell processor [3, 4]. Their
potential for excellent performance on computationintensive
scientific applications coupled with their availability as com
modity hardware has led researchers to adapt computational
kernels to these parallel architectures, which are often referred
to as accelerator architectures.
This paper investigates mapping highresolution finite vol
ume methods for nonlinear hyperbolic partial differential equa
tion (PDE) systems [5] onto two different types of accelerator
architecture, namely, IBM’sCellprocessorandNVIDIAGPUs.
Performance on these architectures is then compared with per
formance on Intel x86 central processing units (CPUs). The
accelerator architectures are investigated as both standalone
computational accelerators and as components of parallel clus
ters. A highresolution explicit numerical scheme is imple
mentedforarelativelysimplebutrepresentativemodelproblem
Preprint submitted to Computer Physics CommunicationsJuly 25, 2010
Page 2
in this class, namely, the shallow water equations. The numeri
cal method is implemented on twodimensional (2D) structured
grids, for three architectures (x86 CPU, GPU, and Cell), and in
parallel using the message passing interface (MPI).
A major goal of this paper is to compare the computational
performance that can be obtained on clusters with these three
types of architectures, for a 2D model problem that is represen
tative of a large class of structured grid based simulation algo
rithms. Simulations of this type are widely used in many ar
eas of computational science and engineering. Another impor
tant goal is to provide computational scientists and engineers
who are considering porting their codes to accelerator environ
ments with insight into techniques for optimizing structured
grid based explicit algorithms on clusters with Cell and GPU
accelerators, and into the learning curve and programming ef
fort involved. It was also our aim to write this paper in a way
that is accessible to computational scientists who may not have
specific background in Cell or GPU computing.
There is extensive related work in the literature on the use
of Cell processors and GPUs for scientific computing applica
tions. Many of the papers in the literature deal with optimized
implementations for either Cell processors [6, 7, 8, 9] or GPUs
[10, 11, 12, 13, 14, 15]. Most of these papers deal with stan
dalone or sharedmemory hardware configurations, and do not
involve distributed memory communication and MPI. Related
work in the computational fluid dynamics area can be found in
[16, 17, 18, 19]. Work that directly compares Cell with GPU
performance is not widespread [20], and applications on par
allel clusters with Cell and GPU accelerators have only more
recently started to come to the forefront [21, 22]. Our paper
goes further than existing work in comparing Cell with GPU
performance on clusters with MPI, and these are relevant ex
tensions of existing work since large clusters with accelerators
are already being deployed and appear to be a promising direc
tion for the future.
In our approach we have developed a unified code frame
work for our model problem, for hardware platforms that in
clude distributed memory clusters with x86 CPU, Cell and GPU
components. Several levels of parallelism are exploited (see
Fig. 1). At the coarsest level of parallelism, we partition the
computational domain over the distributed memory nodes of
the cluster and use MPI for communication. We carry out per
formance tests on clusters provided by Ontario’s Shared Hi
erarchical Academic Research Computing Network (SHARC
NET, [23]) and the Juelich Supercomputing Centre (JSC, [24]).
These clusters have two CPUs, Cell processors or GPUs per
cluster node. At finer levels of parallelism, we exploit the par
allel acceleration features provided by x86 CPUs, and Cell and
GPU devices. The x86 CPUs we use feature four cores per
CPU, and the cores provide single instruction, multiple data
(SIMD) vector parallelism through streaming SIMD extensions
(SSE). The Cell processors feature eight SIMD vector proces
sor cores. The GPUs feature dozens of streaming multiproces
sors with single instruction multiple thread (SIMT) parallelism.
We exploit these different levels of parallelism through opti
mization of data layout, data flow and dataparallel instructions.
Our development code is available on our website [25] and via
the Computer Programs in Physics (CPiP) program library. We
report runtime performance results for the various levels of op
timization performed, and first compare Cell and GPU perfor
mance to performance on a single CPU core, as is customary
in the literature. We also compare CPU, Cell and GPU perfor
manceonachipbychipbasis, onanodebynodebasis(i.e., on
single cluster nodes without MPI), and on clusters (with MPI).
Our GPU cluster results use NVIDIA Tesla GPUs with GT200
architecture, but we also include some results on recently in
troduced NVIDIA GPUs with the nextgeneration Fermi archi
tecture. Our Fermi results are preliminary: we did not further
optimize our code for the Fermi platform, but found it interest
ing to include results that show how a code developed on the
GT200 architecture performs on Fermi. We conclude on the
suitability of the accelerator architectures studied for the appli
cation class considered, and discuss the speedup that may be
gained on current and future accelerator architectures for this
class of applications.
The rest of this paper is organized as follows. In Section
2 we briefly describe the class of scientific computing prob
lems we target in this study, and the specific model problem we
have implemented. Section 3 gives a brief overview of the as
pects of the CPU, Cell and GPU architectures that are important
for code optimization. Section 4 describes how our simulation
code implementation was optimized for the architectures un
der consideration. Section 5 describes the clusters we use and
compares performance of the optimized simulation code on the
CPU, Cell and GPU platforms, and Section 6 formulates con
clusions.
2. Hyperbolic PDE Simulation Problem
In this paper we target acceleration of a class of structured
grid simulations in which grid quantities are evolved from step
to step using information from nearby grid cells. One appli
cation area where this type of successive shortrange updates
are used is fluid and plasma simulation with explicit time inte
gration, but there are many other use cases with this pattern in
the computational science and engineering field. The particu
lar problems we study are nonlinear hyperbolic PDE systems,
which require storage of multiple unknowns in each grid cell,
and which involve a relatively large number of floating point
operations (FLOPS) per grid cell in each time step. (Note that,
in this paper, we will write FLOPS/s when we mean floating
point operations per second.) For ease of implementation and
experimentation, we chose a relatively simple fluid simulation
problemand arelatively simplebutcommonly usedalgorithmic
approach. However, these choices are representative of a large
class of existing simulation codes, and our approach can eas
ily be generalized. Therefore, many of our findings carry over
to this general class of simulation problems. In particular, we
chose to investigate shallow water flow on 2D Cartesian grids,
using a highresolution finite volume method with explicit time
integration [5].
Our code computes numerical solutions of the shallow water
2
Page 3
Figure 1: General overview of the different levels of parallelism exploited. At the coarsest level of parallelism (left) we partition
the computational domain over the distributed memory nodes of the cluster and use MPI for communication between neighboring
partitions. At the finest level of parallelism (right), we utilize SIMD vectors (CPU and Cell) or SIMT thread parallelism (GPU). At
intermediate levels, we use Local Storesized blocks of data (Cell) or thread blocks (GPU). The actual details of the different levels
of parallelism depend on the platform and are represented more explicitly in Figs. 4 (CPU), 5 (Cell), and 7 (GPU).
equations, which are given by
wherehistheheight ofthewater, gisgravity, anduandvrepre
sent the fluid velocities. The gravitational constant g is taken to
beoneinthetestsimulationsreportedinthispaper. Theshallow
water system is a nonlinear system of hyperbolic conservation
laws [5], and given an initial condition, a 2D domain and appro
priate boundary conditions, it describes the evolution in time of
the unknown functions h(x,y,t), u(x,y,t) and v(x,y,t). We dis
cretize the equations on a rectangular domain with a structured
Cartesian grid, and evolve the solution numerically in time us
ing a finite volume numerical method with explicit time inte
gration [5]. In what follows we write U = [h
update the solution in each grid cell (i, j) using an explicit dif
ference method. One approach to this problem is to use so
called unsplit methods of the form
∂
∂t
h
hu
hv
+∂
∂x
hu
hu2+gh2
huv
2
+∂
∂y
hv
huv
hv2+gh2
2
=
0
0
0
,
(1)
hu hv]T. We
Un+1
i,j = Un
i,j−∆t
∆x(Fn
i+1
2,j− Fn
i−1
2,j) −∆t
∆y(Gn
i,j+1
2−Gn
i,j−1
2). (2)
Here, i, j are the spatial grid indices and n is the temporal index,
and F and G stand for numerical approximations to the fluxes
of Eq. (1) in the x and y directions, respectively. The vector
Un
at time level n. Alternatively, one can consider a dimensional
splitting approach
k
∆x
Un+1
∆y
and this is the method we chose to implement. An advantage
of the dimensional splitting approach is that Eq. (3) leads to ac
curacy that is in practice close to secondorder time accuracy
i,jis the vector of three unknown function values in cell (i, j)
U∗
i,j= Un
i,j−
?
?
Fn
i+1
2,j− Fn
i−1
2,j
?
?
,
i,j = U∗
i,j−
k
G∗
i,j+1
2−G∗
i,j−1
2
,
(3)
(see [5], pp. 386, 388, 444) without the need for a twostage
time integration. We use an expression for the numerical fluxes
F and G ([5], p. 121, Eqs. (6.59)(6.60)) that is secondorder
accurate away from discontinuities, utilizing a Roe Riemann
solver ([5], p. 481) with flux limiter. The update formula for
any point (i, j) on the grid involves values from two neighbor
ing grid points in each of the up, down, left and right directions,
leading to a ninepoint stencil for grid cell updates. For paral
lel implementations, this means that two layers of ghost cells
need to be communicated between blocks after each iteration
[5]. For numerical stability, the timestep size is limited by the
wellknown CourantFriedrichsLewy condition, which implies
that the timestep size must decrease proportional to the spatial
grid size as the grid is refined. Grid cell updates may be com
puted in parallel and the arithmetic density per grid point is
high (see Table 1), which, along with the structured nature of
the grid data, makes this algorithm a good candidate for accel
eration on Cell or GPU. The arithmetic density is computed by
calculating the minimum number of floating point operations
necessary to update all grid cells. That is, flux calculations are
counted once per cell interface and the calculation of interme
diate results that may be reused is not counted multiple times in
the number of operations. This is a flat operation count: no spe
cial consideration is given to square root or division operations.
It is useful to point out that, among the 360 FLOPS per grid
cell, there are 2 square roots and 16 divisions. This is important
since square roots and divisions may be evaluated in software
or on a restricted number of processor subcomponents on Cell
and GPU devices (depending on the precision, see below), so
actual arithmetic density on those platforms may effectively be
higher than what is reported in Table 1. Note that our algorithm
has such a high effective arithmetic density for several reasons:
we have a coupled system of three PDEs (3×9=27 values en
ter into the formula to update each grid value, instead of just
9 for uncoupled equations solved with the same accuracy), the
3
Page 4
system is highly nonlinear and requires sophisticated numerical
flux formulas based on Riemann solvers ([5], p. 481), and the
flux formulas involve square roots and divisions. Since our al
gorithm is implemented in two passes, the minimum number of
memory operations is each grid cell being read twice, and then
stored twice, in each timestep.
FLOPS per grid cell
Precision
Memory per grid cell
FLOPS/Byte
360
SPDP
48 Bytes96 Bytes
7.53.75
Table 1: The compute kernel requires a minimum of 7.5 and
3.75 FLOPS per Byte of data loaded or stored in single preci
sion (SP) and double precision (DP), respectively.
The test problem used for the simulations in this paper has
initial conditions
?x
u(x,y,0) = v(x,y,0) = 0,
h(x,y,0) =1
4L+y
W
?
+ 1,
on a square domain Ω = [−L,L] × [−W,W]. Boundary condi
tions are perfect walls [5].
As noted above, we have chosen a relatively simple set of hy
perbolic equations for this optimization and performance study
paper. However, more complicated hyperbolic systems, includ
ing the compressible Euler and Magnetohydrodynamics equa
tions which are widely used for fluid and plasma simulations,
can be approximated numerically by the same or similar meth
ods, and extension of our approach from 2D to 3D bodyfitted
structured grids or to unsplit explicit methods is also not diffi
cult. We have deliberately chosen this relatively simple model
problem for this paper because its simplicity allows us to ex
plain the essential aspects of optimizing structured grid prob
lems for Cell and GPU architectures, without being distracted
by nonessential details of a more complicated application.
Similarly, readers can easily investigate and comprehend the
details of our implementation in the simulation code that we
provide, without being overwhelmed by complications of the
application. However, the approach and conclusions of our pa
per carry over directly to a broad class of important fluid and
plasma simulation problems and algorithms.
3. Hardware Description
In this section we give a brief overview of the aspects of the
x86 CPU, IBM Cell and NVIDIA GPU architectures that are
important for optimization of our algorithmic approach.
3.1. Intel Xeon CPU
The Intel Xeon E5430 processors have four cores, and the
particular features that are important in the context of this pa
per are the cachebased architecture and the SIMD vector paral
lelism provided through the streaming SIMD extensions (SSE)
mechanism. Each core has SIMD vector units that are 128 bits
wide and are capable of performing four single precision cal
culations or two double precision calculations at the same time.
While compiler features are being developed that can automat
ically exploit this functionality, we found that for good perfor
mance it is at present still necessary to explicitly call intrinsic
library functions that access these SIMD capabilities efficiently
(see Section 4.1). The Intel Xeon E5430 quadcore processors
used in this study have a clockrate of 2.66GHz, a 12MB L2
cache, and each core has a 16KB L1 cache.
3.2. Cell Processor
The Cell Broadband Engine Architecture (CBEA), devel
oped jointly by IBM, Sony, and Toshiba is a microproces
sor design focused on maximizing computational throughput
and memory bandwidth while minimizing power consumption
[3, 4]. The first implementation of the CBEA is the Cell pro
cessor and it has been used successfully in several largescale
scientific computing clusters [26, 27], notably Los Alamos Na
tional Laboratory’s petaflopscale system Roadrunner [28].
The heterogeneous multicore design of the Cell processor
may be thought of as a network on a chip, with different cores
specialized for different computational tasks (Fig. 2). Since the
Cell processor is designed for high computational throughput
applications, eight of its nine processor cores are vector proces
sors, called synergistic processing elements (SPEs). The other
core is a more conventional (and relatively slow) CPU, called
the PowerPC processing element (PPE). The PPE has a 64bit
processor (called the PowerPC processing unit (PPU)) as well
as a memory subsystem containing a 512KB L2 cache. The
PPU runs the operating system and is suitable for general pur
pose computing. However, in practice its main task is to coor
dinate the activities of the SPEs.
Communication on the chip is carried out through the ele
ment interconnect bus. It has a high bandwidth (204.8 GB/s)
and connects the PPE, SPEs, and main memory through a four
channel ring topology, with two channels going in each direc
tion (Fig. 2). For main memory the Cell uses Rambus XDR
DRAM memory which delivers 25.6 GB/s maximum band
width on two 32bit channels of 12.8 GB/s each.
The SPE is the main computational workhorse of the Cell
Processor. It has a 3.2GHz SIMD processor (called the syn
ergistic processing unit (SPU)) that operates on 128bit wide
vectors which it stores in its 128 128bit registers.
Each SPE has 256KB of onchip memory called the Local
Store (LS). The SPU draws on the LS for both its instructions
and data: if data is not in the LS it has no automatic mechanism
to look for it in main memory. All data transfers between the LS
and main memory are controlled via softwarecontrolled direct
memory access (DMA) commands. Each SPE has a memory
flow controller that takes care of DMAs and operates indepen
dently of the SPU. DMAs may also transfer data directly be
tween the local stores of different SPEs.
The SPU has only static branch predicting capabilities and
has no other registers besides the 128bit registers. It sup
ports both single and double precision floating point instruc
tions. However, hardware support for transcendental functions
4
Page 5
MFCLS
SPE
MFC LS
SPE
MFC LS
SPE
MFC LS
SPE
MFC LS
SPE
MFC LS
SPE
MFC LS
SPE
MFCLS
SPE
EIB
L1
L2
PPE
XDR DRAM
Interface
Coherent
Interface
I/O
Interface
Main
Memory
Figure 2: Hardware diagram of the Cell processor. The 8 SPEs are the SIMD vector processors, the PPE is the PowerPC CPU,
and the rings illustrate the fourchannel ring topology of the Element Interconnect Bus. Also shown is the XDR DRAM memory
interface to the Cell blade main memory, and the I/O interfaces which allow two Cell processors on one blade to share SPEs.
is only available in the form of reduced precision approxima
tions of reciprocals and reciprocal square roots. Full single and
double precision transcendentals must be evaluated in software.
Most Cell tests in this paper are performed on the cluster de
scribed in Section 5.1.2 with PowerXCell 8i processors, but we
alsoincludesometestsonCellprocessorsinSony’sPlayStation
3, which are an earlier generation of the Cell processor with less
hardware support for double precision calculations, and which
have two of their SPEs disabled.
3.3. NVIDIA GPUs and CUDA Programming Model
GPUs are not, as their name would suggest, solely used for
graphics applications: NVIDIA Tesla GPUs have evolved to be
general purpose highthroughput dataparallel computing de
vices [1]. The GPU attaches to a host CPU system via the PCI
Express bridge as an addon computational accelerator with its
own separate DRAM (up to 4GB), which we call GPU global
memory, and some specialized onchip memory. Programs may
be developed to make use of the GPU by using NVIDIA’s
CUDA programming model which provides extensions to the
C programming language [29, 30]. (CUDA stands for compute
unified device architecture.) The GPU is incorporated into a
program’s execution by calling what is known as a kernel func
tion from within the CPU host code. A kernel is defined sim
ilarly to a normal C function but when called, a userspecified
number of threads are spawned, each of which executes the ker
nel function on the GPU in parallel. The threads are mapped
into groups of up to 512 called thread blocks, and the threads
within a thread block are grouped into smaller groups of 32
threads called warps.
The NVIDIA GT200 architecture uses a hierarchal organiza
tion of thread processors and memory to implement a single in
struction multiple thread (SIMT) streaming multiprocessor de
sign, shown schematically in Fig. 3. Threads are farmed out to
the hundreds of identical scalar processors (SPs) on the GPU.
(The Tesla T10 GPU we use has 240 SPs.) Ideally, many more
threads are spawned than the number of SPs. The SPs are or
ganized into blocks of eight, called streaming multiprocessors
(SMs). Each SM in addition to the eight SPs has a special func
tion unit (SFU) for computing transcendental functions and a
double precision unit (DP) which can also act as an SFU. Each
SM also has a block of local memory called shared memory vis
ible to all threads within a thread block, and a scheduling unit
used to schedule warps. The GPU is capable of swapping warps
into and out of context without any performance overhead. This
functionality provides an important method of hiding memory
and instruction latency on the GPU hardware.
When a kernel function is called, it is initiated on the GPU by
mapping multiple thread blocks onto the SMs. Thread blocks
are divided on the SMs into groups of 32 threads called warps
and execution proceeds in a SIMT fashion within each warp.
Threads within a thread block may be synchronized if neces
sary. However, there is no generally efficient mechanism for
synchronization across the thread blocks within a kernel func
tion.
3.3.1. Fermi GPU
The Fermi architecture, released in the spring of 2010, is
NVIDIA’s nextgeneration GPU. It is the successor of the
GT200 architecture described above, and is the first in which
NVIDIA focussed on generalpurpose computation perfor
mance. The main improvements to note for this paper are the
full IEEE floating point compliance, the improved double pre
cision performance, and the addition of a cache hierarchy. The
double precision performance on Fermi is half the speed of sin
gle precision, bringing it in line with most CPUs. The addition
of a cache hierarchy, consisting of a global L2 cache, as well
as a perSM L1 cache gives more flexibility in nonuniform
memory accesses. The Fermi C2050 features 448 SPs orga
nized in 14 SMs. Each SM has 32 SPs, 16 DPs, and 4 SFUs.
The Fermi C2050 features a 1.15 GHz clock speed which is
slower than the Tesla T10’s 1.30 GHz. For the rest of the de
5