Parallel Hyperbolic PDE Simulation on Clusters: Cell versus GPU
Scott Rostrup and Hans De Sterck
Department of Applied Mathematics, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational per-
formance. Two technologies that have received significant attention are IBM’s Cell Processor and NVIDIA’s CUDA programming
model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial
differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The
message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of
the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data
layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code perfor-
mance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and
GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors
or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32
Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some
preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper
provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight
into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides
insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.
Keywords: parallel performance, Cell processor, GPU, hyperbolic system, code optimization
Program Title: SWsolver
Licensing provisions: GPL v3
Programming language: C, CUDA
Computer: Parallel Computing Clusters. Individual compute nodes
may consist of x86 CPU, Cell processor, or x86 CPU with attached
NVIDIA GPU accelerator.
Operating system: Linux
RAM: Tested on Problems requiring up to 4 GB per compute node.
Number of processors used: Tested on 1-128 x86 CPU cores, 1-32 Cell
Processors, and 1-32 NVIDIA GPUs.
Keywords: Parallel Computing, Cell Processor, GPU, Hyberbolic
External routines/libraries: MPI, CUDA, IBM Cell SDK
Subprograms used: numdiff (for test run)
Nature of problem:
MPI-parallel simulation of Shallow Water equations using high-
resolution 2D hyperbolic equation solver on regular Cartesian grids
for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.
SWsolver provides 3 implementations of a high-resolution 2D Shallow
Water equation solver on regular Cartesian grids, for CPU, Cell Pro-
cessor, and NVIDIA GPU. Each implementation uses MPI to divide
work across a parallel computing cluster.
The test run provided should run in a few seconds on all architectures.
In the results section of the manuscript a comprehensive analysis of
performance for different problem sizes and architectures is given.
Recent microprocessor advances have focused on increas-
ing parallelism rather than frequency, resulting in the develop-
ment of highly parallel architectures such as graphics process-
ing units (GPUs) [1, 2] and IBM’s Cell processor [3, 4]. Their
potential for excellent performance on computation-intensive
scientific applications coupled with their availability as com-
modity hardware has led researchers to adapt computational
kernels to these parallel architectures, which are often referred
to as accelerator architectures.
This paper investigates mapping high-resolution finite vol-
ume methods for nonlinear hyperbolic partial differential equa-
tion (PDE) systems  onto two different types of accelerator
architecture, namely, IBM’sCellprocessorandNVIDIAGPUs.
Performance on these architectures is then compared with per-
formance on Intel x86 central processing units (CPUs). The
accelerator architectures are investigated as both stand-alone
computational accelerators and as components of parallel clus-
ters. A high-resolution explicit numerical scheme is imple-
Preprint submitted to Computer Physics CommunicationsJuly 25, 2010
in this class, namely, the shallow water equations. The numeri-
cal method is implemented on two-dimensional (2D) structured
grids, for three architectures (x86 CPU, GPU, and Cell), and in
parallel using the message passing interface (MPI).
A major goal of this paper is to compare the computational
performance that can be obtained on clusters with these three
types of architectures, for a 2D model problem that is represen-
tative of a large class of structured grid based simulation algo-
rithms. Simulations of this type are widely used in many ar-
eas of computational science and engineering. Another impor-
tant goal is to provide computational scientists and engineers
who are considering porting their codes to accelerator environ-
ments with insight into techniques for optimizing structured
grid based explicit algorithms on clusters with Cell and GPU
accelerators, and into the learning curve and programming ef-
fort involved. It was also our aim to write this paper in a way
that is accessible to computational scientists who may not have
specific background in Cell or GPU computing.
There is extensive related work in the literature on the use
of Cell processors and GPUs for scientific computing applica-
tions. Many of the papers in the literature deal with optimized
implementations for either Cell processors [6, 7, 8, 9] or GPUs
[10, 11, 12, 13, 14, 15]. Most of these papers deal with stan-
dalone or shared-memory hardware configurations, and do not
involve distributed memory communication and MPI. Related
work in the computational fluid dynamics area can be found in
[16, 17, 18, 19]. Work that directly compares Cell with GPU
performance is not widespread , and applications on par-
allel clusters with Cell and GPU accelerators have only more
recently started to come to the forefront [21, 22]. Our paper
goes further than existing work in comparing Cell with GPU
performance on clusters with MPI, and these are relevant ex-
tensions of existing work since large clusters with accelerators
are already being deployed and appear to be a promising direc-
tion for the future.
In our approach we have developed a unified code frame-
work for our model problem, for hardware platforms that in-
clude distributed memory clusters with x86 CPU, Cell and GPU
components. Several levels of parallelism are exploited (see
Fig. 1). At the coarsest level of parallelism, we partition the
computational domain over the distributed memory nodes of
the cluster and use MPI for communication. We carry out per-
formance tests on clusters provided by Ontario’s Shared Hi-
erarchical Academic Research Computing Network (SHARC-
NET, ) and the Juelich Supercomputing Centre (JSC, ).
These clusters have two CPUs, Cell processors or GPUs per
cluster node. At finer levels of parallelism, we exploit the par-
allel acceleration features provided by x86 CPUs, and Cell and
GPU devices. The x86 CPUs we use feature four cores per
CPU, and the cores provide single instruction, multiple data
(SIMD) vector parallelism through streaming SIMD extensions
(SSE). The Cell processors feature eight SIMD vector proces-
sor cores. The GPUs feature dozens of streaming multiproces-
sors with single instruction multiple thread (SIMT) parallelism.
We exploit these different levels of parallelism through opti-
mization of data layout, data flow and data-parallel instructions.
Our development code is available on our website  and via
the Computer Programs in Physics (CPiP) program library. We
report runtime performance results for the various levels of op-
timization performed, and first compare Cell and GPU perfor-
mance to performance on a single CPU core, as is customary
in the literature. We also compare CPU, Cell and GPU perfor-
manceonachip-by-chipbasis, onanode-by-nodebasis(i.e., on
single cluster nodes without MPI), and on clusters (with MPI).
Our GPU cluster results use NVIDIA Tesla GPUs with GT200
architecture, but we also include some results on recently in-
troduced NVIDIA GPUs with the next-generation Fermi archi-
tecture. Our Fermi results are preliminary: we did not further
optimize our code for the Fermi platform, but found it interest-
ing to include results that show how a code developed on the
GT200 architecture performs on Fermi. We conclude on the
suitability of the accelerator architectures studied for the appli-
cation class considered, and discuss the speed-up that may be
gained on current and future accelerator architectures for this
class of applications.
The rest of this paper is organized as follows. In Section
2 we briefly describe the class of scientific computing prob-
lems we target in this study, and the specific model problem we
have implemented. Section 3 gives a brief overview of the as-
pects of the CPU, Cell and GPU architectures that are important
for code optimization. Section 4 describes how our simulation
code implementation was optimized for the architectures un-
der consideration. Section 5 describes the clusters we use and
compares performance of the optimized simulation code on the
CPU, Cell and GPU platforms, and Section 6 formulates con-
2. Hyperbolic PDE Simulation Problem
In this paper we target acceleration of a class of structured
grid simulations in which grid quantities are evolved from step
to step using information from nearby grid cells. One appli-
cation area where this type of successive short-range updates
are used is fluid and plasma simulation with explicit time inte-
gration, but there are many other use cases with this pattern in
the computational science and engineering field. The particu-
lar problems we study are nonlinear hyperbolic PDE systems,
which require storage of multiple unknowns in each grid cell,
and which involve a relatively large number of floating point
operations (FLOPS) per grid cell in each time step. (Note that,
in this paper, we will write FLOPS/s when we mean floating
point operations per second.) For ease of implementation and
experimentation, we chose a relatively simple fluid simulation
problemand arelatively simplebutcommonly usedalgorithmic
approach. However, these choices are representative of a large
class of existing simulation codes, and our approach can eas-
ily be generalized. Therefore, many of our findings carry over
to this general class of simulation problems. In particular, we
chose to investigate shallow water flow on 2D Cartesian grids,
using a high-resolution finite volume method with explicit time
Our code computes numerical solutions of the shallow water
Figure 1: General overview of the different levels of parallelism exploited. At the coarsest level of parallelism (left) we partition
the computational domain over the distributed memory nodes of the cluster and use MPI for communication between neighboring
partitions. At the finest level of parallelism (right), we utilize SIMD vectors (CPU and Cell) or SIMT thread parallelism (GPU). At
intermediate levels, we use Local Store-sized blocks of data (Cell) or thread blocks (GPU). The actual details of the different levels
of parallelism depend on the platform and are represented more explicitly in Figs. 4 (CPU), 5 (Cell), and 7 (GPU).
equations, which are given by
wherehistheheight ofthewater, gisgravity, anduandvrepre-
sent the fluid velocities. The gravitational constant g is taken to
water system is a nonlinear system of hyperbolic conservation
laws , and given an initial condition, a 2D domain and appro-
priate boundary conditions, it describes the evolution in time of
the unknown functions h(x,y,t), u(x,y,t) and v(x,y,t). We dis-
cretize the equations on a rectangular domain with a structured
Cartesian grid, and evolve the solution numerically in time us-
ing a finite volume numerical method with explicit time inte-
gration . In what follows we write U = [h
update the solution in each grid cell (i, j) using an explicit dif-
ference method. One approach to this problem is to use so-
called unsplit methods of the form
hu hv]T. We
i,j = Un
Here, i, j are the spatial grid indices and n is the temporal index,
and F and G stand for numerical approximations to the fluxes
of Eq. (1) in the x and y directions, respectively. The vector
at time level n. Alternatively, one can consider a dimensional
and this is the method we chose to implement. An advantage
of the dimensional splitting approach is that Eq. (3) leads to ac-
curacy that is in practice close to second-order time accuracy
i,jis the vector of three unknown function values in cell (i, j)
i,j = U∗
(see , pp. 386, 388, 444) without the need for a two-stage
time integration. We use an expression for the numerical fluxes
F and G (, p. 121, Eqs. (6.59)-(6.60)) that is second-order
accurate away from discontinuities, utilizing a Roe Riemann
solver (, p. 481) with flux limiter. The update formula for
any point (i, j) on the grid involves values from two neighbor-
ing grid points in each of the up, down, left and right directions,
leading to a nine-point stencil for grid cell updates. For paral-
lel implementations, this means that two layers of ghost cells
need to be communicated between blocks after each iteration
. For numerical stability, the timestep size is limited by the
well-known Courant-Friedrichs-Lewy condition, which implies
that the timestep size must decrease proportional to the spatial
grid size as the grid is refined. Grid cell updates may be com-
puted in parallel and the arithmetic density per grid point is
high (see Table 1), which, along with the structured nature of
the grid data, makes this algorithm a good candidate for accel-
eration on Cell or GPU. The arithmetic density is computed by
calculating the minimum number of floating point operations
necessary to update all grid cells. That is, flux calculations are
counted once per cell interface and the calculation of interme-
diate results that may be reused is not counted multiple times in
the number of operations. This is a flat operation count: no spe-
cial consideration is given to square root or division operations.
It is useful to point out that, among the 360 FLOPS per grid
cell, there are 2 square roots and 16 divisions. This is important
since square roots and divisions may be evaluated in software
or on a restricted number of processor sub-components on Cell
and GPU devices (depending on the precision, see below), so
actual arithmetic density on those platforms may effectively be
higher than what is reported in Table 1. Note that our algorithm
has such a high effective arithmetic density for several reasons:
we have a coupled system of three PDEs (3×9=27 values en-
ter into the formula to update each grid value, instead of just
9 for uncoupled equations solved with the same accuracy), the
2 x Xeon CPU
2 x Tesla T10 GPU
2 x PowerXCell 8i
(a) Single Precision
2 x Xeon CPU
2 x Tesla T10 GPU
2 x PowerXCell 8i
(b) Double Precision
Table 11: Strong scaling performance comparison between architectures in single and double precision using from N=1 to N=16
cluster nodes with MPI. This test uses a 16000 × 10000 grid with L = 160, W = 100 and 100 timesteps. We give the total runtime
(T), the mpi-exchange time (Ex), the time to do the actual calculations (Wk), and the scaling of the total runtime (S). For all
platforms we use our most performant code, as listed in Section 5.1.1.
2 x Xeon CPU
2 x Tesla T10 GPU
2 x PowerXCell 8i
(a) Single Precision
2 x Xeon CPU
2 x Tesla T10 GPU
2 x PowerXCell 8i
(b) Double Precision
Table 12: Weak scaling performance comparison between architectures in single and double precision using from N=1 to N=16
cluster nodes with MPI. This test uses grid sizes of 10000×5000,10000×10000, 20000×10000, 20000×20000, and 40000×20000
with grid L = W = 800 and 100 timesteps. We give the total runtime (T), the mpi-exchange time (Ex), the time to do the actual
calculations (Wk), and the ratio of the total runtime for N=1 divided by the total runtime on N nodes (R). For all platforms we use
our most performant code, as listed in Section 5.1.1.
performance does not scale well because of the poor PPU per-
formance on the inter-node MPI calls.
6. Conclusions and Future Directions
We have shown how a numerical simulation method for non-
linear hyperbolic PDE systems can be implemented efficiently
for the Cell processor and NVIDIA GPUs using CUDA. We
have described memory layout, communication patterns and
optimization steps that were performed to exploit the low-level
parallelism of these two platforms. A coarse-level layer of MPI
parallelism was added to obtain a hybrid parallel code that can
be executed efficiently on GPU or Cell clusters. Performance
tests were conducted on JSC’s Cell cluster system ‘JUICEnext’
 and SHARCNET’s GPU cluster ‘angel’ .
Compared to a reference CPU implementation with cache
and SSE optimization executed on one Xeon core, significant
speed-ups were obtained by both the Cell processor and GPU
implementations. In single precision the Tesla T10 GPU imple-
mentation (GT200 architecture) gave almost 32× speed-up, and
the Cell provided a 22× speed-up. In double precision the Tesla
GPU gave a 13× speed-up over a single Xeon core, while the
Cell gave just under 7.5× better performance (Table 5).
In a chip-to-chip comparison of single precision results, the
Cell processor was 5× faster than a quad-core Xeon implemen-
tation, and the Tesla GPU was 8× faster. In double precision
the Cell achieves a 2× speed-up, and the Tesla GPU roughly 3×
(Table 5). For our CUDA code, recently released GPUs with
next-generation Fermi architecture improve on the Tesla results
(GT200 architecture) by a factor of about two (in preliminary
results without further Fermi-specific optimization).
In a cluster-to-cluster comparison, the speed-up ratios remain
roughly the same. However, the Cell implementation scales
poorly on our QS22-cluster due to slow performance of the
PPE. There are work-arounds for the slow PPE, such as the
hybrid AMD-Cell approach used in Roadrunner, or one can re-
duce the number of times that MPI communication must occur
by adding more ghost layers and doing multiple iterations be-
tween MPI communication calls.
The overall results of our study demonstrate that Cell and
GPU clusters can be used for efficiently simulating nonlin-
ear hyperbolic PDE systems on structured grids with explicit
timestepping. Our model problem is representative of a large
class of structured grid based local-interaction simulation algo-
rithms with relatively large FLOP counts per Byte of data trans-
fered, and our conclusions carry over to many simulation codes
used in broad areas of computational science and engineering.
Our discussion of Cell and GPU optimization techniques
shows that there is still a significant learning curve and opti-
mization effort involved in porting codes to Cell and GPU en-
vironments. We have found the effort required to obtain sig-
nificant speed-ups smaller on GPU platforms than on Cell plat-
forms, and find the GPU platform generally simpler and more
intuitive to program. Extensive efforts are underway to improve
automatic compiler optimization for both Cell and GPU plat-
forms: the RapidMind platform, the HONEI libraries  and
the new OpenCL framework  are examples. However, for
the time being, automatic approaches cannot provide yet the
level of speed optimization that we were seeking in this paper.
In future work we will extend our approach to bodyfitted
adaptive multi-block codes in three dimensions (3D), which
allow for simulation of real 3D problems with more complex
geometry. Extending our work to unstructured grids or im-
plicit time integration would require different optimization ap-
proaches, since those problems require unstructured data and/or
nonlocal connectivity. Mixed-precision calculations are also an
interesting option, due to the excellent performance of Cell and
GPU on single precision calculations.
Our preliminary results on the recently introduced Fermi
GPU show that GPU calculations can be up to 14 times faster
than quad-core Xeon E5430 CPU calculations in single preci-
sion, and up to 7 times faster in double precision, for our ap-
plication. Improvements in the newest GPUs that are important
for HPC include full IEEE floating point compliance, error cor-
recting memory, and L1 and L2 caches. Direct communication
between GPUs over Infiniband networks  is another excit-
Our results support indications that clusters with heteroge-
neous multi-core architectures may become increasingly im-
portant for scientific computing applications in the near future,
especially also if one considers their typically low power re-
quirements [26, 22] and the fact that heterogeneous multi-core
architectures may scale more easily to large numbers of on-chip
cores than homogeneous chips with full x86 cores .
This work was made possible by the facilities of the Shared
Hierarchical Academic Research Computing Network (SHAR-
CNET, ) and the Juelich Supercomputer Centre (JSC, ),
Merz, John Morton and other members of the SHARCNET
technical staff for their expert advice and technical help. We
also thank Markus Stuermer, Lucian Ivan and Matthias Bolten
for their advice, and Willy Homberg for JSC support.
 E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, NVIDIA Tesla:
A Unified Graphics and Computing Architecture, IEEE Micro 28 (2008)
 J. D. Owens , M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C.
Phillips, GPU Computing, Proceedings of the IEEE 96 (2008) 879–899.
 D.C. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry,
D. Cox, P. Harvey, P.M. Harvey, H.P. Hofstee, C. Johns, J. Kahle,
A. Kameyama, J. Keaty, Y. Masubuchi, M. Pham, J. Pille, S. Posluszny,
M. Riley, D.L. Stasiak, M. Suzuoki, O. Takahashi, J. Warnock, S. Weitzel,
D. Wendel, and K. Yazawa, Overview of the architecture, circuit design,
and physical implementation of a first-generation Cell processor, IEEE
Journal of Solid-State Circuits 41 (2006) 179–196.
 A. Arevalo, R. M. Matinata, M. Pandian, E. Peri, K. Ruby, F. Thomas,
and C. Almond, Programming the Cell Broadband Engine Architecture:
Examples and Best Practices (IBM Redbooks, 2008).
 R. J. LeVeque, Finite volume methods for hyperbolic problems (Cam-
bridge University Press, 2002).
 C. Benthin, I. Wald, M. Scherbaum, and H. Friedrich, Ray tracing on the
Cell processor, in: Proceedings of the 2006 IEEE Symposium on Interac-
tive Ray Tracing, 2006, 15–23.
 G. De Fabritiis, Performance of the Cell processor for biomolecular sim-
ulations, Computer Physics Communications 176 (2007) 660–664.
 M. Stuermer, J. Goetz, G. Richter, A. Doerfler, and U. Ruede, Fluid flow
simulation on the Cell Broadband Engine using the lattice Boltzmann
method, Computers and Mathematics with Applications 58 (2009) 1062–
 S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick,
Scientific computing kernels on the Cell processor, Int. J. Parallel Pro-
gramming 35 (2007) 263–298.
 M. S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. Legrand,
A. L. Beberg, D. L. Ensign, C. M. Bruns, and V. S. Pande, Accelerating
molecular dynamic simulation on graphics processing units, Journal of
Computational Chemistry 30 (2009) 864–872.
 T. Hamada and T. Iitaka, The chamomile scheme: An optimized algo-
rithm for n-body simulations on programmable graphics processing units,
 Lars Nyland, Mark Harris, and Jan Prins, Fast N-Body Simulation with
CUDA, in: GPU gems 3, Addison-Wesley, 2008, Chapter 31, 677–696.
 S. S. Stone, J. P. Haldar, S. C. Tsao, W. Hwu, B. P. Sutton, and Z. P. Liang,
Accelerating advanced MRI reconstructions on GPUs, J. Parallel Distrib.
Comput. 68 (2008) 1307–1318.
 F. Xu and K. Mueller, Real-time 3D computed tomographic reconstruc-
tion using commodity graphics hardware, Physics in Medicine and Biol-
ogy 52 (2007) 3405–3419.
 J. E. Stone, J. Saam, D. J. Hardy, K. L. Vandivort, W.-m. W. Hwu, and
K. Schulten, High performance computation and interactive display of
molecular orbitals on GPUs and multi-core CPUs, in: Proceedings of the
2nd Workshop on General-Purpose Processing on Graphics Processing
Units, ACM International Conference Proceeding Series, vol. 383, 9–18,
 T. Brandvik and G. Pullan, Acceleration of a two-dimensional Euler flow
solverusingcommoditygraphicshardware, in: ProceedingsoftheInstitu-
tion of Mechanical Engineers, Part C: Journal of Mechanical Engineering
Science 221 (2007) 1745–1748.
 T. R. Hagen, K.-A. Lie, and J. R. Natvig, Solving the Euler equations
on Graphics Processing Units, Lecture Notes in Computer Science 3994
 A. Kloeckner, T. Warburton, J. Bridge, and J.S. Hesthaven, High-order
discontinuous Galerkin methods on graphics processors, Journal of Com-
putational Physics, submitted, 2009.
 M. L. Saetra, Solving systems of hyperbolic PDEs using multiple GPUs,
Master’s thesis, University of Oslo, 2007.
 D. van Dyk, M. Geveler, S. Mallach, D. Ribbrock, D. Goeddeke, and C.
Gutwenger, HONEI: A collection of libraries for numerical computations
targeting multiple processor architectures, Computer Physics Communi-
cations, in press, 2009.
 J. C. Phillips, J. E. Stone, and K. Schulten, Adapting a message-driven
parallel application to GPU-accelerated clusters, in: Proceedings of the
2008 ACM/IEEE conference on Supercomputing, 2008, 1–9.
 V. V. Kindratenko, J. J. Enos, M. T. Guochun Shi Showerman, G. W.
Arnold, J. E. Stone, J. C. Phillips, W.-m. Hwu, GPU clusters for high-
performance computing, in: IEEE International Conference on Cluster
 SHARCNET website, www.sharcnet.ca .
 Juelich Supercomputing Centre website, www.fz-juelich.de .
 Scalable Scientific Computing group website, University of Waterloo,
 Green 500 website, www.green500.org .
 Top 500 website, www.top500.org .
 K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin, and
J. C. Sancho, Entering the petaflop era: The architecture and performance
of Roadrunner, in: Proceedings of the 2008 ACM/IEEE conference on
Supercomputing, 2008, 1–11.
 NVIDIA CUDA Programming Guide. Version 2.3.1.
 J. Nickolls, I. Buck, M. Garland, and K. Skadron, Scalable parallel pro-
gramming with CUDA, Queue 6 (2008) 40–53.
 NVIDIA’s Next Generation CUDA Compute Architecture Fermi, white
 S. Rostrup, Solving Hyperbolic PDEs using Accelerator Architectures,
Master’s thesis, Department of Applied Mathematics, University of Wa-
 S. Rostrup and H. De Sterck, Hybrid MPI-Cell Parallelism for Hyper-
bolic PDE Simulation on a Cell Processor Cluster, in: Proceedings of the
High Performance Computing Symposium Symposium, Kingston, On-
 Intel C++ Compiler User and Reference Guides, Ch. 26, pp. 1385-1387.
Document number: 304968-023US.
 D. Callahan, S. Carr, and K. Kennedy, Improving Register Allocation for
Subscripted Variables, SIGPLAN Not. 39, 4 (Apr. 2004), 328-342.
 C. Ding and P. Sweany, Improving Software Pipelining with Unroll-and-
Jam and Memory Reuse Analysis, Master’s thesis, Department of Com-
puter Science, Michigan Technological University, 1996.
 J.E.Stone, J.C.Phillips, P.L.Freddolino, D.J.Hardy, L.G.Trabuco, and
K. Schulten, Accelerating molecular modeling applications with graphics
processors, Journal of Computational Chemistry 28 (2007) 2618–2640.
 S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk,
and W.-m. W. Hwu, Optimization principles and application performance
evaluation of a multithreaded GPU using CUDA, in: Proceedings of the
13th ACM SIGPLAN Symposium on Principles and practice of parallel
programming, 2008, 73–82.
 K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D.
Patterson, J. Shalf and K. Yelick, Stencil computation optimization and
auto-tuning on state-of-the-art multicore architectures, in: Proceedings of
the 2008 ACM/IEEE conference on Supercomputing, 2008, 1–12.
 S.W. Williams, A. Waterman, and D.A. Patterson, Roofline: An Insightful
Visual Performance Model for Floating-Point Programs and Multicore
Architectures (Tech. Report UCB/EECS-2008-134, EECS Department,
University of California, Berkeley, 2008).
 Khronos Group, OpenCL, www.khronos.org/opencl .
 K.Asanovic, R.Bodik, J.Demmel, J.Kubiatowicz, K.Keutzer, E.Lee, G.
Necula, D. Patterson, K. Sen, J. Shalf, J. Wawrzynek, and K. Yelick, The
landscape of parallel computing research: A view from Berkeley (Tech.
Report UCB/EECS-2006-183, EECS Department, University of Califor-
nia, Berkeley, 2006).
 NVIDIA Tesla GPUs To Communicate Faster Over Mellanox In-
finiBand Networks, www.nvidia.com/object/io 1258539409179.html, re-
trieved December 2, 2009.