Content uploaded by Cristobal A Navarro
Author content
All content in this area was uploaded by Cristobal A Navarro
Content may be subject to copyright.
Commun. Comput. Phys.
doi: 10.4208/cicp.110113.010813a
Vol. 15, No. 2, pp. 285-329
February 2014
REVIEW ARTICLE
A Survey on Parallel Computing and its Applications
in Data-Para llel Problems Using GPU Archit e ct ures
Crist´obal A . Navarro
1,2,∗
, Nancy Hitschfeld-Kahler
1
and Luis Mateu
1
1
Department of Computer Science (DCC), Universidad de Chile, Santiago, Chile.
2
Centro de Estudios Cient´ıficos (CECS), Valdivia, Ch ile.
Received 11 Ja nua ry 2013; Accepted (in revised version) 1 August 2013
Available online 10 September 2013
Abstract. Parallel computing has become an important subject in the field of computer
science and has proven to be critical when researching high performance solutions.
The evolution of computer architectures (multi-core and many-core) towards a higher
number of cores c an only confirm that parallelism is the method of choice f or speeding
up an algorithm. In the last decade, the graphics processing unit, or GPU, ha s gained
an important place in the field of high performance computing (HPC) because of its
low cost and massive para llel processing power. Super-computing has become, for the
first time, availa b le to anyone at the price of a desktop computer. In this paper, we
survey the concept of parallel computing and especially GPU computing. Achieving
efficient parallel algorithms for the GPU is not a trivial task, there are several technical
restrictions that must be satisfied in order to achieve the expecte d performance. Some
of these limitations are consequences of the underlying architecture of the GPU and
the theoretical models behind it. Our goal is to present a set of theoretical and techni-
cal concepts that are often required to understa nd the GPU and its massive parallelism
model. In particular, we show how this new technology can help the field of compu-
tational p hysics, especially when the problem is data-parallel. We present four examples
of computational p hysics problems; n-body, collision det ection, Potts model and cellular
automata simulations. These examples well represent the kind of problems that are
suitable for GPU computing. By understanding the GPU architecture and its massive
parallelism programming model, one can overcome many of the technical limitations
found along the way, design bette r GPU-based algorithms for computational physics
problems and achieve speedup s that can reach up to two orders of magnitude when
compared to sequential implementations.
AMS subject classifications: 68W10, 65Y05, 68Q10, 68N19, 68M14, 00A79, 81Vxx
∗
Corresponding author. Email addresses: crinavar@dcc.uchile.cl (C. A. Navarro), nancy@dcc.uchile.cl
(N. Hitschfeld), lmateu@dcc.uchile.cl (L. Mateu)
http://www.global-sci.com/ 285
c
2014 Global-Science Press
286 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
Key words: GPU computing, para llel computing, computing models, algorithms, data parallel,
massive para llelism, Potts model, Ising Model, collision detection, n-body, Cellular Automata.
Contents
1 Introduction 286
2 Basic concepts 288
3 Performance measures 289
4 Parallel computing models 295
5 Parallel programming models 300
6 Architectures 303
7 Strategy for designing a parallel algorithm 309
8 GPU Computing 310
9 Examples of spatial and tiled GPU compatible problems 314
10 Latest advances and open problems in GPU computing 320
11 Discussion 321
1 Introduction
For some computational problems, CPU-based algorithms are not fast enou gh to give
a solution in a reasonable amount of time. Furthermore, these problems can become
even larger, to the point that not even a multi-core CPU-based algorithm is fast enough.
Problems such as these can be found in science and technology; natural sciences [52,
108, 116] (Physics, Biology, Chemistry), information technologies [115] (IT), geospatial
information systems [11, 66] (GIS), structural mechanics problems [12] and even abstract
mathematical/computer science (CS) problems [84, 98, 101, 109]. Today, many of these
problems can be solved faster and more efficiently by using massive parallel processors.
The story of how massive parallel p rocessors were bor n is actually one of a kind
because it combines two fields that were not related at all; computational science and video-
game industry. In science, there is a constant need for s olving the largest problems in a
reasonable amount of time. This need has led to the construction of massively parallel
super-computers for und erstanding phenomena such as galaxy formation, mo lecular dy-
namics and climate change, among others. On the other hand, the video -game industry
is in a constant need for achieving real-time photo-realistic graphics, with the major re-
striction of running their algorithms on consumer-level computer hardware. The need
of realistic video-games led to t h e invention of the graphics accelerator, a small paral-
lel processor that handled many graphical computations using hardware implemented
functions. The t wo needs, combined toge ther, have given birth to one of th e most impor-
tant hardware for parallel computing; th e GPU.
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 287
The GPU (graphics processing unit) is t h e maximum exponent o f parallel computing.
It is physically small to fit inside a de sktop machine and it is massively parallel as a
small scale super-computer, capable of handling up to t h ousands of threads in p arallel.
The GPU is indeed attractive to any scientist, because it is no more restricted to graphical
problems and offers impressive parallel performance at the cost of a desktop computer. It
is not a sur prise to se e GPU-based algorithms achieve considerable amounts of speedup
over a classic CPU-based solution [10, 30], even by two orders of magnitude [25, 78].
Some problems do not have a parallel solution [47]. For example, th e approximation
of
√
x using the Newton-Rhapson method [100] cannot be parallelized because each iter-
ation depends on the value of the previous one; there is the issue of time dependence. Such
problems do not benefit from p arallelism at all and are best solved using a CPU. On the
other hand, there are problems that can be naturally split into many independent s ub-
problems; e.g., matrix multiplication can be split into s everal independe n t multiply-add
computations. Such problems are massively parallel, they are very common in compu-
tational physics and they are best solved using a GPU. In some cases, these problems
become so parallelizable that they receive the name embarrassingly parallel
†
or pleasingly
parallel [87, 97].
One of the most important aspects of parallel computing is its close relation to the
underlying hardware and programming models. Typical ques tions in the field are: What
type of problem I am dealing with? Should I use a CPU or a GPU? Is it a MIMD or SIMD
architecture? It is a distributed or shared memory system? What should I use: OpenMP, MPI,
CUDA or OpenCL? Why the performance is not what I had expected? Should I use a hierarchical
partition? How can I design a parallel algorithm?. These questions are indeed important
when searching for a high performance solution and their answers lie in the areas of
algorithms, compu ter architectures, computing mo dels and programming models. GPU
computing also brings up additional challenges such as manual cache u sage, parallel
memory access patterns, communication, thread mapping and syn chronization, among
others. These challenges are critical for implementing an efficient GPU algorithm.
This paper is a comprehensive survey of basic and advanced topics that are often
required to understand parallel computing and especially GPU computing. The main
contribution is the presentation of massive parallel architectures as a useful technology
for solving computational physics problems. As a result, the reader should become more
confident in the fundamental and technical aspects of GPU computing, with a clear idea
of what types of problems are best suited for GPU computing.
We organized the rest of the sections in the following way: fundamental concepts
of parallel computing and theoretical background are presented first, such as basic def-
initions, performance measures, comput ing models, programming mod els and architec-
tures (from Section 2 to Section 6). Advanced concepts of GPU comp uting start from
Section 7, and cover strategies for de signing massive parallel algorithms, the massive
†
The term embarrassingly parallel means that it would be em barrassing to not take advantage of such paral-
lelization. However in some cases, the term has been taken as of being embarrassed to make such paral-
lelization; this meaning is unwanted. An alternative name is pleasingly parallel.
288 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
parallelism programming model and its technical restrictions. We describe four examples
of computational physics problems that h ave been solved using GPU-based algorithms,
giving an idea of what types of problems are ideal to be solved on GPU. Finally, Section
10 is a dedicated to the latest advances in the field. We choose this organization because
it provides to th e reader the need ed background on parallel computing, making the GPU
computing sections easier to understand .
2 Basic concepts
The terms concurrency and parallelism are oft en de bate d by the computer science commu-
nity and sometimes it has become unclear what th e difference is between the two, leading
to misunderstanding of very fundamental concep ts. Both terms are frequently used in th e
field of HPC and their difference must be made clear before discus sing more advanced
concepts along the survey. The following definitions of concurrency and parallelism are
consistent and considered correct [14];
Definition 2.1. Concu rrency is a property o f a program (at design level) where two or
more tasks can be in progress simultaneously.
Definition 2.2. Parallelism is a run-time property where two or more tasks are being
executed simu ltaneously.
There is a difference between being in progress and being executed since the first o n e
does not necessarily involve being in execution. Let C and P be concurrency and par-
allelism, respectively, then P ⊂C. In othe r words, parallelism requires concurrency, but
concurrency does not require parallelism. A nice example where both concepts come into
play is the oper ating system (OS); it is concurrent by design (performs multi-tasking so
that many tasks are in progress at a given time) and de pending on the number of phys ical
processing units, these tasks can run parallel or not . With these concepts clear, now we
can make a simple definition for parallel computing:
Definition 2.3. Parallel computing is the act of solving a problem of size n by dividing
its domain into k ≥2 (with k ∈N) parts and so lving them with p physical processors,
simultaneously.
Being able to identify the typ e of problem is essential in the formulation of a parallel
algorithm. Let P
D
be a problem with domain D. If P
D
is parallelizable, th en D can be
decomposed into k sub-problems:
D = d
1
+d
2
+···+d
k
=
k
∑
i=1
d
i
. (2.1)
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 289
P
D
is a data-parallel problem if D is composed of d ata elemen ts and solving the problem
requires applying a kernel function f (···) to the whole domain:
f (D) = f (d
1
)+ f (d
2
)+···+ f (d
k
) =
k
∑
i=1
f (d
i
). (2.2)
P
D
is a task-parallel problem if D is composed of functions and solving the problem
requires applying each function to a common stream of data S:
D(S) = d
1
(S)+d
2
(S)+···+d
k
(S) =
k
∑
i=1
d
i
(S). (2.3)
Data-parallel problems are ideal candidates for the GPU. The reason is be cause the
GPU architecture works best when all threads execute the same instructions but on differ-
ent data. On the other hand, task-parallel problems are best suited for the CPU. The rea-
son is because the CPU architecture allows different tasks to be executed on each thread.
This classification scheme is critical for achieving the best partition o f the problem do-
main, which is in fact the first step when designing a parallel algorithm. It also provides
useful information when choosing the best hardware for the implementation (CPU or
GPU). Computational physics problems often classify as data-parallel, therefore they are
good candidate s for a massive parallelization on GPU. Since the aim of this work is to
provide a survey on parallel computing for computational physics, most of the explana-
tions will be in the context of data-parallel problems.
3 Performance measures
Performance measures consist of a set of metr ics that can be used for quantifying the
quality of an algorithm. For sequential algorithms, the metrics time and space are suffi-
cient. For parallel algorithms the scenario is a little more complicated. Apart from time
and space, metrics such as speedup and efficiency are necessary for studying the quality of a
parallel algorithm. Furthermore, when an algorithm cannot be completely parallelized, it
is useful to have a theoretical est imate of the maximum speedup possible. In these cases,
the laws of Amdahl and Gustafson become useful for such analysis. On the exper imen-
tal side, metrics such as memory bandwidth and floating point operations per second (Flops)
define the performance of a parallel architecture when running a parallel algorithm.
Given a problem of size n, the running time of a parallel algorithm, using p p roces-
sors, is denot ed:
T(n, p). (3.1)
From the theoretical point of view, the metrics work and span define the basis fo r comput-
ing other metrics such as speedup and efficiency.
290 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
3.1 Work and span
The quality of a parallel algorithm can be defined by two metr ics as stated by Cormen
et al. [27]; work and span. Both metrics are important because they give limits to parallel
computing and introduce the notion of work. Parallel algorithms have the challenge of
being fast, but also to generate the minimum amount of ext r a work. By doing less extra
work, they be come more efficient.
Work is defined as the total time nee ded to execute a parallel algorithm using one
processor; denoted as T(n,1). Span is defined as th e longes t time need ed to execute a
parallel path of computation by one thread; denot ed as T(n,∞). Span is t h e equivalent of
measuring time when using an infinite amount of processors.
These tw o metrics provide lower bounds for T(n, p). The work law e q uation states the
first lower bound:
T(n, p) ≥
T(n,1)
p
. (3.2)
That is, the running time of a parallel algorithm must be at least 1 /p of its work. With the
work law, one can realize that parallel algorithms run faster when the work per p rocessor
is balanced.
The span law defines the second lower bound for T(n, p):
T(n, p) ≥T(n,∞). (3.3)
This means that the time of a parallel algorithm cannot be lower than the span or the
minimal amount of time needed by a processor in an infinite processor machine.
3.2 Speedup
One of the most important actions in parallel computing is to actually me asure how much
faster a p arallel algorithm runs with respect to the be st sequential one. This measure is
known as speedup.
For a problem of size n, the expression for speedup is:
S
p
=
T
s
(n,1)
T(n, p)
, (3.4)
where T
s
(n,1) is the time of the best sequential algorithm (i.e., T
s
(n,1) ≤ T(n,1)) and
T(n, p) is the time of the parallel algorithm with p processors, both solving the same
problem. Speedup is upper bounded when n is fixed because o f the work law from equ a-
tion (3.2):
S
p
≤p . (3.5)
If the speedup increases linearly as a function of p, then we speak of linear speedup. Linear
speedup means that the overhead of the algorithm is always in the same proportion with
its running time, for all p. In the particular case of T(n, p) = T
s
(n,1)/p, we then speak
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 291
Figure 1: The four possible curves for speedup.
of ideal speedup or perfect linear speedup. It is the maximum theo retical value of speedup
a p arallel algorithm can achieve when n is fixed. In practice, it is hard to achieve linear
speedup let alone perfect linear speedup, because memory bottleneck s and overhead
increase as a function of p. What we fin d in practice is that most programs achieve sub-
linear speedup, that is, T(n, p) ≥T
s
(n,1)/p. Fig. 1 shows the four possible curves.
For the last three decades it has been debated whether super-linear speedup (i.e., S
p
> p)
is possible or just a fantasy. S uper-linear speedup is an important matter in parallel com-
puting and proving its e x istence would benefit computer science, since parallel machines
would be literally m ore than the sum of their parts (Gust afso n ’s conclusion in [50]). Smith
[110] and Faber et al. [35] state that it is not possible to achieve super-linear speedup and
if such a parallel algorithm existed, th en a single-core comput ation of the same algorithm
would be no less than p times slower (leading to linear s peedup again). On the oppo site
side, Parkinson’s work [99] on parallel efficiency proposes that super-linear speedup is
sometimes possible because the single processor has loop overhead. Gustafson supports
super-linear speed up and considers a more general definition of S
p
, one as the ratio of
speeds (speed = work/time ) [50] and not the ratio of times as in Eq. (3.4). Gust afso n con-
cludes that the de fin ition of w ork, its assumption of being constant and the assumption
of fixed-size speedu p as the only model are th e cause s for thinking of th e impossibility of
super-linear speedup [51].
It is important to mention that t h ere are three differen t mod els of speedup. (1) Fixed-
size speedup is the one explained recently; fixes n and varies p. It is t h e most popular
model of spee dup. (2) Scaled speedup consists of varying n and p such that the problem
size per processor remains constant. Lastly, (3) fixed-time speedup consists of varying n and
p such that the amount of work per processor remains constant. Throughout this survey,
fixed-size speedup is assumed by default. For the case of (2) and (3), speedup becomes a
curve from the su r face on (n, p).
If a problem cannot be completely parallelized (one of the causes for sub-linear
speedup), a partial speedup expression is needed in the place of Eq. (3.4). Amdahl and
292 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
Gustafson proposed each one an expression for compu ting partial speedup. They are
known as the laws of speedup.
3.2.1 Amdahl’s law
Let c be the fraction (in a/b form or as a real number) of a program that is parallel, (1−c)
the fraction that runs s equential and p the number of processors. A mdahl’s law [5] states
that for a fixed size problem the ex pected overall speedu p is given by:
S(p) =
1
(1−c)+
c
p
. (3.6)
If p ≈∞, Eq. (3.6) becomes:
S(p) =
1
1−c
. (3.7)
That is, if a computer has a large number of processors (i.e., a super-computer or a modern
GPU), then the maximum speedup is limited by the seque n tial p art of the algorithm (e.g,
if c =4/5 then th e maximum speedup is 5x).
Amdahl’s law is useful for algorithms that need t o scale its performance as a function
of the number of processors, fixing the problem size n. This type of scaling is known as
strong scaling.
3.2.2 Gustafson’s law
Gustafson’s law [49] is another useful me asure for theoretical performance analysis. This
metric does not assume a fixed size of the p roblem as Amdahl’s law did. Instead, it us es
the fixed -t ime model w h ere work per processor is kept constant when increasing p and
n. In Gustafson’s law, t h e time of a parallel program is composed of a sequential part s
and a parallel part c executed by p p rocessors.
T(p) = s+c. (3.8)
If the sequent ial time for all the computation is s+cp, t h en the speedup is:
S(p) =
s+cp
s+c
=
s
s+c
+
cp
s+c
. (3.9)
Defining α as the fraction of serial computation α = s/(s+c), then t h e parallel fraction is
1−α = c/(s+c). F inally, E q . (3.9) be comes the fixed-time speedup S(p):
S(p) = α+ p(1−α) = p−α(p−1). (3.10)
Gustafson’s law is important for expanding the knowledge in parallel computing and
the definition of speed up. With Gustafson’s law, the idea is to increase the work linearly
as a function of p and n. Now the problem size is not fixe d anymore, instead the wor k
per processor is fixed. This type of scaling is also known as weak scaling. There are many
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 293
applications where t h e size of the problem would actually increase if more computational
power was available; weather prediction, computer graphics, Monte Carlo algorithms,
particle simulations, etc. Fixing the problem size and measuring time vs p is mostly done
for academic purposes. As the p roblem size gets larger, the parallel part p may grow
faster than α.
While it is t rue that speedup might be one of th e most important measures o f parallel
computing, there are also other metrics that provide add itional information about the
quality of a parallel algorithm, such as the efficiency.
3.3 Efficiency
If we divide Eq. (3.5) by p, we get:
E
p
=
S
p
p
=
T
s
(n,1)
pT(n, p)
≤1. (3.11)
E
p
is the efficiency of an algorithm using p p rocessors and it tells how well the processors
are being used. E
p
=1 is the maximum efficiency and me ans optimal usage of the compu-
tational resources. Maximum efficiency is difficult to achieve in an implemented solution
(it is a consequence of t h e difficult to achieve perfect linear speedup ). Today, efficiency
has become as important as speedup, if not more, since it measures how well the hard-
ware is used and it tells which implementations should have priority when competing
for limited resources (cluster, supercomputer, workstation).
3.4 FLOPS
The FLOPS metric represents raw arithmet ic performance and is measured as the number
of floating point o perations per second. Let F
h
be the peak floating p oint performance of a
known hardware and F
e
the floating point performance measured for t h e implementation
of a given algorithm, then F
c
is defined as:
F
c
=
F
e
F
h
. (3.12)
F
c
tells us the efficiency of the numerical computation relative to a given hardware. A
value of F
c
=1 means maximum hardware usage for numerical computations.
The highest performance reported up to date (March 2013) is approximately 17.5
Pflops by the ’Titan’ supercomputer from DOE/SC/ Oak Ridge National Laboratory
‡
. There
is high enthusiasm for achieving for the first time the Exaflops scale. It is believed that
in the following years, with th e help of GPU-based hardware, t h e goal of Exaflops scale
will be achieved.
‡
An updated li st of the 500 most powerful super-computers in the world is available at the
’www.top500.org’.
294 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
Figure 2: Comparison of CPU and GPU single precision floating point performance t hrough the years. Image
taken from Nvidia’s CUDA C programming guide [93].
Today, high-e n d CPUs offer up to 240 GFlops (Intel Xeon E5 2690) of numerical per-
formance while high-end GPUs offer approximately 4 TFlops (Nvidia Tesla K20X), equiv-
alent to a super-computer from ten years ago. Due to this big difference in o rders of mag-
nitude, people from the HPC community are moving tow ards GPU compu ting. Fig. 2
shows how CPU and GPU performance has improved through the years.
3.5 Performance per Watt
In recent years, power consumption has become more important than brute speedup. To-
day the notion of performance per watt
§
is one of the most important measures for choos-
ing hardware and has been the subject of research [7]. The initiative to develop energy
efficient hardware began as a way of doing HPC in a responsible manner. Titan super-
computer runs at the cost of 8.2 MW, offering 2.1 GFlops/W, while a Nvidia Tesla K20X
GPU offers 16.8 GFlops/W using 300W. As systems get larger, there is a substantial loss
of performance per Watt. The integration of GPUs into supercomputers has h elped these
systems to be more energy efficient than before.
3.6 Memory Bandwidth
Memory Bandwidth is the rate at which data can be tr ans ferred between processors and
main memory. It is usually measured as GB/s. The memory efficiency B
c
of an imple-
mentation is computed by dividing t h e experiment al bandwidth B
e
by the maximum
§
An updated list of the most energy e fficient super computers is available at ’www.green500.org’.
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 295
bandwidth B
h
of the hardware:
B
c
=
B
e
B
h
. (3.13)
A value of B
c
= 1 means that the application is using the maximum memory bandwidth
available on the hardware. Actual high-end CPUs have a memory bandwidth in the
range 40GB/s ≤ B
h
≤80GB/s while high-end GPUs have a memory bandwidth in the
range 200GB/s ≤B
h
≤300GB/s .
Achieving maximum bandwidth in the GPU is much harder than in CPU. The main
reason is because memory performance is problem-dependent. Data structures have to
be aligned in simple patterns so that many chunks of data are read or written simultane-
ously w ith minimal hardware cost . Irregular data accesses, alignments and different data
chunk sizes result in lower memory bandwidth. Latest GPU architectu res such as Fermi
and Kepler can mitigate this effect by using an L2 cache for global memory (see [29, 92]
for more information on t h e GPU’s L2 cache).
All performance measures depend on the running time T(n, p) of the parallel algo-
rithm. Measuring the mean wall-clock time with a standard error below 5% is a go od
practice for obtaining an experimental value of T(n,p). F or the theoretical case, comput-
ing T(n, p) is less trivial because it will de pend on the chosen parallel computing model.
4 Parallel computing models
Computing models are abstract computing machines that allow theoretical analysis of
algorithms. The se mode ls simplify the computational universe to a small set of parame-
ters that define h ow much time a memory access or a mathematical op eration will cost.
Theoretical analysis is fundamental for the process of researching new algorithms, since
it can t ell us which algorithm is asymptotically better. In the case of parallel computing,
there are several models available; PRAM, P MH, Bulk parallel processing and LogP. Each
one focuses on a subset of aspe cts, reducing the number of variables so that mathematical
analysis is possible.
4.1 Parallel Random Access Machine (PRAM)
The parallel random access machine, or PRAM, was proposed by Fortune and Wyllie in
1978 [39]. It is inspired by t h e classic random access machine (RAM) and has been one of
the most used models for parallel algorithm design and analysis.
In th e 1990s, the PRAM model gained reputation as an unrealistic model for algorithm
design and analysis because no comput er could offer cons tant memory access times for
simultaneous operations, let alone performance scalability. Implementations of PRAM-
designed algorithms did not reflect the complexity the mode l was suggest ing . Howe ve r,
in 2006, the model be came relevant again with the introduction of general purpose GPU
(GPGPU) computing APIs.
296 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
Figure 3: In the PRAM model each one of the cores has a complete view of the global memory.
In the PRAM model, there are p processors that operate s ynchronously over an un-
limited memory complete ly visible for each processor (see Fig. 3). T h e p parameter does
not need t o be a constant number, it can also be defined as a function of the problem size
n. Each r/w (read/write) operation costs O(1). Different variations of the mod el e x ist in
order to make it more realistic when mo deling parallel algorithms . These variations spec-
ify whether memory access can be performed concurren tly or exclusively. Fou r variants of
the model exist.
EREW, or Exclusive Read – Exclusive Write, is a variant o f PRAM where all read an
write operations are pe r fo r med exclusively in different places of memory for each pro-
cessor. The EREW variation can be used when the problem is split into independent
tasks, witho ut requiring any sort of communication. For example, vector addition as
well as matrix addition can be done with an ERE W algorithm.
CREW, o r Concurren t Read – Exclusive Write, is a variant of P RAM where processors
can read from common sections of memory but always write to sections exclusive to
one another. CREW algorithms are useful for problems based on tilings, where e ach site
computation requires information from neighbor sites. Let k be the number of neighbors
per site, then each site will p erform at least k reads and one write operation. At the same
time, each neighbor site will perform the same number of memory reads and writes.
In the end, each site is read concurrently by k other sites but only modified once. T h is
behavior is the main idea of a CREW algorithm. Algorithms for fluid dynamics, cellular
automata, PDEs and N-body simulations are compatible with the CREW variation.
ERCW, o r Exclusive Read – Concurrent Write, is a variant of P RAM where processors
read from different exclusive sections of memory but write to shared locations. This
variant is not as popu lar as the other s because there are less problems that benefit from
an ERCW algorithm. Neve r theless, important results have been obt ained for this vari-
ation. Mackenzie and Ramachandran proved that finding the maximum of n numbers
has a lower bound of Ω(
p
log n) under ERCW [81], while the p roblem is Θ(log n) under
EREW/CREW.
CRCW, or Concurrent Read – Concurrent Write, is a variant of PRAM where processors
can read and write from the same memory locations. Beame and Hastad have studied
optimal so lut ions using CRCW algorithms [9]. Subramonian [111] presented an O(logn)
algorithm for computing the minimum spanning tree.
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 297
Concurrent writes are not trivial and must use one of the following protocols:
• Common: all processors write the same value;
• Arbitrary: only one write is successful, the others are not applied;
• Priority: priority values are given to each processor (e.g., rank value), and the pro-
cessor with highest priority will be the one to write;
• Reduction: all writes are reduced by an operator (add, multiply, OR, AND, XOR).
Over the last decades, Uzi Vishkin has been one of the main supporters of the PRAM
model. He proposed an on-chip architecture based on PRAM [120] as well as the notion
of explicit Multi-threading for achieving efficient implementations of PRAM algorithms
[121].
4.2 Parallel Memory Hierarchy (PMH)
The Parallel Memory Hierarchy mode l, or PMH, was proposed in 1993 by Alpern et al. [4]
and inspired by related works [2, 3] (HMM and UHM memory models). This model was
proposed to deal with the problems of PRAM regarding cons tant time memory opera-
tions. Actual CPUs (such as Intel Xeon E5 series o r AMD’s Opteron 6000 series) have
memory h ier archies composed of registers, L1, L2 and L3 caches. GPUs such as Nvidia
GTX 680 or AMD’s Radeon HD 7850 also have a memory hierarchy composed of reg-
isters, L1, L2 caches and the global memory. Ind eed, the memory hierarchy should be
considered when designing a parallel algorithm in order to match the theoretical com-
plexity bounds.
The PMH model is defined by a hierarchical tree of memo r y modules. The leaves
of the tree correspond to processors and t h e inte r n al nodes represent memory modules.
Modules closer to the processors are fast, but small, and modules far from the processors
are slow, but larger. For the i-th level module, the following parameters are defined; s
i
as
the number of items per block (or blocksize), n
i
the number of blocks, l
i
the latency and
c
i
is the child-count. In practice, it is easier to model an algorithm by using the uniform
parallel memory hierarchy (UPMH) which is a simplified version of the PMH mo del. Th e
UPMH model defines a complete τ-ary tree (see Fig. 4).
In UPMH, composite parameters are used, such as the aspect ratio α=n
i
/s
i
, the pack-
ing factor ρ = s
i
/s
i−1
and the branching factor τ which is the tree arity. Additionally, the
UPMH model define s t h e tr ans fer cost t
i
as a function of the tree level; t
i
= f (i). Typical
values of the transfer cost function are f (i) = 1,i,ρ
i
. Function f (i) = ρ
i
is considered a re-
alistic transfer cost function for modern architectures. Usually, the model is referred to as
UPMH
α,ρ, f (i),τ
to indicate its four parameters. This model h as proven to be more realistic
than PRAM, but harder for analyzing algorithms.
Alpern et al. showed t h at an unblocked matrix multiplication algorithm (i.e., the basic
matrix multiplication algorithm) can cost Ω(N
5
/p) t ime instead of O(N
3
/p) [13] as in
298 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
Figure 4: The uniform parallel memory hierarchy tree.
Figure 5: The classic algorithm processes sticks of computation as seen in the left side. The blocked version
computes sub-cubes of the d omain in parallel, taking advantage of locality.
PRAM. In the same wor k, the aut h ors prove that a p arallel block-based matrix multipli-
cation algorithm (se e Fig. 5) can indeed achieve the desired O(N
3
/p) upper bound by
reusing the data from the fastest memory mo dules.
The entire proof of the matrix multiplication algorithm for one processor can be found
in the work of Alpern et al. [3]. The UPMH model can be considered a complement to
other models such as PRAM or BSP.
4.3 Bulk S ynchronous Parallel (BSP)
The Bulk synchronous parallel, or BSP, is a parallel computing model focused on commu-
nication, published in 1990 by Leslie Valiant [119]. Synchronization and communication
are considered high priority in the cost equation. The model consists of a number of pro-
cessors with fast local memory, connected through a network and capable of sending and
receiving mess ages to and from any other processor. A BSP-based algorithm is compose d
of super-steps (see Fig. 6).
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 299
Figure 6: A representation of a super-step; processing, communication and a global synchronization barrier.
A supe r-step is a parallel block of computation composed of three steps:
• Local comp utation: p processors perform up to L local computations;
• Global communication: Processors can send and receive data among them;
• Barrier synchronization: Wait for all other processors to reach the barrier.
The cost c for a super-step using p processors is defined as:
c = max
p
i=1
(w
i
)+g max
p
i=1
(h
i
)+l, (4.1)
where w
i
is the comput ation time of the i-th processor, h
i
the number of messages used
by the i-th processor, g is the capability of the network and l is the cost of the barrier
synchronization. In practice, g and l are computed empirically and available for each
architecture as lookup values. For an algorithm composed of S super-steps , the final cost
is the sum of all the super-step costs:
C =
S
∑
i=1
c
i
. (4.2)
4.4 LogP
The LogP model was proposed in 1993 by Culler et al. [31]. Similar to BSP, it focuses
on modeling the cos t of communicating a set of distributed processors (i.e., network of
computers). In this model, local operations cost one unit of time but the network has
latency and overhead. The following parameters are de fin ed:
• latency (L): the latency for communicating a message containing a word (or small
number of words) from its source to its target processor;
• overhead (o): the amount of time a processor spends in communication (sending or
receiving). During this time, the processor cannot perform other operations;
300 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
• gap (g): the minimum amount of time between successive messages in a g iven pro-
cessor;
• processors (P): the number of processors.
All parameters, except for the processor count (P), are measured in cycles. Fig. 7 illus-
trates the model with an example of communication w ith one-word messages. The Lo gP
model is similar to the BSP with the difference that BSP uses g lobal barriers of synchro-
nizations while LogP synchronizes by pairs of processors. Another difference is that
LogP considers a message overhead when sen ding and receiving. Choosing the right
model (BSP or LogP) depends if global or local synchronization barriers are predominant
and if the communication overhead is significant or not.
Figure 7: An example communication using the LogP model.
Parallel computing models are u seful for analyzing the running time of a parallel
algorithm as a function of n, p and other paramete r s specific to the chosen model. But
there are also other important aspects to be considered related to the s tyles of parallel
programming. The se s tyles are well explained by the parallel programming models.
5 Parallel programming models
Parallel computing should also be analyzed down how processors communicate and how
they are programmed. For example, PRAM and UPMH use the shared memory model,
while LogP and BSP use a message passing mod el. These two models are actually parallel
programming m odels. A parallel programming model
¶
is an abstraction of the p rogrammable
aspects of a computing model. While computing models from Section 4 are useful fo r al-
gorithm design and analysis (i.e., computing time complexity), parallel programming mod-
els are useful for the implementation of such algorithm. The following parallel program-
ming models are the most important be cause they have been implemented by mode r n
APIs.
¶
Some of the literature may treat the concept of parallel programming model as equal to computing model. In
this survey we denote a difference between the two; thus Sections 4 and 5.
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 301
5.1 Shared memory
In the shared memory model, th reads can read and write asynchronously within a com-
mon memory. This programming model work s naturally with the PRAM computing
model and it is mostly useful for multi-core and GPU based solutions. A well known API
for CPUs is the Open Multiprocessing interface or Open MP [18] which is based on the Unix
pthreads implementation [90]. In the case of GPUs, OpenCL [65] and CUDA [93] are the
most common.
Many times, a shared memory parallel algorithm need s to manage non-deterministic
behavior from multiple concurrent th reads (the operating system thread scheduling is
considered non-deterministic). When concurrent threads read and write on the same
memory locations, one must supply an explicit synchronization and control mechanism
such as monitors [54], semaphores [34], atomic operations and mutexes (a binary semaphore).
These control primitives allow threads to lock and work on shared resources without
other threads interfering, making the algorithm consistent. In some scenarios the pro-
grammer must also be aware of the shared memory consistency model. These mo dels d efine
rules and the strategy used to maintain consisten cy on shared memory. A detailed e x pla-
nation of consistency models is available in Adve et al. work [1].
For the case of GPUs, one can use atomic operations, synchronization barriers and
memory fences [93].
5.2 Message passing
In a message passing programming model, or distributed model, processors commu-
nicate asynchronously or synchronously by sending and receiving messages containing
words of data. In this model, emphasis is placed on communication and synchronization
making distributed computing the main application for the model. Dijkstra introduced
many new ideas for consistent concurrency on distributed systems based o n exclusion
mechanisms [33]. This programming model works naturally with the BSP and LogP
models which were built with t h e same paradigm.
The standard interface for me ssage passing is the Message Passing Interface or MPI [41].
MPI is used for handling communication in CPU distributed applications and is also used
to distribute the work when using multiple GPUs.
5.3 Implicit
Implicit parallelism refers to compilers or high-level to ols that are capable of achieving
a degree of parallelism automatically from a sequential piece of source code. The ad-
vantage of implicit parallelism is that all the hard work is done by the tool or compiler,
achieving practically the same performance as a manual parallelization. The d isadvan-
tage however is that it only works for simple problems such as for loops with indepen-
dent iterations. Kim et al. [68] describe the structure of a compiler capable of implicit
and explicit parallelism. In their wo r k, the authors address the two main problems for
302 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
achieving their go al; (1) how to integrate the parallelizing preprocessor with the code generator
and (2) when to generate explicit and when to generate implicit threads.
Map-Reduce [32] is a well known implicit parallel programming tool (so metimes con-
sidered a programming mode l itself) and has been used for frameworks such as Hadoop
with outstanding results at processing large data-sets over distr ibuted systems. Func-
tional languages such as Haskell or Racket also benefit from the parallel map-redu ce
model. In the 90’s, High performance For tran (HPF) [77] was a famous parallel implicit
API. OpenMP can also be cons idered semi-implicit since its parallelism is based on hints
given by the programmer. Today, pH (parallel Haskell) is probably th e first fully implicit
parallel programming language [91]. Automatic parallelization is hard to achieve for al-
gorithms not based on simple loops and has become a research topic in th e last twenty
years [48, 74,79, 105].
5.4 Algorithmic skeletons
Algorithm skeletons provide an important abstraction layer for implementing a parallel
algorithm. With this abstraction, the programmer can now focus more on th e strategy
of the algorithm rathe r than on the technical p roblems regarding parallel programming.
Algorithm skeletons, also known as parallelism patterns, were proposed by Cole in 1989
and published in 1991 [24]. This model is based on a set of available parallel comput-
ing patterns kno wn as skeletons (implemented as higher order functions to receive other
functions) that are available to use. The critical step when using algorithmic skeletons is
to choose the right pattern fo r a given problem. The following patterns are some of the
most important for parallel computing:
• Farm: or parallel map, is a master-slave pattern where a function f () is replicated to
many slaves so that slave s
i
applies f (x
i
) to sub-problem x
i
;
• Pipeline: or function decomposition, is a staged pattern where f ()
1
−> f ()
2
−> ···−>
f ()
n
are parts of a bigger logic that works as a p ipeline. Each stage of the pipeline
can work in parallel;
• Parallel tasks: In this pattern, f ()
1
, f ()
2
,···, f ()
n
are d ifferent tasks to be performed
in parallel. These tasks can run completely independent or can include critical sec-
tions;
• Divide and Conquer: This is a recursive pattern where a problem A, a divide
function d : A → {a
1
,a
2
,···,a
k
} and a combine function c : {a
1
,a
2
,···,a
k
}, f () →
f ({a
1
,a
2
,···,a
k
}) are passed as parameters for the skeleton. Then the skeleton ap-
plies a divide and conquer approach spanning parallel branches of computation as
the recursion tree grows.
r(A,d,c, f ) = c({r(d(A)
1
,d,c),r(d(A)
2
,d,c),···,r(d(A)
k
,d,c)}, f ), (5.1)
r(A
′
,d,c, f ) = c(d(A
′
), f ). (5.2)
The recursion sto ps when the smallest sub-problems d(A
′
) are reached.
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 303
There are also basic skeleton patterns for managing while and for loops as well as condi-
tional statements. The advantage of algorithmic ske letons is their ability to be combined
or nested to make more complex patt erns (because the y are higher order functions). Their
limitation is that the abstraction layers include an overhead cost in performance.
In the previous two sections we covered computing models and programming mod-
els which are useful for algorithm analysis and programming, respectively. It is critical
however, when implementing a high performance solut ion, to know how the underlying
architecture actually works.
6 Architectures
Computer architectures define the way processors work, communicate and how memory
is organized; all in the context of a fully working computer (note , a working computer
is an implementation of an architecture). Normally, a computer architecture is well de-
scribed by one or two computing models. It is important to say that the goal of computer
architectures is not to implement an existing computing mode l. In fact, it is th e other
way around; computing models try to model actual compu ter architectures. The final
goal of a computer architecture is t o make a computer run programs efficiently and as
fast as possible. In the past, implementations achieved higher performance automati-
cally because the hardware industry increased the processor’s frequency. At that time
there were not many changes regarding the architecture. Now, computer architectures
have evolved into parallel machines because the single core clock speed has reached its
limit in frequency
k
[104]. Today the most important architectures are the multi-core and
many-core, represented by the CPU and GPU, respectively.
Unfortunately, sequen tial implementations will no longer run faster by just buy ing
better hardware. They mus t be re-designed as a parallel algorithm that can scale its per-
formance as more processors are available. Aspects such as the type of instruction/data
streams, memor y organization and processor communication inde ed help for achieving
a better implementation.
6.1 Flynn’s taxonomy
Computer architectures can be classified by using Flynn’s taxonomy [38]. Flynn realized
that all architectures can be classified into four categories. This classification depends on
two aspects; number of instructions and number of data streams that can be handled in
parallel. He ended with four classifications.
SISD, or single instruction single data stream can only perform one instruction to one
data stream at a time. There is no parallelism at all. Old single core CPUs of t h e 1950s,
based on the or iginal Von Neumann architecture, were all SIS D type s. Inte l processors
from 8086 to 80486 were also SISD.
k
Above 4.0 GHz of frequency, sili co n transistors can become too hot for conventional cooling systems.
304 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
SIMD, or single instruction multiple data streams can handle only one instruction
but apply it to many data st reams simultaneously. These architectures allow data paral-
lelism, which is useful in science for applying a mathematical model to different parts of
the problem domain. Vector computers in the 70’s and 80’s were the first to implement
this architecture. Nowadays, GPUs are considered an evolved S IMD architecture because
they wo r k with many SI MD batches simultaneously. SIMD has also been supported on
CPUs as instruction se ts, such as MMX, 3DNow!, SSE and AVX. These instruction s ets
allow parallel integer or floating point operations over small arrays of data.
MISD, or multiple instruction single data stream can handle different tasks over the
same stream of data. Th ese architectures are not so common and often end up being
implemented for specific scenarios such as digital attack s ystems (e.g., to destroy a data
encryption) or fault tolerance syst ems and space flight controllers (NASA).
MIMD, or multiple instruction multiple data streams is the most flexible architec-
ture. It can handle one different instruction for each data stream and can achieve any
type of parallelism. However, the complexity of the physical implementation is high and
often the overhead involved in hand ling such d ifferent tasks and data streams becomes
a problem when try ing to scale with the number of cores. Modern CPUs fall into this
category (Intel, AMD and ARM multi-cores) and newer GPU architectures are partially
implementing this architecture. The MIMD concept can be divided into SPMD (single
program multiple data) and MPMD (multiple programs multiple data). SPMD occurs
when a simple program is being executed in different processors . The key difference
compared to SIMD is that in SPMD each processor can be at a different stage of the e x e-
cution or at different paths of the code caused by conditional branching. MPMD occurs
when different indep endent programs are being run on multiple processors.
6.2 Memory architectures and organizations
There are two forms of memory organization, shared and distr ibuted. In distributed
memory, each node has its ow n memory architecture and it is completely independent
from other nodes. Communication is based on messages between nodes through a net-
work. I n a d istributed memory scenario, the network plays a important role and its
topology is different dep ending on the context. So me common topologies are bus, star,
ring, mesh, hypercube and tree. Also, hybrid topologies are made based on the basic ones
already mentioned.
In a shared memory organization, processors communicate through a common bank
of global memory, not n eeding ex plicit messages as in a distributed memory scheme.
Today, two architectures are mostly used; UMA and NUMA.
Uniform Memory Access or UMA consists of a shared memory in which th e access
time for any processor takes the same amount of time no matter the data location. UMA
is also known as S ymm etric Multi-Processors or SMP. The main disadvantage of UMA is
the low scalability when increasing the number of processors. T h is occurs because of the
single memory controller shared for all processors.
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 305
Non Uniform Memory Access or NUMA is an architecture where access time to
shared memory depends on the location of data relative to the processor. This means that
the memory that is closer to a processor is accessed much faster t h an memory closer to an-
other processor (i.e., cos t is a function of distance). To take advantage of NUMA, t h e prob-
lem must be sp lit into independent chunks of data, each one assigne d t o a unique CPU.
Also, global read-only data is better replicated than shared. In practice, all NU MA archi-
tectures implement a hardware cache-coherence logic and become cache-coherent NUMA
or ccNUMA.
One can find the SMP architecture in many de sktop computers with dual core hard-
ware and the N UMA architecture in modern multicore machines with two or more CPU
sockets (e.g., AMD Opteron and Intel Xeon). Fig. 8 shows the concept s of UMA and
NUMA.
Figure 8: The UMA (aka SMP) and NUMA memory architectures.
Finally, shared and distributed memory architectures can also be mixed, leading to a
hybrid configuration which is useful for MPI + OpenMP or MPI + GPGPU solutions.
6.3 Technical d etails of modern CPU and GPU architectures
The diffe rences between multi-core and many-core architectures can be visualized in the
schematics of modern CPUs and GPUs.
Actual high -end CPUs, such as the Xeon E5 2600, are built with many interconnec-
tions between the cores providing fle x ibility in communication (see Fig. 9).
Each core has a local L1 and L2 cache of 64KB and 256KB, respectively, and in the
center of t h e chip there is a bigger L3 cache of size 20MB, shared by all cores through a
ring scheme [57]. The Quick-Path Interconnect or QPI section (known as Hyper-transport for
AMD processors) of the chip implements p art of the NUMA memory archite cture. The
PCI module handles communication with the PCI ports and finally the Internal Memory
Controller, or IMC, handles the memory access to its section of RAM, completing the rest
of the NUMA architecture.
On the other hand, modern GPUs such as the Tesla K20X have a complet ely differe n t
chip schematic that is oriented to massive parallelism. Fig. 10 shows the schematic of an
Nvidia Tesla K20X GPU as well as its actual chip. The cores of the GPU are grouped into
306 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
Figure 9: On the left, an Intel Xeon E5 2600 (2012) chip schematic, inspired from [28]. On the right, the
processor die [57].
Figure 10: On the left, Nvidia’s Tesla K20X GPU schemat ic. On the right, a picture of its chip [29].
SMX units, or next generation streaming multiprocessors. The most important aspects that
characterize a GPU are inside th e SMX units (s ee Fig. 11).
A SMX is the smallest unit capable of performing parallel computing. The main dif-
ference between a low-end GPU and a high-end GPU of the same architecture is the
number of SMX units inside the chip. In the case of Tesla K20 GPUs, each SMX unit is
composed of 192 cores (repre sented by the C bo x es). Its architecture was built for a max-
imum of 15 SMX, giving a maximum of 2,880 cores. However in practice, some SMX are
deactivated because of production issu es.
The cores of a SMX are 32-bit units that can perform basic integer and single p recision
(FP32) floating point arithmetic. Additionally, there are 32 special function units or S FU
that perform special mathematical operations such as log, sqrt, sin and cos, among others.
Each SMX has also 64 d ouble precision floating point units (represented as DPC boxes),
known as FP64, and 32 LD/ST units (load / store) for writing and reading memory.
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 307
Figure 11: A diagram of a streaming multiprocessor next-generation (SMX). Image inspired from Nvidia’s CUDA
C programming guide [93].
Numerical performance of GPUs is classified into two categories; FP32 and FP64 per-
formance. The FP32 performance is always greater than FP64 pe r fo r mance. This is actu-
ally a problem for massive parallel architectures because they must spend chip surface
on special units of computation fo r increasing FP64 performance. Th e Tesla K20X GPU
can achieve close to 4TFlops of FP32 pe r fo r mance while only 1.1TFlops in FP64 mode.
Actual GPUs such as the Tesla K20 implement a four-level memory hierarchy ; (1)
registers, (2) L1 cache, (3) L2 cache and (4) global memory. All levels, except for the global
memory, reside in the GPU chip. The L2 cache is automatic and it improves memory
accesses on global memory. The L1 cache is manual, there is one per SMX, and it can be
as fast as the registers . Kepler and Fermi based GPUs have L1 caches of size 64KB that
308 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
are split into 16KB of programmable shared memory and 48KB of automatic cache, or
vice versa.
6.4 The fundamental difference between CPU and GPU architectures
Modern CPUs have evolved towards parallel processing, implementing the MIMD ar-
chitecture. Most of their die surface is reserved for control units and cache, leaving a
small area for t h e numerical computations . The reason is, a CPU performs such different
tasks that having advanced cache and control mechanisms is the only way to achieve an
overall good performance.
On the other hand, the GPU has a SIMD-based architecture that can be well repre-
sented by the PRAM and UPMH models (Sections 4.1 and 4.2, respectively). The main
goal of a GPU architecture is to achieve high performance t h rough massive parallelism.
Contrary to the CPU, the die surface of the GPU is mostly occupied by ALUs and a min-
imal region is reserved for con trol and cache (se e Fig. 12). Efficient algorithms designed
for GPUs have reported up to 100x speedup over CPU implementations [25,78].
Figure 12: The GPU architecture differs from the one of the CPU beca use its layout is dedicated for placing
many small cores, giving little space for control and cache un its.
This difference in architecture has a direct consequence, t h e GPU is much more restric-
tive than the CPU but it is much more powerful if the solution is carefully design ed for
it. Latest GPU architectures such as Nvidia’s Fermi and Kepler have added a significant
degree of flexibility by incorp orating a L2 cache for h and ling irregular memory accesses
and by improving the performance of atomic operations. However, this flexibility is still
far from the one found in CPUs.
Indeed there is a trade-o ff betw een flexibility and computing power. Actual CPUs
struggle to maintain a balance betwee n comp uting power and gen eral purpos e function-
ality while GPUs aim at massive parallel arithmetic computations, introducing many
restrictions. Some o f these restrictions are overcome at t h e implementation phase while
some ot h ers must be treated when the problem is being parallelized. It is always a good
idea to follow a strate gy for de signing a parallel algorithm.
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 309
7 Strategy for designing a par allel algorithm
Designing a new algorithm is not a simple task. In fact, it is considered an art [70,95] that
involves a combination of mathe matical background, creativity, discipline, passion and
probably ot h er unclassifiable abilities. In parallel computing the scenario is no diffe rent,
there is no golden rule for designing perfe ct parallel algorithms.
There are some formal strategies that are frequently use d for creating efficient parallel
algorithms. Leighton and Thomson [76] have contributed conside r ably to the field by
pointing out how data structures, architectures and algorithms relate when facing the act
of implementing a parallel algorithm. In 1995, Foster [40] identified a four-step strategy
that is present in many well designed parallel algorithms; partitioning, communication,
agglomeration and mapping (see Fig. 13).
7.1 Partitioning
The fir st step when designing a parallel algorithm is to split the problem into p arallel sub-
problems. In partitioning, the goal is to find the best partition po ssible; one that generates
the highest amount of sub-problems (at this point, communication is no t considered yet).
Identifying the domain type is critical for achieving a good partition o f a problem. If
the problem is data-parallel, then the data is partitioned and we sp eak of data parallelism.
On the other h and , if the problem is task-parallel, then the functionality is partitioned and
we speak of task-parallelism. Most of the computational physics problems based on
simulations are suitable for a data-parallelism approach, while problems such as parallel
graph traverse, communication flows, traffic managemen t, security and fault tolerance
often fall into the task-parallelism approach.
7.2 Communication
After partitioning, communication is defined between the sub-problems (task or data
type). There are two types of communication; local communication and global communica-
tion. In local communication, sub-problems communicate with neighbors using a certain
geometric or functional pattern. Global communications involve broadcast, reductions or
global variables. In this phase, all types of communication problems are handled; from
race cond itions handled by critical sections or atomic operations, to synchronization bar-
riers to ensure that the strategy of computation is working up to this point.
7.3 Agglomeration
At this point, th ere is a chance that sub-problems may not generate enough work to
become a thread of computation (g iven a computer architecture). This aspect is often
known as the granularity of an algorithm [19]. A fine-grained algorithm divides the prob-
lem into a massive amount of small jobs, increasing p arallelism as well as communication
310 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
overhead. A coarse-grained algorithm divides the problem into less but larger jobs, reduc-
ing communication overhead as well as parallelism. Agglomeration seeks to find th e best
level of granularity by grouping sub-problems into larger ones. A parallel algorithm run-
ning on a multi-core CPU shou ld produce larger agglomer ations than the same algorithm
designed for a GPU.
7.4 Mapping
Eventually, all agglomerations will need to be processed by the available cores o f the
computer. The distribution of agglomerations to the different cores is specified by the
mapping. The Mapping step is the last one of Foster’s strategy and consists of assigning
agglomerations to processors with a certain pattern. The simplest pattern is the 1-to-1
geometric mapping between agglomerations and processors, that is, to assign agglomer-
ation k
i
to processor p
i
. Higher complexity mapping patt erns lead to higher hardware
overhead and unbalanced w ork. The challenge is to achieve the most balanced and sim-
ple patterns for complex problems.
Fig. 13 illustrates all four ste ps using a data-partition based problem on a dual core
architecture (c
0
and c
1
).
Figure 13: Foster’s diagram of the design steps used in a parallelization process.
Foste r’s strategy is well suited for computational physics because it handles data-
parallel problems in a natural way. At the same time, Foster’s strategy also works well
for designing massive parallel GPU-based algorithms. In order to apply this strategy, it
is necessary to know how the massive parallelism programming model works for map-
ping the computational resources to a data-parallel problem and how to overcome the
technical restrictions when programming the GPU.
8 GPU Computing
GPU computing is the utilization of the GPU as a general purp ose unit for solving a given
problem, unrestricted to the graphical context. It is also known by its acronym; GPGPU
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 311
(General-Purpose computing on Graphics Processing Units). The goal of GPU comput-
ing is to achieve the highes t performance for data-parallel problems through a mass ive
parallel algorithm that is run on the GPU.
GPU computing started as a research field for computer graphics (CG) in the early
2000s and gained high importance as a general purpose parallel processor [80]. In 2001,
for the first time the graphics processing unit was built upon a programmable architec-
ture, permitting programmable lighting [22, 64, 67], shadow [23] and geometry [107] ef-
fects to be computed and rendered in real-time. These effects were achieved using a high
level shading language such as GLSL (OpenGL Shading Language) [83], HLSL (High-
level Shading Language) [94] and CG (C for Graphics) [82]. At that time, th e massive
parallelism p aradigm was already in the minds of the CG researchers w h o were design-
ing per-vertex and per-fragment algorithms to work in a set of millions o f primitives. As
the years passed, the s cient ific community became interested in the powe r of GPUs and
its low cos t compared to other solutions (clusters, super-computers). However, adapting
a scientific problem to a graphics environment was hard and challenging fro m the tech-
nical side. In the early days, the act of adapting different kinds of problems to the GPU
was considered as hacking the GPU.
In 2002, McCool et al. published a paper detailing a meta-programming GPGPU lan-
guage, named Sh [85]. In 2004, Buck et al. proposed Brook for GPUs, also known as
Brook-GPU [15]. This was an extension of the C language that allowed general purpose
programming on programmable GPUs. Both Sh and Brook-GPU played a fundamental
role in expanding the idea o f GPU computing by hiding the graphical context of shading
languages.
In th e year 2006 another general purpose GPU computing API was released. T h is time
by Nvidia and named CUDA (Compute Unified Device Architecture) [93]. Technically,
the CUDA API is an extension of the C language and compiles general purpose code to
be executed on the GPU (based on the shared memory programming model). The release
of CUDA became an important milestone in the history of GPU computing because it
was the firs t API that offered effective documentation for getting started in the field. The
CUDA acronym refers to the general purpo se architecture of Nvidia’s GPUs [29], suitable
for GPU computing. At the moment, only Nvidia GPUs support CUDA.
In the year 2008, an open standard was released with the name of Ope n CL (Open
Computing Language), allowing t h e creation of multi-platform, mass ively parallel code
[65]. Its programming model is similar to that of CUDA but uses different names for the
same structures. The programming model behind CUDA and OpenCL is a key aspect for
GPU computing because it defines several components that are essential for implement-
ing a massively parallel algorithm.
8.1 The massive parallelism programming model
The programming models explained in Section 5 are necessary but not sufficient for un-
derstanding the programming model of the GPU. There are important aspects regarding
312 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
thread and memory organization that are relevant to the implementation o f a GPU-based
algorithm. This section covers these aspects.
The GPU programming model is characterized by its high level of parallelism, thus
the name Massive parallelism programming model. This model is an abstract layer that lies
on t op of the GPU’s architecture. It allows the design of massive parallel algorithms
independent of how many physical processing un its are available or how execution order
of threads is sche duled.
The abstraction is achieved by t h e space of computation, defined as a discrete s pace
where a massive amount of threads are organized. In CUDA, the space of compu tation
is composed of a grid, blocks and threads. For OpenCL, it is work-space, work-group and
work-item, respectively. A grid is a discrete k-dimensional (with k = 1,2,3) box type struc-
ture that defines the size and volume of the space of computation. Each element of the
grid is a block. Blocks are smaller k
′
-dimensional (with k
′
= 1,2,3) structures identified
by their coordinate relative to the grid. Each block contains many spatially organized
threads. F inally, each t h read has a coordinate relative to t h e block for which it belongs.
This coordinate system characterizes the space of computation and serves to map the
threads to the different locations of the problem. Fig. 14 illustrates an example of two-
dimensional s pace of computation. Each block has access to a small local memory, in
CUDA it is known as the shared memory (in OpenCL it is known just as the local memory).
In practical terms, the shared memory works as a manual cache. It is important to make
good use of this fast memory in order to achieve peak performance of the GPU.
Figure 14: Massive parallelism programming model presented as a 2D model including grid, blocks and threads.
Image inspired from Nvidia’s CUDA C programming guide [93].
The programming work -flow of GPU compu ting is viewed as a host-device relation-
ship between the CPU and GPU, respectively. A host program (e.g., a C program) uploads
the problem into the device (GPU memory), and then invokes a kernel (a function w r itten
to run on the GPU) passing as p arameter the gr id and block size. The h ost program can
work in a synchronous or asynchronous manner, depending if th e result from the GPU is
needed for the next step of computation or no t. When the kernel has finished in the GPU,
the result data is copied back from device to host. Fig. 15 summarizes the work-flow.
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 313
Figure 15: The GPU’s main function, named kernel, is invoked from the CPU host code.
8.2 Thread managing and GPU concurrency
Actual GPUs manage threads in small groups that work in SIMD mode . For AMD GPUs,
these groups are known as wavefronts and its size is 64 threads for their actual GCN
(graphics core ne x t) architecture. For Nvidia GPUs, these groups are known as warps
and the actual architectures such as Fermi and Kepler work with a size of 32 threads. The
OpenCL standard uses a more descriptive name; SIMD width. For simplicity reasons, we
will refer to these groups as warps.
Both AMD’s and Nvidia’s GPUs support some degree of concurrency for handling
the entire space of computation. Most of the time, there will be more threads th an wh at
can really be processed in parallel. Wh ile all threads are in progress (concurrency), only
a subset are really w orking in parallel. The maximum number of parallel threads run-
ning on a GPU normally corresponds to the number of processing units. However, the
maximum number of concurrent threads is much higher. For example, the Geforce GTX
580 GPU can process up to 512 threads in parallel, but can handle u p to 24,576 concur-
rent threads. For most problems, it is recommended to over flo w the parallel computing
capacity. The reason is that the GPU’s thread scheduler is smart eno ugh to switch idle
warps (i.e., warps that are waiting a memory access or a special function unit result, such
as sqrt()) with new one s ready for computation. In other words, there is a small pipeline
of n umerical comput ation and memory accesses that th e scheduler tries to maintain busy
all the time.
8.3 Technical considerations for a GPU impleme n tation
The GPU computing community frequently use s the terms coalesced memory, thread coars-
ening, padding and branching. These are the names of th e most critical technical consid-
erations that must be taken into account in order to achieve the best perfor mance on the
GPU.
Coalesced memory refers to a desired scenario where consecutive threads access con-
secutive data chunks of 4, 8 or 16 bytes long. When this access pattern is achieved, mem-
314 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
ory bandwidth increases, making the implementation more efficient. In every other case,
memory p erformance will suffer a p enalty. Many algorithms require irregular access pat-
terns with crossed relations between the chunks of data and threads. These algorithms
are the hardest to optimize for the GPU and are considered great challenges in HPC re-
search [95].
Thread coarsening is the act of reducing th e fine-grained scheme used on a solution
by increasing the work per th read. As a result, the amount of registers per block increases
allowing to re-use more computations saved on the registers. Choosing the right amount
of w ork per thread normally requires some experimental tuning.
Padding is the act of adjusting the problem size on each dimension, n
d
, into one that
is multiple of the block size; n
′
d
= ρ⌈n
d
/ρ⌉ (where ρ is the number of t h reads per block
per dimension) so that now the problem fits tightly in the grid (the block size per dimen-
sion is a multiple of the warp size). An impor tant requirement is that the extra dummy
data mus t not affect the original result. With padding, one can avoid putting conditional
statements in the kernel that would lead to unnecessary branching.
Branching is an effect caused when conditional statements in the kerne l code lead to
sequent ial execution of the if and else parts. It has a neg ative impact in performance and
should be avoided whenever po ssible. The reason why branching occurs is because all
threads within a warp execute in a lock-step mode and will run completely in parallel
only if they follow the same e x ecution path in the kernel code (SIMD computation). If
any conditional statement breaks the execution into two or more paths, then the paths
are executed sequentially. Conditionals can be safely used if one can guarantee that the
program will follow the same execution path for a whole warp. Additionally, tricks such
as clamp, min, max, module and bit-shifts are hardware implemented, cause no branching
and can be used to evade simple conditionals.
9 Examples of spati al and til ed GPU compatible problems
The last three sections have covered differen t aspects of parallel computing with spe-
cial emphasis on GPU computing. In several instances we have mentioned that comp u-
tational physics is a field that can benefit g reatly from GPU-based algorithms be cause
many of its problems are data-parallel. The following subsections will describe four ex-
amples of computational physics problems that benefit from GPU computing because of
data-parallelism.
9.1 The n-body problem
The n-body problem
∗∗
is an interaction problem where n bo dies are affected by gravity in
a k-dimensional space (usually k=1,2,3). The problem is relevant for this survey because
∗∗
Other non-astrophysical problems can also be solved with an n-body simulation. In this case we refer to
the astrophysical n-body problem.
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 315
it is one of the most emblematic data-parallel problems found in computational physics
[58], thus an ideal candidate for GPU computing. Additionally, some of the algorithmic
challenges present in the n-body problem, su ch as p artitioning space, are generic for
every spatial interaction problem.
The n-body problem states that for each particle q
i
, its force F
i
is given by its interac-
tions with the other n−1 particles:
F
i
= G
∑
k6=i
m
i
m
k
(q
k
−q
i
)
|q
k
−q
i
|
3
, i =1,···,n. (9.1)
The gravitational constant is G = 6.67∗10
−11
N(
m
kg
)
2
.
Forces can be comp uted simultaneously for all particles. The trick is to save the result-
ing positions in a different array so that no synchronization is needed. The naive solution
to this problem requires a O(n
2
) algorithm; for each particle, sum up the contributions
of the o ther n−1. A GPU so lut ion for this algorithm can indeed be achieved by splitting
the problem domain into one job per particle; resulting in a O(
n
2
p
) algorithm (where p
is the amount of parallelism provided by the GPU). The book GPU Gems 3 includes a
chapter for the GPU-based implementation o f the O(n
2
) algorithm [89]. But GPU algo-
rithms should not stop on the simplest parallelization. As with the sequential case, GPU
algorithms can also improve by using more advanced data-structures such as trees.
A faster hierarchical algorithm can be achieved by sacrificing some numerical preci-
sion. In Eq. (9.1), one can s ee that the contributions of the summation decrease quadrat-
ically as the distance from q
k
to q
i
increases. As a result, particles ve r y far from q
i
will
make no significant contribution to F
i
. On the other hand, particles close to q
i
will make
most of the contribution. This analysis on the significance of contributions as a function
of distance is the key for designing a faster n-body algorithm.
The Barnes-Hut tree-code algorithm [6] is a we ll known spatial partitioning solution
that achieves O(nlog
k
n) average time p er time step. It uses a 2
k
-tree known as quad-
tree for k = 2 and oct-tree for k = 3. The tree data-structure is used for storing average
measures for cluster of points far from the reference point. The bigger the distance from
the reference point, the bigger the cluster (See Fig. 16). The algorithm works as follows;
each inte r n al node of the tree st ructure contains k children, its position to the center of the
sub-space contained and an average measure of the contained particles. Computing the
interaction of each particle with position ~p
i
requires to traverse the tree starting from the
root. At e ach internal node with average position n
j
, the following value is computed:
c =
s
|p
i
−n
j
|
, (9.2)
where s is the length of the sub-region contained by node n
i
. If c > θ, then the node is
close enough and it is necessary to look into the children of n
j
recursively. If c≤θ, then all
the children p articles are far enough for n
j
and the average contribution is used, stopping
there. The value θ∈[0,1] corresponds to the accuracy of the simulation. Lower values of θ
316 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
Figure 16: A galaxy simulation (left) and its quad-tree structure computed (right).
lead to faster, but less accurate simulations and higher θ leads to slower, but more precise
simulations.
A parallelization of the algorithm is done by first building t h e tree and then assigning
m bodies to each of the p processors such that n=mp. Each p arallel computation uses the
tree as read-only to obtain the total quantity of force applied to it (a CREW algorithm).
The cost of the algorithm under th e PRAM model is O((nlog
k
n)/p) = O(mlog
k
n). In
the best possible case, when p = n processors, then O(log
k
n). Bedorf et al. solution [10]
implements all parts of the algorithm in the GPU, achieving up to 20x s peedup compared
to a sequen tial execution. The n-body problem has also been solved with hierarchical
algorithms running on supercomputers with thousands of GPUs [126]. The authors do
not report any speedup value, how ever they report up to 1.0 Pflops with an efficiency of
74%, all using 4096 GPUs.
9.2 Collision detection
At a first glance, collision detection can be seen as equivalent to the n-body problem but
it is not. In a collision detection problem, the goal is to detect pairs of collisions rather
than computing a magnitude for each body as in n-body. This problem is important for
computational physics because it is fundamental for the simulation of rigid bodies [53].
Performing collision d etection on the GPU brings up the opportunity to achieve real-time
simulation and visualization for many interactive problems that are based on collisions.
The collision detection p roblem is defined as: given n bodies at a time step t, compute
all collisions pairs where bodies intersect, then perform an action in step t+1 for each colliding
object. Similar to the n-body problem, a brute fo rce O(n
2
) algorithm is sufficient but not
necessary. P r actical case shows that bodies collide with t h eir closest neighbors before
anything else; this gives the opportunity for a faster algorithms.
The kd-tree is a spatial binary tree for partitioning a k -dimensional space in tw o parts
recursively. Kd-trees do not partition around a point as the 2
k
-tree used in the n-body
simulation but instead they partition along a plane, forcing the tree to remain binary at
every level. A partition can also be non centered, see Fig. 17. Se arch costs O(log n) time,
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 317
Figure 17: Building a 2D kd-tree. Partitions can be non-centered for better balance.
making an improvement on the brute force O(n) search. Before any algorithm begins,
the kd-tree is built in O(nlog n) time using a linear-time median-finding algorithm [27].
Then, the problem can be solved with a parallel divide and conqu er str ategy; start w ith a
single thread at the root of the kd-tree and recursively spawn two threads that will access
the two children of the tree. Since we are in the massive parallelism programming model,
such an amount of threads is indeed possible and will not become a problem. When a
thread hits a leaf of the tree, it will check the collisions inside that region of space. Since
that region w ill have a constant number of elements (e.g., less than 5), the comput ation
will cost O(1). The cost of an algorithm of this type is t h eoretically O(log n/p) because
all branches are computed in parallel using p p rocessors. If p≈n, then T(n, p)≈O(log n ).
This problem can also be solved with an 2
k
-tree in a similar way as the N-body problem.
Today, parallel divide and conquer algorithms can be implemented fully in modern
GPU architectures like GK110 from Nvidia. A survey on collision d etection methods is
available in [60] and GPU implementations in [69] and [96].
9.3 Probabilistic Potts model simulations
The Potts model is a spin based model and one of the most important in statistical me-
chanics [125]. It allows the study of phase transitions for different lattice topologies. Exact
methods are characterized for being NP-hard [123] and intractable. Instead, Monte Carlo
based simulations are often used because of their polynomial time.
There mos t popular Mo n te Carlo algorithm for th e Potts mo del is the Metropolis algo-
rithm [86]. The main idea is that each sp in uses only neighborhood information (no global
data), making each computation independent and fully parallelizable. The Metropolis
algorithm is relevant for GPU computing because it introduces the concept of stencil com-
putation. A ste n cil is a fixed p attern defined along the problem domain, in this case the
lattice. Stencil computation means to apply a kernel to each element of the st encil using
just the n eighborhood information. St encil computations are data-parallel problems, and
therefore, good candidates for a massive parallel solution on the GPU.
When parallelizing the Metropolis algorithm, the p artition is done at a particle level,
that is, every particle is an indep endent job. Communication is only local; read from
neighbor particles. Agglomeration will be high for a CPU simulation and low for a GPU-
based one. The mapping is 1-to-1 for a square lattice because each agg lome r ation is as-
318 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
signed to a different thread. Th e theoretical complexity of s uch a parallel algorithm un der
the PRAM model using the CREW variation is O(n/p) per step.
The algorithm starts with a random initial state σ
t=0
for the lattice and defines a tran-
sition from σ
t
to σ
t+1
as follows: for each vertex v
i
∈V with spin value s
i
∈[1···q], com-
pute the present and future energies h(s
t
i
) and h(r), respectively (with r = rand(0···q)
and r 6= s
t
i
). The spin energy is compute d by summing up the contributions from the
neighborhood < s
i
>:
h(s
i
) = −J
∑
s
k
∈<s
i
>
δ
s
k
,s
i
. (9.3)
Once h(s
t
i
) and h(r) are computed, the new value s
t+1
i
is chosen with a probability based
on ∆h = h(r)−h(s
t
i
):
P(s
t+1
i
⇒r) =
(
1 if ∆h ≤0,
e
−∆h
κT
if ∆h > 0.
(9.4)
When ∆h≤0, the new value for the spin becomes the rand om value r with full probability
of success. If ∆h > 0, the new value of the spin may become r w ith probability e
−∆h
κT
(with
κ being Boltzmann constant), ot h erwise it remains the same (see Fig. 18).
Figure 18: On the left, the probability function. On the right, an example of an evolved lattice u sing q= 2 spin
states per site [21].
The simulation stops when th e lattice reaches thermodynamic equilibrium. In prac-
tice, equilibrium occurs when σ
t+1
is similar to σ
t
.
Achieving massive parallel Potts model simulations using GPUs is currently a re-
search topic [102,117]. Ferrero et al. have proposed a GPU-based version of the Metropo-
lis algorithm [37].
Apart from the Metropolis algorithm, there also exist cluster base d algorithms for the
Potts model such as the multi-cluster and single cluster algorithms proposed by Swend-
sen et al. [113] and Wo lff [124], respectively. These algorithms are not local, inst ead th ey
generate random clusters of spins identified by a common label (in Wolff algorithm, it
is just one cluster). The idea is to flip the clusters to a ne w r and om state. The advan-
tage of cluster algorithms is that they do not suffer from auto-correlations and converge
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 319
much faster to the equilibrium than Metropolis. Ho wever, they present additional chal-
lenges for a parallelization. Komura et al. have proposed GPU based versions of such
algorithms [71, 72].
9.4 Cellular Automata simulation
Cellular Automata (CA) were first formulated by the hands of John Von Neumann, Kon-
rad Zuse and Stanislaw Ulam in the 1940s (see [88, 122]). CA are import ant for GPU
computing because they are inherently data-parallel dynamic systems based on stencil
computations. CA are used for the simulation of galaxy formation, ty phoon propagation,
fluid dynamics [61,118], biological dynamics [45], social interactions, and theoretical par-
allel machines. In gener al, it can simulate any problem that can be reduced to a system
of cells interacting u n der a local rule. It has been proven that even the simplest CA are
capable of being Turing complete machines [26].
CA are defined as a dynamical discrete space, where each cell c
t
i
has one of k possible
states at time t. The computation of th e next st ate c
t+1
i
is given by:
c
t+1
i
= f (< c
i
>
t
), (9.5)
where f (< c
i
>
t
) is the local transition function or local rule applied to the neighborhood
< c
i
>. The challenge of CA research is to discover ne w rules that could simulate events
of our universe and help in the quest of und erstanding how complex and chaotic systems
work.
In a CA simulation, every cell can compute its next state for time step t+1 indepen-
dently by only using the information from its ne igh bo rhood (including itself) in time step
t. It is important to note that along the simulation, some cells are in quiescent state, which
means that they will remain in that s ame state as long as the neighborhood is in quiescent
state too (in some CA this is considered as the ”dead” state).
A simple parallel algorithm can assign p processors to the n different automatons
and perform one step of simulation in a theoretical time of O(
n
p
) under the PRAM model
using the CREW variation. In the ideal case, with p≈n, one s tep of evolution would take
O(1) time. Fig. 19 shows a time-space evolution of a known one-dimensional CA, the
rule 161, as well as a snapshot of John Conway’s game of life [42] in its present state. Both
images were taken from a GPU application implemented by the authors.
An efficient algorithm is one that only processes non-quiescent cells, otherwise there
is a substantial waste of computation p er time step. Processing b non-quiescent cells
with p processors is at least Ω(b/p). In many CA, space is much bigge r than the number
of non-quiescent cells, therefore b ≪ n. For these cases, the space can be treated as a
dimensional sparse matrix. However, adapting such a strategy for a massive parallel
solution is n ot simple and problems such as access patterns and neighbor data layout
must be solved.
Related work has been focused on solving the main challenges for efficient GPU im-
plementations and comparing performance against multicore solutions (CPU) [59, 106].
320 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
Figure 19: On the left, t he time-space of the elementary CA 161. On the right, John Conway’s game of life.
Tran et al. have identified th e main challenges o f Cellular Aut omata simulation on
GPU [59]. Another interesting topic is to efficiently simulate CA working in oth er topolo-
gies such as hexagonal or triangular [8,43] w h ere t h e GPU’s space of computation does not
match exactly with the problem domain. Furthermore, doing CA simulations on non-
Euclidean topolog ies is an even more complex problem [130].
10 Latest advances and open problems in GPU computing
In the field of computational physics, O (n) cost algorithms (the fast multi-pole expan-
sion method [127–129]) have been implemented on GPU for n-body simulations. Single
GPU implementations have been proposed for achieving high performance Potts model
simulations [116, 117] even with biological applications [20]. Multi-GPU based imple-
mentations have also been proposed for the Potts mo del [72] and for n-bod y simula-
tions [126]. In a multi-GPU scenario, two levels of parallelism are used; distributed and
local. Distributed parallelism is in charge of doing a coars e grained partition of the prob-
lem, the mapping of sub-problems to computer nodes and the communication across the
super-computer or cluster. Local parallelism is in charge of s olving a sub-problem inde-
pendently with a single GPU. Multi-GPU based algorithms h ave the advantage of com-
puting solutions to large scale problems that cannot fit in a single machine’s memory. T h e
main challenge for multi-GPU methods is to achieve efficient distributed parallelism (e.g.,
hiding data communication cost by overlapping communication with computation).
Cellular Automata are now being used as a model for fast parallel simulation of phys-
ical phen omena, traffic simulation and image segmentation, among ot h ers [36,44,63,73].
In th e field of comp uter graphics, n ew algorithms have been proposed for building kd-
trees or oct-trees in GPU to achieve real-time ray-tracing [56, 62, 131] as w ell as real-time
methods for 3D reconstruction and level set segmentation [46, 103]. The field of pro-
gramming languages have contributed to parallel computing by proposed high level
parallel languages for the programmer (i.e., to abstract the programmer so that the job
of partitioning, communication, agglomeration and mapping is part of th e compiler or
framework [17, 112]). Tools for auto matically converting CPU code into GPU code are
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 321
now becoming popular [55] and useful for fields that use parallelism at a high level tool
and not as a goal of their research. In a more theoretical level, a new GPU-based com-
putational model has also been proposed; the K-model [16] w h ich s erves for analyzing
GPU-based algorithms.
Architectural advances in parallel computing have focused on combining t h e best of
the CPU and GPU worlds. Parallel GPU architectures are now making possible massive
parallelism by using tho usands of cores, but also with flexible work-flows, access pat-
terns and efficient cache predictions. The lates t GPU architectu res have included dynamic
parallelism [29]; a feature that consists of making it possible for the GPU to schedule addi-
tional work for itself by using a command processor, without nee ding to se n d data back and
forth betwee n host and device. This means that recursive hierarchical partition of the do-
main will be possible on the fly, without needing the CPU to control each step . Lastly, one
of the most important revolutions in computer architecture (affecting parallel computing
directly) is the Hybrid Memory Cube (HMC) [114] project. HMC is a three-dimensional
memory architecture that promises 15× better performance than DDR3 memory, requir-
ing 70% less energy p er bit.
There are still ope n problems for GPU computing. Most of th em exist because of the
actual limitations of the massive parallelism mod el. In a parallel SIMD architecture, some
data structures do not work so well. Tree implement ations have been solved partially for
the GPU, but data structures such as classic dynamic arrays, heaps, hash tables an d complex
graphs are not performance-friendly yet and need research for efficient GPU usage. An-
other problem is the fact that so me sequential algorithms are so complex that porting
them to a parallel version will lead to no improvement at all. In these cases, a complete
redesign of the algorithm must be done. The last open problem we have identified is
the act of mapping the space of computation (i.e., the grid of blocks) to different kinds
of problem domains (i.e., geometries). A naive space of comput ation can always build
a bounding box around the domain and discard the non useful blocks of comp utation.
Non Euclidean geometry is a special case that illustrate s this problem; finding an efficient
map for each block of the grid to the fractal problem domain is not trivial. The only way
of solving this problem is to find an efficient mapping function from the space o f compu-
tation to the problem domain, or modify the problem domain so that it becomes similar
to an Euclidean box. With dynamic parallelism, it will be possible for the first time to build
fractal spaces of computation for non-Euclidean problems, increasing efficiency.
11 Discussion
Over the past 40 years, parallel comp uting has evolved significantly from being a mat-
ter of high equipped data centers and supercomputers to almost every e lectronic device
that uses a CPU or GPU. Today, the field of parallel computing is having one of its best
moments in history of computing and its importance will only g row as long as computer
architectures keep evolving to a higher number of processors. Speedup and efficiency are
322 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
the most important measures in a parallel solution and w ill continue to be in the follow-
ing years, especially efficiency, now that power consumption is a serious matter for every
branch of science. H owever, to achieve such desired levels of performance, implemen-
tations need to be aware of the underlying architecture. The PRAM model is one of the
most popular theoretical models but it lacks the considerations of the d ifferent me mo r y
hierarchies that are present in every computer architecture to day. Wh en using the shared
memory programming model, it is important to combine the PMH and PRA M model so
that the implementation behaves like the theoretical bounds. Th e BS P and LogP mod-
els are useful w h en designing distributed algorithms where communication is a critical
aspect. As with PMH and PRAM, BSP and LogP can be combined for better modeling
of a distr ibuted algorithm. That is, one can include message overheads, gaps and global
synchronization barriers all in one problem.
Automatic parallelization is now supported by some high level languages such as pH.
Also, modern APIs can parallelize loo p sections of programs by just add ing a directive in
the program code (e.g., OpenMP). This trend will continue to evolve, each time including
more and more patterns to the implicit parallelization programming model so that in the
end only complex problems of science and engineering will need manually optimized
parallelization.
GPU computing is a young branch of parallel computing that is growing each year.
People from t h e H P C community have realized that many problems can be solved with
energy-efficient hardware us ing the massive parallelism paradigm. For some p roblems,
GPU implementations have reported impressive spee dups of up to 100x; results accepted
by most, though que stioned by Lee et. al. [75]. The cos t of achieving such extreme perfor-
mance is the complex re-design. For some problems, a port from th e sequential version
is not enough and a re-design of the algorithm is needed in order to use the massive
parallel capabilities of the GPU architecture. In general, data-parallel problems are well
suited for a massive parallel execution in GPU; that is, problems where the domain can
be split into several independent sub-problems. The split process can be an arbitrary dis-
tribution or a recursive divide and conquer strategy. Time dep endent problems with little
work to be done at each time step are not well suited for GPU computing and will lead
to inefficient performance. Examples are the computation of small dynamical systems,
numerical series, iterative algorithms with small work per iterations and gr aph traverse
when the structure has only a few connections.
It is important to note how the natural-scientific and human-type problems differ in
the light of parallel compu ting. Science problems benefit from parallelism by using a
data-parallel approach without sacrificing complexity as a result. This is because many
physical phenomena follow the idea of a dynamic syste m where simple rules can exhibit
complex behavior in the long term. On the other side, human-type problems are simi-
lar to a complex graph structure with many connections and synchronization problems.
Such leve l of complexity in communication make parallel algorithms perform slowly.
There is still much work to be do n e in the field of parallel computing. The challenge
for massive parallel architectures in the following years is for them to become mo re flexi-
C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329 323
ble and en ergy efficient. At the same time, the challenge for computer science researchers
will be to design more efficient algorithms by using the features of these new architec-
tures.
Acknowledgments
Special t h ank s to Conicyt for funding th e PhD program of the first author Crist´obal A.
Navarro and to the reviewers of Communications in Computational Physics Jour n al for im-
proving the quality of this survey. This work was supported by Fondecyt Project No.
1120495. Finally, thanks to Renato Cerro for improving the English of this manuscript.
References
[1] S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. Com-
puter, 29 (12):66–76, December 1996.
[2] A. Aggarwal, B. Alpern, A. Chandra, and M. Snir. A model f or hierarchical memory. In
Proceedings of the nineteenth annual ACM symposium on Theory of computing, S TOC
’87, pages 305–314, New York, NY, USA, 1987. ACM.
[3] B. Alpern, L. Carter, E. Feig, and T. Se lker. The uniform memory hiera rchy model of com-
putation. Algorithmica, 12:7 2–109, 1994. 10. 1007/BF01185206.
[4] B. Alpern, L. Carter, and J. Ferrante. Modeling par allel computers as memory hierarchies.
In In Proc. Programming Models for Massively Parallel Computers, pages 116–123. IEEE
Computer Society Press, 199 3.
[5] G. M. Amdahl. Validity of the single processor approach to achieving large scale computing
capabilities. In Proceedings of the April 18-20 , 1967, spring joint computer conference,
AFIPS ’67 (Spring), pages 483–485, New York, NY, USA, 1967. ACM.
[6] J. Barnes and P. Hut. A hierarchical O( N log N) force-calculation algorithm. Nature,
324(6096):446–449, December 1986.
[7] L. A. Barroso. The pr ice of performance. Queue, 3(7 ):48–53, September 2005.
[8] C. Bays. Cellular automata in triangular, pentagonal and hexagonal tessellations. In
Robert A. Meyer s, editor, Computational Complexity, pages 434 –442. Springer New York,
2012.
[9] P. Beame and J. Hastad. Optimal bounds for decision problems on the crcw p ram. In In
Proceedings of the 19th ACM Symposium on Theory of Computing (New, pages 25 –27.
ACM.
[10] J. B´edorf, E. Gaburov, and S. P. Zwart. A sparse octree gravitational n-b ody code that runs
entirely on the GPU processor. J. Comput. Phys., 231(7):282 5–2839, April 2012.
[11] A. Bernhardt, A . Maximo, L. Velho, H. Hnaidi, and M.-P. Cani. Real- time terr ain modeling
using cpu-GPU coupled c omputation. In Proceedings of the 2 011 24th SIBGRAPI Confer-
ence on Gra p hics, Patterns and Images, SIBGRAPI ’11, pages 64–71, Washington, DC, USA,
2011. IEEE Computer Society.
[12] Z. Bittnar, J. Kruis, J. Nˇemeˇcek, B. Patz´ak, and D. Rypl. Civil and structural engineering
computing: 2001. chapter Parallel and distributed computations for structural mechanics:
a review, pages 211–233. Saxe-Coburg Publications, 2001.
[13] L. Ca rter B. Alpern. T he ram model considered harmful towards a science of performance
programming, 1994.
324 C. A. Navarro et al. / Commun. Comput. Phys., 15 (2014), pp. 285-329
[14] C . P. Breshears. The Ar t of Concurrency – A Thread Monkey’s Guide to Writing Parallel
Applications. O’Reilly, 2009.
[15] I. Buck, T. Foley, D. Horn, J. Sugerma n, K. Fa tahalian, M. Houston, and P. Ha nrahan. Brook
for GPUs: stream computing on graphics hardware. ACM Tr ans. Graph., 23(3):777–786 ,
August 2004.
[16] G. Capannini, F. S ilvestri, and R. Baraglia. K-model: A new computational model for
stream processors. In Proceedings of the 2010 IEEE 12th International Conference on High
Performance Computing and Communications, HPCC ’1 0, pages 239–246, Washington,
DC, USA, 2010. IEEE Computer Society.
[17] B. L. Chamberlain. Chapel (cray inc. hpcs language). In Encyclopedia of Para llel Comput-
ing, pages 249–256. 2011.
[18] B. Chapma n, G. J ost, a nd R. van der Pas. Using OpenMP: Portable Shared Memory Parallel
Programming (Scientific and Engineering Computation). The MIT Press, 2007.
[19] D.-K. Chen, H.-M. Su, and P.-C. Yew. The