Content uploaded by Alecio P. D. Binotto

Author content

All content in this area was uploaded by Alecio P. D. Binotto on Jan 18, 2016

Content may be subject to copyright.

Iterative SLE Solvers over a CPU-GPU Platform

Al´

ecio P. D. Binotto∗ †, Christian Daniel∗, Daniel Weber∗, Arjan Kuijper∗, Andr´

e Stork∗, Carlos Pereira†, Dieter Fellner∗

∗Fraunhofer IGD

Technische Universit¨

at Darmstadt, Darmstadt, Germany

†Institute of Informatics

UFRGS - Federal University of Rio Grande do Sul, Porto Alegre, Brazil

Email: abinotto@inf.ufrgs.br, c.daniel42@googlemail.com, daniel.weber@igd.fraunhofer.de,

arjan.kuijper@igd.fraunhofer.de, andre.stork@igd.fraunhofer.de, cpereira@ece.ufrgs.br, dieter.fellner@igd.fraunhofer.de

Abstract—GPUs (Graphics Processing Units) have become

one of the main co-processors that contributed to desktops

towards high performance computing. Together with multi-core

CPUs, a powerful heterogeneous execution platform is built

for massive calculations. To improve application performance

and explore this heterogeneity, a distribution of workload in

a balanced way over the PUs (Processing Units) plays an

important role for the system. However, this problem faces

challenges since the cost of a task at a PU is non-deterministic

and can be inﬂuenced by several parameters not known a

priori, like the problem size domain.

We present a comparison of iterative SLE (Systems of Linear

Equations) solvers, used in many scientiﬁc and engineering

applications, over a heterogeneous CPU-GPUs platform and

characterize scenarios where the solvers obtain better perfor-

mances. A new technique to improve memory access on matrix-

vector multiplication used by SLEs on GPUs is described and

compared to standard implementations for CPU and GPUs.

Such timing proﬁling is analyzed and break-even points based

on the problem sizes are identiﬁed for this implementation,

pointing whether our technique is faster to use GPU instead

of CPU. Preliminary results show the importance of this study

applied to a real-time CFD (Computational Fluid Dynamics)

application with geometry modiﬁcation.

Keywords-Graphics processors, solvers for SLEs, parallel

processing, real-time CFD.

I. INTRODUCTION

Due to timing constraints, modern applications usually

require high performance platforms to deal with distinct

scientiﬁc and engineering algorithms and massive calcula-

tions. The development of low-cost desktop-based accel-

erators (e.g., the GPU) offers alternatives for execution

platforms aiming at better performance. For example, the

GPU GTX285 from Nvidia provides a peak performance

of 1062 Gﬂop/s for single precision (89 Gﬂop/s for double

precision) [1]. The resulting heterogeneous platforms can

be considered as a powerful asymmetric multi-core cluster,

which is intensiﬁed with the new generation of multi-

core CPUs (e.g., Intel Core2Quad provides around 100

Gﬂop/s [2]), being a challenge to program applications that

efﬁciently use all available resources from the PUs.

Motivated by a 3D CFD simulation with real-time per-

formance and interactive geometry modiﬁcation require-

ments, this paper compares the performance of three dif-

ferent implemented iterative solvers for SLEs (Jacobi, Red-

Black Gauss-Seidel, and Conjugate Gradient) that are to

be executed on a scenario of a CPU and multiple GPUs.

Speciﬁcally for exploring the best performance of GPUs, we

present a method to improve locality of data on GPU mem-

ory accesses focused to the SLEs’ solvers. For that purpose,

we take advantage of the GPU shared memory, which is a

small cache-like memory with high access bandwidth.

In addition to the solvers’ performance analysis with and

without our data access method for GPU, we point out that

there are some scenarios where the CPU provides better

performance, partially based on the amount of data to be

processed. In other cases, the GPU has better computational

times and 1 GPU performs better than 2 GPUs in some cases.

Our implementation achieves similar performances, but with

performance gain, compared to state-of-the-art works.

Based on the performance results, we simply conduct a

static coded scheduling over CPU and GPU for the CFD

simulator, assigning the solvers to a PU based on the break-

even points. However, we point here that more elaborated dy-

namic techniques for desktop platforms composed by CPU

and co-processors are an important aspect for future works

[3], improving the current high-level static programming and

scheduling used by CUDA [4] or OpenCL [5].

In summary, the main contributions of this paper are:

1) The implementation and comparison of three different

iterative methods to solve SLEs on CPU and multiple

GPUs, applied to a real-time CFD simulation with a

geometry modiﬁcation example;

2) The development of a data accessing strategy for the

solvers on the GPU aiming memory coalescing using

the shared memory;

3) The analysis of the solvers’ characteristics and their

performance on a CPU-GPU platform, expressing the

conditions where the solvers obtain better execution

performance (i.e., ﬁnding a so-called performance

break-even point that indicates the best PU to be used).

The paper is organized as follows. Section 2 presents the

requirements of the real-time CFD application, used as the

motivation for this work, and its comparison to a traditional

2010 12th IEEE International Conference on High Performance Computing and Communications

978-0-7695-4214-0/10 $26.00 © 2010 IEEE

DOI 10.1109/HPCC.2010.40

279

2010 12th IEEE International Conference on High Performance Computing and Communications

978-0-7695-4214-0/10 $26.00 © 2010 IEEE

DOI 10.1109/HPCC.2010.40

305

(a) (b) (c)

Figure 1. Real-time CFD Application: (a) velocity ﬁeld slice visualization of 3D simulation; (b) pressure slice visualization of 3D simulation; (c) velocity

ﬁeld visualization of 2D simulation with real-time geometry modiﬁcation

approach. Section 3 exposes related work focused on the

solvers and their GPU approach as well as on distributed

platforms oriented to the GPU. Sequentially, Section 4

provides a background overview of the numerical SLEs

solvers followed by their new implementation features over

the GPU and the CPU in Section 5. Section 6 discusses

the experimental results based on a performance analysis

oriented to the heterogeneous platform approach, present-

ing the PUs break-even points. Finalizing, conclusions and

points for future research are described in Section 7.

II. MOT IVATIO N:R EA L-T IM E CFD AP PL IC ATIO N

A Computational Fluid Dynamics (CFD) case study is

carried out to exemplify the implementation of the solvers

and the need of an asymmetric CPU-GPU platform approach

to optimize performance. This CFD simulation is tailored

towards industrial prototyping, where typically default ﬂow

simulation is used in conjunction with detailed geometrical

models in later stages of product development. The average

calculation time for a traditional ﬂow simulation takes about

12 hours using a cluster of several computers and reaches

a deviation on accuracy, i.e., deviation between the actual

aerodynamic behavior and the subsequent real prototype, of

at about ±5 %. This means that the used CFD model brings

precision to the engineers, who maybe could accept less

precision in order to gain time on the simulation in early

stages of product development. Based on that scenario, there

is a need to potentially increase the ﬂow simulation in terms

of time performance, inserting a new phase on early stages

of product development. Within this new phase, a 3D real-

time CFD simulator with surface modiﬁcation is expecting

to reduce the number of times that accurate CFD models

must be reevaluated in later stages of product development.

To reach this goal, we developed a 3D real-time CFD

simulation based on [6]. Its implementation uses the iterative

solvers of SLEs on a CPU-GPU heterogeneous platform,

presented in the next sections. Additionally, it chooses the

solver at runtime based on the break-even points. Solving

SLEs on the diffusion and projection phases is the most

time consuming part of the CFD approach and depends

directly on the domain size. Being efﬁcient with huge

domain sizes and also with small problems is important for

real time applications, leading to the assumption that a CPU-

GPU heterogeneous platform might offer a better execution

scenario than homogeneous ones.

Fig.1(a) shows a slice of the developed 3D simulation that

represents the velocity visualization and Fig.1(b) illustrates

the pressure visualization. The real-time geometry modiﬁca-

tion is shown in Fig.1(c), over three time instances, based

on 2D. The 3D model modiﬁcation is analogous and also

performs in interactive rates.

III. REL ATED WO RK S

SLE Solvers using the GPU: A number of works

has contributed with strategies of systems of equations

approaching the GPU. Kr¨

uger and Westermann [7] provided

data structures and operators for a linear algebra toolbox

on the GPU for the Conjugate Gradient algorithm. Bolz et

al. [8] presented an application of those algorithms oriented

to problems on unstructured grids, extending it with the

Multigrid solver for regular grids. Both approaches used

shaders for programming the graphics pipeline and textures

for data storage. In addition, [9] presented a symmetric

sparse system solver and compared its performance on

CPUs and GPUs, strategy also followed by [10], but with

deep analysis of several formats for sparse matrix-vector

multiplication. The authors of [11] presented recently three

parallel algorithms for solving tridiagonal linear systems on

a GPU using its shared memory, obtaining a 12x speedup

over a multi-threaded CPU solver.

Volkov and Demmel [12] presented a performance bench-

mark of linear algebra algorithms implemented on GPUs

and its comparison to CPUs, mentioning that a hybrid ar-

chitecture is more appropriate (even if the GPU performance

power outperforms the CPU in several circumstances).

Introducing multiple GPUs, [13] described a method for

Conjugate Gradient, obtaining fast results when working

with data decomposition, and [14] improved it with a parallel

pre-conditioner that outperformed classical ones, like over-

relaxation, on the GPU.

280306

These works show that there is still research needed

to directly compare the performances of the pointed SLE

solvers based on a CPU-GPUs platform.

Distributed and Heterogeneous Platforms: the authors

of [3] presented a performance comparison with a static

domain size partition to be computed by the CPU-GPU

platform, but applied to ﬁnite element solvers in solid

mechanics. In this line, Song, Yarkhan, and Dongarra [15]

described a dynamic task scheduling approach to execute

dense linear algebra algorithms (based on factorization meth-

ods) on a distributed-memory CPU cluster. Recently, [16]

presented a hybrid CPU-GPU implementation of Cholesky,

LU, and QR factorizations , assigning independent functions

over the PUs in a static way, i.e., scheduling sequential

functions to the CPU and data parallel ones to the GPU.

Therefore, there is also a need for studying the perfor-

mance of iterative solvers over an asymmetric CPU-GPUs

platform. This paper contributes with a performance analysis

of iterative solvers over a CPU-GPU platform, in particular

for the creation of the strategy for data access on the GPU.

IV. MET HO DS F OR S OLVING SLES

To determine the velocity and pressure ﬁelds around

objects (e.g., planes), we use a simple setup for a ﬂuid

simulator with the incompressible Euler equations

∂u

∂t =−(u· ∇)u− ∇p+f(1)

∇ · u= 0,(2)

with velocity u, pressure p, external forces fand assuming

constant density ρ, as extensively described in [6], [17], [18],

and [19]. Therefore, Equation (1) is decomposed by operator

splitting, i.e., the different effects are computed apart from

each other.

We are focusing on the pressure correction step because

it is the most time consuming part in the simulation loop. In

this last part, of a timestep ∆t, the intermediate velocity ˆu

is corrected. More precisely, we compute the pressure while

simultaneously satisfying the constraints given by Equation

(2). This is achieved by solving the Poisson equation

∇2p=1

∆t∇ · ˆu (3)

with homogeneous Neumann boundary conditions ∂p

∂n= 0.

Afterwards, the velocity uis corrected by subtracting the

gradient of the pressure u=ˆu −∆t∇p.

A simple spatial discretization of Equation (3) results in

a large system of linear equations

Ax=b,(4)

where Ais the matrix of coefﬁcients related to the derivative

operations, bis the vector related to pand ˆu, and xis the

vector of unknowns to be solved; and where the dimension

Figure 2. Sparsity pattern of a ﬁnite difference discretization in 3

dimensions

depends on the number of degrees of freedom. Using a

regular Cartesian grid and approximating the derivatives by

ﬁnite differences, it leads to a sparsity pattern of A, shown

in Fig.2 and known as the 7-point Laplacian. The system

matrix Ais positive semi-deﬁnite and symmetric [19], which

is a pattern also used by [20].

There are several choices for computing or approximat-

ing the solution of Equation (4). Direct methods are not

appropriate because of the huge dimension, which is n×n,

with nbeing the number of unknowns. We are analyzing

explicit iterative methods, like Jacobi, Gauss-Seidel, and

the Conjugate Gradient and give only a brief overview (for

detailed information, please refer to [21]).

Jacobi Method: The Jacobi method iteratively improves

an initial guess x0for the SLE. For one complete iteration,

the next approximation x(m+1)

iis computed by rearranging

and isolating each equation of the SLE:

x(m+1)

i=1

Aii

bi−

n

X

j=1,j6=i

Aij x(m)

j

i= 1, . . . , n.

(5)

As the system matrix Ahas the regular pattern as depicted

in Fig.2, the sum consists of only six values. The iteration

needs two vectors, x(m+1) and x(m), for storing the approx-

imation before and after each iteration. The convergence of

this iterative method is slow.

Red-Black Gauss-Seidel (GS): In contrast to the Jacobi

method, the Gauss-Seidel method uses the values computed

before to advance to a new approximation:

x(m+1)

i=1

Aii

bi−

n

X

j=i+1

Aij x(m)

j−

i−1

X

j=1

Aij x(m+1)

j

,

(6)

where i= 1, . . . , n.

Thus, the sum is splitted into components containing old

and new approximations. One advantage is that only one

vector xis needed, keeping both old and new results. But,

data dependency arises and because of each computation

needs the newly calculated values, it is inapplicable for

parallelization.

The Jacobi and Gauss-Seidel methods have no restriction

concerning the order in which the Equation (4) is solved.

281307

An ordinary approach is lexicographical ordering, i.e., the

equations are computed as the unknowns are stored. A

speciﬁc reordering of those equations removes the data

dependency in the iteration and makes it applicable for

parallelization.

The Red-Black Gauss-Seidel iteration divides the un-

knowns in a red and a black set in a way that all neighbors

of a red cell is black and vice versa. As a consequence of

this classiﬁcation, the computation of one type of cells only

needs the other type as input. One complete iteration is, then,

splitted into a red and a black iteration, which processes the

equations (2i)and (2i+ 1) respectively.

Conjugate Gradient (CG): The method of conjugate

gradients combines the ideas of steepest descent and the

method of conjugate directions. The ﬁrst algorithm min-

imizes the functional E=1

2xTAx−xTbiteratively by

using a direction, which reduces the error optimally in one

iteration. This results in solving Ax=bin the case A

is positive deﬁnite and symmetric. The second algorithm

uses conjugate search directions, which are perpendicular

to all the previous ones in order to optimally exploit the

space of search. The combination of these two approaches

leads to the conjugate gradient algorithm, which minimizes

the distance to the true solution in each iteration (detailed

information can be found in [22]). The algorithm can be

assembled by some basic operations like dot products,

scalar multiplication, sums of vectors, and the matrix-vector

multiplication

yi=

n

X

j=1

Aij xj,(7)

which is the most time consuming part [9].

V. IMPLEMENTATION

In this section, we present a novel implementation of

the algorithms on the GPU using CUDA 2.0, exploring

its shared memory. It is important to make a note that

another kind of GPU cache could be used, the texture cache.

The work of [23], for example, employed such computer

graphics-based approach, but focused on a ﬂuid particle

simulation. However, this strategy leads to an extra method

to organize the data, since 3D textures are read-only. In this

work, we explore the GPU using the shared memory cache,

like other approaches presented on related work.

In order to obtain the full computational power provided

by the GPUs, some requirements have to be met: global

memory access has to be coalesced, multiple access to global

memory should be buffered in shared memory, data should

be available independently of the thread, and branching of

threads in one block should be avoided. In order to meet

the last requirement, a buffer of zeroes is introduced to the

matrix, ensuring that no illegal data access may occur, while

obliterating the need for boundary queries. Although this

approach has already been proposed by [24], it was not

mentioned how the size of the buffer has to be adjusted

for different problem domains. This size is crucial to meet

the requirements for an aligned starting address in order to

enable coalesced access to the data. The same holds for the

number of threads in one CUDA block.

In addition, our approach uses ghost cells (padded area)

only in ”front” and ”behind” of the problem domain and

no additional ghost cells in between layers are introduced,

reducing the memory load and simplifying computations.

The ghost cells, in our case, are not used for any kind of

boundary conditions (as opposed to [24]) and serve for the

unique purpose of avoiding illegal memory accesses. The

implementation of boundary conditions is achieved by a

modiﬁcation of the matrix A, where a zero is stored in

those entries that would lead to a multiplication of a border

element of the domain by a cell from another control volume

in a different layer or line. This meets the third requirement.

To fulﬁll the other ﬁrst two requirements, the use of

the GPU shared memory is also improved from the related

work, since shared memory improves performance for the

cases where the same data has to be accessed multiple

times or with the goal to allow coalesced loading/writing

of the data. Basically, the GPU, just as CPU, has several

layers of memory that differ in size and bandwidth. The

shared memory of the GPU may be compared to the

cache of a CPU in terms of speed and size. However, as

opposed to the CPU cache, the GPU shared memory is not

automatically managed, being this task - to take advantage

of the higher bandwidth offered by the shared memory -

explicitly performed by the developer. For GPUs, access

to multiple elements, should be performed in an aligned

way and stored in a consecutive proﬁle, resulting in much

higher bandwidths than accesses to elements that are either

not aligned or not stored consecutively. In CUDA, this is

described as coalesced access [4].

Therefore, this access should be coalesced in order to

obtain the best performance of the GPU. Thus, even when

data is only accessed once, by a single thread, it may be

advantageous to ﬁrst load data into the shared memory,

using a coalesced access pattern. In this work, instead of

using the shared memory for coalesced loading of the matrix

A, the adapted way in which Ais internally represented

allows coalesced access. This enables the use of bigger

CUDA block sizes and the use of CUDA latency hiding

mechanisms, which both improve the speed of computations.

Following, our implementation uses only 5 rows of vector

x(plus the padding) that are loaded into shared memory,

allowing that multiple threads access the same cells and

increasing the limit for the CUDA block size. Additionally,

to enable latency hiding mechanisms presented in CUDA,

we work with batches always on the limit of shared memory,

allowing the GPU to switch to another block while the

current block is waiting for data. This approach works

when shared memory size accommodates multiple CUDA

282308

(a)

(b)

(c)

Figure 3. Enabling memory coalescing access: (a) simple loading, where

different threads access different addresses; (b)improved loading, where

coalesced access is partially achieved; (c) ﬁnal loading strategy, where the

starting address for each block will be aligned to a multiple of 128

blocks (and no other data is loaded) and, in combination

with the coalesced loading model presented here, it leads

to a signiﬁcant speedup of computations, as idle time is

prevented.

The coalesced access to the main memory is achieved by

splitting the matrix Ain 7 vectors (A0 to A6) and ensuring

that the starting addresses for data access in a kernel will

always be aligned.

Fig.3 shows how the coalescing strategy was developed.

Starting from a naive approach where each thread would

have to read/write seven continuous entries and, therefore,

not fulﬁll the requirements for achieving coalesced loading

(Fig.3(a)), a storing/loading strategy was developed, where

a series of threads would access a continuous part of the

global memory in an ordered fashion (Fig.3(b)). However,

each thread would not have access to its respective entries.

Taking as example threads T1 and T2, using shared memory,

T1 may load data that will only be processed by T2 and

not by T1 itself, being a typical example of stride memory

access. This method almost performs coalesced access, but

will not work on ﬁrst generation of CUDA capable devices

since consecutive segments of each block starts at the

7∗blocksize ∗sizeof f loat (8)

position, allowing only threads in the very ﬁrst block to

achieve coalesced loading. For further blocks, the starting

Figure 4. Representation by seven linear vectors (A0-A6) of the resulting

matrix from a regular grid

address is not necessarily a multiple of the block size (128).

To overcome this problem, the matrix was decomposed into

seven single vectors, ensuring that the starting addresses for

data access in a kernel will always be aligned and allowing

coalesced access on every CUDA device (Fig.3(c)).

For such an implementation, the lexicographic ordering

i= 0, . . . , n is replaced by a component-wise represen-

tation (i, j, k)with i= 0, . . . , nx, j = 0, . . . , ny, k =

0, . . . , nz, n =nx·ny·nz,which accounts for the three

dimensional setting. In this representation, the neighbors of

one cell (i, j, k)are (i±1, j, k),(i, j ±1, k), and (i, j, k ±1).

In CUDA, threads are numbered in a speciﬁc disjoint

pattern, being allowed to construct consecutive indices ix.

Here, one ix represents the ix-th equation of the SLE and

simultaneously implies a position (i, j, k)in the simulation

domain

ix =k·nx·ny+j·nx+i. (9)

The vector of unknowns xand the right hand side b

represent the position and are simply stored in the linear

pattern. The system matrix Acan be represented by the seven

vectors of length nx·ny·nzdue to the implicit topology of

the simple Cartesian grid (Fig.3(c) and Fig.4).

The essential part of the Jacobi and the Gauss-Seidel

method is algorithmically equivalent to a matrix-vector

product. All of the algorithms perform an iteration on the

non-zero entries of the sparse matrix and combine them with

some data. Therefore, we will focus on the explanation of

computing Equation (7) on the GPU.

For computing one component yijk of Equation (7),

the following data need to be accessed: the mem-

ory of yijk for writing the result, the matrix entries

Aijk , Ai±1jk , Aij±1k, Aij k±1, and the corresponding entry

xijk and its neighbors xi±1j k, xij ±1k, xijk±1. Those access

patterns can be interpreted such that only data from all

adjacent cells is needed. In that way, an iteration or a

matrix multiplication can be executed for one equation with

a coalesced memory access pattern, except for the values

xi±1jk , xij±1k, xij k±1, located at the adjacent cells. We use

283309

0

1000

2000

3000

4000

5000

6000

0 1000 2000 300 0 4000 5000 6000 700 0 8000 9000

Unknowns x 1000

Tim e ( ms)

Conjugate Gradient on 8800

Jacobi on 8800

Gauss Seidel on 8800

Conjugate Gradient on G TX 285

Jacobi on GTX 285

Gauss Seidel on GTX 285

(a)

0

1000

2000

3000

4000

5000

6000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Unknowns x 1000

Time (m s)

Conjugate Gradient on 8800

Jacobi on 8800

Gauss Seidel on 8800

Conjugate Gradient on GTX 285

Jacobi on GTX 285

Gauss Seidel on GTX 285

(b)

0

500

1000

1500

2000

2500

0 1000 2000 3000 4000 5000 6000 7000 8000 900 0

Unknowns x 1000

Tim e (m s)

Conjugate Gradient on 8800

Conjugate Gradient on 8800 new

Conjugate Gradient on GTX 285

Conjugate Gradient on GTX 285 new

(c)

0

100

200

300

400

500

600

700

0 1000 2000 3000 4000 500 0 6000 7000 8000 90 00

Unknowns x 1000

Ti m e ( m s)

Conjugate Gradient on GTX 285

Jacobi on GTX 285

Gauss Seidel on GTX 285

(d)

0

5

10

15

20

25

30

35

40

45

50

0 50 100 150 200 250 300 35 0 400 450 500

Unknowns x 1000

Tim e ( ms )

Conjugate Gradient on GTX 285

Jacobi on GTX 285

Gauss Seidel on GTX 285

(e)

0

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140 16 0

Unknowns x 1000

Ti m e ( m s)

Conjugate Gradient on 8800

Jacobi on 8800

Gauss Seidel on 8800

(f)

Figure 5. Performance of the solvers on the GPUs: (a) without a coalesced access strategy; (b) with our (new) approach; (c) CG with and without our

approach; (d) on the GTX285 with our approach; (e) break-even point on the GTX285 with our approach (zoom of the red circle area of (d)); (f) break-even

point on the 8800GT with our approach

the shared memory to buffer the access for xi±1jk. For the

remaining values xij±1k, xijk±1, we note that the access is

not coalesced in that pattern.

VI. EX PE RI ME NTAL RE SU LTS A ND ANA LYSIS

In this section, we test our implementation with respect

to the performance between the CPU and GPUs. Of special

interest, are the conditions where the solvers obtain better

execution performance, i.e., the break-even points that can

be considered the decision point to schedule a solver for a

PU.

Three heterogeneous PUs were used in the experiments:

•CPU 4-core (Intel Q6600) of 2.4GHz and 8MB L2

cache with 4GB of main memory with 6.4GB/s of

bandwidth;

•GPU Geforce 8800GT (14 streaming multiprocessors -

112 cores - with a core clock frequency of 600MHz

and 512MB of memory with bandwidth of 57.6GB/s);

•GPU Geforce GTX285 (30 streaming multiproces-

sors - 240 cores - with a core clock frequency of

1476MHz and 1000MB of memory with bandwidth of

159.6GB/s).

The PUs’ communication was made via PCIe x16 that

bounds the bandwidth of the CPU-GPU link by 4GB/s and

the used measure of convergence was the residual (residual

vector is squared, values are summed, and its root is obtained

as accuracy), smaller than 1e-4.

Fig.5(a) shows the performance of the solvers without

exploring memory coalescing and Fig.5(b) expose the results

with our strategy for data locality on the GPU. Particularly,

for 8M of unknowns, the Jacobi reached 406 milliseconds

(ms) on the GTX285 with our strategy and 609ms using

the common approach; and 2637ms with our strategy and

4370ms without on the 8800GT. The GS executed with the

coalescing strategy at 5139ms on the 8800GT and at 586ms

on the GTX285 with our approach.

The experiment showed that the CG solver obtained the

best performances over the GPUs. For the same 8M of

unknowns, the CG took 2042ms with its default implementa-

tion and 1198ms with our memory strategy on the 8800GT;

and 369ms without our strategy and 313ms with the strategy

on the GTX285 (Fig.5(c)). For comparison (not illustrated

on the pictures), a Jacobi-preconditioned CG obtained a

performance of 39219ms on the CPU.

The CG will naturally converge faster than the others with

a given ”big enough” problem at the GPU. However, for

small problems Fig.7 shows its performance over the PUs.

The CPU obtained better performance until 3K unknowns

compared to the GTX285 and until 7K unknowns compared

to the 8800GT. In the cases that the CPU obtained better

performance, few threads were launched to enable latency

284310

0

100

200

300

400

500

600

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Unknowns x 1000

Time ( ms)

Single GPU

Average two GPUs

(a)

0

0.5

1

1.5

2

2.5

0 1000 200 0 3000 4000 5000 6000 70 00 8000 900 0

Unknowns x 1000

(b)

Figure 6. Performance of the solvers using 2 GPUs GTX285: (a) comparison with one GPU; (b) speedup using 2 GPUs

hiding on the GPU. After these break-even points, the GPUs

processing power was fully utilized.

Taking the PU that obtained a global better performance,

Fig.5(d) depicts the performance of the solvers on the

GTX285. The area of the picture with a red circle indicates

that performance behavior with few number of unknowns

differ from the overall tendency. The CG will become faster

than the Jacobi and the GS after reaching the border of

approximately 500K unknowns as shown in Fig.5(e). This

gain indicates that in the CG algorithm, many operations will

always be ’naturally’ coalesced, since a sequential strategy

is used, i.e., vector-vector operations in which one block of

threads can always load a sequential segment of data to be

computed. The same is valid for the reduction kernel used

to sum up values of a vector. The Jacobi algorithm (and,

thus, the matrix multiply kernel in the CG) will also proﬁt

from such loading strategy. However, the CG just needs an

accelerated (improvement) strategy for the matrix-multiply

operation, being all other computations coalesced.

Applied to the Geforce 8800GT, Fig.5(f) points out the

break-even point of 140K unknowns for the CG being faster

than other solvers.

The implementation was extended to a multi-GPU ap-

proach. In general, a SLE cannot be simply divided to

be computed in parts by different devices, since each

element depends on other elements. Nevertheless, in the

case of structured grids, the elements needed to compute

one iteration on one part of the SLE are known. The

earliest element needed is one layer of elements ahead of

the starting element of the partial xvector. And the last

element needed is one layer of elements behind the last

element of the partial xvector. Those are the elements that

have to be additionally loaded to the elements that will be

computed. These elements have the same size of the padding

used to avoid illegal memory accesses. This way, instead

of ﬁlling the padding with zero entries, it will be ﬁlled

with the current values of x. Moreover, those elements will

have to be updated after each iteration. When using more

than one GPU, this communication is expected to diminish

performance, resulting in a speedup smaller than the number

of available GPUs.

The communication needed for a system with two co-

processors is exactly one layer of control volumes. This

layer has to be ”downloaded” from each device and, then,

”uploaded” to each device after each iteration. Fig.6(a)

illustrates that a 2-GPU implementation will need about

2M unknowns to be faster than the execution on one GPU,

considering two GTX285 GPUs and using the CG as bench-

mark. The use of two devices with less than 2M elements

results in an increased communication effort that cannot be

balanced by the combined higher processing power.

The multi-GPU approach demonstrates that the speedup

depends on the problem size as illustrated on Fig.6(b). In

our case, each GPU computed half of the elements in the

domain plus the borders’ elements. The maximum speedup

achieved was approximately 1.7 for 8M unknowns. This

result is comparable to the work of [24], which reached

a speedup using 2 GPUs of 1.5 for the same domain size.

Although we obtained better speedup, a direct comparison

is of difﬁcult evaluation due to system differences (they use

a TeslaC870 GPU over a Nvidia TeslaS870 server).

For a system with more than two GPUs, twice of such

communication bandwidth is necessary, since the devices

need information from the neighbors. However, communi-

cation will be performed in parallel, since each device is

controlled by a single thread. Thus, the addition of more

than 3 devices will increase the speed of the computation,

but will not have effect on the communication time. The

initial time needed to upload the data to the device will

decrease, since each device only needs its respective part of

the right hand side and the matrix (as opposed to a single

GPU implementation, where all data has to be transferred

to one GPU) and data transfer to each co-processor will be

performed in parallel.

From this, the bandwidth through data is transferred to the

GPU directly depends on the size of data to be transferred.

285311

0

10

20

30

40

50

60

70

0 2000 400 0 6000 8000 10000 12000

Unknowns

Time ( ms)

PCG on CPU

Conjugate Gradient on 8800

Conjugate Gradient on GTX 285

Figure 7. Performance break-even point on CPU and GPU

A small problem can be processed just by a limited number

of GPU threads, under-utilizing the GPU. On that cases,

the CPU will achieve better performances. In detail, this is

mainly due to memory bandwidths in the different stages

of communication. First, the data has to be sent from

CPU memory to the GPU and, then, loaded from the GPU

main memory into the shared memory or into the GPUs

registers/textures. For all of these stages, the real bandwidth

depends on the size of the data chunk that is processed.

Secondly, CUDA enables the GPU to switch between blocks

while some of them are waiting for data. An idle block will

be set to an inactive state until its data has arrived and, in

the meantime, another block may be processed. This will

almost eliminate idle time on the PU. If the problem size

is small in a way that there are not enough blocks to fulﬁll

the shared memory, the latency hiding strategy cannot be

employed efﬁciently.

Thereby, larger data arrays will be transmitted more

effectively to the GPU, improving bandwidth. This implies

that data should always be grouped into larger chunks to

exploit the best possible bandwidth and also be processed

more efﬁciently by the GPU memory controller. Fig.8 shows

how the size of data chunk is important to reach the optimal

bandwidth, which differs from the theoretical one.

VII. CONCLUSIONS AND FUTURE DIRECTIONS

This paper presents a performance evaluation of SLE

iterative solvers applied to a real-time CFD application over

an asymmetric desktop platform composed of CPU and

GPUs. Based on experimental results, it was observed that

the performance is directly inﬂuenced by the domain size

(number of unknowns). For a GPU approach, the character-

istics of the solvers are important to be analyzed. Even more

important is the management of memory accesses in order

to obtain an optimal gain on the calculations and the size of

data chunk on the GPU. Break-even points, where it is the

border point when a PU has better time performance than

other, were discussed. We also achieved a speedup using

multiple GPUs compared to related work.

Figure 8. Real consumed bandwidth

The developed mechanisms accomplished with the three

main goals proposed in this paper, which compared the per-

formance of different solvers in a heterogeneous platform,

described a GPU data access strategy for iterative solvers

that enabled memory coalescing, and analyzed different

scenarios where each solver obtained better performance.

Future research directions lead to a further analysis of

how the CFD application is beneﬁted from such tuning of

solvers and dynamic scheduling using a hybrid approach.

Dynamically adapting the solvers’ convergence in order to

continuously assure an acceptable relationship between the

real-time requirement and the precision of the solution is part

of ongoing work. Additionally, an automatic scheduling and

dynamic load-balancing of high-level functions for a CPU-

GPU platform can also accelerate this kind of applications

that deal with several massive calculations. Details about this

strategy are introduced on our related work [25].

Regarding the platform, an increased number of GPUs can

be used as well as, more challenging, other types of PUs. An

important improvement is to extend the solvers to explore

the modern CPUs in order to provide a comparison of the

presented GPU method with a concurrent implementation

over the CPU multiple cores. Besides, the use of a PCIe bus

2.0 can increase the whole performance, but the employed

bandwidth does not have an impact on the ﬁndings of this

work.

Finally, results indicate that the introduced core functions

presented in this paper are easily modiﬁed to produce

a highly efﬁcient Multigrid algorithm that uses the GPU

for the computationally intensive parts of the algorithm.

This supports the assumption that our ﬁndings can be well

generalized and apply to a variety of other problems.

ACKNOWLEDGMENT

We would like to thank all the reviewers for the detailed

suggestions and comments. Al´

ecio Binotto thanks the sup-

port given by DAAD fellowship and the Programme Alβan,

scholarship no. E07D402961BR.

286312

REFERENCES

[1] Nvidia, “Nvidia geforce series,”

http://www.nvidia.com/geforce [Stand 14.05.2010].

[2] Intel, “Intel core2quad processors,”

http://www.intel.com/products/processor/core2quad/ [Stand

14.05.2010].

[3] D. G¨

oddeke, H. Wobker, R. Strzodka, J. Mohd-Yuspf, P. Mc-

Cormick, and S. Turek, “Co-processor acceleration of an

unmodiﬁed parallel solid mechanics code with feastgpu,”

Journal of Computational Science and Engineering, vol. 4,

no. 4, pp. 254–269, 2009.

[4] Nvidia, “Cuda architecture,” http://www.nvidia.com/cuda

[Stand 14.05.2010].

[5] Khronos, “Opencl architecture,”

http://www.khronos.org/opencl/ [Stand 14.05.2010].

[6] J. Stam, “Stable ﬂuids,” in SIGGRAPH ’99: Proceedings

of the 26th annual conference on Computer graphics and

interactive techniques. New York, NY, USA: ACM

Press/Addison-Wesley Publishing Co., 1999, pp. 121–128.

[7] J. Kr¨

uger and R. Westermann, GPUGems 2 : Programming

Techniques for High-Performance Graphics and General-

Purpose Computation. Addison-Wesley, 2005, ch. 44 A

GPU Framework for Solving Systems of Linear Equations,

pp. 703–718.

[8] J. Bolz, I. Farmer, E. Grinspun, and P. Schr¨

oder, “Sparse

matrix solvers on the gpu: conjugate gradients and multigrid,”

in SIGGRAPH ’05: ACM SIGGRAPH 2005 Courses. New

York, NY, USA: ACM, 2005, p. 171.

[9] L. Buatois, G. Caumon, and B. L´

evy, “Concurrent number

cruncher: An efﬁcient sparse linear solver on the gpu,” in High

Performance Computation Conference - HPCC’07, Houston,

USA, 2007, pp. 358–371.

[10] N. Bell and M. Garland, “Implementing sparse matrix-vector

multiplication on throughput-oriented processors,” in SC ’09:

Proceedings of the Conference on High Performance Com-

puting Networking, Storage and Analysis. New York, NY,

USA: ACM, 2009, pp. 1–11.

[11] Y. Zhang, J. Cohen, and J. D. Owens, “Fast tridiagonal solvers

on the gpu,” in PPoPP ’10: Proceedings of the 15th ACM

SIGPLAN symposium on Principles and practice of parallel

programming. New York, NY, USA: ACM, 2010, pp. 127–

136.

[12] V. Volkov and J. W. Demmel, “Benchmarking gpus to tune

dense linear algebra,” in SC ’08: Proceedings of the 2008

ACM/IEEE conference on Supercomputing. Piscataway, NJ,

USA: IEEE Press, 2008, pp. 1–11.

[13] A. Cevahir, A. Nukada, and S. Matsuoka, “Fast conjugate

gradients with multiple gpus,” in ICCS ’09: Proceedings of

the 9th International Conference on Computational Science.

Berlin, Heidelberg: Springer-Verlag, 2009, pp. 893–903.

[14] M. Ament, G. Knittel, D. Weiskopf, and W. Strasser, “A

parallel preconditioned conjugate gradient solver for the pois-

son problem on a multi-gpu platform,” Parallel, Distributed,

and Network-Based Processing, Euromicro Conference on,

pp. 583–592, 2010.

[15] F. Song, A. YarKhan, and J. Dongarra, “Dynamic task

scheduling for linear algebra algorithms on distributed-

memory multicore systems,” in SC’09 The International

Conference for High Performance Computing, Networking,

Storage and Analysis, Portland, OR, 2009, pp. 1–10.

[16] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, “Dense

linear algebra solvers for multicore with gpu accelerators,” in

Parallel Distributed Processing, Workshops and Phd Forum

(IPDPSW), 2010 IEEE International Symposium on, 2010,

pp. 1 –8.

[17] R. Fedkiw, J. Stam, and H. W. Jensen, “Visual simulation of

smoke,” in SIGGRAPH ’01: Proceedings of the 28th annual

conference on Computer graphics and interactive techniques.

New York, NY, USA: ACM, 2001, pp. 15–22.

[18] K. Crane, I. Llamas, and S. Tariq, “Real-

time simulation and rendering of 3d ﬂuids,” in

GPU Gems 3, H. Nguyen, Ed. Addison Wesley

Professional, August 2007, ch. 30. [Online]. Available:

http://my.safaribooksonline.com/9780321545428/ch29

[19] R. Bridson, Fluid Simulation for Computer Graphics. A K

Peters, 2008.

[20] T. Jost, S. Contassot-Vivier, and S. Vialle, “An efﬁcient

multialgorithms sparse linear solver for gpus,” in EuroGPU

minisymposium of the International Conference on Parallel

Computing, ParCo2009, 2009, pp. 1–8.

[21] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato,

J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der

Vorst, Templates for the Solution of Linear Systems: Building

Blocks for Iterative Methods, 2nd Edition. Philadelphia, PA:

SIAM, 1994.

[22] J. R. Shewchuk, “An introduction to the conjugate

gradient method without the agonizing pain,” Pittsburgh,

PA, USA, Tech. Rep., 1994. [Online]. Available:

http://portal.acm.org/citation.cfm?id=865018

[23] J. M. Cohen, S. Tariq, and S. Green, “Interactive ﬂuid-

particle simulation using translating eulerian grids,” in I3D

’10: Proceedings of the 2010 ACM SIGGRAPH symposium

on Interactive 3D Graphics and Games. New York, NY,

USA: ACM, 2010, pp. 15–22.

[24] J. Thibault and I. Senocak, “Cuda implementation of a navier-

stokes solver on multi-gpu desktop platforms for incom-

pressible ﬂows,” in 47th AIAA Aerospace Sciences Meeting.

American Institute of Aeronautics and Austronautics, 2009,

pp. 1–15.

[25] A. P. D. Binotto, C. E. Pereira, and D. W. Fellner, “Towards

dynamic reconﬁgurable load-balancing for hybrid desktop

platforms,” in Parallel Distributed Processing, Workshops and

Phd Forum (IPDPSW), 2010 IEEE International Symposium

on, 2010, pp. 1 –4.

287313