Page 1

Adaptable Particle-in-Cell Algorithms for Graphical Processing Units

Viktor K. Decyk1,2 and Tajendra V. Singh2

1Department of Physics and Astronomy

2Institute for Digital Research and Education

University of California, Los Angeles

Los Angeles, CA 90095

Abstract

We developed new parameterized Particle-in-Cell algorithms and data structures for

emerging multi-core and many-core architectures. Four parameters allow tuning of this PIC

code to different hardware configurations. Particles are kept ordered at each time step. The first

application of these algorithms is to NVIDIA Graphical Processing Units, where speedups of

about 15-25 compared to an Intel Nehalem processor were obtained for a simple 2D electrostatic

code. Electromagnetic codes are expected to get higher speedups due to their greater

computational intensity.

Introduction

Computer architectures are rapidly evolving to include more and more processor cores. The

designs of these new architectures also vary considerably. These include incremental evolutions

of existing architectures, heterogeneous mixtures of processors such as the IBM Cell or AMD

APU, and accelerators, such as NVIDIA’s Graphical Processing Units (GPUs). As the most

advanced computers move from petaflops to exaflops in the next decade, it is likely that they will

consist of complex nodes with such multi-core and many-core processors. This variety results in

great challenges for application programmers. Not only is parallel computing challenging in its

own right, there is also a growing variety of programming models evolving, as older

programming models prove inadequate and new ones are proposed. These are disruptive times.

Application developers resist developing different codes for different architectures. To cope

with this bewildering variety, they need to find some common features that they can use to

parameterize their application algorithms to adapt to different architectures. In our opinion, the

common feature that we can exploit in future advanced architectures is that processors will be

organized in a hierarchy. At the lowest level, they will consist of a number of tightly coupled

SIMD cores, which are all executing the same instruction in lock step and sharing fast memory.

At the next level, will be loosely coupled groups of SIMD cores that share a slower memory.

Finally, there will be accumulations of groups that share no memory and communicate with

message-passing. Furthermore, accessing global memory may be even more of a bottleneck than

it is now, since memory speeds will not keep up with processor counts, so that lightweight

threads to hide memory latency will be necessary. We think that a cluster of GPUs is the closest

currently existing architecture to this future and therefore a good choice for experimentation and

development.

1

Page 2

One of the important applications in plasma physics are Particle-in-Cell (PIC) codes. These

codes model plasmas at the microscopic level by self-consistently integrating the trajectories of

charged particles with fields that the particles themselves produce. PIC applications are very

compute intensive, as the number of particles in these codes can vary from many thousands to

many billions. Such codes are used in many different areas of plasma physics, including space

plasma physics, advanced accelerators, and fusion energy research. The US Department of

Energy allocated over 200 million nodes hours on their most advanced computers to PIC codes

in their INCITE program in 2010, about 12% of the total. This paper will report on new

algorithms that we have developed for PIC codes that work well on GPUs, and that we feel are

adaptable to other future architectures as they evolve. This field is very new and a variety of

approaches to developing PIC codes on GPUs are under development [1-5].

Graphical Processing Units

GPUs consist of a number of SIMD multi-processors (30 on the Tesla C1070). Although

each SIMD multi-processor has 8 cores, the hardware executes 32 threads together in what

NVIDA calls a warp. Adjacent threads are organized into blocks, which are normally between

32 and 512 threads in size. Blocks can share a small (16 KByte), fast memory, which has a

latency of a few clock cycles, and have fast barrier synchronization. The GPU has a global

memory (4 GBytes on the C1070) accessible by any thread. High memory bandwidth is

possible, but is achieved typically when 16 adjacent threads read memory within the same 64

byte block, which NVIDIA calls data coalescing. It is usually achieved by having adjacent

threads read memory with stride 1 access. Memory has substantial latency (hundreds of clock

cycles), which is hidden by lightweight threads that can be swapped out in one clock cycle.

Thousands of threads can be supported simultaneously and millions can be outstanding. Fine

grain parallelism is needed to make efficient use of the hardware. Because access to global

memory is usually the bottleneck, streaming algorithms are optimal, where global memory is

read only once.

NVIDIA has developed a programming model called CUDA for this device, which is based

on C, with extensions. OpenCL is also supported for code which will also be run on other

devices, such as Intel SSE. In addition, the Portland Group has developed a Fortran compiler for

GPUs.

Streaming Algorithm for PIC

Particle-in-Cell codes are a type of N-body code (where all particles are interacting with all

the others), but they differ from molecular dynamic (MD) codes. In MD codes where particles

interact directly with each other, the calculation is of order N2. In PIC codes, particles interact

via the electric and magnetic fields that they produce. This makes the calculation of order N, and

many more particles can be used than in typical MD, up to a trillion currently [6].

A PIC code has three major steps in the inner loop. In the first step, a charge or current

density is deposited on a grid. This involves an inverse interpolation (scatter operation) from the

particle position to the nearest grid points. The second step involves solving a differential

equation (Maxwell’s equations or a subset) to obtain the electric and magnetic fields on the grid

2

Page 3

from the current or charge density. Finally, the particle acceleration is obtained using Newton’s

law and the particle positions are updated. This involves an interpolation (gather operation) to

obtain the fields at the particle position from the fields on the nearest grid points. Thus PIC

codes have two data structures, particles and fields, that need to communicate with one another.

Usually, most of the time is spent in steps one and three, and most of the CPU time is spent in

only a few subroutines. Textbooks are available which describe such codes [7-8].

PIC codes have been parallelized for distributed memory computers for many years [9], and

they have obtained good scaling with up to 300,000 processors for large problem sizes. The

parallelization is usually achieved by coarse grain domain decomposition, keeping particles and

the fields they need on the same node.

To achieve a streaming algorithm for PIC requires that particles and fields each be read and

written only once. For particles this is the usual case. However, for fields this is not so, since

there are many particles per cell and different particles are at different locations in space and read

different fields. The only way to achieve a streaming algorithm is to keep particles ordered, so

that all the particles which would interpolate to the same grid points are stored together. Then

the fields the particles need can be read only once and saved in registers or a small local array.

Fine grain parallelism can thus be implemented. We will illustrate this with a 2D electrostatic

code, which uses only Poisson’s equation to obtain an electric field, and is derived from one of

the codes in the UPIC framework [10].

Parallel Charge Deposit

We will begin with the first step of the PIC code, the charge deposit. The original Fortran

listing of this procedure with bi-linear interpolation in 2 dimensions (2D) is shown below.

dimension part(4,nop), q(nx+1,ny+1) ! nop = number of particles

! nx, ny = number of grid points

do j = 1, nop

n = part(1,j) ! extract x grid point

m = part(2,j) ! extract y grid point

dxp = qm*(part(1,j) - real(n)) ! find weights

dyp = part(2,j) - real(m)

n = n + 1; m = m + 1 ! add 1 for Fortran

amx = qm - dxp

amy = 1.0 - dyp

q(n+1,m+1) = q(n+1,m+1) + dxp*dyp ! deposit

q(n,m+1) = q(n,m+1) + amx*dyp

q(n+1,m) = q(n+1,m) + dxp*amy

q(n,m) = q(n,m) + amx*amy

enddo

In this code, the charge on the particle is split into 4 parts, which are then deposited to the 4

nearest grid points. A particle spatial co-ordinate consists of an integer part which contains the

lower grid point, and the deviation from the point. The algorithm first separates the integer part

and the deviation from the particle spatial co-ordinate. The integer part is used to address the

grid points, and the amount deposited is proportional to how close the particle is to that grid.

3

Page 4

Since particles are normally not ordered, each particle will deposit to a different location in

memory.

In the new adaptable streaming algorithm, we need a new data structure. Since particles can

be processed in any order, we can partition them into independent thread groups, and store them

in an array declared as follows:

dimension partc(lth,4,nppmax,mth)

where lth refers to tightly coupled threads, either SIMD cores or threads blocks, while mth

refers to loosely coupled groups of SIMD cores or a grid of thread blocks in CUDA. Because

lth is the most rapidly varying dimension, particles with adjacent values of the first index are

stored in adjacent locations in memory. This is important in achieving stride 1 memory access

(or data coalescing in CUDA). In C, the dimensions would be reversed. The total number of

independents threads is the product lth*mth. The parameter nppmax refers to the maximum

number of particles in each thread. Note that lth is a tunable parameter that we can set to

match the computer architecture.

The charge density, on the other hand, has a data dependency or data hazard, since particles

in different threads can attempt to simultaneously update the same grid point. There are several

possible methods to deal with this data dependency. One way is to use atomic updates, which

treat an instruction such as s = s + x as uninterruptible, if they are supported. Supporting locks

on memory can achieve this goal. Another method is to determine which of several possible

writes actually occurred, and then try again for those writes which did not occur [4]. The last

method is to partition memory with extra guards cells so that each thread writes to a different

location, then add up those locations that refer to the same grid. This is what is done on

distributed memory computers [9]. Since atomic updates are considered to be very slow in the

current NVIDA hardware, and the second method seemed to be costly with SIMD processors, we

decided on the third method initially.

If particles are sorted by grid, then we can partition the charge density the same way as the

particles.

dimension qs(lth,4,mth), number of threads: lth*mth = nx*ny

where each particle at some grid location can deposit in 4 different locations, 3 of them guard

cells. Note that if particles are sorted by grid, then the integer part of the address does not have

to be stored, just the deviation. This allows one to get greater precision for the spatial co-

ordinates, important when using single precision. The particle co-ordinates would always lie in

the range 0 < x < 1, and 0 < y < 1. The parallel deposit subroutine is shown below:

4

Page 5

dimension s(4) 1 local accumulation array s

do m = 1, mth ! outer loops can be done in parallel

do l = 1, lth

s(1) = 0.0 ! zero out accumulation array s

s(2) = 0.0

s(3) = 0.0

s(4) = 0.0

do j = 1, npp(l,m) ! loop over particles

dxp = partc(l,1,j,m) ! find weights

dyp = partc(l,2,j,m)

dxp = qm*dxp

amy = 1.0 - dyp

amx = qm - dxp

s(1) = s(1) + dxp*dyp ! accumulate charge

s(2) = s(2) + amx*dyp

s(3) = s(3) + dxp*amy

s(4) = s(4) + amx*amy

enddo

qs(l,1,m) = qs(l,1,m) + s(1) ! deposit charge

qs(l,2,m) = qs(l,2,m) + s(2)

qs(l,3,m) = qs(l,3,m) + s(3)

qs(l,4,m) = qs(l,4,m) + s(4)

enddo

enddo

In this algorithm, the particles at a particular cell first deposit to a local accumulation array s, of

size 4 words. When all the particles are processed, the local accumulation array is added to the

charge density array qs. The loops over thread indices l and m can be done in parallel, where

the variable npp(l,m) contains the number of actual particles assigned to that thread and

where npp(l,m) < nppmax. When the deposit is completed, the 4 locations in qs need to

be added to the appropriate locations in the array q.

The algorithm can be generalized in several ways. For example, if the cost of maintaining

the particle order depends on how many particles are leaving a grid, then this can be reduced by

defining a sorting cell to contain multiple grid points. For example, if we define the parameters

ngpx and ngpy to describe the number of gridpoints in x and y in a cell, respectively, then

particles will have co-ordinates 0 < x < ngpx, and 0 < y < ngpy, stored in arbitrary order within

the cell. This also reduces the number of duplicate guard cells needed.

In that case, we have to enlarge the charge density array qs and local accumulation array s:

dimension qs(lth,(ngpx+1)*(ngpy+1),mth), s((ngpx+1)*(ngpy+1))

number of threads: lth*mth = ((nx-1)/ngpx+1)*((ny-1)/ngpy+1)

The algorithm would also have to be modified to determine which grid within a cell the particle

belongs to and deposit the charge to the appropriate grid points. The structure of the code would

remain the same, however, first accumulating all the particles in a cell in the local array s, then

adding to the density array qs. There are no data dependencies anywhere in this procedure.

5