Adaptable Particle-in-Cell algorithms for graphical processing units.
-
Citations (0)
- Cited In (1)
-
Conference Proceeding: Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems.
Kamesh Madduri, Khaled Z. Ibrahim, Samuel Williams, Eun-Jin Im, Stéphane Ethier, John Shalf, Leonid OlikerConference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011; 01/2011
Page 1
Adaptable Particle-in-Cell Algorithms for Graphical Processing Units
Viktor K. Decyk1,2 and Tajendra V. Singh2
1Department of Physics and Astronomy
2Institute for Digital Research and Education
University of California, Los Angeles
Los Angeles, CA 90095
Abstract
We developed new parameterized Particle-in-Cell algorithms and data structures for
emerging multi-core and many-core architectures. Four parameters allow tuning of this PIC
code to different hardware configurations. Particles are kept ordered at each time step. The first
application of these algorithms is to NVIDIA Graphical Processing Units, where speedups of
about 15-25 compared to an Intel Nehalem processor were obtained for a simple 2D electrostatic
code. Electromagnetic codes are expected to get higher speedups due to their greater
computational intensity.
Introduction
Computer architectures are rapidly evolving to include more and more processor cores. The
designs of these new architectures also vary considerably. These include incremental evolutions
of existing architectures, heterogeneous mixtures of processors such as the IBM Cell or AMD
APU, and accelerators, such as NVIDIA’s Graphical Processing Units (GPUs). As the most
advanced computers move from petaflops to exaflops in the next decade, it is likely that they will
consist of complex nodes with such multi-core and many-core processors. This variety results in
great challenges for application programmers. Not only is parallel computing challenging in its
own right, there is also a growing variety of programming models evolving, as older
programming models prove inadequate and new ones are proposed. These are disruptive times.
Application developers resist developing different codes for different architectures. To cope
with this bewildering variety, they need to find some common features that they can use to
parameterize their application algorithms to adapt to different architectures. In our opinion, the
common feature that we can exploit in future advanced architectures is that processors will be
organized in a hierarchy. At the lowest level, they will consist of a number of tightly coupled
SIMD cores, which are all executing the same instruction in lock step and sharing fast memory.
At the next level, will be loosely coupled groups of SIMD cores that share a slower memory.
Finally, there will be accumulations of groups that share no memory and communicate with
message-passing. Furthermore, accessing global memory may be even more of a bottleneck than
it is now, since memory speeds will not keep up with processor counts, so that lightweight
threads to hide memory latency will be necessary. We think that a cluster of GPUs is the closest
currently existing architecture to this future and therefore a good choice for experimentation and
development.
1
Page 2
One of the important applications in plasma physics are Particle-in-Cell (PIC) codes. These
codes model plasmas at the microscopic level by self-consistently integrating the trajectories of
charged particles with fields that the particles themselves produce. PIC applications are very
compute intensive, as the number of particles in these codes can vary from many thousands to
many billions. Such codes are used in many different areas of plasma physics, including space
plasma physics, advanced accelerators, and fusion energy research. The US Department of
Energy allocated over 200 million nodes hours on their most advanced computers to PIC codes
in their INCITE program in 2010, about 12% of the total. This paper will report on new
algorithms that we have developed for PIC codes that work well on GPUs, and that we feel are
adaptable to other future architectures as they evolve. This field is very new and a variety of
approaches to developing PIC codes on GPUs are under development [1-5].
Graphical Processing Units
GPUs consist of a number of SIMD multi-processors (30 on the Tesla C1070). Although
each SIMD multi-processor has 8 cores, the hardware executes 32 threads together in what
NVIDA calls a warp. Adjacent threads are organized into blocks, which are normally between
32 and 512 threads in size. Blocks can share a small (16 KByte), fast memory, which has a
latency of a few clock cycles, and have fast barrier synchronization. The GPU has a global
memory (4 GBytes on the C1070) accessible by any thread. High memory bandwidth is
possible, but is achieved typically when 16 adjacent threads read memory within the same 64
byte block, which NVIDIA calls data coalescing. It is usually achieved by having adjacent
threads read memory with stride 1 access. Memory has substantial latency (hundreds of clock
cycles), which is hidden by lightweight threads that can be swapped out in one clock cycle.
Thousands of threads can be supported simultaneously and millions can be outstanding. Fine
grain parallelism is needed to make efficient use of the hardware. Because access to global
memory is usually the bottleneck, streaming algorithms are optimal, where global memory is
read only once.
NVIDIA has developed a programming model called CUDA for this device, which is based
on C, with extensions. OpenCL is also supported for code which will also be run on other
devices, such as Intel SSE. In addition, the Portland Group has developed a Fortran compiler for
GPUs.
Streaming Algorithm for PIC
Particle-in-Cell codes are a type of N-body code (where all particles are interacting with all
the others), but they differ from molecular dynamic (MD) codes. In MD codes where particles
interact directly with each other, the calculation is of order N2. In PIC codes, particles interact
via the electric and magnetic fields that they produce. This makes the calculation of order N, and
many more particles can be used than in typical MD, up to a trillion currently [6].
A PIC code has three major steps in the inner loop. In the first step, a charge or current
density is deposited on a grid. This involves an inverse interpolation (scatter operation) from the
particle position to the nearest grid points. The second step involves solving a differential
equation (Maxwell’s equations or a subset) to obtain the electric and magnetic fields on the grid
2
Page 3
from the current or charge density. Finally, the particle acceleration is obtained using Newton’s
law and the particle positions are updated. This involves an interpolation (gather operation) to
obtain the fields at the particle position from the fields on the nearest grid points. Thus PIC
codes have two data structures, particles and fields, that need to communicate with one another.
Usually, most of the time is spent in steps one and three, and most of the CPU time is spent in
only a few subroutines. Textbooks are available which describe such codes [7-8].
PIC codes have been parallelized for distributed memory computers for many years [9], and
they have obtained good scaling with up to 300,000 processors for large problem sizes. The
parallelization is usually achieved by coarse grain domain decomposition, keeping particles and
the fields they need on the same node.
To achieve a streaming algorithm for PIC requires that particles and fields each be read and
written only once. For particles this is the usual case. However, for fields this is not so, since
there are many particles per cell and different particles are at different locations in space and read
different fields. The only way to achieve a streaming algorithm is to keep particles ordered, so
that all the particles which would interpolate to the same grid points are stored together. Then
the fields the particles need can be read only once and saved in registers or a small local array.
Fine grain parallelism can thus be implemented. We will illustrate this with a 2D electrostatic
code, which uses only Poisson’s equation to obtain an electric field, and is derived from one of
the codes in the UPIC framework [10].
Parallel Charge Deposit
We will begin with the first step of the PIC code, the charge deposit. The original Fortran
listing of this procedure with bi-linear interpolation in 2 dimensions (2D) is shown below.
dimension part(4,nop), q(nx+1,ny+1) ! nop = number of particles
! nx, ny = number of grid points
do j = 1, nop
n = part(1,j) ! extract x grid point
m = part(2,j) ! extract y grid point
dxp = qm*(part(1,j) - real(n)) ! find weights
dyp = part(2,j) - real(m)
n = n + 1; m = m + 1 ! add 1 for Fortran
amx = qm - dxp
amy = 1.0 - dyp
q(n+1,m+1) = q(n+1,m+1) + dxp*dyp ! deposit
q(n,m+1) = q(n,m+1) + amx*dyp
q(n+1,m) = q(n+1,m) + dxp*amy
q(n,m) = q(n,m) + amx*amy
enddo
In this code, the charge on the particle is split into 4 parts, which are then deposited to the 4
nearest grid points. A particle spatial co-ordinate consists of an integer part which contains the
lower grid point, and the deviation from the point. The algorithm first separates the integer part
and the deviation from the particle spatial co-ordinate. The integer part is used to address the
grid points, and the amount deposited is proportional to how close the particle is to that grid.
3
Page 4
Since particles are normally not ordered, each particle will deposit to a different location in
memory.
In the new adaptable streaming algorithm, we need a new data structure. Since particles can
be processed in any order, we can partition them into independent thread groups, and store them
in an array declared as follows:
dimension partc(lth,4,nppmax,mth)
where lth refers to tightly coupled threads, either SIMD cores or threads blocks, while mth
refers to loosely coupled groups of SIMD cores or a grid of thread blocks in CUDA. Because
lth is the most rapidly varying dimension, particles with adjacent values of the first index are
stored in adjacent locations in memory. This is important in achieving stride 1 memory access
(or data coalescing in CUDA). In C, the dimensions would be reversed. The total number of
independents threads is the product lth*mth. The parameter nppmax refers to the maximum
number of particles in each thread. Note that lth is a tunable parameter that we can set to
match the computer architecture.
The charge density, on the other hand, has a data dependency or data hazard, since particles
in different threads can attempt to simultaneously update the same grid point. There are several
possible methods to deal with this data dependency. One way is to use atomic updates, which
treat an instruction such as s = s + x as uninterruptible, if they are supported. Supporting locks
on memory can achieve this goal. Another method is to determine which of several possible
writes actually occurred, and then try again for those writes which did not occur [4]. The last
method is to partition memory with extra guards cells so that each thread writes to a different
location, then add up those locations that refer to the same grid. This is what is done on
distributed memory computers [9]. Since atomic updates are considered to be very slow in the
current NVIDA hardware, and the second method seemed to be costly with SIMD processors, we
decided on the third method initially.
If particles are sorted by grid, then we can partition the charge density the same way as the
particles.
dimension qs(lth,4,mth), number of threads: lth*mth = nx*ny
where each particle at some grid location can deposit in 4 different locations, 3 of them guard
cells. Note that if particles are sorted by grid, then the integer part of the address does not have
to be stored, just the deviation. This allows one to get greater precision for the spatial co-
ordinates, important when using single precision. The particle co-ordinates would always lie in
the range 0 < x < 1, and 0 < y < 1. The parallel deposit subroutine is shown below:
4
Page 5
dimension s(4) 1 local accumulation array s
do m = 1, mth ! outer loops can be done in parallel
do l = 1, lth
s(1) = 0.0 ! zero out accumulation array s
s(2) = 0.0
s(3) = 0.0
s(4) = 0.0
do j = 1, npp(l,m) ! loop over particles
dxp = partc(l,1,j,m) ! find weights
dyp = partc(l,2,j,m)
dxp = qm*dxp
amy = 1.0 - dyp
amx = qm - dxp
s(1) = s(1) + dxp*dyp ! accumulate charge
s(2) = s(2) + amx*dyp
s(3) = s(3) + dxp*amy
s(4) = s(4) + amx*amy
enddo
qs(l,1,m) = qs(l,1,m) + s(1) ! deposit charge
qs(l,2,m) = qs(l,2,m) + s(2)
qs(l,3,m) = qs(l,3,m) + s(3)
qs(l,4,m) = qs(l,4,m) + s(4)
enddo
enddo
In this algorithm, the particles at a particular cell first deposit to a local accumulation array s, of
size 4 words. When all the particles are processed, the local accumulation array is added to the
charge density array qs. The loops over thread indices l and m can be done in parallel, where
the variable npp(l,m) contains the number of actual particles assigned to that thread and
where npp(l,m) < nppmax. When the deposit is completed, the 4 locations in qs need to
be added to the appropriate locations in the array q.
The algorithm can be generalized in several ways. For example, if the cost of maintaining
the particle order depends on how many particles are leaving a grid, then this can be reduced by
defining a sorting cell to contain multiple grid points. For example, if we define the parameters
ngpx and ngpy to describe the number of gridpoints in x and y in a cell, respectively, then
particles will have co-ordinates 0 < x < ngpx, and 0 < y < ngpy, stored in arbitrary order within
the cell. This also reduces the number of duplicate guard cells needed.
In that case, we have to enlarge the charge density array qs and local accumulation array s:
dimension qs(lth,(ngpx+1)*(ngpy+1),mth), s((ngpx+1)*(ngpy+1))
number of threads: lth*mth = ((nx-1)/ngpx+1)*((ny-1)/ngpy+1)
The algorithm would also have to be modified to determine which grid within a cell the particle
belongs to and deposit the charge to the appropriate grid points. The structure of the code would
remain the same, however, first accumulating all the particles in a cell in the local array s, then
adding to the density array qs. There are no data dependencies anywhere in this procedure.
5