Content uploaded by Iosif Meyerov
Author content
All content in this area was uploaded by Iosif Meyerov
Content may be subject to copyright.
Evgeny Efimenko,
Sergey Bastrakov,
Arkady Gonoskov,
Iosif Meyerov,
Michail Shiryaev,
Igor Surmin
ICNSP 2013
Particle-in-Cell Plasma Simulation
on The Intel Xeon Phi in PICADOR
Agenda
Introduction to Xeon Phi
PICADOR Overview
Optimization Techniques
Performance Analysis
Scaling Efficiency
2
3
HPC Hardware Environment
CPU
Peak 100-200 GFLOPS
4-16 cores (x86)
SIMD, branch prediction,
out-of-order execution
C++, Fortran, MPI,
OpenMP, TBB
GPU
Peak 1-2 TFLOPS
500-1000 cores (GPU)
Scalar, simple,
low-frequency cores
CUDA C, CUDA Fortran,
OpenCL, OpenACC
Xeon Phi
Peak 1 TFLOPS, 61 cores (x86)
Wide SIMD, simple, low-frequency cores
C++, Fortran, MPI, OpenMP, TBB, OpenCL
Product family of Intel Many Integrated Core
Architecture (MIC), launched in 2012.
61 cores on shared memory with Linux on board.
Pentium-like x86 cores with 512-bit vector units.
~ 1 TFLOPS peak in DP, ~ 2 TFLOPS in SP.
4
Intel Xeon Phi Overview
8 to 16 GB GDDR5 memory,
240 to 350 GB/sec bandwidth.
Standard programming tools:
C++, Fortran, MPI, OpenMP.
Image source: http://software.intel.com/sites/default/files/xeonPhiHardware.jpg
5
Xeon Phi Architecture Overview
Image source: http://www.extremetech.com/wp-content/uploads/2012/07/Intel-Mic-640x353.jpg
61 cores, 4 hardware threads per core.
–Launch one or several MPI processes with
several threads per process (e.g. OpenMP).
–Run at least 2 threads per core for efficiency.
512-bit vector units: capable of 8 simultaneous
arithmetic operations in DP, 16 in SP.
–Use SIMD instructions to benefit from vector
units.
–Compiler may generate vector instructions to
execute several loop iterations in parallel. 6
Levels of Parallelism on Xeon Phi
FMA(a, b, c) = a × b + c with one round-off.
For peak performance do vector FMA in all cores:
1064.8 GFLOPS in DP = 61 cores × 1.091 GHz ×
1 FMA per cycle × 8 vector FMA × 2 FLOP per FMA
for Intel Xeon Phi 7110X
7
512-bit Vector Units
x7
x6
x5
x4
x3
x2
x1
x0
y7
y6
y5
y4
y3
y2
y1
y0
z7
z6
z5
z4
z3
z2
z1
z0
+
=
One 512-bit
vector
operation
8
How To Benefit from Xeon Phi?
Used Levels of
Parallelism Peak,
GFLOPS
Equivalent
Peak
Multi
-
threaded + vector
1064
5 CPUs, GPU
Multi-threaded + scalar
133 CPU
Single
-
threaded + vector
17 CPU core
Single
-
threaded + scalar
2
1/10 CPU core
Essential: good scaling on shared memory up to at
least 120 threads.
Strongly recommended: vectorization.
Overall peak of Xeon Phi is ~ 5x peak of CPU,
but single-threaded scalar performance is
~ 0.1x CPU core.
A fast sequential operation on CPU can become a
bottleneck on Xeon Phi.
E.g. PML, boundary pulse generator, particle
migration between cells, periodic boundary
conditions (on a single node).
Need to make everything parallel on Xeon Phi or
use offload mode and perform sequential operations
on host CPU. Similar to GPU usage. 9
A Note on Parallelism
Need Intel C++ / Fortran compiler.
Native mode: use as a 61-core CPU under Linux:
C++ / Fortran + OpenMP / TBB / Cilk Plus
Symmetric mode: use as a node of cluster, can host
several MPI processes:
C++ / Fortran + MPI + OpenMP / TBB / Cilk Plus
Offload mode: use as a coprocessor, run kernels
launched from host CPU, similar to GPGPU:
C++ / Fortran + Offload compiler directives +
OpenMP / TBB / Cilk Plus 10
Programming for Xeon Phi
Optimization recommendations for CPUs also
apply to Xeon Phi: multi-threading, vectorization,
cache-friendliness, etc.
But there are nuances on Xeon Phi vs. CPU that
make it more challenging:
–Much more threads: 120 to 240 vs. 4 to 16.
–Wider vector units: 512-bit vs. 128 to 256-bit.
11
Optimization for Xeon Phi
Consider multi-threaded PIC code with 99%
scaling efficiency on 8 cores, 30% of CPU peak.
In ideal case, 30% of Xeon Phi peak 320 GFLOPS
In reality:
25% of arithmetic is vectorized –240 GFLOPS
(not all loops, not full 512-bit width)
70% scaling efficiency on 61 cores –25 GFLOPS
Overall 5% of peak on Xeon Phi 55 GFLOPS
Reality vs. ideal case = 1 / 6 12
Example
Lorenz Force Computation
(1st order field interpolation)
Particle Push (Boris)
13
Field Solver
(FDTD, NDF)
Current Deposition
(1st order)
Particle-in-Cell
PICADOR – tool for large-scale Particle-in-Cell
3D plasma simulation.
Under development since 2010.
Aimed at heterogeneous cluster systems with
CPUs, GPUs, Xeon Phi accelerators.
–C++, MPI, OpenMP, CUDA
Design idea:
–Flexible and extendable high-level architecture.
–Heavily tuned low-level performance-critical routines.
14
PICADOR Overview
* S. Bastrakov, R. Donchenko, A. Gonoskov, E. Efimenko, A. Malyshev, I. Meyerov, I.
Surmin. Particle-in-cell plasma simulation on heterogeneous cluster systems, Journal
of Computational Science, 3 (2012), 474-479.
MPI exchanges and single-node performance can be
optimized independently.
Thus we consider only the single node case.
15
PICADOR Time Distribution
CPU GPU Xeon Phi
Global array for field and current density values.
Particles in each cell are stored in a separate array
and processed by cells in parallel.
Particle push pseudo-code:
#pragma omp parallel for
for each cell
for each particle in the cell
interpolate field (1st order)
push (Boris)
check if particle leaves cell
16
Baseline Implementation
The code already uses OpenMP, so we can use
native mode on Xeon Phi with no modifications.
Rebuild with Intel C++ Compiler, --mmic option.
–Some libraries are not tested on Xeon Phi and
may require efforts (even HDF5 needed hotfix).
Time to port PICADOR to Xeon Phi: 1 day.
Performance on Xeon Phi close to 16 CPU cores.
Time to develop our GPU-based implementation
with similar performance: ~2 months.
–Need to maintain C++ and CUDA versions of all
computationally intensive functions. 17
“No Effort” Port To Xeon Phi
Measure performance on benchmark:
40x40x40 grid, 50 particles per cell, DP.
Hardware:
–2x 4-core Intel Xeon 5150, peak 68 GFLOPS *.
–2x 8-core Intel Xeon E2690, peak 371 GFLOPS **.
–61-core Intel Xeon Phi 7110X, peak 1076 GFLOPS **.
* UNN, ** MVS-10P at JSC RAS
Measure two metrics:
–Time per iteration per particle (ns per particle update).
–% of peak performance.
18
Benchmark
19
Baseline Performance
% of peak 8% 7.9% 5.5% 2.6% 0.9%
853
108
457
59 62
Particle-grid interactions are spatially local.
When a cell is processed, preload field values /
accumulate current locally.
#pragma omp parallel for
for each cell
preload 27 local field values
for each particle in the cell
locally interpolate field
push (Boris)
check if particle leaves cell
20
Employing Locality
21
Impact of Employing Locality
% of peak 19.7% 19.1% 14.6% 10.7% 3.1%
speedup
over baseline
2.5x 2.4x 2.7x 4x 3.6x
347
45
172
15 17
Loop vectorization – simultaneous
execution of several iterations on vector
units via SIMD instructions.
Compiler may perform vectorization for
internal loop.
Main requirements (for most cases):
–No data dependencies between
iterations.
–All internal function calls are inlined.
–No internal conditional statements. 22
Employing Loop Vectorization
0
1
2
3
4
5
…
Iteration
Ensure inlining of interpolation and push functions.
Subdivide the loop for particles into three loops for
interpolation, push and check to avoid conditionals
and data dependencies between iterations.
Use compiler directives to guarantee pointers do
not alias, otherwise compiler suspects dependence.
Employ loop tiling for interpolation and push to
save interpolated fields.
After modifications push loop is vectorized,
interpolation loop can be vectorized but turned out
to be inefficient. 23
Employing Loop Vectorization
#pragma omp parallel for
for each cell
preload 27 local field values
for each particle tile
#pragma novector
for each particle in the tile
locally interpolate field
#pragma ivdep
for each particle in the tile
push (Boris)
for each particle in the cell
check if particle leaves cell
24
Employing Vectorization
25
Impact of Employing Vectorization
% of peak 34.3% 34.1% 26.4% 20.9% 4.5%
speedup
over previous version
1.7x 1.8x 1.8x 2x 1.4x
199
25
95
8 12
20.3
13.2 11.2 10.6
26
More Threads on Xeon Phi
1 thread per core is apriori inefficient because of hardware:
every clock an instruction is issued for a different thread.
83%
73%
27
Scaling on CPUs
Hyper-threading significantly improves scaling.
Ideal scaling to 8 cores (1 CPU) with HT.
Much worse scaling from 1 CPU to 2 CPUs.
2x 8-core CPUs on shared memory, 1 MPI process
28
Scaling on CPUs
99%
83%
For several CPUs on shared memory the best
option is to launch an MPI process per CPU.
With this option ideal scaling from 1 to 16 cores.
2 MPI
processes
1 MPI process
73%
Keep 4 threads per core.
Use 1 MPI process.
29
Scaling on Xeon Phi
72%
Compare different process/thread configurations.
30
MPI + OpenMP Configuration
[s]
We vectorize loop over particles in a cell.
It is efficient only if there are enough iterations.
31
Optimum # Particles per Cells on Xeon Phi
Performance nearly stabilizes with > 100 particles per cell.
A widely used idea for GPU-based implementations.
Store and handle particles from neighbor cells together.
Select supercell size to have ~100 particles per cell.
32
Supercells
1.5x speedup
33
Summary
We ported parallel 3D PIC code to Xeon Phi.
Optimizations were beneficial for both CPU and Xeon Phi:
–2.5x to 4x speedup due to memory access locality.
–1.4x to 2x speedup due to vectorization.
Final performance metrics:
–34% of peak, 50 ns / particle update on 4-core Xeon.
–21% of peak, 15 ns / particle update on 8-core Xeon.
–6% of peak, 12 ns / particle update on Xeon Phi.
Reasons for lesser efficiency on Xeon Phi vs. CPU:
–Worse scaling: 72% vs. 99%.
–Subpar vectorization efficiency: 1.4x (ideal 8x).
Improve data layout and alignment for better
utilization of vector loads and vectorization.
–Array of full particles, or separate arrays of positions
and velocities, or separate arrays of everything?
Another interpolation scheme: first from Yee grid
to cell-aligned grid, then use it for particles.
–In progress, currently got 15% benefit, expect more.
Apply vectorization not to loops over particles but
to internal operations on 3D vectors.
–Can process two pairs of 3D vectors in one operation.
Use #pragma simd for full vectorization control. 34
Xeon Phi Specific Optimizations To Try
THANK YOU FOR ATTENTION!
evgeny.efimenko@gmail.com