Technical ReportPDF Available

CAAR for Frontier -An ORNL Project Jan 2020 TECHNICAL REPORT Analysis of PIConGPU's Three Most Intensive Kernels from NVProf on Summit

Authors:

Abstract

This is a technical report that summarizes findings on the analysis of PIConGPU's three most intensive kernels by using NVProf Profiler tool and Summit system at the Oak Ridge National Lab (ORML). The kernels, Current Deposition (also known as Compute Current), Particle Push (Move and Mark), and Shift Particles are known to be some of the biggest kernel issues in PIConGPU. The Current Deposition kernel and Particle Push kernel are also necessary as both set up the particle attributes for running any physics simulation with PIConGPU, thus it is crucial to improve the performance of these two kernels. Below are some suggestions on how to improve these kernels. This analysis was performed using the minimum dimensions, a grid size of 240 x 272 x 224, and 10 time steps on the Mid-November Figure of Merit (FOM) run setup. The Traveling Wave Electron Acceleration (TWEAC) science case used in this run, is a representative science case for PIConGPU. This execution can be used for baseline analysis on AMD MI50/ MI60 systems.
CAAR for Frontier - An ORNL Project
Jan 2020
TECHNICAL REPORT
Analysis of PIConGPU’s Three Most Intensive Kernels from NVProf on Summit
Matt Leinhauser1, Sergei Bastrakov2, Rene Widera3, Alexander Debus4, Michael Bussmann5, Guido Juckeland6,
Arghya Chatterjee7, Sunita Chandrasekaran8
1, 8 University of Delaware (UDEL)
2, 3, 4, 5, 6 Helmholtz-Zentrum Dresden-Rossendorf (HZDR)
7 Oak Ridge National Laboratory (ORNL)
Contact: {mattl, schandra}@udel.edu,chatterjeea@ornl.gov
CAAR PI: Sunita Chandrasekaran, University of Delaware, USA
CAAR Liaison: Arghya Chatterjee, Oak Ridge National Laboratory, USA
The kernels, Current Deposition (also known as Compute Current), Particle Push (Move and
Mark), and Shift Particles are known to be some of the biggest kernel issues in PIConGPU. The
Current Deposition kernel and Particle Push kernel are also necessary as both set up the
particle attributes for running any physics simulation with PIConGPU, thus it is crucial to
improve the performance of these two kernels. Below are some suggestions on how to improve
these kernels. This analysis was performed using the minimum dimensions, a grid size of 240 x
272 x 224, and 10 time steps on the Mid-November Figure of Merit (FOM) run setup. The
Traveling Wave Electron Acceleration (TWEAC) science case used in this run, is a
representative science case for PIConGPU. This execution can be used for baseline analysis on
AMD MI50/ MI60 systems.
Compute Current Kernel:
- In this kernel, each thread block processes a super cell and these contributions reach
into shared memory using atomic operations and then are written back to global
memory.
- 57.4% of runtime
According to an application profile analysis using NVProf:
- The kernel performance is bound by Instruction and Memory Latency
- NOTE: This figure was obtained by running PIConGPU with a grid size reduced
by a factor of four (64 x 60 x 56) because at the full grid size, NVProf was not
able to calculate the performance limiter of the kernel. We believe this happened
because the profile could not complete in the maximum allotted time.
- It’s achieved occupancy is 49.7% and its theoretical occupancy is 50%
- The kernel is achieving 31.78 out of 32 theoretical warps
- This is really good.
- Warps are primarily being prevented from executing by execution dependency, followed
by instruction fetch, and then synchronization.
- If execution dependency improves, instruction fetch is more than likely to also
improve.
- NVProf states GPU Utilization is mainly being limited by Register Usage
- The number of warps per SM will stay virtually the same until about 48,000 B of Shared
Memory per block.
- Increasing the number of threads per block to roughly 576 will offer the most warps per
SM. However, decreasing the number of threads per block to 384 or roughly 288 will
elicit a similar number of warps.
- The most time was spent in the CUDA atomic add file (on shared memory) followed by
Vector.hpp
- Divergent branches occurred on line 148 of EmZ.hpp, line 99 of ForEachIdx.hpp, and
line 77 of the CUDA atomic add file.
- Line 148 of EmZ.hpp is an if statement that detects if there is a second virtual
particle needed.
- If true, the code goes into a for loop and then performs the deposit
function.
- The divergence of this line was calculated at 93.1% (403,212 divergent
executions out of 432,978 total executions)
- Based on this percentage, it seems like improving this line, or
rather the case of the condition will lead to increased performance
of warps.
- The EmZ current deposition scheme performs the underlying deposition
procedure either once or twice for each macroparticle, depending on
particle movement relative to the grid. Since we do not impose any
particle ordering inside the supercell (nor is there a simple way to do it),
it's no surprise that there is high divergence rate here since it only takes
having at least one particle of each kind in a warp for divergence
- Getting rid of this divergence should indeed benefit the
performance, it is not clear how to do it in reasonable time though.
- Line 99 of ForEachIdx.hpp is an if statement within a for loop. If the condition
is true, the code will go into another for loop.
- The divergence of this line was calculated at 6.2% (2,160 divergent
executions out of 34,560 total executions)
- Based on this percentage, it does not seem like improving this line
will really lead to increased performance.
- Line 77 of the CUDA atomic add file is a return statement in the function
atomicAdd of type float_X
- This is the CUDA atomicAdd function on shared memory so this is not
anything we can fix.
- Global Memory has alignment and access pattern issues at lines 314 and 316 of
FieldJ.kernel and line 86 and 368 of Vector.hpp
- Line 314 and Line 316 of FieldJ.kernel are loading particle attributes from
the global memory to registers.
- Is the buffer storing these attributes malaligned?
- The variables are both scalar attributes (4 bytes).
- Line 86 of Vector.hpp is an assignment of the variable dest[d] in a for loop
of the struct CopyElementWise which is a functor to copy an object element-
wise.
- We are setting the destination to be equal to the source inside a for loop.
- Line 368 of Vector.hpp is a variable update
- Shared Memory has alignment and access pattern with CUDA’s atomic add.
- Function Unit Utilization
- Instruction Execution Counts
- Floating-Point Operation Counts
Move and Mark Kernel:
- 19.0% of runtime
- The kernel performance is bound by Memory Bandwidth.
- Its compute utilization is significantly lower than its memory utilization
- It’s achieved occupancy is 37.5% and its theoretical occupancy is 37.5%.
- The kernel is achieving 23.97 out of 24 theoretical warps.
- This is really good.
- Warps are primarily being prevented from executing because the pipe is busy, followed
by execution dependency, and then instruction fetch.
- The most time was spent in the file AssignedTrilinearInterpolation.hpp
followed by PCS.hpp
- PCS.hpp is the form factor used for the FOM setup.
- Global Memory has alignment and access pattern issues at lines 449, 507, 541, and 557
of Particles.kernel, lines 55 and 98 of particlePusherVay.hpp, line 112 of
ForEachIdx.hpp and line 86 of Vector.hpp
- See if we can improve the particle memory buffers.
- In the future see if can store each X, Y, and Z separately.
- It seems like the reasons for the alignment and access pattern issues are the
same.
- Line 449, Line 507,Line 541, Line 557 all are updates to particle attributes.
- Line 55 of particlePusherVay.hpp is a variable assignment.
- Based on the ratio of actual to ideal, fixing the global memory issue with
this line does not seem like a priority.
- Line 98 of particlePusherVay.hpp is an array index update.
- Line 112 of ForEachIdx.hpp is a functor initialization inside a for loop
- This can be solved with LLAMA
- Line 86 of Vector.hpp is an assignment of the variable dest[d] in a for loop
of the struct CopyElementWise which is a functor to copy an object element-
wise. In line 86, we are setting the destination to be equal to the source
- Shared Memory has alignment and access pattern issues at line 103 of
AssignedTrilinearInterpolation.hpp
- The shared memory allocations are done in a sub-optimal order causing
malalignment of the buffers.
- We can make changes to the code to make sure E and B are aligned as
we want.
- The total of shared memory loads and stores is also very high, this is expected though.
- It is also really good that we are maximizing the amount of shared memory,
which we intended to do.
- Using less shared memory per block will increase the number of warps (as shown in the
graph below)
- But, this might not be the best idea for other architectures.
- What we can try is to have a slightly smaller supercell because the area to load is
smaller, thus causing smaller shared memory usage.
- Increasing the number of threads per block from 256 to 512 will increase the number of
warps per SM.
- Currently this is tied to the number of cells per supercell/the number of particles
in a buffer.
- This could be decoupled which could allow us to increase the number of
threads per block thus increasing the number of warps.
- Divergent branches occurred on lines 562 of Particles.kernel, line 99 of
ForEachIdx.hpp, and line 195 of atomic.hpp.
- Line 562 of Particles.kernel is an if statement.
- The divergence of this line was calculated at 48.3% (5,533,444 divergent
executions out of 11,451,623 total executions)
- In this case, some particles leave the cell and some do not, thus causing
the divergence.
- Line 99 of ForEachIdx.hpp is an if statement within a for loop. If the condition
is true, the code will go into another for loop.
- The divergence of this line was calculated at 12.5% (57,120 divergent
executions out of 456,960 total executions)
- Based on this percentage, it does not seem like improving this line
will really lead to increased performance.
- Line 195 of atomic.hpp is an if statement nested inside another if statement.
- The divergence of this line was calculated at 12.5% (5,342,724 divergent
executions out of 5,533,530 total executions)
- Function Unit Utilization
- Instruction Execution Counts
- Floating-Point Operation Counts
Shift Particles Kernel:
- 11.2% of runtime
- The kernel performance is bound by Instruction and Memory Latency
- The most time was spent in the file ParticlesBase.kernel followed by
CopyIdentifier.hpp
- Warps are primarily being prevented from executing because of synchronization,
followed by memory dependency, and then execution dependency.
- Global Memory has alignment and access pattern issues at lines 166, 214, 221, and
384, 403, 425, 432, 655, 657, 704, 779, 780, and 833 of ParticlesBase.kernel, line
41 of CopyIdentifier.hpp
- Line 166, Line 214, Line 221, Line 384, Line 403, Line 425,Line 432, Line 655,
Line 657, Line 704, Line 779, Line 780, Line 833 of ParticlesBase.kernel
are all updates to particle attributes.
- Line 41 of CopyIdentifier.hpp is just copies from the destination to the
source when the particles are shifting.
- There were no issues found with Shared Memory. This is expected because it is not
really used for this kernel.
- Divergent branches occurred on line 99 of ForEachIdx.hpp, line 102 of atomic.hpp,
line 63 of warp.hpp, and lines 147, 193, 384, 393, 403, 657, 705, 727, 741, 769, 809,
and 825 of ParticlesBase.kernel
- Line 99 of ForEachIdx.hpp an if statement within a for loop. If the condition is true,
the code will go into another for loop.
- Line 102 of atomic.hpp is an if statement. If the condition is true, the code updates the
value of the variable result.
- Line 63 of warp.hpp is a return statement inside an if statement.
- The divergence probably occurs because, depending on if the condition is true of
false, a different result can be returned.
- Line 147, Line 193, Line 384, Line 393, Line 403, Line 657, Line 705, Line 727, Line 741,
Line 769, Line 809, Line 825 of ParticlesBase.kernel are all if statements.
- NVProf states GPU Utilization is mainly being limited by Register Usage
- Increasing or decreasing block size would not help increase the number of warps
executed per SM.
- The same is true for increasing or decreasing the amount of shared memory per block.
- Function Unit Utilization
- Instruction Execution Counts
- Floating-Point Operation Counts
- No floating-point operations were calculated
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.