Technical ReportPDF Available

CAAR for Frontier -An ORNL Project Analysis of PIConGPU's Three Most Intensive Kernels from NSight Systems and NSight Compute on Summit

Authors:

Abstract

Three kernels, Current Deposition (also known as Compute Current), Particle Push (Move and Mark), and Shift Particles are known to be some of the most time-consuming kernels in PIConGPU. The Current Deposition kernel and Particle Push kernel both set up the particle attributes for running any physics simulation with PIConGPU, so it is crucial to improve the performance of these two kernels. Below are some suggestions on how to improve these kernels. This analysis was performed using a grid size of 240 x 272 x 224, and 10 time steps with the Mid-November Figure of Merit (FOM) run setup. The Traveling Wave Electron Acceleration (TWEAC) science case used in this run is a representative science case for PIConGPU. This execution can also be used for baseline analysis on AMD MI50/ MI60 systems. As of the time of writing, the PIConGPU application has limited use for features of NSight Systems, so this report will mainly focus on insights garnered from NSight Compute. For this analysis, we run the “full” set to gather NSight Compute metrics. We then compare the profiling data from NSight Compute to that of NVProf.
CAAR for Frontier - An ORNL Project
May 2020
TECHNICAL REPORT
Analysis of PIConGPU’s Three Most Intensive Kernels from NSight Systems and
NSight Compute on Summit
Matthew Leinhauser1, Jeffrey Young2, Sergei Bastrakov3, Rene Widera4, Alexander Debus5, Michael Bussmann6,
Guido Juckeland7, Arghya Chatterjee8, Sunita Chandrasekaran9
1, 9 University of Delaware (UDEL)
2 Georgia Institute of Technology
3, 4, 5, 6, 7 Helmholtz-Zentrum Dresden-Rossendorf (HZDR)
8 Oak Ridge National Laboratory (ORNL)
Contact: {mattl, schandra}@udel.edu, chatterjeea@ornl.gov
Three kernels, Current Deposition (also known as Compute Current), Particle Push (Move and
Mark), and Shift Particles are known to be some of the most time-consuming kernels in
PIConGPU. The Current Deposition kernel and Particle Push kernel both set up the particle
attributes for running any physics simulation with PIConGPU, so it is crucial to improve the
performance of these two kernels. Below are some suggestions on how to improve these
kernels. This analysis was performed using a grid size of 240 x 272 x 224, and 10 time steps
with the Mid-November Figure of Merit (FOM) run setup. The Traveling Wave Electron
Acceleration (TWEAC) science case used in this run is a representative science case for
PIConGPU. This execution can also be used for baseline analysis on AMD MI50/ MI60 systems.
As of the time of writing, the PIConGPU application has limited use for features of NSight
Systems, so this report will mainly focus on insights garnered from NSight Compute. For this
analysis, we run the “full” set to gather NSight Compute metrics. We then compare the profiling
data from NSight Compute to that of NVProf.
High-level Takeaways From this Analysis
NVProf showed us several application-level insights on how to possibly tweak
PIConGPU for better performance on Summit
Using NSight Compute showed us similar metrics and reinforced that our initial
analysis with NVProf was valid. NVProf profiling has more overhead since it
captures a larger number of metrics by default, but both tools can be used on
Summit to do useful analysis within the available 2 hour runtime limit.
NSight Compute in general offers similar functionality to NVProf. Some of the high-level
metrics, such as Speed of Light, are not extremely useful for analyzing a code that has
already been optimized for Summit and NVIDIA GPUs in general.
Nsight can be run with less overhead by using its more robust customization
options with default metric sets. New metric sets are easy to create and combine
for analysis runs. 100 PIConGPU timesteps of profiling can be run with Nsight
Compute in a 2 hr time limit compared to 1 time step with NVProf’s default level
profiling.
Nsight Systems provides aggregate-level statistics for applications but currently
has some issues with separating long C++ templated kernel names that makes it
tough to break these statistics down by kernel (see Possible Profiling Tool
Enhancements).
Comparison of profiles that support AMD hardware to NVIDIA profiles would be much
more insightful than comparing NVProf and Nsight outputs.
PIConGPU has been optimized for CUDA GPUs for many years so further
insights from CUDA profilers may be limited.
A useful comparison of profilers that support NVIDIA hardware and AMD
hardware would look at whether the same metrics like shared memory utilization
show up as important predictors for performance with AMD GPUs.
We also anticipate NVIDIA’s new roofline analysis capability in NSight Compute
2020.1 to be a useful tool for further analysis of PIConGPU.
More Detailed Analysis of Kernels
Compute Current Kernel:
- 57.3% of runtime
- We are assuming this number is correct based on our understanding of the
application
In this kernel, each thread block processes a super cell and these contributions reach into
shared memory using atomic operations and then are written back to global memory.
Launch Statistics:
- At launch:
- The size of the kernel grid is 2280
- The block size is 512
- There are 1,167,360 threads
- There are 14.25 waves per SM
- There are 54 registers per thread
- The Static Shared Memory per block is 18.26 KB/block
- There is no Dynamic Shared Memory per block (0 KB/block)
- The Shared Memory Configuration Size is 65.54 KB
GPU Speed of Light:
According to an application analysis done by NSight Compute:
- Memory is more heavily utilized than Compute: Check memory replay (coalescing)
metrics to make sure you're efficiently utilizing the bytes transferred. Also consider
whether it is possible to do more work per memory access (kernel fusion) or whether
there are values you can (re)compute.
- The Streaming Multiprocessor (SM) Speed of Light (SOL) is 47.21%
- This is also defined as the SM throughput (assuming ideal load balancing across
SMSPs)
- For each unit, the Speed Of Light (SOL) reports the achieved percentage of
utilization with respect to the theoretical maximum.
- 66.43% of the theoretical maximum memory was used
Compute Workload Analysis:
- The SM is busy 48.99% of the time.
- The issue slots were busy 48.99% of the time.
- The pipe was utilized most by Fused-Multiply-Add (FMA) instructions (39.02%) and least
by Address Divergence Unit (ADU) instructions (0.04%).
- 1.87 warp instructions were executed per cycle
Memory Workload Analysis:
- The memory throughput is 51.06 GB/second
- The available communication bandwidth between the SM, caches, and DRAM was
37.32%
- This percentage does not necessarily limit the kernel’s performance
- The maximum throughput of issuing memory instructions reached 39.71%
- Shared Memory Loads had 113,400,037 bank conflicts
- Shared Memory Loads had a peak utilization of 67.13%
- Shared Memory Stores had 0 bank conflicts
- Shared Memory Stores had a peak utilization of 38.25%
Occupancy:
Theoretical Occupancy %
50
Block Limit Registers
2
Theoretical Active Warps per SM
32
Block Limit Shared Mem.
5
Achieved Occupancy %
49.65
Block Limit Warps
4
Achieved Active Warps per SM
31.78
Block Limit SM
32
- Looking at the above graph, we can see that we will achieve the highest amount of warp
occupancy by decreasing the number of registers per thread to 0 - 31. We can also
achieve better performance by decreasing this number in the range of 32 - 40 registers
per thread.
- Looking at the above graph, we can see that increasing the block size from 512 to 576
will increase the number of warps. Similarly, decreasing the block size from 512 to 384
will also yield better warp occupancy.
- From this graph, we can see we are achieving the highest number of warps using no
Shared Memory per Block.
Scheduler Statistics:
- According to an application analysis done by NSight Compute:
- Every scheduler is capable of issuing one instruction per cycle, but for this kernel
each scheduler only issues an instruction every 2.0 cycles. This might leave
hardware resources underutilized and may lead to less optimal performance. Out
of the maximum of 16 warps per scheduler, this kernel allocates an average of
7.95 active warps per scheduler, but only an average of 0.76 warps were eligible
per cycle. Eligible warps are the subset of active warps that are ready to issue
their next instruction. Every cycle with no eligible warp results in no instruction
being issued and the issue slot remains unused. To increase the number of
eligible warps either increase the number of active warps or reduce the time the
active warps are stalled.
- This kernel achieved 7.95 warps out of a 16 theoretical warps per scheduled time period
- 0.49 warps were eligible per scheduler, but only 0.76 were issued
- 50.97% of the time, 0 warps were eligible to execute
- Conversely, 49.03% of the time, one or more warps were eligible to execute
Warp State Statistics:
- According to an application analysis done by NSight Compute:
- Instructions are executed in warps, which are groups of 32 threads. Optimal
instruction throughput is achieved if all 32 threads of a warp execute the same
instruction. The chosen launch configuration, early thread completion, and
divergent flow control can significantly lower the number of active threads in a
warp per cycle. This kernel achieves an average of 20.6 threads being active per
cycle. This is further reduced to 18.9 threads per warp due to predication. The
compiler may use predication to avoid an actual branch. Instead, all instructions
are scheduled, but a per-thread condition code or predicate controls which
threads execute the instructions. Try to avoid different execution paths within a
warp when possible. In addition, assure your kernel makes use of Independent
Thread Scheduling, which allows a warp to reconverge after a data-dependent
conditional block by explicitly calling __syncwarp().
- On average each warp of this kernel spends 5.1 cycles being stalled waiting for a
scoreboard dependency on an MIO operation (not to TEX or L1). This represents
about 31.2% of the total average of 16.2 cycles between issuing two instructions.
The primary reason for a high number of stalls due to short scoreboards is
typically memory operations to shared memory, but other contributors include
frequent execution of special math instructions (e.g. MUFU) or dynamic
branching (e.g. BRX, JMX). Consult the Memory Workload Analysis section to
verify if there are shared memory operations and reduce bank conflicts, if
reported.
- 16.21 Warp Cycles occurred per issued instruction
- 20.60 was the Average Number of Active Threads per Warp
- According to NSight Compute, this number is caused by divergence and if that
divergence is reduced or eliminated, this kernel will achieve a more optimal
number of threads per warp.
Move and Mark Kernel:
- 19.0% of runtime
- We are assuming this number is correct based on our understanding of the
application
Launch Statistics:
- At launch:
- The size of the kernel grid is 57,120
- The block size is 256
- There are 14,622,720 threads
- There are 238 waves per SM
- There are 64 registers per thread
- The Static Shared Memory per block is 27.66 KB/block
- There is no Dynamic Shared Memory per block (0 KB/block)
- The Shared Memory Configuration Size is 98.30 KB
GPU Speed of Light:
According to an application analysis done by NSight Compute:
- The kernel is utilizing greater than 80.0% of the available compute or memory
performance of the device. To further improve performance, work will likely need to be
shifted from the most utilized to another unit.
- The Streaming Multiprocessor (SM) Speed of Light (SOL) is 44.84%
- 90.93% of the theoretical maximum memory was used
Compute Workload Analysis:
- The SM is busy 44.90% of the time.
- The issue slots were busy 43.34% of the time.
- The pipe was utilized most by Fused-Multiply-Add (FMA) instructions (44.90%) and least
by Address Divergence Unit (ADU) instructions (0.03%).
- 1.73 warp instructions were executed per cycle
Memory Workload Analysis:
- The memory throughput is 218.53 GB/second
- The available communication bandwidth between the SM, caches, and DRAM was
43.41%
- The maximum throughput of issuing memory instructions reached 43.43%
- Shared Memory Loads had 4,931,034,721 bank conflicts
- Shared Memory Loads had a peak utilization of 87.50%
- Shared Memory Stores had 345,967 bank conflicts
- Shared Memory Stores had a peak utilization of 0.20%
Occupancy:
Theoretical Occupancy %
37.50
Block Limit Registers
4
Theoretical Active Warps per SM
24
Block Limit Shared Mem.
3
Achieved Occupancy %
37.44
Block Limit Warps
8
Achieved Active Warps per SM
23.96
Block Limit SM
32
- Looking at the above graph, we can see that we are achieving the highest amount of
warp occupancy per the number of registers per thread
- Looking at the above graph, we can see that increasing the block size from 256 to 512
will increase the number of warps
- Currently this is tied to the number of cells per supercell/the number of particles
in a buffer.
- This could be decoupled which could allow us to increase the number of threads
per block thus increasing the number of warps.
- From this graph, we can see that increasing the amount of Shared Memory per Block to
a number greater than 24,448 will decrease the number of achieved warps from 32 to
24.
Scheduler Statistics:
- According to an application analysis done by NSight Compute:
- Every scheduler is capable of issuing one instruction per cycle, but for this kernel
each scheduler only issues an instruction every 2.3 cycles. This might leave
hardware resources underutilized and may lead to less optimal performance. Out
of the maximum of 16 warps per scheduler, this kernel allocates an average of
5.99 active warps per scheduler, but only an average of 1.02 warps were eligible
per cycle. Eligible warps are the subset of active warps that are ready to issue
their next instruction. Every cycle with no eligible warp results in no instruction
being issued and the issue slot remains unused. To increase the number of
eligible warps either increase the number of active warps or reduce the time the
active warps are stalled.
- This kernel achieved 5.99 warps out of a theoretical 16 warps scheduled time period
- 1.02 warps were eligible per scheduler, but only 0.43 were issued
- 56.66% of the time, 0 warps were eligible to execute
- Conversely, 43.34% of the time, one or more warps were eligible to execute
Warp State Statistics:
- According to an application analysis done by NSight Compute:
- On average each warp of this kernel spends 4.5 cycles being stalled waiting for
the MIO instruction queue to be not full. This represents about 32.9% of the total
average of 13.8 cycles between issuing two instructions. This stall reason is high
in cases of extreme utilization of the MIO pipelines, which include special math
instructions, dynamic branches, as well as shared memory instructions.
- 13.82 Warp Cycles occurred per issued instruction
- 31.87 was the Average Number of Active Threads per Warp
Shift Particles Kernel:
- 11.3% of runtime
- We are assuming this number is correct based on our understanding of the
application
In this kernel, each thread block processes a super cell and these contributions reach into
shared memory using atomic operations and then are written back to global memory.
Launch Statistics:
- At launch:
- The size of the kernel grid is 2280
- The block size is 256
- There are 583,680 threads
- There are 7.12 waves per SM
- There are 56 registers per thread
- The Static Shared Memory per block is 2.66 KB/block
- There is no Dynamic Shared Memory per block (0 KB/block)
- The Shared Memory Configuration Size is 16.38 KB
GPU Speed of Light:
According to an application analysis done by NSight Compute:
- This kernel exhibits low compute throughput and memory bandwidth utilization relative to
the peak performance of this device. Achieved compute throughput and/or memory
bandwidth below 60.0% of peak typically indicate latency issues.
- The Streaming Multiprocessor (SM) Speed of Light (SOL) is 10.51%
- 14.02% of the theoretical maximum memory was used
Compute Workload Analysis:
- The SM is busy 11.11% of the time.
- The issue slots were busy 11.11% of the time.
- The pipe was utilized most by Load/Store Unit (LSU) instructions (10.16%) and least by
Texture Memory (TEX) instructions (0.01%).
- 0.40 warp instructions were executed per cycle
Memory Workload Analysis:
- The memory throughput is 120.01 GB/second
- The available communication bandwidth between the SM, caches, and DRAM was
14.02%
- The maximum throughput of issuing memory instructions reached 9.61%
- The Shared Memory Loads had 798,480 bank conflicts
- The Shared Memory Loads had a peak utilization of 2.45%
- The Shared Memory Stores had 781,119 bank conflicts
- The Shared Memory Stores had a peak utilization of 0.85%
Occupancy:
Theoretical Occupancy %
50
Block Limit Registers
4
Theoretical Active Warps per SM
32
Block Limit Shared Mem.
34
Achieved Occupancy %
47.92
Block Limit Warps
8
Achieved Active Warps per SM
30.67
Block Limit SM
32
- Looking at the above graph, we can see that decreasing the number of registers per
thread will yield higher warp occupancy. We can achieve the highest amount of warp
occupancy by decreasing the number of registers per thread to no more than 32.
- Looking at the above graph, we can see that increasing the block size from 256 to 288
will increase the number of warps
- Since we are already achieving almost the theoretical max warp occupancy, this
does not seem like a priority.
- From this graph, we can see that we are achieving the greatest warp occupancy by not
using any shared memory per block.
- If we needed to incorporate shared memory, we should try not use more than
24,448 bytes per block.
Scheduler Statistics:
- According to an application analysis done by NSight Compute:
- Every scheduler is capable of issuing one instruction per cycle, but for this kernel
each scheduler only issues an instruction every 8.9 cycles. This might leave
hardware resources underutilized and may lead to less optimal performance. Out
of the maximum of 16 warps per scheduler, this kernel allocates an average of
7.71 active warps per scheduler, but only an average of 0.13 warps were eligible
per cycle. Eligible warps are the subset of active warps that are ready to issue
their next instruction. Every cycle with no eligible warp results in no instruction
being issued and the issue slot remains unused. To increase the number of
eligible warps either increase the number of active warps or reduce the time the
active warps are stalled.
- This kernel achieved 7.71 warps out of a theoretical 16 warps scheduled time period
- 0.13 warps were eligible per scheduler, and 0.11 were issued
- This shows we executed eligible warps fairly well in this scenario
- 88.83% of the time, 0 warps were eligible to execute
- Conversely, 11.17% of the time, one or more warps were eligible to execute
Warp State Statistics:
- According to an application analysis done by NSight Compute:
- On average each warp of this kernel spends 46.2 cycles being stalled waiting for
sibling warps at a CTA barrier. This represents about 67.0% of the total average
of 69.0 cycles between issuing two instructions. A high number of warps waiting
at a barrier is commonly caused by diverging code paths before a barrier that
causes some warps to wait a long time until other warps reach the
synchronization point. Whenever possible try to divide up the work into blocks of
uniform workloads. Use the Source View's sampling columns to identify which
barrier instruction causes the most stalls and optimize the code executed before
that synchronization point first.
- Instructions are executed in warps, which are groups of 32 threads. Optimal
instruction throughput is achieved if all 32 threads of a warp execute the same
instruction. The chosen launch configuration, early thread completion, and
divergent flow control can significantly lower the number of active threads in a
warp per cycle. This kernel achieves an average of 16.9 threads being active per
cycle. This is further reduced to 16.0 threads per warp due to predication. The
compiler may use predication to avoid an actual branch. Instead, all instructions
are scheduled, but a per-thread condition code or predicate controls which
threads execute the instructions. Try to avoid different execution paths within a
warp when possible. In addition, assure your kernel makes use of Independent
Thread Scheduling, which allows a warp to reconverge after a data-dependent
conditional block by explicitly calling __syncwarp().
- 69.00 Warp Cycles occurred per issued instruction
- 16.88 was the Average Number of Active Threads per Warp
- According to NSight Compute, this number is caused by divergence and if that
divergence is reduced or eliminated, this kernel will achieve a more optimal
number of threads per warp.
NVProf Comparison
Previously, we had generated an analysis report using NVProf. This analysis was performed
using a grid size of 240 x 272 x 224, and 10 time steps on the Mid-November Figure of Merit
(FOM) run setup. The Traveling Wave Electron Acceleration (TWEAC) science case used in this
run, is a representative science case for PIConGPU. This execution can be used for baseline
analysis on AMD MI50/ MI60 systems. Our initial thoughts are that we should see similar, if not
the same results for metrics, we saw using NVProf. We point out similarities and differences
found across both applications for the three kernels below.
Compute Current Kernel:
Similarities:
The GPU Utilization percentages were the same across both applications
The Occupancy numbers and percentages were the same
The impact of varying shared memory usage per block figures and numbers were
the same
The impact of varying block size figures and numbers were the same
Differences:
NVProf stated this kernel’s performance is bound by Instruction and Memory
Latency. NSight Compute did not explicitly mention instruction and memory
latency. Regarding memory it did say to check memory replay (coalescing)
metrics and consider whether it is possible to do more work per memory access
(kernel fusion) or whether there are values we can (re)compute.
Move and Mark Kernel:
Similarities:
The GPU Utilization percentages were the same across both applications
NVProf and Nsight Compute said this kernel’s performance was bound by
Memory Bandwidth.
While NSight Compute did not explicitly say that (“The kernel is utilizing
greater than 80.0% of the available compute or memory performance of
the device.”), it seems reasonably clear from looking at the chart that this
is implied.
The Occupancy numbers and percentages were virtually the same.
The only differences occurred with NVProf rounding numbers
The impact of varying block size figures and numbers were the same
● Differences:
The impact of varying shared memory usage per block graphs were the same,
however the amount of warp occupancy we were achieving per shared memory
differed.
NVProf stated we were achieving 24 warps at 27,000 Bytes of Shared
Memory per block
NSight Compute said we were achieving 32 warps using 0 Bytes of
Shared Memory per block
Shift Particles Kernel:
Similarities:
The GPU Utilization percentages were the same across both applications
NVProf and Nsight Compute said this kernel’s performance was bound by
Instruction and Memory Latency.
NSight Compute did not
The impact of varying block size figures and numbers were the same
Differences:
This kernel exhibits low compute throughput and memory bandwidth utilization
relative to the peak performance of this device. Achieved compute throughput
and/or memory bandwidth below 60.0% of peak typically indicate latency issues.
The occupancy numbers vary slightly
In NVProf, the achieved active warps is 30.59 vs. 30.67 for NSight
Compute
In NVProf, the Achieved Occupancy is 47.8% vs. 47.92% for Nsight
Compute
The impact of varying shared memory usage per block graphs were the same,
however the amount of warp occupancy we were achieving per shared memory
differed.
NVProf stated we were achieving 32 warps at 2,000 Bytes of Shared
Memory per block
NSight Compute said we were achieving 32 warps using 0 Bytes of
Shared Memory per block
Possible Profiling Tool Enhancements
From comparing this report, the NVProf report, and the reports generated by both applications,
we propose the following as possible enhancements for NSight Compute and NSight Systems:
NSight Compute:
For the GPU Utilization graphs, break down the utilization percentage by
operation (memory, control-flow, arithmetic, etc.) like NVVP did
NVVP’s GPU Utilization Graph:
Provide documentation that explains each metric and abbreviation found in a
generated report
NSight Systems:
Have the ability for the full name of the kernel to pop up when hovering over the
the kernels in the aggregate stream total (All Streams)
Currently, the full name of the kernel is available after expanding the
individual streams
Also for templated kernel instantiations, the ability to limit the call stack trace
would be useful. Currently the entire instantiation trace is listed in the popup
which is not that useful.
Have the ability to reset the screen to the default view after fully extending the
kernel names view, or have the kernel names text wrap
Currently, if the kernel names view is fully extended to the right, there’s no
way to view the timeline again without restarting the application
Be able to infinitely extend the width of the window. Right now, after extending it
so far, the application turns black.
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.