Content uploaded by Sameer Shende
Author content
All content in this area was uploaded by Sameer Shende on Dec 03, 2019
Content may be subject to copyright.
Multi-Level Performance Instrumentation for
Kokkos Applications using TAU
Sameer Shende
ParaTools, Inc.
Eugene, Oregon
sameer@paratools.com
Nicholas Chaimov
ParaTools, Inc.
Eugene, Oregon
nchaimov@paratools.com
Allen D. Malony
ParaTools, Inc.
Eugene, Oregon
malony@paratools.com
Neena Imam
Oak Ridge National Laboratory
Oak Ridge, Tennessee
imamn@ornl.gov
Abstract—The TAU Performance System®provides a multi-
level instrumentation strategy for instrumentation of Kokkos
applications. Kokkos provides a performance portable API for
expressing parallelism at the node level. TAU uses the Kokkos
profiling system to expose performance factors using user-
specified parallel kernel names for lambda functions or C++
functors. It can also use instrumentation at the OpenMP, CUDA,
pthread, or other runtime levels to expose the implementation
details giving a dual focus of higher-level abstractions as well as
low-level execution dynamics. This multi-level instrumentation
strategy adopted by TAU can highlight performance problems
across multiple layers of the runtime system without modifying
the application binary.
Index Terms—TAU, Kokkos, profiling, tracing, OTF2, instru-
mentation, measurement
I. INTRODUCTION
A major challenge in high-performance computing is per-
formance portability [1]. There are many cross-platform pro-
gramming languages and libraries available which can target
different execution environments, such as traditional CPUs,
many-core accelerators, and GPUs of various architectures,
such as those from NVIDIA, AMD, and Intel. Using such
languages and libraries, it is possible to write a single version
of a code which will run and produce correct results on many
platforms. However, cross-platform code produced in that way
will not necessarily provide acceptable performance on multi-
ple platforms. Different CPUs and accelerator devices provide
different memory hierarchies which may produce different
behavior on the same memory access patterns. Optimizing
code for one device may require a different layout of data
in memory than for another device.
To address this challenge, the Kokkos C++ Performance
Portability Programming ecosystem has been developed [2].
The Kokkos core is a C++ library which enables programmers
to specify operations that occur over data for which the
physical layout in memory is deferred until the backend device
on which the code will execute is known. This allows a
programmer to write a single code which will run on a CPU
or accelerator device in which the data layout will adapt to
the execution environment in which it is actually used.
Given that the motivation behind Kokkos is to provide
performance portability, it is important to be able to measure
performance metrics across many different platforms. This
precludes the use of vendor-specific tools such as NVIDIA’s
visual profiler [3] or Intel’s VTune [4], which by their design
are incapable of operating on another manufacturer’s device.
The TAU Performance System®is a framework for per-
formance instrumentation, measurement, analysis and visu-
alization of codes running on large-scale parallel computing
systems [5]. It supports multiple parallel programming models,
including MPI, OpenMP, and OpenACC, as well as monitoring
of kernel execution and hardware performance counters from
CPUs and GPUs from multiple vendors, including GPUs from
NVIDIA, AMD, and Intel.
This paper describes our integration of TAU with the
Kokkos runtime. By incorporating data from annotations pro-
vided by the Kokkos runtime or by application code, this
integration enables performance data to be presented at a level
of abstraction usable by application developers despite the
layers of C++ templates and runtime components which exist
between the application code as written and the instructions
ultimately executed on the backend device. We provide an
example of Kokkos profiling in TAU to profile the ExaM-
iniMD [6] and CabanaMD [7] molecular dynamics proxy
applications.
II. IN TE GR ATIO N OF TAU AN D KOK KOS
A. Kokkos
Kokkos runtime system provides a sophisticated yet simple
interface to observe key runtime events by mapping these
directly to higher-level kernel names (or C++ instantiations
of the template representing the kernel if the name is not
specified) and other data structure, memory, and code region
events. Tools such as TAU can attach to the Kokkos profiling
system and observe key events that take place in the runtime
without any code modifications.
B. TAU
TAU provides support for instrumentation of Kokkos as
well as a broad range of runtimes at the node level including
OpenMP, pthread, OpenACC, CUDA, OpenCL, and HIP. It
supports detailed MPI level data using both the PMPI as well
as the MPI Tools interface (MPI T) [8]. This allows TAU
to generate detailed performance data at the runtime level.
For e.g., TAU can track data transfers between the host and
the device (GPU) using callbacks from the CUDA Profiling
Tools Interface (CUPTI) at a coarse granularity, or map the
data volume based on each variable name in OpenACC at
a finer granularity. Similarly, based on the timestamps of
kernels, it can create detailed profiles and event traces for
kernel executions on the GPU. On AMD systems, TAU uses
the ROCTracer and ROCProfiler packages for instrumentation.
For OpenMP, TAU supports the OpenMP Tools Interface
(OMPT) [9] which is now supported in modern compilers
such as Intel v19+ and LLVM Clang v8+. While OMPT
relies on the presence of pragmas close to code regions
that describe the parallelism, Kokkos kernels are created
from template instantiations and map to other layers of the
parallel runtime. Currently, Kokkos supports CUDA, pthread,
and OpenMP, while TAU supports HIP from AMD as well.
When events from multiple instrumentation streams coalesce
within TAU’s performance repository, an event hierarchy is
maintained. Events are grouped into logically related runtime
layers (OpenMP, Kokkos, OpenACC, CUDA, HIP, MPI, etc.)
and caller-callee relationships may be observed using callpath
and callsite profiling. Event filtering is also supported at
runtime using TAU’s plugin [10] for selective instrumentation.
TAU also provides support for event-based sampling at the line
(statement), function, and file level. Sampling may use either
a wallclock timer or hardware performance counter overflow
events from packages such as PAPI [11]. Callstack unwinding
can show the system callstack when a sample is recorded and
the program counter (PC) is correlated back to the source
code. By providing sampling as well as instrumentation hooks
at multiple runtime layers, TAU can highlight performance
problems at different layers of the runtime and the source code.
Sometimes having such a multi-layer instrumentation support
in the tool can provide detailed information that may be
more powerful than tools that support a single instrumentation
point.
C. Runtime-Tool Integration
Application developers need performance-monitoring tools
in order to understand the performance of their application
so that they can determine what optimizations are needed
to increase performance. Many such tools exist for programs
written using traditional programming models. Directly apply-
ing such tools to frameworks like Kokkos, however, will not
generate results which are useful to application developers.
Consider the effect of applying event-based sampling to a
Kokkos application. This will periodically sample the applica-
tion and generate a report of the amount of time spent in
each function. This is unlikely to provide information that
is actionable to an application developer, as the functions
presented will be mostly internal to the runtime or expose
unnecessary and distracting runtime implementation details.
The extensive use of C++ templates results in the creation of
a large number of differently-named functions implementing
each parallel region, which also contain unnecessary imple-
mentation details and prevents easy comparison of results
produced by different backends.
For example, applying event-based sampling to a simple
Kokkos tutorial example reveals that 15% of the application
runtime is spent in an instantiation of template function named:
std::enable_if<(((Kokkos::Impl::are_integral<int, int>::
value&&((2)==((Kokkos::View<double const*[3], Kokkos::
LayoutRight, Kokkos::Threads, Kokkos::MemoryTraits<3u>
>::{unnamed type#1})2)))&&((Kokkos::View<double const*
[3], Kokkos::LayoutRight, Kokkos::Threads, Kokkos::
MemoryTraits<3u> >::{unnamed type#3})1))&&((Kokkos::
View<double const*[3], Kokkos::LayoutRight, Kokkos::
Threads, Kokkos::MemoryTraits<3u> >::{unnamed type#1})
1))&&(((Kokkos::ViewTraits<double const*[3], Kokkos::
LayoutRight, Kokkos::Threads, Kokkos::MemoryTraits<3u>
>::{unnamed type#2})1)!=(0)), double const&>::type
Kokkos::View<double const*[3], Kokkos::LayoutRight,
Kokkos::Threads, Kokkos::MemoryTraits<3u> >::operator()
<int, int>(int const&, int const&) const
The extremely complicated nested templates which com-
prise the type of this function are a result of the infrastructure
inside Kokkos used to map the logical space of the View
onto a physical memory layout at compile time using template
metaprogramming. This provides no details which are of use
to the developer of an application which uses Kokkos.
To obtain insightful and actionable performance results, a
performance tool must receive metadata from the runtime pro-
viding data mapping runtime behavior back to the application
code which produced it.
D. Kokkos Profiling Hooks
To enable performance tools to be aware of the executing
parallel constructs within a Kokkos application, the Kokkos
runtime provides a profiling interface which allows an external
library to register functions which will be called when runtime
events occur [12]. At initialization, the Kokkos runtime checks
the environment variable KOKKOS_PROFILE_LIBRARY. If
set to the path to a shared library, Kokkos will search the
shared library for any of a set of pre-defined functions which
are then registered as callbacks for certain events.
For example, Kokkos provides the callbacks
extern "C" void kokkosp_begin_parallel_for(
const char*name, uint32_t devid, uint64_t
*kernid);
and
extern "C" void kokkosp_end_parallel_for(
uint64_t kernid);
which are called by the runtime upon entry to and exit from
aparallel_for construct. This provides the performance
tool with a human-readable name for the construct (provided
by the application developer as an argument), an identifier for
the device on which the construct executes, and a unique ID
which is used to distinguish the construct from other constructs
of the same type. The ID is provided in the end callback. These
callbacks are used by TAU to start and stop timers, allowing
human-readable timer names such as
Kokkos::parallel_for ForceFunctor [device=0]
to be used in place of the names of C++ template instantia-
tions.
Similar callback functions are provided for other parallel
constructs (parallel for, parallel reduce, parallel scan), appli-
cation code segments (push and pop regions, create, start, stop
Fig. 1. TAU’s Paraprof thread statistics table shows a CabanaMD profile
and destroy sections), and memory management events (data
allocation, de-allocation, as well as deep copy begin and end).
Callbacks are also provided which allow the application
to annotate regions within its own code, so that parallel
constructs and memory management events will be associated
with a stack of profiling regions. Currently, TAU supports
the instrumentation of parallel constructs and the push and
pop region code segments. While the execution of parallel
constructs is tracked using TAU timers, the push and pop
operations of code regions are mapped to TAU phases. With
phase-based profiling, we can observe the time spent in all
routines that are called directly or indirectly within a phase.
Both timers and phases can generate accurate exclusive and
inclusive timings in profiles, and show events along a timeline
when TAU’s tracing is enabled. TAU can generate traces in its
native format that may be merged and converted to traces in
SLOG2 format for Jumpshot [13], JSON for Chrome tracing,
and the Paraver trace format. It can also generate OTF2
traces natively for the Vampir trace visualizer [14]. TAU can
also interface with the Score-P measurement library [15] to
generate traces in the OTF2 format or profiles in the CUBEX
format.
E. ExaMiniMD
Kokkos is used for node-level parallelism in the ExaMin-
iMD [16] application. Figure 2 shows a code snippet from
ExaMiniMD that illustrates the Kokkos profiling interface. In
this Molecular Dynamics (MD) proxy application, we see a
region of code in the method CommMPI::update halo. This
routine is annotated using the Kokkos::Profiling pushRegion
and popRegion calls. These regions are mapped to a TAU
phase as shown in Figure 3. Phase based profiling is a
powerful feature in TAU [17] that partitions the profiling data,
and shows all routines called directly or indirectly within a
phase. Within this phase, we see calls to MPI and three calls
to Kokkos::parallel for and OpenMP functions instrumented
using TAU’s OpenMP Tools Interface (OMPT) [9]. In this
case, Kokkos is configured to use the OpenMP backend and
TAU is configured to use OMPT and MPI to match the
Kokkos. Besides profiles, TAU can also generate OTF2 [18]
traces natively. These traces do not require any additional
merging or format conversion and may be directly visualized
in the Vampir visualizer [14], as shown in Figure 5.
F. CabanaMD
To illustrate the integration of TAU and Kokkos, we eval-
uate the performance of CabanaMD. CabanaMD [7] is a
proxy application that uses Kokkos and is based on the
ExaMiniMD application to use the CoPA Cabana Particle
Toolkit. When an un-modified CabanaMD binary is launched
on an IBM Power 9 Linux system with NVIDIA V100 GPUs
using tau exec, instrumentation is activated at multiple levels.
First, TAU uses the Kokkos profiling interface to highlight
phases such as Comm::update halo. Figure 1 shows the
profile on thread 0 (host). Here, in the aggregate summary
profile, we see the Kokkos kernels names (e.g., the high-
lighted Kokkos::parallel for ForceLJCabanaNeigh::compute
[device=0]) as well as statement-level statistics in the ap-
plication routines LAMMPS RandomVelocityGeom::reset and
Input::create lattice), and the time spent in CUDA events cu-
Fig. 2. Snippet of code from the ExaminiMD proxy application showing the use of the Kokkos API
Fig. 3. TAU’s ParaProf thread statistics table shows the Comm::update halo phase in ExaMiniMD
daDeviceSynchronize as well as MPI events (MPI Allreduce).
We can also observe the time spent in pthread lock calls and
scheduler yield called from within MPI AllReduce from the
IBM Spectrum MPI library. Figure 6 shows the time spent
in a kernel on a GPU. This requires TAU’s CUPTI support.
With TAU’s tracing enabled, we can see the time spent in
GPU data transfer options on each of the four V100 GPUs
within the four MPI ranks along a timeline display, as shown
in Figure 7.
III. CONCLUSIONS
A multi-level instrumentation interface that combines events
from multiple runtimes including Kokkos, MPI, pthread, and
CUDA can highlight performance issues from different layers
of the runtime. In this paper, we describe our initial efforts
in supporting instrumentation of un-modified Kokkos appli-
cations using TAU. In the future, we plan to extend support
in TAU for tracking memory and code sections to provide a
holistic view of the inner workings of the runtime.
Fig. 4. TAU’s ParaProf shows the Comm::update halo phase comprised of OpenMP and MPI routines called directly or indirectly in this phase
Fig. 5. TAU’s OTF2 traces for ExaMiniMD are shown in Vampir trace visualizer
ACKNOWLEDGMENT
This work was supported by the United States Department
of Defense (DoD) and used resources of the Computational
Research and Development Programs, the Oak Ridge Lead-
ership Computing Facility (OLCF) at Oak Ridge National
Laboratory, and the Performance Research Laboratory at the
University it Oregon. This research was supported by the Exas-
cale Computing Project (17-SC-20-SC), a collaborative effort
of the U.S. Department of Energy Office of Science and the
National Nuclear Security Administration. This work benefited
from access to the University of Oregon high performance
computer, Talapas. The authors would like to thank Sam Reeve
(LLNL), for his assistance with CabanaMD.
REFERENCES
[1] C. Kessler, U. Dastgeer, S. Thibault, R. Namyst, A. Richards, U. Dolin-
sky, S. Benkner, J. L. Tr¨
aff, and S. Pllana, “Programmability and perfor-
mance portability aspects of heterogeneous multi-/manycore systems,”
in Proceedings of the Conference on Design, Automation and Test in
Europe. EDA Consortium, 2012, pp. 1403–1408.
[2] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling
manycore performance portability through polymorphic memory access
patterns,” Journal of Parallel and Distributed Computing, vol. 74, no. 12,
pp. 3202–3216, 2014.
[3] “NVIDIA Visual Profiler user guide,”
https://docs.nvidia.com/cuda/profiler-users-guide/index.html, accessed:
2019-09-06.
[4] J. Reinders, “Vtune performance analyzer essentials,” Intel Press, 2005.
[5] S. Shende and A. Malony, “The TAU Parallel Performance System,”
International Journal of High Performance Computing Applications,
vol. 20, no. 2, pp. 287–311, 2006.
Fig. 6. TAU’s Paraprof function display window shows the time spent in a kernel executing on four GPUs
[6] A. P. Thompson and C. R. Trott, “A brief description of the kokkos im-
plementation of the snap potential in examinimd.” SAND2017-12362R,
Tech. Rep., 2017.
[7] “CoPA Cabana - The Exascale Co-Design Center for Particle Applica-
tions Toolkit,” https://github.com/ECP-copa/Cabana, accessed: 2019-09-
06.
[8] S. Ramesh, A. Mah´
eo, S. Shende, A. D. Malony, H. Subramoni,
A. Ruhela, and D. K. Panda, “MPI Performance Engineering with
the MPI Tool Interface: Integration of MVAPICH and TAU,” Parallel
Computing, vol. 77, pp. 19–37, 2018.
[9] A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty,
R. Dietrich, X. Liu, E. Loh, and D. Lorenz, “OMPT: An OpenMP
Tools Application Programming Interface for Performance Analysis,”
in International Workshop on OpenMP. Springer, 2013, pp. 171–185.
[10] A. D. Malony, S. Ramesh, K. A. Huck, N. Chaimov, and S. Shende, “A
plugin architecture for the TAU performance system,” in Proceedings
of the 48th International Conference on Parallel Processing, ICPP
2019, Kyoto, Japan, August 05-08, 2019. ACM, 2019, pp. 90:1–90:11.
[Online]. Available: https://doi.org/10.1145/3337821.3337916
[11] P. Mucci, S. Browne, C. Deane, and G. Ho, “PAPI: A Portable Interface
to Hardware Performance Counters,” in DoD HPCMP Users Group
Conference, 1999, pp. 7–10.
[12] S. D. Hammond, C. R. Trott, D. Ibanez, and D. Sunderland, “Profiling
and debugging support for the kokkos programming model,” in Interna-
tional Conference on High Performance Computing. Springer, 2018,
pp. 743–754.
[13] C. E. Wu, A. Bolmarcich, M. Snir, D. Wootton, F. Parpia, A. Chan,
E. Lusk, and W. Gropp, “From trace generation to visualization: A
performance framework for distributed parallel systems,” in Proc. of
SC2000: High Performance Networking and Computing, November
2000.
[14] A. Kn¨
upfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler,
M. S. M¨
uller, and W. E. Nagel, “The Vampir performance analysis tool-
set,” in Tools for High Performance Computing. Springer, 2008, pp.
139–155.
[15] A. Kn¨
upfer, C. R¨
ossel, D. an Mey, S. Biersdorff, K. Diethelm, D. Es-
chweiler, M. Geimer, M. Gerndt, D. Lorenz, A. Malony, W. Nagel,
Y. Oleynik, P. Philippen, P. Saviankou, D. Schmidl, S. Shende et al.,
“Score-P: A Joint Performance Measurement Run-Time Infrastructure
for Periscope, Scalasca, TAU, and Vampir,” in Tools for High Perfor-
mance Computing 2011. Springer, 2012, pp. 79–91.
[16] A. P. Thompson and C. Trott, “A brief description of the kokkos
implementation of the snap potential in examinimd.” 2017.
[17] S. Shende, A. D. Malony, A. Morris, W. Spear, and S. Biersdorff, TAU.
Boston, MA: Springer US, 2011, pp. 2025–2029.
[18] A. Kn¨
upfer, R. Brendel, H. Brunst, H. Mix, and W. E. Nagel, “Intro-
ducing the Open Trace Format (OTF),” in Proceedings of the 6th Inter-
national Conference on Computational Science, ser. Springer Lecture
Notes in Computer Science, vol. 3992, Reading, UK, May 2006, pp.
526–533.
Fig. 7. Jumpshot shows the execution of CabanaMD along a timeline view