Conference PaperPDF Available

Multi-Level Performance Instrumentation for Kokkos Applications Using TAU

Authors:

Abstract and Figures

The TAU Performance System ® provides a multi-level instrumentation strategy for instrumentation of Kokkos applications. Kokkos provides a performance portable API for expressing parallelism at the node level. TAU uses the Kokkos profiling system to expose performance factors using user-specified parallel kernel names for lambda functions or C++ functors. It can also use instrumentation at the OpenMP, CUDA, pthread, or other runtime levels to expose the implementation details giving a dual focus of higher-level abstractions as well as low-level execution dynamics. This multi-level instrumentation strategy adopted by TAU can highlight performance problems across multiple layers of the runtime system without modifying the application binary.
Content may be subject to copyright.
Multi-Level Performance Instrumentation for
Kokkos Applications using TAU
Sameer Shende
ParaTools, Inc.
Eugene, Oregon
sameer@paratools.com
Nicholas Chaimov
ParaTools, Inc.
Eugene, Oregon
nchaimov@paratools.com
Allen D. Malony
ParaTools, Inc.
Eugene, Oregon
malony@paratools.com
Neena Imam
Oak Ridge National Laboratory
Oak Ridge, Tennessee
imamn@ornl.gov
Abstract—The TAU Performance System®provides a multi-
level instrumentation strategy for instrumentation of Kokkos
applications. Kokkos provides a performance portable API for
expressing parallelism at the node level. TAU uses the Kokkos
profiling system to expose performance factors using user-
specified parallel kernel names for lambda functions or C++
functors. It can also use instrumentation at the OpenMP, CUDA,
pthread, or other runtime levels to expose the implementation
details giving a dual focus of higher-level abstractions as well as
low-level execution dynamics. This multi-level instrumentation
strategy adopted by TAU can highlight performance problems
across multiple layers of the runtime system without modifying
the application binary.
Index Terms—TAU, Kokkos, profiling, tracing, OTF2, instru-
mentation, measurement
I. INTRODUCTION
A major challenge in high-performance computing is per-
formance portability [1]. There are many cross-platform pro-
gramming languages and libraries available which can target
different execution environments, such as traditional CPUs,
many-core accelerators, and GPUs of various architectures,
such as those from NVIDIA, AMD, and Intel. Using such
languages and libraries, it is possible to write a single version
of a code which will run and produce correct results on many
platforms. However, cross-platform code produced in that way
will not necessarily provide acceptable performance on multi-
ple platforms. Different CPUs and accelerator devices provide
different memory hierarchies which may produce different
behavior on the same memory access patterns. Optimizing
code for one device may require a different layout of data
in memory than for another device.
To address this challenge, the Kokkos C++ Performance
Portability Programming ecosystem has been developed [2].
The Kokkos core is a C++ library which enables programmers
to specify operations that occur over data for which the
physical layout in memory is deferred until the backend device
on which the code will execute is known. This allows a
programmer to write a single code which will run on a CPU
or accelerator device in which the data layout will adapt to
the execution environment in which it is actually used.
Given that the motivation behind Kokkos is to provide
performance portability, it is important to be able to measure
performance metrics across many different platforms. This
precludes the use of vendor-specific tools such as NVIDIAs
visual profiler [3] or Intel’s VTune [4], which by their design
are incapable of operating on another manufacturer’s device.
The TAU Performance System®is a framework for per-
formance instrumentation, measurement, analysis and visu-
alization of codes running on large-scale parallel computing
systems [5]. It supports multiple parallel programming models,
including MPI, OpenMP, and OpenACC, as well as monitoring
of kernel execution and hardware performance counters from
CPUs and GPUs from multiple vendors, including GPUs from
NVIDIA, AMD, and Intel.
This paper describes our integration of TAU with the
Kokkos runtime. By incorporating data from annotations pro-
vided by the Kokkos runtime or by application code, this
integration enables performance data to be presented at a level
of abstraction usable by application developers despite the
layers of C++ templates and runtime components which exist
between the application code as written and the instructions
ultimately executed on the backend device. We provide an
example of Kokkos profiling in TAU to profile the ExaM-
iniMD [6] and CabanaMD [7] molecular dynamics proxy
applications.
II. IN TE GR ATIO N OF TAU AN D KOK KOS
A. Kokkos
Kokkos runtime system provides a sophisticated yet simple
interface to observe key runtime events by mapping these
directly to higher-level kernel names (or C++ instantiations
of the template representing the kernel if the name is not
specified) and other data structure, memory, and code region
events. Tools such as TAU can attach to the Kokkos profiling
system and observe key events that take place in the runtime
without any code modifications.
B. TAU
TAU provides support for instrumentation of Kokkos as
well as a broad range of runtimes at the node level including
OpenMP, pthread, OpenACC, CUDA, OpenCL, and HIP. It
supports detailed MPI level data using both the PMPI as well
as the MPI Tools interface (MPI T) [8]. This allows TAU
to generate detailed performance data at the runtime level.
For e.g., TAU can track data transfers between the host and
the device (GPU) using callbacks from the CUDA Profiling
Tools Interface (CUPTI) at a coarse granularity, or map the
data volume based on each variable name in OpenACC at
a finer granularity. Similarly, based on the timestamps of
kernels, it can create detailed profiles and event traces for
kernel executions on the GPU. On AMD systems, TAU uses
the ROCTracer and ROCProfiler packages for instrumentation.
For OpenMP, TAU supports the OpenMP Tools Interface
(OMPT) [9] which is now supported in modern compilers
such as Intel v19+ and LLVM Clang v8+. While OMPT
relies on the presence of pragmas close to code regions
that describe the parallelism, Kokkos kernels are created
from template instantiations and map to other layers of the
parallel runtime. Currently, Kokkos supports CUDA, pthread,
and OpenMP, while TAU supports HIP from AMD as well.
When events from multiple instrumentation streams coalesce
within TAU’s performance repository, an event hierarchy is
maintained. Events are grouped into logically related runtime
layers (OpenMP, Kokkos, OpenACC, CUDA, HIP, MPI, etc.)
and caller-callee relationships may be observed using callpath
and callsite profiling. Event filtering is also supported at
runtime using TAU’s plugin [10] for selective instrumentation.
TAU also provides support for event-based sampling at the line
(statement), function, and file level. Sampling may use either
a wallclock timer or hardware performance counter overflow
events from packages such as PAPI [11]. Callstack unwinding
can show the system callstack when a sample is recorded and
the program counter (PC) is correlated back to the source
code. By providing sampling as well as instrumentation hooks
at multiple runtime layers, TAU can highlight performance
problems at different layers of the runtime and the source code.
Sometimes having such a multi-layer instrumentation support
in the tool can provide detailed information that may be
more powerful than tools that support a single instrumentation
point.
C. Runtime-Tool Integration
Application developers need performance-monitoring tools
in order to understand the performance of their application
so that they can determine what optimizations are needed
to increase performance. Many such tools exist for programs
written using traditional programming models. Directly apply-
ing such tools to frameworks like Kokkos, however, will not
generate results which are useful to application developers.
Consider the effect of applying event-based sampling to a
Kokkos application. This will periodically sample the applica-
tion and generate a report of the amount of time spent in
each function. This is unlikely to provide information that
is actionable to an application developer, as the functions
presented will be mostly internal to the runtime or expose
unnecessary and distracting runtime implementation details.
The extensive use of C++ templates results in the creation of
a large number of differently-named functions implementing
each parallel region, which also contain unnecessary imple-
mentation details and prevents easy comparison of results
produced by different backends.
For example, applying event-based sampling to a simple
Kokkos tutorial example reveals that 15% of the application
runtime is spent in an instantiation of template function named:
std::enable_if<(((Kokkos::Impl::are_integral<int, int>::
value&&((2)==((Kokkos::View<double const*[3], Kokkos::
LayoutRight, Kokkos::Threads, Kokkos::MemoryTraits<3u>
>::{unnamed type#1})2)))&&((Kokkos::View<double const*
[3], Kokkos::LayoutRight, Kokkos::Threads, Kokkos::
MemoryTraits<3u> >::{unnamed type#3})1))&&((Kokkos::
View<double const*[3], Kokkos::LayoutRight, Kokkos::
Threads, Kokkos::MemoryTraits<3u> >::{unnamed type#1})
1))&&(((Kokkos::ViewTraits<double const*[3], Kokkos::
LayoutRight, Kokkos::Threads, Kokkos::MemoryTraits<3u>
>::{unnamed type#2})1)!=(0)), double const&>::type
Kokkos::View<double const*[3], Kokkos::LayoutRight,
Kokkos::Threads, Kokkos::MemoryTraits<3u> >::operator()
<int, int>(int const&, int const&) const
The extremely complicated nested templates which com-
prise the type of this function are a result of the infrastructure
inside Kokkos used to map the logical space of the View
onto a physical memory layout at compile time using template
metaprogramming. This provides no details which are of use
to the developer of an application which uses Kokkos.
To obtain insightful and actionable performance results, a
performance tool must receive metadata from the runtime pro-
viding data mapping runtime behavior back to the application
code which produced it.
D. Kokkos Profiling Hooks
To enable performance tools to be aware of the executing
parallel constructs within a Kokkos application, the Kokkos
runtime provides a profiling interface which allows an external
library to register functions which will be called when runtime
events occur [12]. At initialization, the Kokkos runtime checks
the environment variable KOKKOS_PROFILE_LIBRARY. If
set to the path to a shared library, Kokkos will search the
shared library for any of a set of pre-defined functions which
are then registered as callbacks for certain events.
For example, Kokkos provides the callbacks
extern "C" void kokkosp_begin_parallel_for(
const char*name, uint32_t devid, uint64_t
*kernid);
and
extern "C" void kokkosp_end_parallel_for(
uint64_t kernid);
which are called by the runtime upon entry to and exit from
aparallel_for construct. This provides the performance
tool with a human-readable name for the construct (provided
by the application developer as an argument), an identifier for
the device on which the construct executes, and a unique ID
which is used to distinguish the construct from other constructs
of the same type. The ID is provided in the end callback. These
callbacks are used by TAU to start and stop timers, allowing
human-readable timer names such as
Kokkos::parallel_for ForceFunctor [device=0]
to be used in place of the names of C++ template instantia-
tions.
Similar callback functions are provided for other parallel
constructs (parallel for, parallel reduce, parallel scan), appli-
cation code segments (push and pop regions, create, start, stop
Fig. 1. TAU’s Paraprof thread statistics table shows a CabanaMD profile
and destroy sections), and memory management events (data
allocation, de-allocation, as well as deep copy begin and end).
Callbacks are also provided which allow the application
to annotate regions within its own code, so that parallel
constructs and memory management events will be associated
with a stack of profiling regions. Currently, TAU supports
the instrumentation of parallel constructs and the push and
pop region code segments. While the execution of parallel
constructs is tracked using TAU timers, the push and pop
operations of code regions are mapped to TAU phases. With
phase-based profiling, we can observe the time spent in all
routines that are called directly or indirectly within a phase.
Both timers and phases can generate accurate exclusive and
inclusive timings in profiles, and show events along a timeline
when TAU’s tracing is enabled. TAU can generate traces in its
native format that may be merged and converted to traces in
SLOG2 format for Jumpshot [13], JSON for Chrome tracing,
and the Paraver trace format. It can also generate OTF2
traces natively for the Vampir trace visualizer [14]. TAU can
also interface with the Score-P measurement library [15] to
generate traces in the OTF2 format or profiles in the CUBEX
format.
E. ExaMiniMD
Kokkos is used for node-level parallelism in the ExaMin-
iMD [16] application. Figure 2 shows a code snippet from
ExaMiniMD that illustrates the Kokkos profiling interface. In
this Molecular Dynamics (MD) proxy application, we see a
region of code in the method CommMPI::update halo. This
routine is annotated using the Kokkos::Profiling pushRegion
and popRegion calls. These regions are mapped to a TAU
phase as shown in Figure 3. Phase based profiling is a
powerful feature in TAU [17] that partitions the profiling data,
and shows all routines called directly or indirectly within a
phase. Within this phase, we see calls to MPI and three calls
to Kokkos::parallel for and OpenMP functions instrumented
using TAU’s OpenMP Tools Interface (OMPT) [9]. In this
case, Kokkos is configured to use the OpenMP backend and
TAU is configured to use OMPT and MPI to match the
Kokkos. Besides profiles, TAU can also generate OTF2 [18]
traces natively. These traces do not require any additional
merging or format conversion and may be directly visualized
in the Vampir visualizer [14], as shown in Figure 5.
F. CabanaMD
To illustrate the integration of TAU and Kokkos, we eval-
uate the performance of CabanaMD. CabanaMD [7] is a
proxy application that uses Kokkos and is based on the
ExaMiniMD application to use the CoPA Cabana Particle
Toolkit. When an un-modified CabanaMD binary is launched
on an IBM Power 9 Linux system with NVIDIA V100 GPUs
using tau exec, instrumentation is activated at multiple levels.
First, TAU uses the Kokkos profiling interface to highlight
phases such as Comm::update halo. Figure 1 shows the
profile on thread 0 (host). Here, in the aggregate summary
profile, we see the Kokkos kernels names (e.g., the high-
lighted Kokkos::parallel for ForceLJCabanaNeigh::compute
[device=0]) as well as statement-level statistics in the ap-
plication routines LAMMPS RandomVelocityGeom::reset and
Input::create lattice), and the time spent in CUDA events cu-
Fig. 2. Snippet of code from the ExaminiMD proxy application showing the use of the Kokkos API
Fig. 3. TAU’s ParaProf thread statistics table shows the Comm::update halo phase in ExaMiniMD
daDeviceSynchronize as well as MPI events (MPI Allreduce).
We can also observe the time spent in pthread lock calls and
scheduler yield called from within MPI AllReduce from the
IBM Spectrum MPI library. Figure 6 shows the time spent
in a kernel on a GPU. This requires TAU’s CUPTI support.
With TAU’s tracing enabled, we can see the time spent in
GPU data transfer options on each of the four V100 GPUs
within the four MPI ranks along a timeline display, as shown
in Figure 7.
III. CONCLUSIONS
A multi-level instrumentation interface that combines events
from multiple runtimes including Kokkos, MPI, pthread, and
CUDA can highlight performance issues from different layers
of the runtime. In this paper, we describe our initial efforts
in supporting instrumentation of un-modified Kokkos appli-
cations using TAU. In the future, we plan to extend support
in TAU for tracking memory and code sections to provide a
holistic view of the inner workings of the runtime.
Fig. 4. TAU’s ParaProf shows the Comm::update halo phase comprised of OpenMP and MPI routines called directly or indirectly in this phase
Fig. 5. TAU’s OTF2 traces for ExaMiniMD are shown in Vampir trace visualizer
ACKNOWLEDGMENT
This work was supported by the United States Department
of Defense (DoD) and used resources of the Computational
Research and Development Programs, the Oak Ridge Lead-
ership Computing Facility (OLCF) at Oak Ridge National
Laboratory, and the Performance Research Laboratory at the
University it Oregon. This research was supported by the Exas-
cale Computing Project (17-SC-20-SC), a collaborative effort
of the U.S. Department of Energy Office of Science and the
National Nuclear Security Administration. This work benefited
from access to the University of Oregon high performance
computer, Talapas. The authors would like to thank Sam Reeve
(LLNL), for his assistance with CabanaMD.
REFERENCES
[1] C. Kessler, U. Dastgeer, S. Thibault, R. Namyst, A. Richards, U. Dolin-
sky, S. Benkner, J. L. Tr¨
aff, and S. Pllana, “Programmability and perfor-
mance portability aspects of heterogeneous multi-/manycore systems,”
in Proceedings of the Conference on Design, Automation and Test in
Europe. EDA Consortium, 2012, pp. 1403–1408.
[2] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling
manycore performance portability through polymorphic memory access
patterns,” Journal of Parallel and Distributed Computing, vol. 74, no. 12,
pp. 3202–3216, 2014.
[3] “NVIDIA Visual Profiler user guide,
https://docs.nvidia.com/cuda/profiler-users-guide/index.html, accessed:
2019-09-06.
[4] J. Reinders, “Vtune performance analyzer essentials,” Intel Press, 2005.
[5] S. Shende and A. Malony, “The TAU Parallel Performance System,”
International Journal of High Performance Computing Applications,
vol. 20, no. 2, pp. 287–311, 2006.
Fig. 6. TAU’s Paraprof function display window shows the time spent in a kernel executing on four GPUs
[6] A. P. Thompson and C. R. Trott, “A brief description of the kokkos im-
plementation of the snap potential in examinimd.” SAND2017-12362R,
Tech. Rep., 2017.
[7] “CoPA Cabana - The Exascale Co-Design Center for Particle Applica-
tions Toolkit,” https://github.com/ECP-copa/Cabana, accessed: 2019-09-
06.
[8] S. Ramesh, A. Mah´
eo, S. Shende, A. D. Malony, H. Subramoni,
A. Ruhela, and D. K. Panda, “MPI Performance Engineering with
the MPI Tool Interface: Integration of MVAPICH and TAU,Parallel
Computing, vol. 77, pp. 19–37, 2018.
[9] A. E. Eichenberger, J. Mellor-Crummey, M. Schulz, M. Wong, N. Copty,
R. Dietrich, X. Liu, E. Loh, and D. Lorenz, “OMPT: An OpenMP
Tools Application Programming Interface for Performance Analysis,
in International Workshop on OpenMP. Springer, 2013, pp. 171–185.
[10] A. D. Malony, S. Ramesh, K. A. Huck, N. Chaimov, and S. Shende, “A
plugin architecture for the TAU performance system,” in Proceedings
of the 48th International Conference on Parallel Processing, ICPP
2019, Kyoto, Japan, August 05-08, 2019. ACM, 2019, pp. 90:1–90:11.
[Online]. Available: https://doi.org/10.1145/3337821.3337916
[11] P. Mucci, S. Browne, C. Deane, and G. Ho, “PAPI: A Portable Interface
to Hardware Performance Counters,” in DoD HPCMP Users Group
Conference, 1999, pp. 7–10.
[12] S. D. Hammond, C. R. Trott, D. Ibanez, and D. Sunderland, “Profiling
and debugging support for the kokkos programming model,” in Interna-
tional Conference on High Performance Computing. Springer, 2018,
pp. 743–754.
[13] C. E. Wu, A. Bolmarcich, M. Snir, D. Wootton, F. Parpia, A. Chan,
E. Lusk, and W. Gropp, “From trace generation to visualization: A
performance framework for distributed parallel systems,” in Proc. of
SC2000: High Performance Networking and Computing, November
2000.
[14] A. Kn¨
upfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler,
M. S. M¨
uller, and W. E. Nagel, “The Vampir performance analysis tool-
set,” in Tools for High Performance Computing. Springer, 2008, pp.
139–155.
[15] A. Kn¨
upfer, C. R¨
ossel, D. an Mey, S. Biersdorff, K. Diethelm, D. Es-
chweiler, M. Geimer, M. Gerndt, D. Lorenz, A. Malony, W. Nagel,
Y. Oleynik, P. Philippen, P. Saviankou, D. Schmidl, S. Shende et al.,
“Score-P: A Joint Performance Measurement Run-Time Infrastructure
for Periscope, Scalasca, TAU, and Vampir,” in Tools for High Perfor-
mance Computing 2011. Springer, 2012, pp. 79–91.
[16] A. P. Thompson and C. Trott, “A brief description of the kokkos
implementation of the snap potential in examinimd.” 2017.
[17] S. Shende, A. D. Malony, A. Morris, W. Spear, and S. Biersdorff, TAU.
Boston, MA: Springer US, 2011, pp. 2025–2029.
[18] A. Kn¨
upfer, R. Brendel, H. Brunst, H. Mix, and W. E. Nagel, “Intro-
ducing the Open Trace Format (OTF),” in Proceedings of the 6th Inter-
national Conference on Computational Science, ser. Springer Lecture
Notes in Computer Science, vol. 3992, Reading, UK, May 2006, pp.
526–533.
Fig. 7. Jumpshot shows the execution of CabanaMD along a timeline view

Supplementary resource (1)

Data
December 2019
Sameer Shende · Nicholas Chaimov · Allen Malony · Neena Imam
... TAU's observation model, measurement technology, and analysis tools makes it highly flexible and configurable, allowing it to be ported to different HPC systems and used in a variety of HPC applications. It takes advantage of state-ofthe-art performance interfaces for accessing hardware data (e.g., PAPI) and capturing events (e.g., PMPI, OMPT, Level Zero, ROCprofiler, ROCtracer, OpenCL, CUPTI, and Kokkos [12]). TAU is complemented by our team's work to develop a plugin interface [13], support for MPI Tools interface [14], and generic performance interfaces that can be utilized by multiple performance tools (i.e., PerfStubs [15]) and operate in asynchronous task-based scenarios (i.e., APEX [16]). ...
Article
Full-text available
The field of high-performance computing (HPC) has always challenged the research community to design and develop performance observation technology (based on instrumentation, measurement, and analysis methods), keeping pace with the rapid and aggressive evolution of HPC systems’ hardware and software. While the scope of observational concerns is broad and complex, it is the HPC innovation flux that poses difficult translation issues, even for performance tools of limited functionality. Both the complexity of HPC performance observation and the HPC translational pressures have kept the performance tools community mostly research oriented, with only a few open source toolkits widely used. The TAU Performance System is a performance toolkit for HPC with more than 30 years of continuous research and development. This project at the University of Oregon has attempted to keep TAU at the forefront of performance observation capabilities, ported to the latest HPC platforms available, and supported by a dedicated core research team. This article briefly describes the project’s research work and the challenges encountered, with a particular emphasis on the translation process necessary to make TAU the leading performance technology it is today.
... Besides a connector to Intel's VTune Amplifier XE, which is provided with the Kokkos package, the TAU performance analysis toolkit [24] also implements the Kokkos profiling interface [23]. ...
Chapter
Nowadays, HPC systems often comprise heterogeneous architectures with general purpose processors and additional accelerator devices. For performance and energy efficiency reasons, parallel codes need to optimally exploit available hardware resources. To utilize different compute resources, there exists a wide range of application programming interfaces (APIs), some of which are vendor-specific, such as CUDA for NVIDIA graphics processors. Consequently, implementing portable applications for heterogeneous architectures requires substantial efforts and possibly several code bases, which often cannot be properly maintained due to limited developer resources. Abstraction layers such as Kokkos promise platform independence of application code and thereby mitigate repeated porting efforts for each new accelerator platform. The abstraction layer handles the mapping of abstract code statements onto specific APIs. Unfortunately, this abstraction does not automatically guarantee efficient execution on every platform and therefore requires performance tuning. For this purpose, Kokkos provides a profiling interface allowing performance tools to acquire detailed Kokkos activity information, closing the gap between program code and back-end API. In this paper, we introduce support for the Kokkos profiling interface in the Score-P measurement infrastructure, which enables performance analysis of Kokkos applications with a wide range of tools.
... At the runtime system level, TAU provides an MPI wrapper interposition library that supports both the PMPI and MPI T [22] interfaces. It supports thread level instrumentation for OpenMP using the OMPT interface as well as instrumentation using the Kokkos profiling interface [23]. For GPUs, OpenCL, CUDA, ROCm, and OpenACC programming models are supported for both profiling and tracing modes of measurement. ...
Conference Paper
Full-text available
Since its launch in 2010, OpenACC has evolved into one of the most widely used portable programming models for accelerators on HPC systems today. Clacc is a project funded by the US Exascale Computing Project (ECP) to bring OpenACC support for C and C++ to the popular Clang and LLVM compiler infrastructure. In this paper, we describe Clacc's support for the OpenACC Profiling Interface, a critical component of the OpenACC specification that standardizes an interface that profiling tools and libraries can depend upon across OpenACC implementations. As part of Clacc's general strategy to build OpenACC support upon OpenMP, we describe how Clacc builds OpenACC Profiling Interface support upon an extended version of OMPT. We then describe how a major profiling and tracing toolkit within ECP, the TAU Performance System, takes advantage of this support. We also describe TAU's selective instrumentation support for OpenACC. Finally, using Clacc and TAU, we present example visualizations for several SPEC ACCEL OpenACC benchmarks running on an IBM AC922 node, and we show that the associated performance overhead is negligible.
Conference Paper
Full-text available
Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infrastructure is tightly embedded with the application code and runtime execution environment. As parallel software and systems evolve, especially towards more heterogeneous, asynchronous, and dynamic operation, it is expected that the requirements for performance observation and awareness will change. For instance, heterogeneous machines introduce new types of performance data to capture and performance behaviors to characterize. Furthermore, there is a growing interest in interacting with the performance infrastructure for in situ analytics and policy-based control. The problem is that an existing performance system architecture could be constrained in its ability to evolve to meet these new requirements. The paper reports our research efforts to address this concern in the context of the TAU Performance System. In particular, we consider the use of a powerful plugin model to both capture existing capabilities in TAU and to extend its functionality in ways it was not necessarily conceived originally. The TAU plugin architecture supports three types of plugin paradigms: EVENT, TRIGGER, and AGENT. We demonstrate how each operates under several different scenarios. Results from larger-scale experiments are shown to highlight the fact that efficiency and robustness can be maintained, while new flexibility and programmability can be offered that leverages the power of the core TAU system while allowing significant and compelling extensions to be realized.
Article
Full-text available
The desire for high performance on scalable parallel systems is increasing the complexity and tunability of MPI implementations. The MPI Tools Information Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect scalability issues. The interface also provides a mechanism to fine-tune the performance of the MPI library dynamically at runtime. In this paper, we propose an infrastructure that extends existing components-TAU, MVAPICH2, and BEACON to take advantage of the MPI_T interface and offer runtime introspection, online monitoring, recommendation generation, and autotuning capabilities. We validate our design by developing optimizations for a combination of production and synthetic applications. Using our infrastructure , we implement an autotuning policy for AmberMD (a molecular dynamics package) that monitors and reduces the internal memory footprint of the MVAPICH2 MPI library without affecting performance. For applications such as MiniAMR whose collective communication is latency sensitive, our infrastructure is able to generate recommendations to enable hardware offloading of collectives supported by MVAPICH2. By implementing this recommendation, the MPI time for MiniAMR at 224 processes reduces by 15%.
Conference Paper
Full-text available
MPI implementations are becoming increasingly complex and highly tunable, and thus scalability limitations can come from numerous sources. The MPI Tools Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect scalability issues. The interface also provides a mechanism to re-configure the MPI library dynamically at runtime to fine-tune performance. In this paper, we propose an infrastructure that extends existing components - TAU, MVAPICH2 and BEACON to take advantage of the MPI_T interface to offer runtime introspection, online monitoring, recommendation generation and autotuning capabilities. We validate our design by developing optimizations for a combination of production and synthetic applications. We use our infrastructure to implement an autotuning policy for AmberMD[1] that monitors and reduces MVAPICH2 library internal memory footprint by 20% without affecting performance. For applications where collective communication is latency sensitive such as MiniAMR[2], our infrastructure is able to generate recommendations to enable hardware offloading of collectives supported by MVAPICH2. By implementing this recommendation, we see a 5% improvement in application runtime.
Article
Full-text available
The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. Flexibility and portability in empirical methods and processes are influenced primarily by the strategies available for instrmentation and measurement, and how effectively they are integrated and composed. This paper presents the TAU (Tuning and Analysis Utilities) parallel performance sytem and describe how it addresses diverse requirements for performance observation and analysis.
Conference Paper
Full-text available
A shortcoming of OpenMP standards to date is that they lack an application programming interface (API) to support construction of portable, efficient, and vendor-neutral performance tools. To address this issue, the tools working group of the OpenMP Language Committee has designed OMPT—a performance tools API for OpenMP. OMPT enables performance tools to gather useful performance information from applications with low overhead and to map this information back to a user-level view of applications. OMPT provides three principal capabilities: (1) runtime state tracking, which enables a sampling-based performance tool to understand what an application thread is doing, (2) callbacks and inquiry functions that enable sampling-based performance tools to attribute application performance to complete calling contexts, and (3) additional callback notifications that enable construction of more full-featured monitoring capabilities. The earnest hope of the tools working group is that OMPT be adopted as part of the OpenMP standard and supported by all standard-compliant OpenMP implementations.
Article
Full-text available
The purpose of the PAPI project is to specify a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count events, which are occurrences of specific signals related to the processor' s function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis including hand tuning, compiler optimization, debugging, benchmarking, monitoring and performance modeling. In addition, it is hoped that this information will prove useful in the development of new compilation technology as well as in steering architectural development towards alleviating commonly occurring bottlenecks in high performance computing.
Chapter
Supercomputing hardware is undergoing a period of significant change. In order to cope with the rapid pace of hardware and, in many cases, programming model innovation, we have developed the Kokkos Programming Model – a C++-based abstraction that permits performance portability across diverse architectures. Our experience has shown that the abstractions developed can significantly frustrate debugging and profiling activities because they break expected code proximity and layout assumptions. In this paper we present the Kokkos Profiling interface, a lightweight, suite of hooks to which debugging and profiling tools can attach to gain deep insights into the execution and data structure behaviors of parallel programs written to the Kokkos interface.
Conference Paper
The desire for high performance on scalable parallel systems is increasing the complexity and tunability of MPI implementations. The MPI Tools Information Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect scalability issues. The interface also provides a mechanism to fine-tune the performance of the MPI library dynamically at runtime. In this paper, we propose an infrastructure that extends existing components — TAU, MVAPICH2, and BEACON to take advantage of the MPI_T interface and offer runtime introspection, online monitoring, recommendation generation, and autotuning capabilities. We validate our design by developing optimizations for a combination of production and synthetic applications. Using our infrastructure, we implement an autotuning policy for AmberMD (a molecular dynamics package) that monitors and reduces the internal memory footprint of the MVAPICH2 MPI library without affecting performance. For applications such as MiniAMR whose collective communication is latency sensitive, our infrastructure is able to generate recommendations to enable hardware offloading of collectives supported by MVAPICH2. By implementing this recommendation, the MPI time for MiniAMR at 224 processes reduces by 15%.
Article
The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly finer levels of parallelism within their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse and conflicting set of constraints on memory access patterns across devices. Contemporary portable programming models address manycore parallelism (e.g., OpenMP, OpenACC, OpenCL) but fail to address memory access patterns. The Kokkos C++ library enables applications and domain libraries to achieve performance portability on diverse manycore architectures by unifying abstractions for both fine-grain data parallelism and memory access patterns. In this paper we describe Kokkos’ abstractions, summarize its application programmer interface (API), present performance results for unit-test kernels and mini-applications, and outline an incremental strategy for migrating legacy C++ codes to Kokkos. The Kokkos library is under active research and development to incorporate capabilities from new generations of manycore architectures, and to address an growing list of applications and domain libraries.