ArticlePDF Available

Managing heterogeneous device memory using C++17 memory resources

Authors:

Abstract

Programmers using the C++ programming language are increasingly taught to manage memory implicitly through containers provided by the C++ standard library. However, heterogeneous programming platforms often require explicit allocation and deallocation of memory. This discrepancy in memory management strategies can be daunting and problematic for C++ developers who are not already familiar with heterogeneous programming. The C++17 standard introduces the concept of memory resources , which allow the user to control how standard library containers allocate memory; we believe that this addition to the C++17 standard is a powerful tool towards the unification of memory management for heterogeneous systems with best-practice C++ development. In this paper, we present vecmem , a library of memory resources which allows efficient and user-friendly allocation of memory on CUDA, HIP, and SYCL devices through standard C++ containers. We investigate the design and use cases of such a library, the potential performance gains over naive memory allocation, and the limitations of this memory allocation model.
Journal of Physics: Conference Series
PAPER • OPEN ACCESS
Managing heterogeneous device memory using
C++17 memory resources
To cite this article: S N Swatman et al 2023 J. Phys.: Conf. Ser. 2438 012050
View the article online for updates and enhancements.
You may also like
Preparing for the new C++11 standard
Axel Naumann
-
Migrating large codebases to C++
Modules
Y Takahashi, O Shadura and V Vassilev
-
C++ evolves!
Axel Naumann
-
This content was downloaded from IP address 206.232.36.165 on 16/02/2023 at 13:22
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012050
IOP Publishing
doi:10.1088/1742-6596/2438/1/012050
1
Managing heterogeneous device memory using
C++17 memory resources
S N Swatman1,2, A Krasznahorkay1and P Gessinger1
1European Organization for Nuclear Research, Meyrin, Switzerland
2University of Amsterdam, Amsterdam, The Netherlands
E-mail: stephen.nicholas.swatman@cern.ch attila.krasznahorkay@cern.ch
paul.gessinger@cern.ch
Abstract. Programmers using the C++ programming language are increasingly taught to
manage memory implicitly through containers provided by the C++ standard library. However,
heterogeneous programming platforms often require explicit allocation and deallocation of
memory. This discrepancy in memory management strategies can be daunting and problematic
for C++ developers who are not already familiar with heterogeneous programming. The C++17
standard introduces the concept of memory resources, which allow the user to control how
standard library containers allocate memory; we believe that this addition to the C++17
standard is a powerful tool towards the unification of memory management for heterogeneous
systems with best-practice C++ development. In this paper, we present vecmem, a library
of memory resources which allows efficient and user-friendly allocation of memory on CUDA,
HIP, and SYCL devices through standard C++ containers. We investigate the design and use
cases of such a library, the potential performance gains over naive memory allocation, and the
limitations of this memory allocation model.
1. Introduction
Heterogeneous software is increasingly becoming a necessity in scientific computing as it allows
tackling data challenges at ever greater scale, and with ever greater efficiency [1]. Unfortunately,
heterogeneity brings its own sets of challenges, one of which is increased complexity to developers;
developing software for heterogeneous systems is often very different from more traditional
homogeneous systems, which can be especially daunting to domain scientists who may not
necessarily be experts in computing [2]. As a result, developing and maintaining heterogeneous
software comes at the cost of great human effort, and the resulting software may be more error-
prone and harder to maintain.
One particularly grating aspect of many heterogeneous development platforms is the
management of device memory. Over the last few decades, the C++ language has moved
away from explicit management of memory by the programmer: the use of malloc and free
to dynamically allocate and deallocate memory on the heap, as well as the new and delete
keywords, have become increasingly uncommon [3]. Instead, it is often considered good practice
to rely on implicit memory management, such as through RAII and standard library containers,
which hide the management of memory from the user completely [4]. Meanwhile, heterogeneous
platforms are often designed to be compatible with the C programming language and are
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012050
IOP Publishing
doi:10.1088/1742-6596/2438/1/012050
2
therefore required to be far more explicit. In CUDA, for example, the cudaMalloc and cudaFree
methods are pervasive [5].
This leads to the unfortunate state of heterogeneous memory management today: we teach
aspiring domain scientists to practice modern C++ on homogeneous systems, and then ask
them to work with heterogeneous platforms which do not support those practices. This not only
wastes these scientists’ valuable time, but also increases the potential for errors in the software;
explicit memory management can very easily cause memory safety violations if not used carefully
[6].
We believe that this barrier between homogeneous and heterogeneous memory management
is not only undesirable, but also unnecessary. In this paper, we present vecmem (available at
github.com/acts-project/vecmem), a library which bridges the gap between the ergonomics of
idiomatic C++ and the power of heterogeneous computing. Our library is based around the use
of standard C++ library classes such as vectors and smart pointers with device memory through
memory resources, a C++17 library feature which allows fine-grained control over the allocation
schemes of library classes [7]. We extend these memory resources to cover heterogeneous memory,
and demonstrate how they can be applied to heterogeneous programming to provide a more
comfortable and productive experience to C++ programmers. The project also provides the
CMake [8] build infrastructure for making use of the supported heterogeneous programming
languages on all platforms that those languages themselves support.
2. Host-side Functionality
The objective of the host-side1functionality of vecmem is to allow C++ developers to interact
with device-accessible memory in ways that: (1) are as similar as possible to standard C++ code;
(2) support a diverse range of heterogeneous programming platforms; and (3) are sufficiently
performant as to not introduce new bottlenecks. To this end, vecmem takes a compositional
view of memory management and provides two classes of memory resources; our library provides
upstream memory resources which interface directly with the underlying programming platform,
as well as downstream memory resources which wrap additional behaviour around an existing
memory resource. For example, an upstream allocator might interface directly with the CUDA
runtime library, while a downstream allocator may sub-allocate memory allocated by that
upstream allocator in order to avoid costly and unnecessary allocations and deallocations.
Examples of interactions between user code and the heterogeneous platform, directly as well
as through our library, are given in Figure 1.
2.1. General memory resources
We will first examine memory resources in the general sense, the type of which we shall refer to
as M. Memory resources share a common interface which requires the implementation of the
following three methods:
Allocation A memory resource must be able to allocate a chunk of memory of a given size
with a given alignment2. This allocation is allowed to fail; such failures are in accordance
with the C++ standard modelled as exceptions.
Deallocation A memory resource must be able to deallocate a chunk of memory of a given size,
at a given pointer. This method is assumed to always succeed, and as such does not return
any indication of success or failure. Notably, invalid deallocation requests (e.g. double
deallocations, or deallocations where the size does not match the original allocation) are
1In this context, the host refers to the processing unit responsible for the core functionality of the system,
commonly a CPU. In contrast, a device refers to some auxiliary processing unit such as a GPU.
2The alignment of an allocation refers to the requirement for that allocation to begin at a memory address which
is a multiple of the alignment size.
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012050
IOP Publishing
doi:10.1088/1742-6596/2438/1/012050
3
Device
DeallocateAllocate
User code
(a) User code interfaces directly
with the platform runtime.
Device
DeallocateAllocate
Upstream
STL
User code
(b) User code interfaces through
a container and a vecmem up-
stream resource.
Device
DeallocateAllocate
Upstream
Downstream
STL
User code
(c) User code interfaces through
a container and a vecmem down-
stream resource.
Figure 1: Models of interaction between the programmer, platform, and device.
considered undefined behaviour at the call site, and thus need not be verified by the callee
memory resource.
Equivalence Two memory resources are equivalent if and only if one of the resources can
deallocate memory allocated by the other, and vice versa. A memory resource must be able
to determine whether it is equivalent to another.
In C++, memory resources can be passed as run-time arguments to certain value constructors
of standard library containers. For this to be possible, the container has to be configured
to support polymorphic allocators at the type level through specific template parameters.
For example, std::vector is a type with two template parameters, the second of which
determines the allocation scheme3. In order to construct vecmem’s vector type vecmem::vector
itself a type with one template parameter we partially specialise std::vector with
std::pmr::polymorphic allocator as its second template argument. In addition to containers,
vecmem also extends standard library smart pointers such that they can be used with memory
resources. Since we are effectively re-purposing types from the standard library, we inherit all
the methods that C++ developers are used to. In addition, we inherit certain C++ semantics;
for example, vecmem::vector will automatically deallocate its owned (device) memory when
leaving scope, and we inherit support for move semantics.
An example of working with memory resources (in this case, one for CUDA managed memory)
is given in Listing 1b. Listing 1a shows idiomatic C++ code for performing the same operations
on a vector contained in host memory; these examples are included to illustrate how well the
semantics of C++ carry over to vecmem.
2.2. Upstream memory resources
In our library, an upstream memory resource is defined as a memory resource that interfaces
directly with a given platform run-time. We denote the type of these resources as M<:M4.
Note that all upstream memory resources are, in themselves, complete memory resources, and
may be used with containers and other vecmem features directly. Some upstream memory
resources are additionally parameterized. For example, CUDA memory resources may be
configured to act on a specific device, if there are multiple devices present in the system. We
3This template parameter is usually hidden from developers, as the default argument is sufficient for most
common use cases.
4All upstream memory resources are memory resources, but not all memory resources are upstream memory
resources. As such, Mis a subtype of M.
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012050
IOP Publishing
doi:10.1088/1742-6596/2438/1/012050
4
1st d : : v e ct o r < i n t > v ;
2
3v . pu s h _b a ck (5 ) ;
4fo r ( si z e_t i = 0 ; i < 10 0 ; + + i ) {
5v.emplace_back(i);
6}
7
8v . re s iz e ( 15 0 ) ;
9st d : : io ta ( v . b eg in ( ) , v . en d () , 0 ) ;
10 st d : : so rt ( v . b eg in ( ) , v . en d () ) ;
(a) Using idiomatic C++ for host memory.
1ve c me m :: c ud a :: m a na g ed _ me m or y _r es o ur c e m ;
2ve c m em : : v ec t or < i nt > v( & m );
3
4v . pu s h _b a ck ( 5 ) ;
5fo r ( s i ze_t i = 0; i < 1 0 0 ; + + i ) {
6v.emplace_back(i);
7}
8
9v . re s iz e ( 15 0 ) ;
10 st d : : io ta ( v . b eg in ( ) , v . en d () , 0 ) ;
11 st d : : so rt ( v . b eg in ( ) , v . en d () ) ;
(b) Using vecmem for shared CUDA memory.
Listing 1: Inserting and manipulating elements of a vector in various kinds of memory using
idiomatic C++ and vecmem.
currently provide nine upstream memory resources: one for host memory, three for the NVIDIA
CUDA platform (one for pinned host memory, one for shared5memory, and one for device
memory) [5], two for AMD HIP (one for device memory and one for shared memory) [10], and
three for SYCL (one for host memory, one for shared memory, and one for device memory) [11].
2.3. Downstream memory resources
To allow users to add additional logic to allocation schemes, vecmem defines downstream memory
resources, the type of which is defined as M M M. Crucially, downstream memory
resources are not, in themselves, fully fledged memory resources; they cannot be used to allocate
or deallocate memory without first applying them in a functional sense to an existing memory
resource. vecmem’s downstream memory resources can be roughly subdivided into the following
categories, based on their intended use:
Caching memory resources provide caching of allocated memory such that it is not
immediately deallocated when it is no longer needed. This allows future allocations to avoid
interactions with the heterogeneous runtime, which are generally orders of magnitude slower
than allocations in host memory. We provide memory resources with buddy allocation [12]
and arena allocation schemes [13].
Utility memory resources can be used to enforce certain rules on allocation schemes. For
example, we provide a memory resource that guarantees that memory is allocated in a
contiguous block of memory.
Conditional memory resources provide branching behaviour to allocation schemes, allowing
developers to encode complex decision-making processes in their memory management
systems. For example, a conditional memory resource may direct small allocations to one
memory resource, and large allocations to another.
Instrumentation memory resources allow developers to instrument, profile, and debug
their memory allocation schemes by providing useful side-effects, such as providing helpful
debug messages, checking whether incoming allocations are valid, and recording the
execution times of individual allocations.
Abstract memory resources have properties that are useful for studying memory resources
in an abstract sense, but which have little use in practice. For example, the identity memory
resource which simply forwards any requests upstream, and the terminal memory resource
which always fails.
5In this context, shared memory refers to memory that is through a sufficiently abstract lens accessible to
both the host and a device. In CUDA terminology, this is also referred to as managed memory [9].
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012050
IOP Publishing
doi:10.1088/1742-6596/2438/1/012050
5
Container
vector view vector buffer
device vector
Host
Device
or
Figure 2: The construction and consumption of buffer and view types enables elegant crossing
of the host-device barrier.
The downstream memory resources provided by our library form a monoid ⟨M,◦⟩ under
composition6. This property allows for the construction of arbitrarily complex downstream
allocators at run-time, if necessary. For example, the configuration of memory allocation
may be derived from user-supplied configuration, or it may be decided heuristically at run-time;
treating memory management as a hyper-parameter in this fashion can provide an avenue of
optimisation in cases where the performance of an application is highly sensitive to the overhead
incurred by memory allocation.
We believe the compositional design of memory resources is a powerful one, as any
downstream allocator can, in principle, be used with any upstream allocator. This means that
the addition of a single upstream memory resource for a new programming platform allows the
user to create a large number of composite memory resources with different allocation strategies
for different use cases, without any additional development work. Similarly, a new downstream
memory resource can immediately be used with any existing upstream allocators; a programmer
implementing a new downstream allocation strategy can immediately use that strategy with
CUDA, HIP, SYCL, and host memory.
3. Device-side Functionality
In addition to host-side code related to the allocation of memory, we also provide device-side
functionality for using that memory. Like the host-side interfaces, our goal is to provide a
software development experience that is as close to the homogeneous C++ experience as possible.
Most containers and types defined in the C++ standard library are not available in heterogeneous
environments; even if the memory they use is accessible by the device, the code that operates on
that memory is not usable on the device. To resolve this problem, we provide device-compatible
classes that mimic the behaviour of standard library containers.
In order to cross the barrier between data structures on the host and the device, vecmem
provides light-weight data types that can be passed directly to the device. We provide buffer
types which own the memory they point to, as well as view types, which do not. These buffer
and view types can be generated on the host, and passed to the device. In device code, we
can then construct standard library-like containers around view type objects, which we can
interact with as we would expect to. A schematic view of the different types involved in this
host-device interaction is shown in Figure 2. Additionally, these types allow the user to work
with host-inaccessible memory in ways that we cannot safely do using standard types.
In order to emulate the ergonomics of C++ library containers, we provide high-level features
such as data accesses and iterators. In addition, we provide support for operations which are
6In other words, any two downstream memory resources mand ncan be chained to produce mn, which is
guaranteed to be another downstream memory resource. This property follows from our definition of Mas an
endomorphism.
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012050
IOP Publishing
doi:10.1088/1742-6596/2438/1/012050
6
1_ _g l ob a l_ _ vo id ke r ne l (
2ve c m em : : d at a : : v ec t or _ vi e w < f lo at > o ut ,
3ve c m em : : d at a : : v ec t or _ vi e w < f lo at > b uf
4) {
5ve c m em : : d e vi c e_ v e ct o r < > ou t_ v ( o ut ) ;
6ve c m em : : d e vi c e_ v e ct o r < > bu f_ v ( b uf ) ;
7
8bu f _v . p u sh _ ba c k ( 12 3 . f) ;
9ou t _v [ 5] = 1 .2 3 f ;
10 }
(a) Device-side code.
1ve c m em : : c ud a : : m a n a ge d _ m e mo r y _ re s o u r ce mr ;
2
3ve c m em : : v ec t or < f l oa t > o u t { 10 0 , 0 , &m r };
4ve c m em : : d at a : : v ec t or _ b uf f er < f lo at > bu f {
510 0 , 0 , m r };
6
7ke r ne l < < < 1 , 1 > > > (
8ve c m em : : g e t_ d at a ( o ut ) , bu f );
(b) Host-side code.
Listing 2: Example of using vecmem device containers in a CUDA application.
non-trivial on heterogeneous devices, such as the resizing of vectors. As long as there is sufficient
capacity in a vector, we allow device-side code to insert elements into it atomically, ensuring
that there are no data races in a massively parallel environment such as a GPU. An example
of passing containers to a CUDA kernel, one of which would only be used on the device by the
kernel, is shown in Listing 2.
4. Conclusion
In this paper, we have presented the design and implementation of the vecmem library, which
aims to bring the software development ergonomics of C++ to heterogeneous device memory.
Our library is built upon the idea of composing upstream and downstream allocators, to reduce
overhead while supporting a large variety of platforms. In addition, we provide support for
device-side containers which emulate the behaviour of standard library containers.
Acknowledgments
This work was partially funded and supported by the CERN Strategic R&D Programme on
Technologies for Future Experiments [14].
References
[1] Owens J D, Luebke D, Govindaraju N, Harris M, Kr¨uger J, Lefohn A E and Purcell T J 2007 Computer
Graphics Forum 26 80–113
[2] Ujald´on M 2016 CUDA achievements and GPU challenges ahead International Conference on Articulated
Motion and Deformable Objects (Springer) pp 207–217
[3] Stroustrup B 1996 A History of C++: 1979–1991 (New York, NY, USA: Association for Computing
Machinery) p 699–769 ISBN 0201895021
[4] Murray R B 1993 C++ strategies and tactics (Addison Wesley Longman Publishing Co., Inc.)
[5] NVIDIA Corporation 2021 CUDA C++ Programming Guide
[6] De Amorim A A, Hrit¸cu C and Pierce B C 2018 The meaning of memory safety International Conference
on Principles of Security and Trust (Springer) pp 79–105
[7] Dawes B and Meredith A 2016 Adopt Library Fundamentals V1 TS Components for C++17 (R1)
[8] Hoffman W and Martin K 2003 Dr. Dobb’s Journal: Software Tools for the Professional Programmer 28
40–43
[9] Li W, Jin G, Cui X and See S 2015 An evaluation of unified memory technology on NVIDIA GPUs 2015
15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (IEEE) pp 1092–1098
[10] Advanced Micro Devices, Inc 2021 HIP Programming Guide
[11] Reyes R and Lom¨uller V 2016 SYCL: Single-source C++ accelerator programming Parallel Computing: On
the Road to Exascale (IOS Press) pp 673–682
[12] Knowlton K C 1965 Communications of the ACM 8623–624
[13] Hanson D R 1990 Software: Practice and Experience 20 5–12
[14] CERN Experimental Physics Department 2018 Strategic R&D Programme on Technologies for Future
Experiments Tech. rep. CERN Geneva URL https://cds.cern.ch/record/2649646
Technical Report
Full-text available
This report summarises the activities and main achievements of the CERN strategic R&D programme on technologies for future experiments during the year 2021.
Chapter
Complex particle reconstruction software used by High Energy Physics experiments already pushes the edges of computing resources with demanding requirements for speed and memory throughput, but the future experiments pose an even greater challenge. Although many supercomputers have already reached petascale capacities using many-core architectures and accelerators, numerous scientific applications still need to be adapted to make use of these new resources. To ensure a smooth transition to a platform-agnostic code base, we developed a prototype of a portability and parallelization framework named vecpar. In this paper, we introduce the technical concepts, the main features and we demonstrate the framework’s potential by comparing the runtimes of the single-source vecpar implementation (compiled for different architectures) with native serial and parallel implementations, which reveal significant speedup over the former and competitive speedup versus the latter. Further optimizations and extended portability options are currently investigated and are therefore the focus of future work.KeywordsPerformance portabilityParallel computingHeterogeneous computing
Article
A study is performed on the CMake build manager. CMake is an open-source, cross-platform C/C++ build manager that lets the user specify build parameters in a text file. It also includes integrated support for regression testing. It provides the ability to determine the byte order and other hardware-specific characteristics, when developing the cross-platform software.
Article
Unified Memory is an emerging technology which is supported by CUDA 6.X. Before CUDA 6.X, the existing CUDA programming model relies on programmers to explicitly manage data between CPU and GPU and hence increases programming complexity. CUDA 6.X provides a new technology which is called as Unified Memory to provide a new programming model that defines CPU and GPU memory space as a single coherent memory (imaging as a same common address space). The system manages data access between CPU and GPU without explicit memory copy functions. This paper is to evaluate the Unified Memory technology through different applications on different GPUs to show the users how to use the Unified Memory technology of CUDA 6.X efficiently. The applications include Diffusion3D Benchmark, Parboil Benchmark Suite, and Matrix Multiplication from the CUDA SDK Samples. We changed those applications to corresponding Unified Memory versions and compare those with the original ones. We selected the NVIDIA Keller K40 and the Jetson TK1, which can represent the latest GPUs with Keller architecture and the first mobile platform of NVIDIA series with Keller GPU. This paper shows that Unified Memory versions cause 10% performance loss on average. Furthermore, we used the NVIDIA Visual Profiler to dig the reason of the performance loss by the Unified Memory technology.
Article
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware a compelling platform for computationally demanding tasks in a wide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general‐purpose computation to graphics hardware. We begin with the technical motivations that underlie general‐purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general‐purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general‐purpose application development on graphics hardware.
Adopt Library Fundamentals V1 TS Components for C++17 (R1)
  • B Dawes
  • A Meredith
Dawes B and Meredith A 2016 Adopt Library Fundamentals V1 TS Components for C++17 (R1)
  • D Hanson
Hanson D R 1990 Software: Practice and Experience 20 5-12