ArticlePDF Available

GPU Accelerated Automatic Differentiation With Clad

Authors:

Abstract and Figures

Automatic Differentiation (AD) is instrumental for science and industry. It is a tool to evaluate the derivative of a function specified through a computer program. The range of AD application domain spans from Machine Learning to Robotics to High Energy Physics. Computing gradients with the help of AD is guaranteed to be more precise than the numerical alternative and have a low, constant factor more arithmetical operations compared to the original function. Moreover, AD applications to domain problems typically are computationally bound. They are often limited by the computational requirements of high-dimensional parameters and thus can benefit from parallel implementations on graphics processing units (GPUs). Clad aims to enable differential analysis for C/C++ and CUDA and is a compiler-assisted AD tool available both as a compiler extension and in ROOT. Moreover, Clad works as a plugin extending the Clang compiler; as a plugin extending the interactive interpreter Cling; and as a Jupyter kernel extension based on xeus-cling. We demonstrate the advantages of parallel gradient computations on GPUs with Clad. We explain how to bring forth a new layer of optimization and a proportional speed up by extending Clad to support CUDA. The gradients of well-behaved C++ functions can be automatically executed on a GPU. The library can be easily integrated into existing frameworks or used interactively. Furthermore, we demonstrate the achieved application performance improvements, including (≈10x) in ROOT histogram fitting and corresponding performance gains from offloading to GPUs.
Content may be subject to copyright.
Journal of Physics: Conference Series
PAPER • OPEN ACCESS
GPU Accelerated Automatic Differentiation With
Clad
To cite this article: Ioana Ifrim et al 2023 J. Phys.: Conf. Ser. 2438 012043
View the article online for updates and enhancements.
You may also like
XACC: a system-level software
infrastructure for heterogeneous
quantum–classical computing
Alexander J McCaskey, Dmitry I Lyakh,
Eugene F Dumitrescu et al.
-
t|ket: a retargetable compiler for NISQ
devices
Seyon Sivarajah, Silas Dilkes, Alexander
Cowtan et al.
-
SUB-ALFVÉNIC NON-IDEAL
MAGNETOHYDRODYNAMIC
TURBULENCE SIMULATIONS WITH
AMBIPOLAR DIFFUSION. III.
IMPLICATIONS FOR OBSERVATIONS
AND TURBULENT ENHANCEMENT
Pak Shing Li, Christopher F. McKee and
Richard I. Klein
-
This content was downloaded from IP address 168.151.165.132 on 16/02/2023 at 13:26
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012043
IOP Publishing
doi:10.1088/1742-6596/2438/1/012043
1
GPU Accelerated Automatic Differentiation With Clad
Ioana Ifrim, Vassil Vassilev and David J Lange
Department of Physics, Princeton University, Princeton, NJ 08544, USA
E-mail: ii3193@princeton.edu, vassil.vassilev@cern.ch, david.lange@princeton.edu
Abstract.
Automatic Differentiation (AD) is instrumental for science and industry. It is a tool to
evaluate the derivative of a function specified through a computer program. The range of
AD application domain spans from Machine Learning to Robotics to High Energy Physics.
Computing gradients with the help of AD is guaranteed to be more precise than the numerical
alternative and have a low, constant factor more arithmetical operations compared to the original
function. Moreover, AD applications to domain problems typically are computationally bound.
They are often limited by the computational requirements of high-dimensional parameters and
thus can benefit from parallel implementations on graphics processing units (GPUs).
Clad aims to enable differential analysis for C/C++ and CUDA and is a compiler-assisted
AD tool available both as a compiler extension and in ROOT. Moreover, Clad works as a plugin
extending the Clang compiler; as a plugin extending the interactive interpreter Cling; and as a
Jupyter kernel extension based on xeus-cling.
We demonstrate the advantages of parallel gradient computations on GPUs with Clad.
We explain how to bring forth a new layer of optimization and a proportional speed up
by extending Clad to support CUDA. The gradients of well-behaved C++ functions can be
automatically executed on a GPU. The library can be easily integrated into existing frameworks
or used interactively. Furthermore, we demonstrate the achieved application performance
improvements, including (10x) in ROOT histogram fitting and corresponding performance
gains from offloading to GPUs.
1. Introduction
Many tasks in science and industry are or can be amenable to gradient-based optimization.
Gradient-based optimization is an effective way to find optimal parametrization for
multiparameter processes that can be expressed with an objective function. The optimization
algorithms rely on a function’s derivatives or gradients. Multiple approaches to obtain
gradients of mathematical functions exist, including manual, numerical, symbolic and
automatic/algorithmic. Automatic/algorithmic differentiation (AD) is a computer program
transformation that relies on the fact that every program can be decomposed into elementary
operations to which the chain rule can be applied. Unlike numeric or symbolic differentiation,
AD enables computation of exact gradients and is free of systematic and round-off errors. In
addition, AD provides highly efficient means to control the computational complexity of the
produced gradients and derivatives [1].
AD is not new, and has become increasingly popular due to the backpropagation technique
in machine learning (ML). ML models are trained using large data sets, with an optimization
procedure relying on the computation of derivatives for error correction. The scalable AD-
produced gradients combined with recent increases in computational capabilities are an enabling
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012043
IOP Publishing
doi:10.1088/1742-6596/2438/1/012043
2
factor for sciences such as oceanology and geosciences [2]. High Energy Physics (HEP) is well
positioned to benefit from gradient-based optimizations allowing for better parameter fitting,
sensitivity analysis, Monte Carlo simulations and general statistical analysis for uncertainty
propagation through parameter tracking [3]. Data-intensive fields such as ML and HEP advance
their scalability by employing various parallelization techniques and hardware. For ML, it is
common that part, if not all, of the computations are run on GPGPU-accelerators, taking
advantage of parallel computing platforms and programming models, such as CUDA. HEP has
invested in systems for GPGPUs for real-time data processing [4] and clustering algorithms [5].
It has also started to reorganize its data analysis and statistical software to be more susceptible
towards parallelism. Packages such as RooFit [6] can see large speed-ups via gradient-based
optimization developments and hardware acceleration support.
The AD program transformation has two distinct modes: forward and reverse accumulation
mode. The forward accumulation mode has a lower computational cost for functions taking
fewer inputs and returning more outputs. The reverse accumulation mode is better suited to
"reducing" functions that have fewer outputs and more inputs. The reverse accumulation mode
is more difficult to implement, but is critical for many use cases as the execution time complexity
of the produced gradient is independent of the number of inputs. The implementation of efficient
reverse accumulation AD for parallel systems such as CUDA is challenging as it is a non-trivial
task to preserve the parallel properties of the original program.
This paper describes our progress with the CUDA support of the compiler-assisted AD tool,
Clad. We demonstrate the integration of Clad with interactive development services such as
Jupyter Notebooks [7], Cling [8], Clang-Repl [9] and ROOT [10].
2. Background
The AD program transformation is particularly useful for gradient-based optimization because
it offers upper bound asymptotic complexity computation guarantees. That is, the cost of
the reverse accumulation scales linearly as O(m)where m is the number of output variables.
Moreover, the cheap gradient principle [11] demonstrates that the "cost of computing the gradient
of a scalar-valued function is nearly the same (often within a factor of 5) as that of simply
computing the function itself" [12]. These guarantees make gradient-descent (also known as error
backpropagation in ML) optimization computationally feasible in areas such as deep learning.
AD typically starts by building a computational graph, or a directed acyclic graph of
mathematical operations applied to an input. There are two approaches to AD, which differ
primarily in the amount of work done before the program execution. The tracing approaches
construct and process a computational graph for the derivative, by exploiting the ability to
overload operators in C++. They perform an evaluation at the time of execution, for every
function invocation. Source transformation approaches generate a derivative function at compile
time to create and optimize the derivative only once. The tracing approach, which is implemented
in C++ by ADOL-C[13], CppAD [14], and Adept [15], is easier to implement and adopt into
existing codebases. Source transformation shows better performance, but it usually covers a
subset of the language. In C++, Tapenade [16] has a custom language parsing infrastructure.
Being a language of choice for Machine Learning, Python hosts one of the most sophisticated
AD systems. JAX is a trace-based AD system that differentiates a sub-language of Python
oriented towards ML applications [17]. Dex aims to give better AD asymptotic and parallelism
guarantees than JAX for loops with indexing [18]. Trace-based ML AD tools are developed
to exploit ML workflow characteristics such as having little branching, no branching depending
on input data or probabilistic draws at runtime, and almost no adaptive algorithms. However,
investigations into further generalized approaches to AD have shown superior performance [19].
HEP projects that exploit AD include work on MadJAX [20], ACTS [21] and ROOT [22].
Gradients are used in objective-function optimizations, or in the propagation of uncertainty
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012043
IOP Publishing
doi:10.1088/1742-6596/2438/1/012043
3
in different applications. A few projects aim to introduce AD-based pipelines for end-to-end
sensitivity analysis in bigger scale systems such as reconstruction or down-stream analysis tasks
[23, 24]. The community creates events and documents to capture the growing interest [25].
Recent advancements of production quality compilers like Clang allow tools to reuse the
language parsing infrastructure, making it easier to implement source transformation AD.
ADIC [26], Enzyme [27] and Clad [28] are compiler-based AD tools using source transformation.
The increasing importance of AD is evident in newer programming languages such as Swift and
Julia where it is integrated deep into the language [29, 30]. Clad is a compiler-based source
transformation AD tool for C++. Clad uses the high-level compiler program representation
called abstract syntax tree (AST). It can work as an extension to the Clang compiler, as an
extension to the C++ interpreter Cling, or Clang-Repl, and is available in the ROOT software
package. Clad implements forward and reverse accumulation modes and can provide higher order
derivatives. It produces C++ code of the gradient, allowing for easy inspection and verification.
3. Design and Implementation
The Clang frontend builds a high-level program representation in the form of an AST. The
programmatic AST synthesis might look challenging and laborious to implement, but yields
several advantages. Having access to such a high-level code representation means being able to
follow the high-level semantics of the algorithm and it facilitates domain-specific optimizations.
This can be used to automatically generate code (re-targeting C++) for hardware acceleration
with corresponding scheduling. Clang supports an entire family of languages, such as C, C++,
CUDA and OpenMP. For the most part, the AST for these languages is shared and thus the
addition of CUDA support focuses only on the different constructs.
The CUDA programming model provides a separation between the device (GPU) and host
(CPU) using language attributes such as __global__,__host__ and __device__. A typical
function that is to be differentiated can become a device executable function and its execution can
be scheduled via kernels (functions that are marked with the __global__ attribute) for parallel
computation. In order to maintain this support for the gradient function, it is sufficient for Clad
to transfer the relevant attributes to the generated code and make the Clad-based run-time data
structures compatible with CUDA by adding the relevant attributes.
# include "clad/Differentiator/Differentiator.h"
# define N 512
using arrtype =double[1];
// A device function to compute a gaussian
__device__ __host__ double gauss(double x,
double p, double sigma) {
double t=-(x -p) *(x -p) /
(2*sigma *sigma);
return pow(2*M_PI, -1/2.0)*
pow(sigma, -0.5)*exp(t);
}
// Tells Clad to create a gradient
auto gauss_g =clad::gradient(gauss, "x, p");
// Forward declare the generated function so
// that the compiler can put it on the device.
void gauss_grad_0_1(double x, double p,
double sigma, clad::array_ref<double>_d_x,
clad::array_ref<double>_d_p)
__attribute__((device))
__attribute__((host));
// The compute kernel including scheduling
__global__ void compute(double*x,
double*p, double sigma, arrtype dx,
arrtype dp) {
int i=blockIdx.x *blockDim.x +
threadIdx.x;
// Runs `N`different compute units,
// each unit computes the gradient wrt
// each parameter set
if (i <N)
gauss_grad_0_1(x[i], p[i], sigma,
dx[i], dp[i]);
}
int main() {
// CUDA device memory allocations...
compute<<<N/256+1,256>>>(dev_x, dev_p,
dev_sigma,
dx, dp);
// Utilize the results...
}
Listing 1: Clad-CUDA support example of a function and generated gradient
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012043
IOP Publishing
doi:10.1088/1742-6596/2438/1/012043
4
In essence, Clad operates on the same footing as the compiler’s template instantiation logic.
As Clad visits the AST, each node is cloned and differentiated. The differentiation rules are
implemented in separate classes depending on the accumulation mode. In Listing 1, Clad
differentiates a multidimensional normal distribution implemented as a device function. In
this particular example, Clad generates the gauss_grad_0_1 gradient executable in a CUDA
environment. The generated gradient for the device function can be called from compute in the
same way the original function was, to take care of the level of parallelism.
Currently, preserving the parallel properties of the generated gradient relies on the user. In
general, preserving parallel properties after AD transformation is still an ongoing research topic,
however some interesting results can be achieved automatically based on the gradient content.
We are working to automatically preserve the semantics of the generated code as in some cases
the parallel reads from memory of the original function become parallel writes to memory in
the produced gradient. In addition, the parallelism scheme of the original function might not be
optimal for the gradient function. We are working to make use of the information available in
Clad to find a greater degree of parallelism for any produced code.
4. Integration and Results
Interactive prototyping and exchanges for collaborative work is the backbone of research. Being
a Clang plugin, Clad can easily integrate with the LLVM-based ecosystem. For example, Clad
integrates with Jupyter notebooks, the C++ interactive interpreters Clang-Repl and Cling.
(a) Jupyter
(b) Cling
(c) Clang-Repl
Figure 1: Clad Integrated in Interactive Environments
Figure 1a demonstrates that Clad can be used in a Jupyter notebook via xeus-cling. Xeus-
cling enables interactive C++ for Jupyter notebooks. Figure 1b shows that Clad works with the
interactive C++ interpreter Cling. Figure 1c demonstrates that Clad works in a terminal setup
with Clang-Repl available in LLVM (since version 13).
AD with Clad is integrated in ROOT to provide an efficient way to calculate the gradient
required for minimization or fitting with a function. This facility is available in ROOT’s
TFormula class via the GradientPar and the HessianPar interfaces. The functionality can
be used within the TF1 fitting interfaces as well, shown in figure 2a. Replacing the numerical
gradients yield promising results. In Figure 2b we compare ROOT’s implementation of numerical
differentiation for generating gradients, and AD-produced gradients with Clad. As expected by
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012043
IOP Publishing
doi:10.1088/1742-6596/2438/1/012043
5
the AD theory, gradients produced by reverse mode AD scale better as the time complexity
of the computation is independent on the number of inputs. The better performance of Clad
AD approach versus the numerical differentiation, via the central finite difference method, has
been previously studied [22]. Further performance improvements can be achieved by moving
the second order derivatives (the Hessian matrix) to use Clad. Currently, they are computed
numerically as ROOT’s minimizer must be adapted to use externally provided Hessians.
auto f1 =new TF1("f1","gaus");
auto h1 =new TH1D("h1","h1",1000,-5,5);
double p1[] ={1,0,1.5}, p2[] ={100,1,3};
f1->SetParameters(p1); f1->SetParameters(p2);
h1->FillRandom("f1",100000);
// Enable Clad in TFormula.
f1->GetFormula()->GenerateGradientPar();
auto r2 =h1->Fit(f1, "S G Q N"); // clad
// ...
(a) Example use of Clad in TH1. (b) Scaling of many gaussian fits.
Figure 2: Comparison of fitting time using gradients produced by Clad (in green) and
numerically (red) of sums of Gaussians
We will work with the RooFit development team to integrate Clad into RooFit [31]. RooFit
processes more computationally-intense operations and the benefit from automatically generated
gradients will have a significant performance impact [32]. Clad could be also used to differentiate
the new batch-based CUDA backend in RooFit or re-target C++ gradients for GPU execution.
5. Conclusion and Future Work
In this paper, we introduced automatic differentiation and motivated its advantages for tasks
oriented towards gradient-based optimization. We demonstrated the basic usage of Clad and
described the advantages provided by it being a compiler-assisted tool. We outlined the
advancements in the area of CUDA in Clad. We demonstrated the ease of use via simple use
cases of Clad and we showed how the tool integrates with other interactive systems such as
Jupyter, Cling and Clang-Repl. We provided some performance results in using AD in ROOT’s
histogram fitting logic where we compared against the numerical differentiation approach.
We aim to further advance the GPU support of Clad, and to preserve better the parallel
algorithm properties. One of the immediate next steps is to be able to differentiate the GPU
kernel code automatically along with the already supported device function differentiation. The
planned automatic kernel dispatch will utilize the information available in Clad to provide the
optimal scheme for parallel computation for the produced gradient code. We plan to implement
more advanced program analyses based on the Clang static analyzer infrastructure to allow more
efficient code generation. We are working on extending the coverage of C++ and CUDA language
features such as support of polymorphism (virtual functions) and differentiation with respect to
aggregate types (for example class member variables). We work on a Clad-based backend for a
statistical modelling tool with RooFit in order to provide efficient means to compute gradients.
6. Acknowledgments
This project is supported by National Science Foundation under Grant OAC-1931408. Some of
the code contributions were facilitated by the 2021 Google Summer of Code program.
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012043
IOP Publishing
doi:10.1088/1742-6596/2438/1/012043
6
References
1. Verma A. An introduction to automatic differentiation. Current Science 2000 :804–7
2. Qin J, Liang S, Li X, and Wang J. Development of the adjoint model of a canopy radiative
transfer model for sensitivity study and inversion of leaf area index. IEEE Transactions on
Geoscience and Remote Sensing 2008; 46:2028–37
3. Ramos A. Automatic differentiation for error analysis. arXiv preprint arXiv:2012.11183
2020
4. Bruch D vom. Real-time data processing with GPUs in high energy physics. Journal of
Instrumentation 2020; 15:C06010
5. Chen Z, Di Pilato A, Pantaleo F, and Rovere M. GPU-based Clustering Algorithm for
the CMS High Granularity Calorimeter. EPJ Web of Conferences. Vol. 245. EDP Sciences.
2020 :05005
6. Hageboeck S. What the new RooFit can do for your analysis. arXiv preprint
arXiv:2012.02746 2020
7. Xeus-cling project on Github. 2021. Available from: https://github.com/QuantStack/
xeus-cling
8. Vasilev V, Canal P, Naumann A, and Russo P. Cling The New Interactive Interpreter for
ROOT 6. Journal of Physics: Conference Series 2012 Dec; 396:052071. doi:10.1088/1742-
6596/ 396/5/052071. Available from: https://doi. org/10.1088 /1742-6596/396/ 5/
052071
9. Vassilev V. Cling Transitions to LLVM’s Clang-Repl. Blog post. https:/ / root . cern/
blog/cling-in-llvm/. 2022
10. Brun R and Rademakers F. ROOT An object oriented data analysis framework. Nuclear
Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers,
Detectors and Associated Equipment 1997; 389. New Computing Techniques in Physics
Research V:81–6. doi:https://doi.org/10.1016/S0168-9002(97)00048-X. Available
from: http://www.sciencedirect.com/science/article/pii/S016890029700048X
11. Griewank A and Walther A. Evaluating derivatives: principles and techniques of algorithmic
differentiation. SIAM, 2008
12. Kakade SM and Lee JD. Provably correct automatic sub-differentiation for qualified
programs. Advances in neural information processing systems 2018; 31
13. Griewank A, Juedes D, and Utke J. Algorithm 755: ADOL-C: A package for the automatic
differentiation of algorithms written in C/C++. ACM Transactions on Mathematical
Software (TOMS) 1996; 22:131–67
14. Bell BM. CppAD: A package for differentiation of C++ algorithms. 2008 06-01.
http://www.coin-or.org/CppAD 2021
15. Hogan RJ. Fast reverse-mode automatic differentiation using expression templates in C++.
ACM Transactions on Mathematical Software (TOMS) 2014; 40:1–16
16. Hascoet L and Pascual V. The Tapenade automatic differentiation tool: principles, model,
and specification. ACM Transactions on Mathematical Software (TOMS) 2013; 39:1–43
17. Frostig R, Johnson MJ, and Leary C. Compiling machine learning programs via high-level
tracing. Systems for Machine Learning 2018 :23–4
18. Paszke A, Johnson D, Duvenaud D, Vytiniotis D, Radul A, Johnson M, Ragan-Kelley J,
and Maclaurin D. Getting to the point. index sets and parallelism-preserving autodiff for
pointful array programming. arXiv preprint arXiv:2104.05372 2021
19. RFC: C++ Gradients (TensorFlow Community Proposal). 2021. Available from: https:
//github.com/tensorflow/community/pull/335
ACAT-2021
Journal of Physics: Conference Series 2438 (2023) 012043
IOP Publishing
doi:10.1088/1742-6596/2438/1/012043
7
20. Heinrich L and Kagan M. Differentiable Matrix Elements with MadJax. arXiv preprint
arXiv:2203.00057 2022
21. Ai X, Allaire C, Calace N, Czirkos A, Ene I, Elsing M, Farkas R, Gagnon LG, Garg R,
Gessinger P, et al. A common tracking software project. arXiv preprint arXiv:2106.13593
2021
22. Vassilev V, Efremov A, and Shadura O. Automatic Differentiation in ROOT. EPJ Web of
Conferences. Vol. 245. EDP Sciences. 2020 :02015
23. De Castro P and Dorigo T. INFERNO: Inference-aware neural optimisation. Computer
Physics Communications 2019; 244:170–9
24. Simpson N and Heinrich L. neos: End-to-End-Optimised Summary Statistics for High
Energy Physics. arXiv preprint arXiv:2203.05570 2022
25. Baydin AG, NYU KC, Feickert M, Gray L, Heinrich L, NYU AH, Neubauer AMVM, Pearkes
J, Simpson N, Smith N, et al. Differentiable programming in high-energy physics. Submitted
as a Snowmass LOI 2020
26. Bischof CH, Roh L, and Mauer-Oats AJ. ADIC: an extensible automatic differentiation
tool for ANSI-C. Software: Practice and Experience 1997; 27:1427–56
27. Moses WS and Churavy V. Instead of Rewriting Foreign Code for Machine Learning,
Automatically Synthesize Fast Gradients. Advances in Neural Information Processing
Systems 33. 2020
28. Vassilev V, Vassilev M, Penev A, Moneta L, and Ilieva V. Clad Automatic Differentiation
Using Clang and LLVM. Journal of Physics: Conference Series 2015; 608. http://stacks.
iop.org/1742-6596/608/i=1/a=012055:012055
29. Saeta B and Shabalin D. Swift for TensorFlow: A portable, flexible platform for deep
learning. Proceedings of Machine Learning and Systems 2021; 3
30. JuliaDiff Differentiation tools in Julia. https://juliadiff.org/. 2022
31. Verkerke W and Kirkby DP. The RooFit toolkit for data modeling. eConf 2003;
C0303241:MOLT007. arXiv: physics/0306116 [physics]
32. Bos EP, Burgard CD, Croft VA, Hageboeck S, Moneta L, Pelupessy I, Attema JJ, and
Verkerke Wr. Faster RooFitting: Automated parallel calculation of collaborative statistical
models. EPJ Web of Conferences. Vol. 245. EDP Sciences. 2020 :06027
... Recently, several AD tools have been developed for GPU architectures, for example, Refs. [6,7,8,9,10]. Enzyme [6] performs AD of GPU kernels using an LLVM-based plugin [8] that can generate kernel gradients in CUDA or ROCm. In Ref. [6], the authors demonstrated that the AD performance on a set of benchmarks is within an order of magnitude of the performance of the source program. ...
Preprint
Full-text available
The Hessian-vector product computation appears in many scientific applications such as in optimization and finite element modeling. Often there is a need for computing Hessian-vector products at many data points concurrently. We propose an automatic differentiation (AD) based method, CHESSFAD (Chunked HESSian using Forward-mode AD), that is designed with efficient parallel computation of Hessian and Hessian-Vector products in mind. CHESSFAD computes second-order derivatives using forward mode and exposes parallelism at different levels that can be exploited on accelerators such as NVIDIA GPUs. In CHESSFAD approach, the computation of a row of the Hessian matrix is independent of the computation of other rows. Hence rows of the Hessian matrix can be computed concurrently. The second level of parallelism is exposed because CHESSFAD approach partitions the computation of a Hessian row into chunks, where different chunks can be computed concurrently. CHESSFAD is implemented as a lightweight header-based C++ library that works both for CPUs and GPUs. We evaluate the performance of CHESSFAD for performing a large number of independent Hessian-Vector products on a set of standard test functions and compare its performance to other existing header-based C++ libraries such as {\tt autodiff}. Our results show that CHESSFAD performs better than {\tt autodiff}, on all these functions with improvement ranging from 5-50\% on average.
Conference Paper
Full-text available
Applying differentiable programming techniques and machine learning algorithms to foreign programs requires developers to either rewrite their code in a machine learning framework, or otherwise provide derivatives of the foreign code. This paper presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme synthesizes gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional source-to-source and operator-overloading tools, Enzyme performs AD on optimized IR. On a machine-learning focused benchmark suite including Microsoft's ADBench, AD on optimized IR achieves a geometric mean speedup of 4.2 times over AD on IR before optimization allowing Enzyme to achieve state-of-the-art performance. Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with state-of-the-art performance, enabling foreign code to be directly incorporated into existing machine learning workflows.
Conference Paper
Full-text available
Conference Paper
Full-text available
Article
Full-text available
In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.), elementary functions (exp, log, sin, cos, etc.) and control flow statements. AD takes source code of a function as input and produces source code of the derived function. By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. This paper presents AD techniques available in ROOT, supported by Cling, to produce derivatives of arbitrary C/C++ functions through implementing source code transformation and employing the chain rule of differential calculus in both forward mode and reverse mode. We explain its current integration for gradient computation in TFormula. We demonstrate the correctness and performance improvements in ROOT’s fitting algorithms.
Article
Full-text available
The future High Luminosity LHC (HL-LHC) is expected to deliver about 5 times higher instantaneous luminosity than the present LHC, resulting in pile-up up to 200 interactions per bunch crossing (PU200). As part of the phase-II upgrade program, the CMS collaboration is developing a new endcap calorimeter system, the High Granularity Calorimeter (HGCAL), featuring highly-segmented hexagonal silicon sensors and scintillators with more than 6 million channels. For each event, the HGCAL clustering algorithm needs to group more than 10 ⁵ hits into clusters. As consequence of both high pile-up and the high granularity, the HGCAL clustering algorithm is confronted with an unprecedented computing load. CLUE (CLUsters of Energy) is a fast fullyparallelizable density-based clustering algorithm, optimized for high pile-up scenarios in high granularity calorimeters. In this paper, we present both CPU and GPU implementations of CLUE in the application of HGCAL clustering in the CMS Software framework (CMSSW). Comparing with the previous HGCAL clustering algorithm, CLUE on CPU (GPU) in CMSSW is 30x (180x) faster in processing PU200 events while outputting almost the same clustering results.
Article
Full-text available
RooFit [1, 2] is the main statistical modeling and fitting package used to extract physical parameters from reduced particle collision data, e.g. the Higgs boson experiments at the LHC [3, 4]. RooFit aims to separate particle physics model building and fitting (the users’ goals) from their technical implementation and optimization in the back-end. In this paper, we outline our efforts to further optimize this back-end by automatically running parts of user models in parallel on multi-core machines. A major challenge is that RooFit allows users to define many different types of models, with different types of computational bottlenecks. Our automatic parallelization framework must then be flexible, while still reducing run time by at least an order of magnitude, preferably more. We have performed extensive benchmarks and identified at least three bottlenecks that will benefit from parallelization. We designed a parallelization framework that allows us to parallelize likelihood minimization with high performance by splitting over partial derivatives in the minimizer. The basis of the framework is a task queue approach. Preliminary results show speed-ups of factor 2 to 20, depending on the exact model and parallelization strategy.
Article
Full-text available
As high energy physics experiments reach higher luminosities and intensities, the computing burden for real time data processing and reduction grows. Following the developments in the computing landscape, multi-core processors such as graphics processing units (GPUs) are increasingly used for such tasks. These proceedings provide an introduction to the GPU architecture and describe how it maps to common tasks in real time data processing. In addition, specific use cases of GPUs in the trigger systems of five different high energy physics experiments are presented.
Article
Full-text available
Complex computer simulations are commonly required for accurate data modelling in many scientific disciplines, making statistical inference challenging due to the intractability of the likelihood evaluation for the observed data. Furthermore, sometimes one is interested on inference drawn over a subset of the generative model parameters while taking into account model uncertainty or misspecification on the remaining nuisance parameters. In this work, we show how non-linear summary statistics can be constructed by minimising inference-motivated losses via stochastic gradient descent such that they provide the smallest uncertainty for the parameters of interest. As a use case, the problem of confidence interval estimation for the mixture coefficient in a multi-dimensional two-component mixture model (i.e. signal vs background) is considered, where the proposed technique clearly outperforms summary statistics based on probabilistic classification, a commonly used alternative which does not account for the presence of nuisance parameters.
Conference Paper
Full-text available
Differentiation is ubiquitous in high energy physics, for instance in minimization algorithms and statistical analysis, in detector alignment and calibration, and in theory. Automatic differentiation (AD) avoids well-known limitations in round-offs and speed, which symbolic and numerical differentiation suffer from, by transforming the source code of functions. We will present how AD can be used to compute the gradient of multi-variate functions and functor objects. We will explain approaches to implement an AD tool. We will show how LLVM, Clang and Cling (ROOT's C++11 interpreter) simplifies creation of such a tool. We describe how the tool could be integrated within any framework. We will demonstrate a simple proof-of-concept prototype, called Clad, which is able to generate n-th order derivatives of C++ functions and other language constructs. We also demonstrate how Clad can offload laborious computations from the CPU using OpenCL.
Article
We present a novel programming language design that attempts to combine the clarity and safety of high-level functional languages with the efficiency and parallelism of low-level numerical languages. We treat arrays as eagerly-memoized functions on typed index sets, allowing abstract function manipulations, such as currying, to work on arrays. In contrast to composing primitive bulk-array operations, we argue for an explicit nested indexing style that mirrors application of functions to arguments. We also introduce a fine-grained typed effects system which affords concise and automatically-parallelized in-place updates. Specifically, an associative accumulation effect allows reverse-mode automatic differentiation of in-place updates in a way that preserves parallelism. Empirically, we benchmark against the Futhark array programming language, and demonstrate that aggressive inlining and type-driven compilation allows array programs to be written in an expressive, "pointful" style with little performance penalty.