Conference PaperPDF Available

Clad – Automatic Differentiation Using Clang and LLVM

  • University of Plovdiv Paisii Hilendarski

Abstract and Figures

Differentiation is ubiquitous in high energy physics, for instance in minimization algorithms and statistical analysis, in detector alignment and calibration, and in theory. Automatic differentiation (AD) avoids well-known limitations in round-offs and speed, which symbolic and numerical differentiation suffer from, by transforming the source code of functions. We will present how AD can be used to compute the gradient of multi-variate functions and functor objects. We will explain approaches to implement an AD tool. We will show how LLVM, Clang and Cling (ROOT's C++11 interpreter) simplifies creation of such a tool. We describe how the tool could be integrated within any framework. We will demonstrate a simple proof-of-concept prototype, called Clad, which is able to generate n-th order derivatives of C++ functions and other language constructs. We also demonstrate how Clad can offload laborious computations from the CPU using OpenCL.
Content may be subject to copyright.
This content has been downloaded from IOPscience. Please scroll down to see the full text.
Download details:
IP Address:
This content was downloaded on 28/05/2015 at 09:50
Please note that terms and conditions apply.
Clad — Automatic Differentiation Using Clang and LLVM
View the table of contents for this issue, or go to the journal homepage for more
2015 J. Phys.: Conf. Ser. 608 012055
Home Search Collections Journals About Contact us My IOPscience
Clad – Automatic Differentiation Using Clang and
V Vassilev1,2, M Vassilev2, A Penev2, L Moneta1, and V Ilieva3
1CERN, PH-SFT, Geneva, Switzerland
2FMI, University of Plovdiv Paisii Hilendarski, Plovdiv, Bulgaria
3Princeton University, Princeton, NJ, USA
Abstract. Differentiation is ubiquitous in high energy physics, for instance in minimization
algorithms and statistical analysis, in detector alignment and calibration, and in theory.
Automatic differentiation (AD) avoids well-known limitations in round-offs and speed, which
symbolic and numerical differentiation suffer from, by transforming the source code of functions.
We will present how AD can be used to compute the gradient of multi-variate functions and
functor objects. We will explain approaches to implement an AD tool. We will show how
LLVM, Clang and Cling (ROOT’s C++11 interpreter) simplifies creation of such a tool. We
describe how the tool could be integrated within any framework. We will demonstrate a simple
proof-of-concept prototype, called Clad, which is able to generate n-th order derivatives of C++
functions and other language constructs. We also demonstrate how Clad can offload laborious
computations from the CPU using OpenCL.
1. Introduction
Both industry and science often use the mathematical apparatus of differential calculus.
Modeling financial markets, climatic changes or searching for the Higgs boson use function
optimization and thus derivatives. The numerical calculation of the derivative values yields
precision losses. They come from machine’s floating point representation and the stability of the
used numerical method. The computation fragility becomes even worse when computing higher
order derivatives. In practice the user (a programmer) must consider very carefully the input
values and the stepping delta, which sometimes can be far from trivial. Moreover, the derivative
is hard-coded and becomes a maintenance issue. The developer has to differentiate the function
mentally or using an external tool and translate it to the implementation language.
An alternative approach is the so called symbolic differentiation which overcomes the above-
described issues, but sometimes it is slow [1]. It does not offer straight forward framework
integration. For example, the function to be differentiated is hardcoded the programming
language (e.g. C++) and it has to be translated to the symbolic language, differentiated and the
result needs to be translated back to the framework’s programming language. Moreover, both
numerical and symbolic differentiation methods suffer when computing gradients and higher-
order derivatives.
Despite sometimes overlooked, there is a hybrid approach which lays in the middle of the
both extremes, resolving the mentioned issues. It is the automatic/algorithmic differentiation
(AD). A widely accepted definition of AD is “a set of techniques to numerically evaluate the
ACAT2014 IOP Publishing
Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
derivative of a function specified by a computer program. AD exploits the fact that every
computer program, no matter how complicated, executes a sequence of elementary arithmetic
operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp,
log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of
arbitrary order can be computed automatically, accurately to working precision, and using at
most a small constant factor more arithmetic operations than the original program.”.
This paper is divided into sections as follows: Section 2, Related Concepts, discusses in brief
some of the existing tools in the field and demonstrates their advantages and disadvantages.
Section 3, Concepts of Clad, lays down the key concepts of our prototype – Clad. Section 4,
Implementation, uncovers some of the important technical details of the concrete realization.
Section 5, Applications, shows a few use cases in the context of the computer graphics and
high-energy physics. Section 6, Computation Offload, describes a conceptual implementation
employing parallel derivative computations. Section 7, Conclusion and Plans.
2. Related Concepts
Informally, the variety of incarnations of the AD can be classified in three major classes:
Implemented via operator overloading – quick to implement at the cost of excessive amount
of memory; most compilers evaluate expressions containing overloaded operators exactly as
they are written, without performing any compile-time optimizations [2].
Implemented via source-to-source transformations – results in faster derivative generation
than the operator overloading. The concept of the idea is to rewrite an expression in a
particular computer language, such as C++ or Fortran [3]. In order for the previously
mentioned operations to be executed, the AD tool have to perform compiler-like processes
like code parsing, code analyses, intermediate representations. The source transform
approach is considered harder to implement but on the other hand it has advantages such
as: better compiler optimizations and relatively lower amount of used memory [4].
Implemented via compiler modules – an ideal realization of the source-to-source
transformations is implementing the AD compiler modules. Integrating automatic
differentiation capabilities in a compiler combines the advantages of both the operator
overloading and source transform approaches [5].
ADOL-C is an AD tool, based on operator overloading. It can produce first and higher order
derivatives of vector functions written in C/C++. One of its advantages is that the derived
functions are valid C/C++ routines. ADOL-C can handle codes based on classes, templates
and other C++ features. The tool supports the computation of standard objects required for
optimization purposes such as gradients, Jacobians and Hessians [6]. ADOL-C uses the concept
of active variables for denoting the possible differentiation independent variables. All of those
variables have to be declared with a special variable type. The process of derivation is based
on an ADOL-C specific internal representation and its start and finish are denoted with calls to
special service routines. Every calculation incorporating active variables within the derivation
process are recorded in a special data type. Once this is done, ADOL-C proceeds with executing
its internal algorithms for computing the derivatives.
ADIC2 is a project following the source-to-source transform for Fortran, C and C++ based
programs. It is built on top of the OpenAD project, which incorporates multiple, independent
software components [7]. ADIC2 uses the ROSE compiler framework for parsing the input source
code programs and for generating corresponding abstract syntax trees (ASTs). Unfortunately
ADIC2 is closely coupled with ROSE thus changes in ROSE may result in failures in ADIC2.
The tool uses configuration files specifying parameters and settings related to the differentiation
process. The AST is passed to a dedicated analyzer which is reducing the amount of code
passed to the differentiation algorithms. A XML based data structure is used in order to denote
ACAT2014 IOP Publishing
Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055
which parts of the source code should be differentiated and which parts should be marked as
statements and thus ignored by the differentiation algorithms. This XML based representation
is used by another tool producing the differentiation and a differentiated AST. ADIC2 uses again
the ROSE compiler for creating the output source code from the resulted AST.
The Fortran 95 compiler AD incorporates both the operator overloading and source transform
techniques thus it is considered as a hybrid compiler. The mathematics support is handled by
a compiler integrated module which provides overloaded versions of the arithmetic operators.
The user is allowed to select the dependent and independent variables. The compiler uses a
special active data type that is used to hold the function value as well as a vector for the
direction derivatives. The overloaded operators as well as the active data types are contained in
a compiler specific module. The independent variables as well as sections of the code that needs
to be differentiated need to be marked using directive-like statements, in order for the compiler
to recognize them. Unfortunately, no static data flow analyses of the code is conducted therefore
the achieved efficiency is not optimal [8].
Source-to-source transformation technique is considered as the ideal approach for building
large scale and run time crucial applications [9]. Furthermore, if the source-to-source
transformers are implemented as a part of an existing compiler, it makes the implementation
extremely efficient and easy to maintain at the cost of limited portability.
3. Concepts of Clad
An automatic differentiation algorithm takes a function (F) written in a programming language
(L), translates it and yields another function (F0) written in an another programming language
(R), where the translation between (F)(F0) follows the rules of the differential calculus and
turning F0into a derivative. For many tools (L) matches (R), i.e. the implementation language
of the input function is the same as the programming language of the differentiated function.
Usually it is so, because of design or technical limitations of the implementation. Most of the
uses of (F0) tend to be in the same framework and programming environment. An interesting
domain of research is when (L) is different from (R) but (L) compatible with (R). For example,
(F) is written in C/C++ and (F0) is written in OpenCL [10] or CUDA [11].
A derivative is produced by transforming the body of Foperation by operation. Every
operation is transformed following the well-known differentiation rules. Every statement in a
C++ function is treated as a standalone transformation entity. The translation employs the
chain rule of differentiation.
The chain rule in differential calculus provide mathematically proved simplification of the
translation process. It reduces the implementation complexity of the algorithm responsible for
the differentiation. There are two general flavors of implementing the differentiation: top-down
or bottom-up. In the top-down approach (also called forward mode or tangent mode) the
computations are done on every step and the byproducts are thrown away. In the bottom-up
approach (also called reverse mode or adjoint mode) the computations of the byproducts are
stored and reused. This is particularly useful when computing derivatives of the same function
with respect to different independent variables, for example when calculating function’s gradient.
∂t =z
∂t +z
∂t ,(1)
where z=f(x, y), x =g(t), y =h(t).
AD works with the assumption that a derivative exists, i.e. it does not check whether the
differentiated function is continuous at any interval. If the logic of the original function is
mathematically incorrect, there is no way how AD can produce a correct derivative.
ACAT2014 IOP Publishing
Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055
4. Implementation
Clad is an AD tool, implemented on top of the LLVM compiler infrastructure [12]. The
underlying compiler platform (Clang) builds an internal representation of the source code in the
form of an AST. Parts of AST are used to communicate with third-party libraries, which can
further specialize the compilation process. Because of the internal design and the communication
through well-defined representations, Clad can be packed in different ways. It can be easily
transformed from a plugin into a standalone tool or as an extension to third-party tools such as
the interactive C++ interpreter Cling [13].
Once Clad receives the necessary information, it can decide whether a derivative is requested
and produce it. It transforms the body of the candidate for differentiation and clones its AST
nodes while applying the differentiation rules. If the body is not present it tries to find user
provided directions how to proceed.
4.1. Usage
Currently, Clad is shipped as a plugin library for the Clang compiler. One can attach the library
to the Clang compiler and it will produce derivatives at compilation time as a part of the current
object file. However, this approach has some limitations: it is bound to a particular compiler
and compiler version. It is also required to compile the project, using derivatives with the same
compiler. For this reason, Clad can operate in three conceptually different modes (Figure 1),
generating derivatives in different representations. Derivatives can be a part of, either:
an object file – if a derivative is requested Clad would put it as if it was present in the
source file;
a source file – Clad can write out valid source code of the derivative into a source file. This
is handy when the user wants to produce the list of the requested derivatives and compile
them with another compiler;
a shared library – Clad can write out the derivatives into a shared object (dynamic-link
library). This is useful when the user prefers another compiler, which is binary compatible
with Clang, For instance, the user wants to compile the application with GCC/ICC but
wants the derivatives to be still used.
Figure 1: Clad usage scenarios.
Listing 1 shows how simple is to use Clad. The example demonstrates how to produce the first
derivative of power of two (pow2 ). One needs to include a small header file, introducing Clad’s
runtime and use clad::differentiate function to specify which function needs to be differentiated.
It takes two arguments: function to differentiate and the position of the independent variable.
In the example below, Clad will fill in the body of pow2 darg0 statement-by-statement following
the derivation rules.
Listing 2 and 3 illustrate the textual mode in Clad. Clad can use the produced derivative
straight away or to write its source code into a file. In the Listing 2 there is the user code and
in the Listing 3 there are the automatically generated derivatives.
ACAT2014 IOP Publishing
Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055
#include "clad/Differentiator/Differentiator.h"
double pow2(double x) { return x*x; }
double pow2_darg0(double); // Body will be filled by Clad.
int main() {
clad::differentiate(pow2, 0);
printf("Result is %f\n", pow2_darg0(4.2)); // Prints out 8.4
return 0;
Listing 1: Clad will produce the body of the forward declared function.
#include "clad/Differentiator/Differentiator.h"
float example1(float x, float y) {
return x*x*y*y;
void diffExamples() {
clad::differentiate(example1, 0);
clad::differentiate(example1, 1);
Listing 2: Differentiation of function example1.
float example1_darg0(float x, float y) {
return (((x +x) *y) *y);
float example1_darg1(float x, float y) {
return ((x *x) *y+x*x*y);
Listing 3: Derivatives of the function
Valid C++ functions are generated with a system-defined name inferred from original function
name, the independent variable of differentiation, and the derivative order. By construction,
the signature of the newly-generated function is the same as the derivation template function.
Differentiation may be applied not only on simple functions but also on more complex
language constructs such as:
Templated C++ constructs;
Classes and structs (eg. functors);
Virtual functions.
4.1.1. Builtin Derivatives Some functions don’t have bodies, because they are only forward
declared and their implementation is in a library. The differentiation of some functions could be
steered for performance improvements. Clad has a mechanism allowing to override the default
differentiation policy and provide user-directed substitutions. Listing 4 specializes the default
behavior of the differentiation, by replacing the default cosine (cos) with a user-specified one.
The implementation of the user-specific substitutions relies on a namespace with overloaded
semantics. All substitutions need to be inside a special namespace called custom derivatives and
to follow specific naming rules. Before Clad builds a derivative, it checks if there is already a
predefined derivative candidate. If this is the case it simply uses the one available.
All built-in derivatives rely on this mechanism. Differentiation of trigonometric functions and
the derivatives of other special functions is done using user substitutions.
4.1.2. Higher Order Derivatives and Mixed Partial Derivatives Clad provides a convenient way
of obtaining n-th order derivative out of a specific function. This is possible by invoking the
templated version of the clad::differentiate function for instance see Listing 4.
Mixed derivatives are produced on Listing 5.
ACAT2014 IOP Publishing
Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055
namespace custom_derivatives { double sin_darg0(double x) { return my_better_impl::cos(x); } }
void secondDerivative() {
clad::differentiate<2>(sin, 0);
Listing 4: Generation of the second derivative of sin, using substitutions.
float example2(float x, float y) {
return x*x*y;
float example2_darg0(float x, float y);
auto example_darg0 =clad::differentiate(example2, 0);
auto example_darg0_darg1 =clad::differentiate(example2_darg0, 1);
Listing 5: Generation of mixed partial derivatives.
4.2. Performance
Clad works in synergy with the Clang compiler. Table 1 shows the times to compile the test cases
with and without Clad. We test the overhead in two extreme cases – one, when it differentiates
a function body with many statements (Listing 6), and another, where the body contains a very
large expression (Listing 7). In more realistic scenarios Clad’s overhead is negligible. It is so
because the functions to be differentiated are much shorter, and the volume of these functions
compared to the rest of the code is much smaller.
Table 1: Compilation times of Listing 6 and Listing 7.
Test With Clad No Clad Clad overhead
Large body 1.007s 0.993s 0.014s ( 1.39%)
Large expression 3.258s 2.432s 0.826s (25.35%)
double f1(double x) {
x=x+ 2; x =x*x; x =x+x;
// ... repeated 1020 times.
return x;
Listing 6: Performance – Large function body.
double f2(double x) {
// ... 1020 repetitions.
Listing 7: Performance – Large expression.
5. Applications
We explored two major application domains for the automatic differentiator Clad – computer
graphics and high-energy physics.
ACAT2014 IOP Publishing
Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055
5.1. Computer Graphics
We embedded Clad in a demonstration path tracer, called SmallPT [14]. We investigated how
difficult and laborious would the integration be. SmallPT shoots rays of light from the viewer
towards the scene and computes lighting properties at the intersection points. In order to do
that, it computes three partial derivatives of the function describing the surface. They form
the normal vector at the intersection point between a ray and a surface. We replaced the
hand-written implementation with invocations of Clad. The performance is comparable to the
performance of the hand-written calculations. Calculating the normal vectors using numerical
approximation is about three times slower. The timing was collected from 1010 calculations of the
gradient at random points on a sphere and a hyperbolic solid. Another advantage of embedding
Clad allows even to improve the flexibility of our modification of SmallPT implementation,
because it allows to find derivatives of arbitrary implicit surfaces (not only spheres as in the
original code). The addition of a new surface is now easier, because the hand-written derivative
will not be needed.
5.2. High-Energy Physics
The ROOT Framework [15] is widely adopted in high-energy physics for data analysis. It offers
a wide range of mathematical tools for fitting and minimization. These tools use extensively
derivatives and some of them are hand-written while others are numerically calculated. We plan
on adopting Clad in ROOT6 through its C++ interpreter – Cling [13]. We expect performance
improvement and Clad is expected to become a gateway for derivative computation on General-
Purpose Computing on Graphics Processing Units (GPGPU). This work is still to be done
6. Computation Offload
Derivative calculation is very computing-intense and time-consuming process. Clad’s immediate
goal is to increase the computational performance and to make use of all computing power of the
environment. The natural evolution towards execution on GPGPUs is prominent. Computation
can be accelerated by using the offloading of calculations from CPU towards the available
GPGPUs. There are mainly two conceptually-different approaches:
Using built-in approaches in the compiler to guide compilation (usually by ’pragma’
directives) to turn on the parallel execution, the automatic offload of specific computation-
intense parts of the code to selected accelerators. This is the approach in OpenMP 4.0
[16]. The advantage is that the code does not change significantly. If the compiler does not
support these options, it ignores the parallelisation directives gracefully. The disadvantages
are: very hard to make full use of the underlying computing architectures, because the
pragma directives cannot specialize the algorithm for all architectures; requires some effort
and re-engineering to get some parallelism.
Using a specific programming language (such as OpenCL and CUDA) to talk to the
hardware. This approach supposes rewriting the algorithms, or parts of them. The main
advantage of this approach is that the algorithm can fully comply with the architecture of
the target hardware. The disadvantage is that it requires a lot of effort and expertise to
port it for every architecture and hardware.
We target scalability by changing semantics of the language syntax constructs and based on
them we enhance the compiler actions. A key goal for this transformations is to be as transparent
as possible. Most actions are performed automatically – at compile time. The goal is to use
most of the advantages of the above-mentioned approaches, without forcing major changes in
the user code.
ACAT2014 IOP Publishing
Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055
Listing 8 presents an example of CPU-calculated gradient. In the implementation we find a
sum of the partial derivatives of the Rosenbrock function.
float rosenbrock(float x[], int size) {
auto rosenbrockX =clad::differentiate(rosenbrock_func, 0);
auto rosenbrockY =clad::differentiate(rosenbrock_func, 1);
float sum = 0;
for (int i= 0; i <size-1; i++) {
float one =rosenbrockX.execute(x[i], x[i + 1]);
float two =rosenbrockY.execute(x[i], x[i + 1]);
sum += one +two;
return sum;
Listing 8: Rosenbrock function implementation using Clad.
Listing 9 shows how the user should specify that he/she wants the compilation to be offloaded.
The original implementation needs to be transformed into a lambda function, which is very
straight-forward and almost transparent conversion. It is not much harder than putting a
pragma directive in the source code. The code is not trivial to offload, because it has a for-loop,
which should be made parallel to be able to exploit the properties of the GPGPU architecture.
Furthermore, we need to calculate the sum of the results of the calculations, that requires
reduction in the highly parallel hardware.
float rosenbrock_offloaded(float x[], int size) {
return clad::experimental_offload([=] {
auto rosenbrockX =clad::differentiate(rosenbrock_func, 0);
auto rosenbrockY =clad::differentiate(rosenbrock_func, 1);
float sum = 0;
for (int i= 0; i <size-1; i++) {
float one =rosenbrockX.execute(x[i], x[i + 1]);
float two =rosenbrockY.execute(x[i], x[i + 1]);
sum += one +two;
return sum;
Listing 9: Computation offloading – Conceptual implementation using C++11 lambda function.
After the minimalistic transformation is done, Clad should take over and perform the rest
automatically. Every call to clad::experimental offload is replaced by a call to a function
generated by the Clad plugin. This is done by an AST transform of the lambda function,
allowing to transmit the parameters and to invoke one or more kernels.
In order to achieve optimal performance, Clad has to keep the computing units busy, which
is far from a trivial task. Computation offload to GPGPU has to be very well planned and
should happen when processing large data sets. One of the reasons is that all data needs to be
copied through the system bus to a peripheral device, which introduces big overheads. Clad-
generated code which passes data and executes kernels, should provide optimal parallel load as
well as minimum data transfer between different types of memory (as it is possible to get too
ACAT2014 IOP Publishing
Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055
much latency). The use of read only, write only parameters and other similar techniques are
recommended in order to achieve better performance.
The given example was tested for performance with 100 calls to Rosenbrock’s function over
1024*1024*48 float numbers. The resulting benchmarks are shown in Table 2.
Table 2: OpenCL parallel execution results.
Test Device Clock Compute
Original Intel i7-2635QM 2GHz 1 12.466s
Multicore Intel i7-2635QM 2GHz 4 1024 8192MB 32KB 11.128s
ATI Radeon 6490M 150MHz 2 256 256MB 32KB 18.183s
AMD Devastator 844MHz 6 256 2047MB 32KB 15.479s
NVIDIA Tesla K20m 705MHz 13 1024 4800MB 47KB 10.615s
The benchmarking showed that the CPU multicore is slightly faster than the original single
core computation. This is because of the copying of large amounts of data. Overall, the GPGPU
offload is very sensitive to the particular hardware. This is mainly due to the copy of large
data sets sequentially between the host and the device. There is also latency because of the
synchronous manner of execution. The next data transfer waits for the previous computation
to be executed. In the case of NVIDIA Tesla, the CPU offload reached 50%, which is very good
result for a prototype implementation.
7. Conclusion and Plans
Derivative production is important not only in high-energy physics but in many other domains.
The automatic differentiation is often an overlooked approach to compute derivatives. It
eliminates the precision losses in the numerical differentiation and it is faster than symbolic
differentiation. It can also simplify the complexity of gradient computations. The AD tool
mainly focuses on C and partially C++, because of the complexity of the language. We presented
an innovative proof-of-concept prototype facilitating automatic differentiation, called Clad. It is
based on the industrial-strength compiler technologies Clang and LLVM. Clad can differentiate
non-trivial C++ routines and it is getting closer to production grade quality. Adding C support
is trivial, the only work that needs to be done is writing the runtime environment to be C
We explained how it can be used to produce derivatives of various orders; how to produce
mixed derivatives; how to perform user-based substitutions, steering the differentiation process;
and how to offload computations in heterogeneous environments. There is still a lot of room for
improvements. We have a conceptual implementation in OpenCL, providing a way how to offload
computations in heterogeneous environments (making extensive use of GPGPUs). However, this
work is still experimental and it requires a lot of efforts for Clad to be made robust.
We plan to generalize the computation of gradient and Jacobian of functions. We plan
to reduce the computational complexity of these computations by making use of the reverse
automatic differentiation mode in cases of many seeds.
Another immediate plan is to integrate Clad into Cling – the C++ interpreter of ROOT6,
which would make Clad available to the entire high-energy physics community. Then, we can
proceed using Clad in ROOT’s minimization and fitting algorithms of ROOT.
ACAT2014 IOP Publishing
Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055
The work was partially facilitated by Google Summer of Code Program 2013 and 2014,
through CERN-SFT mentoring organization. The heterogeneous environment benchmarking
are supported by the University of Plovdiv “Paisii Hilendarski” through Fund “Research” under
contract NIS14-FMIIT-002/26.03.2014.
[1] Castro M, Vieira R and Biscaia Jr E 2000 Automatic differentiation tools in the dynamic simulation
of chemical engineering processes Brazilian Journal of Chemical Engineering 17 373–382 ISSN 0104-
6632 URL
[2] Bartholomew-Biggs M, Brown S, Christianson B and Dixon L 2000 Automatic differentiation of algorithms
Journal of Computational and Applied Mathematics 124 171–190 ISSN 0377-0427 numerical Analysis 2000.
Vol. IV: Optimization and Nonlinear Equations URL
[3] Gay D 2006 Semiautomatic differentiation for efficient gradient computations Automatic Differentiation:
Applications, Theory, and Implementations (Lecture Notes in Computational Science and Engineering
vol 50) ed B¨ucker M, Corliss G, Naumann U, Hovland P and Norris B (Springer Berlin Heidelberg) pp
147–158 ISBN 978-3-540-28403-1 URL
[4] Bischof C H, Hovland P D and Norris B 2002 Implementation of automatic differentiation tools. PEPM ed
Thiemann P (ACM) pp 98–107 ISBN 1-58113-455-X URL
[5] Cohen M, Naumann U and Riehme J 2003 Towards differentiation-enabled Fortran 95 compiler technology
Proceedings of the 2003 ACM Symposium on Applied Computing SAC ’03 (New York, NY, USA: ACM)
pp 143–147 ISBN 1-58113-624-2 URL
[6] Walther A and Griewank A 2012 Getting started with ADOL-C Combinatorial Scientific Computing ed
Naumann U and Schenk O (Chapman-Hall CRC Computational Science) chap 7, pp 181–202
[7] Narayanan S H K, Norris B and Winnicka B 2010 ADIC2: Development of a component source transformation
system for differentiating C and C++ Procedia Computer Science 11845–1853 ISSN 1877-0509 iCCS 2010
[8] Naumann U and Riehme J 2005 A differentiation-enabled Fortran 95 compiler ACM Trans. Math. Softw. 31
458–474 ISSN 0098-3500 URL
[9] Voßbeck M, Giering R and Kaminski T 2008 Development and first applications of TAC++ Advances in
Automatic Differentiation (Lecture Notes in Computational Science and Engineering vol 64) ed Bischof
C H, B¨ucker H M, Hovland P D, Naumann U and Utke J (Springer Berlin Heidelberg) pp 187–197 ISBN
978-3-540-68935-5 URL
[10] Gaster B, Howes L, Kaeli D R, Mistry P and Schaa D 2013 Heterogeneous Computing with OpenCL:
Revised OpenCL 1.2 Edition 2nd ed (San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.) ISBN
[11] Cook S 2013 CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs 1st ed (San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc.) ISBN 9780124159334
[12] Lattner C and Adve V 2004 LLVM: A compilation framework for lifelong program analysis & transformation
Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04) (Palo
Alto, California)
[13] Vasilev V, Canal P, Naumann A and Russo P 2012 Cling – the new interactive interpreter for ROOT 6 Journal
of Physics: Conference Series 396 052071 URL
[14] Beason K 2014 SmallPT: Global illumination in 99 lines of C++ URL
[15] Brun R and Rademakers F 1997 ROOT – an object oriented data analysis framework Nuclear Instruments
and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated
Equipment 389 81–86 ISSN 0168-9002 new Computing Techniques in Physics Research V URL http:
[16] OpenMP A R B 2013 OpenMP application program interface URL
ACAT2014 IOP Publishing
Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055
... Recent advancements of production quality compilers like Clang allow tools to reuse the language parsing infrastructure, making it easier to implement source transformation AD. ADIC [26], Enzyme [27] and Clad [28] are compiler-based AD tools using source transformation. The increasing importance of AD is evident in newer programming languages such as Swift and Julia where it is integrated deep into the language [29,30]. ...
Full-text available
Automatic Differentiation (AD) is instrumental for science and industry. It is a tool to evaluate the derivative of a function specified through a computer program. The range of AD application domain spans from Machine Learning to Robotics to High Energy Physics. Computing gradients with the help of AD is guaranteed to be more precise than the numerical alternative and have a low, constant factor more arithmetical operations compared to the original function. Moreover, AD applications to domain problems typically are computationally bound. They are often limited by the computational requirements of high-dimensional parameters and thus can benefit from parallel implementations on graphics processing units (GPUs). Clad aims to enable differential analysis for C/C++ and CUDA and is a compiler-assisted AD tool available both as a compiler extension and in ROOT. Moreover, Clad works as a plugin extending the Clang compiler; as a plugin extending the interactive interpreter Cling; and as a Jupyter kernel extension based on xeus-cling. We demonstrate the advantages of parallel gradient computations on GPUs with Clad. We explain how to bring forth a new layer of optimization and a proportional speed up by extending Clad to support CUDA. The gradients of well-behaved C++ functions can be automatically executed on a GPU. The library can be easily integrated into existing frameworks or used interactively. Furthermore, we demonstrate the achieved application performance improvements, including (≈10x) in ROOT histogram fitting and corresponding performance gains from offloading to GPUs.
... This could be done by following the work presented in [137] for instance, which defines a differentiable array based programming language. Another solution would be to rely on an external system that could perform automatic differentiation at a lower level, such as Clad [148], which performs automatic differentiation of llvm programs or Tapenade [72], which works on Fortran or C programs. ...
Full-text available
This thesis is concerned with modelling languages aimed at assisting with modelling and simulation of systems described in terms of differential equations. These languages can be split into two classes: causal languages, where models are expressed using directed equations; and non-causal languages, where models are expressed using undirected equations. This thesis focuses on two related paradigms: FRP and FHM. FRP is an approach to programming causal time-aware applications that has successfully been used in causal modelling applications; while FHM is an approach to programming non-causal modelling applications. However, both are built on similar principles, namely, the treatment of models as first-class entities, allowing for models to be parametrised by other models or computed at runtime; and support for structurally dynamic models, whose behaviour can change during the simulation. This makes FRP and FHM particularly flexible and expressive approaches to modelling, especially compared to other mainstream languages. Because of their highly expressive and flexible nature, providing efficient implementations of these languages is a challenge. This thesis explores novel implementation techniques aimed at improving the performance of existing implementations of FRP and FHM, and other expressive modelling languages built on similar ideas. In the setting of FRP, this thesis proposes a novel embedded FRP library that uses the implementation approach of synchronous dataflow languages. This allows for significant performance improvement by better handling of the reactive network's topology, which represents a large portion of the runtime in current implementations, especially for applications that make heavy use of continuously varying values, such as modelling applications. In the setting of FHM, this thesis presents the modular compilation of a language based on FHM. Due to inherent difficulties with the simulation of systems of undirected equations, previous implementations of FHM and similarly expressive languages were either interpreted or generated code on the fly using just-in-time compilation, two techniques which have runtime overhead over ahead-of-time compilation. This thesis presents a new method for generating code for equation systems which allows for the separate compilation of FHM models. Compared with current approaches to FRP and FHM implementation, there is greater commonality between the implementation approaches described here, suggesting a possible way forward towards a future non-causal modelling language supporting FRP-like features, resulting in an even more expressive modelling language.
... Several computer codes exists in which all or some of these techniques are implemented. They include, for example, TAPENADE, 38 Stan Math Library, 39 CppAD, 40 CasADi, 41 ADOL-C, 42 Clad, 43 Adept, 44 autodiff. 45 We focus below on the description of automatic differentiation techniques in which operator overloading is used and the derivatives are computed in a forward-mode approach instead of reverse-mode. ...
Full-text available
This work uses advanced numerical techniques (complex differentiation and automatic differentiation) to efficiently and accurately compute all the required thermodynamic properties of an equation of state without any analytical derivatives─particularly without any handwritten derivatives. It avoids the tedious and error-prone process of symbolic differentiation, thus allowing for more rapid development of new thermodynamic models. The technique presented here was tested with several equations of state (van der Waals, Peng-Robinson, Soave-Redlich-Kwong, PC-SAFT, and cubic-plus-association) and high-accuracy multifluid models. A minimal set of algorithms (critical locus tracing and vapor-liquid equilibrium tracing) were implemented in an extensible and concise open-source C++ library: teqp (for Templated EQuation of state Package). This work demonstrates that highly complicated equations of state can be implemented faster yet with minimal computational overhead and negligible loss in numerical precision compared with the traditional approach that relies on analytical derivatives. We believe that the approach outlined in this work has the potential to establish a new computational standard when implementing computer codes for thermodynamic models.
... Alternatively, AD arithmetics can be added by a modified or special compiler [79,80] or through source-to-source transformation tools [81][82][83][84]. An extensive overview of AD tools can be consulted online [85]. ...
Full-text available
The full optimization of the design and operation of instruments whose functioning relies on the interaction of radiation with matter is a super-human task, given the large dimensionality of the space of possible choices for geometry, detection technology, materials, data-acquisition, and information-extraction techniques, and the interdependence of the related parameters. On the other hand, massive potential gains in performance over standard, "experience-driven" layouts are in principle within our reach if an objective function fully aligned with the final goals of the instrument is maximized by means of a systematic search of the configuration space. The stochastic nature of the involved quantum processes make the modeling of these systems an intractable problem from a classical statistics point of view, yet the construction of a fully differentiable pipeline and the use of deep learning techniques may allow the simultaneous optimization of all design parameters. In this document we lay down our plans for the design of a modular and versatile modeling tool for the end-to-end optimization of complex instruments for particle physics experiments as well as industrial and medical applications that share the detection of radiation as their basic ingredient. We consider a selected set of use cases to highlight the specific needs of different applications.
... Dedicated automatic differentiation tools capable of augmenting existing software, rather than requiring complete software rewrites, are needed. New compiler-based source-translation based AD tools, such as enzyme [108] and CLAD [109], are promising for such tasks. ...
The computational cost for high energy physics detector simulation in future experimental facilities is going to exceed the current available resources. To overcome this challenge, new ideas on surrogate models using machine learning methods are being explored to replace computationally expensive components. Additionally, differentiable programming has been proposed as a complementary approach, providing controllable and scalable simulation routines. In this document, new and ongoing efforts for surrogate models and differential programming applied to detector simulation are discussed in the context of the 2021 Particle Physics Community Planning Exercise (`Snowmass').
... Moreover, it imposes constraints on the compiler and complicates porting to GPU. Nevertheless, the application of the source-to-source tools using a special compiler is promising from a practical point of view [99] and makes it possible to hope for including the HMC into future computation systems. ...
... • A study demonstrating that running AD after optimization results in significant performance gains on a standard machine learning benchmark suite [57] and achieves state-of-the-art performance. Related work Clad is a plugin to the Clang compiler that implements forward mode automatic differentiation on a subset of C/C++ with reverse mode in development [59]. Chen et al. [11] present an end-to-end differentiable model for protein structure prediction. ...
Conference Paper
Full-text available
Applying differentiable programming techniques and machine learning algorithms to foreign programs requires developers to either rewrite their code in a machine learning framework, or otherwise provide derivatives of the foreign code. This paper presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme synthesizes gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional source-to-source and operator-overloading tools, Enzyme performs AD on optimized IR. On a machine-learning focused benchmark suite including Microsoft's ADBench, AD on optimized IR achieves a geometric mean speedup of 4.2 times over AD on IR before optimization allowing Enzyme to achieve state-of-the-art performance. Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with state-of-the-art performance, enabling foreign code to be directly incorporated into existing machine learning workflows.
Large-scale computing, including machine learning (MI) and AI, offer a great promise in enabling sustainability and resiliency of electric energy systems. At present, however, there is no standardized framework for systematic modeling and simulation of system response over time to different continuous- and discrete-time events and/or changes in equipment status. As a result, there is generally a poor understanding of the effects of candidate technologies on the quality and cost of electric energy services. In this chapter, the authors discuss a unified, physically intuitive multi-layered modeling of system components and their mutual dynamic interactions. The fundamental concept underlying this modeling is the notion of interaction variables whose definition directly lends itself to capturing modular structure needed to manage complexity. As a direct result, the same modeling approach defines an information exchange structure between different system layers, and hence can be used to establish structure for the design of a dedicated computational architecture, including AI methods.
The computation of first and second-order derivatives is a staple in many computing applications, ranging from machine learning to scientific computing. We propose an algorithm to automatically differentiate algorithms written in a subset of C99 code and its efficient implementation as a Python script. We demonstrate that our algorithm enables automatic, reliable, and efficient differentiation of common algorithms used in physical simulation and geometry processing.
Conference Paper
Full-text available
Cling is an interactive C++ interpreter, built on top of Clang and LLVM compiler infrastructure. Like its predecessor Cint, Cling realizes the read-print-evaluate-loop concept, in order to leverage rapid application development. Implemented as a small extension to LLVM and Clang, the interpreter reuses their strengths such as the praised concise and expressive compiler diagnostics. We show how to match the interpreter concept to the compiler library and generalize common set of requirements for building up an interactive interpreter. We reason the design and implementation decisions as solution to the challenge of implementing interpreter behaviour as an extension of the compiler library. We present the new features, e.g. how C++11 will come to Cling and how Cint-specific extensions are being adopted. We clarify the state of integration in the ROOT framework and the induced change set. We explain how ROOT dictionaries are simplified due to the new interpreter.
Full-text available
The paper describes the development of the software tool Transformation of Algorithms in C++ (TAC++) for automatic differentiation (AD) of C(++) codes by source-to-source translation. We have transferred to TAC++ a subset of the algorithms from its well-established Fortran equivalent, Transformation of Algorithms in Fortran (TAF). TAC++ features forward and reverse as well as scalar and vector modes of AD. Efficient higher order derivative code is generated by multiple application of TAC++. High performance of the generated derivate code is demonstrated for five examples from application fields covering remote sensing, computer vision, computational finance, and aeronautics. For instance, the run time of the adjoints for simultaneous evaluation of the function and its gradient is between 1.9 and 3.9 times slower than that of the respective function codes. Options for further enhancement are discussed.
Full-text available
Automatic differentiation is a semantic transformation that applies the rules of differential calculus to source code. It thus transforms a computer program that computes a mathematical function into a program that computes the function and its derivatives. Derivatives play an important role in a wide variety of scientific computing applications, including numerical optimization, solution of nonlinear equations, sensitivity analysis, and nonlinear inverse problems. We describe the forward and reverse modes of automatic differentiation and provide a survey of implementation strategies. We describe some of the challenges in the implementation of automatic differentiation tools, with a focus on tools based on source transformation. We conclude with an overview of current research and future opportunities.
Full-text available
We introduce the basic notions of automatic differentiation, describe some extensions which are of interest in the context of nonlinear optimization and give some illustrative examples.
Conference Paper
Full-text available
We present a novel approach to generating derivative code for mathematical models implemented as Fortran 95 programs using Automatic Differentiation inside a compiler. This technique allows us to combine the advantages of both operator overloading and source transformation based tools for Automatic Differentiation. Furthermore, the compiler's infrastructure for syntactic, semantic, and static data flow analysis can be built on.
Conference Paper
Full-text available
Automatic differentiation is a semantic transformation that applies the rules of differential calculus to source code. It thus transforms a computer program that computes a mathematical function into a program that computes the function and its derivatives. Derivatives play an important role in a wide variety of scientific computing applications, including optimization, solution of nonlinear equations, sensitivity analysis, and nonlinear inverse problems. We describe a simple component architecture for developing tools for automatic differentiation and other mathematically oriented semantic transformations of scientific software. This architecture consists of a compiler-based, language-specific front-end for source transformation, loosely coupled with one or more language-independent "plug-in" transformation modules. The coupling mechanism between the front-end and transformation modules is provided by the XML Abstract Interface Form (XAIF). XAIF provides an abstract, language-independent representation of language constructs common in imperative languages, such as C and Fortran. We describe the use of this architecture in constructing tools for automatic differentiation of Fortran 77 and ANSI C, and we discuss how access to compiler optimization techniques can enable more efficient derivative augmentation.
Linux now facilitates scientific research in the Atlantic Ocean and Antarctica
Heterogeneous Computing with OpenCL teaches OpenCL and parallel programming for complex systems that may include a variety of device architectures: multi-core CPUs, GPUs, and fully-integrated Accelerated Processing Units (APUs) such as AMD Fusion technology. Designed to work on multiple platforms and with wide industry support, OpenCL will help you more effectively program for a heterogeneous future. Written by leaders in the parallel computing and OpenCL communities, this book will give you hands-on OpenCL experience to address a range of fundamental parallel algorithms. The authors explore memory spaces, optimization techniques, graphics interoperability, extensions, and debugging and profiling. Intended to support a parallel programming course, Heterogeneous Computing with OpenCL includes detailed examples throughout, plus additional online exercises and other supporting materials. Explains principles and strategies to learn parallel programming with OpenCL, from understanding the four abstraction models to thoroughly testing and debugging complete applications. Covers image processing, web plugins, particle simulations, video editing, performance optimization, and more. Shows how OpenCL maps to an example target architecture and explains some of the tradeoffs associated with mapping to various architectures Addresses a range of fundamental programming techniques, with multiple examples and case studies that demonstrate OpenCL extensions for a variety of hardware platforms. © 2013 Advanced Micro Devices. Published by Elsevier Inc. All rights reserved.
Many large-scale computations involve a mesh and first (or sometimes higher) partial derivatives of functions of mesh elements. In principle, automatic differentiation (AD) can provide the requisite partials more efficiently and accurately than conventional finite-difference approximations. AD requires source-code modifications, which may be little more than changes to declarations. Such simple changes can easily give improved results, e.g., when Jacobian-vector products are used iteratively to solve nonlinear equations. When gradients are required (say, for optimization) and the problem involves many variables, “backward AD” in theory is very efficient, but when carried out automatically and straightforwardly, may use a prohibitive amount of memory. In this case, applying AD separately to each element function and manually assembling the gradient pieces — semiautomatic differentiation — can deliver gradients efficiently and accurately. This paper concerns on-going work; it compares several implementations of backward AD, describes a simple operator-overloading implementation specialized for gradient computations, and compares the implementations on some mesh-optimization examples. Ideas from the specialized implementation could be used in fully general source-to-source translators for C and C++.
The ROOT system in an Object Oriented framework for large scale data analysis. ROOT written in C++, contains, among others, an efficient hierarchical OO database, a C++ interpreter, advanced statistical analysis (multi-dimensional histogramming, fitting, minimization, cluster finding algorithms) and visualization tools. The user interacts with ROOT via a graphical user interface, the command line or batch scripts. The command and scripting language is C++ (using the interpreter) and large scripts can be compiled and dynamically linked in. The OO database design has been optimized for parallel access (reading as well as writing) by multiple processes.