Content uploaded by Vassil G Vassilev

Author content

All content in this area was uploaded by Vassil G Vassilev on May 28, 2015

Content may be subject to copyright.

This content has been downloaded from IOPscience. Please scroll down to see the full text.

Download details:

IP Address: 78.212.26.136

This content was downloaded on 28/05/2015 at 09:50

Please note that terms and conditions apply.

Clad — Automatic Differentiation Using Clang and LLVM

View the table of contents for this issue, or go to the journal homepage for more

2015 J. Phys.: Conf. Ser. 608 012055

(http://iopscience.iop.org/1742-6596/608/1/012055)

Home Search Collections Journals About Contact us My IOPscience

Clad – Automatic Diﬀerentiation Using Clang and

LLVM

V Vassilev1,2, M Vassilev2, A Penev2, L Moneta1, and V Ilieva3

1CERN, PH-SFT, Geneva, Switzerland

2FMI, University of Plovdiv Paisii Hilendarski, Plovdiv, Bulgaria

3Princeton University, Princeton, NJ, USA

E-mail: vvasilev@cern.ch

Abstract. Diﬀerentiation is ubiquitous in high energy physics, for instance in minimization

algorithms and statistical analysis, in detector alignment and calibration, and in theory.

Automatic diﬀerentiation (AD) avoids well-known limitations in round-oﬀs and speed, which

symbolic and numerical diﬀerentiation suﬀer from, by transforming the source code of functions.

We will present how AD can be used to compute the gradient of multi-variate functions and

functor objects. We will explain approaches to implement an AD tool. We will show how

LLVM, Clang and Cling (ROOT’s C++11 interpreter) simpliﬁes creation of such a tool. We

describe how the tool could be integrated within any framework. We will demonstrate a simple

proof-of-concept prototype, called Clad, which is able to generate n-th order derivatives of C++

functions and other language constructs. We also demonstrate how Clad can oﬄoad laborious

computations from the CPU using OpenCL.

1. Introduction

Both industry and science often use the mathematical apparatus of diﬀerential calculus.

Modeling ﬁnancial markets, climatic changes or searching for the Higgs boson use function

optimization and thus derivatives. The numerical calculation of the derivative values yields

precision losses. They come from machine’s ﬂoating point representation and the stability of the

used numerical method. The computation fragility becomes even worse when computing higher

order derivatives. In practice the user (a programmer) must consider very carefully the input

values and the stepping delta, which sometimes can be far from trivial. Moreover, the derivative

is hard-coded and becomes a maintenance issue. The developer has to diﬀerentiate the function

mentally or using an external tool and translate it to the implementation language.

An alternative approach is the so called symbolic diﬀerentiation which overcomes the above-

described issues, but sometimes it is slow [1]. It does not oﬀer straight forward framework

integration. For example, the function to be diﬀerentiated is hardcoded the programming

language (e.g. C++) and it has to be translated to the symbolic language, diﬀerentiated and the

result needs to be translated back to the framework’s programming language. Moreover, both

numerical and symbolic diﬀerentiation methods suﬀer when computing gradients and higher-

order derivatives.

Despite sometimes overlooked, there is a hybrid approach which lays in the middle of the

both extremes, resolving the mentioned issues. It is the automatic/algorithmic diﬀerentiation

(AD). A widely accepted deﬁnition of AD is “a set of techniques to numerically evaluate the

ACAT2014 IOP Publishing

Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution

of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Published under licence by IOP Publishing Ltd 1

derivative of a function speciﬁed by a computer program. AD exploits the fact that every

computer program, no matter how complicated, executes a sequence of elementary arithmetic

operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp,

log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of

arbitrary order can be computed automatically, accurately to working precision, and using at

most a small constant factor more arithmetic operations than the original program.”.

This paper is divided into sections as follows: Section 2, Related Concepts, discusses in brief

some of the existing tools in the ﬁeld and demonstrates their advantages and disadvantages.

Section 3, Concepts of Clad, lays down the key concepts of our prototype – Clad. Section 4,

Implementation, uncovers some of the important technical details of the concrete realization.

Section 5, Applications, shows a few use cases in the context of the computer graphics and

high-energy physics. Section 6, Computation Oﬄoad, describes a conceptual implementation

employing parallel derivative computations. Section 7, Conclusion and Plans.

2. Related Concepts

Informally, the variety of incarnations of the AD can be classiﬁed in three major classes:

•Implemented via operator overloading – quick to implement at the cost of excessive amount

of memory; most compilers evaluate expressions containing overloaded operators exactly as

they are written, without performing any compile-time optimizations [2].

•Implemented via source-to-source transformations – results in faster derivative generation

than the operator overloading. The concept of the idea is to rewrite an expression in a

particular computer language, such as C++ or Fortran [3]. In order for the previously

mentioned operations to be executed, the AD tool have to perform compiler-like processes

like code parsing, code analyses, intermediate representations. The source transform

approach is considered harder to implement but on the other hand it has advantages such

as: better compiler optimizations and relatively lower amount of used memory [4].

•Implemented via compiler modules – an ideal realization of the source-to-source

transformations is implementing the AD compiler modules. Integrating automatic

diﬀerentiation capabilities in a compiler combines the advantages of both the operator

overloading and source transform approaches [5].

ADOL-C is an AD tool, based on operator overloading. It can produce ﬁrst and higher order

derivatives of vector functions written in C/C++. One of its advantages is that the derived

functions are valid C/C++ routines. ADOL-C can handle codes based on classes, templates

and other C++ features. The tool supports the computation of standard objects required for

optimization purposes such as gradients, Jacobians and Hessians [6]. ADOL-C uses the concept

of active variables for denoting the possible diﬀerentiation independent variables. All of those

variables have to be declared with a special variable type. The process of derivation is based

on an ADOL-C speciﬁc internal representation and its start and ﬁnish are denoted with calls to

special service routines. Every calculation incorporating active variables within the derivation

process are recorded in a special data type. Once this is done, ADOL-C proceeds with executing

its internal algorithms for computing the derivatives.

ADIC2 is a project following the source-to-source transform for Fortran, C and C++ based

programs. It is built on top of the OpenAD project, which incorporates multiple, independent

software components [7]. ADIC2 uses the ROSE compiler framework for parsing the input source

code programs and for generating corresponding abstract syntax trees (ASTs). Unfortunately

ADIC2 is closely coupled with ROSE thus changes in ROSE may result in failures in ADIC2.

The tool uses conﬁguration ﬁles specifying parameters and settings related to the diﬀerentiation

process. The AST is passed to a dedicated analyzer which is reducing the amount of code

passed to the diﬀerentiation algorithms. A XML based data structure is used in order to denote

ACAT2014 IOP Publishing

Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055

2

which parts of the source code should be diﬀerentiated and which parts should be marked as

statements and thus ignored by the diﬀerentiation algorithms. This XML based representation

is used by another tool producing the diﬀerentiation and a diﬀerentiated AST. ADIC2 uses again

the ROSE compiler for creating the output source code from the resulted AST.

The Fortran 95 compiler AD incorporates both the operator overloading and source transform

techniques thus it is considered as a hybrid compiler. The mathematics support is handled by

a compiler integrated module which provides overloaded versions of the arithmetic operators.

The user is allowed to select the dependent and independent variables. The compiler uses a

special active data type that is used to hold the function value as well as a vector for the

direction derivatives. The overloaded operators as well as the active data types are contained in

a compiler speciﬁc module. The independent variables as well as sections of the code that needs

to be diﬀerentiated need to be marked using directive-like statements, in order for the compiler

to recognize them. Unfortunately, no static data ﬂow analyses of the code is conducted therefore

the achieved eﬃciency is not optimal [8].

Source-to-source transformation technique is considered as the ideal approach for building

large scale and run time crucial applications [9]. Furthermore, if the source-to-source

transformers are implemented as a part of an existing compiler, it makes the implementation

extremely eﬃcient and easy to maintain at the cost of limited portability.

3. Concepts of Clad

An automatic diﬀerentiation algorithm takes a function (F) written in a programming language

(L), translates it and yields another function (F0) written in an another programming language

(R), where the translation between (F)→(F0) follows the rules of the diﬀerential calculus and

turning F0into a derivative. For many tools (L) matches (R), i.e. the implementation language

of the input function is the same as the programming language of the diﬀerentiated function.

Usually it is so, because of design or technical limitations of the implementation. Most of the

uses of (F0) tend to be in the same framework and programming environment. An interesting

domain of research is when (L) is diﬀerent from (R) but (L) compatible with (R). For example,

(F) is written in C/C++ and (F0) is written in OpenCL [10] or CUDA [11].

A derivative is produced by transforming the body of Foperation by operation. Every

operation is transformed following the well-known diﬀerentiation rules. Every statement in a

C++ function is treated as a standalone transformation entity. The translation employs the

chain rule of diﬀerentiation.

The chain rule in diﬀerential calculus provide mathematically proved simpliﬁcation of the

translation process. It reduces the implementation complexity of the algorithm responsible for

the diﬀerentiation. There are two general ﬂavors of implementing the diﬀerentiation: top-down

or bottom-up. In the top-down approach (also called forward mode or tangent mode) the

computations are done on every step and the byproducts are thrown away. In the bottom-up

approach (also called reverse mode or adjoint mode) the computations of the byproducts are

stored and reused. This is particularly useful when computing derivatives of the same function

with respect to diﬀerent independent variables, for example when calculating function’s gradient.

∂z

∂t =∂z

∂x

∂x

∂t +∂z

∂y

∂y

∂t ,(1)

where z=f(x, y), x =g(t), y =h(t).

AD works with the assumption that a derivative exists, i.e. it does not check whether the

diﬀerentiated function is continuous at any interval. If the logic of the original function is

mathematically incorrect, there is no way how AD can produce a correct derivative.

ACAT2014 IOP Publishing

Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055

3

4. Implementation

Clad is an AD tool, implemented on top of the LLVM compiler infrastructure [12]. The

underlying compiler platform (Clang) builds an internal representation of the source code in the

form of an AST. Parts of AST are used to communicate with third-party libraries, which can

further specialize the compilation process. Because of the internal design and the communication

through well-deﬁned representations, Clad can be packed in diﬀerent ways. It can be easily

transformed from a plugin into a standalone tool or as an extension to third-party tools such as

the interactive C++ interpreter Cling [13].

Once Clad receives the necessary information, it can decide whether a derivative is requested

and produce it. It transforms the body of the candidate for diﬀerentiation and clones its AST

nodes while applying the diﬀerentiation rules. If the body is not present it tries to ﬁnd user

provided directions how to proceed.

4.1. Usage

Currently, Clad is shipped as a plugin library for the Clang compiler. One can attach the library

to the Clang compiler and it will produce derivatives at compilation time as a part of the current

object ﬁle. However, this approach has some limitations: it is bound to a particular compiler

and compiler version. It is also required to compile the project, using derivatives with the same

compiler. For this reason, Clad can operate in three conceptually diﬀerent modes (Figure 1),

generating derivatives in diﬀerent representations. Derivatives can be a part of, either:

•an object ﬁle – if a derivative is requested Clad would put it as if it was present in the

source ﬁle;

•a source ﬁle – Clad can write out valid source code of the derivative into a source ﬁle. This

is handy when the user wants to produce the list of the requested derivatives and compile

them with another compiler;

•a shared library – Clad can write out the derivatives into a shared object (dynamic-link

library). This is useful when the user prefers another compiler, which is binary compatible

with Clang, For instance, the user wants to compile the application with GCC/ICC but

wants the derivatives to be still used.

Foo.cxx

FooDerivatives.cxx

libFooDerivatives.so

Foo.o

clang

libClad.so

Figure 1: Clad usage scenarios.

Listing 1 shows how simple is to use Clad. The example demonstrates how to produce the ﬁrst

derivative of power of two (pow2 ). One needs to include a small header ﬁle, introducing Clad’s

runtime and use clad::diﬀerentiate function to specify which function needs to be diﬀerentiated.

It takes two arguments: function to diﬀerentiate and the position of the independent variable.

In the example below, Clad will ﬁll in the body of pow2 darg0 statement-by-statement following

the derivation rules.

Listing 2 and 3 illustrate the textual mode in Clad. Clad can use the produced derivative

straight away or to write its source code into a ﬁle. In the Listing 2 there is the user code and

in the Listing 3 there are the automatically generated derivatives.

ACAT2014 IOP Publishing

Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055

4

#include "clad/Differentiator/Differentiator.h"

double pow2(double x) { return x*x; }

double pow2_darg0(double); // Body will be filled by Clad.

int main() {

clad::differentiate(pow2, 0);

printf("Result is %f\n", pow2_darg0(4.2)); // Prints out 8.4

return 0;

}

Listing 1: Clad will produce the body of the forward declared function.

#include "clad/Differentiator/Differentiator.h"

float example1(float x, float y) {

return x*x*y*y;

}

void diffExamples() {

clad::differentiate(example1, 0);

clad::differentiate(example1, 1);

}

Listing 2: Diﬀerentiation of function example1.

float example1_darg0(float x, float y) {

return (((x +x) *y) *y);

}

float example1_darg1(float x, float y) {

return ((x *x) *y+x*x*y);

}

Listing 3: Derivatives of the function

example1.

Valid C++ functions are generated with a system-deﬁned name inferred from original function

name, the independent variable of diﬀerentiation, and the derivative order. By construction,

the signature of the newly-generated function is the same as the derivation template function.

Diﬀerentiation may be applied not only on simple functions but also on more complex

language constructs such as:

•Templated C++ constructs;

•Classes and structs (eg. functors);

•Virtual functions.

4.1.1. Builtin Derivatives Some functions don’t have bodies, because they are only forward

declared and their implementation is in a library. The diﬀerentiation of some functions could be

steered for performance improvements. Clad has a mechanism allowing to override the default

diﬀerentiation policy and provide user-directed substitutions. Listing 4 specializes the default

behavior of the diﬀerentiation, by replacing the default cosine (cos) with a user-speciﬁed one.

The implementation of the user-speciﬁc substitutions relies on a namespace with overloaded

semantics. All substitutions need to be inside a special namespace called custom derivatives and

to follow speciﬁc naming rules. Before Clad builds a derivative, it checks if there is already a

predeﬁned derivative candidate. If this is the case it simply uses the one available.

All built-in derivatives rely on this mechanism. Diﬀerentiation of trigonometric functions and

the derivatives of other special functions is done using user substitutions.

4.1.2. Higher Order Derivatives and Mixed Partial Derivatives Clad provides a convenient way

of obtaining n-th order derivative out of a speciﬁc function. This is possible by invoking the

templated version of the clad::diﬀerentiate function for instance see Listing 4.

Mixed derivatives are produced on Listing 5.

ACAT2014 IOP Publishing

Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055

5

//...

namespace custom_derivatives { double sin_darg0(double x) { return my_better_impl::cos(x); } }

void secondDerivative() {

clad::differentiate<2>(sin, 0);

}

Listing 4: Generation of the second derivative of sin, using substitutions.

float example2(float x, float y) {

return x*x*y;

}

float example2_darg0(float x, float y);

auto example_darg0 =clad::differentiate(example2, 0);

auto example_darg0_darg1 =clad::differentiate(example2_darg0, 1);

Listing 5: Generation of mixed partial derivatives.

4.2. Performance

Clad works in synergy with the Clang compiler. Table 1 shows the times to compile the test cases

with and without Clad. We test the overhead in two extreme cases – one, when it diﬀerentiates

a function body with many statements (Listing 6), and another, where the body contains a very

large expression (Listing 7). In more realistic scenarios Clad’s overhead is negligible. It is so

because the functions to be diﬀerentiated are much shorter, and the volume of these functions

compared to the rest of the code is much smaller.

Table 1: Compilation times of Listing 6 and Listing 7.

Test With Clad No Clad Clad overhead

Large body 1.007s 0.993s 0.014s ( 1.39%)

Large expression 3.258s 2.432s 0.826s (25.35%)

double f1(double x) {

x=x+ 2; x =x*x; x =x+x;

// ... repeated 1020 times.

return x;

}

Listing 6: Performance – Large function body.

double f2(double x) {

return

x+2+x*x+x+x+

// ... 1020 repetitions.

}

Listing 7: Performance – Large expression.

5. Applications

We explored two major application domains for the automatic diﬀerentiator Clad – computer

graphics and high-energy physics.

ACAT2014 IOP Publishing

Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055

6

5.1. Computer Graphics

We embedded Clad in a demonstration path tracer, called SmallPT [14]. We investigated how

diﬃcult and laborious would the integration be. SmallPT shoots rays of light from the viewer

towards the scene and computes lighting properties at the intersection points. In order to do

that, it computes three partial derivatives of the function describing the surface. They form

the normal vector at the intersection point between a ray and a surface. We replaced the

hand-written implementation with invocations of Clad. The performance is comparable to the

performance of the hand-written calculations. Calculating the normal vectors using numerical

approximation is about three times slower. The timing was collected from 1010 calculations of the

gradient at random points on a sphere and a hyperbolic solid. Another advantage of embedding

Clad allows even to improve the ﬂexibility of our modiﬁcation of SmallPT implementation,

because it allows to ﬁnd derivatives of arbitrary implicit surfaces (not only spheres as in the

original code). The addition of a new surface is now easier, because the hand-written derivative

will not be needed.

5.2. High-Energy Physics

The ROOT Framework [15] is widely adopted in high-energy physics for data analysis. It oﬀers

a wide range of mathematical tools for ﬁtting and minimization. These tools use extensively

derivatives and some of them are hand-written while others are numerically calculated. We plan

on adopting Clad in ROOT6 through its C++ interpreter – Cling [13]. We expect performance

improvement and Clad is expected to become a gateway for derivative computation on General-

Purpose Computing on Graphics Processing Units (GPGPU). This work is still to be done

soon.

6. Computation Oﬄoad

Derivative calculation is very computing-intense and time-consuming process. Clad’s immediate

goal is to increase the computational performance and to make use of all computing power of the

environment. The natural evolution towards execution on GPGPUs is prominent. Computation

can be accelerated by using the oﬄoading of calculations from CPU towards the available

GPGPUs. There are mainly two conceptually-diﬀerent approaches:

•Using built-in approaches in the compiler to guide compilation (usually by ’pragma’

directives) to turn on the parallel execution, the automatic oﬄoad of speciﬁc computation-

intense parts of the code to selected accelerators. This is the approach in OpenMP 4.0

[16]. The advantage is that the code does not change signiﬁcantly. If the compiler does not

support these options, it ignores the parallelisation directives gracefully. The disadvantages

are: very hard to make full use of the underlying computing architectures, because the

pragma directives cannot specialize the algorithm for all architectures; requires some eﬀort

and re-engineering to get some parallelism.

•Using a speciﬁc programming language (such as OpenCL and CUDA) to talk to the

hardware. This approach supposes rewriting the algorithms, or parts of them. The main

advantage of this approach is that the algorithm can fully comply with the architecture of

the target hardware. The disadvantage is that it requires a lot of eﬀort and expertise to

port it for every architecture and hardware.

We target scalability by changing semantics of the language syntax constructs and based on

them we enhance the compiler actions. A key goal for this transformations is to be as transparent

as possible. Most actions are performed automatically – at compile time. The goal is to use

most of the advantages of the above-mentioned approaches, without forcing major changes in

the user code.

ACAT2014 IOP Publishing

Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055

7

Listing 8 presents an example of CPU-calculated gradient. In the implementation we ﬁnd a

sum of the partial derivatives of the Rosenbrock function.

float rosenbrock(float x[], int size) {

auto rosenbrockX =clad::differentiate(rosenbrock_func, 0);

auto rosenbrockY =clad::differentiate(rosenbrock_func, 1);

float sum = 0;

for (int i= 0; i <size-1; i++) {

float one =rosenbrockX.execute(x[i], x[i + 1]);

float two =rosenbrockY.execute(x[i], x[i + 1]);

sum += one +two;

}

return sum;

}

Listing 8: Rosenbrock function implementation using Clad.

Listing 9 shows how the user should specify that he/she wants the compilation to be oﬄoaded.

The original implementation needs to be transformed into a lambda function, which is very

straight-forward and almost transparent conversion. It is not much harder than putting a

pragma directive in the source code. The code is not trivial to oﬄoad, because it has a for-loop,

which should be made parallel to be able to exploit the properties of the GPGPU architecture.

Furthermore, we need to calculate the sum of the results of the calculations, that requires

reduction in the highly parallel hardware.

float rosenbrock_offloaded(float x[], int size) {

return clad::experimental_offload([=] {

auto rosenbrockX =clad::differentiate(rosenbrock_func, 0);

auto rosenbrockY =clad::differentiate(rosenbrock_func, 1);

float sum = 0;

for (int i= 0; i <size-1; i++) {

float one =rosenbrockX.execute(x[i], x[i + 1]);

float two =rosenbrockY.execute(x[i], x[i + 1]);

sum += one +two;

}

return sum;

});

}

Listing 9: Computation oﬄoading – Conceptual implementation using C++11 lambda function.

After the minimalistic transformation is done, Clad should take over and perform the rest

automatically. Every call to clad::experimental oﬄoad is replaced by a call to a function

generated by the Clad plugin. This is done by an AST transform of the lambda function,

allowing to transmit the parameters and to invoke one or more kernels.

In order to achieve optimal performance, Clad has to keep the computing units busy, which

is far from a trivial task. Computation oﬄoad to GPGPU has to be very well planned and

should happen when processing large data sets. One of the reasons is that all data needs to be

copied through the system bus to a peripheral device, which introduces big overheads. Clad-

generated code which passes data and executes kernels, should provide optimal parallel load as

well as minimum data transfer between diﬀerent types of memory (as it is possible to get too

ACAT2014 IOP Publishing

Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055

8

much latency). The use of read only, write only parameters and other similar techniques are

recommended in order to achieve better performance.

The given example was tested for performance with 100 calls to Rosenbrock’s function over

1024*1024*48 ﬂoat numbers. The resulting benchmarks are shown in Table 2.

Table 2: OpenCL parallel execution results.

Test Device Clock Compute

Units

Work

Group

Global

Memory

Local

Memory

Time

Original Intel i7-2635QM 2GHz 1 – – – 12.466s

Multicore Intel i7-2635QM 2GHz 4 1024 8192MB 32KB 11.128s

ATI Radeon 6490M 150MHz 2 256 256MB 32KB 18.183s

AMD Devastator 844MHz 6 256 2047MB 32KB 15.479s

NVIDIA Tesla K20m 705MHz 13 1024 4800MB 47KB 10.615s

The benchmarking showed that the CPU multicore is slightly faster than the original single

core computation. This is because of the copying of large amounts of data. Overall, the GPGPU

oﬄoad is very sensitive to the particular hardware. This is mainly due to the copy of large

data sets sequentially between the host and the device. There is also latency because of the

synchronous manner of execution. The next data transfer waits for the previous computation

to be executed. In the case of NVIDIA Tesla, the CPU oﬄoad reached 50%, which is very good

result for a prototype implementation.

7. Conclusion and Plans

Derivative production is important not only in high-energy physics but in many other domains.

The automatic diﬀerentiation is often an overlooked approach to compute derivatives. It

eliminates the precision losses in the numerical diﬀerentiation and it is faster than symbolic

diﬀerentiation. It can also simplify the complexity of gradient computations. The AD tool

mainly focuses on C and partially C++, because of the complexity of the language. We presented

an innovative proof-of-concept prototype facilitating automatic diﬀerentiation, called Clad. It is

based on the industrial-strength compiler technologies Clang and LLVM. Clad can diﬀerentiate

non-trivial C++ routines and it is getting closer to production grade quality. Adding C support

is trivial, the only work that needs to be done is writing the runtime environment to be C

compliant.

We explained how it can be used to produce derivatives of various orders; how to produce

mixed derivatives; how to perform user-based substitutions, steering the diﬀerentiation process;

and how to oﬄoad computations in heterogeneous environments. There is still a lot of room for

improvements. We have a conceptual implementation in OpenCL, providing a way how to oﬄoad

computations in heterogeneous environments (making extensive use of GPGPUs). However, this

work is still experimental and it requires a lot of eﬀorts for Clad to be made robust.

We plan to generalize the computation of gradient and Jacobian of functions. We plan

to reduce the computational complexity of these computations by making use of the reverse

automatic diﬀerentiation mode in cases of many seeds.

Another immediate plan is to integrate Clad into Cling – the C++ interpreter of ROOT6,

which would make Clad available to the entire high-energy physics community. Then, we can

proceed using Clad in ROOT’s minimization and ﬁtting algorithms of ROOT.

ACAT2014 IOP Publishing

Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055

9

Acknowledgments

The work was partially facilitated by Google Summer of Code Program 2013 and 2014,

through CERN-SFT mentoring organization. The heterogeneous environment benchmarking

are supported by the University of Plovdiv “Paisii Hilendarski” through Fund “Research” under

contract NIS14-FMIIT-002/26.03.2014.

References

[1] Castro M, Vieira R and Biscaia Jr E 2000 Automatic diﬀerentiation tools in the dynamic simulation

of chemical engineering processes Brazilian Journal of Chemical Engineering 17 373–382 ISSN 0104-

6632 URL http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0104-66322000000400002&

nrm=iso

[2] Bartholomew-Biggs M, Brown S, Christianson B and Dixon L 2000 Automatic diﬀerentiation of algorithms

Journal of Computational and Applied Mathematics 124 171–190 ISSN 0377-0427 numerical Analysis 2000.

Vol. IV: Optimization and Nonlinear Equations URL http://www.sciencedirect.com/science/article/

pii/S0377042700004222

[3] Gay D 2006 Semiautomatic diﬀerentiation for eﬃcient gradient computations Automatic Diﬀerentiation:

Applications, Theory, and Implementations (Lecture Notes in Computational Science and Engineering

vol 50) ed B¨ucker M, Corliss G, Naumann U, Hovland P and Norris B (Springer Berlin Heidelberg) pp

147–158 ISBN 978-3-540-28403-1 URL http://dx.doi.org/10.1007/3-540-28438-9_13

[4] Bischof C H, Hovland P D and Norris B 2002 Implementation of automatic diﬀerentiation tools. PEPM ed

Thiemann P (ACM) pp 98–107 ISBN 1-58113-455-X URL http://dblp.uni-trier.de/db/conf/pepm/

pepm2002.html#BischofHN02

[5] Cohen M, Naumann U and Riehme J 2003 Towards diﬀerentiation-enabled Fortran 95 compiler technology

Proceedings of the 2003 ACM Symposium on Applied Computing SAC ’03 (New York, NY, USA: ACM)

pp 143–147 ISBN 1-58113-624-2 URL http://doi.acm.org/10.1145/952532.952564

[6] Walther A and Griewank A 2012 Getting started with ADOL-C Combinatorial Scientiﬁc Computing ed

Naumann U and Schenk O (Chapman-Hall CRC Computational Science) chap 7, pp 181–202

[7] Narayanan S H K, Norris B and Winnicka B 2010 ADIC2: Development of a component source transformation

system for diﬀerentiating C and C++ Procedia Computer Science 11845–1853 ISSN 1877-0509 iCCS 2010

URL http://www.sciencedirect.com/science/article/pii/S1877050910002073

[8] Naumann U and Riehme J 2005 A diﬀerentiation-enabled Fortran 95 compiler ACM Trans. Math. Softw. 31

458–474 ISSN 0098-3500 URL http://doi.acm.org/10.1145/1114268.1114270

[9] Voßbeck M, Giering R and Kaminski T 2008 Development and ﬁrst applications of TAC++ Advances in

Automatic Diﬀerentiation (Lecture Notes in Computational Science and Engineering vol 64) ed Bischof

C H, B¨ucker H M, Hovland P D, Naumann U and Utke J (Springer Berlin Heidelberg) pp 187–197 ISBN

978-3-540-68935-5 URL http://dx.doi.org/10.1007/978-3-540-68942-3_17

[10] Gaster B, Howes L, Kaeli D R, Mistry P and Schaa D 2013 Heterogeneous Computing with OpenCL:

Revised OpenCL 1.2 Edition 2nd ed (San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.) ISBN

9780124055209

[11] Cook S 2013 CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs 1st ed (San

Francisco, CA, USA: Morgan Kaufmann Publishers Inc.) ISBN 9780124159334

[12] Lattner C and Adve V 2004 LLVM: A compilation framework for lifelong program analysis & transformation

Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04) (Palo

Alto, California)

[13] Vasilev V, Canal P, Naumann A and Russo P 2012 Cling – the new interactive interpreter for ROOT 6 Journal

of Physics: Conference Series 396 052071 URL http://stacks.iop.org/1742-6596/396/i=5/a=052071

[14] Beason K 2014 SmallPT: Global illumination in 99 lines of C++ URL http://www.kevinbeason.com/

smallpt/

[15] Brun R and Rademakers F 1997 ROOT – an object oriented data analysis framework Nuclear Instruments

and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated

Equipment 389 81–86 ISSN 0168-9002 new Computing Techniques in Physics Research V URL http:

//www.sciencedirect.com/science/article/pii/S016890029700048X

[16] OpenMP A R B 2013 OpenMP application program interface URL http://www.openmp.org/mp-documents/

OpenMP4.0.0.pdf

ACAT2014 IOP Publishing

Journal of Physics: Conference Series 608 (2015) 012055 doi:10.1088/1742-6596/608/1/012055

10