ArticlePublisher preview available
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Algorithmic differentiation (AD), also known as automatic differentiation, is a technology for accurate and efficient evaluation of derivatives of a function given as a computer model. The evaluations of such models are essential building blocks in numerous scientific computing and data analysis applications, including optimization, parameter identification, sensitivity analysis, uncertainty quantification, nonlinear equation solving, and integration of differential equations. We provide an introduction to AD and present its basic ideas and techniques, some of its most important results, the implementation paradigms it relies on, the connection it has to other domains including machine learning and parallel computing, and a few of the major open problems in the area. Topics we discuss include: forward mode and reverse mode of AD, higher‐order derivatives, operator overloading and source transformation, sparsity exploitation, checkpointing, cross‐country mode, and differentiating iterative processes. This article is categorized under: • Algorithmic Development > Scalable Statistical Methods • Technologies > Data Preprocessing Abstract Visual illustration of algorithmic differentiation and overview of the topics covered in this paper.
This content is subject to copyright. Terms and conditions apply.
OVERVIEW
An introduction to algorithmic differentiation
Assefaw H. Gebremedhin
1
| Andrea Walther
2
1
School of Electrical Engineering and
Computer Science, Washington State
University, Pullman, Washington
2
Institute for Mathematics, Paderborn
University, Paderborn, Germany
Correspondence
Assefaw H. Gebremedhin, School of
Electrical Engineering and Computer
Science, Washington State University,
Pullman, WA, 99164, USA.
Email: assefaw.gebremedhin@wsu.edu
Funding information
National Science Foundation, Grant/Award
Number: IIS-1553528
Abstract
Algorithmic differentiation (AD), also known as automatic differentiation, is a
technology for accurate and efficient evaluation of derivatives of a function given
as a computer model. The evaluations of such models are essential building blocks
in numerous scientific computing and data analysis applications, including optimi-
zation, parameter identification, sensitivity analysis, uncertainty quantification,
nonlinear equation solving, and integration of differential equations. We provide
an introduction to AD and present its basic ideas and techniques, some of its most
important results, the implementation paradigms it relies on, the connection it has
to other domains including machine learning and parallel computing, and a few of
the major open problems in the area. Topics we discuss include: forward mode and
reverse mode of AD, higher-order derivatives, operator overloading and source
transformation, sparsity exploitation, checkpointing, cross-country mode, and dif-
ferentiating iterative processes.
This article is categorized under:
Algorithmic Development > Scalable Statistical Methods
Technologies > Data Preprocessing
KEYWORDS
adjoints, algorithmic differentiation, automatic differentiation, backpropagation, checkpointing,
sensitivities
1|INTRODUCTION
Efficient calculation of derivative information is an indispensable building block in numerous applications ranging from
methods for solving nonlinear equations to sophisticated simulations in unconstrained and constrained optimization. Suppose
a sufficiently smooth function F:Rn!Rmis given. There are a few different techniques one can use to compute, either
exactly or approximately, the derivatives of F. In comparing and contrasting the different conceivable approaches, it is neces-
sary to take into account the desired accuracy and the computational effort involved.
The finite difference method is a very simple way to obtain approximations of derivatives. Following a truncated Taylor
expansion analysis, all that is needed for estimating derivatives using finite differences is function evaluations at different
arguments, taking the difference and dividing by a step-length (see Quarteroni, Sacco, & Saleri, 2000, Chapter 10.10 for
details on this class of methods). The runtime complexity of approximating derivatives using finite differences grows linearly
with n, that is, the number of unknowns, which is quite often prohibitively expensive. So the finite difference method is sim-
ple, but inexact and slow. On the other hand, we should also note that some optimization algorithms are not that much affected
by the inexactness of the derivative information provided by a finite difference method.
Received: 13 February 2019 Revised: 26 June 2019 Accepted: 2 August 2019
DOI: 10.1002/widm.1334
WIREs Data Mining Knowl Discov. 2020;10:e1334. wires.wiley.com/dmkd © 2019 Wiley Periodicals, Inc. 1of21
https://doi.org/10.1002/widm.1334
... Its main drawbacks are that it scales poorly with the dimension ofˆ , as the PDE needs to be solved times in addition to the solve for computingˆ ′ , and that it typically only achieves a relative precision in the order of √ mach due to cancellation errors, with mach the machine precision. Algorithmic differentiation [8], while not fully black-box, makes use of the existing solver code, automatically generating a code for calculating the corresponding derivative. Conceptually, this approach relies on repeated application of the chain rule. ...
... In our case, so-called backward mode is the most advantageous as ≫ 1. This approach computes dĴ dˆ (ˆ ′ ,ˆ ′ ) to a relative precision in the order of mach , with a theoretical cost upper-bound of four times that of evaluatingĴ (ˆ ′ ,ˆ ′ ) [8]. However, in the context of Monte Carlo simulations, this approach requires accessing particle states in a time-reversed fashion. ...
Preprint
Full-text available
In PDE-constrained optimization, one aims to find design parameters that minimize some objective, subject to the satisfaction of a partial differential equation. A major challenges is computing gradients of the objective to the design parameters, as applying the chain rule requires computing the Jacobian of the design parameters to the PDE's state. The adjoint method avoids this Jacobian by computing partial derivatives of a Lagrangian. Evaluating these derivatives requires the solution of a second PDE with the adjoint differential operator to the constraint, resulting in a backwards-in-time simulation. Particle-based Monte Carlo solvers are often used to compute the solution to high-dimensional PDEs. However, such solvers have the drawback of introducing noise to the computed results, thus requiring stochastic optimization methods. To guarantee convergence in this setting, both the constraint and adjoint Monte Carlo simulations should simulate the same particle trajectories. For large simulations, storing full paths from the constraint equation for re-use in the adjoint equation becomes infeasible due to memory limitations. In this paper, we provide a reversible extension to the family of permuted congruential pseudorandom number generators (PCG). We then use such a generator to recompute these time-reversed paths for the heat equation, avoiding these memory issues.
... In particular, this work is motivated by the use of graph coloring as a preprocessing step for distributed scientific computations such as automatic differentiation [7]. For such applications, assembling the associated graphs on a single node to run a sequential coloring algorithm may not be feasible [3]. ...
... We selected many of the same graphs used by Deveci et al. to allow for direct performance comparisons. We include many graphs from Partial Differential Equation (PDE) problems because they are representative of graphs used with Automatic Differentiation [7], which is a target application for graph coloring algorithms. We also include social network graphs and a web crawl to demonstrate scaling of our methods on irregular real-world datasets. ...
Preprint
Full-text available
Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but to the best of our knowledge, hybrid MPI+GPU algorithms have been unexplored until this work. We present several MPI+GPU coloring approaches based on the distributed coloring algorithms of Gebremedhin et al. and the shared-memory algorithms of Deveci et al. . The on-node parallel coloring uses implementations in KokkosKernels, which provide parallelization for both multicore CPUs and GPUs. We further extend our approaches to compute distance-2 and partial distance-2 colorings, giving the first known distributed, multi-GPU algorithm for these problems. In addition, we propose a novel heuristic to reduce communication for recoloring in distributed graph coloring. Our experiments show that our approaches operate efficiently on inputs too large to fit on a single GPU and scale up to graphs with 76.7 billion edges running on 128 GPUs.
... The reverse mode computes the Jacobian row-by-row and depends on the number of outputs. The reverse mode requires storing a larger amount of data compared with the forward mode [15,18] and there are techniques for reducing the memory usage of reverse AD [18]. The two primary implementation approaches for AD are source transformation and operator overloading. ...
... The critical step for a successful HMC algorithm is the capability of computing the gradient of the potential energy efficiently enough such that the algorithm can perform the desired number of iterations in a reasonable amount of time. In our case, computing derivatives of the forward models with respect to vertices position by hand is rather cumbersome, so, for convenience, we resort to the technique of automatic differentiation (AD) (Gebremedhin & Walther, 2020;Griewank & Walther, 2008;Sambridge et al., 2007). AD is a computational technique which allows us to automatically generate the derivative of a user-provided function, commonly happening on-the-fly, for almost any code in a given language where this tool is available. ...
Article
Full-text available
Two‐dimensional modeling of gravity and magnetic anomalies in terms of polygonal bodies is a popular approach to infer possible configurations of geological structures in the subsurface. Alternatively to the traditional trial‐and‐error manual fit of measured data, here we illustrate a probabilistic strategy to solve the inverse problem. First we derive a set of formulae for solving a 2.75‐dimensional forward model, where the polygonal bodies have a given finite lateral extent, and then we devise a Hamiltonian Monte Carlo algorithm to jointly invert gravity and magnetic data for the geometry and properties of the polygonal bodies. This probabilistic approach fully addresses the nonlinearity of the forward model and provides uncertainty estimation. The result of the inversion is a collection of models which represent the posterior distribution, analysis of which provides estimates of sought properties and may reveal different scenarios.
... By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed algorithmically. The interested reader is referred to [GW19] for an overview of ad methods. Alternatively, one can provide the derivatives manually, yet manual differentiation is a cumbersome process which is prone to errors. ...
Thesis
Inertial motion tracking describes the application of estimating the relevant navigation states, e.g. position, velocity, and orientation of a moving system using inertial measurement data. The measurements are provided by an inertial measurement unit which describes a collection of sensors measuring linear acceleration and angular velocity in all relevant directions. As a result of technological advances, such sensors are nowadays widely available for a broad range of applications in robotics, automation, animation, ergonomics, and biomechanics. Like every sensor, also an inertial measurement unit suffers from measurement inaccuracies, which cause a drift in estimates when integrated over time in order to obtain the desired navigation state. To compensate for this accumulating error and to obtain estimates with high accuracy instead, information from other sensors is considered in a sensor fusion approach. The Kalman filter including its extensions for nonlinear systems represents the working horse in state-of-the-art sensor fusion approaches. In this thesis, we use moving horizon estimation for the sensor fusion problems arising in inertial motion tracking. Moving horizon estimation is an approach for state estimation which uses the powerful framework of nonlinear numerical optimization over a window of recent measurements in order to increase the robustness and accuracy of the estimates. Estimating the motion in three-dimensional space is a highly nonlinear problem and therefore well-suited for moving horizon estimation. We present and evaluate methods to handle necessary over-parameterizations of the orientation state in moving horizon estimation. We propose several estimator implementations for inertial motion tracking using position, velocity, and magnetometer measurements to aid against drifting estimates. Furthermore, we propose new algorithms beyond the current state-of-the-art motion tracking approaches by embedding problems like sensor calibration and delay compensation into our moving horizon estimation algorithms. The proposed approaches are evaluated and validated by adapting state-of-the-art simulation approaches to our problem and using real-world experiments adopting high-grade sensing technologies. The results of the proposed algorithms demonstrate that moving horizon estimation is not only a valid alternative to traditional Kalman filtering but represents a possible way to decrease the dependency on costly calibration and design processes.
... One approach to obtain both the tangent linear and adjoint based derivatives based on a code directly, i.e., the discrete approach, is to use algorithmic differentiation (AD) (Bücker et al., 2006;Griewank, 2008;Baydin et al., 2017;Gebremedhin and Walther, 2020). This used to be commonly termed automatic differentiation as in principle, and for short simple codes, it can indeed be automatic; however, for large complex codes it requires developer intervention and so automatic can be somewhat of a misnomer. ...
Chapter
Optimizing marine renewable energy systems to maximize performance is key to their success. However, a range of physical, environmental, engineering, economic as well as computational challenges means that this is not straightforward. This article considers this topic, focusing on those systems whose performance is coupled to the hydrodynamics providing the resource; tidal power represents a clear example of this. In such cases system design must be optimal in relation to the resource's magnitude as well as its spatial and temporal variation, which are all dependent on the system's configuration and operation and so cannot be assumed to be known at the design stage. Designing based on the ambient resource could lead to under-performance. Coupling between the design and the resource has implications for the complexity of the optimization problem and potential hydrodynamical and environmental impacts. This coupling distinguishes many marine energy systems from other renewables which do not impact in any significant manner on the resource. The optimal design of marine energy systems thus represents a challenging and somewhat unique problem. However, feedback also opens up a number of possibilities where the resource can be ‘controlled’, to maximize the cumulative power obtained from multiple devices or plants, or to achieve some other complementary goal. Design optimization is thus critical, with many issues to consider. Due to the complexity of the problem a computational based solution is a necessity in all but the simplest scenarios. However, the coupled feedback requires that an iterative solution approach be used, which combined while the vast range of spatial and temporal scales means that methodological compromises need to be made. These compromises need to be understood, with the correct computational tool used at the appropriate point in the design process. This article reviews these challenges as well as the progress that has been made in addressing them.
Article
Full-text available
Full waveform inversion (FWI) is usually solved as a nonlinear least-squares problem to minimize the discrepancy between the recorded signal and the synthetic data by some gradient-based optimization methods. In this paper, we investigate acoustic FWI as training a neural network. We recast the time-domain difference scheme into a forward propagation process of vanilla recurrent neural network (RNN) and then find that the parameters of RNN coincide with the physical parameters in the wave equation. As a result, the FWI problem is resolved as training such a kind of neural network. Some stochastic optimization methods such as the Adam optimizer are applied to train deep neural networks with different structures. We demonstrate our method numerically with three typical models including the benchmark Marmousi model. The numerical results show that the fast convergence and good velocity inversion result can be achieved.
Article
Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but to the best of our knowledge, hybrid MPI+GPU algorithms have been unexplored until this work. We present several MPI+GPU coloring approaches based on the distributed coloring algorithms of Gebremedhin et al. and the shared-memory algorithms of Deveci et al. The on-node parallel coloring uses implementations in KokkosKernels, which provide parallelization for both multicore CPUs and GPUs. We further extend our approaches to compute distance-2 and partial distance-2 colorings, giving the first known distributed, multi-GPU algorithm for these problems. In addition, we propose a novel heuristic to reduce communication for recoloring in distributed graph coloring. Our experiments show that our approaches operate efficiently on inputs too large to fit on a single GPU and scale up to graphs with 76.7 billion edges running on 128 GPUs.
Article
Previous work has introduced the stability analysis of coupled neutronics-depletion solvers for the standard explicit Euler and predictor-corrector methods. The present work is an extension of this analysis to higher-order schemes that are commonly used, including the LE, LE/LI, LE/QI, and their implicit versions where such a method exists. Substepping, extrapolation, and linear and quadratic interpolation are investigated, and their effects on numerical stability are discussed. A realistic, numerically-stiff depletion system is considered by applying automatic differentiation to the Chebyshev rational approximation method; accounting for initial nonlinear behaviour, the predictions from the stability analysis match the outcomes of simulation.
Article
Full-text available
We show that, although the conjugate gradient (CG) algorithm has a singularity at the solution, it is possible to differentiate forward through the algorithm automatically by re-declaring all the variables as truncated Taylor series, the type of active variable widely used in automatic differentiation (AD) tools such as ADOL-C. If exact arithmetic is used, this approach gives a complete sequence of correct directional derivatives of the solution, to arbitrary order, in a single cycle of at most n iterations, where n is the number of dimensions. In the inexact case, the approach emphasizes the need for a means by which the programmer can communicate certain conditions involving derivative values directly to an AD tool.
Article
Full-text available
Classical reverse-mode automatic differentiation (AD) imposes only a small constant-factor overhead in operation count over the original computation, but has storage requirements that grow, in the worst case, in proportion to the time consumed by the original computation. This storage blowup can be ameliorated by checkpointing, a process that reorders application of classical reverse-mode AD over an execution interval to tradeoff space \vs\ time. Application of checkpointing in a divide-and-conquer fashion to strategically chosen nested execution intervals can break classical reverse-mode AD into stages which can reduce the worst-case growth in storage from linear to sublinear. Doing this has been fully automated only for computations of particularly simple form, with checkpoints spanning execution intervals resulting from a limited set of program constructs. Here we show how the technique can be automated for arbitrary computations. The essential innovation is to apply the technique at the level of the language implementation itself, thus allowing checkpoints to span any execution interval.
Article
Full-text available
Adjoints are an important computational tool for large-scale sensitivity evaluation, uncertainty quantification, and derivative-based optimization. An essential component of their performance is the storage/recomputation balance in which efficient adjoint checkpointing strategies play a key role. We introduce a novel asynchronous two-level adjoint checkpointing scheme for numerical time discretizations targeted at large-scale numerical simulations. The checkpointing scheme combines bandwidth-limited disk checkpointing and binomial memory checkpointing. Based on assumptions about the target petascale systems, which we later demonstrate to be realistic on the IBM Blue Gene/Q system Mira, we create a model of the predicted performance of the adjoint computation and validate it using the highly scalable Navier-Stokes spectral-element solver Nek5000 on small to moderate subsystems of the Mira supercomputer.
Article
Full-text available
We reexamine the work of Stumm and Walther on multistage algorithms for adjoint computation. We provide an optimal algorithm for this problem when there are two levels of checkpoints, in memory and on disk. Previously, optimal algorithms for adjoint computations were known only for a single level of checkpoints with no writing and reading costs; a well-known example is the binomial checkpointing algorithm of Griewank and Walther. Stumm and Walther extended that binomial checkpointing algorithm to the case of two levels of checkpoints, but they did not provide any optimality results. We bridge the gap by designing the first optimal algorithm in this context. We experimentally compare our optimal algorithm with that of Stumm and Walther to assess the difference in performance.
Article
We present an optimization method for Lipschitz continuous, piecewise smooth (PS) objective functions based on successive piecewise linearization. Since, in many realistic cases, nondifferentiabilities are caused by the occurrence of abs(), max(), and min(), we concentrate on these nonsmooth elemental functions. The method’s idea is to locate an optimum of a PS objective function by explicitly handling the kink structure at the level of piecewise linear models. This piecewise linearization can be generated in its abs-normal-form by minor extension of standard algorithmic, or automatic, differentiation tools. In this paper it is shown that the new method when started from within a compact level set generates a sequence of iterates whose cluster points are all Clarke stationary. Numerical results including comparisons with other nonsmooth optimization methods then illustrate the capabilities of the proposed approach.
This book presents the thoroughly refereed post-conference proceedings of the International Conference on Formal Verification of Object-Oriented Software, FoVeOOS 2011, held in Turin, Italy, in October 2011 – organised by COST Action IC0701. The 10 revised full papers presented together with 5 invited talks were carefully reviewed and selected from 19 submissions. Formal software verification has outgrown the area of academic case studies, and industry is showing serious interest. The logical next goal is the verification of industrial software products. Most programming languages used in industrial practice are object-oriented, e.g. Java, C++, or C#. FoVeOOS 2011 aimed to foster collaboration and interactions among researchers in this area.
Article
Automatic differentiation (AD) is an essential primitive for machine learning programming systems. Tangent is a new library that performs AD using source code transformation (SCT) in Python. It takes numeric functions written in a syntactic subset of Python and NumPy as input, and generates new Python functions which calculate a derivative. This approach to automatic differentiation is different from existing packages popular in machine learning, such as TensorFlow and Autograd. Advantages are that Tangent generates gradient code in Python which is readable by the user, easy to understand and debug, and has no runtime overhead. Tangent also introduces abstractions for easily injecting logic into the generated gradient code, further improving usability.
Article
We revisit an algorithm [called Edge Pushing (EP)] for computing Hessians using Automatic Differentiation (AD) recently proposed by Gower and Mello (Optim Methods Softw 27(2): 233–249, 2012). Here we give a new, simpler derivation for the EP algorithm based on the notion of live variables from data-flow analysis in compiler theory and redesign the algorithm with close attention to general applicability and performance. We call this algorithm Livarh and develop an extension of Livarh that incorporates preaccumulation to further reduce execution time—the resulting algorithm is called Livarhacc. We engineer robust implementations for both algorithms Livarh and Livarhacc within ADOL-C, a widely-used operator overloading based AD software tool. Rigorous complexity analyses for the algorithms are provided, and the performance of the algorithms is evaluated using a mesh optimization application and several kinds of synthetic functions as testbeds. The results show that the new algorithms outperform state-of-the-art sparse methods (based on sparsity pattern detection, coloring, compressed matrix evaluation, and recovery) in some cases by orders of magnitude. We have made our implementation available online as open-source software and it will be included in a future release of ADOL-C.