ArticlePDF Available

Automatic differentiation in machine learning: A survey

Authors:

Abstract and Figures

Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply “autodiff”, is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs. AD is a small but established field with applications in areas including computational fluid dynamics, atmospheric sciences, and engineering design optimization. Until very recently, the fields of machine learning and AD have largely been unaware of each other and, in some cases, have independently discovered each other’s results. Despite its relevance, general-purpose AD has been missing from the machine learning toolbox, a situation slowly changing with its ongoing adoption under the names “dynamic computational graphs” and “differentiable programming”. We survey the intersection of AD and machine learning, cover applications where AD has direct relevance, and address the main implementation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms “autodiff”, “automatic differentiation”, and “symbolic differentiation” as these are encountered more and more in machine learning settings.
Content may be subject to copyright.
A preview of the PDF is not available
... At the same time, in the past two decades, the sheer availability of observational data and the increasing ease of collecting them have led to the development of a vast array of machine learning techniques. In particular, thanks to the technological advancements in hardware acceleration, automatic differentiation (AD) has become a viable and powerful strategy for large-scale optimization [8]- [10]. Ultimately, one could think of the supervised training of a neural network via backpropagation as the solution to a classic inverse problem where the hidden layer parameters (weights and biases) are estimated from a set of observations. ...
... Automatic differentiation (AD) [8]- [10] refers to a set of techniques for algorithmically evaluating the derivatives of a differentiable function expressed as a computer program. It differs from, e.g., numerical and symbolic differentiation methods in that AD involves a non-standard interpretation of computer programs where each algebraic operation is augmented with the calculation of the corresponding derivative. ...
... Hence, the gradient is computed by evaluating the chain rule by traversing the computational graph from roots to leaves. This way, reverse-mode AD greatly reduces the number of operations required for differentiating functions with many inputs compared to the forward mode [10]. ...
Article
Full-text available
Lumped-element models (LEMs) provide a compact characterization of numerous real-world physical systems, including electrical, acoustic, and mechanical systems. However, even when the target topology is known, deriving model parameters that approximate a possibly distributed system often requires educated guesses or dedicated optimization routines. This article presents a general framework for the data-driven estimation of lumped parameters using automatic differentiation. Inspired by recent work on physical neural networks, we propose to explicitly embed a differentiable LEM in the forward pass of a learning algorithm and discover its parameters via backpropagation. The same approach could also be applied to blindly parameterize an approximating model that shares no isomorphism with the target system, for which it would be thus challenging to exploit prior knowledge of the underlying physics. We evaluate our framework on various linear and nonlinear systems, including time- and frequency-domain learning objectives, and consider real- and complex-valued differentiation strategies. In all our experiments, we were able to achieve a near-perfect match of the system state measurements and retrieve the true model parameters whenever possible. Besides its practical interest, the present approach provides a fully interpretable input-output mapping by exposing the topological structure of the underlying physical model, and it may therefore constitute an explainable ad-hoc alternative to otherwise black-box methods.
... Zhang et al. [14] used a tensor differentiator to determine the derivatives of state space variables to couple the LSTM network with physics information arising from the governing equations and applied to nonlinear-structural problems. PINN usually considers an automatic differential (AD) based approach [6] for the computation of the gradients. There have been few efforts to use derivatives based on known numerical approaches such as finite difference, finite-element, finite volume methods etc. to compute the physics-driven loss-term in the PINN network. ...
... . Now, if we concatenate the snapshots corresponding to all the parameters, we obtain the total snapshot matrix U ≡ [U 1 ... U Nµ ], where U ∈ R (N ×mNt) . Proper Orthogonal Decomposition (POD) [41] can be used to compute the spatial basis vectors Φ by choosing the first n columns of the left singular value matrix W after the application of Singular Value Decomposition (SVD) on the snapshot matrix U as shown in Equation 6. ...
... • The automatic differentiation (AD) based gradient computation [6] in conventional PINN framework requires the spatial coordinates (i.e., Cartesian coordinates) and time coordinates as the input to the neural network where the current approach provides a flexible choice of input. ...
Preprint
Full-text available
This work addresses the development of a physics-informed neural network (PINN) with a loss term derived from a discretized time-dependent reduced-order system. In this work, first, the governing equations are discretized using a finite difference scheme (whereas, any other discretization technique can be adopted), then projected on a reduced or latent space using the Proper Orthogonal Decomposition (POD)-Galerkin approach and next, the residual arising from discretized reduced order equation is considered as an additional loss penalty term alongside the data-driven loss term using different variants of deep learning method such as Artificial neural network (ANN), Long Short-Term Memory based neural network (LSTM). The LSTM neural network has been proven to be very effective for time-dependent problems in a purely data-driven environment. The current work demonstrates the LSTM network's potential over ANN networks in physics-informed neural networks (PINN) as well. The potential of using discretized governing equations instead of continuous form lies in the flexibility of input to the PINN. Different sizes of data ranging from small, medium to big datasets are used to assess the potential of discretized-physics-informed neural networks when there are very sparse or no data available. The proposed methods are applied to a pitch-plunge airfoil motion governed by rigid-body dynamics and a one-dimensional viscous Burgers' equation. The current work also demonstrates the prediction capability of various discretized-physics-informed neural networks outside the domain where the data is available or governing equation-based residuals are minimized.
Article
Full-text available
The Fokker-Planck equation is a partial differential equation that describes how the probability density function of an object state varies, when subject to deterministic and random forces. The solution to this equation is crucial in many space applications, such as space debris trajectory tracking and prediction, guidance navigation and control under uncertainties, space situational awareness, and mission analysis and planning. However, no general closed-form solutions are known and several methods exist to tackle its solution. In this work, we use a known technique to transform this equation into a set of linear ordinary differential equations in the context of orbital dynamics. In particular, we show the advantages of the applied methodology, which allows to decouple the time and state-dependent components and to retain the entire shape of the probability density function through time, in the presence of both deterministic and stochastic dynamics. With this approach, the probability density function values at future times and for different initial conditions can be computed without added costs, provided that some time-independent integrals are solved offline. We showcase the efficacy and use of this method on some orbital dynamics example, by also leveraging the use of automatic differentiation for efficiently computing the involved derivatives.
Article
Full-text available
We present ForwardDiff, a Julia package for forward-mode automatic differentiation (AD) featuring performance competitive with low-level languages like C++. Unlike recently developed AD tools in other popular high-level languages such as Python and MATLAB, ForwardDiff takes advantage of just-in-time (JIT) compilation to transparently recompile AD-unaware user code, enabling efficient support for higher-order differentiation and differentiation using custom number types (including complex numbers). For gradient and Jacobian calculations, ForwardDiff provides a variant of vector-forward mode that avoids expensive heap allocation and makes better use of memory bandwidth than traditional vector mode. In our numerical experiments, we demonstrate that for nontrivially large dimensions, ForwardDiff's gradient computations can be faster than a reverse-mode implementation from the Python-based autograd package. We also illustrate how ForwardDiff is used effectively within JuMP, a modeling language for optimization. According to our usage statistics, 41 unique repositories on GitHub depend on ForwardDiff, with users from diverse fields such as astronomy, optimization, finite element analysis, and statistics. This document is an extended abstract that has been accepted for presentation at the AD2016 7th International Conference on Algorithmic Differentiation.
Book
Two ideas lie gleaming on the jeweler's velvet. The first is the calculus, the sec­ ond, the algorithm. The calculus and the rich body of mathematical analysis to which it gave rise made modern science possible; but it has been the algorithm that has made possible the modern world. -David Berlinski, The Advent of the Algorithm First there was the concept of integers, then there were symbols for integers: I, II, III, 1111, fttt (what might be called a sticks and stones representation); I, II, III, IV, V (Roman numerals); 1, 2, 3, 4, 5 (Arabic numerals), etc. Then there were other concepts with symbols for them and algorithms (sometimes) for ma­ nipulating the new symbols. Then came collections of mathematical knowledge (tables of mathematical computations, theorems of general results). Soon after algorithms came devices that provided assistancefor carryingout computations. Then mathematical knowledge was organized and structured into several related concepts (and symbols): logic, algebra, analysis, topology, algebraic geometry, number theory, combinatorics, etc. This organization and abstraction lead to new algorithms and new fields like universal algebra. But always our symbol systems reflected and influenced our thinking, our concepts, and our algorithms.
Article
Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables, based on gradients of a learned function. These estimators can be jointly trained with model parameters or policies, and are applicable in both discrete and continuous settings. We give unbiased, adaptive analogs of state-of-the-art reinforcement learning methods such as advantage actor-critic. We also demonstrate this framework for training discrete latent-variable models.
Conference Paper
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua. Its goal is to provide a flexible environment to design and train learning machines. Flexibility is obtained via Lua, an extremely lightweight scripting language. High performance is obtained via efficient OpenMP/SSE and CUDA implementations of low-level numeric routines. Torch7 can easily be in- terfaced to third-party software thanks to Lua's light interface.
Conference Paper
How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions is two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.