# Alexey Radul's research while affiliated with Google Inc. and other places

**What is this page?**

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

## Publications (29)

Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two “modes”—forward and reverse—which are typically presented (and implemented) separately. Can there be only one? Following up on the AD systems developed in the JAX and Dex projects, we formalize a decomposition of reverse-mode AD into (i) fo...

Correctly manipulating program terms in a compiler is surprisingly difficult because of the need to avoid name capture. The rapier from "Secrets of the Glasgow Haskell Compiler inliner" is a cutting-edge technique for fast, stateless capture-avoiding substitution for expressions represented with explicit names. It is, however, a sharp tool: its inv...

Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two "modes" -- forward and reverse -- which are typically presented (and implemented) separately. Can there be only one? Following up on the AD systems developed in the JAX and Dex projects, we formalize a decomposition of reverse-mode AD into...

We present a novel programming language design that attempts to combine the clarity and safety of high-level functional languages with the efficiency and parallelism of low-level numerical languages. We treat arrays as eagerly-memoized functions on typed index sets, allowing abstract function manipulations, such as currying, to work on arrays. In c...

We decompose reverse-mode automatic differentiation into (forward-mode) linearization followed by transposition. Doing so isolates the essential difference between forward- and reverse-mode AD, and simplifies their joint implementation. In particular, once forward-mode AD rules are defined for every primitive operation in a source language, only li...

We present a novel programming language design that attempts to combine the clarity and safety of high-level functional languages with the efficiency and parallelism of low-level numerical languages. We treat arrays as eagerly-memoized functions on typed index sets, allowing abstract function manipulations, such as currying, to work on arrays. In c...

Probabilistic programming systems generally compute with probability density functions, leaving the base measure of each such function implicit. This usually works, but creates problems in situations where densities with respect to different base measures are accidentally combined or compared. We motivate and clarify the problem in the context of a...

Constant-memory algorithms, also loosely called Markov chains, power the vast majority of probabilistic inference and machine learning applications today. A lot of progress has been made in constructing user-friendly APIs around these algorithms. Such APIs, however, rarely make it easy to research new algorithms of this type. In this work we presen...

We present a general approach to batching arbitrary computations for accelerators such as GPUs. We show orders-of-magnitude speedups using our method on the No U-Turn Sampler (NUTS), a workhorse algorithm in Bayesian statistics. The central challenge of batching NUTS and other Markov chain Monte Carlo algorithms is data-dependent control flow and r...

We describe a simple, low-level approach for embedding probabilistic programming in a deep learning ecosystem. In particular, we distill probabilistic programming down to a single abstraction---the random variable. Our lightweight implementation in TensorFlow enables numerous applications: a model-parallel variational auto-encoder (VAE) with 2nd-ge...

We introduce inference metaprogramming for probabilistic programming languages, including new language constructs, a formalism, and the rst demonstration of e ectiveness in practice. Instead of relying on rigid black-box inference algorithms hard-coded into the language implementation as in previous probabilistic programming languages, infer- ence...

We introduce inference metaprogramming for probabilistic programming languages, including new language constructs, a formalism, and the rst demonstration of e ectiveness in practice. Instead of relying on rigid black-box inference algorithms hard-coded into the language implementation as in previous probabilistic programming languages, infer- ence...

Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply “autodiff”, is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed a...

Intelligent systems sometimes need to infer the probable goals of people, cars, and robots, based on partial observations of their motion. This paper introduces a class of probabilistic programs for formulating and solving these problems. The formulation uses randomized path planning algorithms as the basis for probabilistic models of the process b...

Gaussian Processes (GPs) are widely used tools in statistics, machine
learning, robotics, computer vision, and scientific computation. However,
despite their popularity, they can be difficult to apply; all but the simplest
classification or regression applications require specification and inference
over complex covariance functions that do not adm...

This is a special collection of problems that were given to select applicants during oral entrance exams to the Department of Mechanics and Mathematics of Moscow State University. These problems were designed to prevent Jewish candidates and other “undesirables” from getting a passing grade, thus preventing them from studying at MSU. Among problems...

Forward Automatic Differentiation (AD) is a technique for augmenting programs
to both perform their original calculation and also compute its directional
derivative. The essence of Forward AD is to attach a derivative value to each
number, and propagate these through the computation. When derivatives are
nested, the distinct derivative calculations...

We propose extensions to Fortran which integrate forward and reverse
Automatic Differentiation (AD) directly into the programming model.
Irrespective of implementation technology, embedding AD constructs directly
into the language extends the reach and convenience of AD while allowing
abstraction of concepts of interest to scientific-computing prac...

We describe an implementation of the Farfel Fortran AD extensions. These
extensions integrate forward and reverse AD directly into the programming
model, with attendant benefits to flexibility, modularity, and ease of use. The
implementation we describe is a "prepreprocessor" that generates input to
existing Fortran-based AD tools. In essence, bloc...

This is a special collection of problems that were given to select applicants
during oral entrance exams to the math department of Moscow State University.
These problems were designed to prevent Jews and other undesirables from
getting a passing grade. Among problems that were used by the department to
blackball unwanted candidate students, these...

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009. Cataloged from PDF version of thesis. Includes bibliographical references (p. 167-174). In this dissertation I propose a shift in the foundations of computation. Modem programming systems are not expressive enough. The traditional ima...

Traditionally, distributed computing problems have been solved by partitioning data into chunks small enough to be handled by commodity hardware. However, such partitioning is not possible in cases where there are a high number of dependencies or high dimensionality, such as in reasoning and expert systems, rendering such problems less tractable fo...

We investigate a coin-weighing puzzle that appeared in the all-Russian math
Olympiad in 2000. We liked the puzzle because the methods of analysis differ
from classical coin-weighing puzzles. We generalize the puzzle by varying the
number of participating coins, and deduce a complete solution, perhaps
surprisingly, the objective can be achieved in n...

We develop a programming model built on the idea that the basic computational elements are autonomous machines interconnected by shared cells through which they communicate. Each machine continuously examines the cells it is interested in, and adds information to some based on deductions it can make from information from the others. This model make...

Reasoning with probabilistic models is a widespread andsuccessful technique in areas ranging from computer vision, to naturallanguage processing, to bioinformatics. Currently, these reasoningsystems are either coded from scratch in general-purpose languages oruse formalisms such as Bayesian networks that have limited expressivepower. In both cases,...

In the past year we have made serious progress on elaborating the propagator programming model [2, 3]. Things have gotten serious enough to build a system that can be used for real experiments. The most important problem facing a programmer is the revision of an existing program to extend it for some new situation. Unfortunately, the traditional mo...

An ad hoc network is a group of mobile nodes that au-tonomously establish connectivity via multi-hop wireless links, without relying on any pre-configured network infras-tructure. In recent years, a variety of routing protocols for mobile ad hoc networks have been developed. Their main drawback is that they use a large number of routing packets to...

## Citations

... AD can also compute Hessians or other higher-order derivatives or their projections. The pitfalls described in this work apply to all of these AD usages; and we refer to some of the many articles, books, and reviews that discuss AD modes and their relationship [5,6,7,1,4,8,3,9,10,11]. ...

... Hence, commonly used deep learning frameworks like PyTorch [18] or JAX [2] feature a large number of built-in operations on tensors. In this work, we are instead interested in array languages which have only few built-in constructs upon which richer APIs can be constructed, likeF [20,21] or Dex [19]. Our implementation of an array programming language as a domain specific language (DSL) closely follows the former. ...

... For example, the efficiency of Monto Carlo methods such as particle filtering is highly dependent on the choice of proposal distributions encoding knowledge of the inference problem, often provided by the user [Snyder et al. 2015]. This need for both library designers and end users to adapt existing solutions to specific settings has led to interest in language and implementation techniques that make it easy to assemble new inference algorithms out of reusable parts, techniques which have been explored with some success in probabilistic programming systems such as Venture [Mansinghka et al. 2018], MonadBayes [Ścibior et al. 2018], and Gen [Cusumano-Towner et al. 2019]. ...

... One potential way forward is to explicitly generate models of thinking processes that augment the world models with which they are thinking, by synthesizing inference programs (M. F. Cusumano-Towner et al., 2019; V. K. Mansinghka et al., 2018) tailored to specific problems. For example, Venture's inference metaprogramming language is designed to enable concise specification of sequential inference processes that combine SMC, dynamic programming, MCMC, gradient-based optimization, and variational inference to perform inference in a sequence of world models and queries that grows dynamically. ...

... Recent work in 'amortized' or 'compiled' inference has studied technniques for offline optimizing of proposal distributions in importance sampling or sequential Monte Carlo [2][3][4][5]. Others have applied similar approaches to optimize proposal distributions for use in MCMC [10,20]. However, the focus of these efforts is optimizing over neurally-parameterized distributions that inherit their structure from the generative model being targeted, do not contain their own internal random choices, and are therefore not suited to use with heuristic randomized algorithms. ...

Reference: Using probabilistic programs as proposals

... The two primary limitations are the training cost and the scalability of the models [23]. The point-wise formulation of PINN involves a large number of automatic differentiation operations, leading to the generation of large tensors during the back-propagation process [24]. In the literature [25] and [26], studies have investigated the impact of the number of collocation points in the computational domain on PINN training. ...

... A few of these do not aim to implement automatic differentiation [20,14]. Forward-mode automatic differentiation is realized in [38,39,45]. ...

... The present author's AD research program, done in collaboration with Jeffrey Mark Siskind, has been to bring Type IV into existence, and to make it fast and robust and general and convenient [30]. These efforts have therefore focused not just on formalizations of the AD process suitable for mathematical analysis and integration into compilers, but also in making AD more general and robust, allowing the AD operators to be used as firstclass operators in a nested fashion, thus expanding the scope of programs involving derivatives that can be written succinctly and conveniently [4, 5, 32, 33]. We have also explored the limits of speed that Type IV AD integrated into an aggressive compiler can attain [37]. ...

... A six-fold increase in speed (compared with numerical differentiation using center difference) is reported. Nested applications of AD would facilitate compositional approaches to machine learning tasks, where one can, for example, perform gradient optimization on a system of many components that can in turn be internally using other derivatives or performing optimization (Siskind and Pearlmutter, 2008b; Radul et al., 2012). This capability is relevant to, e.g., hyperparameter optimization, where using gradient methods on model selection criteria has been proposed as an alternative to the established grid search and randomized search methods. ...

... It would be far easier for developers of distributed applications not to need to worry about how provenance is handled in their distributed system; this would reduce complexity of program design. Data propagation, a model of concurrent [4] and distributed computation [5], allows for the transformation of programs that use it so they may track provenance. ...