# Jeffrey PenningtonGoogle Inc. | Google

Jeffrey Pennington

Doctor of Philosophy

## About

44

Publications

24,327

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

28,417

Citations

Citations since 2016

Introduction

**Skills and Expertise**

## Publications

Publications (44)

We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow. The repriorisation map acts directly on parameters, and its analytic simplicity complements the known neural network Gaussian process (NNGP)...

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open...

We develop a stochastic differential equation, called homogenized SGD, for analyzing the dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares problem with $\ell^2$-regularization. We show that homogenized SGD is the high-dimensional equivalence of SGD -- for any quadratic statistic (e.g., population risk with qua...

A significant obstacle in the development of robust machine learning models is covariate shift, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same. Despite the prevalence of covariate shift in real-world applications, a theoretical unde...

A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks (NNs) have made a theory of learning dynamics elusive. In this work, we show that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit,...

Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function. However, such a simple trade-off does not adequately describe deep learning...

Modern deep learning models have achieved great success in predictive accuracy for many data modalities. However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-of-distribution (OOD) data and ungraceful failing under distributional shift. Previous benchmarks have found that ense...

Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well. An emerging paradigm for describing this unexpected behavior is in terms of a \emph{double descent} curv...

We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but und...

Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove...

Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original w...

The recent striking success of deep neural networks in machine learning raises profound questions about the theoretical principles underlying their success. For example, what can such deep networks compute? How can we train them? How does information propagate through them? Why can they generalize? And how can we teach them to imagine? We review re...

The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effec...

A fundamental goal in deep learning is the characterization of trainability and generalization of neural networks as a function of their architecture and hyperparameters. In this paper, we discuss these challenging issues in the context of wide neural networks at large depths where we will see that the situation simplifies considerably. To do this,...

We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. We find that gradient signals grow exponentially in depth and that these exploding gradients...

A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width l...

Training recurrent neural networks (RNNs) on long sequence tasks is plagued with difficulties arising from the exponential explosion or vanishing of signals as they propagate forward or backward through the network. Many techniques have been proposed to ameliorate these issues, including various algorithmic and architectural modifications. Two of t...

There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corre...

In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While res...

Recurrent neural networks have gained widespread use in modeling sequence data across various domains. While many successful recurrent architectures employ a notion of gating, the exact mechanism that enables such remarkable performance is not well understood. We develop a theory for signal propagation in recurrent networks after random initializat...

Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude. Therefore, to guide important design choices, it is important to build a full theoretical understanding of the spectra of Jacobians at initializ...

In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensi...

Many important problems are characterized by the eigenvalues of a large matrix. For example, the difficulty of many optimization problems, such as those arising from the fitting of large models in statistics and machine learning, can be investigated via the spectrum of the Hessian of the empirical loss function. Network data can be understood via t...

It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is $O(1)$ is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values...

A deep fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP) in the limit of infinite network width. This correspondence enables exact Bayesian inference for neural networks on regression tasks by means of straightforward matrix computations. For single hidden-layer networks, the covariance...

A number of recent papers have provided evidence that practical design questions about neural networks may be tackled theoretically by studying the behavior of random networks. However, until now the tools available for analyzing random neural networks have been relatively ad-hoc. In this work, we show that the distribution of pre-activations in ra...

We describe the hexagon function bootstrap for solving for six-gluon
scattering amplitudes in the large $N_c$ limit of ${\cal N}=4$ super-Yang-Mills
theory. In this method, an ansatz for the finite part of these amplitudes is
constrained at the level of amplitudes, not integrands, using boundary
information. In the near-collinear limit, the dual pi...

We present the four-loop remainder function for six-gluon scattering with
maximal helicity violation in planar N=4 super-Yang-Mills theory, as an
analytic function of three dual-conformal cross ratios. The function is
constructed entirely from its analytic properties, without ever inspecting any
multi-loop integrand. We employ the same approach use...

We introduce a generating function for the coefficients of the leading
logarithmic BFKL Green's function in transverse-momentum space, order by order
in alpha_s, in terms of single-valued harmonic polylogarithms. As an
application, we exhibit fully analytic azimuthal-angle and transverse-momentum
distributions for Mueller-Navelet jet cross sections...

We present the three-loop remainder function, which describes the scattering
of six gluons in the maximally-helicity-violating configuration in planar N=4
super-Yang-Mills theory, as a function of the three dual conformal cross
ratios. The result can be expressed in terms of multiple Goncharov
polylogarithms. We also employ a more restricted class...

The three-loop four-point function of stress-tensor multiplets in N=4 super
Yang-Mills theory contains two so far unknown, off-shell, conformal integrals,
in addition to the known, ladder-type integrals. In this paper we evaluate the
unknown integrals, thus obtaining the three-loop correlation function
analytically. The integrals have the generic s...

We present an all-orders formula for the six-point amplitude of planar
maximally supersymmetric N=4 Yang-Mills theory in the leading-logarithmic
approximation of multi-Regge kinematics. In the MHV helicity configuration, our
results agree with an integral formula of Lipatov and Prygarin through at least
14 loops. A differential equation linking the...

We argue that the natural functions for describing the multi-Regge limit of six-gluon scattering in planar \( \mathcal{N}=4 \) super Yang-Mills theory are the single-valued harmonic polylogarithmic functions introduced by Brown. These functions depend on a single complex variable and its conjugate, (w, w
∗). Using these functions, and formulas due...

We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as mov...

Paraphrase detection is the task of examining two sentences and determining whether they have the same meaning. In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed. We introduce a method for paraphrase detection based on recursive autoencoders (RAE). Our unsupervised RAEs are based...

We use a combination of a 't Hooft limit and numerical methods to find
non-perturbative solutions of exactly solvable string theories, showing that
perturbative solutions in different asymptotic regimes are connected by smooth
interpolating functions. Our earlier perturbative work showed that a large
class of minimal string theories arise as specia...

We uncover a remarkable role that an infinite hierarchy of non-linear differential equations plays in organizing and connecting certain {hat c}<1 string theories non-perturbatively. We are able to embed the type 0A and 0B (A,A) minimal string theories into this single framework. The string theories arise as special limits of a rich system of equati...

Plasma can be used as a convenient medium for manipulating intense charged particle beams, e.g., for ballistic focusing and steering, because the plasma can effectively reduce the self-space charge potential and self-magnetic field of the beam pulse. We previously developed a reduced analytical model of beam charge and current neutralization for an...

An analytical model is developed to describe the self-magnetic field of a finite-length ion beam pulse propagating in a cold background plasma in a solenoidal magnetic field. Previously, we developed an analytical model to describe the current neutralization of a beam pulse propagating in a background plasma. In the presence of an applied magnetic...

Type 0A string theory in the (2,4k) superconformal minimal model backgrounds, with background ZZ D-branes or R-R fluxes can be formulated non-perturbatively. The branes and fluxes have a description as threshold bound states in an associated one-dimensional quantum mechanics which has a supersymmetric structure, familiar from studies of the general...

We study the Type 0A string theory in the (2, 4k) superconformal minimal model back-grounds, focusing on the fully non–perturbative string equations which define the partition function of the model. The equations admit a parameter, Γ, which in the spacetime in-terpretation controls the number of background D–branes, or R–R flux units, depending upo...