Jeffrey Pennington

Jeffrey Pennington
Google Inc. | Google

Doctor of Philosophy

About

44
Publications
24,327
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
28,417
Citations
Citations since 2016
26 Research Items
27455 Citations
201620172018201920202021202201,0002,0003,0004,0005,0006,000
201620172018201920202021202201,0002,0003,0004,0005,0006,000
201620172018201920202021202201,0002,0003,0004,0005,0006,000
201620172018201920202021202201,0002,0003,0004,0005,0006,000
Introduction

Publications

Publications (44)
Preprint
Full-text available
We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow. The repriorisation map acts directly on parameters, and its analytic simplicity complements the known neural network Gaussian process (NNGP)...
Preprint
Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open...
Preprint
We develop a stochastic differential equation, called homogenized SGD, for analyzing the dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares problem with $\ell^2$-regularization. We show that homogenized SGD is the high-dimensional equivalence of SGD -- for any quadratic statistic (e.g., population risk with qua...
Preprint
Full-text available
A significant obstacle in the development of robust machine learning models is covariate shift, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same. Despite the prevalence of covariate shift in real-world applications, a theoretical unde...
Article
A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks (NNs) have made a theory of learning dynamics elusive. In this work, we show that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit,...
Preprint
Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function. However, such a simple trade-off does not adequately describe deep learning...
Preprint
Modern deep learning models have achieved great success in predictive accuracy for many data modalities. However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-of-distribution (OOD) data and ungraceful failing under distributional shift. Previous benchmarks have found that ense...
Preprint
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well. An emerging paradigm for describing this unexpected behavior is in terms of a \emph{double descent} curv...
Preprint
We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but und...
Preprint
Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove...
Preprint
Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original w...
Article
The recent striking success of deep neural networks in machine learning raises profound questions about the theoretical principles underlying their success. For example, what can such deep networks compute? How can we train them? How does information propagate through them? Why can they generalize? And how can we teach them to imagine? We review re...
Preprint
The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effec...
Preprint
A fundamental goal in deep learning is the characterization of trainability and generalization of neural networks as a function of their architecture and hyperparameters. In this paper, we discuss these challenging issues in the context of wide neural networks at large depths where we will see that the situation simplifies considerably. To do this,...
Preprint
We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. We find that gradient signals grow exponentially in depth and that these exploding gradients...
Preprint
A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width l...
Preprint
Training recurrent neural networks (RNNs) on long sequence tasks is plagued with difficulties arising from the exponential explosion or vanishing of signals as they propagate forward or backward through the network. Many techniques have been proposed to ameliorate these issues, including various algorithmic and architectural modifications. Two of t...
Preprint
There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corre...
Preprint
Full-text available
In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While res...
Preprint
Full-text available
Recurrent neural networks have gained widespread use in modeling sequence data across various domains. While many successful recurrent architectures employ a notion of gating, the exact mechanism that enables such remarkable performance is not well understood. We develop a theory for signal propagation in recurrent networks after random initializat...
Article
Full-text available
Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude. Therefore, to guide important design choices, it is important to build a full theoretical understanding of the spectra of Jacobians at initializ...
Article
Full-text available
In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensi...
Article
Full-text available
Many important problems are characterized by the eigenvalues of a large matrix. For example, the difficulty of many optimization problems, such as those arising from the fitting of large models in statistics and machine learning, can be investigated via the spectrum of the Hessian of the empirical loss function. Network data can be understood via t...
Article
Full-text available
It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is $O(1)$ is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values...
Article
Full-text available
A deep fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP) in the limit of infinite network width. This correspondence enables exact Bayesian inference for neural networks on regression tasks by means of straightforward matrix computations. For single hidden-layer networks, the covariance...
Article
Full-text available
A number of recent papers have provided evidence that practical design questions about neural networks may be tackled theoretically by studying the behavior of random networks. However, until now the tools available for analyzing random neural networks have been relatively ad-hoc. In this work, we show that the distribution of pre-activations in ra...
Article
We describe the hexagon function bootstrap for solving for six-gluon scattering amplitudes in the large $N_c$ limit of ${\cal N}=4$ super-Yang-Mills theory. In this method, an ansatz for the finite part of these amplitudes is constrained at the level of amplitudes, not integrands, using boundary information. In the near-collinear limit, the dual pi...
Article
We present the four-loop remainder function for six-gluon scattering with maximal helicity violation in planar N=4 super-Yang-Mills theory, as an analytic function of three dual-conformal cross ratios. The function is constructed entirely from its analytic properties, without ever inspecting any multi-loop integrand. We employ the same approach use...
Article
Full-text available
We introduce a generating function for the coefficients of the leading logarithmic BFKL Green's function in transverse-momentum space, order by order in alpha_s, in terms of single-valued harmonic polylogarithms. As an application, we exhibit fully analytic azimuthal-angle and transverse-momentum distributions for Mueller-Navelet jet cross sections...
Article
Full-text available
We present the three-loop remainder function, which describes the scattering of six gluons in the maximally-helicity-violating configuration in planar N=4 super-Yang-Mills theory, as a function of the three dual conformal cross ratios. The result can be expressed in terms of multiple Goncharov polylogarithms. We also employ a more restricted class...
Article
Full-text available
The three-loop four-point function of stress-tensor multiplets in N=4 super Yang-Mills theory contains two so far unknown, off-shell, conformal integrals, in addition to the known, ladder-type integrals. In this paper we evaluate the unknown integrals, thus obtaining the three-loop correlation function analytically. The integrals have the generic s...
Article
Full-text available
We present an all-orders formula for the six-point amplitude of planar maximally supersymmetric N=4 Yang-Mills theory in the leading-logarithmic approximation of multi-Regge kinematics. In the MHV helicity configuration, our results agree with an integral formula of Lipatov and Prygarin through at least 14 loops. A differential equation linking the...
Article
Full-text available
We argue that the natural functions for describing the multi-Regge limit of six-gluon scattering in planar \( \mathcal{N}=4 \) super Yang-Mills theory are the single-valued harmonic polylogarithmic functions introduced by Brown. These functions depend on a single complex variable and its conjugate, (w, w ∗). Using these functions, and formulas due...
Conference Paper
We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as mov...
Article
Full-text available
Paraphrase detection is the task of examining two sentences and determining whether they have the same meaning. In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed. We introduce a method for paraphrase detection based on recursive autoencoders (RAE). Our unsupervised RAEs are based...
Article
Full-text available
We use a combination of a 't Hooft limit and numerical methods to find non-perturbative solutions of exactly solvable string theories, showing that perturbative solutions in different asymptotic regimes are connected by smooth interpolating functions. Our earlier perturbative work showed that a large class of minimal string theories arise as specia...
Article
Full-text available
We uncover a remarkable role that an infinite hierarchy of non-linear differential equations plays in organizing and connecting certain {hat c}<1 string theories non-perturbatively. We are able to embed the type 0A and 0B (A,A) minimal string theories into this single framework. The string theories arise as special limits of a rich system of equati...
Conference Paper
Full-text available
Plasma can be used as a convenient medium for manipulating intense charged particle beams, e.g., for ballistic focusing and steering, because the plasma can effectively reduce the self-space charge potential and self-magnetic field of the beam pulse. We previously developed a reduced analytical model of beam charge and current neutralization for an...
Article
An analytical model is developed to describe the self-magnetic field of a finite-length ion beam pulse propagating in a cold background plasma in a solenoidal magnetic field. Previously, we developed an analytical model to describe the current neutralization of a beam pulse propagating in a background plasma. In the presence of an applied magnetic...
Article
Full-text available
Type 0A string theory in the (2,4k) superconformal minimal model backgrounds, with background ZZ D-branes or R-R fluxes can be formulated non-perturbatively. The branes and fluxes have a description as threshold bound states in an associated one-dimensional quantum mechanics which has a supersymmetric structure, familiar from studies of the general...
Article
Full-text available
We study the Type 0A string theory in the (2, 4k) superconformal minimal model back-grounds, focusing on the fully non–perturbative string equations which define the partition function of the model. The equations admit a parameter, Γ, which in the spacetime in-terpretation controls the number of background D–branes, or R–R flux units, depending upo...

Network

Cited By