## About

155

Publications

16,435

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

15,049

Citations

Citations since 2017

Introduction

**Skills and Expertise**

## Publications

Publications (155)

Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninforma...

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural...

Influence functions estimate effect of individual data points on predictions of the model on test data and were adapted to deep learning in Koh and Liang [2017]. They have been used for detecting data poisoning, detecting helpful and harmful examples, influence of groups of datapoints, etc. Recently, Ilyas et al. [2022] introduced a linear regressi...

As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a characterization of this phenomenon under a notion termed commuting parametriza...

Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supportin...

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there wer...

Deep learning experiments in Cohen et al. (2021) using deterministic Gradient Descent (GD) revealed an {\em Edge of Stability (EoS)} phase when learning rate (LR) and sharpness (\emph{i.e.}, the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterati...

Contrastive learning is a popular form of self-supervised learning that encourages augmentations (views) of the same input to have more similar representations compared to augmentations of different inputs. Recent attempts to theoretically explain the success of contrastive learning on downstream classification tasks prove guarantees depending on p...

Gradient inversion attack (or input recovery from gradient) is an emerging threat to the security and privacy preservation of Federated learning, whereby malicious eavesdroppers or participants in the protocol can recover (partially) the clients' private data. This paper evaluates existing attacks and defenses. We find that some attacks make strong...

The generalization mystery of overparametrized deep nets has motivated efforts to understand how gradient descent (GD) converges to low-loss solutions that generalize well. Real-life neural networks are initialized from small random values and trained with cross-entropy loss for classification (unlike the "lazy" or "NTK" regime of training where an...

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\eta$, SGD tracks Gradient Descent (GD) until it gets close to such...

It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs). But formal justification for this approximation (e.g., (Li et al., 2019a)) only ap...

Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of 'better inductive bias'. However, this has not been made mathematically rigorous, and the hurdle is that the fully connected net can always simulate the convolutional...

An unsolved challenge in distributed or federated learning is to effectively mitigate privacy risks without slowing down training or reducing accuracy. In this paper, we propose TextHide aiming at addressing this challenge for natural language understanding tasks. It requires all participants to add a simple encryption step to prevent an eavesdropp...

Autoregressive language models pretrained on large corpora have been successful at solving downstream tasks, even with zero-shot usage. However, there is little theoretical justification for their success. This paper considers the following questions: (1) Why should learning the distribution of natural language help with downstream classification t...

Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e.g., use of exponentially increasing learning rates. The current paper highlights other ways in which behavior of normalized nets departs...

One popular trend in meta-learning is to learn from many training tasks a common initialization for a gradient-based method that can be used to solve a new task with few samples. The theory of meta-learning is still in its early stages, with several recent learning-theoretic analyses of methods such as Reptile [Nichol et al., 2018] being for convex...

A common strategy in modern learning systems is to learn a representation that is useful for many tasks, a.k.a. representation learning. We study this strategy in the imitation learning setting for Markov decision processes (MDPs) where multiple experts' trajectories are available. We formulate representation learning as a bi-level optimization pro...

Adversarial training is a popular method to give neural nets robustness against adversarial perturbations. In practice adversarial training leads to low robust training loss. However, a rigorous explanation for why this happens under natural conditions is still missing. Recently a convergence theory for standard (non-adversarial) supervised trainin...

Recent research shows that for training with $\ell_2$ loss, convolutional neural networks (CNNs) whose width (number of channels in convolutional layers) goes to infinity correspond to regression with respect to the CNN Gaussian Process kernel (CNN-GP) if only the last layer is trained, and correspond to regression with respect to the Convolutional...

Intriguing empirical evidence exists that deep learning can work well with exoticschedules for varying the learning rate. This paper suggests that the phenomenonmay be due to Batch Normalization or BN(Ioffe & Szegedy, 2015), which is ubiquitous and provides benefits in optimization and generalization across all standardarchitectures. The following...

Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its...

Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima---at least those discovered by gradient-based optimization---turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical expla...

Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing --- a model referr...

There is great interest in *saliency methods* (also called *attribution methods*), which give "explanations" for a deep net's decision, by assigning a *score* to each feature/pixel in the input. Their design usually involves credit-assignment via the gradient of the output with respect to input. Recently Adebayo et al. [arXiv:1810.03292] questioned...

How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its "width" --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoreti...

Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically "similar" data points and "negative samples," the learner...

Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed...

Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability t...

We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as $x\mapsto W_N \cdots W_1x$) by minimizing the $\ell_2$ loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and o...

Motivations like domain adaptation, transfer learning, and feature learning have fueled interest in inducing embeddings for rare or unseen words, n-grams, synsets, and other textual features. This paper introduces a la carte embedding, a simple and general alternative to the usual word2vec-based approaches for building such representations that is...

A first line of attack in exploratory data analysis is data visualization, i.e., generating a 2-dimensional representation of data that makes clusters of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based...

Conventional wisdom in deep learning states that increasing depth improves expressiveness but complicates optimization. This paper suggests that, sometimes, increasing depth can speed up optimization. The effect of depth on optimization is decoupled from expressiveness by focusing on settings where additional layers amount to overparameterization -...

Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that're orders of magnitude better i...

Encoder-decoder GANs architectures (e.g., BiGAN and ALI) seek to add an inference mechanism to the GANs setup, consisting of a small encoder deep net that maps data-points to their succinct encodings. The intuition is that being forced to train an encoder alongside the usual generator forces the system to learn meaningful mappings from the code to...

Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Learning the model ---that is, the mapping from hidden variables to visible ones and vice versa---is NP-hard even in very sim...

There is general consensus that learning representations is useful for a variety of reasons, e.g. efficient use of labeled data (semi-supervised learning), transfer learning and understanding hidden structure of data. Popular techniques for representation learning include clustering, manifold learning, kernel-learning, autoencoders, Boltzmann machi...

Several research groups have shown how to map fMRI responses to the meanings of presented stimuli. This paper presents new methods for doing so when only a natural language annotation is available as the description of the stimulus. We study fMRI data gathered from subjects watching an episode of BBCs Sherlock (Chen et al., 2017), and learn bidirec...

This work presents an unsupervised approach for improving WordNet that builds upon recent advances in document and sense representation via distributional semantics. We apply our methods to construct Wordnets in French and Russian, languages which both lack good manual constructions.1 These are evaluated on two new 600-word test sets for word-to-sy...

This paper makes progress on several open theoretical issues related to Generative Adversarial Networks. A definition is provided for what it means for the training to generalize, and it is shown that generalization is not guaranteed for the popular distances between distributions such as Jensen-Shannon or Wasserstein. We introduce a new metric cal...

Deep neural nets have caused a revolution in many classification tasks. A related ongoing revolution---also theoretically not understood---concerns their ability to serve as generative models for complicated types of data such as images and texts. These models are trained using ideas like variational autoencoders and Generative Adversarial Networks...

Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NP-hard even in very simple settings. In recent years, provably efficient a...

Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods. Many use nonlinear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods.
This paper proposes a new generative model, a dynamic version of the log-linear topic model of Mnih and Hinton (2007). The...

This work provides support for the notion that distributional methods of representing word meaning from computational linguistics are useful for capturing neural correlates of real life multi-sensory stimuli, where the stimuli ---in this case, a movie being watched by the human subjects--- have been given text annotations. We present an approach to...

Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We l...

Semidefinite programs (SDPs) have been used in many recent approximation algorithms. We develop a general primal-dual approach to solve SDPs using a generalization of the well-known multiplicative weights update rule to symmetric matrices. For a number of problems, such as SPARSEST CUT and BALANCED SEPARATOR in undirected and directed weighted grap...

Word embeddings are ubiquitous in NLP and information retrieval, but it's
unclear what they represent when the word is polysemous, i.e., has multiple
senses. Here it is shown that multiple word senses reside in linear
superposition within the word embedding and can be recovered by simple sparse
coding.
The success of the method ---which applies to...

In the nonnegative matrix factorization (NMF) problem we are given an $n \times m$ nonnegative matrix $M$ and an integer $r > 0$. Our goal is to express $M$ as $A W$, where $A$ and $W$ are nonnegative matrices of size $n \times r$ and $r \times m$, respectively. In some applications, it makes sense to ask instead for the product $AW$ to approximate...

Generative model approaches to deep learning are of interest in the quest for
both better understanding as well as training methods requiring fewer labeled
samples.
Recent works use generative model approaches to produce the deep net's input
given the value of a hidden layer several levels above. However, there is no
accompanying "proof of correctn...

Sparse coding is a basic task in many fields including signal processing,
neuroscience and machine learning where the goal is to learn a basis that
enables a sparse representation of a given set of data, if one exists. Its
standard formulation is as a non-convex optimization problem which is solved in
practice by heuristics based on alternating min...

The papers of Mikolov et al. 2013 as well as subsequent works have led to
dramatic progress in solving word analogy tasks using semantic word embeddings.
This leverages linear structure that is often found in the word embeddings,
which is surprising since the training method is usually nonlinear. There were
attempts ---notably by Levy and Goldberg...

We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an $n$ node multilayer neural net that has degree at most $n^backslashgamma$ for some $backslashgamma textless1$ and each edge has a random edge weight in $[-1,1]$. Our algorithm learns $...

In dictionary learning, also known as sparse coding, the algorithm is given
samples of the form $y = Ax$ where $x\in \mathbb{R}^m$ is an unknown random
sparse vector and $A$ is an unknown dictionary matrix in $\mathbb{R}^{n\times
m}$ (usually $m > n$, which is the overcomplete case). The goal is to learn $A$
and $x$. This problem has been studied i...

We give algorithms with provable guarantees that learn a class of deep nets
in the generative model view popularized by Hinton and others. Our generative
model is an $n$ node multilayer neural net that has degree at most $n^{\gamma}$
for some $\gamma <1$ and each edge has a random edge weight in $[-1,1]$. Our
algorithm learns {\em almost all} netwo...

A matrix $A \in \R^{n \times m}$ is said to be $\mu$-incoherent if each pair
of columns has inner product at most $\mu / \sqrt{n}$. Starting with the
pioneering work of Donoho and Huo such matrices (often called {\em
dictionaries}) have played a central role in signal processing, statistics and
machine learning. They allow {\em sparse recovery}: th...

We give a new $(1+\epsilon)$-approximation for sparsest cut problem on
graphs where small sets expand significantly more than the sparsest cut
(sets of size $n/r$ expand by a factor $\sqrt{\log n\log r}$ bigger, for
some small $r$; this condition holds for many natural graph families).
We give two different algorithms. One involves Guruswami-Sinop...

Topic models provide a useful method for dimensionality reduction and
exploratory data analysis in large text corpora. Most approaches to topic model
inference have been based on a maximum likelihood objective. Efficient
algorithms exist that approximate this objective, but they have no provable
guarantees. Recently, algorithms have been introduced...

Suppose we are given an oracle that claims to approximate the permanent for most matrices X, where X is chosen from the Gaussian ensemble (the matrix entries are i.i.d. univariate complex Gaussians). Can we test that the oracle satisfies this claim? This paper gives a polynomial-time algorithm for the task.
The oracle-testing problem is of interest...

We present a new algorithm for Independent Component Analysis (ICA) which has
provable performance guarantees. In particular, suppose we are given samples of
the form $y = Ax + \eta$ where $A$ is an unknown $n \times n$ matrix and $x$ is
a random variable whose components are independent and have a fourth moment
strictly less than that of a standar...

Topic Modeling is an approach used for automatic comprehension and
classification of data in a variety of settings, and perhaps the canonical
application is in uncovering thematic structure in a corpus of documents. A
number of foundational works both in machine learning and in theory have
suggested a probabilistic model for documents, whereby docu...

Algorithms in varied fields use the idea of maintaining a
distribution over a certain set and use the multiplicative
update rule to iteratively change these weights. Their analyses
are usually very similar and rely on an exponential potential
function.
In this survey we present a simple meta-algorithm that unifies many
of these disparate algorithm...

Linear programming (LP) decoding for low-density parity-check codes (and related domains such as compressed sensing) has received increased attention over recent years because of its practical performancecoming close to that of iterative decoding algorithmsand its amenability to finite-blocklength analysis. Several works starting with the work of F...

A community in a social network is usually understood to be a group of nodes more densely connected with each other than with the rest of the network. This is an important concept in most domains where networks arise: social, technological, biological, etc. For many years algorithms for finding communities implicitly assumed communities are nonover...

The Nonnegative Matrix Factorization (NMF) problem has a rich history spanning quantum mechanics, probability theory, data analysis, polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormously popular in machine learning, where the factorization is computed using a variety of local...

How to color 3 colorable graphs with few colors is a problem of longstanding interest. The best polynomial-time algorithm
uses n
0.2072 colors. There are no indications that coloring using say O(logn) colors is hard. It has been suggested that SDP hierarchies could be used to design algorithms that use n
ε
colors for arbitrarily small ε > 0.
We e...

We give new algorithms for a variety of randomly-generated instances of computational problems using a linearization technique that reduces to solving a system of linear equations.
These algorithms are derived in the context of learning with structured noise, a notion introduced in this paper. This notion is best illustrated with the learning pari...

This paper introduces notions from computational complexity into the study of financial derivatives. Tradi- tional economics argues that derivatives, like CDOs and CDSs, ameliorate the negative costs imposed due to asymmetric information between buyers and sellers. This is because securitization via these derivatives allows the informed party to fi...

In a well-known paper[ARV], Arora, Rao and Vazirani obtained an O(sqrt(log
n)) approximation to the Balanced Separator problem and Uniform Sparsest Cut.
At the heart of their result is a geometric statement about sets of points that
satisfy triangle inequalities, which also underlies subsequent work on
approximation algorithms and geometric embeddi...

Subexponential time approximation algorithms are presented for the Unique Games and Small-Set Expansion problems. Specifically, for some absolute constant c, the following two algorithms are presented.
(1) An exp(knε)-time algorithm that, given as input a k-alphabet unique game on n variables that has an assignment satisfying 1-εc fractio...

Computing approximate solutions for NP-hard problems is an important research endeavor. Since the work of Goemans-Williamson in 1993, semidefinite programming (a form of convex programming in which the variables are vector inner products) has been used to design the current best approximation algorithms for problems such as MAX-CUT, MAX-3SAT, SPARS...

This paper shows how to compute O(logn)-approximations to the sparsest cut and balanced separator problems in O ˜(n 2 ) time, thus improving upon the recent algorithm of S. Arora, S. Rao and U. Vazirani [Proceedings of the 36th annual ACM symposium on theory of computing (STOC 2004), 222–231 (2004; Zbl 1192.68467)]. Their algorithm uses semidefinit...

We propose the study of graphs that are defined by low-complexity distributed and deterministic agents. We suggest that this
viewpoint may help introduce the element of individual choice in models of large scale social networks. This viewpoint may also provide interesting new classes of graphs for which to
design algorithms.
We focus largely on th...

This beginning graduate textbook describes both recent achievements and classical results of computational complexity theory. Requiring essentially no background apart from mathematical maturity, the book can be used as a reference for self-study for anyone interested in complexity, including physicists, mathematicians, and other scientists, as wel...

Linear programming decoding for low-density parity check codes (and related domains such as compressed sensing) has received increased attention over recent years because of its practical performance |coming close to that of iterative de- coding algorithms| and its amenability to nite-blocklength analysis. Several works starting with the work of Fe...

Graph partitioning is a computational problems to divide the vertices of a graph into two large pieces with minimum number of the edges. The application of partitioning can be used for computer vision, data analysis, image segmentation, and image analysis. The geometric approach of partitioning start with drawing the graph in a geometric space by k...

We present an efficient algorithm to find a good solution to the Unique Games problem when the constraint graph is an expander.
We introduce a new analysis of the standard SDP in this case that involves correlations among distant vertices. It also leads to a parallel repetition theorem for unique games when the graph is an expander.

We present an efficient algorithm to find a good solution to the Unique Games problem when the constraint graph is an expander. We introduce a new analysis of the standard SDP in this case that involves correlations among distant vertices. It also leads to a parallel repetition theorem for unique games when the graph is an expander.

We show that every n-point metric of negative type (in particular, every n-point subset of L
1) admits a Fréchet embedding into Euclidean space with distortion
\(O(\sqrt{\log n}\cdot \log \log n)\)
, a result which is tight up to the O(log log n) factor, even for Euclidean metrics. This strengthens our recent work on the Euclidean distortion of met...

Semidefinite programs (SDP) have been used in many recent approximation algorithms. We develop a general primal-dual approach to solve SDPs using a generalization of the well-known multiplicative weights update rule to symmetric matrices. For a number of problems, such as Sparsest Cut and Balanced Separator in undirected and directed weighted graph...

For any ε>0 we give a (2+ε)-approximation algorithm for the problem of finding a minimum tree spanning any k vertices in a graph (k-MST), improving a 3-approximation algorithm by N. Garg [A 3-approximation for the minimum tree spanning k vertices. in: Proceedings of the 37 IEEE Symp. on Foundations of Computer Science (FOCS), 302–309 (1996)]. As in...

In the Euclidean traveling salesman problem, we are given n nodes in ℝ2 (more generally, in ℝd and desire the minimum cost salesman tour for these nodes, where the cost of the edge between nodes (x1,y1) and (x2,y2) is \(
\sqrt {(x_1 - x_2 )^2 + (y_1 - y_2 )^2 }
\)
The decision version of the problem (“Does a tour of cost ≤ C exist?”) is NP-hard [65...

Motivated by applications in combinatorial optimization, we initiate a study of the extent to which the global properties of a metric space (especially, embeddability in l(1) with low distortion) are determined by the properties of small subspaces. We note connections to similar issues studied already in Ramsey theory, complexity theory (especially...

Proving integrality gaps for linear relaxations of NP optimization
problems is a difficult task and usually undertaken on a
case-by-case basis. We initiate a more systematic approach. We
prove an integrality gap of $2 -o(1)$ for three families of linear
relaxations for VERTEX COVER, and our methods seem
relevant to other problems as well.

We describe a simple random-sampling based procedure for producing sparse matrix approximations. Our procedure and analysis
are extremely simple: the analysis uses nothing more than the Chernoff-Hoeffding bounds. Despite the simplicity, the approximation
is comparable and sometimes better than previous work.
Our algorithm computes the sparse matri...

We describe how to color every 3-colorable graph with O(n0.2111) colors, thus improving an algorithm of Blum and Karger from almost a decade ago. Our analysis uses new geometric ideas inspired by the recent work of Arora, Rao, and Vazirani on SPARSEST CUT, and these ideas show promise of leading to further improvements.