Sanjeev Arora

Sanjeev Arora
Princeton University | PU · Department of Computer Science

About

155
Publications
16,435
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
15,049
Citations
Citations since 2017
45 Research Items
6368 Citations
201720182019202020212022202302004006008001,000
201720182019202020212022202302004006008001,000
201720182019202020212022202302004006008001,000
201720182019202020212022202302004006008001,000
Introduction

Publications

Publications (155)
Preprint
Full-text available
Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninforma...
Preprint
It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural...
Preprint
Full-text available
Influence functions estimate effect of individual data points on predictions of the model on test data and were adapted to deep learning in Koh and Liang [2017]. They have been used for detecting data poisoning, detecting helpful and harmful examples, influence of groups of datapoints, etc. Recently, Ilyas et al. [2022] introduced a linear regressi...
Preprint
As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a characterization of this phenomenon under a notion termed commuting parametriza...
Preprint
Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supportin...
Preprint
Full-text available
Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there wer...
Preprint
Full-text available
Deep learning experiments in Cohen et al. (2021) using deterministic Gradient Descent (GD) revealed an {\em Edge of Stability (EoS)} phase when learning rate (LR) and sharpness (\emph{i.e.}, the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterati...
Preprint
Contrastive learning is a popular form of self-supervised learning that encourages augmentations (views) of the same input to have more similar representations compared to augmentations of different inputs. Recent attempts to theoretically explain the success of contrastive learning on downstream classification tasks prove guarantees depending on p...
Preprint
Full-text available
Gradient inversion attack (or input recovery from gradient) is an emerging threat to the security and privacy preservation of Federated learning, whereby malicious eavesdroppers or participants in the protocol can recover (partially) the clients' private data. This paper evaluates existing attacks and defenses. We find that some attacks make strong...
Preprint
The generalization mystery of overparametrized deep nets has motivated efforts to understand how gradient descent (GD) converges to low-loss solutions that generalize well. Real-life neural networks are initialized from small random values and trained with cross-entropy loss for classification (unlike the "lazy" or "NTK" regime of training where an...
Preprint
Full-text available
Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function $L$ can form a manifold. Intuitively, with a sufficiently small learning rate $\eta$, SGD tracks Gradient Descent (GD) until it gets close to such...
Preprint
Full-text available
It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs). But formal justification for this approximation (e.g., (Li et al., 2019a)) only ap...
Preprint
Convolutional neural networks often dominate fully-connected counterparts in generalization performance, especially on image classification tasks. This is often explained in terms of 'better inductive bias'. However, this has not been made mathematically rigorous, and the hurdle is that the fully connected net can always simulate the convolutional...
Preprint
Full-text available
An unsolved challenge in distributed or federated learning is to effectively mitigate privacy risks without slowing down training or reducing accuracy. In this paper, we propose TextHide aiming at addressing this challenge for natural language understanding tasks. It requires all participants to add a simple encryption step to prevent an eavesdropp...
Preprint
Autoregressive language models pretrained on large corpora have been successful at solving downstream tasks, even with zero-shot usage. However, there is little theoretical justification for their success. This paper considers the following questions: (1) Why should learning the distribution of natural language help with downstream classification t...
Preprint
Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popular normalization schemes (including Batch Normalization) in today's deep learning can move it far from a traditional optimization viewpoint, e.g., use of exponentially increasing learning rates. The current paper highlights other ways in which behavior of normalized nets departs...
Preprint
One popular trend in meta-learning is to learn from many training tasks a common initialization for a gradient-based method that can be used to solve a new task with few samples. The theory of meta-learning is still in its early stages, with several recent learning-theoretic analyses of methods such as Reptile [Nichol et al., 2018] being for convex...
Preprint
A common strategy in modern learning systems is to learn a representation that is useful for many tasks, a.k.a. representation learning. We study this strategy in the imitation learning setting for Markov decision processes (MDPs) where multiple experts' trajectories are available. We formulate representation learning as a bi-level optimization pro...
Preprint
Adversarial training is a popular method to give neural nets robustness against adversarial perturbations. In practice adversarial training leads to low robust training loss. However, a rigorous explanation for why this happens under natural conditions is still missing. Recently a convergence theory for standard (non-adversarial) supervised trainin...
Preprint
Recent research shows that for training with $\ell_2$ loss, convolutional neural networks (CNNs) whose width (number of channels in convolutional layers) goes to infinity correspond to regression with respect to the CNN Gaussian Process kernel (CNN-GP) if only the last layer is trained, and correspond to regression with respect to the Convolutional...
Preprint
Intriguing empirical evidence exists that deep learning can work well with exoticschedules for varying the learning rate. This paper suggests that the phenomenonmay be due to Batch Normalization or BN(Ioffe & Szegedy, 2015), which is ubiquitous and provides benefits in optimization and generalization across all standardarchitectures. The following...
Preprint
Recent research shows that the following two models are equivalent: (a) infinitely wide neural networks (NNs) trained under l2 loss by gradient descent with infinitesimally small learning rate (b) kernel regression with respect to so-called Neural Tangent Kernels (NTKs) (Jacot et al., 2018). An efficient algorithm to compute the NTK, as well as its...
Preprint
Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optima---at least those discovered by gradient-based optimization---turn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piece-wise linear, with as few as two segments. We give mathematical expla...
Preprint
Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing --- a model referr...
Preprint
There is great interest in *saliency methods* (also called *attribution methods*), which give "explanations" for a deep net's decision, by assigning a *score* to each feature/pixel in the input. Their design usually involves credit-assignment via the gradient of the output with respect to input. Recently Adebayo et al. [arXiv:1810.03292] questioned...
Preprint
How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its "width" --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoreti...
Preprint
Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically "similar" data points and "negative samples," the learner...
Preprint
Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed...
Preprint
Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability t...
Preprint
We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as $x\mapsto W_N \cdots W_1x$) by minimizing the $\ell_2$ loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and o...
Preprint
Motivations like domain adaptation, transfer learning, and feature learning have fueled interest in inducing embeddings for rare or unseen words, n-grams, synsets, and other textual features. This paper introduces a la carte embedding, a simple and general alternative to the usual word2vec-based approaches for building such representations that is...
Article
A first line of attack in exploratory data analysis is data visualization, i.e., generating a 2-dimensional representation of data that makes clusters of similar points visually identifiable. Standard Johnson-Lindenstrauss dimensionality reduction does not produce data visualizations. The t-SNE heuristic of van der Maaten and Hinton, which is based...
Article
Full-text available
Conventional wisdom in deep learning states that increasing depth improves expressiveness but complicates optimization. This paper suggests that, sometimes, increasing depth can speed up optimization. The effect of depth on optimization is decoupled from expressiveness by focusing on settings where additional layers amount to overparameterization -...
Article
Deep nets generalize well despite having more parameters than the number of training samples. Recent works try to give an explanation using PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting. The current paper shows generalization bounds that're orders of magnitude better i...
Article
Encoder-decoder GANs architectures (e.g., BiGAN and ALI) seek to add an inference mechanism to the GANs setup, consisting of a small encoder deep net that maps data-points to their succinct encodings. The intuition is that being forced to train an encoder alongside the usual generator forces the system to learn meaningful mappings from the code to...
Conference Paper
Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Learning the model ---that is, the mapping from hidden variables to visible ones and vice versa---is NP-hard even in very sim...
Article
There is general consensus that learning representations is useful for a variety of reasons, e.g. efficient use of labeled data (semi-supervised learning), transfer learning and understanding hidden structure of data. Popular techniques for representation learning include clustering, manifold learning, kernel-learning, autoencoders, Boltzmann machi...
Article
Full-text available
Several research groups have shown how to map fMRI responses to the meanings of presented stimuli. This paper presents new methods for doing so when only a natural language annotation is available as the description of the stimulus. We study fMRI data gathered from subjects watching an episode of BBCs Sherlock (Chen et al., 2017), and learn bidirec...
Article
This work presents an unsupervised approach for improving WordNet that builds upon recent advances in document and sense representation via distributional semantics. We apply our methods to construct Wordnets in French and Russian, languages which both lack good manual constructions.1 These are evaluated on two new 600-word test sets for word-to-sy...
Article
This paper makes progress on several open theoretical issues related to Generative Adversarial Networks. A definition is provided for what it means for the training to generalize, and it is shown that generalization is not guaranteed for the popular distances between distributions such as Jensen-Shannon or Wasserstein. We introduce a new metric cal...
Article
Deep neural nets have caused a revolution in many classification tasks. A related ongoing revolution---also theoretically not understood---concerns their ability to serve as generative models for complicated types of data such as images and texts. These models are trained using ideas like variational autoencoders and Generative Adversarial Networks...
Article
Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NP-hard even in very simple settings. In recent years, provably efficient a...
Article
Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods. Many use nonlinear operations on co-occurrence statistics, and have hand-tuned hyperparameters and reweighting methods. This paper proposes a new generative model, a dynamic version of the log-linear topic model of Mnih and Hinton (2007). The...
Article
Full-text available
This work provides support for the notion that distributional methods of representing word meaning from computational linguistics are useful for capturing neural correlates of real life multi-sensory stimuli, where the stimuli ---in this case, a movie being watched by the human subjects--- have been given text annotations. We present an approach to...
Article
Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We l...
Article
Semidefinite programs (SDPs) have been used in many recent approximation algorithms. We develop a general primal-dual approach to solve SDPs using a generalization of the well-known multiplicative weights update rule to symmetric matrices. For a number of problems, such as SPARSEST CUT and BALANCED SEPARATOR in undirected and directed weighted grap...
Article
Full-text available
Word embeddings are ubiquitous in NLP and information retrieval, but it's unclear what they represent when the word is polysemous, i.e., has multiple senses. Here it is shown that multiple word senses reside in linear superposition within the word embedding and can be recovered by simple sparse coding. The success of the method ---which applies to...
Article
In the nonnegative matrix factorization (NMF) problem we are given an $n \times m$ nonnegative matrix $M$ and an integer $r > 0$. Our goal is to express $M$ as $A W$, where $A$ and $W$ are nonnegative matrices of size $n \times r$ and $r \times m$, respectively. In some applications, it makes sense to ask instead for the product $AW$ to approximate...
Article
Full-text available
Generative model approaches to deep learning are of interest in the quest for both better understanding as well as training methods requiring fewer labeled samples. Recent works use generative model approaches to produce the deep net's input given the value of a hidden layer several levels above. However, there is no accompanying "proof of correctn...
Article
Sparse coding is a basic task in many fields including signal processing, neuroscience and machine learning where the goal is to learn a basis that enables a sparse representation of a given set of data, if one exists. Its standard formulation is as a non-convex optimization problem which is solved in practice by heuristics based on alternating min...
Article
Full-text available
The papers of Mikolov et al. 2013 as well as subsequent works have led to dramatic progress in solving word analogy tasks using semantic word embeddings. This leverages linear structure that is often found in the word embeddings, which is surprising since the training method is usually nonlinear. There were attempts ---notably by Levy and Goldberg...
Conference Paper
We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an $n$ node multilayer neural net that has degree at most $n^backslashgamma$ for some $backslashgamma textless1$ and each edge has a random edge weight in $[-1,1]$. Our algorithm learns $...
Article
In dictionary learning, also known as sparse coding, the algorithm is given samples of the form $y = Ax$ where $x\in \mathbb{R}^m$ is an unknown random sparse vector and $A$ is an unknown dictionary matrix in $\mathbb{R}^{n\times m}$ (usually $m > n$, which is the overcomplete case). The goal is to learn $A$ and $x$. This problem has been studied i...
Article
We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an $n$ node multilayer neural net that has degree at most $n^{\gamma}$ for some $\gamma <1$ and each edge has a random edge weight in $[-1,1]$. Our algorithm learns {\em almost all} netwo...
Article
A matrix $A \in \R^{n \times m}$ is said to be $\mu$-incoherent if each pair of columns has inner product at most $\mu / \sqrt{n}$. Starting with the pioneering work of Donoho and Huo such matrices (often called {\em dictionaries}) have played a central role in signal processing, statistics and machine learning. They allow {\em sparse recovery}: th...
Article
We give a new $(1+\epsilon)$-approximation for sparsest cut problem on graphs where small sets expand significantly more than the sparsest cut (sets of size $n/r$ expand by a factor $\sqrt{\log n\log r}$ bigger, for some small $r$; this condition holds for many natural graph families). We give two different algorithms. One involves Guruswami-Sinop...
Article
Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model inference have been based on a maximum likelihood objective. Efficient algorithms exist that approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced...
Article
Full-text available
Suppose we are given an oracle that claims to approximate the permanent for most matrices X, where X is chosen from the Gaussian ensemble (the matrix entries are i.i.d. univariate complex Gaussians). Can we test that the oracle satisfies this claim? This paper gives a polynomial-time algorithm for the task. The oracle-testing problem is of interest...
Article
We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form $y = Ax + \eta$ where $A$ is an unknown $n \times n$ matrix and $x$ is a random variable whose components are independent and have a fourth moment strictly less than that of a standar...
Article
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby docu...
Article
Full-text available
Algorithms in varied fields use the idea of maintaining a distribution over a certain set and use the multiplicative update rule to iteratively change these weights. Their analyses are usually very similar and rely on an exponential potential function. In this survey we present a simple meta-algorithm that unifies many of these disparate algorithm...
Article
Linear programming (LP) decoding for low-density parity-check codes (and related domains such as compressed sensing) has received increased attention over recent years because of its practical performancecoming close to that of iterative decoding algorithmsand its amenability to finite-blocklength analysis. Several works starting with the work of F...
Article
A community in a social network is usually understood to be a group of nodes more densely connected with each other than with the rest of the network. This is an important concept in most domains where networks arise: social, technological, biological, etc. For many years algorithms for finding communities implicitly assumed communities are nonover...
Article
Full-text available
The Nonnegative Matrix Factorization (NMF) problem has a rich history spanning quantum mechanics, probability theory, data analysis, polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormously popular in machine learning, where the factorization is computed using a variety of local...
Chapter
How to color 3 colorable graphs with few colors is a problem of longstanding interest. The best polynomial-time algorithm uses n 0.2072 colors. There are no indications that coloring using say O(logn) colors is hard. It has been suggested that SDP hierarchies could be used to design algorithms that use n ε colors for arbitrarily small ε > 0. We e...
Conference Paper
We give new algorithms for a variety of randomly-generated instances of computational problems using a linearization technique that reduces to solving a system of linear equations. These algorithms are derived in the context of learning with structured noise, a notion introduced in this paper. This notion is best illustrated with the learning pari...
Article
This paper introduces notions from computational complexity into the study of financial derivatives. Tradi- tional economics argues that derivatives, like CDOs and CDSs, ameliorate the negative costs imposed due to asymmetric information between buyers and sellers. This is because securitization via these derivatives allows the informed party to fi...
Article
In a well-known paper[ARV], Arora, Rao and Vazirani obtained an O(sqrt(log n)) approximation to the Balanced Separator problem and Uniform Sparsest Cut. At the heart of their result is a geometric statement about sets of points that satisfy triangle inequalities, which also underlies subsequent work on approximation algorithms and geometric embeddi...
Conference Paper
Subexponential time approximation algorithms are presented for the Unique Games and Small-Set Expansion problems. Specifically, for some absolute constant c, the following two algorithms are presented. (1) An exp(kn&epsi;)-time algorithm that, given as input a k-alphabet unique game on n variables that has an assignment satisfying 1-&epsi;c fractio...
Conference Paper
Computing approximate solutions for NP-hard problems is an important research endeavor. Since the work of Goemans-Williamson in 1993, semidefinite programming (a form of convex programming in which the variables are vector inner products) has been used to design the current best approximation algorithms for problems such as MAX-CUT, MAX-3SAT, SPARS...
Article
This paper shows how to compute O(logn)-approximations to the sparsest cut and balanced separator problems in O ˜(n 2 ) time, thus improving upon the recent algorithm of S. Arora, S. Rao and U. Vazirani [Proceedings of the 36th annual ACM symposium on theory of computing (STOC 2004), 222–231 (2004; Zbl 1192.68467)]. Their algorithm uses semidefinit...
Conference Paper
We propose the study of graphs that are defined by low-complexity distributed and deterministic agents. We suggest that this viewpoint may help introduce the element of individual choice in models of large scale social networks. This viewpoint may also provide interesting new classes of graphs for which to design algorithms. We focus largely on th...
Book
This beginning graduate textbook describes both recent achievements and classical results of computational complexity theory. Requiring essentially no background apart from mathematical maturity, the book can be used as a reference for self-study for anyone interested in complexity, including physicists, mathematicians, and other scientists, as wel...
Conference Paper
Full-text available
Linear programming decoding for low-density parity check codes (and related domains such as compressed sensing) has received increased attention over recent years because of its practical performance |coming close to that of iterative de- coding algorithms| and its amenability to nite-blocklength analysis. Several works starting with the work of Fe...
Article
Graph partitioning is a computational problems to divide the vertices of a graph into two large pieces with minimum number of the edges. The application of partitioning can be used for computer vision, data analysis, image segmentation, and image analysis. The geometric approach of partitioning start with drawing the graph in a geometric space by k...
Conference Paper
We present an efficient algorithm to find a good solution to the Unique Games problem when the constraint graph is an expander. We introduce a new analysis of the standard SDP in this case that involves correlations among distant vertices. It also leads to a parallel repetition theorem for unique games when the graph is an expander.
Article
We present an efficient algorithm to find a good solution to the Unique Games problem when the constraint graph is an expander. We introduce a new analysis of the standard SDP in this case that involves correlations among distant vertices. It also leads to a parallel repetition theorem for unique games when the graph is an expander.
Article
We show that every n-point metric of negative type (in particular, every n-point subset of L 1) admits a Fréchet embedding into Euclidean space with distortion \(O(\sqrt{\log n}\cdot \log \log n)\) , a result which is tight up to the O(log log n) factor, even for Euclidean metrics. This strengthens our recent work on the Euclidean distortion of met...
Conference Paper
Semidefinite programs (SDP) have been used in many recent approximation algorithms. We develop a general primal-dual approach to solve SDPs using a generalization of the well-known multiplicative weights update rule to symmetric matrices. For a number of problems, such as Sparsest Cut and Balanced Separator in undirected and directed weighted graph...
Article
Full-text available
For any ε>0 we give a (2+ε)-approximation algorithm for the problem of finding a minimum tree spanning any k vertices in a graph (k-MST), improving a 3-approximation algorithm by N. Garg [A 3-approximation for the minimum tree spanning k vertices. in: Proceedings of the 37 IEEE Symp. on Foundations of Computer Science (FOCS), 302–309 (1996)]. As in...
Chapter
In the Euclidean traveling salesman problem, we are given n nodes in ℝ2 (more generally, in ℝd and desire the minimum cost salesman tour for these nodes, where the cost of the edge between nodes (x1,y1) and (x2,y2) is \( \sqrt {(x_1 - x_2 )^2 + (y_1 - y_2 )^2 } \) The decision version of the problem (“Does a tour of cost ≤ C exist?”) is NP-hard [65...
Conference Paper
Full-text available
Motivated by applications in combinatorial optimization, we initiate a study of the extent to which the global properties of a metric space (especially, embeddability in l(1) with low distortion) are determined by the properties of small subspaces. We note connections to similar issues studied already in Ramsey theory, complexity theory (especially...
Article
Full-text available
Proving integrality gaps for linear relaxations of NP optimization problems is a difficult task and usually undertaken on a case-by-case basis. We initiate a more systematic approach. We prove an integrality gap of $2 -o(1)$ for three families of linear relaxations for VERTEX COVER, and our methods seem relevant to other problems as well.
Article
Full-text available
Conference Paper
Full-text available
We describe a simple random-sampling based procedure for producing sparse matrix approximations. Our procedure and analysis are extremely simple: the analysis uses nothing more than the Chernoff-Hoeffding bounds. Despite the simplicity, the approximation is comparable and sometimes better than previous work. Our algorithm computes the sparse matri...
Conference Paper
We describe how to color every 3-colorable graph with O(n0.2111) colors, thus improving an algorithm of Blum and Karger from almost a decade ago. Our analysis uses new geometric ideas inspired by the recent work of Arora, Rao, and Vazirani on SPARSEST CUT, and these ideas show promise of leading to further improvements.