Léon Bottou

Microsoft, Washington, West Virginia, United States

Are you Léon Bottou?

Claim your profile

Publications (145)

  • Léon Bottou · Frank E. Curtis · Jorge Nocedal
    [Show abstract] [Hide abstract] ABSTRACT: This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.
    Article · Jun 2016
  • David Lopez-Paz · Robert Nishihara · Soumith Chintala · [...] · Léon Bottou
    [Show abstract] [Hide abstract] ABSTRACT: The purpose of this paper is to point out and assay observable causal signals within collections of static images. We achieve this goal in two steps. First, we take a learning approach to observational causal inference, and build a classifier that achieves state-of-the-art performance on finding the causal direction between pairs of random variables, when given samples from their joint distribution. Second, we use our causal direction finder to effectively distinguish between features of objects and features of their contexts in collections of static images. Our experiments demonstrate the existence of (1) a relation between the direction of causality and the difference between objects and their contexts, and (2) observable causal signals in collections of static images.
    Article · May 2016
  • [Show abstract] [Hide abstract] ABSTRACT: Distillation (Hinton et al., 2015) and privileged information (Vapnik & Izmailov, 2015) are two techniques that enable machines to learn from other machines. This paper unifies these two techniques into generalized distillation, a framework to learn from multiple machines and data representations. We provide theoretical and causal insight about the inner workings of generalized distillation, extend it to unsupervised, semisupervised and multitask learning scenarios, and illustrate its efficacy on a variety of numerical simulations on both synthetic and real-world data.
    Article · Nov 2015
  • Robert Nishihara · David Lopez-Paz · Léon Bottou
    [Show abstract] [Hide abstract] ABSTRACT: Algorithms for hyperparameter optimization abound, all of which work well under different and often unverifiable assumptions. Motivated by the general challenge of sequentially choosing which algorithm to use, we study the more specific task of choosing among distributions to use for random hyperparameter optimization. This work is naturally framed in the extreme bandit setting, which deals with sequentially choosing which distribution from a collection to sample in order to minimize (maximize) the single best cost (reward). Whereas the distributions in the standard bandit setting are primarly characterized by their means, a number of subtleties arise when we care about the minimal cost as opposed to the average cost. For example, there may not be a well-defined "best" distribution as there is in the standard bandit setting. The best distribution depends on the rewards that have been obtained and on the remaining time horizon. Whereas in the standard bandit setting, it is sensible to compare policies with an oracle which plays the single best arm, in the extreme bandit setting, there are multiple sensible oracle models. We define a sensible notion of regret in the extreme bandit setting, which turns out to be more subtle than in the standard bandit setting. We then prove that no policy can asymptotically achieve no regret. Furthermore, we show that in the worst case, no policy can be guaranteed to perform better than the policy of choosing each distribution equally often.
    Article · Aug 2015
  • Maxime Oquab · Leon Bottou · Ivan Laptev · Josef Sivic
    Conference Paper · Jun 2015
  • Léon Bottou
    [Show abstract] [Hide abstract] ABSTRACT: This chapter shows how returning to the combinatorial nature of the Vapnik–Chervonenkis bounds provides simple ways to increase their accuracy, take into account properties of the data and of the learning algorithm, and provide empirically accurate estimates of the deviation between training error and test error.
    Chapter · Jan 2015
  • Alekh Agarwal · Leon Bottou
    [Show abstract] [Hide abstract] ABSTRACT: This paper presents a lower bound for optimizing a finite sum of $n$ functions, where each function is $L$-smooth and the sum is $\mu$-strongly convex. We show that no algorithm can reach an error $\epsilon$ in minimizing all functions from this class in fewer than $\Omega(n + \sqrt{n(\kappa-1)}\log(1/\epsilon))$ iterations, where $\kappa=L/\mu$ is a surrogate condition number. We then compare this lower bound to upper bounds for recently developed methods specializing to this setting. When the functions involved in this sum are not arbitrary, but based on i.i.d. random data, then we further contrast these complexity results with those for optimal first-order methods to directly optimize the sum. The conclusion we draw is that a lot of caution is necessary for an accurate comparison. In interest of completeness, we also provide a self-contained proof of the classical result on optimizing smooth and strongly convex functions under a first-order oracle.
    Article · Oct 2014
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: Quick interaction between a human teacher and a learning machine presents numerous benefits and challenges when working with web-scale data. The human teacher guides the machine towards accomplishing the task of interest. The learning machine leverages big data to find examples that maximize the training value of its interaction with the teacher. When the teacher is restricted to labeling examples selected by the machine, this problem is an instance of active learning. When the teacher can provide additional information to the machine (e.g., suggestions on what examples or predictive features should be used) as the learning task progresses, then the problem becomes one of interactive learning. To accommodate the two-way communication channel needed for efficient interactive learning, the teacher and the machine need an environment that supports an interaction language. The machine can access, process, and summarize more examples than the teacher can see in a lifetime. Based on the machine's output, the teacher can revise the definition of the task or make it more precise. Both the teacher and the machine continuously learn and benefit from the interaction. We have built a platform to (1) produce valuable and deployable models and (2) support research on both the machine learning and user interface challenges of the interactive learning problem. The platform relies on a dedicated, low-latency, distributed, in-memory architecture that allows us to construct web-scale learning machines with quick interaction speed. The purpose of this paper is to describe this architecture and demonstrate how it supports our research efforts. Preliminary results are presented as illustrations of the architecture but are not the primary focus of the paper.
    Full-text Article · Sep 2014
  • Maxime Oquab · Léon Bottou · Ivan Laptev · Josef Sivic
    [Show abstract] [Hide abstract] ABSTRACT: Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large- scale visual recognition challenge (ILSVRC2012). The suc- cess of CNNs is attributed to their ability to learn rich mid- level image representations as opposed to hand-designed low-level features used in other image classification meth- ods. Learning CNNs, however, amounts to estimating mil- lions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be effi- ciently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred rep- resentation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization.
    Article · Jun 2014
  • Source
    Antoine Bordes · Léon Bottou · Ronan Collobert · [...] · Luke Zettlemoyer
    [Show abstract] [Hide abstract] ABSTRACT: The 2014 Special Issue of Machine Learning discusses several papers on learning semantics. The first paper of the special issue, 'From Machine Learning to Machine Reasoning' by Léon Bottou is an essay which attempts to bridge trainable systems, like neural networks, and sophisticated 'all-purpose' inference mechanisms, such as logical or probabilistic inference. The paper 'Learning Perceptually Grounded Word Meanings from Unaligned Parallel Data' by Stefanie Tellex, Pratiksha Thaker, Joshua Joseph and Nicholas Roy describes an approach to map natural language commands to actions for a forklift control task. The paper 'Interactive Relational Reinforcement Learning of Concept Semantics' by Matthias Nickles and Achim Rettinger presents a Relational Reinforcement Learning (RRL) approach for learning denotational concept semantics using symbolic interaction of artificial agents with human users.
    Full-text Article · Feb 2014 · Machine Learning
  • Douwe Kiela · Léon Bottou
    Conference Paper · Jan 2014
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: This paper proposes a novel parallel stochastic gradient descent (SGD) method that is obtained by applying parallel sets of SGD iterations (each set operating on one node using the data residing in it) for finding the direction in each iteration of a batch descent method. The method has strong convergence properties. Experiments on datasets with high dimensional feature spaces show the value of this method.
    Full-text Article · Nov 2013
  • Léon Bottou · Jonas Peters · Joaquin Quiñonero-Candela · [...] · Ed Snelson
    [Show abstract] [Hide abstract] ABSTRACT: This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine. © 2013 Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard and Ed Snelson.
    Article · Nov 2013 · Journal of Machine Learning Research
  • Source
    [Show abstract] [Hide abstract] ABSTRACT: This paper gives a novel approach to the distributed training of linear classifiers. At each iteration, the nodes minimize approximate objective functions and combine the resulting minimizers to form a descent direction to move. The method is shown to have $O(log(1/\epsilon))$ time convergence. The method can be viewed as an iterative parameter mixing method. A special instantiation yields a parallel stochastic gradient descent method with strong convergence. When communication times between nodes are large, our method is much faster than the SQM method, which uses distributed computation only for function and gradient calls.
    Full-text Article · Oct 2013
  • [Show abstract] [Hide abstract] ABSTRACT: Training examples are not all equally informative. Active learning strategies leverage this observation in order to massively reduce the number of examples that need to be labeled. We leverage the same observation to build a generic strategy for parallelizing learning algorithms. This strategy is effective because the search for informative examples is highly parallelizable and because we show that its performance does not deteriorate when the sifting process relies on a slightly outdated model. Parallel active learning is particularly attractive to train nonlinear models with non-linear representations because there are few practical parallel learning algorithms for such models. We report preliminary experiments using both kernel SVMs and SGD-trained neural networks.
    Article · Oct 2013
  • Léon Bottou
    [Show abstract] [Hide abstract] ABSTRACT: This short contribution presents the first paper in which Vapnik and Chervonenkis describe the foundations of Statistical Learning Theory (Vapnik, Chervonenkis (1968) Proc USSR Acad Sci 181(4): 781–783).
    Chapter · Jan 2013
  • Source
    Léon Bottou · Jonas Peters · Joaquin Quiñonero-Candela · [...] · Ed Snelson
    [Show abstract] [Hide abstract] ABSTRACT: This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance of such systems. This work is illustrated by experiments carried out on the ad placement system associated with the Bing search engine.
    Full-text Article · Sep 2012
  • Léon Bottou
    [Show abstract] [Hide abstract] ABSTRACT: Chapter 1 strongly advocates the stochastic back-propagation method to train neural networks. This is in fact an instance of a more general technique called stochastic gradient descent (SGD). This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations.
    Chapter · Jan 2012
  • [Show abstract] [Hide abstract] ABSTRACT: We describe and evaluate two algorithms for Neyman-Pearson (NP) classification problem which has been recently shown to be of a particular importance for bipartite ranking problems. NP classification is a nonconvex problem involving a constraint on false negatives rate. We investigated batch algorithm based on DC programming and stochastic gradient method well suited for large-scale datasets. Empirical evidences illustrate the potential of the proposed methods.
    Article · Apr 2011 · ACM Transactions on Intelligent Systems and Technology
  • Source
    Ronan Collobert · Jason Weston · Leon Bottou · [...] · Pavel Kuksa
    [Show abstract] [Hide abstract] ABSTRACT: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
    Full-text Article · Mar 2011 · Journal of Machine Learning Research

Publication Stats

12k Citations


  • 2014
    • Microsoft
      Washington, West Virginia, United States
  • 2003-2006
    • NEC Laboratories America
      Princeton, New Jersey, United States
  • 1992-2001
    • AT&T Labs
      • Research
      Austin, Texas, United States
  • 1998
    • Ecole Normale Supérieure de Paris
      Lutetia Parisorum, Île-de-France, France