Léon Bottou

Microsoft, Washington, West Virginia, United States

Are you Léon Bottou?

Claim your profile

Publications (139)66.36 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Distillation (Hinton et al., 2015) and privileged information (Vapnik & Izmailov, 2015) are two techniques that enable machines to learn from other machines. This paper unifies these two techniques into generalized distillation, a framework to learn from multiple machines and data representations. We provide theoretical and causal insight about the inner workings of generalized distillation, extend it to unsupervised, semisupervised and multitask learning scenarios, and illustrate its efficacy on a variety of numerical simulations on both synthetic and real-world data.
    No preview · Article · Nov 2015
  • Source
    Robert Nishihara · David Lopez-Paz · Léon Bottou
    [Show abstract] [Hide abstract]
    ABSTRACT: Algorithms for hyperparameter optimization abound, all of which work well under different and often unverifiable assumptions. Motivated by the general challenge of sequentially choosing which algorithm to use, we study the more specific task of choosing among distributions to use for random hyperparameter optimization. This work is naturally framed in the extreme bandit setting, which deals with sequentially choosing which distribution from a collection to sample in order to minimize (maximize) the single best cost (reward). Whereas the distributions in the standard bandit setting are primarly characterized by their means, a number of subtleties arise when we care about the minimal cost as opposed to the average cost. For example, there may not be a well-defined "best" distribution as there is in the standard bandit setting. The best distribution depends on the rewards that have been obtained and on the remaining time horizon. Whereas in the standard bandit setting, it is sensible to compare policies with an oracle which plays the single best arm, in the extreme bandit setting, there are multiple sensible oracle models. We define a sensible notion of regret in the extreme bandit setting, which turns out to be more subtle than in the standard bandit setting. We then prove that no policy can asymptotically achieve no regret. Furthermore, we show that in the worst case, no policy can be guaranteed to perform better than the policy of choosing each distribution equally often.
    Preview · Article · Aug 2015
  • Source
    Maxime Oquab · Léon Bottou · Ivan Laptev · Josef Sivic

    Preview · Article · Jun 2015
  • Source
    Alekh Agarwal · Leon Bottou
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a lower bound for optimizing a finite sum of $n$ functions, where each function is $L$-smooth and the sum is $\mu$-strongly convex. We show that no algorithm can reach an error $\epsilon$ in minimizing all functions from this class in fewer than $\Omega(n + \sqrt{n(\kappa-1)}\log(1/\epsilon))$ iterations, where $\kappa=L/\mu$ is a surrogate condition number. We then compare this lower bound to upper bounds for recently developed methods specializing to this setting. When the functions involved in this sum are not arbitrary, but based on i.i.d. random data, then we further contrast these complexity results with those for optimal first-order methods to directly optimize the sum. The conclusion we draw is that a lot of caution is necessary for an accurate comparison. In interest of completeness, we also provide a self-contained proof of the classical result on optimizing smooth and strongly convex functions under a first-order oracle.
    Preview · Article · Oct 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Quick interaction between a human teacher and a learning machine presents numerous benefits and challenges when working with web-scale data. The human teacher guides the machine towards accomplishing the task of interest. The learning machine leverages big data to find examples that maximize the training value of its interaction with the teacher. When the teacher is restricted to labeling examples selected by the machine, this problem is an instance of active learning. When the teacher can provide additional information to the machine (e.g., suggestions on what examples or predictive features should be used) as the learning task progresses, then the problem becomes one of interactive learning. To accommodate the two-way communication channel needed for efficient interactive learning, the teacher and the machine need an environment that supports an interaction language. The machine can access, process, and summarize more examples than the teacher can see in a lifetime. Based on the machine's output, the teacher can revise the definition of the task or make it more precise. Both the teacher and the machine continuously learn and benefit from the interaction. We have built a platform to (1) produce valuable and deployable models and (2) support research on both the machine learning and user interface challenges of the interactive learning problem. The platform relies on a dedicated, low-latency, distributed, in-memory architecture that allows us to construct web-scale learning machines with quick interaction speed. The purpose of this paper is to describe this architecture and demonstrate how it supports our research efforts. Preliminary results are presented as illustrations of the architecture but are not the primary focus of the paper.
    Full-text · Article · Sep 2014
  • Source
    Maxime Oquab · Léon Bottou · Ivan Laptev · Josef Sivic
    [Show abstract] [Hide abstract]
    ABSTRACT: Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large- scale visual recognition challenge (ILSVRC2012). The suc- cess of CNNs is attributed to their ability to learn rich mid- level image representations as opposed to hand-designed low-level features used in other image classification meth- ods. Learning CNNs, however, amounts to estimating mil- lions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be effi- ciently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred rep- resentation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization.
    Preview · Article · Jun 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The 2014 Special Issue of Machine Learning discusses several papers on learning semantics. The first paper of the special issue, 'From Machine Learning to Machine Reasoning' by Léon Bottou is an essay which attempts to bridge trainable systems, like neural networks, and sophisticated 'all-purpose' inference mechanisms, such as logical or probabilistic inference. The paper 'Learning Perceptually Grounded Word Meanings from Unaligned Parallel Data' by Stefanie Tellex, Pratiksha Thaker, Joshua Joseph and Nicholas Roy describes an approach to map natural language commands to actions for a forklift control task. The paper 'Interactive Relational Reinforcement Learning of Concept Semantics' by Matthias Nickles and Achim Rettinger presents a Relational Reinforcement Learning (RRL) approach for learning denotational concept semantics using symbolic interaction of artificial agents with human users.
    Full-text · Article · Feb 2014 · Machine Learning
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a novel parallel stochastic gradient descent (SGD) method that is obtained by applying parallel sets of SGD iterations (each set operating on one node using the data residing in it) for finding the direction in each iteration of a batch descent method. The method has strong convergence properties. Experiments on datasets with high dimensional feature spaces show the value of this method.
    Full-text · Article · Nov 2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine. © 2013 Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard and Ed Snelson.
    No preview · Article · Nov 2013 · Journal of Machine Learning Research
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper gives a novel approach to the distributed training of linear classifiers. At each iteration, the nodes minimize approximate objective functions and combine the resulting minimizers to form a descent direction to move. The method is shown to have $O(log(1/\epsilon))$ time convergence. The method can be viewed as an iterative parameter mixing method. A special instantiation yields a parallel stochastic gradient descent method with strong convergence. When communication times between nodes are large, our method is much faster than the SQM method, which uses distributed computation only for function and gradient calls.
    Full-text · Article · Oct 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Training examples are not all equally informative. Active learning strategies leverage this observation in order to massively reduce the number of examples that need to be labeled. We leverage the same observation to build a generic strategy for parallelizing learning algorithms. This strategy is effective because the search for informative examples is highly parallelizable and because we show that its performance does not deteriorate when the sifting process relies on a slightly outdated model. Parallel active learning is particularly attractive to train nonlinear models with non-linear representations because there are few practical parallel learning algorithms for such models. We report preliminary experiments using both kernel SVMs and SGD-trained neural networks.
    Preview · Article · Oct 2013
  • Léon Bottou
    [Show abstract] [Hide abstract]
    ABSTRACT: This short contribution presents the first paper in which Vapnik and Chervonenkis describe the foundations of Statistical Learning Theory (Vapnik, Chervonenkis (1968) Proc USSR Acad Sci 181(4): 781–783).
    No preview · Chapter · Jan 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance of such systems. This work is illustrated by experiments carried out on the ad placement system associated with the Bing search engine.
    Full-text · Article · Sep 2012
  • Léon Bottou
    [Show abstract] [Hide abstract]
    ABSTRACT: Chapter 1 strongly advocates the stochastic back-propagation method to train neural networks. This is in fact an instance of a more general technique called stochastic gradient descent (SGD). This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations.
    No preview · Chapter · Jan 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe and evaluate two algorithms for Neyman-Pearson (NP) classification problem which has been recently shown to be of a particular importance for bipartite ranking problems. NP classification is a nonconvex problem involving a constraint on false negatives rate. We investigated batch algorithm based on DC programming and stochastic gradient method well suited for large-scale datasets. Empirical evidences illustrate the potential of the proposed methods.
    Preview · Article · Apr 2011 · ACM Transactions on Intelligent Systems and Technology
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
    Full-text · Article · Mar 2011 · Journal of Machine Learning Research
  • Source
    Seyda Ertekin · Léon Bottou · C Lee Giles
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a nonconvex online Support Vector Machine (SVM) algorithm (LASVM-NC) based on the Ramp Loss, which has the strong ability of suppressing the influence of outliers. Then, again in the online learning setting, we propose an outlier filtering mechanism (LASVM-I) based on approximating nonconvex behavior in convex optimization. These two algorithms are built upon another novel SVM algorithm (LASVM-G) that is capable of generating accurate intermediate models in its iterative steps by leveraging the duality gap. We present experimental results that demonstrate the merit of our frameworks in achieving significant robustness to outliers in noisy data classification where mislabeled training instances are in abundance. Experimental evaluation shows that the proposed approaches yield a more scalable online SVM algorithm with sparser models and less computational running time, both in the training and recognition phases, without sacrificing generalization performance. We also point out the relation between nonconvex optimization and min-margin active learning.
    Preview · Article · Mar 2011 · IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Source
    Leon Bottou
    [Show abstract] [Hide abstract]
    ABSTRACT: A plausible definition of “reasoning” could be “algebraically manipulating previously acquired knowledge in order to answer a new question”. This definition covers first-order logical inference or probabilistic inference. It also includes much simpler manipulations commonly used to build large learning systems. For instance, we can build an optical character recognition system by first training a character segmenter, an isolated character recognizer, and a language model, using appropriate labelled training sets. Adequately concatenating these modules and fine tuning the resulting system can be viewed as an algebraic operation in a space of models. The resulting model answers a new question, that is, converting the image of a text page into a computer readable text. This observation suggests a conceptual continuity between algebraically rich inference systems, such as logical or probabilistic inference, and simple manipulations, such as the mere concatenation of trainable learning systems. Therefore, instead of trying to bridge the gap between machine learning systems and sophisticated “all-purpose” inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to training systems, and build reasoning capabilities from the ground up.
    Preview · Article · Feb 2011 · Machine Learning
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
    No preview · Article · Feb 2011 · Journal of Machine Learning Research
  • Source
    Nicolas Usunier · Antoine Bordes · Léon Bottou

    Full-text · Article · Apr 2010