L. Bottou

Microsoft, Washington, West Virginia, United States

Are you L. Bottou?

Claim your profile

Publications (122)60.77 Total impact

  • Alekh Agarwal, Leon Bottou
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a lower bound for optimizing a finite sum of $n$ functions, where each function is $L$-smooth and the sum is $\mu$-strongly convex. We show that no algorithm can reach an error $\epsilon$ in minimizing all functions from this class in fewer than $\Omega(n + \sqrt{n(\kappa-1)}\log(1/\epsilon))$ iterations, where $\kappa=L/\mu$ is a surrogate condition number. We then compare this lower bound to upper bounds for recently developed methods specializing to this setting. When the functions involved in this sum are not arbitrary, but based on i.i.d. random data, then we further contrast these complexity results with those for optimal first-order methods to directly optimize the sum. The conclusion we draw is that a lot of caution is necessary for an accurate comparison. In interest of completeness, we also provide a self-contained proof of the classical result on optimizing smooth and strongly convex functions under a first-order oracle.
    10/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Quick interaction between a human teacher and a learning machine presents numerous benefits and challenges when working with web-scale data. The human teacher guides the machine towards accomplishing the task of interest. The learning machine leverages big data to find examples that maximize the training value of its interaction with the teacher. When the teacher is restricted to labeling examples selected by the machine, this problem is an instance of active learning. When the teacher can provide additional information to the machine (e.g., suggestions on what examples or predictive features should be used) as the learning task progresses, then the problem becomes one of interactive learning. To accommodate the two-way communication channel needed for efficient interactive learning, the teacher and the machine need an environment that supports an interaction language. The machine can access, process, and summarize more examples than the teacher can see in a lifetime. Based on the machine's output, the teacher can revise the definition of the task or make it more precise. Both the teacher and the machine continuously learn and benefit from the interaction. We have built a platform to (1) produce valuable and deployable models and (2) support research on both the machine learning and user interface challenges of the interactive learning problem. The platform relies on a dedicated, low-latency, distributed, in-memory architecture that allows us to construct web-scale learning machines with quick interaction speed. The purpose of this paper is to describe this architecture and demonstrate how it supports our research efforts. Preliminary results are presented as illustrations of the architecture but are not the primary focus of the paper.
    09/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A plausible definition of "reasoning" could be "algebraically manipulating previously acquired knowledge in order to answer a new question". This definition covers first-order logical inference or probabilistic inference. It also includes ...
    Machine Learning 02/2014; · 1.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a novel parallel stochastic gradient descent (SGD) method that is obtained by applying parallel sets of SGD iterations (each set operating on one node using the data residing in it) for finding the direction in each iteration of a batch descent method. The method has strong convergence properties. Experiments on datasets with high dimensional feature spaces show the value of this method.
    11/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper gives a novel approach to the distributed training of linear classifiers. At each iteration, the nodes minimize approximate objective functions and combine the resulting minimizers to form a descent direction to move. The method is shown to have $O(log(1/\epsilon))$ time convergence. The method can be viewed as an iterative parameter mixing method. A special instantiation yields a parallel stochastic gradient descent method with strong convergence. When communication times between nodes are large, our method is much faster than the SQM method, which uses distributed computation only for function and gradient calls.
    10/2013;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Training examples are not all equally informative. Active learning strategies leverage this observation in order to massively reduce the number of examples that need to be labeled. We leverage the same observation to build a generic strategy for parallelizing learning algorithms. This strategy is effective because the search for informative examples is highly parallelizable and because we show that its performance does not deteriorate when the sifting process relies on a slightly outdated model. Parallel active learning is particularly attractive to train nonlinear models with non-linear representations because there are few practical parallel learning algorithms for such models. We report preliminary experiments using both kernel SVMs and SGD-trained neural networks.
    10/2013;
  • [Show abstract] [Hide abstract]
    ABSTRACT: This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine.
    Journal of Machine Learning Research 01/2013; 14(1):3207-3260. · 3.42 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance of such systems. This work is illustrated by experiments carried out on the ad placement system associated with the Bing search engine.
    09/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
    Computing Research Repository - CORR. 03/2011;
  • Source
    S. Ertekin, L. Bottou, C.L. Giles
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a nonconvex online Support Vector Machine (SVM) algorithm (LASVM-NC) based on the Ramp Loss, which has the strong ability of suppressing the influence of outliers. Then, again in the online learning setting, we propose an outlier filtering mechanism (LASVM-I) based on approximating nonconvex behavior in convex optimization. These two algorithms are built upon another novel SVM algorithm (LASVM-G) that is capable of generating accurate intermediate models in its iterative steps by leveraging the duality gap. We present experimental results that demonstrate the merit of our frameworks in achieving significant robustness to outliers in noisy data classification where mislabeled training instances are in abundance. Experimental evaluation shows that the proposed approaches yield a more scalable online SVM algorithm with sparser models and less computational running time, both in the training and recognition phases, without sacrificing generalization performance. We also point out the relation between nonconvex optimization and min-margin active learning.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 03/2011; · 4.80 Impact Factor
  • Source
    Leon Bottou
    [Show abstract] [Hide abstract]
    ABSTRACT: A plausible definition of "reasoning" could be "algebraically manipulating previously acquired knowledge in order to answer a new question". This definition covers first-order logical inference or probabilistic inference. It also includes much simpler manipulations commonly used to build large learning systems. For instance, we can build an optical character recognition system by first training a character segmenter, an isolated character recognizer, and a language model, using appropriate labeled training sets. Adequately concatenating these modules and fine tuning the resulting system can be viewed as an algebraic operation in a space of models. The resulting model answers a new question, that is, converting the image of a text page into a computer readable text. This observation suggests a conceptual continuity between algebraically rich inference systems, such as logical or probabilistic inference, and simple manipulations, such as the mere concatenation of trainable learning systems. Therefore, instead of trying to bridge the gap between machine learning systems and sophisticated "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to training systems, and build reasoning capabilities from the ground up.
    Machine Learning 02/2011; · 1.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
    Journal of Machine Learning Research 02/2011; 12:2493-2537. · 3.42 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe and evaluate two algorithms for Neyman-Pearson (NP) classification problem which has been recently shown to be of a particular importance for bipartite ranking problems. NP classification is a nonconvex problem involving a constraint on false negatives rate. We investigated batch algorithm based on DC programming and stochastic gradient method well suited for large-scale datasets. Empirical evidences illustrate the potential of the proposed methods.
    ACM TIST. 01/2011; 2:28.
  • Source
    Seyda Ertekin, Leon Bottou, C. Lee Giles
    [Show abstract] [Hide abstract]
    ABSTRACT: Technological advances in the past several decades have facilitated the generation and storage of digital information. As the amount of digital information increases, there arises the need for more effective tools to better find, filter and manage these resources. Therefore, developing fast and highly accurate algorithms for automatic classification of digital data has become an important part of the machine learning and knowledge discovery research. This poster presents a fast online Support Vector Machine (SVM) classifier algorithm that has an outstanding speed improvement over the classical (batch) SVMs and the other online SVM algorithms, while preserving the classification accuracy rates of the state-of-the-art SVM solvers. The speed improvement and the demand for less memory with the online learning setting enable the SVMs to be applicable to very large data sets.
    01/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The SGD-QN algorithm described in (Bordes et al., 2009) contains a subtle flaw that prevents it from reaching its design goals. Yet the flawed SGD-QN algorithm has worked well enough to be a winner of the first Pascal Large Scale Learning Challenge (Sonnenburg et al., 2008). This document clarifies the situation, proposes a corrected algo rithm, and evaluates its performance.
    Journal of Machine Learning Research 01/2010; 11:2229-2240. · 3.42 Impact Factor
  • Source
    Journal of Machine Learning Research - Proceedings Track. 01/2010; 9:884-891.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Shotgun proteomics coupled with database search software allows the identification of a large number of peptides in a single experiment. However, some existing search algorithms, such as SEQUEST, use score functions that are designed primarily to identify the best peptide for a given spectrum. Consequently, when comparing identifications across spectra, the SEQUEST score function Xcorr fails to discriminate accurately between correct and incorrect peptide identifications. Several machine learning methods have been proposed to address the resulting classification task of distinguishing between correct and incorrect peptide-spectrum matches (PSMs). A recent example is Percolator, which uses semisupervised learning and a decoy database search strategy to learn to distinguish between correct and incorrect PSMs identified by a database search algorithm. The current work describes three improvements to Percolator. (1) Percolator's heuristic optimization is replaced with a clear objective function, with intuitive reasons behind its choice. (2) Tractable nonlinear models are used instead of linear models, leading to improved accuracy over the original Percolator. (3) A method, Q-ranker, for directly optimizing the number of identified spectra at a specified q value is proposed, which achieves further gains.
    Journal of Proteome Research 05/2009; 8(7):3737-45. · 5.06 Impact Factor
  • Source
    Article: LASVM
    01/2009;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Bandpass filtering, orientation selectivity, and contrast gain control are prominent features of sensory coding at the level of V1 simple cells. While the effect of bandpass filtering and orientation selectivity can be assessed within a linear model, contrast gain control is an inherently nonlinear computation. Here we employ the class of $L_p$ elliptically contoured distributions to investigate the extent to which the two features---orientation selectivity and contrast gain control---are suited to model the statistics of natural images. Within this framework we find that contrast gain control can play a significant role for the removal of redundancies in natural images. Orientation selectivity, in contrast, has only a very limited potential for redundancy reduction.
    Advances in Neural Information Processing Systems 21: Proceedings of the 2008 Conference, 1521-1528 (2009). 01/2009;
  • 01/2009;