L. Bottou

Microsoft, Washington, West Virginia, United States

Are you L. Bottou?

Claim your profile

Publications (124)67.22 Total impact

  • Source
    Alekh Agarwal, Leon Bottou
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a lower bound for optimizing a finite sum of $n$ functions, where each function is $L$-smooth and the sum is $\mu$-strongly convex. We show that no algorithm can reach an error $\epsilon$ in minimizing all functions from this class in fewer than $\Omega(n + \sqrt{n(\kappa-1)}\log(1/\epsilon))$ iterations, where $\kappa=L/\mu$ is a surrogate condition number. We then compare this lower bound to upper bounds for recently developed methods specializing to this setting. When the functions involved in this sum are not arbitrary, but based on i.i.d. random data, then we further contrast these complexity results with those for optimal first-order methods to directly optimize the sum. The conclusion we draw is that a lot of caution is necessary for an accurate comparison. In interest of completeness, we also provide a self-contained proof of the classical result on optimizing smooth and strongly convex functions under a first-order oracle.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Quick interaction between a human teacher and a learning machine presents numerous benefits and challenges when working with web-scale data. The human teacher guides the machine towards accomplishing the task of interest. The learning machine leverages big data to find examples that maximize the training value of its interaction with the teacher. When the teacher is restricted to labeling examples selected by the machine, this problem is an instance of active learning. When the teacher can provide additional information to the machine (e.g., suggestions on what examples or predictive features should be used) as the learning task progresses, then the problem becomes one of interactive learning. To accommodate the two-way communication channel needed for efficient interactive learning, the teacher and the machine need an environment that supports an interaction language. The machine can access, process, and summarize more examples than the teacher can see in a lifetime. Based on the machine's output, the teacher can revise the definition of the task or make it more precise. Both the teacher and the machine continuously learn and benefit from the interaction. We have built a platform to (1) produce valuable and deployable models and (2) support research on both the machine learning and user interface challenges of the interactive learning problem. The platform relies on a dedicated, low-latency, distributed, in-memory architecture that allows us to construct web-scale learning machines with quick interaction speed. The purpose of this paper is to describe this architecture and demonstrate how it supports our research efforts. Preliminary results are presented as illustrations of the architecture but are not the primary focus of the paper.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A plausible definition of "reasoning" could be "algebraically manipulating previously acquired knowledge in order to answer a new question". This definition covers first-order logical inference or probabilistic inference. It also includes ...
    Machine Learning 02/2014; DOI:10.1007/s10994-013-5381-4 · 1.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a novel parallel stochastic gradient descent (SGD) method that is obtained by applying parallel sets of SGD iterations (each set operating on one node using the data residing in it) for finding the direction in each iteration of a batch descent method. The method has strong convergence properties. Experiments on datasets with high dimensional feature spaces show the value of this method.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper gives a novel approach to the distributed training of linear classifiers. At each iteration, the nodes minimize approximate objective functions and combine the resulting minimizers to form a descent direction to move. The method is shown to have $O(log(1/\epsilon))$ time convergence. The method can be viewed as an iterative parameter mixing method. A special instantiation yields a parallel stochastic gradient descent method with strong convergence. When communication times between nodes are large, our method is much faster than the SQM method, which uses distributed computation only for function and gradient calls.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Training examples are not all equally informative. Active learning strategies leverage this observation in order to massively reduce the number of examples that need to be labeled. We leverage the same observation to build a generic strategy for parallelizing learning algorithms. This strategy is effective because the search for informative examples is highly parallelizable and because we show that its performance does not deteriorate when the sifting process relies on a slightly outdated model. Parallel active learning is particularly attractive to train nonlinear models with non-linear representations because there are few practical parallel learning algorithms for such models. We report preliminary experiments using both kernel SVMs and SGD-trained neural networks.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine.
    Journal of Machine Learning Research 01/2013; 14(1):3207-3260. · 2.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance of such systems. This work is illustrated by experiments carried out on the ad placement system associated with the Bing search engine.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe and evaluate two algorithms for Neyman-Pearson (NP) classification problem which has been recently shown to be of a particular importance for bipartite ranking problems. NP classification is a nonconvex problem involving a constraint on false negatives rate. We investigated batch algorithm based on DC programming and stochastic gradient method well suited for large-scale datasets. Empirical evidences illustrate the potential of the proposed methods.
    ACM Transactions on Intelligent Systems and Technology 04/2011; 2:28. DOI:10.1145/1961189.1961200 · 9.39 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
  • Source
    S. Ertekin, L. Bottou, C.L. Giles
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a nonconvex online Support Vector Machine (SVM) algorithm (LASVM-NC) based on the Ramp Loss, which has the strong ability of suppressing the influence of outliers. Then, again in the online learning setting, we propose an outlier filtering mechanism (LASVM-I) based on approximating nonconvex behavior in convex optimization. These two algorithms are built upon another novel SVM algorithm (LASVM-G) that is capable of generating accurate intermediate models in its iterative steps by leveraging the duality gap. We present experimental results that demonstrate the merit of our frameworks in achieving significant robustness to outliers in noisy data classification where mislabeled training instances are in abundance. Experimental evaluation shows that the proposed approaches yield a more scalable online SVM algorithm with sparser models and less computational running time, both in the training and recognition phases, without sacrificing generalization performance. We also point out the relation between nonconvex optimization and min-margin active learning.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 03/2011; 33(2-33):368 - 381. DOI:10.1109/TPAMI.2010.109 · 5.69 Impact Factor
  • Source
    Leon Bottou
    [Show abstract] [Hide abstract]
    ABSTRACT: A plausible definition of "reasoning" could be "algebraically manipulating previously acquired knowledge in order to answer a new question". This definition covers first-order logical inference or probabilistic inference. It also includes much simpler manipulations commonly used to build large learning systems. For instance, we can build an optical character recognition system by first training a character segmenter, an isolated character recognizer, and a language model, using appropriate labeled training sets. Adequately concatenating these modules and fine tuning the resulting system can be viewed as an algebraic operation in a space of models. The resulting model answers a new question, that is, converting the image of a text page into a computer readable text. This observation suggests a conceptual continuity between algebraically rich inference systems, such as logical or probabilistic inference, and simple manipulations, such as the mere concatenation of trainable learning systems. Therefore, instead of trying to bridge the gap between machine learning systems and sophisticated "all-purpose" inference mechanisms, we can instead algebraically enrich the set of manipulations applicable to training systems, and build reasoning capabilities from the ground up.
    Machine Learning 02/2011; 94(2). DOI:10.1007/s10994-013-5335-x · 1.69 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
    Journal of Machine Learning Research 02/2011; 12:2493-2537. · 2.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The SGD-QN algorithm described in (Bordes et al., 2009) contains a subtle flaw that prevents it from reaching its design goals. Yet the flawed SGD-QN algorithm has worked well enough to be a winner of the first Pascal Large Scale Learning Challenge (Sonnenburg et al., 2008). This document clarifies the situation, proposes a corrected algo rithm, and evaluates its performance.
    Journal of Machine Learning Research 01/2010; 11:2229-2240. · 2.85 Impact Factor
  • Source
  • Source
    Seyda Ertekin, Leon Bottou, C. Lee Giles
    [Show abstract] [Hide abstract]
    ABSTRACT: Technological advances in the past several decades have facilitated the generation and storage of digital information. As the amount of digital information increases, there arises the need for more effective tools to better find, filter and manage these resources. Therefore, developing fast and highly accurate algorithms for automatic classification of digital data has become an important part of the machine learning and knowledge discovery research. This poster presents a fast online Support Vector Machine (SVM) classifier algorithm that has an outstanding speed improvement over the classical (batch) SVMs and the other online SVM algorithms, while preserving the classification accuracy rates of the state-of-the-art SVM solvers. The speed improvement and the demand for less memory with the online learning setting enable the SVMs to be applicable to very large data sets.
  • Source
    Léon Bottou
    [Show abstract] [Hide abstract]
    ABSTRACT: During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning meth-ods is limited by the computing time rather than the sample size. A more pre-cise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing perfor-mance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of second- order information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorith m won the "Wild Track" of the first PAS- CAL Large Scale Learning Challenge (Sonnenburg et al., 2008).
    Journal of Machine Learning Research 07/2009; 10:1737-1754. DOI:10.1145/1577069.1755842 · 2.85 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Shotgun proteomics coupled with database search software allows the identification of a large number of peptides in a single experiment. However, some existing search algorithms, such as SEQUEST, use score functions that are designed primarily to identify the best peptide for a given spectrum. Consequently, when comparing identifications across spectra, the SEQUEST score function Xcorr fails to discriminate accurately between correct and incorrect peptide identifications. Several machine learning methods have been proposed to address the resulting classification task of distinguishing between correct and incorrect peptide-spectrum matches (PSMs). A recent example is Percolator, which uses semisupervised learning and a decoy database search strategy to learn to distinguish between correct and incorrect PSMs identified by a database search algorithm. The current work describes three improvements to Percolator. (1) Percolator's heuristic optimization is replaced with a clear objective function, with intuitive reasons behind its choice. (2) Tractable nonlinear models are used instead of linear models, leading to improved accuracy over the original Percolator. (3) A method, Q-ranker, for directly optimizing the number of identified spectra at a specified q value is proposed, which achieves further gains.
    Journal of Proteome Research 05/2009; 8(7):3737-45. DOI:10.1021/pr801109k · 5.00 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a new method to quantify the solution stability of a large class of combinatorial optimization problems arising in machine learning. As practical example we apply the method to correlation clustering, clustering aggregation, modularity clustering, and relative performance significance clustering. Our method is extensively motivated by the idea of linear programming relaxations. We prove that when a relaxation is used to solve the original clustering problem, then the solution stability calculated by our method is conservative, that is, it never overestimates the solution stability of the true, unrelaxed problem. We also demonstrate how our method can be used to compute the entire path of optimal solutions as the optimization problem is increasingly perturbed. Experimentally, our method is shown to perform well on a number of benchmark problems.