L. Bottou’s research while affiliated with NEC Laboratories America and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (48)


Figure 1. An example partitioning z. Bold edges have zi,j = 1, while others have z k,l = 0.
Figure 6. Critical edges in Zarachy's karate club network with four groups. A removal of any red edge would change the current (best) partitioning. All other edges can be removed individually without changing the solution. (Figure best viewed in color.)
Figure 7. Clustering solution path for the leaves dataset. The red stems show the difference of adjacent clusterings.
Solution Stability in Linear Programming Relaxations: Graph Partitioning and Unsupervised Learning
  • Article
  • Full-text available

June 2009

·

144 Reads

·

38 Citations

·

Stefanie Jegelka

·

A. Danyluk

·

[...]

·

We propose a new method to quantify the solution stability of a large class of combinatorial optimization problems arising in machine learning. As practical example we apply the method to correlation clustering, clustering aggregation, modularity clustering, and relative performance significance clustering. Our method is extensively motivated by the idea of linear programming relaxations. We prove that when a relaxation is used to solve the original clustering problem, then the solution stability calculated by our method is conservative, that is, it never overestimates the solution stability of the true, unrelaxed problem. We also demonstrate how our method can be used to compute the entire path of optimal solutions as the optimization problem is increasingly perturbed. Experimentally, our method is shown to perform well on a number of benchmark problems.

Download

The Conjoint Effect of Divisive Normalization and Orientation Selectivity on Redundancy Reduction

January 2009

·

21 Reads

·

16 Citations

Bandpass filtering, orientation selectivity, and contrast gain control are prominent features of sensory coding at the level of V1 simple cells. While the effect of bandpass filtering and orientation selectivity can be assessed within a linear model, contrast gain control is an inherently nonlinear computation. Here we employ the class of LpL_p elliptically contoured distributions to investigate the extent to which the two features---orientation selectivity and contrast gain control---are suited to model the statistics of natural images. Within this framework we find that contrast gain control can play a significant role for the removal of redundancies in natural images. Orientation selectivity, in contrast, has only a very limited potential for redundancy reduction.


Figure 1: MR signal acquisition: r-space and k-space representation of the signal on a rectangular grid as well as the trajectory obtained by means of magnetic field gradients 
Figure 4: Spirals found by our algorithm. The ordering is color-coded: dark spirals selected first. 
Bayesian Experimental Design of Magnetic Resonance Imaging Sequences

January 2008

·

119 Reads

·

15 Citations

We show how improved sequences for magnetic resonance imaging can be found through automated optimization of Bayesian design scores. Combining recent advances in approximate Bayesian inference and natural image statistics with high-performance numerical computation, we propose the first scalable Bayesian experimental design framework for this problem of high relevance to clinical and brain research. Our solution requires approximate inference for dense, non-Gaussian models on a scale seldom addressed before. We propose a novel scalable variational inference algorithm, and show how powerful methods of numerical mathematics can be modified to compute primitives in our framework. Our approach is evaluated on a realistic setup with raw data from a 3T MR scanner.


Learning Taxonomies by Dependence Maximization

January 2008

·

35 Reads

·

9 Citations

We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe- art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data.


Influence of graph construction on graph-based clustering measures

January 2008

·

117 Reads

·

134 Citations

Graph clustering methods such as spectral clustering are defined for general weighted graphs. In machine learning, however, data often is not given in form of a graph, but in terms of similarity (or distance) values between points. In this case, first a neighborhood graph is constructed using the similarities between the points and then a graph clustering algorithm is applied to this graph. In this paper we investigate the influence of the construction of the similarity graph on the clustering results. We first study the convergence of graph clustering criteria such as the normalized cut (Ncut) as the sample size tends to infinity. We find that the limit expressions are different for different types of graph, for example the r-neighborhood graph or the k-nearest neighbor graph. In plain words: Ncut on a kNN graph does something systematically different than Ncut on an r-neighborhood graph! This finding shows that graph clustering criteria cannot be studied independently of the kind of graph they are applied to. We also provide examples which show that these differences can be observed for toy and real data already for rather small sample sizes.


Understanding Brain Connectivity Patterns during Motor Imagery for Brain-Computer Interfacing

January 2008

·

58 Reads

·

41 Citations

EEG connectivity measures could provide a new type of feature space for inferring a subject‘s intention in Brain-Computer Interfaces (BCIs). However, very little is known on EEG connectivity patterns for BCIs. In this study, EEG connectivity during motor imagery (MI) of the left and right is investigated in a broad frequency range across the whole scalp by combining Beamforming with Transfer Entropy and taking into account possible volume conduction effects. Observed connectivity patterns indicate that modulation intentionally induced by MI is strongest in the gamma-band, i.e., above 35 Hz. Furthermore, modulation between MI and rest is found to be more pronounced than between MI of different hands. This is in contrast to results on MI obtained with bandpower features, and might provide an explanation for the so far only moderate success of connectivity features in BCIs. It is concluded that future studies on connectivity based BCIs should focus on high frequency bands and conside r ex peri mental paradigms that maximally vary cognitive demands between conditions.


Figure 4: Schematic showing model-based robot control. The learned dynamics model can be updated online using LGP.
Local Gaussian Process Regression for Real Time Online Model Learning and Control

January 2008

·

660 Reads

·

128 Citations

Advances in Neural Information Processing Systems

Learning in real-time applications, e.g., online approximation of the inverse dynamics model for model-based robot control, requires fast online regression techniques. Inspired by local learning, we propose a method to speed up standard Gaussian Process regression (GPR) with local GP models (LGP). The training data is partitioned in local regions, for each an individual GP model is trained. The prediction for a query point is performed by weighted estimation using nearby local models. Unlike other GP approximations, such as mixtures of experts, we use a distance based measure for partitioning of the data and weighted prediction. The proposed method achieves online learning and prediction in real-time. Comparisons with other nonparametric regression methods show that LGP has higher accuracy than LWPR and close to the performance of standard GPR and nu-SVR.


Fitted Q-iteration by Advantage Weighted Regression

January 2008

·

188 Reads

·

29 Citations

Recently, fitted Q-iteration (FQI) based methods have become more popular due to their increased sample efficiency, a more stable learning process and the higher quality of the resulting policy. However, these methods remain hard to use for continuous action spaces which frequently occur in real-world tasks, e.g., in robotics and other technical applications. The greedy action selection commonly used for the policy improvement step is particularly problematic as it is expensive for continuous actions, can cause an unstable learning process, introduces an optimization bias and results in highly non-smooth policies unsuitable for real-world systems. In this paper, we show that by using a soft-greedy action selection the policy improvement step used in FQI can be simplified to an inexpensive advantage-weighted regression. With this result, we are able to derive a new, computationally efficient FQI algorithm which can even deal with high dimensional action spaces.


Figure 1 of 1
Using Bayesian Dynamical Systems for Motion Template Libraries

January 2008

·

75 Reads

·

21 Citations

Motor primitives or motion templates have become an important concept for both modeling human motor control as well as generating robot behaviors using imitation learning. Recent impressive results range from humanoid robot movement generation to timing models of human motions. The automatic generation of skill libraries containing multiple motion templates is an important step in robot learning. Such a skill learning system needs to cluster similar movements together and represent each resulting motion template as a generative model which is subsequently used for the execution of the behavior by a robot system. In this paper, we show how human trajectories captured as multidimensional time-series can be clustered using Bayesian mixtures of linear Gaussian state-space models based on the similarity of their dynamics. The appropriate number of templates is automatically determined by enforcing a parsimonious parametrization. As the resulting model is intractable, we introduce a novel approximation method based on variational Bayes, which is especially designed to enable the use of efficient inference algorithms. On recorded human Balero movements, this method is not only capable of finding reasonable motion templates but also yields a generative model which works well in the execution of this complex task on a simulated anthropomorphic SARCOS arm.


Figure 2: Offline (leave-one-letter-out) 36-class prediction performance as a function of codeword length (i.e. the number of consecutive epochs of the left-out letter that were used to make a prediction). Performance values (and standard-error bar heights) are averaged across the 6 subjects.
Effects of Stimulus Type and of Error-Correcting Code Design on BCI Speller Performance

January 2008

·

108 Reads

·

33 Citations

From an information-theoretic perspective, a noisy transmission system such as a visual Brain-Computer Interface (BCI) speller could benefit from the use of errorcorrecting codes. However, optimizing the code solely according to the maximal minimum-Hamming-distance criterion tends to lead to an overall increase in target frequency of target stimuli, and hence a significantly reduced average target-to-target interval (TTI), leading to difficulties in classifying the individual event-related potentials (ERPs) due to overlap and refractory effects. Clearly any change to the stimulus setup must also respect the possible psychophysiological consequences. Here we report new EEG data from experiments in which we explore stimulus types and codebooks in a within-subject design, finding an interaction between the two factors. Our data demonstrate that the traditional, rowcolumn code has particular spatial properties that lead to better performance than one would expect from its TTIs and Hamming-distances alone, but nonetheless error-correcting codes can improve performance provided the right stimulus type is used.


Citations (46)


... The k-means grouping algorithm was invented in 1967 by Macqueen, later improved by Hartigan [14]. Bottou and Bengio explored the coexistence of k-means computations and algorithms [15]. While the core k-means algorithm operates within memory, it can be extended to handle datasets that exceed memory capacity [15]. ...

Reference:

DG-means: a superior greedy algorithm for clustering distributed data
Convergence properties of the k-means algorithm
  • Citing Article
  • January 1994

... In other words, an adversarial client in DFL can receive model gradients from its neighbors, and infer the sender's raw training data from these received gradients [38]. For instance, for a simple differentiable model, the gradients directly correspond to the multiples of raw data, thus enabling a straightforward inference of the raw data from the received gradients [15]. For more complex models, the adversarial client can simulate the fake data and gradients to reversely approximate the raw data by minimizing the discrepancy between the false gradients and truly received gradients (a.k.a. ...

Gradientbased learning applied to document recognition
  • Citing Article
  • January 1986

... Reducing updates at historically important synapses is one potential approach to determining which synapses should have their strengths adjusted and which should be stabilised. Adjusting learning rates based on synapse importance enables fast, stable learning (LeCun et al., 2002;Kingma and Ba, 2014;Khan et al., 2018;Aitchison, 2020;Martens, 2020;Jegminat et al., 2022). ...

In Neural Networks - Tricks of the Trade
  • Citing Article
  • January 1998

... The first TDNN model was proposed to recognize phonemes [62]. Then, TDNNs were utilized for recognizing the spoken word [63] and handwriting [64], enabling the acoustic model to learn the temporal dynamics of the speech signal using short-term acoustic feature vectors. Moreover, it uses sub-sampling for reducing computation in training. ...

Speaker-independent isolated digit recognition: Multilayer perceptrons vs. Dynamic time warping

Neural Networks

... Recent work has highlighted limitations in the original Invariant Risk Minimization (IRM) framework, particularly in nonlinear settings where deep models tend to overfit (Rosenfeld, Ravikumar, and Risteski 2021). To address this, we included Bayesian Invariant Risk Minimization (BIRM) as a baseline, which has been shown to alleviate overfitting issues by incorporating Bayesian inference and thereby improving generalization in nonlinear scenarios (Lin et al. 2022). ...

Learning Algorithms For Classification: A Comparison On Handwritten Digit Recognition
  • Citing Conference Paper
  • January 1995

... In the realm of handwritten digit recognition, traditional machine learning methods primarily focus on these two aspects [1]- [6]. Several classification methods have been explored, including hidden Markov models (HMM) [2]-[4], k-nearest neighbors (KNN) [3], [6], neural networks [1], [2], and support vector machines (SVM) [1], [6]- [9]. These methods have shown higher performance compared to their modern counterparts, but they may struggle with robustness and computational complexity. ...

Comparison of learning algorithms for handwritten digit recognition

... The corresponding theory is formalized in 1991, and Bottou [1991] (updated in Bottou [1998]) provides a proof for the convergence of SGD towards extremal points (see Graph Transformer networks and general architectures. In 1997, neural networks have successful industrial applications in handwritten character recognition, and they are used for automatically reading checks in the USA with a system described in LeCun et al. [1997]. The Graph Transformer Networks (GTN) approach consists of building a whole system with arbitrarily complicated parameterized modules, as long as they are differentiable. ...

Reading Checks with graph transformer networks
  • Citing Conference Paper
  • January 1997

... Digipaper [22] and DjVu [23,24] are two image compression techniques that are particularly geared towards compressing a colour document image. The basic idea behind Digipaper and DjVu is to separate the text from the background and to use different techniques to compress those components. ...

Browsing through High Quality Document Images with DjVu

... Various coding standards have been developed and are widely used in a variety of applications, such as MPEG-1/2/4 (Moving Picture Experts Group), H.261/2/3, and H.264/AVC (Advanced Video Coding) [11], as well as AVS (Audio and Video Coding Standard in China) [12][13][14][15], H.265/HEVC (High Efficiency Video Coding) [16], and H.266/VVC (Versatile Video Coding) [17]. In [11,[16][17][18][19][20][21][22][23][24][25][26], the traditional hybrid coding methods have been well reviewed from the historical pulse code modulation (PCM), DPCM coding to HEVC, three-dimensional video (3DV) coding, and VVC. ...

Image and Video Coding - Emerging Standards and Beyond
  • Citing Article
  • January 1998