Léon Bottou’s research while affiliated with Meta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (135)


Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need
  • Conference Paper

October 2023

·

74 Reads

·

12 Citations

·

Leon Bottou

·

·


Birth of a Transformer: A Memory Viewpoint

June 2023

·

43 Reads

Alberto Bietti

·

·

Diane Bouchacourt

·

[...]

·

Leon Bottou

Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We study how transformers balance these two types of knowledge by considering a synthetic setup where tokens are generated from either global or context-specific bigram distributions. By a careful empirical analysis of the training process on a simplified two-layer transformer, we illustrate the fast learning of global bigrams and the slower development of an "induction head" mechanism for the in-context bigrams. We highlight the role of weight matrices as associative memories, provide theoretical insights on how gradients enable their learning during training, and study the role of data-distributional properties.


Figure 7. Comparison of contrastive (Gij ∈ {−1, 0, 1}) and non-contrastive (Gij ∈ {0, 1}) variation of VICReg with N = 300. The setting is the same as Fig. 3 with Algorithm 3. We remark the usefulness to distinguish between negative pairs and unknown pairs, although some instability issues seem to appear when few entries are known for the contrastive method.
Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need
  • Preprint
  • File available

March 2023

·

268 Reads

·

1 Citation

Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data. However, SSL requires to build samples that are known to be semantically akin, i.e. positive views. Requiring such knowledge is the main limitation of SSL and is often tackled by ad-hoc strategies e.g. applying known data-augmentations to the same input. In this work, we generalize and formalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples. PAL achieves three main objectives. First, it unveils a theoretically grounded learning framework beyond SSL, that can be extended to tackle supervised and semi-supervised learning depending on the employed oracle. Second, it provides a consistent algorithm to embed a priori knowledge, e.g. some observed labels, into any SSL losses without any change in the training pipeline. Third, it provides a proper active learning framework yielding low-cost solutions to annotate datasets, arguably bringing the gap between theory and practice of active learning that is based on simple-to-answer-by-non-experts queries of semantic relationships between inputs.

Download

Recycling diverse models for out-of-distribution generalization

December 2022

·

9 Reads

Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: download a copy of a foundation model, and fine-tune it using some in-house data about the target task of interest. Consequently, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks. Yet, these individual fine-tunings often lack strong generalization and exist in isolation without benefiting from each other. In our opinion, this is a missed opportunity, as these specialized models contain diverse features. Based on this insight, we propose model recycling, a simple strategy that leverages multiple fine-tunings of the same foundation model on diverse auxiliary tasks, and repurposes them as rich and diverse initializations for the target task. Specifically, model recycling fine-tunes in parallel each specialized model on the target task, and then averages the weights of all target fine-tunings into a final model. Empirically, we show that model recycling maximizes model diversity by benefiting from diverse auxiliary tasks, and achieves a new state of the art on the reference DomainBed benchmark for out-of-distribution generalization. Looking forward, model recycling is a contribution to the emerging paradigm of updatable machine learning where, akin to open-source software development, the community collaborates to incrementally and reliably update machine learning models.


Test accuracy on the CAMELYON17 dataset with DENSENET121. We compare various initialization (ERM, CATn, DISTILLn, and RFC) for two algorithms VREX and ERM using either the IID or OOD hyperparameter tuning method. The standard deviations over 5 runs are reported.
Imagenet supervised transfer learning performance on a deep architecture RESNET152.
Linear probing accuracy of SIMSIAM (Chen and He, 2020) CIFAR10 learned representation on CIFAR100, CIFAR10, CIFAR100(1%), and CIFAR10(10%) tasks. CATn concatenates n learned representation before linear probing. DISTILLn distills n learned representation into RESNET18 before linear probing. RESNET18Wn contains around n 2 parameters as RESNET18.
Learning useful representations for shifting tasks and distributions

December 2022

·

123 Reads

Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions. Our thesis is that such scenarios are better served by representations that are "richer" than those obtained with a single optimization episode. This is supported by a collection of empirical results obtained with an apparently na\"ive ensembling technique: concatenating the representations obtained with multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained from scratch. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.


Distributions of the ratio of theoretical scaling to actual for a strided LeNet network. The ratios are close to the ideal value of 1, indicating good theoretical and practical agreement
Average singular value heat maps for the strided LeNet model, where each square represents a block of the Hessian, with blocking at the level of weight matrices (biases omitted). Using geometric initialization maintains an approximately constant block-diagonal weight. The scale goes from Yellow (larger) through green to blue (smaller)
Training loss comparison across 26 datasets from the LibSVM repository
CIFAR-10 training loss for a strided AlexNet architecture. The median as well as a 25%-75% IQR of 40 seeds is shown for each initialization, where for each seed a sliding window of minibatch training loss over 400 steps is used
A scaling calculus for the design and initialization of ReLU networks

September 2022

·

51 Reads

·

1 Citation

Neural Computing and Applications

We propose a system for calculating a “scaling constant” for layers and weights of neural networks. We relate this scaling constant to two important quantities that relate to the optimizability of neural networks, and argue that a network that is “preconditioned” via scaling, in the sense that all weights have the same scaling constant, will be easier to train. This scaling calculus results in a number of consequences, among them the fact that the geometric mean of the fan-in and fan-out, rather than the fan-in, fan-out, or arithmetic mean, should be used for the initialization of the variance of weights in a neural network. Our system allows for the off-line design & engineering of ReLU (Rectified Linear Unit) neural networks, potentially replacing blind experimentation. We verify the effectiveness of our approach on a set of benchmark problems.


On the Relation between Distributionally Robust Optimization and Data Curation (Student Abstract)

June 2022

·

7 Reads

Proceedings of the AAAI Conference on Artificial Intelligence

Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. A practical implication of our results is that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation.


The Effects of Regularization and Data Augmentation are Class Dependent

April 2022

·

398 Reads

·

1 Citation

Regularization is a fundamental technique to prevent over-fitting and to improve generalization performances by constraining a model's complexity. Current Deep Networks heavily rely on regularizers such as Data-Augmentation (DA) or weight-decay, and employ structural risk minimization, i.e. cross-validation, to select the optimal regularization hyper-parameters. In this study, we demonstrate that techniques such as DA or weight decay produce a model with a reduced complexity that is unfair across classes. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes e.g. on Imagenet with a resnet50, the "barn spider" classification test accuracy falls from 68%68\% to 46%46\% only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay. Those results demonstrate that our search for ever increasing generalization performance -- averaged over all classes and samples -- has left us with models and regularizers that silently sacrifice performances on some classes. This scenario can become dangerous when deploying a model on downstream tasks e.g. an Imagenet pre-trained resnet50 deployed on INaturalist sees its performances fall from 70%70\% to 30%30\% on class \#8889 when introducing random crop DA during the Imagenet pre-training phase. Those results demonstrate that designing novel regularizers without class-dependent bias remains an open research question.


Figure 1. Test performance of nine penalized OoD methods as a function of the number of epochs used to pre-train the neural network with ERM. The final OoD testing performance is very dependent on choosing the right number of pretraining epochs, illustrating the challenges of these optimization problems.
Figure 2. Test performance of OoD methods as a function of training epochs. Top: Six OoD methods are trained from a 'perfect' initialization where only the robust feature is well learned. The blue star indicates the initial test accuracy. Bottom: The OoD methods are trained from the proposed (frozen) RFC representation.
Test accuracy of OOD methods (IRMv1, vREx) and ERM methods. Three synthesis phase L2 weights decay {1e − 2, 1e − 4, 1e − 6} are tested. All the other settings are the same as the main results in Table 3.
Rich Feature Construction for the Optimization-Generalization Dilemma

March 2022

·

34 Reads

There often is a dilemma between ease of optimization and robust out-of-distribution (OoD) generalization. For instance, many OoD methods rely on penalty terms whose optimization is challenging. They are either too strong to optimize reliably or too weak to achieve their goals. In order to escape this dilemma, we propose to first construct a rich representation (RFC) containing a palette of potentially useful features, ready to be used by even simple models. On the one hand, a rich representation provides a good initialization for the optimizer. On the other hand, it also provides an inductive bias that helps OoD generalization. RFC is constructed in a succession of training episodes. During each step of the discovery phase, we craft a multi-objective optimization criterion and its associated datasets in a manner that prevents the network from using the features constructed in the previous iterations. During the synthesis phase, we use knowledge distillation to force the network to simultaneously develop all the features identified during the discovery phase. RFC consistently helps six OoD methods achieve top performance on challenging invariant training benchmarks, ColoredMNIST (Arjovsky et al., 2020). Furthermore, on the realistic Camelyon17 task, our method helps both OoD and ERM methods outperform earlier compatable results by at least 5%5\%, reduce standard deviation by at least 4.1%4.1\%, and makes hyperparameter tuning and model selection more reliable.


On Distributionally Robust Optimization and Data Rebalancing

February 2022

·

525 Reads

·

5 Citations

Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. The results cover finite and infinite number of training distributions , as well as convex and non-convex loss functions. An implication of our results is that for each DRO problem there exists a data distribution such that learning this distribution is equivalent to solving the DRO problem. Yet, important problems that DRO seeks to address (for instance, adversarial ro-bustness and fighting bias) cannot be reduced to finding the one 'unbiased' dataset. Our discussion section addresses this important discrepancy.


Citations (76)


... In contrast, models like TA-VAAL (Kim et al., 2021) and SRAAL incorporate classconditional relationships, surpassing task-agnostic methods like VAAL (Sinha et al., 2019) and LL4AL (Yoo & Kweon, 2019). Despite these advancements, current methods under-utilize available unlabeled data for task learning (Gao et al., 2020;Cabannes et al., 2023), missing the opportunity to enhance visual representation and model performance . Self-supervised learning (Gidaris et al., 2018;Chen & He, 2021;Doersch et al., 2015;Noroozi & Favaro, 2016) exploits unlabeled data for intermediate representation learning, producing rich and semantically meaningful representations without requiring the knowledge of the annotation space (distribution). ...

Reference:

ADROIT: A Self-Supervised Framework for Learning Robust Representations for Active Learning
Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need
  • Citing Conference Paper
  • October 2023

... MoBYv2AL [102] incorporates self-supervised learning algorithms (MoBY) into the DeepAL process, and jointly optimizes a task-aware objective function with contrastive loss. Cabannes et al. [103] proposed a unified learning framework that combines semisupervised, self-supervised, and supervised learning based on the concept of a similarity graph, introducing a strategy called positive active learning to select samples based on relative similarity comparisons rather than absolute sample labeling. ...

Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need

... The goal of a good initialisation strategy is to produce similar statistics in every layer. This can be done in the forward pass (LeCun et al., 1998;Mishkin and Matas, 2016;Klambauer et al., 2017;Chang et al., 2020) or during back-propagation (Glorot and Bengio, 2010;Hoedt et al., 2018;Defazio and Bottou, 2021). It is also important to account for the effects due to non-linearities in the network (Saxe et al., 37th Conference on Neural Information Processing Systems (NeurIPS 2023). ...

A scaling calculus for the design and initialization of ReLU networks

Neural Computing and Applications

... Differences in CTC-based training losses due to length, speaker, and acoustics may lead to varying magnitudes and irreducible components of losses across different groups. As a result, some groups with disproportionately high losses may dominate training with group DRO, causing under-training of the other groups, and ultimately negatively impacting overall downstream performance (Słowik & Bottou, 2022). ...

On Distributionally Robust Optimization and Data Rebalancing

... In this study, our experiment was conducted on an Intel Core i7-10700 CPU with 16GB RAM. In addition, we used the umap-learn library to implement our method and partially used the code in the literature (Klimovskaia et al., 2020) to produce the visualization. Note that we did not vary λ and k, i.e., these parameters were fixed λ = 0.1 and k = 20. ...

Poincaré maps for analyzing complex hierarchies in single-cell data

... As the study progresses, it delves into the application of deep learning techniques, such as CNNs and RNNs, in the realm of audio processing. The exploration of coupled deep learning frameworks, such as HR-LSTM [47], Dense-U-Net [55][56][57], Wave-U-Net [50][51][52][53][54], Conv-Tasnet [62][63][64][65], Res-U-Net [58,59], and LRCN [49], provides a holistic view of the cutting-edge approaches employed in this field. Moreover, the evaluation of combined architectures, such as DenseLSTM [60,61] and Audio Spectrogram Transformer [77], underscores the potential for increased efficiency and efficacy in separating instrumental acoustics. ...

Music Source Separation in the Waveform Domain
  • Citing Preprint
  • November 2019

... The invariant features are informative about the labels and shared across domains, which can enable model generalization beyond training domains [14]. A notable approach in learning invariant features is the invariant risk minimization (IRM) [15]. The key idea of IRM is penalizing a dummy classifier to be simultaneously optimal on multiple training domains. ...

Invariant Risk Minimization
  • Citing Preprint
  • July 2019

... 8 Instead, data typically have continuous relationships, more akin to a distance or similarity, than a binary connection. Even in studies that have worked with data in this form, 9,10 there is no clear prescription for determining the curvature or dimension of the underlying space. Both are important geometric parameters for interpreting continuous maps obtained from discrete data. ...

Poincare Maps for Analyzing Complex Hierarchies in Single-Cell Data