Samy Bengio’s research while affiliated with Mountain View College and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (430)


Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions
  • Preprint

February 2025

·

5 Reads

Yizhe Zhang

·

Richard Bai

·

Zijin Gu

·

[...]

·

Navdeep Jaitly

Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability and directional conditional entropy. We ablate the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous.


Visual Scratchpads: Enabling Global Reasoning in Vision

October 2024

·

1 Read

Modern vision models have achieved remarkable success in benchmarks where local features provide critical information about the target. There is now a growing interest in solving tasks that require more global reasoning, where local features offer no significant information. These tasks are reminiscent of the connectivity tasks discussed by Minsky and Papert in 1969, which exposed the limitations of the perceptron model and contributed to the first AI winter. In this paper, we revisit such tasks by introducing four global visual benchmarks involving path findings and mazes. We show that: (1) although today's large vision models largely surpass the expressivity limitations of the early models, they still struggle with the learning efficiency; we put forward the "globality degree" notion to understand this limitation; (2) we then demonstrate that the picture changes and global reasoning becomes feasible with the introduction of "visual scratchpads"; similarly to the text scratchpads and chain-of-thoughts used in language models, visual scratchpads help break down global tasks into simpler ones; (3) we finally show that some scratchpads are better than others, in particular, "inductive scratchpads" that take steps relying on less information afford better out-of-distribution generalization and succeed for smaller model sizes.


Figure 6: The average of the maximum and average distance in directed random graphs with n = 128 nodes and a varying number of edges.
Figure 7: Performance of a model trained on a balanced distribution of random graphs with 24 nodes and edges where with probability 0.5 the query nodes are not connected and with probability 0.5 they are connected and their distance is uniformly selected from 1, 2, 3, 4. The validation set has the same distribution as the training set showing that the model reaches around 80% accuracy on in-distribution samples. Particularly, the model has perfect accuracy on connected nodes (distance 1-4) and around 60% accuracy on the nodes that are not connected. However, when we tested the model on OOD samples (where some spurious correlations are not present) the model shows a chance level performance. Note that these samples would be of low complexity if the model was actually checking whether there exists a path or not.
How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad
  • Preprint
  • File available

June 2024

·

24 Reads

Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'distribution locality' to capture when weak learning is efficiently achievable by regular Transformers, where the locality measures the least number of tokens required in addition to the tokens histogram to correlate nontrivially with the target. As shown experimentally and theoretically under additional assumptions, distributions with high locality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Furthermore, we show that (i) an agnostic scratchpad cannot help to break the locality barrier, (ii) an educated scratchpad can help if it breaks the locality at each step, (iii) a notion of 'inductive scratchpad' can both break the locality and improve the out-of-distribution generalization, e.g., generalizing to almost double input size for some arithmetic tasks.

Download

Transformers learn through gradual rank increase

June 2023

·

4 Reads

We identify incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank. We rigorously prove this occurs under the simplifying assumptions of diagonal weight matrices and small initialization. Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions.


Figure 3: Learning full parity function in dimension d = 15 in the length generalization setting with inputs in B 6 , B 7 , B 8 , B 9 , B 10 and B 15 (full space) respectively, with an MLP (model details in Appendix B). X-axis: degree-profile component, Y-axis: degree-profile value, i.e., T :|T |=xˆf |=xˆ |=xˆf NN (T ) 2 . As the length of training samples is decreased, the coefficient of the full parity gets smaller and the coefficients of low-degree monomials get larger.
Figure 6: Leakage of the interpolators learned by the MLP model trained with different learning rates. Larger learning rates weaken the min-degree bias and lead to higher leaks.
Generalization on the Unseen, Logic Reasoning and Degree Curriculum

January 2023

·

50 Reads

This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We then study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for a class of network models including instances of Transformers, random features models, and diagonal linear networks, a min-degree-interpolator (MDI) is learned on the unseen. We also provide evidence that other instances with larger learning rates or mean-field networks reach leaky MDIs. These findings lead to two implications: (1) we provide an explanation to the length generalization problem (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports.


Continuous Soft Pseudo-Labeling in ASR

November 2022

·

7 Reads

Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.


Continuous Pseudo-Labeling from the Start

October 2022

·

4 Reads

Self-training (ST), or pseudo-labeling has sparked significant interest in the automatic speech recognition (ASR) community recently because of its success in harnessing unlabeled data. Unlike prior semi-supervised learning approaches that relied on iteratively regenerating pseudo-labels (PLs) from a trained model and using them to train a new model, recent state-of-the-art methods perform `continuous training' where PLs are generated using a very recent version of the model being trained. Nevertheless, these approaches still rely on bootstrapping the ST using an initial supervised learning phase where the model is trained on labeled data alone. We believe this has the potential for over-fitting to the labeled dataset in low resource settings and that ST from the start of training should reduce over-fitting. In this paper we show how we can do this by dynamically controlling the evolution of PLs during the training process in ASR. To the best of our knowledge, this is the first study that shows the feasibility of generating PLs from the very start of the training. We are able to achieve this using two techniques that avoid instabilities which lead to degenerate models that do not generalize. Firstly, we control the evolution of PLs through a curriculum that uses the online changes in PLs to control the membership of the cache of PLs and improve generalization. Secondly, we find that by sampling transcriptions from the predictive distribution, rather than only using the best transcription, we can stabilize training further. With these techniques, our ST models match prior works without an external language model.


Figure 2: Comparison between the generalization loss in the canonical distribution shift setting and the Boolean influence for a PVR function with 3 pointer bits, window size 3, and majority-vote aggregation function.
Figure 6: Comparison between the Boolean Influence and generalization error for f 1 (x 1 , . . . , x 11 ) = x 1 x 2 + 2x 2 x 3 + 3x 3 x 4 + · · · + 10x 10 x 11 . Frozen coordinates are represented by the x-axis; while the y-axis represents the value of generalization error and the Boolean influence.
Figure 7: Comparison between the generalization loss in the canonical distribution shift setting and the Boolean Influence for f 2 (x 1 , x 2 , . . . , x 14 ) = x 1 + x 1 x 2 + x 1 x 2 x 3 + · · · + x 1 x 2 x 3 · · · x 14 .
Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures

May 2022

·

27 Reads

This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.


Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization

July 2021

·

48 Reads

The successes of deep learning critically rely on the ability of neural networks to output meaningful predictions on unseen data -- generalization. Yet despite its criticality, there remain fundamental open questions on how neural networks generalize. How much do neural networks rely on memorization -- seeing highly similar training examples -- and how much are they capable of human-intelligence styled reasoning -- identifying abstract rules underlying the data? In this paper we introduce a novel benchmark, Pointer Value Retrieval (PVR) tasks, that explore the limits of neural network generalization. While PVR tasks can consist of visual as well as symbolic inputs, each with varying levels of difficulty, they all have a simple underlying rule. One part of the PVR task input acts as a pointer, giving the location of a different part of the input, which forms the value (and output). We demonstrate that this task structure provides a rich testbed for understanding generalization, with our empirical study showing large variations in neural network performance based on dataset size, task complexity and model architecture. The interaction of position, values and the pointer rule also allow the development of nuanced tests of generalization, by introducing distribution shift and increasing functional complexity. These reveal both subtle failures and surprising successes, suggesting many promising directions of exploration on this benchmark.


Learnable Fourier Features for Multi-DimensionalSpatial Positional Encoding

June 2021

·

91 Reads

·

1 Citation

Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated with a multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e.g., pixel positions on an image, where L2L_2 distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.


Citations (63)


... Generalizability is an important topic in a broad range of studies [51][52][53] . In this proof-of-concept study, we investigated the generalizability of various StrB approaches in predicting peptide-MHC interactions, a pivotal aspect of immune surveillance and a major bottleneck in the design of cancer vaccines [54][55][56] and TCR therapies 57,58 . ...

Reference:

Geometric deep learning improves generalizability of MHC-bound peptide predictions
Understanding deep learning requires rethinking generalization
  • Citing Preprint
  • November 2016

... Recent advancements in NLP have enhanced the field of mental health disorder detection in general, especially using vectorized representations of language. Word embeddings and contextual analysis using deep learning models have substantially improved the accuracy and performance of depression detection models [34][35][36][37]. Word embeddings are created by transforming words into continuous vector representations, capturing semantic relationships and contextual meaning between words. ...

Word embeddings for speech recognition
  • Citing Conference Paper
  • September 2014

... To address the absence of positional information in time-series data, the CTAT Block incorporates learnable positional encoding [38]. By combining positional encoding with attention weight matrices, the Transformer captures the sequential characteristics of the time-series data more effectively. ...

Learnable Fourier Features for Multi-DimensionalSpatial Positional Encoding
  • Citing Preprint
  • June 2021

... This focus on self-organization was later influential for early variants of neural network modeling such as connectionism (Palmer 1995;Rock and Palmer 1990). Since then, neural networks have successfully modeled for Gestalt closure effects (Kim et al. 2019(Kim et al. ,. 2021Zhang et al. 2024), Gestalt perceptual illusions (Herzog, et al. 2003), and the Gestalt-based perception of biological motion (Sadeghi et al. 2021). Yet the concept of the Gestalt was also a foundational truth for Merleau-Ponty, as evident from the numerous Gestalt references in The Structure of Behavior and The Phenomenology of Perception (Embree 1980;Welsh 2006). ...

Neural Networks Trained on Natural Scenes Exhibit Gestalt Closure

Computational Brain & Behavior

... First, nearly all top-ranked attacks, including ς-zero, FMN, PDPGD, PGD-ε 0 (Croce and Hein 2019), APGD, APGD t , BIM (Kurakin, Goodfellow, and Bengio 2016), and DDN, use normalization or linear projections on the gradient. These methods decouple the gradient's original size from the step size used in the updates (Rony et al. 2019;Pintor et al. 2021). ...

Adversarial Examples in the Physical World
  • Citing Chapter
  • July 2018

... Previous works on conditional computation in deep learning have explored dynamic filtering and gating strategies to improve model performance and efficiency. For instance, the GaterNet [30] introduces input-dependent dynamic filter selection in convolutional neural networks, leading to improved generalization and interpretability. Similarly, the autors in [31], propose fine-grained gating for individual convolutional maps, reducing computational cost while enhancing accuracy. ...

You Look Twice: GaterNet for Dynamic Filter Selection in CNNs
  • Citing Conference Paper
  • June 2019

... Most deep learning networks require large-scale datasets for training, and when faced with small amounts of data, they often have a poor model performance [29]. In order to address the small sample problem, and improve the generalization performance of the neural network and its adaptability to new tasks, this paper introduces MAML [30], which is executed in two stages: meta training and meta testing [31,32]. The goal of the meta training stage is to find an effective meta initialization, which will be used as the starting point for adaptive optimization parameters in the meta testing stage so that the model can effectively solve new tasks with only a small number of annotated samples. ...

Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML
  • Citing Preprint
  • September 2019

... The main advantage of the causal convolutions in WaveNet as compared to Recurrent Neural Networks (RNNs) is that they are faster to compute because they do not have recurrent connections. With stacked dilated convolutions, WaveNet can effectively capture long-term dependencies in input data [57]. Apply dilated convolution: ...

Unsupervised Speech Representation Learning Using WaveNet Autoencoders
  • Citing Article
  • September 2019

IEEE/ACM Transactions on Audio Speech and Language Processing

... Improving messagepassing efficiency is thus key to scalable graph learning. While importance sampling reduces costs [17]- [19], it's unsuitable for garment simulation where each vertex needs full neighborhood context. Pre-computing message-passing [20], [21] is also suboptimal due to static features. ...

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks
  • Citing Conference Paper
  • July 2019