Guillaume Lajoie’s research while affiliated with University of Quebec in Montreal and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (102)


Robust prior-biased acquisition function for human-in-the-loop Bayesian optimization
  • Article

February 2025

·

11 Reads

Knowledge-Based Systems

Rose Guay-Hottin

·

Lison Kardassevitch

·

Hugo Pham

·

[...]

·


Figure 3: Two-stage training of Celo.
Figure 4: Comparing Celo with state-of-the-art hand-crafted optimizers. Celo is our proposed learned optimizer meta-trained on a limited budget of 24 GPU hours. Optimizers are evaluated with final loss (left) and speedup (right) criteria with respect to Adam on a diverse set of 17 tasks which are out-ofdistribution for Celo and include image classification, language modeling, autoencoders, learned optimizer training, etc. IQM score above 1.0 indicates "Super-Adam" performance, read more about our evaluation methodology in §4 and meta-training/evaluation tasks in §5.
Learning Versatile Optimizers on a Compute Diet
  • Preprint
  • File available

January 2025

Learned optimization has emerged as a promising alternative to hand-crafted optimizers, with the potential to discover stronger learned update rules that enable faster, hyperparameter-free training of neural networks. A critical element for practically useful learned optimizers, that can be used off-the-shelf after meta-training, is strong meta-generalization: the ability to apply the optimizers to new tasks. Recent state-of-the-art work in learned optimizers, VeLO (Metz et al., 2022), requires a large number of highly diverse meta-training tasks along with massive computational resources, 4000 TPU months, to achieve meta-generalization. This makes further improvements to such learned optimizers impractical. In this work, we identify several key elements in learned optimizer architectures and meta-training procedures that can lead to strong meta-generalization. We also propose evaluation metrics to reliably assess quantitative performance of an optimizer at scale on a set of evaluation tasks. Our proposed approach, Celo, makes a significant leap in improving the meta-generalization performance of learned optimizers and also outperforms tuned state-of-the-art optimizers on a diverse set of out-of-distribution tasks, despite being meta-trained for just 24 GPU hours.

Download

Neural networks with optimized single-neuron adaptation uncover biologically plausible regularization

December 2024

·

51 Reads

Neurons in the brain have rich and adaptive input-output properties. Features such as heterogeneous f-I curves and spike frequency adaptation are known to place single neurons in optimal coding regimes when facing changing stimuli. Yet, it is still unclear how brain circuits exploit single-neuron flexibility, and how network-level requirements may have shaped such cellular function. To answer this question, a multi-scaled approach is needed where the computations of single neurons and neural circuits must be considered as a complete system. In this work, we use artificial neural networks to systematically investigate single-neuron input-output adaptive mechanisms, optimized in an end-to-end fashion. Throughout the optimization process, each neuron has the liberty to modify its nonlinear activation function parametrized to mimic f-I curves of biological neurons, either by learning an individual static function or via a learned and shared adaptation mechanism to modify activation functions in real-time during a task. We find that such adaptive networks show much-improved robustness to noise and changes in input statistics. Using tools from dynamical systems theory, we analyze the role of these emergent single-neuron properties and argue that neural diversity and adaptation play an active regularization role, enabling neural circuits to optimally propagate information across time. Finally, we outline similarities between these optimized solutions and known coding strategies found in biological neurons, such as gain scaling and fractional order differentiation/integration.


Figure 2: Exponentiated gradient (EG) finds weights that are easier to prune, and easier to re-train after pruning, than gradient descent (GD).
Figure 3: Exponentiated gradient (EG) outperforms gradient descent (GD) when relevant inputs are sparse .
Figure 4: Continuous control with noise is better with exponentiated gradient (EG). a | Schematic of an RNN controlling a two-joint planar arm through 8 muscles. The network receives delayed visual and proprioceptive feedback (see Methods) b | Three example arm reaches from random starting locations (x symbol) to target locations (filled circles) at the start and end of training. Different colours represent different reaches. Scale bars represent 20cm. c| Example feedback to the RNN controller. Left: from arm. Right: noise. d| Learning curves for RNNs trained with EG or GD. (From n=9 models). e| Validation reaches for EG and GD networks after 3 thousand updates. Scale bars represent 20 cm. f| As e but for 6 thousand updates.
Brain-like learning with exponentiated gradients

October 2024

·

55 Reads

·

1 Citation

Computational neuroscience relies on gradient descent (GD) for training artificial neural network (ANN) models of the brain. The advantage of GD is that it is effective at learning difficult tasks. However, it produces ANNs that are a poor phenomenological fit to biology, making them less relevant as models of the brain. Specifically, it violates Dale's law, by allowing synapses to change from excitatory to inhibitory and leads to synaptic weights that are not log-normally distributed, contradicting experimental data. Here, starting from first principles of optimisation theory, we present an alternative learning algorithm, exponentiated gradient (EG), that respects Dale's Law and produces log-normal weights, without losing the power of learning with gradients. We also show that in biologically relevant settings EG outperforms GD, including learning from sparsely relevant signals and dealing with synaptic pruning. Altogether, our results show that EG is a superior learning algorithm for modelling the brain with ANNs.


Multi-agent cooperation through learning-aware policy gradients

October 2024

·

37 Reads

Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner's dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.


Figure 1: Hypothesized form of the shortest program that outputs a compositional representation Z. a. Pseudocode showing the skeleton of the program. The program describes the representation using sentences W (sequences of discrete tokens) that are compressed using a prior pw(w), and then maps these sentences to high-dimensional vectors in representation space using a function f (w) that outputs the sufficient statistics of a Normal distribution. Reconstruction errors are corrected using bit sequences who's length depends on the magnitudes of the errors. decode_algo() is a short function that decodes an object compressed using arithmetic coding (Witten et al., 1987). b. A visual depiction of the program. c. The total Kolmogorov complexity of the representation is estimated by the length of the shortest program that has this form.
Figure 4: Compositionality of natural language systems. We consider language natural systems in which W L are sentences in some language and Z are sentence embedding vectors obtained from a pretrained multilingual model. a. We used prequential coding to measure K(Z|W L ) for these natural languages, where the area under the curve is the "prequential code length" estimating compression size. Languages have highly similar prequential code lengths, with Japanese having the lowest among them. b. Assuming all languages have equivalent expressivity K(Z), their relative compositionalities measured using our definition C L (Z) are similar. c. Using topological similarity as a measure of compositionality gives counter-intuitive results, with most languages having near-zero topological similarity and Japanese being a strong outlier with a topological similarity of −0.2. All error bars show standard deviations across 3 random seeds.
Figure B.1: Estimating the complexity of a representation K(Z) by fitting a discrete auto-encoder with learned latent prior. The encoder, prior, and decoder are jointly trained with a loss that maximizes the likelihood of Z using sentences that have high prior likelihood pw(W ). If pw and f are also regularized to be simple functions, fitting this discrete auto-encoder is equivalent to finding a pw, W , and f that jointly minimize K(Z).
A Complexity-Based Theory of Compositionality

October 2024

·

26 Reads

Compositionality is believed to be fundamental to intelligence. In humans, it underlies the structure of thought, language, and higher-level reasoning. In AI, compositional representations can enable a powerful form of out-of-distribution generalization, in which a model systematically adapts to novel combinations of known concepts. However, while we have strong intuitions about what compositionality is, there currently exists no formal definition for it that is measurable and mathematical. Here, we propose such a definition, which we call representational compositionality, that accounts for and extends our intuitions about compositionality. The definition is conceptually simple, quantitative, grounded in algorithmic information theory, and applicable to any representation. Intuitively, representational compositionality states that a compositional representation satisfies three properties. First, it must be expressive. Second, it must be possible to re-describe the representation as a function of discrete symbolic sequences with re-combinable parts, analogous to sentences in natural language. Third, the function that relates these symbolic sequences to the representation, analogous to semantics in natural language, must be simple. Through experiments on both synthetic and real world data, we validate our definition of compositionality and show how it unifies disparate intuitions from across the literature in both AI and cognitive science. We also show that representational compositionality, while theoretically intractable, can be readily estimated using standard deep learning tools. Our definition has the potential to inspire the design of novel, theoretically-driven models that better capture the mechanisms of compositional thought.


Figure 2: Experimental results comparing different learners. Figures show average prequential coding curves for a meta-dataset, which is the mean prediction error on unseen data (generalization error, y-axis) given observed contexts of increasing length (datapoints seen, x-axis). The area underneath these curves corresponds to prequential code length. Error is measured using MSE for linear and sinusoid regression and cross-entropy for Mastermind. Error bars show standard error across seeds (5 for ICL, 15 for SGD). a. ICL from next-token prediction objectives (prequential ICL, blue) yields lower prequential code lengths than ICL from past-token prediction objectives (train-risk ICL, orange), with greater effects in low-data regimes. An SGD-based learner (green) fits more complex models than prequential ICL and performs poorly in low-data regimes, but can generalize better in large-data regimes on a difficult Mastermind task due to underfitting in ICL. b. The architecture used to parameterize T ϕ has substantial influence on ICL's ability to minimize prequential code length.
Figure 3: Experimental results for LLM and data manipulation strategies. Figures show average prequential coding curves for a meta-dataset, which is the mean prediction error on unseen data (generalization error, y-axis) given observed contexts of increasing length (datapoints seen, x-axis). The area underneath these curves corresponds to prequential code length. Error bars show standard error across 5 seeds. a. An LLM (GPT-4, red) fails to meaningfully minimize prequential code length on a novel Mastermind task, performing far worse than small ICL models trained on a distribution of Mastermind tasks (blue) and a naive baseline that predicts the marginal class distribution over the context (purple). Error is measured using cross-entropy. b. On a synthetic HMM dataset designed to mimic natural language, preferentially training on shorter contexts (red) yields lower prequential code lengths than training uniformly over context lengths (purple). Error is measured using reverse KL divergence between model and oracle conditioned on seen context.
Figure E.1: Validation loss as a function of the number of tokens seen during training. The curve is averaged over 5 different datasets (seeds). We can see that the models trained on sequences with shorter length converge faster.
Figure E.2: Prequential code curves at different stages of training Reproduction of Figure 3b but with the prequential curve at 610M tokens also. At this point, the models trained with uniform context length have essentially the same performance as the ones trained with smaller context lengths.
In-context learning and Occam's razor

October 2024

·

27 Reads

The goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best: a principle called Occam's razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam's razor and in-context learning: an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at https://github.com/3rdCore/PrequentialCode.


Brain-like neural dynamics for behavioral control develop through reinforcement learning

October 2024

·

30 Reads

During development, neural circuits are shaped continuously as we learn to control our bodies. The ultimate goal of this process is to produce neural dynamics that enable the rich repertoire of behaviors we perform with our limbs. What begins as a series of "babbles" coalesces into skilled motor output as the brain rapidly learns to control the body. However, the nature of the teaching signal underlying this normative learning process remains elusive. Here, we test two well-established and biologically plausible theories-supervised learning (SL) and reinforcement learning (RL)-that could explain how neural circuits develop the capacity for skilled movements. We trained recurrent neural networks to control a biomechanical model of a primate arm using either SL or RL and compared the resulting neural dynamics to populations of neurons recorded from the motor cortex of monkeys performing the same movements. Intriguingly, only RL-trained networks produced neural activity that matched their biological counterparts in terms of both the geometry and dynamics of population activity. We show that the similarity between RL-trained networks and biological brains depends critically on matching biomechanical properties of the limb. We then demonstrated that monkeys and RL-trained networks, but not SL-trained networks, show a strikingly similar capacity for robust short-term behavioral adaptation to a movement perturbation, indicating a fundamental and general commonality in the neural control policy. Together, our results support the hypothesis that neural dynamics for behavioral control emerge through a process akin to reinforcement learning. The resulting neural circuits offer numerous advantages for adaptable behavioral control over simpler and more efficient learning rules and expand our understanding of how developmental processes shape neural dynamics.


Figure 2: Visualizing the effects of psychedelics in the model. We model the effects of classical psychedelics by progressively increasing α from 0 to 1 in our model, where α = 1 is equivalent to the Sleep phase. We visualize the effects of psychedelics on the network representation by inspecting the stimulus layer s. a) Example stimulus-layer activity (rows) in response to an MNIST digit presentation as psychedelic dose increases (columns, left to right). b) Same as (a) but for 'eyes-closed' conditions where an entirely black image is presented. c-d) Same as (a-b), but for the CIFAR10 dataset.
Figure 3: Effects of psychedelics on single model neurons. a) Correlations between the apical and basal dendritic compartments of either the same network neuron or between randomly selected neurons. b) Total plasticity for apical (left) and basal (right) synapses as α increases in the model when plasticity is either gated or not gated by α. Error bars indicate +/-1 s.e.m. c) Cosine similarity between plasticity induced under psychedelic conditions compared to baseline for apical (left) and basal (right) synapses.
Figure 5: Network-level effects of psychedelics. a) Pairwise correlation matrices computed for neurons in layer 2 across stimuli for α = 0 (left), α = 0.5 (center), and α = 1.0 (right). b) Correlation similarity metric between the pairwise correlation matrices of the network in the absence of hallucination (α = 0) as compared to hallucinating network states (α > 0). c) Proportion of explained variability as a function of principal component (PC) number for α ∈ {0, 0.5, 1}. d) Ratio of across-stimulus variance in individual stimulus layer neurons when the apical dendrites have been inactivated, versus baseline conditions across different α values. e) Ratio of across-stimulus variance in individual neurons in the stimulus layer when neurons at the deepest network layer have been inactivated, versus baseline conditions across different α values. Error bars indicate +/-1 s.e.m.
The oneirogen hypothesis: modeling the hallucinatory effects of classical psychedelics in terms of replay-dependent plasticity mechanisms

September 2024

·

67 Reads

Classical psychedelics induce complex visual hallucinations in humans, generating percepts that are coherent at a low level, but which have surreal, dream-like qualities at a high level. While there are many hypotheses as to how classical psychedelics could induce these effects, there are no concrete mechanistic models that capture the variety of observed effects in humans, while remaining consistent with the known pharmacological effects of classical psychedelics on neural circuits. In this work, we propose the "oneirogen hypothesis," which posits that the perceptual effects of classical psychedelics are a result of their pharmacological actions inducing neural activity states that truly are more similar to dream-like states. We simulate classical psychedelics' effects via manipulating neural network models trained on perceptual tasks with the Wake-Sleep algorithm. This established machine learning algorithm leverages two activity phases, a perceptual phase (wake) where sensory inputs are encoded, and a generative phase (dream) where the network internally generates activity consistent with stimulus-evoked responses. We simulate the action of psychedelics by partially shifting the model to the 'Sleep' state, which entails a greater influence of top-down connections, in-line with the impact of psychedelics on apical dendrites. The effects resulting from this manipulation capture a number of experimentally observed phenomena including the emergence of hallucinations, increases in stimulus-conditioned variability, and large increases in synaptic plasticity. We further provide a number of testable predictions which could be used to validate or invalidate our oneirogen hypothesis.


Latent Representation Learning for Multimodal Brain Activity Translation

September 2024

·

8 Reads

Neuroscience employs diverse neuroimaging techniques, each offering distinct insights into brain activity, from electrophysiological recordings such as EEG, which have high temporal resolution, to hemodynamic modalities such as fMRI, which have increased spatial precision. However, integrating these heterogeneous data sources remains a challenge, which limits a comprehensive understanding of brain function. We present the Spatiotemporal Alignment of Multimodal Brain Activity (SAMBA) framework, which bridges the spatial and temporal resolution gaps across modalities by learning a unified latent space free of modality-specific biases. SAMBA introduces a novel attention-based wavelet decomposition for spectral filtering of electrophysiological recordings, graph attention networks to model functional connectivity between functional brain units, and recurrent layers to capture temporal autocorrelations in brain signal. We show that the training of SAMBA, aside from achieving translation, also learns a rich representation of brain information processing. We showcase this classify external stimuli driving brain activity from the representation learned in hidden layers of SAMBA, paving the way for broad downstream applications in neuroscience research and clinical contexts.


Citations (44)


... Moreover, Exponentiated gradient (EGU and EG) updates can converge faster than GD when the target weight vector is sparse [5], [8], [11]. Recent findings about synapses in biology indicate that EG algorithms are more biologically plausible than additive GD updates [54]. EG updates are typically viewed as appropriate for problems where the geometry of the optimization domain is described by the Kullback-Leibler divergence or relative entropy, as is often the case when optimizing over probability distributions. ...

Reference:

Generalized Exponentiated Gradient Algorithms and Their Application to On-Line Portfolio Selection
Brain-like learning with exponentiated gradients

... Additionally, this approach can help tackle the credit assignment problem (Y. H. Liu et al., 2021) and promotes multi-task learning by regulating different modes of neuronal dynamics (Munn et al., 2023;Williams et al., 2024), offering a more robust framework for continuous and adaptive learning in artificial neural systems. ...

Expressivity of Neural Networks with Random Weights and Learned Biases
  • Citing Article
  • July 2024

... In such a setting, intra-individual variability cannot be accounted for. This shortcoming can be potentially addressed by studies focusing on datasets featuring hours of scanning time per individual [69][70][71] . ...

A benchmark of individual auto-regressive models in a massive fMRI dataset

... For an intraoperative procedure, optimizing neural biomarkers is thus preferable over optimizing physiological ones. We investigate the relationship between eCAPs and physiological responses in detail in a companion study Berthon et al. (2023). We have shown that OBOES is suitable for optimizing B-fibre activation, a prime example for an indirect optimization of heart rate using a neural biomarker. ...

Using neural biomarkers to personalize dosing of vagus nerve stimulation

Bioelectronic Medicine

... Future work is left to combine our model with biological constraints that induce additional effects during perturbations, e.g., through non-normal synaptic connectivity (O'Shea et al., 2022;Kim et al., 2023;Bondanelli and Ostojic, 2020;Logiaco et al., 2021). Third, our work connects to the setting of BCI, where the experimenter chooses the output weights at the beginning of learning (Sadtler et al., 2014;Golub et al., 2018;Willett et al., 2021;Rajeswaran et al., 2024). Typically, the output weights are set to lie 'within the manifold' of the leading PCs so that we expect aligned dynamics (Sadtler et al., 2014). ...

Assistive sensory-motor perturbations influence learned neural representations

... This advanced approach enhances therapeutic outcomes by specifically targeting distinct portions of the vagus nerve (Bowles et al., 2022), achieving greater efficacy and minimizing side effects compared to traditional whole-nerve stimulation methods. Precision VNS operates by selectively stimulating different fiber bundles within the vagus nerve (Mourdoukoutas et al., 2018), enabling exact control over physiological pathways responsible for functions like heart rate regulation (Wernisch et al., 2024), without impacting unrelated systems such as digestion or immune responses (Ahmed et al., 2022). ...

Online Bayesian optimization of vagus nerve stimulation

... Recent work (Ji et al., 2023) suggests a measurement approach to the ineffability incurred during the mental representation and ascription of thoughts, beliefs, and desires to others. Leveraging multi-modal sensor information (Radford & et al., 2021;Ramesh et al., 2021;OpenAI, 2023) can improve the richness of module communication and obtain refined cross-modal representa-tions that can be potentially reused for different downstream tasks. ...

Sources of richness and ineffability for phenomenally conscious states

Neuroscience of Consciousness

... This approach iteratively refreshes assessment of the response surface by utilizing a probabilistic model that effectively converges on ideal settings, while maintaining a balance between exploration and exploitation (Shahriari et al., 2016). Bayesian optimization has also been extensively used in neurostimulation (Choinière et al., 2024). Furthermore, bayesian optimization may be especially effective for individualizing therapy, as it can adapt to each patient's unique physiological profile. ...

Gaussian-process-based Bayesian optimization for neurostimulation interventions in rats
  • Citing Article
  • February 2024

STAR Protocols

... 10 This new computational scheme, which does not require updating the internal synaptic 11 weight, is called reservoir computing. It is further assumed that the local cortical 12 microcircuit as well as the global cortical network can be viewed as reservoir 13 networks [16,21,31]. It is well documented that binocular rivalry is the result of the 14 interaction of numerous brain regions, including the lateral geniculate nucleus, the 15 thalamic reticular nucleus, and the visual areas V1, V2, V4, MT, and IT. ...

Connectome-based reservoir computing with the conn2res toolbox

... Behavior prediction metrics for BCI performance are also more intuitive for benchmarking progress than neural data prediction or the abstract goal of providing scientific insight (e.g. with latent variable models or in silico models) (Pei et al., 2021;Wang et al., 2023b). Finally, recent work has shown that deep networks are able to transfer learn across motor cortical datasets collected at different timepoints, subjects, or tasks (Azabou et al., 2024;Ye et al., 2023;Schneider et al., 2023). These ingredients provide the motivation and means for scaling neural data modeling. ...

A Unified, Scalable Framework for Neural Population Decoding
  • Citing Article
  • October 2023