Dale Schuurmans’s research while affiliated with University of Alberta and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (282)


Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation
  • Preprint

May 2025

Max Qiushi Lin

·

Jincheng Mei

·

Matin Aghaei

·

[...]

·

Policy gradient (PG) methods have played an essential role in the empirical successes of reinforcement learning. In order to handle large state-action spaces, PG methods are typically used with function approximation. In this setting, the approximation error in modeling problem-dependent quantities is a key notion for characterizing the global convergence of PG methods. We focus on Softmax PG with linear function approximation (referred to as Lin-SPG\texttt{Lin-SPG}) and demonstrate that the approximation error is irrelevant to the algorithm's global convergence even for the stochastic bandit setting. Consequently, we first identify the necessary and sufficient conditions on the feature representation that can guarantee the asymptotic global convergence of Lin-SPG\texttt{Lin-SPG}. Under these feature conditions, we prove that T iterations of Lin-SPG\texttt{Lin-SPG} with a problem-specific learning rate result in an O(1/T) convergence to the optimal policy. Furthermore, we prove that Lin-SPG\texttt{Lin-SPG} with any arbitrary constant learning rate can ensure asymptotic global convergence to the optimal policy.


Figure 4: PDDL Planning on Blocksworld with various LLMs with different numbers of exemplars. All models are using the same set of exemplars for Random. Opus denotes Claude-3.0-Opus.
Improving Large Language Model Planning with Action Sequence Similarity
  • Preprint
  • File available

May 2025

·

9 Reads

Planning is essential for artificial intelligence systems to look ahead and proactively determine a course of actions to reach objectives in the virtual and real world. Recent work on large language models (LLMs) sheds light on their planning capability in various tasks. However, it remains unclear what signals in the context influence the model performance. In this work, we explore how to improve the model planning capability through in-context learning (ICL), specifically, what signals can help select the exemplars. Through extensive experiments, we observe that commonly used problem similarity may result in false positives with drastically different plans, which can mislead the model. In response, we propose to sample and filter exemplars leveraging plan side action sequence similarity (AS). We propose GRASE-DC: a two-stage pipeline that first re-samples high AS exemplars and then curates the selected exemplars with dynamic clustering on AS to achieve a balance of relevance and diversity. Our experimental result confirms that GRASE-DC achieves significant performance improvement on various planning tasks (up to ~11-40 point absolute accuracy improvement with 27.3% fewer exemplars needed on average). With GRASE-DC* + VAL, where we iteratively apply GRASE-DC with a validator, we are able to even boost the performance by 18.9% more. Extensive analysis validates the consistent performance improvement of GRASE-DC with various backbone LLMs and on both classical planning and natural language planning benchmarks. GRASE-DC can further boost the planning accuracy by ~24 absolute points on harder problems using simpler problems as exemplars over a random baseline. This demonstrates its ability to generalize to out-of-distribution problems.

Download

Figure 2 | Ablation for the í µí»¼-divergence with auxiliary EMA = 0. í µí»¼ = 2 is the best performing one.
Figure 3 | Ablation for GHA when auxiliary EMA = 0.8. Without GHA, the representation collapses. With GHA, the representation learns a meaningful representation. Combining GHA and the target network results in even better performance.
Figure 4 | Ablation for auxiliary EMA of MI NC. Any amount of auxiliary EMA is better than 0, with 0.8 being the sweetspot.
Main Result: Test Classification (Top-1) Accuracy
Representation Learning via Non-Contrastive Mutual Information

April 2025

·

1 Read

Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.


Ordering-based Conditions for Global Convergence of Policy Gradient Methods

April 2025

We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolor{blue}{First}, we establish a few key observations that frame the study: \textbf{(i)} Global convergence can be achieved under linear function approximation without policy or reward realizability, both for the standard Softmax PG and natural policy gradient (NPG). \textbf{(ii)} Approximation error is not a key quantity for characterizing global convergence in either algorithm. \textbf{(iii)} The conditions on the representation that imply global convergence are different between these two algorithms. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation. \textcolor{blue}{Second}, motivated by these observations, we establish new general results: \textbf{(i)} NPG with linear function approximation achieves global convergence \emph{if and only if} the projection of the reward onto the representable space preserves the optimal action's rank, a quantity that is not strongly related to approximation error. \textbf{(ii)} The global convergence of Softmax PG occurs if the representation satisfies a non-domination condition and can preserve the ranking of rewards, which goes well beyond policy or reward realizability. We provide experimental results to support these theoretical findings.


Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates

February 2025

We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using \emph{any} constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down. The proofs are based on novel findings about action sampling rates and the relationship between cumulative progress and noise, and extend the current understanding of how simple stochastic gradient methods behave in bandit settings.


Figure 1: A comparative study of RL and SFT on the visual navigation environment V-IRL (Yang et al., 2024a) for OOD generalization. OOD curves represent performance on the same task, using a different textual action space. See detailed descriptions of the task in Section 5.1.
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

January 2025

·

62 Reads

·

2 Citations

Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.


Figure 10 | StegPoet example. Example of the encoding of a StegPoet problem instance (left) and a correct solution (right) that includes the number-to-word cipher and a poem in the style of a children's poetry author. Note that |í µí±€| = 12 in this instance. We added capitalization to the code words to highlight them. Set Success Rate Input Tokens Output Tokens API Cost (Oct 2024) 1-Pass val 0/101 = 0.0% 0.002M < 0.001M <$0.001 Best-of-N val 1/101 = 1.0% 1.56M 0.25M $0.19 Sequential-Revision+ val 20/101 = 19.8% 41.69M 0.24M $3.20 Mind Evolution val 47/101 = 46.5% 3.56M 0.20M $0.33 (+pro) val 88/101 = 87.1% 3.74M 0.22M $0.65 Mind Evolution test 106/245 = 43.3% $0.34 3.63M 0.22M (+pro) test 194/245 = 79.2% $0.72 3.84M 0.24M
Figure 12 | Example Meeting Planning prompt and model response with parent solutions given (Part 1)
Experimental results on StegPoet. Price and token counts are averages per problem. All results use Gemini 1.5 Flash, except (+pro), which solves the problems that were not solved in the Flash runs, using Gemini 1.5 Pro.
Evolving Deeper LLM Thinking

January 2025

·

101 Reads

·

3 Citations

We explore an evolutionary search strategy for scaling inference time compute in Large Language Models. The proposed approach, Mind Evolution, uses a language model to generate, recombine and refine candidate responses. The proposed approach avoids the need to formalize the underlying inference problem whenever a solution evaluator is available. Controlling for inference cost, we find that Mind Evolution significantly outperforms other inference strategies such as Best-of-N and Sequential Revision in natural language planning tasks. In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more than 98% of the problem instances using Gemini 1.5 Pro without the use of a formal solver.


Figure 2 | A pipeline for RL-finetuning with feedback. We first generate videos from the pre-trained models, and then use VLMs capable of video understanding (or metric-based reward) to obtain feedback labels. Those data are leveraged for offline and iterative RL-finetuning.
Figure 7 | (Left) Pearson correlation coefficient between the performance with human preference and with other automated feedback, among 12 algorithm-reward combinations. AI preference from VLMs has the best positive correlation to human preference (í µí± = 0.746; statistically significant with í µí± ≤ 0.01). (Right) Correlation between the averaged human preference and the averaged automated feedback, among 32×5 = 160 prompts. Even the best AI feedback only exhibits a weak positive correlation (í µí± = 0.231 with í µí± ≤ 0.01). The rationale of preference from VLMs is still not enough to be aligned with humans, despite the promising quality improvement.
VLM and human preference evaluation among the combination of algorithms and rewards.
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

December 2024

·

32 Reads

Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.


Toward Understanding In-context vs. In-weight Learning

October 2024

·

2 Reads

It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also diminish upon further training. We provide a new theoretical understanding of these phenomena by identifying simplified distributional properties that give rise to the emergence and eventual disappearance of in-context learning. We do so by first analyzing a simplified model that uses a gating mechanism to choose between an in-weight and an in-context predictor. Through a combination of a generalization error and regret analysis we identify conditions where in-context and in-weight learning emerge. These theoretical findings are then corroborated experimentally by comparing the behaviour of a full transformer on the simplified distributions to that of the stylized model, demonstrating aligned results. We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.


Faster WIND: Accelerating Iterative Best-of-N Distillation for LLM Alignment

October 2024

·

1 Read

Recent advances in aligning large language models with human preferences have corroborated the growing importance of best-of-N distillation (BOND). However, the iterative BOND algorithm is prohibitively expensive in practice due to the sample and computation inefficiency. This paper addresses the problem by revealing a unified game-theoretic connection between iterative BOND and self-play alignment, which unifies seemingly disparate algorithmic paradigms. Based on the connection, we establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization that approximates iterative BOND in the parameter space. We provides provable sample efficiency guarantee for one of the WIND variant with the square loss objective. The experimental results confirm that our algorithm not only accelerates the computation, but also achieves superior sample efficiency compared to existing methods.


Citations (31)


... They experimentally validated their approach by designing epidermal growth factor receptor (EGFR) binders, several of which achieved nanomolar affinity, demonstrating functional enhancement over the wildtype ligand. Together, these studies show that DPO-based fine-tuning can improve the functional capabilities of pLMs, although exploration is limited to the regions defined by the available preference data, in contrast to reinforcement learning approaches like PPO that allow active discovery beyond the training set 53,55,56 . ...

Reference:

Functional alignment of protein language models via reinforcement learning
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

... A widely recognized baseline technique is repeatedly sampling candidate solutions from a model to choose the most frequent answer (aka self-consistency or majority-voting) [38]. However, recent studies are pushing beyond this, investigating methods that leverage LLMs to iteratively refine their generated outputs [9,12,26]. Reasoning models, such as OpenAI o3 series [31] and DeepSeek R1 [10] have enabled sequential scaling of test-time compute by scaling the length of the generated CoT (rather than parallel scaling by generating multiple shorter candidate solutions). While these long CoTs may implicitly incorporate forms of reflection, verification, or refinement within their extended reasoning sequence, such models and previous studies do not primarily address the compute optimality of their proposed methods, the main focus of our investigation. ...

Evolving Deeper LLM Thinking

... Recent breakthroughs in generative models, exemplified by AlphaFold in protein structure prediction (36,37) and diffusion-based models in inorganic crystal generation (19,20,38), highlight 2 the potential of these models to discover structures beyond human intuition. However, even if these models can generate a large number of DFT-stable materials, translating these advances to MOFs remains a challenge. ...

Generative Hierarchical Materials Search

... Unlike traditional autonomy frameworks, which emphasise task execution independence, agentic AI systems dynamically redistribute decision authority and goal-setting capabilities in real-time, often across contexts. These systems, built primarily on foundation models, challenge traditional authority distributions by combining autonomous planning, reasoning, and execution capabilities (Yang et al., 2023;Wang et al., 2024). Liu et al. (2024) document the self-supervision capabilities-where models critique and revise their own outputs without human intervention-creating fluid boundaries between human and AI control that defy traditional governance categories. ...

Foundation Models for Decision Making: Problems, Methods, and Opportunities

... VIP is trained on large-scale human videos and shows promising results in simulated and realrobot tasks, highlighting the potential of value function pre-training for reward learning. • Learning policies and representations from videos Du et al. (2023b) present UniPi, a unique approach that casts sequential decision-making as a text-conditioned video generation problem. UniPi leverages the knowledge embedded in language and videos to generalize to novel goals and tasks across diverse environments and to transfer effectively to downstream RL tasks. ...

Learning Universal Policies via Text-Guided Video Generation

... Few-shot prompting refers to a type of in-context learning (ICL) strategy 9,41 where multiple examples are provided as context in addition to an instruction. Our few-shot prompting strategies include an instruction and five randomly chosen examples of multiple choice questions along with their correct answers. ...

What learning algorithm is in-context learning? Investigations with linear models
  • Citing Preprint
  • November 2022

... Alternatively, probabilistic approaches to CCA were developed: CCA solves a Bayesian inference problem [18] . As recent advances in variational autoencoders [19] made Bayesian inference scalable, the probabilistic CCA approaches gained popularity because of their potential (e.g., inference task such as generating new dataset samples) and scalability, e.g, see VCCA(p) [20] , or VPCCA [21] . Using a probabilistic model, these methods scale easily to large datasets. ...

Deep Probabilistic Canonical Correlation Analysis
  • Citing Article
  • May 2021

Proceedings of the AAAI Conference on Artificial Intelligence

... For simple acyclic queries, the complexity of these search methods is O(n|E| 2 ), where |E| is the size of entities within KG and n is the size of query. From the perspective of data complexity, this complexity grows quadratically with the size of the entities, making it challenging to scale to large-scale graphs (Ren et al., 2022). Considering the cyclic queries, the complexity of precise searching is even O(|E| n ). ...

SMORE: Knowledge Graph Completion and Multi-hop Reasoning in Massive Knowledge Graphs
  • Citing Conference Paper
  • August 2022

... For AQuA, DROP, ANLI-A1, ANLI-A2, ANLI-A3, ComV and OBQA, we use their official test set for evaluation. For BoolQ, we follow Wang et al. (2022a) to use the validation set for evaluation, since its test set is not public available. For FactCK and WikiQA, we manually split they into train/test split, and use the training set's questions as unlabeled dataset, since there is not split version of them released. ...

Rationale-Augmented Ensembles in Language Models

... This included identification of potentially relevant psychological constructs and previous literature on these. The model was also instructed to reason from least-to-most complex justifications 39 to consider different levels of abstraction and to take a deep breath 40 to further improve general performance. Then, the prompt instructed the model to make use of tree-of-thoughts 41 and to rely on a selfconsistency constraint 42 in its reasoning to consider alternative explanations, ensuring that considerations of no correlation at all are continually considered. ...

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models