Shunyu Yao’s research while affiliated with Princeton University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (31)


Contextual Experience Replay for Self-Improvement of Language Agents
  • Preprint

June 2025

Yitao Liu

·

Chenglei Si

·

Karthik Narasimhan

·

Shunyu Yao

Large language model (LLM) agents have been applied to sequential decision-making tasks such as web navigation, but without any environment-specific experiences, they often fail in these complex tasks. Moreover, current LLM agents are not designed to continually learn from past experiences during inference time, which could be crucial for them to gain these environment-specific experiences. To address this, we propose Contextual Experience Replay (CER), a training-free framework to enable efficient self-improvement for language agents in their context window. Specifically, CER accumulates and synthesizes past experiences into a dynamic memory buffer. These experiences encompass environment dynamics and common decision-making patterns, allowing the agents to retrieve and augment themselves with relevant knowledge in new tasks, enhancing their adaptability in complex environments. We evaluate CER on the challenging WebArena and VisualWebArena benchmarks. On VisualWebArena, CER achieves a competitive performance of 31.9%. On WebArena, CER also gets a competitive average success rate of 36.7%, relatively improving the success rate of the GPT-4o agent baseline by 51.0%. We also conduct a comprehensive analysis on it to prove its efficiency, validity and understand it better.


Figure 4: Analysis of human-AI problem-solving interactions. Human queries (left), model responses (center), and human feedback (right) are color-coded by correlation with successful problem resolution (green: positive, red: negative). Percentages indicate each category's frequency, revealing patterns in effective vs. ineffective knowledge transfer.
Figure 19: Image of user interface during a math problem solving session. The user may not type in an answer or perform any calculations during Phase 1, the collective ideation phase.
Figure 20: Image of user interface during a math problem solving session. Once the user clicks "ready to solve", they may no longer view their chats with the model, isolating knowledge transfer.
Figure 21: Image of user interface during a coding problem solving session. In place of a singular answer submission area is a code editor interface.
Bradley-Terry win rates (± standard error) showing human preferences for models post- collaboration across three skill hierarchies: HTM (Human < Task < Model), HMT (Human < Model < Task), and MHT (Model < Human < Task), elaborated in Section 4.3. Bold indicates best performance. Higher values indicate stronger average human preference.

+1

When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration
  • Preprint
  • File available

June 2025

Quan Shi

·

Carlos E. Jimenez

·

Shunyu Yao

·

[...]

·

Karthik Narasimhan

Recent advancements in AI reasoning have driven substantial improvements across diverse tasks. A critical open question is whether these improvements also yields better knowledge transfer: the ability of models to communicate reasoning in ways humans can understand, apply, and learn from. To investigate this, we introduce Knowledge Integration and Transfer Evaluation (KITE), a conceptual and experimental framework for Human-AI knowledge transfer capabilities and conduct the first large-scale human study (N=118) explicitly designed to measure it. In our two-phase setup, humans first ideate with an AI on problem-solving strategies, then independently implement solutions, isolating model explanations' influence on human understanding. Our findings reveal that although model benchmark performance correlates with collaborative outcomes, this relationship is notably inconsistent, featuring significant outliers, indicating that knowledge transfer requires dedicated optimization. Our analysis identifies behavioral and strategic factors mediating successful knowledge transfer. We release our code, dataset, and evaluation framework to support future work on communicatively aligned models.

Download

Embers of autoregression show how large language models are shaped by the problem they are trained to solve

October 2024

·

47 Reads

·

130 Citations

Proceedings of the National Academy of Sciences

The widespread adoption of large language models (LLMs) makes it important to recognize their strengths and limitations. We argue that to develop a holistic understanding of these systems, we must consider the problem that they were trained to solve: next-word prediction over Internet text. By recognizing the pressures that this task exerts, we can make predictions about the strategies that LLMs will adopt, allowing us to reason about when they will succeed or fail. Using this approach—which we call the teleological approach—we identify three factors that we hypothesize will influence LLM accuracy: the probability of the task to be performed, the probability of the target output, and the probability of the provided input. To test our predictions, we evaluate five LLMs (GPT-3.5, GPT-4, Claude 3, Llama 3, and Gemini 1.0) on 11 tasks, and we find robust evidence that LLMs are influenced by probability in the hypothesized ways. Many of the experiments reveal surprising failure modes. For instance, GPT-4’s accuracy at decoding a simple cipher is 51% when the output is a high-probability sentence but only 13% when it is low-probability, even though this task is a deterministic one for which probability should not matter. These results show that AI practitioners should be careful about using LLMs in low-probability situations. More broadly, we conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system—one that has been shaped by its own particular set of pressures.


When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

October 2024

·

10 Reads

·

1 Citation

In "Embers of Autoregression" (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 - like previous LLMs - is sensitive to the probability of examples and tasks, performing better and requiring fewer "thinking tokens" in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model's probability sensitivity.


Figure 3: pass^1 across models/methods in τ -retail.
Figure 5: Breakdown of 36 failed gpt-4o FC agent trajectories in τ -retail.
Figure 6: Retail tasks with more database writes are harder.
Figure 7: The success rate of each τ -retail task, sorted by gpt-4-turbo success rate. Each task has at least 40 gpt-4-turbo trials to ensure reliable per-task success rates.
Overview of τ -retail and τ -airline databases and APIs. API design example Here is the Python implementation of an API in τ -retail.
\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

June 2024

·

139 Reads

·

5 Citations

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose τ\tau-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.


SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

May 2024

·

425 Reads

·

6 Citations

Software engineering is a challenging task requiring proficiency in both code generation and interacting with computers. In this paper, we introduce SWE-agent, an autonomous system that uses a language model to interact with a computer to solve software engineering tasks. We show that a custom-built agent-computer interface (ACI) greatly enhances the ability of an agent to create and edit code files, navigate entire repositories and execute programs. On SWE-bench, SWE-agent is able to solve 12.5% of issues, compared to the previous best of 3.8% achieved with retrieval-augmented generation (RAG). We explore how ACI design impacts an agent's behavior and performance, and provide insights on effective design.



InterCode-CTF task instance where GPT-4 solves it correctly. This problem requires basic file directory navigation skills and string manipulation/search capabilities. The agent is informed in the prompt that the flag is prefixed with picoCTF.
Language Agents as Hackers: Evaluating Cybersecurity Skills with Capture the Flag

November 2023

·

114 Reads

·

7 Citations

Amidst the advent of language models (LMs) and their wide-ranging capabilities, concerns have been raised about their implications with regards to privacy and security. In particular, the emergence of language agents as a promising aid for automating and augmenting digital work poses immediate questions concerning their misuse as malicious cybersecurity actors. With their exceptional compute efficiency and execution speed relative to human counterparts, language agents may be extremely adept at locating vulnerabilities, performing complex social engineering, and hacking real world systems. Understanding and guiding the development of language agents in the cybersecurity space requires a grounded understanding of their capabilities founded on empirical data and demonstrations. To address this need, we introduce InterCode-CTF, a novel task environment and benchmark for evaluating language agents on the Capture the Flag (CTF) task. Built as a facsimile of real world CTF competitions, in the InterCode-CTF environment, a language agent is tasked with finding a flag from a purposely-vulnerable computer program. We manually collect and verify a benchmark of 100 task instances that require a number of cybersecurity skills such as reverse engineering, forensics, and binary exploitation, then evaluate current top-notch LMs on this evaluation set. Our preliminary findings indicate that while language agents possess rudimentary cybersecurity knowledge, they are not able to perform multi-step cybersecurity tasks out-of-the-box.


Cognitive Architectures for Language Agents

September 2023

·

1,147 Reads

·

10 Citations

Recent efforts have incorporated large language models (LLMs) with external resources (e.g., the Internet) or internal control flows (e.g., prompt chaining) for tasks requiring grounding or reasoning. However, these efforts have largely been piecemeal, lacking a systematic framework for constructing a fully-fledged language agent. To address this challenge, we draw on the rich history of agent design in symbolic artificial intelligence to develop a blueprint for a new wave of cognitive language agents. We first show that LLMs have many of the same properties as production systems, and recent efforts to improve their grounding or reasoning mirror the development of cognitive architectures built around production systems. We then propose Cognitive Architectures for Language Agents (CoALA), a conceptual framework to systematize diverse methods for LLM-based reasoning, grounding, learning, and decision making as instantiations of language agents in the framework. Finally, we use the CoALA framework to highlight gaps and propose actionable directions toward more capable language agents in the future.


Figure 2: Data statistics. (a) Number of constraints from each constraint structure. (b) Fraction of strings removed by automated filtering. (c) Length statistics for different levels for each data source.
Figure 8: Model performance on different constraints and datasets. (a)-(e) Constraint satisfaction rates of texts generated by GPT-4, GPT-3.5, PaLM, Vicuna-7B, and Alpaca-7B across various constraints and datasets. Error bars indicate standard error. The constraint group names can be found in Table 1. Sample sizes are reported in supplementary Figure 9. (f) Summary heatmap of model performance on different constraint groups. (g) Summary heatmap of model performance on different constraints and datasets.
COLLIE: Systematic Construction of Constrained Text Generation Tasks

July 2023

·

35 Reads

Text generation under constraints have seen increasing interests in natural language processing, especially with the rapidly improving capabilities of large language models. However, existing benchmarks for constrained generation usually focus on fixed constraint types (e.g.,generate a sentence containing certain words) that have proved to be easy for state-of-the-art models like GPT-4. We present COLLIE, a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels (word, sentence, paragraph, passage) and modeling challenges (e.g.,language understanding, logical reasoning, counting, semantic planning). We also develop tools for automatic extraction of task instances given a constraint structure and a raw text corpus. Using COLLIE, we compile the COLLIE-v1 dataset with 2080 instances comprising 13 constraint structures. We perform systematic experiments across five state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings. COLLIE is designed to be extensible and lightweight, and we hope the community finds it useful to develop more complex constraints and evaluations in the future.


Citations (14)


... However, these studies represent a best-case scenario of relatively easy tasks, as they cover English-language data about standard societal and political issues that are likely much-discussed in LLM training data and do not require much expertise for coding. Research on logical reasoning tasks suggests that LLMs tend to struggle with tasks that are comparably complex, but less commonly appearing in their training and alignment processes (McCoy et al., 2023). In addition, there is ample evidence that LLMs are biased against non-English language contexts in a variety of other tasks (e.g., Durmus et al., 2024;Johnson et al., 2022;Li et al., 2024;Wang et al., 2024). ...

Reference:

AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation
Embers of autoregression show how large language models are shaped by the problem they are trained to solve
  • Citing Article
  • October 2024

Proceedings of the National Academy of Sciences

... Relevant to our research focus, [49,55,33] take as input documents or parts thereof and recommend papers that are likely to be cited, often referred to as context-aware citation recommendation [44,22,79,24,37,56,29]. The text inputs we use in CiteME resemble those used in [37,56,70], which contain a few sentences with a masked out citation. However, CiteME differs because it uses excerpts containing only one unambiguous citation, making the context sufficient to identify the cited paper. ...

Referral Augmentation for Zero-Shot Information Retrieval

... Given a prompt, RL allows an LLM to generate thinking tokens before outputting a final answer, enabling test-time scaling [30,48]. These thinking LLMs are named Large Reasoning Models (LRMs) and have been shown to have particularly strong capabilities on challenging reasoning problems, such as math [10,5,21], coding [3,15,16], logic puzzles [23,35], and agentic tasks [24,59]. ...

\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

... Automated cybersecurity agents are evaluated via Capture The Flag (CTF) challenges that simulate real-world adversarial scenarios in controlled environments for cybersecurity training and skill assessment [7,40,36,49,32,30,26]. CTFs span diverse technical domains such as cryptography, binary exploitation (pwn), forensics, reverse engineering, and web security, demanding adaptive reasoning, strategic planning, and domain-specific knowledge. ...

Language Agents as Hackers: Evaluating Cybersecurity Skills with Capture the Flag

... AI researchers have long been interested in artificial intelligence (AI) agents, ranging from reinforcement learning (RL) agents to autonomous vehicles (Feigenbaum, 1977;Russell and Norvig, 1995;Sutton and Barto, 2020;Wooldridge, 2000). However, recent breakthroughs have led to the development of a new class of AI agents based on powerful foundation models -which are then supplemented with scaffolding for advanced reasoning capabilities, memory, and tool use (Sumers et al., 2023). Building on this architecture, we will likely see a large number of novel AI agents deployed across a range of real-world domains in the near future. ...

Cognitive Architectures for Language Agents

... Unlike most prior setups, our agents are bidirectional communicators [17,64,20,35,51] and embodied in the environment [27,55,50]. This setup better reflects real-world communication, where agents must both produce and interpret signals to collaboratively complete a task effectively. ...

EC 2 : Emergent Communication for Embodied Control
  • Citing Conference Paper
  • June 2023

... Problem Formulation Similar to existing works on story understanding (Kočiskỳ et al., 2018;Pang et al., 2022;Xu et al., 2022;Yu et al., 2023), our task adopts a question-answering (QA) format. We denote the global context of a book as G, which in practice can be the list of all consecutive paragraphs of the book. ...

Personality Understanding of Fictional Characters during Book Reading

... 4. Tree of thought prompting, in which a model does not simply rely on heuristics to choose an output (e.g., the most frequent solution), but is instructed to decompose a problem or task and explore all possible steps associated with each possible solution along different logical branches (Yao et al., 2023). ...

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

... If the Main LLM's code snippet throws an error, we carry out an error handling/feedback mechanism inspired by the popular ReAct (Yao et al. 2023) framework. In Fig. 7, the code returned from the query "Find the busiest date (date with the largest number of trips scheduled) in the GTFS feed" yields a TypeError . ...

ReAct: Synergizing Reasoning and Acting in Language Models

... To address the limitations in existing demonstrations, we introduce MeKB-Sim, a multi-agent simulation platform that leverages a dynamic personal knowledge base, denoted as MeKB. The MeKB of each agent incorporates attributes critical for theory-of-mind modeling (Sang et al., 2022). Specifically, the MeKB for each agent is structured into hierarchical layers, comprising the central fixed attributes such as occupation, race, education level, relationships, and linguistic style, surrounded by variable attributes such as personality, long-term and short-term memory, the emotion status. ...

TVShowGuess: Character Comprehension in Stories as Speaker Guessing