Shuyan Zhou’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (34)


Figure 3: Correlation between Human Step Count and End-to-End Task Accuracy.
Figure 4: Screenshot of COWPILOT evaluation result page. After each task is completed, the evaluation metric values are shown as summary.
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
  • Preprint
  • File available

January 2025

·

38 Reads

Faria Huq

·

Zora Zhiruo Wang

·

Frank F. Xu

·

[...]

·

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

Download


Figure 1: An overview of TheAgentCompany benchmark. It features a reproducible and selfhosted environment, simulated colleagues to test agent communication capabilities, checkpoint and execution-based evaluation, and a set of 175 diverse, realistic and professional tasks in a software engineering company setting.
Figure 3: Overview of OpenHands' default CodeAct + Browsing agent architecture, the baseline agent used throughout the experiments.
Figure 5: Simulated Colleague Communication Example 1 -The agent is tasked with collecting required equipment while adhering to the department's budget. After calculating that the requested items exceed the budget, the agent negotiates with the simulated colleague to reduce the request, showcasing its ability of effective communication.
Performance comparison of various foundation models on TheAgentCompany.
Performance of various models in tasks with different nature in TheAgentCompany. All numbers are percentages (%).
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

December 2024

·

179 Reads

·

1 Citation

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.


Beyond Browsing: API-Based Web Agents

October 2024

·

15 Reads

·

1 Citation

Web browsers are a portal to the internet, where much of human activity is undertaken. Thus, there has been significant research work in AI agents that interact with the internet through web browsing. However, there is also another interface designed specifically for machine interaction with online content: application programming interfaces (APIs). In this paper we ask -- what if we were to take tasks traditionally tackled by browsing agents, and give AI agents access to APIs? To do so, we propose two varieties of agents: (1) an API-calling agent that attempts to perform online tasks through APIs only, similar to traditional coding agents, and (2) a Hybrid Agent that can interact with online data through both web browsing and APIs. In experiments on WebArena, a widely-used and realistic benchmark for web navigation tasks, we find that API-based agents outperform web browsing agents. Hybrid Agents out-perform both others nearly uniformly across tasks, resulting in a more than 20.0% absolute improvement over web browsing alone, achieving a success rate of 35.8%, achiving the SOTA performance among task-agnostic agents. These results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.


Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

September 2024

·

2 Reads

LLMs can now act as autonomous agents that interact with digital environments and complete specific objectives (e.g., arranging an online meeting). However, accuracy is still far from satisfactory, partly due to a lack of large-scale, direct demonstrations for digital tasks. Obtaining supervised data from humans is costly, and automatic data collection through exploration or reinforcement learning relies on complex environmental and content setup, resulting in datasets that lack comprehensive coverage of various scenarios. On the other hand, there is abundant knowledge that may indirectly assist task completion, such as online tutorials that were created for human consumption. In this work, we present Synatra, an approach that effectively transforms this indirect knowledge into direct supervision at scale. We define different types of indirect knowledge, and carefully study the available sources to obtain it, methods to encode the structure of direct demonstrations, and finally methods to transform indirect knowledge into direct demonstrations. We use 100k such synthetically-created demonstrations to finetune a 7B CodeLlama, and demonstrate that the resulting agent surpasses all comparably sized models on three web-based task benchmarks Mind2Web, MiniWoB++ and WebArena, as well as surpassing GPT-3.5 on WebArena and Mind2Web. In addition, while synthetic demonstrations prove to be only 3% the cost of human demonstrations (at $0.031 each), we show that the synthetic demonstrations can be more effective than an identical number of human demonstrations collected from limited domains.


WebCanvas: Benchmarking Web Agents in Online Environments

June 2024

·

42 Reads

·

1 Citation

For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.



Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation

December 2023

·

56 Reads

·

55 Citations

Transactions of the Association for Computational Linguistics

Natural language generation has witnessed significant advancements due to the training of large language models on vast internet-scale datasets. Despite these advancements, there exists a critical challenge: These models can inadvertently generate content that is toxic, inaccurate, and unhelpful, and existing automatic evaluation metrics often fall short of identifying these shortcomings. As models become more capable, human feedback is an invaluable signal for evaluating and improving models. This survey aims to provide an overview of recent research that has leveraged human feedback to improve natural language generation. First, we introduce a taxonomy distilled from existing research to categorize and organize the varied forms of feedback. Next, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using feedback or training feedback models. We also discuss existing datasets for human-feedback data collection, and concerns surrounding feedback collection. Finally, we provide an overview of the nascent field of AI feedback, which uses large language models to make judgments based on a set of principles and minimize the need for human intervention. We also release a website of this survey at feedback-gap-survey.info.


WebArena: A Realistic Web Environment for Building Autonomous Agents

July 2023

·

119 Reads

·

3 Citations

With generative AI advances, the exciting potential for autonomous agents to manage daily tasks via natural language commands has emerged. However, cur rent agents are primarily created and tested in simplified synthetic environments, substantially limiting real-world scenario representation. In this paper, we build an environment for agent command and control that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on websites, and we create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and are designed to emulate tasks that humans routinely perform on the internet. We design and implement several autonomous agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 10.59%. These results highlight the need for further development of robust agents, that current state-of-the-art LMs are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress. Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/.


Hierarchical Prompting Assists Large Language Model on Web Navigation

July 2023

·

7 Reads

·

18 Citations

Large language models (LLMs) struggle on processing complicated observations in interactive decision making. To alleviate this issue, we propose a simple hierarchical prompting approach. Diverging from previous prompting approaches that always put the full observation (e.g., a web page) to the prompt, we propose to first construct an action-aware observation which is more condensed and relevant with a dedicated SUMMARIZER prompt. The ACTOR prompt then predicts the next action based on the summarized history. While our method has broad applicability, we particularly demonstrate its efficacy in the complex domain of web navigation where a full observation often contains redundant and irrelevant information. Our approach outperforms the previous state-of-the-art prompting mechanism with the same LLM by 6.2% on task success rate, demonstrating its potential on interactive decision making tasks with long observation traces.


Citations (20)


... The evaluation standards for AI agents are still in flux (Kapoor et al., 2024;Højmark et al., 2024) and they are urgently needed in specialized domains and realistic scenarios where the the outcomes convey greater bearing on their adoption. Recent works demonstrated the feasibility of LLMs in predicting temporal events (Ye et al., 2024a) and carry out time series forecasting (Tang et al., 2024), but their equivalents in agentic systems are not yet realized. Scientific knowledge and claims have a strong temporal dependence but they have so far been less studied in the context of generative language models Park et al., 2024). ...

Reference:

Measuring temporal effects of agent knowledge by date-controlled tool use
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

... Most existing benchmarks (WebShop, MiniWoB, ALFWorld, etc.) focus on task performance in ideal conditions, not adversarial robustness (Wu et al., 2025). An exception is the recent VisualWebArena-Adv suite, which introduces adversarial tasks for web-based multimodal agents (Koh et al., 2024). The authors had to craft these specifically to measure how easily vision-language agents can be misled, highlighting the general scarcity of safety evaluations. ...

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
  • Citing Conference Paper
  • January 2024

... [Data] Human Flexibility and Variability. Human feedback in LLM-HAS varies significantly in role, timing, and style [27]. Since humans are subjective and influenced by their personalities, different individuals can lead to diverse outcomes when interacting with the same LLM-HAS. ...

Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation

Transactions of the Association for Computational Linguistics

... Inspired by the success of NLP pretraining, specialized models such as C-BERT [25], CodeBERT [26] and GraphCodeBERT [27] have been developed for programming languages, excelling in tasks like defect detection and code completion. Moreover, Code-BERTScore [28] refines token similarity scoring, surpassing models like RoBERTa [29] and CodeBERT. Derived from GPT-3, CodeX [30] demonstrates robust performance in code translation and refactoring. ...

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
  • Citing Conference Paper
  • January 2023

... All models are trained on MBPP (Austin et al., 2021) and evaluated on MBPP, HumanEval (Chen et al., 2021), and ODEX (Wang et al., 2023) to assess generalization to unseen data. For MBPP, we follow the data split provided in Austin et al. (2021), which defines train, validation, test, and few-shot sets. ...

Execution-Based Evaluation for Open-Domain Code Generation
  • Citing Conference Paper
  • January 2023

... Prompting techniques for agent applications combine in-context learning (Brown et al., 2020) with step-by-step reasoning and self-reflection over previous outcomes (Wei et al., 2022b;Yao et al., 2022b;Yang et al., 2023;Zheng et al., 2024). Prompting is particularly effective when working with large proprietary models that support sequence lengths long enough to let these methods grow arbitrarily complicated by making multiple API calls to correct mistakes, retrieve relevant information, and plan for the future (Topsakal & Akinci, 2023;Lutz et al., 2024;Sridhar et al., 2023). ...

Hierarchical Prompting Assists Large Language Model on Web Navigation

... Recent research suggests that formulating prompts as code can enhance LLMs' reasoning abilities (Wang et al., 2023;Zhang et al., 2023). In our task, the Python code format effectively incorporates all necessary terminologies, enabling LLMs to understand them without confusion. ...

Causal Reasoning of Entities and Events in Procedural Texts
  • Citing Conference Paper
  • January 2023

... As LLMs grow in complexity and capability, a variety of benchmarks have been tailored in different ways, with different evaluation metrics, aiming for evaluating different LLMs on specific SE tasks. For example, some benchmarks are primarily constructed through human-writting [33,281], some are tailored by automated collection methods [91,150], and others leverage a combination of both approaches [77,100,174,231]. However, currently, there is no systematic literature review (SLR) that provides a comprehensive overview of these benchmarks and their construction methods. ...

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages
  • Citing Conference Paper
  • January 2023

... LLMs can adapt their behavior based on in-context demonstrations [1,69]. For example, LLM agents could infer how to tackle a particular type of equation from a few examples (e.g., the time-independent Schrödinger equation (−ℏ 2 /2m · ψ ′′ + V ψ) = Eψ for different potentials V (x) like the harmonic oscillator V (x) = mω 2 x 2 /2) and then apply a similar methodology to a new potential, such as the microwave shielding for cold molecules [70,71], where experimental setups require analyzing a new long-range potential. ...

Language Models of Code are Few-Shot Commonsense Learners
  • Citing Conference Paper
  • January 2022

... CrystalBLEU (Eghbali and Pradel, 2022) focuses more on the inherent differences between source code and natural language, such as trivial shared n-gram syntax. Code-BERTScore (Zhou et al., 2023) uses pre-trained models to encode the translation output and reference translation, then calculates the dot product similarity between them, enabling comparisons of code pairs with distinct lexical forms. However, CodeBLEU, CrystalBLEU, and CodeBERTScore have limitations as they only support a limited range of programming languages and cannot be used in general multilingual scenarios. ...

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code