Frank F. Xu’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (43)


Figure 3: Correlation between Human Step Count and End-to-End Task Accuracy.
Figure 4: Screenshot of COWPILOT evaluation result page. After each task is completed, the evaluation metric values are shown as summary.
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
  • Preprint
  • File available

January 2025

·

38 Reads

Faria Huq

·

Zora Zhiruo Wang

·

Frank F. Xu

·

[...]

·

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

Download



Figure 1: An overview of TheAgentCompany benchmark. It features a reproducible and selfhosted environment, simulated colleagues to test agent communication capabilities, checkpoint and execution-based evaluation, and a set of 175 diverse, realistic and professional tasks in a software engineering company setting.
Figure 3: Overview of OpenHands' default CodeAct + Browsing agent architecture, the baseline agent used throughout the experiments.
Figure 5: Simulated Colleague Communication Example 1 -The agent is tasked with collecting required equipment while adhering to the department's budget. After calculating that the requested items exceed the budget, the agent negotiates with the simulated colleague to reduce the request, showcasing its ability of effective communication.
Performance comparison of various foundation models on TheAgentCompany.
Performance of various models in tasks with different nature in TheAgentCompany. All numbers are percentages (%).
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

December 2024

·

179 Reads

·

1 Citation

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.


The BrowserGym Ecosystem for Web Agent Research

December 2024

·

18 Reads

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.


Beyond Browsing: API-Based Web Agents

October 2024

·

15 Reads

·

1 Citation

Web browsers are a portal to the internet, where much of human activity is undertaken. Thus, there has been significant research work in AI agents that interact with the internet through web browsing. However, there is also another interface designed specifically for machine interaction with online content: application programming interfaces (APIs). In this paper we ask -- what if we were to take tasks traditionally tackled by browsing agents, and give AI agents access to APIs? To do so, we propose two varieties of agents: (1) an API-calling agent that attempts to perform online tasks through APIs only, similar to traditional coding agents, and (2) a Hybrid Agent that can interact with online data through both web browsing and APIs. In experiments on WebArena, a widely-used and realistic benchmark for web navigation tasks, we find that API-based agents outperform web browsing agents. Hybrid Agents out-perform both others nearly uniformly across tasks, resulting in a more than 20.0% absolute improvement over web browsing alone, achieving a success rate of 35.8%, achiving the SOTA performance among task-agnostic agents. These results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.


Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

September 2024

·

2 Reads

LLMs can now act as autonomous agents that interact with digital environments and complete specific objectives (e.g., arranging an online meeting). However, accuracy is still far from satisfactory, partly due to a lack of large-scale, direct demonstrations for digital tasks. Obtaining supervised data from humans is costly, and automatic data collection through exploration or reinforcement learning relies on complex environmental and content setup, resulting in datasets that lack comprehensive coverage of various scenarios. On the other hand, there is abundant knowledge that may indirectly assist task completion, such as online tutorials that were created for human consumption. In this work, we present Synatra, an approach that effectively transforms this indirect knowledge into direct supervision at scale. We define different types of indirect knowledge, and carefully study the available sources to obtain it, methods to encode the structure of direct demonstrations, and finally methods to transform indirect knowledge into direct demonstrations. We use 100k such synthetically-created demonstrations to finetune a 7B CodeLlama, and demonstrate that the resulting agent surpasses all comparably sized models on three web-based task benchmarks Mind2Web, MiniWoB++ and WebArena, as well as surpassing GPT-3.5 on WebArena and Mind2Web. In addition, while synthetic demonstrations prove to be only 3% the cost of human demonstrations (at $0.031 each), we show that the synthetic demonstrations can be more effective than an identical number of human demonstrations collected from limited domains.


Figure 3: OpenDevin consists of 3 main components: 1) Agent abstraction where community can contribute different implementation of agents ( §2.1) into an Agent Hub ( §3); 2) Event stream for tracking history of actions and observations; 3) Agent runtime to execute all agent actions into observations ( §2.2).
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

July 2024

·

384 Reads

·

1 Citation

Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenDevin, a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-Bench) and web browsing (e.g., WebArena), among others. Released under the permissive MIT license, OpenDevin is a community project spanning academia and industry with more than 1.3K contributions from over 160 contributors and will improve going forward.


CodeRAG-Bench: Can Retrieval Augment Code Generation?

June 2024

·

98 Reads

While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.


WebArena: A Realistic Web Environment for Building Autonomous Agents

July 2023

·

119 Reads

·

3 Citations

With generative AI advances, the exciting potential for autonomous agents to manage daily tasks via natural language commands has emerged. However, cur rent agents are primarily created and tested in simplified synthetic environments, substantially limiting real-world scenario representation. In this paper, we build an environment for agent command and control that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on websites, and we create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and are designed to emulate tasks that humans routinely perform on the internet. We design and implement several autonomous agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 10.59%. These results highlight the need for further development of robust agents, that current state-of-the-art LMs are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress. Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/.


Citations (23)


... Retrieval augmented generation is broadly known to improve code generation (Wang et al., 2024). The specific idea of dynamically retrieving relevant in-context examples from a larger training set was first proposed in Poesia et al. (2022) and was later shown to be highly effective for program optimization (Shypula et al., 2024). ...

Reference:

LLM Program Optimization via Retrieval Augmented Search
CodeRAG-Bench: Can Retrieval Augment Code Generation?
  • Citing Conference Paper
  • January 2025

... The evaluation standards for AI agents are still in flux (Kapoor et al., 2024;Højmark et al., 2024) and they are urgently needed in specialized domains and realistic scenarios where the the outcomes convey greater bearing on their adoption. Recent works demonstrated the feasibility of LLMs in predicting temporal events (Ye et al., 2024a) and carry out time series forecasting (Tang et al., 2024), but their equivalents in agentic systems are not yet realized. Scientific knowledge and claims have a strong temporal dependence but they have so far been less studied in the context of generative language models Park et al., 2024). ...

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

... 20 In several domains, for example software design, AI has already demonstrated the ability to autonomously perform research and generate high-quality written outputs. 22,23 A prominent example is the use of role-based self-collaboration frameworks, where multiple AI agents assume specialized roles (such as analyst, coder, and tester) and communicate via natural language to iteratively plan, generate, test, and refine software. Empirical studies show that such multi-agent frameworks consistently outperform single-agent models or zero-shot prompting approaches, resulting in more structured and reliable outputs. ...

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

... Retrieval Augmented Generation (RAG). RAG improves generative models by integrating retrieval to inject external knowledge [68,69,70,71,72,73]. While well-studied in text domains [74,75,76,77,78,79], recent efforts have extended RAG to vision-language tasks. ...

Active Retrieval Augmented Generation
  • Citing Conference Paper
  • January 2023

... Prompting techniques for agent applications combine in-context learning (Brown et al., 2020) with step-by-step reasoning and self-reflection over previous outcomes (Wei et al., 2022b;Yao et al., 2022b;Yang et al., 2023;Zheng et al., 2024). Prompting is particularly effective when working with large proprietary models that support sequence lengths long enough to let these methods grow arbitrarily complicated by making multiple API calls to correct mistakes, retrieve relevant information, and plan for the future (Topsakal & Akinci, 2023;Lutz et al., 2024;Sridhar et al., 2023). ...

Hierarchical Prompting Assists Large Language Model on Web Navigation

... As LLMs grow in complexity and capability, a variety of benchmarks have been tailored in different ways, with different evaluation metrics, aiming for evaluating different LLMs on specific SE tasks. For example, some benchmarks are primarily constructed through human-writting [33,281], some are tailored by automated collection methods [91,150], and others leverage a combination of both approaches [77,100,174,231]. However, currently, there is no systematic literature review (SLR) that provides a comprehensive overview of these benchmarks and their construction methods. ...

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages
  • Citing Conference Paper
  • January 2023

... This is due to the setup in which the LLM will always provide some response, as they predict the most probable next token, but the likelihood of a correct or valuable answer becomes lower if the information is a detail that rarely occurs in the training dataset or was not included in the first place. 19 Those described weaknesses are the fields in which ontologies and other highly structured knowledge resources have strengths and complement LLMs. LLMs may serve as a translation layer. ...

Active Retrieval Augmented Generation
  • Citing Preprint
  • May 2023

... (3) where T denotes the temperature to control the sharpness of the softmax function and N (h) = {(K m j , V m j )} k j=1 is the set of k nearest-neighbors retrieved from D using a pre-defined distance function d (., .). In practice, we can use either the dotproduct function or negative l 2 distance to implement d(., .). Xu et al. (2023) have demonstrated that the performance of these two functions is almost identical, so we adopt the dot-product function for theoretical analysis in this paper. Finally, kNN-MT interpolates the vanilla NMT prediction p NMT with the kNN prediction p kNN to obtain the final next-token probability: ...

Why do Nearest Neighbor Language Models Work?

... Large language models (LLMs) have demonstrated impressive capabilities in the field of natural language processing (NLP), excelling in various tasks such as conversation [34], text generation [49]- [51], and reasoning [52]. Given their success in NLP, attention has naturally turned to code-related tasks, leading to the creation of numerous code generation models, such as AlphaCode [53], PolyCoder [54], Codex [36], Google's program synthesis model [55], and CodeGen [56]. ...

A systematic evaluation of large language models of code
  • Citing Conference Paper
  • June 2022

... Our task selection was inspired by the programming tasks used by Xu et al. [56] and Vaithilingam et al. [11]. We first categorized the TranX Developer Study tasks [56] and tasks from the DS-1000 benchmark [57] into different common programming task types. ...

In-IDE Code Generation from Natural Language: Promise and Challenges
  • Citing Article
  • April 2022

ACM Transactions on Software Engineering and Methodology