Yiqing Xie’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (8)


Results with O3-mini as strong LM and Qwen2.5-Coder-7B as weak LM. Red denotes drop. We consider a variance of 0.67% owing to the non-determinism introduced by the growing temperature values so a drop less than or equal to this is not marked in red.
Results with O3-mini as strong LM and Qwen2.5-Coder-32B as weak LM. Red denotes drop. We consider a variance of 0.67% owing to the non-determinism introduced by the growing temperature values so a drop less than or equal to this is not marked in red.
An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation
  • Preprint
  • File available

May 2025

Shubham Gandhi

·

Atharva Naik

·

Yiqing Xie

·

Carolyn Rose

We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most challenging tasks are delegated to the strong model. While many works propose architectures for this task, few analyze performance relative to cost. We evaluate a broad spectrum of collaboration strategies: context-based, pipeline-based, and dynamic, on GitHub issue resolution. Our most effective collaborative strategy achieves equivalent performance to the strong model while reducing the cost by 40%. Based on our findings, we offer actionable guidelines for choosing collaboration strategies under varying budget and performance constraints. Our results show that strong-weak collaboration substantially boosts the weak model's performance at a fraction of the cost, pipeline and context-based methods being most efficient. We release the code for our work at https://github.com/shubhamrgandhi/codegen-strong-weak-collab.

Download

Figure 5: Case study 1. The original score_explicit_question function and its context extracted from the original GitHub repository. The function calls the text completion function from the OpenAI API.
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

March 2025

·

4 Reads

Yiqing Xie

·

Alex Xie

·

Divyanshu Sheth

·

[...]

·

Carolyn Rose

We present RepoST, a scalable method to construct environments that provide execution feedback for repository-level code generation for both training and evaluation. Unlike existing works that aim to build entire repositories for execution, which is challenging for both human and LLMs, we provide execution feedback with sandbox testing, which isolates a given target function and its dependencies to a separate script for testing. Sandbox testing reduces the complexity of external dependencies and enables constructing environments at a large scale. We use our method to construct RepoST-Train, a large-scale train set with 7,415 functions from 832 repositories. Training with the execution feedback provided by RepoST-Train leads to a performance gain of 5.5% Pass@1 on HumanEval and 3.5% Pass@1 on RepoEval. We also build an evaluation dataset, RepoST-Eval, and benchmark 12 code generation models.



Figure 1: An overview of TheAgentCompany benchmark. It features a reproducible and selfhosted environment, simulated colleagues to test agent communication capabilities, checkpoint and execution-based evaluation, and a set of 175 diverse, realistic and professional tasks in a software engineering company setting.
Figure 3: Overview of OpenHands' default CodeAct + Browsing agent architecture, the baseline agent used throughout the experiments.
Figure 5: Simulated Colleague Communication Example 1 -The agent is tasked with collecting required equipment while adhering to the department's budget. After calculating that the requested items exceed the budget, the agent negotiates with the simulated colleague to reduce the request, showcasing its ability of effective communication.
Performance comparison of various foundation models on TheAgentCompany.
Performance of various models in tasks with different nature in TheAgentCompany. All numbers are percentages (%).
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

December 2024

·

178 Reads

·

1 Citation

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.


Improving Model Factuality with Fine-grained Critique-based Evaluator

October 2024

·

2 Reads

Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. We conduct data augmentation on a combination of public judgment datasets to train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, leverage FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator's accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama3-8B-chat's factuality rate by 14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 6.96%.


CodeRAG-Bench: Can Retrieval Augment Code Generation?

June 2024

·

98 Reads

While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.



Citations (4)


... Retrieval augmented generation is broadly known to improve code generation (Wang et al., 2024). The specific idea of dynamically retrieving relevant in-context examples from a larger training set was first proposed in Poesia et al. (2022) and was later shown to be highly effective for program optimization (Shypula et al., 2024). ...

Reference:

LLM Program Optimization via Retrieval Augmented Search
CodeRAG-Bench: Can Retrieval Augment Code Generation?
  • Citing Conference Paper
  • January 2025

... The evaluation standards for AI agents are still in flux (Kapoor et al., 2024;Højmark et al., 2024) and they are urgently needed in specialized domains and realistic scenarios where the the outcomes convey greater bearing on their adoption. Recent works demonstrated the feasibility of LLMs in predicting temporal events (Ye et al., 2024a) and carry out time series forecasting (Tang et al., 2024), but their equivalents in agentic systems are not yet realized. Scientific knowledge and claims have a strong temporal dependence but they have so far been less studied in the context of generative language models Park et al., 2024). ...

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

... More sophisticated approaches include entailment-based metrics (e.g., FactCC [7], SummaC [8], AlignScore [9]), questionanswering based metrics (e.g., FEQA [10], QuestEval [11], Q 2 [12]), and information theory-based metrics (e.g., InfoLM [13]). Recent developments include ensemble approaches like FENICE [14] and LLM-based metrics such as DocLens [15]. While DocLens represents an innovative approach to medical text evaluation, its methodology, which relies on claim generation and citation matching, doesn't directly address the need for hallucination detection in clinical summaries. ...

DocLens: Multi-aspect Fine-grained Medical Text Evaluation
  • Citing Conference Paper
  • January 2024

... Recent neural-based Code Translation researches can be majorly categorized to two types: learningbased transpilers [44,45,56] and pre-trained language models [16, 54,29,43,36,1]. The former majorly studies the scarcity of parallel corpora [58] and develops unsupervised learning methods to overcome it. The latter using Large Language Models' vast pretrained knowledge, can also perform code translations well without training [60,25]. ...

Data Augmentation for Code Translation with Comparable Corpora and Multiple References
  • Citing Conference Paper
  • January 2023