Daniel Fried’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (75)


Figure 4: A The mrCAD dataset contains three subsets: the coverage set of 2249 CADs with 1-2 successful rollouts, dense set of 698 CADs with 3+ successful reconstruction, and the very-dense set of 27 CADs with 30+ successful reconstruction. B We implemented a dynamic threshold for submitting designs that became more lenient in later rounds. Participants took a variable number of rounds to reach the threshold. Visualizing distance to the target broken down by round submitted reveals a trend of refinement over time. Red dashed line indicates the fixed threshold for including in analysis.
Figure 6: A Designers' instructions to generate CADs (round 1) involved lots of drawing and little text, whereas instructions to refine CADs (rounds 2+) used a balance of modalities. B The proportions of the types of root words in the dependency parse tree of instruction text. More verbs are used over rounds, and these verbs become more imperative. C Samples of 20 generation drawings and 20 refinement drawings highlights the rich detail in generation instructions, and more targeted modifications in refinement.
Figure 7: A Comparison of human and model movement towards target following instructions, normalized by distance at start of round. Only humans make reliably positive changes in responses to refinement instructions. Models made positive steps in generation but largely destructive changes when refining. B Comparison of human and model responses.
mrCAD: Multimodal Refinement of Computer-aided Designs
  • Preprint
  • File available

April 2025

·

5 Reads

William P. McCarthy

·

Saujas Vaduguru

·

Karl D. D. Willis

·

[...]

·

Yewen Pu

A key feature of human collaboration is the ability to iteratively refine the concepts we have communicated. In contrast, while generative AI excels at the \textit{generation} of content, it often struggles to make specific language-guided \textit{modifications} of its prior outputs. To bridge the gap between how humans and machines perform edits, we present mrCAD, a dataset of multimodal instructions in a communication game. In each game, players created computer aided designs (CADs) and refined them over several rounds to match specific target designs. Only one player, the Designer, could see the target, and they must instruct the other player, the Maker, using text, drawing, or a combination of modalities. mrCAD consists of 6,082 communication games, 15,163 instruction-execution rounds, played between 1,092 pairs of human players. We analyze the dataset and find that generation and refinement instructions differ in their composition of drawing and text. Using the mrCAD task as a benchmark, we find that state-of-the-art VLMs are better at following generation instructions than refinement instructions. These results lay a foundation for analyzing and modeling a multimodal language of refinement that is not represented in previous datasets.

Download

Figure 1: Online adaptive agent that induces and reuses programmatic skills as actions (bottom), as opposed to adding textual skills in memory (top).
Figure 4: ASI can generalize the search product skill but face incompatibility when sorting items.
Table 14, and Table 15 lists example tasks to test agent generalization abilities
Cross-website results. ASI significantly surpasses baselines in sr and # steps (with |t| > 2 and p < 0.05) from our analysis in §B.3.
Inducing Programmatic Skills for Agentic Tasks

April 2025

·

3 Reads

To succeed in common digital tasks such as web navigation, agents must carry out a variety of specialized tasks such as searching for products or planning a travel route. To tackle these tasks, agents can bootstrap themselves by learning task-specific skills online through interaction with the web environment. In this work, we demonstrate that programs are an effective representation for skills. We propose agent skill induction (ASI), which allows agents to adapt themselves by inducing, verifying, and utilizing program-based skills on the fly. We start with an evaluation on the WebArena agent benchmark and show that ASI outperforms the static baseline agent and its text-skill counterpart by 23.5% and 11.3% in success rate, mainly thanks to the programmatic verification guarantee during the induction phase. ASI also improves efficiency by reducing 10.7-15.3% of the steps over baselines, by composing primitive actions (e.g., click) into higher-level skills (e.g., search product). We then highlight the efficacy of ASI in remaining efficient and accurate under scaled-up web activities. Finally, we examine the generalizability of induced skills when transferring between websites, and find that ASI can effectively reuse common skills, while also updating incompatible skills to versatile website changes.


Figure 5: Case study 1. The original score_explicit_question function and its context extracted from the original GitHub repository. The function calls the text completion function from the OpenAI API.
RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

March 2025

·

3 Reads

We present RepoST, a scalable method to construct environments that provide execution feedback for repository-level code generation for both training and evaluation. Unlike existing works that aim to build entire repositories for execution, which is challenging for both human and LLMs, we provide execution feedback with sandbox testing, which isolates a given target function and its dependencies to a separate script for testing. Sandbox testing reduces the complexity of external dependencies and enables constructing environments at a large scale. We use our method to construct RepoST-Train, a large-scale train set with 7,415 functions from 832 repositories. Training with the execution feedback provided by RepoST-Train leads to a performance gain of 5.5% Pass@1 on HumanEval and 3.5% Pass@1 on RepoEval. We also build an evaluation dataset, RepoST-Eval, and benchmark 12 code generation models.


SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

February 2025

·

4 Reads

·

1 Citation

The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.


Dynamic Coalition Structure Detection in Natural Language-based Interactions

February 2025

·

2 Reads

In strategic multi-agent sequential interactions, detecting dynamic coalition structures is crucial for understanding how self-interested agents coordinate to influence outcomes. However, natural-language-based interactions introduce unique challenges to coalition detection due to ambiguity over intents and difficulty in modeling players' subjective perspectives. We propose a new method that leverages recent advancements in large language models and game theory to predict dynamic multilateral coalition formation in Diplomacy, a strategic multi-agent game where agents negotiate coalitions using natural language. The method consists of two stages. The first stage extracts the set of agreements discussed by two agents in their private dialogue, by combining a parsing-based filtering function with a fine-tuned language model trained to predict player intents. In the second stage, we define a new metric using the concept of subjective rationalizability from hypergame theory to evaluate the expected value of an agreement for each player. We then compute this metric for each agreement identified in the first stage by assessing the strategic value of the agreement for both players and taking into account the subjective belief of one player that the second player would honor the agreement. We demonstrate that our method effectively detects potential coalition structures in online Diplomacy gameplay by assigning high values to agreements likely to be honored and low values to those likely to be violated. The proposed method provides foundational insights into coalition formation in multi-agent environments with language-based negotiation and offers key directions for future research on the analysis of complex natural language-based interactions between agents.


Figure 2. Illustration of SLIDESBENCH. Each example of SLIDESBENCH consists of three instructions: Detailed Instructions with Images, Detailed Instructions Only, and High-Level Instructions. The model is tasked to generate a slide based on the instruction, and the generated slide is evaluated on the metrics suite, which contains both the reference-free metrics and the reference-based metrics.
AutoPresent: Designing Structured Visuals from Scratch

January 2025

·

37 Reads

Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.


Improving Model Factuality with Fine-grained Critique-based Evaluator

October 2024

·

2 Reads

Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. We conduct data augmentation on a combination of public judgment datasets to train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, leverage FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator's accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama3-8B-chat's factuality rate by 14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 6.96%.


Human-aligned Chess with a Bit of Search

October 2024

·

12 Reads

Chess has long been a testbed for AI's quest to match human intelligence, and in recent years, chess AI systems have surpassed the strongest humans at the game. However, these systems are not human-aligned; they are unable to match the skill levels of all human partners or model human-like behaviors beyond piece movement. In this paper, we introduce Allie, a chess-playing AI designed to bridge the gap between artificial and human intelligence in this classic game. Allie is trained on log sequences of real chess games to model the behaviors of human chess players across the skill spectrum, including non-move behaviors such as pondering times and resignations In offline evaluations, we find that Allie exhibits humanlike behavior: it outperforms the existing state-of-the-art in human chess move prediction and "ponders" at critical positions. The model learns to reliably assign reward at each game state, which can be used at inference as a reward function in a novel time-adaptive Monte-Carlo tree search (MCTS) procedure, where the amount of search depends on how long humans would think in the same positions. Adaptive search enables remarkable skill calibration; in a large-scale online evaluation against players with ratings from 1000 to 2600 Elo, our adaptive search method leads to a skill gap of only 49 Elo on average, substantially outperforming search-free and standard MCTS baselines. Against grandmaster-level (2500 Elo) opponents, Allie with adaptive search exhibits the strength of a fellow grandmaster, all while learning exclusively from humans.


Figure 1: Example diff with multiple valid reviews. The ground truth and model-generated reviews focus on different topics like the performance of the added check and how likely it is to be triggered. However, a reference-based metric like the BLEU score assigns this review a low score of 0.0458ß.
Figure 3: Supervised fine-tuning pipeline for training Magicoder-6.7B for claim generation. We generate synthetic data by using GPT-4 to generate claims for the code changes in CodeReviewer validation set.
Figure 5: Histogram of sentence similarity of randomly sampled 100K sentence pairs from the CodeReviewer test set showing the scores are roughly normally distributed, justifying the usage of the 5-sigma rule for coming up with the threshold of 0.85 for high similarity used in metric computation.
Figure 6: Q-Q plot comparing quantiles of empirically observed sentence similarity scores computed over 100K sentence pairs from the CodeReviewer test set showing the theoretical quantiles match a normal distribution except for really high values. The discrepancy seen here is likely due to the random sample being a smaller subset of the whole 100M+ sentence pairs for which we compute similarities.
The various types of errors identified, their descriptions and examples (pseudo-references before and after correction of the error are shown) as well as relative frequencies as percentages are shown here. For this analysis, we annotated 46 erroneous pseudo-references
CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

September 2024

·

36 Reads

·

1 Citation

The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff), even though code review is a one-to-many problem like generation and summarization with many "valid reviews" for a diff. To tackle these issues we develop a CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.6k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.


Agent Workflow Memory

September 2024

·

21 Reads

Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.


Citations (40)


... We adopt GRPO, the reinforcement learning baseline algorithm proposed in DeepSeekMath (Shao et al., 2024), as our RL foundation. GRPO has already demonstrated strong potential in DeepSeek-R1 (Guo et al., 2025) and has recently been shown to be stable and reproducible across a series of follow-up works Wei et al., 2025). To reduce the overhead of training an additional value function as in PPO (Schulman et al., 2017), GRPO instead uses the average reward of sampled responses as a baseline for computing advantages. ...

Reference:

ToM-RL: Reinforcement Learning Unlocks Theory of Mind in Small LLMs
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
  • Citing Preprint
  • February 2025

... Iterative refinement with execution feedback Existing LM-based code editing approaches often leverage iterative refinement with execution feedback (Huang et al., 2024;Peng et al., 2024;Xia & Zhang, 2024;Waghjale et al., 2024), which relies on the availability of test inputs. However, the code to be edited may not always be well-maintained. ...

ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?
  • Citing Conference Paper
  • January 2024

... address these limitations, however, Google's approach [40] only focuses on issue classification without generating specific review comments, while Tencent [50] primarily addresses code maintainability concerns. Our investigation reveals three fundamental challenges in current LLM-based solutions: i) insufficient precision in generating technically accurate comments, ii) low practicality of comments that are technically correct but fail to provide substantial value [26,29], and iii) lack of systematic mechanisms for targeted improvement, preventing data-driven evolution in both model precision and suggestion practicality. ...

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

... Taking cognitive language agents as building blocks, recent work constructs multi-agent interaction pipelines [57,80,141]. While the multi-agent framework claims to bring about useful applications in various fields, from software engineering [65], to general collaborative frameworks [88], to healthcare [82], other work has highlighted the limitations of such systems such as conformity and inconsistency of personas [9], as well as how persona simulation reveals implicit stereotypes about the simulated social groups [58,85]. ...

Evaluating Large Language Model Biases in Persona-Steered Generation
  • Citing Conference Paper
  • January 2024

... Recently, the emergence of Large Language Models (LLMs) has intensified the need to evaluate their human-like communication abilities, particularly for nuanced use cases requiring sophisticated pragmatic reasoning (Hu et al., 2023;Ruis et al., 2023;Yerukola et al., 2024). As LLMs are increasingly deployed in real-world applications, validating their ability to understand and generate contextually appropriate and pragmatically accurate responses is crucial to ensure effective and trustworthy human-computer interactions. ...

Is the Pope Catholic? Yes, the Pope is Catholic. Generative Evaluation of Non-Literal Intent Resolution in LLMs
  • Citing Conference Paper
  • January 2024

... VisualWebArena (Koh et al., 2024) shifts the focus to visually grounded tasks, introducing 910 challenges across three domains: shopping, social forums (Reddit), and a new classifieds environment similar to Craigslist. Unlike WebArena, this dataset requires multimodal agents to combine visual understanding with textual inputs. ...

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
  • Citing Conference Paper
  • January 2024

... Direct communication offers a solution by enabling the exchange of relevant information be-tween players. Recent research has explored the use of large language models to interpret intentions and predict actions through natural language, enhancing alignment and coordination within the human-AI team (Guan et al. 2023;Chen, Fried, and Topcu 2024;Liu et al. 2024). However, the effectiveness of such approaches is constrained by the limitations of the language module and the high latency associated with API calls during the inference process (Liang et al. 2023). ...

Human-Agent Cooperation in Games under Incomplete Information through Natural Language Communication
  • Citing Conference Paper
  • August 2024

... Prompting techniques for multiturn interaction. Prior work has explored prompting strategies to enhance LLM interactivity, particularly for clarification questions (Keh et al., 2024;Mu et al., 2023;Zhang & Choi, 2023;Chi et al., 2024;Kim et al., 2023;Deng et al., 2023b;Zhao & Dou, 2024) and mixed-initiative dialogues (Deng et al., 2023a;Chen et al., 2023;Liao et al., 2023). For instance, Mu et al. (2023) prompt LLMs to ask clarification questions when code generation requests are ambiguous. ...

Asking More Informative Questions for Grounded Retrieval
  • Citing Conference Paper
  • January 2024

... First, the agent's exploration diversity diminishes over time, as its policy becomes too specialized to familiar trajectories, failing to discover novel states or actions (He et al., 2024b). Second, despite the existence of inference-time exploration algorithms, such as variations of searching algorithms (Koh et al., 2024b;Putta et al., 2024;, that can provide diversified action choices, they require significantly more real-world interactions that can be very costly, leading to marginal gains in useful information prohibitively expensive. Although there is work using simulations (Gu et al., 2024;Qiao et al., 2024) to perform action searching, they typically focus on one/two-step look-ahead, lacking the foresight needed for coherent multi-step rollouts. ...

Tree Search for Language Model Agents
  • Citing Preprint
  • July 2024

... Referring expressions have attracted long-standing interest since the last century (Winograd, 1972). From a linguistic perspective, interpreting and producing referring expressions is a natural language grounding problem (Fried et al., 2023;Mollo & Millière, 2023;Shi, 2024), requiring both semantic grounding, linking language to visual entities, and communicative grounding, establishing mutual agreement on the referent (Chai et al., 2018). From a practical perspective, this capability is essential for building robots (Qi et al., 2020) or generative AI models (Brooks et al., 2023;Yu et al., 2025) that can follow human instructions and engage in dialogue (Kollar et al., 2013;Thomason et al., 2015) with humans. ...

Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling Approaches
  • Citing Conference Paper
  • January 2023