Ashish Sabharwal’s research while affiliated with Allen Institute for Brain Science and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (224)


Figure 1: Can we uncover the hidden logic of DPO? Here we show the distillation of the DPO loss into a symbolic expression that expresses its high-level model behavior, along with a modified version we can compile into a novel DPO loss.
Figure 3: The Boolean semantics (top) of WMC and preference structures: ✓ correspond to propositional models satisfying P, P f , × s to ¬P and ¬P f , blank cells to conditioning constraints P C and cells with multiple marks to P A . Losses (columns) are created by assigning/removing marks then counting these marks/rows using the the bottom Eq. (following from Eq. 5).
Figure 5: What other losses are there? Here we show the loss landscape for single model preference approaches using a loss lattice showing losses (nodes) structured according to strict entailment (<) and their core formulas P (boxes) with ✓ being the known losses. See Appendix C for details of the individual losses and Figure 7.
Figure 6: What are interesting DPO variants to explore? Extending the loss lattice in Figure 5 to a version of the single model losses with reference models (i.e., their reference forms), showing different (largely unexplored) variants of DPO and the different semantics regions (gray boxes, corresponding to the core semantic formula for P each set of losses). See Appendix C for details.
Understanding the Logic of Direct Preference Alignment through Logic
  • Preprint
  • File available

December 2024

·

6 Reads

Kyle Richardson

·

Vivek Srikumar

·

Ashish Sabharwal

Recent direct preference alignment algorithms (DPA), such as DPO, have shown great promise in aligning large language models to human preferences. While this has motivated the development of many new variants of the original DPO loss, understanding the differences between these recent proposals, as well as developing new DPA loss functions, remains difficult given the lack of a technical and conceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic expression that characterizes its semantics? How do the semantics of two losses relate to each other? We propose a novel formalism for characterizing preference losses for single model and reference model based approaches, and identify symbolic forms for a number of commonly used DPA variants. Further, we show how this formal view of preference learning sheds new light on both the size and structure of the DPA loss landscape, making it possible to not only rigorously characterize the relationships between recent loss proposals but also to systematically explore the landscape and derive new loss functions from first principles. We hope our framework and findings will help provide useful guidance to those working on human AI alignment.

Download

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

September 2024

·

3 Reads

·

Kejuan Yang

·

Shashank Gupta

·

[...]

·

Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPERaims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for larger-scale development. We introduce various evaluation measures to assess both task success and progress, utilizing gold solutions when available or approximations otherwise. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios. This illustrates the challenge of this task, and suggests that SUPER can serve as a valuable resource for the community to make and measure progress.


AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

July 2024

·

19 Reads

Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built AppWorld Engine\textbf{AppWorld Engine}, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created AppWorld Benchmark\textbf{AppWorld Benchmark} (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.


Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions

July 2024

·

7 Reads

·

1 Citation

Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have quite a range of performance, particularly when the task format is diversified slightly (such as by shuffling answer choice order). In this work we ask: how do successful models perform formatted MCQA? We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information for predicting the correct answer. We find that prediction of a specific answer symbol is causally attributed to a single middle layer, and specifically its multi-head self-attention mechanism. We show that subsequent layers increase the probability of the predicted answer symbol in vocabulary space, and that this probability increase is associated with a sparse set of attention heads with unique roles. We additionally uncover differences in how different models adjust to alternative symbols. Finally, we demonstrate that a synthetic task can disentangle sources of model error to pinpoint when a model has learned formatted MCQA, and show that an inability to separate answer symbol tokens in vocabulary space is a property of models unable to perform formatted MCQA tasks.


DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

July 2024

·

23 Reads

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. We additionally provide 903 synthetic tasks to conduct controlled evaluations across task complexity. Furthermore, our structured formalism of data-driven discovery enables a facet-based evaluation that provides useful insights into different failure modes. We evaluate several popular LLM-based reasoning frameworks using both open and closed LLMs as baselines on DiscoveryBench and find that even the best system scores only 25%. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.







Citations (51)


... Of particular relevance for our work is the usage of LLMs for code generation in the domain of machine learning, either as an LLM agent or, more commonly, in a fixed scaffold that does not allow the LLM to choose what tools to use. Focusing in on code generation, LLMs have been used to do autonomous machine learning research, neural architecture search, data science problems, paper reproduction, writing research papers, and reward function design, including preference optimization for LLM fine-tuning [37,38,39,40,41,42,43,44]. ...

Reference:

RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
  • Citing Conference Paper
  • January 2024

... Integrating external tools to solve diverse multi-modal tasks is a promising research direction towards multi-modal agents (Surís et al., 2023;Gupta & Kembhavi, 2023;Yuan et al., 2024;Zhong et al., 2023). Existing agents usually use a large language model (LLM) as the controller that generates plans via prompt engineering to call tools, achieving impressive performance in multiple domains, such as image editing (Wu et al., 2023), robotic manipulation (ichter et al., 2023), question answering (Shen et al., 2024), video understanding , and desktop APPs (Trivedi et al., 2024). Despite their success, prompt engineering faces limited reasoning abilities for tool usage in tackling practical tasks, as shown in Fig. 1. (1) The in-context examples in prompts only involve textual information, degrading the efficiency of tool usage in the multi-modal world. ...

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
  • Citing Conference Paper
  • January 2024

... These features make code formats not only useful for the reasoning of Code-LLMs, but also benefit general-purpose LLMs. Bogin et al. (2023) leverage the advantage of code in describing structures. They use programming languages to describe domainspecific information such as types of entities, attributes, and methods. ...

Leveraging Code to Improve In-Context Learning for Semantic Parsing
  • Citing Conference Paper
  • January 2024

... [Song et al., 2024] developed a multidimensional framework considering factors like faithfulness and coherence. Murahari et al. [2024] introduced QualEval, a framework that improves traditional metrics with qualitative insights and more fine-grained evaluation. However, they focus on evaluation to improve the model, while we seek to generate faithful and interpretable reports for humans. ...

QualEval: Qualitative Evaluation for Model Improvement
  • Citing Conference Paper
  • January 2024

... Specifically, they read information about the context or reasoning results from the residual stream, then enhance the information that needs to be expressed as output, and write it back into the stream. Amplification Head [Lieberum et al., 2023] and Correct Head [Wiegreffe et al., 2024] amplify the signal of the correct choice letter in MCQA problems near the [END] position. This amplification ensures that after passing through the Unembedding layer and softmax calculation, the correct choice letter has the highest probability. ...

Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions
  • Citing Preprint
  • July 2024

... Experiment 2: Evaluating Zero-shot Reading Comprehension using Belebele dataset Perplexity is an intrinsic measure of how well a language model performs on the task they are trained to do. But it does not necessarily predict how well a model does in tasks that require text comprehension (Holtzman et al., 2021;Wiegreffe et al., 2023). To account for this, in the second experiment we evaluate the language models' performance on multiple-choice reading comprehension (MRC) using the Belebele benchmark dataset (Bandarkar et al., 2023). ...

Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy
  • Citing Conference Paper
  • January 2023

... The key distinction between deductive reasoning and both inductive/abductive reasoning is that deductive reasoning results in clear conclusions, while inductive/abductive reasoning may not necessarily achieve this. For more nuanced and low-level details on the distinction between the three types of logical reasoning in the context of LLMs, we refer the reader to other publications [15,19,20]. ...

IfQA: A Dataset for Open-domain Question Answering under Counterfactual Presuppositions
  • Citing Conference Paper
  • January 2023

... Sclar et al. (2023) proposed an explicit graphical representation for nested belief states, allowing the model to answer questions from the perspective of each character. Kassner et al. (2023) developed a belief graph that includes explicit system beliefs and their inferential relationships, providing an interpretable view of the system's beliefs. Li et al. (2023a) employed prompt engineering to represent explicit belief states, augmenting the agents' information retention and enhancing multi-agent collaboration. ...

Language Models with Rationality
  • Citing Conference Paper
  • January 2023

... Instruction Following. Training models to follow instructions is crucial for improving LLM performance and ensuring safe deployment, with various methods developed to enhance instruction adherence (Ouyang et al., 2022;Sanh et al., 2022;Wei et al., 2022;Bai et al., 2022;Chung et al., 2024), and datasets designed to train and evaluate instruction-following behavior (Ye et al., 2021;Gupta et al., 2022;Finlayson et al., 2022;Longpre et al., 2023;Köpf et al., 2023). Natural language instructions have demonstrated significant promise in providing fine-grained control over model outputs (Zhou et al., 2023b). ...

What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment
  • Citing Conference Paper
  • January 2022