Hanna Wallach’s research while affiliated with Syracuse University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (111)


Position: Evaluating Generative AI Systems is a Social Science Measurement Challenge
  • Preprint
  • File available

February 2025

·

4 Reads

Hanna Wallach

·

Meera Desai

·

·

[...]

·

Abigail Z. Jacobs

The measurement tasks involved in evaluating generative AI (GenAI) systems are especially difficult, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges comparisons" (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI. This framework has two important implications for designing and evaluating evaluations: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.

Download

Figure 1: Both the (a) back-end and (b) front-end involve processes that have their own inputs and produce their own outputs (simplified here). This is why we use this additional terminology for clarifying which inputs and outputs are under discussion. There is nothing complicated here; it is just shorthand to signal different aspects of the trained model at different points in time.
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice

December 2024

·

32 Reads

We articulate fundamental mismatches between technical methods for machine unlearning in Generative AI, and documented aspirations for broader impact that these methods could have for law and policy. These aspirations are both numerous and varied, motivated by issues that pertain to privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of targeted information from a generative-AI model's parameters, e.g., a particular individual's personal data or in-copyright expression of Spiderman that was included in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for thinking rigorously about these challenges, which enables us to be clear about why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact. We aim for conceptual clarity and to encourage more thoughtful communication among machine learning (ML), law, and policy experts who seek to develop and apply technical methods for compliance with policy objectives.


Figure 1: Both the (a) back-end and (b) front-end involve processes that have their own inputs and produce their own outputs (simplified here). This is why we use this additional terminology for clarifying which inputs and outputs are under discussion. There is nothing complicated here; it is just shorthand to signal different aspects of the trained model at different points in time.
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice

December 2024

·

69 Reads

We articulate fundamental mismatches between technical methods for machine unlearning in Generative AI, and documented aspirations for broader impact that these methods could have for law and policy. These aspirations are both numerous and varied, motivated by issues that pertain to privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of targeted information from a generative-AI model's parameters, e.g., a particular individual's personal data or in-copyright expression of Spiderman that was included in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for thinking rigorously about these challenges, which enables us to be clear about why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact. We aim for conceptual clarity and to encourage more thoughtful communication among machine learning (ML), law, and policy experts who seek to develop and apply technical methods for compliance with policy objectives.


Figure 1: Our proposed framework for measurement tasks of the form: measure the [amount] of a [concept] in [instances] from a [population]. The figure shows how the four elements that make up such tasks-amounts, concepts, instances, and populations-are formalized through the sequential processes of systematization, operationalization and application. Elements in earlier levels (rows) can be revised and refined based on findings, including validity concerns, that arise in later levels.
A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts

December 2024

·

6 Reads

The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems. We introduce a shared standard for valid measurement that helps place many of the disparate-seeming evaluation practices in use today on a common footing. Our framework, grounded in measurement theory from the social sciences, extends the work of Adcock & Collier (2001) in which the authors formalized valid measurement of concepts in political science via three processes: systematizing background concepts, operationalizing systematized concepts via annotation procedures, and applying those procedures to instances. We argue that valid measurement of GenAI systems' capabilities, risks, and impacts, further requires systematizing, operationalizing, and applying not only the entailed concepts, but also the contexts of interest and the metrics used. This involves both descriptive reasoning about particular instances and inferential reasoning about underlying populations, which is the purview of statistics. By placing many disparate-seeming GenAI evaluation practices on a common footing, our framework enables individual evaluations to be better understood, interrogated for reliability and validity, and meaningfully compared. This is an important step in advancing GenAI evaluation practices toward more formalized and theoretically grounded processes -- i.e., toward a science of GenAI evaluations.


Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems

November 2024

·

6 Reads

To facilitate the measurement of representational harms caused by large language model (LLM)-based systems, the NLP research community has produced and made publicly available numerous measurement instruments, including tools, datasets, metrics, benchmarks, annotation instructions, and other techniques. However, the research community lacks clarity about whether and to what extent these instruments meet the needs of practitioners tasked with developing and deploying LLM-based systems in the real world, and how these instruments could be improved. Via a series of semi-structured interviews with practitioners in a variety of roles in different organizations, we identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms caused by LLM-based systems: (1) challenges related to using publicly available measurement instruments; (2) challenges related to doing measurement in practice; (3) challenges arising from measurement tasks involving LLM-based systems; and (4) challenges specific to measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs, thus better facilitating the responsible development and deployment of LLM-based systems.


A Framework for Evaluating LLMs Under Task Indeterminacy

November 2024

·

1 Read

Large language model (LLM) evaluations often assume there is a single correct response -- a gold label -- for each item in the evaluation corpus. However, some tasks can be ambiguous -- i.e., they provide insufficient information to identify a unique interpretation -- or vague -- i.e., they do not clearly indicate where to draw the line when making a determination. Both ambiguity and vagueness can cause task indeterminacy -- the condition where some items in the evaluation corpus have more than one correct response. In this paper, we develop a framework for evaluating LLMs under task indeterminacy. Our framework disentangles the relationships between task specification, human ratings, and LLM responses in the LLM evaluation pipeline. Using our framework, we conduct a synthetic experiment showing that evaluations that use the "gold label" assumption underestimate the true performance. We also provide a method for estimating an error-adjusted performance interval given partial knowledge about indeterminate items in the evaluation corpus. We conclude by outlining implications of our work for the research community.


Dimensions of Generative AI Evaluation Design

November 2024

·

2 Reads

There are few principles or guidelines to ensure evaluations of generative AI (GenAI) models and systems are effective. To help address this gap, we propose a set of general dimensions that capture critical choices involved in GenAI evaluation design. These dimensions include the evaluation setting, the task type, the input source, the interaction style, the duration, the metric type, and the scoring method. By situating GenAI evaluations within these dimensions, we aim to guide decision-making during GenAI evaluation design and provide a structure for comparing different evaluations. We illustrate the utility of the proposed set of general dimensions using two examples: a hypothetical evaluation of the fairness of a GenAI system and three real-world GenAI evaluations of biological threats.


Evaluating Generative AI Systems is a Social Science Measurement Challenge

November 2024

·

15 Reads

Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI (GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.


Tinker, Tailor, Configure, Customize: The Articulation Work of Contextualizing an AI Fairness Checklist

April 2024

·

36 Reads

·

2 Citations

Proceedings of the ACM on Human-Computer Interaction

Many responsible AI resources, such as toolkits, playbooks, and checklists, have been developed to support AI practitioners in identifying, measuring, and mitigating potential fairness-related harms. These resources are often designed to be general purpose in order to be applicable to a variety of use cases, domains, and deployment contexts. However, this can lead to decontextualization, where such resources lack the level of relevance or specificity needed to use them. To understand how AI practitioners might contextualize one such resource, an AI fairness checklist, for their particular use cases, domains, and deployment contexts, we conducted a retrospective contextual inquiry with 13 AI practitioners from seven organizations. We identify how contextualizing this checklist introduces new forms of work for AI practitioners and other stakeholders, as well as opening up new sites for negotiation and contestation of values in AI. We also identify how the contextualization process may help AI practitioners develop a shared language around AI fairness, and we identify tensions related to ownership over this process that suggest larger issues of accountability in responsible AI work.



Citations (65)


... These initiatives seek to establish standardized means for researchers and developers to quickly access critical information about datasets intended for training medical devices, algorithms, or conducting epidemiological research. Notable among these efforts are Data Cards [19], Data Statements [20], Datasheet for Datasets [21], Model Cards [22], AI-Usage Cards [23], and the Dataset Nutritional Label [18]. Each of these proposals contributes with valuable frameworks for documenting various aspects of datasets and models, facilitating a more responsible and informed use of data in AI development. ...

Reference:

The Data Artifacts Glossary: a community-based repository for bias on health datasets
Datasheets for Datasets
  • Citing Preprint
  • March 2018

... These findings reinforce past work which argues that allocation harm is manifested in the additional effort is required by non-White users to adapt their prompts for successful use (Cunningham et al., 2024) and in the extra cost of additional tokens required to model linguistic variety (Ahia et al., 2023). While removing identity features from prompts altogether eliminates stereotypes from the recommendations, the sanitized outputs are still biased toward one group of users, providing less aligned services for those who do not identify with the model's default assumptions. ...

Understanding the Impacts of Language Technologies’ Performance Disparities on African American Language Speakers
  • Citing Conference Paper
  • January 2024

... In EUREKA-BENCH, we choose to prioritize transparency and reproducibility but it is also important to consider other forms of transparency that do not necessarily require full access to the whole test data. Recent work in evaluation methodology [65,67] for example provides a process framework and guidance to adapting over time and across tasks, motivating the need for evaluation efforts to adapt and revise both test cases and methods. ...

“One-Size-Fits-All”? Examining Expectations around What Constitute “Fair” or “Good” NLG System Behaviors
  • Citing Conference Paper
  • January 2024

... Systematically examine LLM risks by broadening the scope of values and informed actions. While much of the existing research and practice focus on a limited set of typical values (such as fairness [16,25] and interpretability [50,58]), this study, leveraging the Schwartz Theory of Basic Values [43,45], reveals that risks can emerge beyond these conventional value sets. These findings highlight the importance of incorporating a more comprehensive range of human values in systematically assessing LLMs' value inclinations and their corresponding actions. ...

FairPrism: Evaluating Fairness-Related Harms in Text Generation
  • Citing Conference Paper
  • January 2023

... Socially marginalized groups disproportionately experience representational harms caused by generative AI systems [1,2,3,4,5,6,7]. For example, popular text-to-image (T2I) models have been shown to generate inaccurate, culturally misrepresentative, and insensitive depictions of racial and ethnic minorities [1], people with disabilities [5], and foods from the African continent [4]. ...

Taxonomizing and Measuring Representational Harms: A Look at Image Tagging
  • Citing Article
  • June 2023

Proceedings of the AAAI Conference on Artificial Intelligence

... However, communication that establishes common ground can be difficult. Even with smallscale or special-purpose models, it can be difficult to characterize an AI system's abilities in a way that users understand [18,45,76,43,53]. Generative foundation models have characteristics that make transparent disclosure of their operation particularly challenging, including the wide and constantly evolving range of tasks they can perform at different levels of competency, their massive and opaque architectures, sensitivity to prompt and context, the stochasticity of their output, and the diversity of their user bases who may require different levels of detail [44]. ...

A Human-Centered Agenda for Intelligible Machine Learning
  • Citing Chapter
  • August 2021

... Moreover, the immersive experiences afforded by the roleplay feature may amplify perceived harm from behaviors like physical aggression or sexual misconduct by blurring the line between virtual actions and real emotional consequences, which demands an assessment and re-evaluation of how AI companions interact with users in immersive environments such as virtual reality (VR) or augmented reality (AR). Moreover, by clearly delineating specific roles AI plays in harmful behaviors, this typology can enhance AI accountability [94,105] by providing a structured framework for assigning responsibility for AI-related harms. This may help establish clearer legal liability, ensuring that AI developers, designers, and organizations are held accountable based on the AI's role in harmful interactions. ...

Accountability in Algorithmic Systems: From Principles to Practice
  • Citing Conference Paper
  • April 2023

... The design approach of AI and its mechanisms centers around the interactions between humans, their interpretations, and behaviors-for which there are no predictive models-and there is a link between societal inequality, internet search algorithms, and human decision-making (Winograd, 2006;Vlasceanu and Amodio, 2022). Although there has been research focusing on debiasing an algorithm's training set and investigating the computations of deep neural network models, a closer look at how human decision-makers interact with and consume algorithmic output is needed to increase fairness and transparency in AI use (Baer and Kamalnath, 2017;Gleaves et al., 2020;Du et al., 2021;Nourani et al., 2021;Crawford et al., 2022;Vlasceanu and Amodio, 2022). The vast amount of information generative AI and its software is trained on and created by people-inherently reflecting the societal biases present in the training material and reflected on outputs such as racial and socioeconomic stereotypes (Kazimzade et al., 2019;LSU Online and Continuing Education, 2023). ...

Excerpt from Datasheets for Datasets *
  • Citing Chapter
  • March 2022

... This line of work has explored practices around responsible technology development [17,18,38,58,59,71,75,96,103] and examined data practices specifically [14,32,35,48,64,78,99,105], exposing the challenges faced by practitioners throughout data curation [32], exploratory data analysis [99], annotator selection [48], and data documentation [35]. ...

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata
  • Citing Article
  • November 2022

Proceedings of the ACM on Human-Computer Interaction

... Some worked to make uninterpretable models more interpretable by appending purpose-built explanationgeneration models to the DST [71], or offering counterfactual predictions demonstrating how changes to the values of AI features may impact their outputs [8,38,72]. Others addressed the "too much information, too little time" problem by presenting only selective or modular explanations to clinicians, the explanations that best justify the AI's suggestion [3,17] or can best address clinicians' potential biases [35,73]. These approaches are nascent and have not yet demonstrated an impact on clinicians' decision quality. ...

From Human Explanation to Model Interpretability: A Framework Based on Weight of Evidence
  • Citing Article
  • October 2021

Proceedings of the AAAI Conference on Human Computation and Crowdsourcing