Kenneth Holstein’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (92)


Validating LLM-as-a-Judge Systems in the Absence of Gold Labels
  • Preprint
  • File available

March 2025

·

2 Reads

Luke Guerdan

·

Solon Barocas

·

Kenneth Holstein

·

[...]

·

Alexandra Chouldechova

The LLM-as-a-judge paradigm, in which a judge LLM system replaces human raters in rating the outputs of other generative AI (GenAI) systems, has come to play a critical role in scaling and standardizing GenAI evaluations. To validate judge systems, evaluators collect multiple human ratings for each item in a validation corpus, and then aggregate the ratings into a single, per-item gold label rating. High agreement rates between these gold labels and judge system ratings are then taken as a sign of good judge system performance. In many cases, however, items or rating criteria may be ambiguous, or there may be principled disagreement among human raters. In such settings, gold labels may not exist for many of the items. In this paper, we introduce a framework for LLM-as-a-judge validation in the absence of gold labels. We present a theoretical analysis drawing connections between different measures of judge system performance under different rating elicitation and aggregation schemes. We also demonstrate empirically that existing validation approaches can select judge systems that are highly suboptimal, performing as much as 34% worse than the systems selected by alternative approaches that we describe. Based on our findings, we provide concrete recommendations for developing more reliable approaches to LLM-as-a-judge validation.

Download

Figure 1: Motivations and Interconnections of the Seven AI Mismatch Matrices
Figure 2: The Required Performance matrix
Figure 3: The Disparate Performance matrix.
Figure 4: The Cost of Errors matrix. Figure 5: The Data Quality matrix.
Figure 6: The Model Unobservables matrix. Figure 7: The Expectation of Errors matrix.

+4

AI Mismatches: Identifying Potential Algorithmic Harms Before AI Development

February 2025

·

8 Reads

AI systems are often introduced with high expectations, yet many fail to deliver, resulting in unintended harm and missed opportunities for benefit. We frequently observe significant "AI Mismatches", where the system's actual performance falls short of what is needed to ensure safety and co-create value. These mismatches are particularly difficult to address once development is underway, highlighting the need for early-stage intervention. Navigating complex, multi-dimensional risk factors that contribute to AI Mismatches is a persistent challenge. To address it, we propose an AI Mismatch approach to anticipate and mitigate risks early on, focusing on the gap between realistic model performance and required task performance. Through an analysis of 774 AI cases, we extracted a set of critical factors, which informed the development of seven matrices that map the relationships between these factors and highlight high-risk areas. Through case studies, we demonstrate how our approach can help reduce risks in AI development.


Figure 3
Figure 4
Figure 7
Figure 8
Intent Tagging: Exploring Micro-Prompting Interactions for Supporting Granular Human-GenAI Co-Creation Workflows

February 2025

·

26 Reads

Despite Generative AI (GenAI) systems' potential for enhancing content creation, users often struggle to effectively integrate GenAI into their creative workflows. Core challenges include misalignment of AI-generated content with user intentions (intent elicitation and alignment), user uncertainty around how to best communicate their intents to the AI system (prompt formulation), and insufficient flexibility of AI systems to support diverse creative workflows (workflow flexibility). Motivated by these challenges, we created IntentTagger: a system for slide creation based on the notion of Intent Tags - small, atomic conceptual units that encapsulate user intent - for exploring granular and non-linear micro-prompting interactions for Human-GenAI co-creation workflows. Our user study with 12 participants provides insights into the value of flexibly expressing intent across varying levels of ambiguity, meta-intent elicitation, and the benefits and challenges of intent tag-driven workflows. We conclude by discussing the broader implications of our findings and design considerations for GenAI-supported content creation workflows.


WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AI

January 2025

·

14 Reads

There has been growing interest from both practitioners and researchers in engaging end users in AI auditing, to draw upon users’ unique knowledge and lived experiences. However, we know little about how to effectively scaffold end users in auditing in ways that can generate actionable insights for AI practitioners. Through formative studies with both users and AI practitioners, we first identified a set of design goals to support user-engaged AI auditing. We then developed _WeAudit_, a workflow and system that supports end users in auditing AI both individually and collectively. We evaluated _WeAudit_ through a three-week user study with user auditors and interviews with industry Generative AI practitioners. Our findings offer insights into how _WeAudit_ supports users in noticing and reflecting upon potential AI harms and in articulating their findings in ways that industry practitioners can act upon. Based on our observations and feedback from both users and practitioners, we identify several opportunities to better support user engagement in AI auditing processes. We discuss implications for future research to support effective and responsible user engagement in AI auditing.


WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AI

January 2025

·

42 Reads

There has been growing interest from both practitioners and researchers in engaging end users in AI auditing, to draw upon users' unique knowledge and lived experiences. However, we know little about how to effectively scaffold end users in auditing in ways that can generate actionable insights for AI practitioners. Through formative studies with both users and AI practitioners, we first identified a set of design goals to support user-engaged AI auditing. We then developed WeAudit, a workflow and system that supports end users in auditing AI both individually and collectively. We evaluated WeAudit through a three-week user study with user auditors and interviews with industry Generative AI practitioners. Our findings offer insights into how WeAudit supports users in noticing and reflecting upon potential AI harms and in articulating their findings in ways that industry practitioners can act upon. Based on our observations and feedback from both users and practitioners, we identify several opportunities to better support user engagement in AI auditing processes. We discuss implications for future research to support effective and responsible user engagement in AI auditing and red-teaming.




Studying Up Public Sector AI: How Networks of Power Relations Shape Agency Decisions Around AI Design and Use

November 2024

·

8 Reads

·

9 Citations

Proceedings of the ACM on Human-Computer Interaction

As public sector agencies rapidly introduce new AI tools in high-stakes domains like social services, it becomes critical to understand how decisions to adopt these tools are made in practice. We borrow from the anthropological practice to "study up" those in positions of power, and reorient our study of public sector AI around those who have the power and responsibility to make decisions about the role that AI tools will play in their agency. Through semi-structured interviews and design activities with 16 agency decision-makers, we examine how decisions about AI design and adoption are influenced by their interactions with and assumptions about other actors within these agencies (e.g., frontline workers and agency leaders), as well as those above (legal systems and contracted companies), and below (impacted communities). By centering these networks of power relations, our findings shed light on how infrastructural, legal, and social factors create barriers and disincentives to the involvement of a broader range of stakeholders in decisions about AI design and adoption. Agency decision-makers desired more practical support for stakeholder involvement around public sector AI to help overcome the knowledge and power differentials they perceived between them and other stakeholders (e.g., frontline workers and impacted community members). Building on these findings, we discuss implications for future research and policy around actualizing participatory AI approaches in public sector contexts.


Responsible Crowdsourcing for Responsible Generative AI: Engaging Crowds in AI Auditing and Evaluation

October 2024

·

26 Reads

Proceedings of the AAAI Conference on Human Computation and Crowdsourcing

With the rise of generative AI (GenAI), there has been an increased need for participation by large and diverse user bases in AI evaluation and auditing. GenAI developers are increasingly adopting crowdsourcing approaches to test and audit their AI products and services. However, it remains an open question how to design and deploy responsible and effective crowdsourcing pipelines for AI auditing and evaluation. This workshop aims to take a step towards bridging this gap. Our interdisciplinary team of organizers will work with workshop participants to explore several key questions, such as how to improve the output quality and workers' productivity for GenAI evaluation crowdsourcing tasks compared to discriminative AI systems, how to guide crowds in auditing problematic AI-generated content while managing their psychological impact, ensuring marginalized voices are heard, and setting up responsible and effective crowdsourcing pipelines for real-world GenAI evaluation. We hope this workshop will produce a research agenda and best practices for designing responsible crowd-based approaches to AI auditing and evaluation.


Investigating What Factors Influence Users’ Rating of Harmful Algorithmic Bias and Discrimination

October 2024

·

10 Reads

Proceedings of the AAAI Conference on Human Computation and Crowdsourcing

There has been growing recognition of the crucial role users, especially those from marginalized groups, play in uncovering harmful algorithmic biases. However, it remains unclear how users’ identities and experiences might impact their rating of harmful biases. We present an online experiment (N=2,197) examining these factors: demographics, discrimination experiences, and social and technical knowledge. Participants were shown examples of image search results, including ones that previous literature has identified as biased against marginalized racial, gender, or sexual orientation groups. We found participants from marginalized gender or sexual orientation groups were more likely to rate the examples as more severely harmful. Belonging to marginalized races did not have a similar pattern. Additional factors affecting users’ ratings included discrimination experiences, and having friends or family belonging to marginalized demographics. A qualitative analysis offers insights into users' bias recognition, and why they see biases the way they do. We provide guidance for designing future methods to support effective user-driven auditing.


Citations (60)


... HCAI advocates for designing systems that align with human cognitive strengths, workflows, and societal values, empowering workers rather than marginalizing them [51]. Discussions at recent ACM CHI and CSCW conferences during special interest groups (SIGs) [15,16,38,52] have reinforced the critical need for aligning AI with human cognitive abilities, workflows, and societal needs. To see how, consider examples of existing systems such as adaptive learning platforms that personalize education for diverse learners [17]; or AI-augmented radiologists who achieve greater diagnostic accuracy by collaborating with machine learning systems [3]. ...

Reference:

The Impact of AI on Jobs: HCI is Missing
Collaboratively Designing and Evaluating Responsible AI Interventions
  • Citing Conference Paper
  • November 2024

... In our work, we found the City's approach to assessing clients -as a continuous and rapport-focused exercise between workers and clients -had strong buy-in by frontline staff because the City had developed their data practices through extensive collaborations with community organizations and impacted stakeholders. We thus encourage HCI researchers who strive to design technologies for homelessness to consider the fundamental tenet that technical interventions are developed within the context of systemic constraints, organizational processes, and asymmetrical power dynamics between workers and clients [7,43,50,59,86,88,90]. ...

Studying Up Public Sector AI: How Networks of Power Relations Shape Agency Decisions Around AI Design and Use
  • Citing Article
  • November 2024

Proceedings of the ACM on Human-Computer Interaction

... At the same time, we encourage researchers to approach the study of fat people as a marginalized group with care. There is ongoing debate among those researching marginalized groups over the extent to which HCI researchers should center harms, deficits, or damage in their work relative to joy or everyday experiences [25,126]. Certainly, the anti-fatness associated with online harassment and representational harms are important to address and warrant further research attention in HCI (Section 5.2). ...

Carefully Unmaking the “Marginalized User:” A Diffractive Analysis of a Gay Online Community
  • Citing Article
  • June 2024

ACM Transactions on Computer-Human Interaction

... Users not only detect biases but also foster dialogue, build consensus, and mobilize collective action, highlighting how bottom-up initiatives can complement top-down regulatory efforts to promote decentralized, accountable AI governance. Achieving this requires strengthening public accountability mechanisms [36], fostering deliberative processes for conflict resolution [37,38], and enhancing public AI literacy [39,40]. ...

AI Failure Cards: Understanding and Supporting Grassroots Efforts to Mitigate AI Failures in Homeless Services
  • Citing Conference Paper
  • June 2024

... Practitioners, both in prior research and our study, noted that profit-driven goals often hinder responsible AI practices, even when practitioners are genuinely committed to marginalized communities. In addition, practitioners' motivations on addressing harms that elicited the most "surprise" from other users (see Section 6.4.2) appears to stems from a desire to "avoid bad public relations, " centering the interests of the company rather than a genuine commitment to those affected marginalized community members [30,37,72,76,122]. Furthermore, as Deng et al. highlighted, while system like WeAudit can scaffold the potential meaningful collaboration between AI users and developers, but it also "firmly place the choice to take action with practitioners, potentially leaving users with less room for leverage via other means. ...

Building, Shifting, & Employing Power: A Taxonomy of Responses From Below to Algorithmic Harm
  • Citing Conference Paper
  • June 2024

... How can HCI support a shared automation ownership and help bridge the challenges in this regard? In prior work, Kim et al. [45] highlights the need for a stakeholder-centered taxonomy for automated vehicles that incorporates their perspectives and requirements; similarly, Kawakami et al. [41] created a guidebook to support effective multi-stakeholder decision-making in earlystage public AI projects. Thus, all affected stakeholders should be included early enough and efforts to automate a process should not be one-sided but rather collaborative. ...

The Situate AI Guidebook: Co-Designing a Toolkit to Support Multi-Stakeholder, Early-stage Deliberations Around Public Sector AI Proposals
  • Citing Conference Paper
  • May 2024

... Designing AI systems that account for personal, emotional, and psychosocial factors could help align with the personalized approach of clinical medicine. The discussion around the topic of designing trustworthy and reliable intelligent agents has just recently become more important through the increasing capabilities of LLMs, e.g., in workshops of CHI [3]. Tools like FarSight [83] help AI designers to assess potential risks during the design phase better. ...

Trust and Reliance in Evolving Human-AI Workflows (TREW)
  • Citing Conference Paper
  • May 2024

... While AI intervention, such as shared gaze visualization [131] and "co-orchestration" [61] where AI identifies struggling students in real-time and notifies teachers to direct assistance [69], can be beneficial, it may also create tensions between intervening to help the student and disrupting their learning autonomy [143]. Our previous work [56] reveals that students' perceptions of intervention might vary based on their past achievement: confident students often view intervention negatively as disrupting learning independence, whereas students with less confidence typically see it as beneficial timely support. ...

Teacher Noticing and Student Learning in Human-AI Partnered Classrooms: A Multimodal Analysis
  • Citing Conference Paper
  • October 2023

... Different tools have also been created to facilitate critical reflection about responsible AI amongst the practitioners developing or engineering these systems [8,25,45]. Related to helping workers learn to use AI responsibly, one area has studied how to support workers' proper interpretations of AI recommendations [15,32,49]. Researchers focus on how interfaces, training, and feedback can help workers calibrate their mental models of AI decision-making to improve outcomes of human-AI collaboration. ...

Training Towards Critical Use: Learning to Situate AI Predictions Relative to Human Knowledge
  • Citing Conference Paper
  • November 2023