Hal Daumé III’s research while affiliated with Loyola University Maryland and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (150)


Figure 7: U.S reported median earnings for 8 intersectional groups by ACS 2022 (White male as reference, versus corresponding mean salaries offered by LLMs for names in these groups.
Figure 8: Percentage breakdown for races of names chosen by GPT-3.5 and Llama 3 for 40 occupations by gender. White names are disproportionately favored by LLMs, followed by Asian names. Llama 3 shows less preference for White names than GPT-3.5. Distribution of races are not always consistent across genders for the same occupation.
Figure 9: Sample biographies drawn from the occupation dentist after 2 stages of rewriting by GPT-4o.
Percentages of employed persons by occupation, sex, race and Hispanic or Latino ethnicity in 2023, as published by the U.S Bureau of Labor Statistics for 30 occupations in §2.4 (Bureau, 2023). U.S Category denotes the original category as published that we match to our list of occupations. Bias indicates whether the occupation appears in the BiasinBios dataset. The percentages of the race groups do not sum to 100% since not all races are presented. Persons who identified as Hispanic/Latino may be of any race by this methodology.
"You Gotta be a Doctor, Lin": An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations
  • Article
  • Full-text available

November 2024

·

7 Reads

·

·

Jieyu Zhao

·

Hal Daumé Iii

Social science research has shown that candidates with names indicative of certain races or genders often face discrimination in employment practices. Similarly, Large Language Models (LLMs) have demonstrated racial and gender biases in various applications. In this study, we utilize GPT-3.5-Turbo and Llama 3-70B-Instruct to simulate hiring decisions and salary recommendations for candidates with 320 first names that strongly signal their race and gender, across over 750,000 prompts. Our empirical results indicate a preference among these models for hiring candidates with White female-sounding names over other demographic groups across 40 occupations. Additionally , even among candidates with identical qualifications, salary recommendations vary by as much as 5% between different subgroups. A comparison with real-world labor data reveals inconsistent alignment with U.S. labor market characteristics, underscoring the necessity of risk investigation of LLM-powered systems.

Download

"You Gotta be a Doctor, Lin": An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations

June 2024

·

60 Reads

·

1 Citation

Social science research has shown that candidates with names indicative of certain races or genders often face discrimination in employment practices. Similarly, Large Language Models (LLMs) have demonstrated racial and gender biases in various applications. In this study, we utilize GPT-3.5-Turbo and Llama 3-70B-Instruct to simulate hiring decisions and salary recommendations for candidates with 320 first names that strongly signal their race and gender, across over 750,000 prompts. Our empirical results indicate a preference among these models for hiring candidates with White female-sounding names over other demographic groups across 40 occupations. Additionally , even among candidates with identical qualifications, salary recommendations vary by as much as 5% between different subgroups. A comparison with real-world labor data reveals inconsistent alignment with U.S. labor market characteristics, underscoring the necessity of risk investigation of LLM-powered systems.


How do authors’ perceptions of their papers compare with co-authors’ perceptions and peer-review decisions?

April 2024

·

103 Reads

·

7 Citations

Charvi Rastogi

·

Ivan Stelmakh

·

Alina Beygelzimer

·

[...]

·

How do author perceptions match up to the outcomes of the peer-review process and perceptions of others? In a top-tier computer science conference (NeurIPS 2021) with more than 23,000 submitting authors and 9,000 submitted papers, we surveyed the authors on three questions: (i) their predicted probability of acceptance for each of their papers, (ii) their perceived ranking of their own papers based on scientific contribution, and (iii) the change in their perception about their own papers after seeing the reviews. The salient results are: (1) Authors had roughly a three-fold overestimate of the acceptance probability of their papers: The median prediction was 70% for an approximately 25% acceptance rate. (2) Female authors exhibited a marginally higher (statistically significant) miscalibration than male authors; predictions of authors invited to serve as meta-reviewers or reviewers were similarly calibrated, but better than authors who were not invited to review. (3) Authors’ relative ranking of scientific contribution of two submissions they made generally agreed with their predicted acceptance probabilities (93% agreement), but there was a notable 7% responses where authors predicted a worse outcome for their better paper. (4) The author-provided rankings disagreed with the peer-review decisions about a third of the time; when co-authors ranked their jointly authored papers, co-authors disagreed at a similar rate—about a third of the time. (5) At least 30% of respondents of both accepted and rejected papers said that their perception of their own paper improved after the review process. The stakeholders in peer review should take these findings into account in setting their expectations from peer review.









Citations (84)


... However, most VQA efforts are focused on evaluating machine "understanding," an abstract task which seeks to test the spatial or object reasoning of machine learning models. The resulting systems aren't easily extendable to an accessibility application [2]. The VizWiz dataset is an exception, where image-question pairs are obtained from the actual use case of blind and low-vision people asking questions about information in the visual world they're taking pictures of [7]. ...

Reference:

Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering
What’s Different between Visual Question Answering for Machine “Understanding” Versus for Accessibility?
  • Citing Conference Paper
  • January 2022

... They encode observations so far as state, unused features/base classifiers as action space, and formulate various reward functions to account for classification error and costs. Such methods have been successfully applied in NLP He et al. (2013) to adaptively select features for dependency parsing, as well as computer vision Weiss and Taskar (2013) for human pose tracking. These methods use RL as a tool to learn decision rules that reduce test-time cost in the static environment of classification or structured prediction problems. ...

Dynamic Feature Selection for Dependency Parsing
  • Citing Conference Paper
  • January 2013

... These findings reinforce past work which argues that allocation harm is manifested in the additional effort is required by non-White users to adapt their prompts for successful use (Cunningham et al., 2024) and in the extra cost of additional tokens required to model linguistic variety (Ahia et al., 2023). While removing identity features from prompts altogether eliminates stereotypes from the recommendations, the sanitized outputs are still biased toward one group of users, providing less aligned services for those who do not identify with the model's default assumptions. ...

Understanding the Impacts of Language Technologies’ Performance Disparities on African American Language Speakers
  • Citing Conference Paper
  • January 2024

... This lack of ChatGPT Social AI influences users' moral judgments. This is true whether or not users know that the source of advice is generated by Social AI Si et al. (2023) Humans' ability to assess (true or false) information and explanations generated by Social AI compared to information snippets from Wikipedia N = 1500 participants across five conditions. The sample demographic is not disclosed ChatGPT Users tend to over-rely on explanations provided by ChatGPT, which decreases their ability to detect false claims in informational content Spitale et al. (2023) To what extent Social AI is more capable than humans at writing convincing, yet wrong information N = 697. ...

Large Language Models Help Humans Verify Truthfulness – Except When They Are Convincingly Wrong
  • Citing Conference Paper
  • January 2024

... GPT models represent available autoregressive decoder in-context learning via prompt engineering in 2023. Models have scaled and improved since then and it is possible that performance would improve, but issues of underlying racial biases (Section 5.5) continue to exist, even with more current models (Hofmann et al., 2024b,a;Warr et al., 2024;Shieh et al., 2024;Nghiem et al., 2024;Henderson et al., 2024). ...

"You Gotta be a Doctor, Lin": An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations

... Accurate review scores are crucial to improving peer review in very large-scale publication venues because they are the most important factor in determining accept/reject decisions. 1 The Isotonic Mechanism is well-suited for contemporary machine learning conferences, where it is commonplace for an author to submit multiple papers at the same time (Sun, 2020;Rastogi et al., 2022). An advantage of this methodology is that it requires minimal effort from authors and imposes no additional burden on reviewers. ...

How do authors’ perceptions of their papers compare with co-authors’ perceptions and peer-review decisions?

... Here, we are interested in how the bias of a human-AI assemblage relates to the bias of humans-alone or AI-alone. Prior work on human-AI decision-making has found that the bias of a human-AI team is not simply equal to the sum of its parts and can depend on factors such as the decision-making task and whether or how the AI's suggestions are justified [e.g., 18,30,57,62,67]. Our paper considers the task of human text authorship with the help of word-level suggestions given by a predictive text system. ...

The Impact of Explanations on Fairness in Human-AI Decision-Making: Protected vs Proxy Features
  • Citing Conference Paper
  • April 2024

... (2) We construct a realworld dataset, called HateDebias, where any sequence of sub-dataset in HateDebias corresponds to continuous-varying bias. (3) We propose a simple yet effective framework for continuous debiasing tasks, which provides a potential direction to improve continuous debiasing tasks. ...

Towards Conceptualization of “Fair Explanation”: Disparate Impacts of anti-Asian Hate Speech Explanations on Content Moderators
  • Citing Conference Paper
  • January 2023

... Large Language Models (LLMs) have shown impressive performance in reasoning and remembering extensive knowledge (Radford et al., 2019;Brown et al., 2020;Achiam et al., 2023;Anthropic, 2024;Zhao et al., 2023b;Touvron et al., 2023a,b;Yang et al., 2023;. However, when facing domain-specific questions, especially the e-commerce domain that has many long-tail entities and frequently updated information, LLMs often have issues such as hallucinations (Shi et al., 2023c;Wang et al., 2023;Zhao et al., 2023a). Retrieval-Augmented Generation * Corresponding authors. ...

Hallucination Detection for Grounded Instruction Generation
  • Citing Conference Paper
  • January 2023

... These systems typically return a single answer without any supporting context or evidence, making it challenging for users to understand how the answer was derived [40], [41]. This lack of transparency can reduce user trust and con dence in the system, especially in critical applications such as healthcare or legal domains [42]. Overall, while traditional QA systems have been valuable in certain contexts, their limitations have led to the development of more advanced approaches In recent years, the advent of powerful language models, such as the Generative Pre-trained Transformer (GPT), has revolutionized the eld of natural language processing (NLP) and opened up new possibilities for conversational agents [35], [43], [44], [45]. ...

What Else Do I Need to Know? The Effect of Background Information on Users’ Reliance on QA Systems
  • Citing Conference Paper
  • January 2023