Conference Paper

CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Language models pre-trained on large text corpora that are highly likely to contain toxic and inappropriate content, are known to pass these biases on, or worse amplify them, when generating query responses and text [10] . The bias can present itself in various forms such as discrimination based on race, gender, disability, nationality or religion [11] [12] . To this end researchers developed a challenge dataset, CrowS-Pairs, crowd sourced using Amazon Mechanical Turk (MTurk), to measure the extent of biasin masked language modelling (MLM) [11] . ...
... The bias can present itself in various forms such as discrimination based on race, gender, disability, nationality or religion [11] [12] . To this end researchers developed a challenge dataset, CrowS-Pairs, crowd sourced using Amazon Mechanical Turk (MTurk), to measure the extent of biasin masked language modelling (MLM) [11] . More recent approaches to address such biases were collected by Gallegos et al in [10] . ...
... CLM token prediction is uni-directional because only the past tokens are used to predict the next tokens. MLM, most frequently used to pre-train encoder models [19] [26] , is a self-supervised learning technique that randomly masks tokens in an input sequence with the aim of learning the masked tokens based on the surrounding context provided by the unmasked tokens [11][31] . Hence MLM di ers from CLM by using unmasked tokens both before and after masked tokens, providing a bi-directional understanding of context instead of being limited to the words that precede it ( Figure 7, page 8). ...
Preprint
Full-text available
This paper provides a primer on Large Language Models (LLMs) and identifies their strengths, limitations, applications and research directions. It is intended to be useful to those in academia and industry who are interested in gaining an understanding of the key LLM concepts and technologies, and in utilising this knowledge in both day to day tasks and in more complex scenarios where this technology can enhance current practices and processes.
... Probability-based metrics offer another approach to bias evaluation, focusing on the probabilities assigned to tokens or sequences during model inference. For example, masked token probabilities and pseudo-log-likelihoods (e.g., Crows-pair (Nangia et al., 2020)) are used to assess the likelihood of generating biased tokens, thereby revealing underlying biases in the decision-making process of language models. Finally, generated text-based metrics evaluate biases in the actual outputs generated by LLMs (Guo & Caliskan, 2021;Webster et al., 2020). ...
... Benchmarks A range of benchmarks has been created to assess bias in LLMs, providing essential data for evaluating fairness across different tasks. Masked token benchmarks like Winogender (Rudinger et al., 2018), Winobias (Zhao et al., 2018), and GAP (Webster et al., 2018) are designed to test bias by predicting the most likely words in masked sentences, helping to reveal gender and other social biases embedded in models (Nangia et al., 2020;Rajpurkar et al., 2016). Unmasked sentence benchmarks, such as Crows-pair (Nangia et al., 2020) and Redditbias (Barikeri et al., 2021), evaluate bias by comparing the likelihood of generating biased versus neutral content, offering insights into the model's tendencies in real-world text generation scenarios. ...
... Masked token benchmarks like Winogender (Rudinger et al., 2018), Winobias (Zhao et al., 2018), and GAP (Webster et al., 2018) are designed to test bias by predicting the most likely words in masked sentences, helping to reveal gender and other social biases embedded in models (Nangia et al., 2020;Rajpurkar et al., 2016). Unmasked sentence benchmarks, such as Crows-pair (Nangia et al., 2020) and Redditbias (Barikeri et al., 2021), evaluate bias by comparing the likelihood of generating biased versus neutral content, offering insights into the model's tendencies in real-world text generation scenarios. Additionally, sentence completion benchmarks like RealToxicityPrompts (Gehman et al., 2020) and BOLD (Dhamala et al., 2021a) focus on assessing bias in the continuation of sentences, particularly in the context of generating toxic or harmful language. ...
Preprint
Full-text available
The rapid development and deployment of large language models (LLMs) have introduced a new frontier in artificial intelligence, marked by unprecedented capabilities in natural language understanding and generation. However, the increasing integration of these models into critical applications raises substantial safety concerns, necessitating a thorough examination of their potential risks and associated mitigation strategies. This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks. In addition to the comprehensive review of the mitigation methodologies and evaluation resources on these four aspects, we further explore four topics related to LLM safety: the safety implications of LLM agents, the role of interpretability in enhancing LLM safety, the technology roadmaps proposed and abided by a list of AI companies and institutes for LLM safety, and AI governance aimed at LLM safety with discussions on international cooperation, policy proposals, and prospective regulatory directions. Our findings underscore the necessity for a proactive, multifaceted approach to LLM safety, emphasizing the integration of technical solutions, ethical considerations, and robust governance frameworks. This survey is intended to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. Ultimately, it seeks to contribute to the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well-being. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLM-Safety-Papers.
... AI fairness, therefore, remains to be a critical area of focus for the research community, which bears an ethical responsibility to mitigate the potential negative impacts of the technologies it builds (Talat et al., 2022;Amershi et al., 2020;Hovy and Spruit, 2016). Scholars have developed bias evaluation benchmarks to not only establish baselines quantifying biased behavior exhibited by off-the-shelf PLMs, but also to measure the effectiveness of bias mitigation techniques applied on these models (Reusens et al., 2023;Blodgett et al., 2021;Nangia et al., 2020). ...
... In this paper, we address this gap by adapting two bias benchmarks-Crowdsourced Stereotype Pairs or CrowS-Pairs (Nangia et al., 2020) and WinoQueer (Felkner et al., 2023)-for Filipino, a language that currently does not have high NLP resources (Joshi et al., 2020). CrowS-Pairs is a dataset widely used to probe PLMs for different stereotypes (e.g., race, gender, religion, age, etc.), while WinoQueer is a recently released benchmark designed to assess the extent of anti-LGBTQ+ bias encoded in PLMs. ...
... These efforts resulted in benchmarks that provide more comprehensive and nuanced measures of bias in both masked and causal models. Examples of these bias evaluation benchmarks include BBQ (Parrish et al., 2022), BOLD (Dhamala et al., 2021), RealToxicityPrompts (Gehman et al., 2020;Schick et al., 2021), StereoSet (Nadeem et al., 2021), CrowS-Pairs (Nangia et al., 2020), and WinoQueer (Felkner et al., 2023). All have verified the presence of biased behavior across a wide range of language models. ...
Preprint
Full-text available
Bias studies on multilingual models confirm the presence of gender-related stereotypes in masked models processing languages with high NLP resources. We expand on this line of research by introducing Filipino CrowS-Pairs and Filipino WinoQueer: benchmarks that assess both sexist and anti-queer biases in pretrained language models (PLMs) handling texts in Filipino, a low-resource language from the Philippines. The benchmarks consist of 7,074 new challenge pairs resulting from our cultural adaptation of English bias evaluation datasets, a process that we document in detail to guide similar forthcoming efforts. We apply the Filipino benchmarks on masked and causal multilingual models, including those pretrained on Southeast Asian data, and find that they contain considerable amounts of bias. We also find that for multilingual models, the extent of bias learned for a particular language is influenced by how much pretraining data in that language a model was exposed to. Our benchmarks and insights can serve as a foundation for future work analyzing and mitigating bias in multilingual models.
... Most studies use either geographical region (37 out of 90) or language (35 out of 90) or both (17 out of 90) as a proxy for culture. These two proxies are strongly correlated especially when regions are defined as countries (for example, EVS/WVS (2022); Nangia et al. (2020); Koto et al. (2023)). Some of these studies focus on a specific region or language, for example, Indonesia (Koto et al., 2023), France/French (Nangia et al., 2020), Middle-east/Arabic (Naous et al., 2023), and India (Khanuja et al., 2023). ...
... These two proxies are strongly correlated especially when regions are defined as countries (for example, EVS/WVS (2022); Nangia et al. (2020); Koto et al. (2023)). Some of these studies focus on a specific region or language, for example, Indonesia (Koto et al., 2023), France/French (Nangia et al., 2020), Middle-east/Arabic (Naous et al., 2023), and India (Khanuja et al., 2023). A few studies, such as Dwivedi et al. (2023), further groups countries into larger global regions such as Europe. ...
... Samples used by (Nangia et al., 2020) to calculate conditional likelihood of the pair of sentences 1. For an average American, their attitude towards to "one can be a good manager without having a precise answer to every question that a subordinate may raise about his or her work" is (1)strongly agree (2) agree (3) undecided (4) disagree (5) strongly disagree. ...
... However, the limited number of social attributes studied is problematic. Therefore, some bias researchers have worked on a variety of attributes such as disability [6], sexual orientation [7], and intersectional ones [8,9], as well as studying negative bias toward queer people living in nonbinary genders [10]. Moreover, other studies have found that large language models (LLMs) generate harmful and stereotypical representations [11] and have proposed AI alignment methods to prevent them [12]. ...
... In the first quantitative investigation, we search for the words "speciesism" and "anthropocentrism" 5 on the ACL Anthology 6 to analyze how many papers mention them and how the words were used. 7 The terms "speciesism" and "anthropocentrism" are of significant importance in ethical discussions regarding nonhuman animals within the field of animal ethics [20,25,56]. ACL Anthology is a comprehensive collection of proceedings and papers from international conferences affiliated with the Association 4 T h e search for the adjective "speciesist" yielded four hits, all included in all five hits found by searching for the noun "speciesism". ...
... org/ . 7 We searched these words with ACL Anthology on 1 4 / 1 / 2 0 2 4 . The next speciesist example can be found in the work of Bender et al. [32]. ...
Article
Full-text available
Natural Language Processing (NLP) research on AI Safety and social bias in AI has focused on safety for humans and social bias against human minorities. However, some AI ethicists have argued that the moral significance of nonhuman animals has been ignored in AI research. Therefore, the purpose of this study is to investigate whether there is speciesism, i.e., discrimination against nonhuman animals, in NLP research. First, we explain why nonhuman animals are relevant in NLP research. Next, we survey the findings of existing research on speciesism in NLP researchers, data, and models and further investigate this problem in this study. The findings of this study suggest that speciesism exists within researchers, data, and models, respectively. Specifically, our survey and experiments show that (a) among NLP researchers, even those who study social bias in AI, do not recognize speciesism or speciesist bias; (b) among NLP data, speciesist bias is inherent in the data annotated in the datasets used to evaluate NLP models; (c) OpenAI GPTs, recent NLP models, exhibit speciesist bias by default. Finally, we discuss how we can reduce speciesism in NLP research.
... Perspective API 3 and RealToxicityPrompts (Gehman et al., 2020) cover a spectrum of abusive language, but primarily focus on explicit biases through profanities, threats and insults. Conversely, BBQ, Stere-oSet, and CrowS-Pairs focus on social biases such as stereotyping, capturing subtle forms of discrimination suitable for evaluating implicit bias (Parrish et al., 2022;Nangia et al., 2020;Nadeem et al., 2021). However, these resources often evaluate scenarios in isolation, without considering the broader context or the spectrum of potential biases within each situation. ...
... Demographics: STOP encompasses 9 social demographics drawn from the United States' Equal Employment Opportunity Commission (EEOC) guidelines 5 , which were then modified to ensure comprehensive coverage of social groups and include additional demographics such as class and political ideology. Table 3 compares the demographics included in STOP with popular datasets including BBQ, CrowS-Pairs, and StereoSet (Parrish et al., 2022;Nangia et al., 2020;Nadeem et al., 2021). ...
... Abid et al., 2021; Gonen and Goldberg, 2019;Wan et al., 2023;Sap et al., 2021; Kamruzzaman et al., 2024;Venkit et al., 2022). This bias is often revealed in natural language generation tasks(Sheng et al., 2019), code generation(Huang et al., 2024), and persists across various languages(Zhou et al., 2019).Implicit bias evaluation Existing metrics quantify bias in LLMs through various approaches, such as question-answering (QA) prompts(Shin et al., 2024;Nangia et al., 2020;Nadeem et al., 2021; Parrish et al., 2022) and sentence completion tasks or counterfactual evaluations(Gehman et al., 2020; Dhamala et al., 2021; Huang et al., 2020) ...
... Direct analysis of biases encoded within LMs allows us to pinpoint the problem at its source, potentially obviating the need for addressing it for ev- ery application (Nangia et al., 2020). Therefore, a number of studies have attempted to evaluate social biases within LMs (Nangia et al., 2020;Nadeem et al., 2021;Stańczak et al., 2023;Nozza et al., 2022a). ...
... Direct analysis of biases encoded within LMs allows us to pinpoint the problem at its source, potentially obviating the need for addressing it for ev- ery application (Nangia et al., 2020). Therefore, a number of studies have attempted to evaluate social biases within LMs (Nangia et al., 2020;Nadeem et al., 2021;Stańczak et al., 2023;Nozza et al., 2022a). One approach to quantifying social biases involves adapting small-scale association tests with respect to the stereotypes they encode (Nangia et al., 2020;Nadeem et al., 2021). ...
... Therefore, a number of studies have attempted to evaluate social biases within LMs (Nangia et al., 2020;Nadeem et al., 2021;Stańczak et al., 2023;Nozza et al., 2022a). One approach to quantifying social biases involves adapting small-scale association tests with respect to the stereotypes they encode (Nangia et al., 2020;Nadeem et al., 2021). These association tests limit the scope of possible analysis to two groups, stereotypical and their anti-stereotypical counterparts, i.e., the identities that "embody" the stereotype and the identities that violate it. ...
... Crowdsourcing studies: NLP researchers have recently began adapting social-psychological resources to build NLP evaluation datasets for stereotypes at scale. Approaches such as StereoSet (Nadeem et al., 2021) and CrowsPairs (Nangia et al., 2020) addressed the need for scaling stereotype data via crowdsourcing platforms such as Mechanical Turk. This crowdsourced data, while exceptionally valuable, is often tied to recognizing stereotypes reflected in specific modalities (e.g., recognizing whether a particular text reflects a stereotype), and not as a stand-alone list of social stereotypes as societal knowledge. ...
... Our framework is conceptual in nature, and is not tied to any particular implementation approach. A simpler implementation, for instance, using spreadsheets or relational databases, may suffice if the evaluation con- F T T aggressive low high high Middle-eastern Middle East Netherlanders nationality F F T blunt -high low European Europe SeeGULL Afghans nationality F T T violent low high high South-Asian South Asia dentists profession F F F weird --low -asians race F F T elegant --low --StereoLMs millennials age F F T nostalgic --low -brahmins caste F F T vegetarians --low Indian India dalits caste F T T uneducated -low high Indian India SPICE punjabis region F F T fearless -high low Indian India old age F F T fat --high -US native Americans race F T T lazy -low high -US CrowsPairs schizophrenia disability F F F stupid -low high -US gay men SO, gender T T T disgusting --high -US and Canada women gender F F T objects -low high -US and Canada SBF immigrants nationality F T F primitive -low high -US and Canada Table 1: The table shows instances of stereotype from five NLP resources -SeeGULL (Jha et al., 2023), Stereotypes in LMs (StereoLMs; Choenni et al., 2021), SPICE (Dev et al., 2023a), CrowsPairs (Nangia et al., 2020), and Social Bias Frames (SBF; Sap et al., 2020) -imported into our framework. ...
Preprint
Full-text available
Societal stereotypes are at the center of a myriad of responsible AI interventions targeted at reducing the generation and propagation of potentially harmful outcomes. While these efforts are much needed, they tend to be fragmented and often address different parts of the issue without taking in a unified or holistic approach about social stereotypes and how they impact various parts of the machine learning pipeline. As a result, it fails to capitalize on the underlying mechanisms that are common across different types of stereotypes, and to anchor on particular aspects that are relevant in certain cases. In this paper, we draw on social psychological research, and build on NLP data and methods, to propose a unified framework to operationalize stereotypes in generative AI evaluations. Our framework identifies key components of stereotypes that are crucial in AI evaluation, including the target group, associated attribute, relationship characteristics, perceiving group, and relevant context. We also provide considerations and recommendations for its responsible use.
... Our study focuses on two widely recognized intrinsic stereotyping benchmarks: StereoSet (Nadeem et al., 2021) and CrowS-Pairs (Nangia et al., 2020). We begin by highlighting the inconsistencies in the results yielded by these two benchmarks, even after controlling for data distribution in our experiments. ...
... Numerous studies have attempted to quantify stereotypes and bias in language models, consistently showing that these issues persist (Nangia et al., 2020;Dhamala et al., 2021;Nadeem et al., 2021;Felkner et al., 2023;Onorati et al., 2023;Zakizadeh et al., 2023). Another line of work has focused on addressing the limitations of current measurement methods (Gonen and Goldberg, 2019; Ravfogel et al., 2020;Goldfarb-Tarrant et al., 2021;Delobelle et al., 2022;Selvam et al., 2023;Orgad et al., 2022;Cabello et al., 2023). ...
Preprint
The multifaceted challenge of accurately measuring gender stereotypical bias in language models is akin to discerning different segments of a broader, unseen entity. This short paper primarily focuses on intrinsic bias mitigation and measurement strategies for language models, building on prior research that demonstrates a lack of correlation between intrinsic and extrinsic approaches. We delve deeper into intrinsic measurements, identifying inconsistencies and suggesting that these benchmarks may reflect different facets of gender stereotype. Our methodology involves analyzing data distributions across datasets and integrating gender stereotype components informed by social psychology. By adjusting the distribution of two datasets, we achieve a better alignment of outcomes. Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.
... Two benchmarks particularly relevant to LLM bias performance metrics are CrowS-Pairs [14] and StereoSet [15]. CrowS-Pairs (Crowdsourced Stereotype Pairs) measures stereotype bias using three open-source language models: BERT, RoBERTa, and ALBERT. ...
... CrowS-Pairs (Crowdsourced Stereotype Pairs) measures stereotype bias using three open-source language models: BERT, RoBERTa, and ALBERT. It also employs a stereotype score metric [14]. CrowS-Pairs assesses categories such as age, disability, gender identity, race/color, and others. ...
Article
Full-text available
Bias in Large Language Models (LLMs) can perpetuate harmful stereotypes, reinforce inequities, and lead to unfair outcomes in applications from automated content moderation to decision-making systems. These biases also limit the applicability of LLMs in areas such as law, medicine, education, and finance. This paper introduces a benchmark designed to measure and evaluate biases in LLMs. It addresses the protected characteristics on which bias is often enacted, including gender, race, socioeconomic status, and intersectional identities. By systematically assessing LLMs using an expert-curated dataset, the benchmark tests for the biases present in recent large language models like GPT-4o, Llama 3, Gemini and Claude 3.5 Sonnet. This paper details the construction of the benchmark, including the selection of the categories (Ageism, Colonial bias, Colorism, Disability, Homophobia, Racism, Sexism, and Supremacism), the evaluation metrics, and the implementation of testing protocols. Through empirical analysis, we evaluated the LLMs and observed significant performance disparities in multiple categories. All LLMs had an accuracy of at least 74% on average when tested for knowledge regarding these categories. However, this threshold was reduced when LLMs were required to interpret, reason, or deduce. This was especially true regarding homophobia, colonial praxis, and disability. GPT-4 performed best regarding content knowledge followed closely by Claude 3.5 Sonnet, while Gemma-1.1 performed best with interpretation. Gemini 1.5 Pro was better overall than its predecessor, Gemini 1.0, demonstrating that rapid improvement in bias mitigation is possible. These findings highlight the critical need for ongoing monitoring and mitigation strategies to address bias in generative AI systems. We hope that it will serve as a critical tool for policymakers aiming to promote fairness and provide an opportunity for LLM developers to leverage the benchmark to unlock use cases where they may better serve all of us.
... Generative language models have seen significant advancements in recent years (Brown, 2020;Radford et al., 2019), with downstream applications spanning a wide range of tasks such as reading comprehension, summarization, and dialogue generation. Despite their success, a growing body of research has highlighted the issues of unfairness in generative models, manifesting itself in the form of stereotypes (Nadeem et al., 2021;Nangia et al., 2020), hate speech (Hartvigsen et al., 2022), and toxicity (Gehman et al., 2020). These societal concerns necessitate a thorough and fine-grained evaluation of state-of-the-art generative models for learned bias in addition to their downstream utility. ...
... Evaluation Benchmarks StereoSet (Nadeem et al., 2021) and CrowS-Pairs (Nangia et al., 2020) measure stereotypes by evaluating models' preferences for response continuations. SeeGULL (Jha et al., 2023) uncovers the regional stereotypes in generative models for natural language inferencing task. ...
Preprint
Full-text available
Recent studies have shown that generative language models often reflect and amplify societal biases in their outputs. However, these studies frequently conflate observed biases with other task-specific shortcomings, such as comprehension failure. For example, when a model misinterprets a text and produces a response that reinforces a stereotype, it becomes difficult to determine whether the issue arises from inherent bias or from a misunderstanding of the given content. In this paper, we conduct a multi-faceted evaluation that distinctly disentangles bias from flaws within the reading comprehension task. We propose a targeted stereotype mitigation framework that implicitly mitigates observed stereotypes in generative models through instruction-tuning on general-purpose datasets. We reduce stereotypical outputs by over 60% across multiple dimensions -- including nationality, age, gender, disability, and physical appearance -- by addressing comprehension-based failures, and without relying on explicit debiasing techniques. We evaluate several state-of-the-art generative models to demonstrate the effectiveness of our approach while maintaining the overall utility. Our findings highlight the need to critically disentangle the concept of `bias' from other types of errors to build more targeted and effective mitigation strategies. CONTENT WARNING: Some examples contain offensive stereotypes.
... (Kaneko and Bollegala, 2022), cosine similarity (Caliskan et al., 2017b;May et al., 2019), inner-product (Ethayarajh et al., 2019, among others. Independently of any downstream tasks, intrinsic bias evaluation measures (Nangia et al., 2020;Nadeem et al., 2021;Kaneko and Bollegala, 2022) assess social biases in MLMs on a standalone basis. Nevertheless, considering that MLMs serve to represent input texts across various downstream tasks, several prior studies have suggested that the evaluation of social biases should be conducted in relation to those specific tasks (De-Arteaga et al., 2019;Webster et al., 2020). ...
... We perform experiments on the two most commonly used benchmark datasets used to evaluate social biases in MLMs. (Nangia et al., 2020). ...
... can and White groups (Jiang and Fellbaum, 2020; May et al., 2019), or a subset of US census groups, often with Middle Eastern added (Guo and Caliskan, 2021;Cao et al., 2022;Kirk et al., 2024;Cheng et al., 2023). Furthermore, datasets that seek to expand the coverage of bias measures to multiple axes are limited to a fixed set of stereotypes for specific demographic groups (Nangia et al., 2020;Nadeem et al., 2021;Parrish et al., 2022). To address these limitations, we focus on incorporating a wide range of ethnicities and using a one-vs-all, unsupervised approach to identify which stereotypes are associated with each demographic group. ...
... This paper builds on previous work exploring how stereotypes are associated with specific demographic groups and how this reinforces existing social hierarchies (Greenwald et al., 1998;Blodgett et al., 2021). Several datasets have been developed to examine stereotypes, often structured around sentence pairs comparing two demographic groups (May et al., 2019), contrasting stereotypes and antistereotypes (Zhao et al., 2018a;Nangia et al., 2020;Nadeem et al., 2021), or using question-answer sets to compare groups (Parrish et al., 2022). While valuable, these datasets are not suitable for our objective of analyzing each stereotype across multiple subgroups. ...
... Work that identifies and measures the biases of language models have classified these harms in two general categories: allocation and representation harm (Stanczak and Augenstein, 2021). Representational harms happen when harmful concepts or relations are associated with demographic groups by a model; in language models these are often measured via token embeddings and model parameters with fill-in the blank, or complete the sentence templates (e.g., Nadeem et al., 2021;Nangia et al., 2020). Most bias studies in NLP have focused on representational harms: many studies have demonstrated how generations from LLMs exhibit bias towards specific groups, or generate text that can be considered offensive, harmful or toxic (Dodge et al., 2021;De-Arteaga et al., 2019;Bender et al., 2021;Nadeem et al., 2021;Si et al., 2022), generations from LLMs are more likely to generative negative sentiment for refugees, disabled people, AAVE sentences, nonbinary, muslim and women (Magee et al., 2021;Groenwold et al., 2020;Sheng et al., 2019). ...
... In this area, research has also investigated how shot selection and ordering affects the bias of models, finding that random ordering and representative shots helps reduce bias (Si et al., 2022). To understand the underlying bias source in the behavior of these models, researchers have evaluated the generations of LLMs under different conditions, like size and training procedure (Baldini et al., 2022;Tal et al., 2022;de Vassimon Manela et al., 2021;Nangia et al., 2020). ...
... But, the study is narrowly scoped to limited adjectives and is from decades ago thus may not reflect the current Indian society. Recent research within NLP has built large stereotype datasets such as Stereoset (Nadeem et al., 2020) and CrowS-P (Nangia et al., 2020) to evaluate models, but they may not capture the stereotypes relevant to India. ...
... Generating Candidate Associations: We build the set of candidate association tuples (i, t) using identity terms described in Section 4 for re-ligion and region. We then create a list of tokens based on prior work (Malik et al., 2021;Nangia et al., 2020;Nadeem et al., 2020); including lists of professions, subjects of study (history, science, etc.), action-verbs, and adjectives for behaviour, socio-economic status, food habits, and clothing preferences. Tuples are formed by a cross product between tokens and identity terms. ...
... May et al. (2019) extended WEAT to measure bias in sentence encoders such as ELMo and BERT. Nangia et al. (2020) further proposed CrowS-Pairs to use crowdsourced sentences to uncover a wide range of social biases in language models, and concurrently Nadeem et al. (2021) proposed a similar StereoSet for the same purpose. Bias in pre-trained vision models: Inspired by WEAT, Steed and Caliskan (2021) developed the Image Embedding Association Test (iEAT) for quantifying biased associations between represen-tations of social concepts and attributes in images. ...
... But other existing datasets also have their limitations. For example, CrowS-Pairs (Nangia et al., 2020) only contains disadvantaged groups in the United States, and WinoBias (Zhao et al., 2018a) and Winogender (Rudinger et al., 2018) focuses on gender bias. We therefore believe that StereoSet is still a good choice to start with given the variety of bias types and attribute terms. ...
... For example, some studies on bias assessment define an unbiased state as one where all demographic groups are treated equally [19,35,41,43]. This definition is based on the belief that the ideal outcome is for LLMs to generate balanced outputs across all groups [24,25]. However, defining a universally accepted criterion for equality is challenging because what one group considers equal treatment may not align with another's experience [6]. ...
... In the balancing approach, language models (LMs) are designed to treat different demographic groups equally by adjusting training data, word embeddings, model parameters, or outputs [7,8,35,46]. This approach aims to ensure equal performance across groups or prevent any group from being disproportionately advantaged or disadvantaged in the model's predictions [5,24,25,45]. On the other hand, the refusal (or not-to-answer, rejection) approach trains LMs to refuse harmful instructions and refrain from generating harmful contents [2,23,42]. ...
Preprint
Full-text available
Large language models (LLMs) often reflect real-world biases, leading to efforts to mitigate these effects and make the models unbiased. Achieving this goal requires defining clear criteria for an unbiased state, with any deviation from these criteria considered biased. Some studies define an unbiased state as equal treatment across diverse demographic groups, aiming for balanced outputs from LLMs. However, differing perspectives on equality and the importance of pluralism make it challenging to establish a universal standard. Alternatively, other approaches propose using fact-based criteria for more consistent and objective evaluations, though these methods have not yet been fully applied to LLM bias assessments. Thus, there is a need for a metric with objective criteria that offers a distinct perspective from equality-based approaches. Motivated by this need, we introduce a novel metric to assess bias using fact-based criteria and real-world statistics. In this paper, we conducted a human survey demonstrating that humans tend to perceive LLM outputs more positively when they align closely with real-world demographic distributions. Evaluating various LLMs with our proposed metric reveals that model bias varies depending on the criteria used, highlighting the need for multi-perspective assessment.
... In what setting are we going to focus on? We focus on comparative prompts (Parrish et al., 2022;Nangia et al., 2020;Rudinger et al., 2018) where models are required to make a choice or express preference towards a decision that may favor or otherwise stereotype specific groups. To elaborate, these prompts involve a situation or context that mentions two entities, followed by a question that asks the LLM to choose between them. ...
... Datasets: For our evaluations, we utilize the BBQ (Bias Benchmark for Question Answering) dataset (Parrish et al., 2022), CrowS-Pairs dataset (Nangia et al., 2020), and WinoGender dataset (Rudinger et al., 2018). More details about these datasets, the number of samples we used, and how these were modified can be found in Appendix B. ...
Preprint
We explore the internal mechanisms of how bias emerges in large language models (LLMs) when provided with ambiguous comparative prompts: inputs that compare or enforce choosing between two or more entities without providing clear context for preference. Most approaches for bias mitigation focus on either post-hoc analysis or data augmentation. However, these are transient solutions, without addressing the root cause: the model itself. Numerous prior works show the influence of the attention module towards steering generations. We believe that analyzing attention is also crucial for understanding bias, as it provides insight into how the LLM distributes its focus across different entities and how this contributes to biased decisions. To this end, we first introduce a metric to quantify the LLM's preference for one entity over another. We then propose ATLAS\texttt{ATLAS} (Attention-based Targeted Layer Analysis and Scaling), a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers. To evaluate our method, we conduct experiments across 3 datasets (BBQ, Crows-Pairs, and WinoGender) using GPT-2 XL\texttt{GPT-2 XL} (1.5B), GPT-J\texttt{GPT-J} (6B), LLaMA-2\texttt{LLaMA-2} (7B) and LLaMA-3\texttt{LLaMA-3} (8B). Our experiments demonstrate that bias is concentrated in the later layers, typically around the last third. We also show how ATLAS\texttt{ATLAS} effectively mitigates bias through targeted interventions without compromising downstream performance and an average increase of only 0.82% in perplexity when the intervention is applied. We see an average improvement of 0.28 points in the bias score across all the datasets.
... 1. Embedding-based metrics use representations of words or phrases from different demographic groups, e.g., WEAT (Caliskan et al., 2017) and SEAT (May et al., 2019). 2. Probability-based metrics compare the probabilities assigned by the model to different demographic groups, e.g., CrowSPairs (Nangia et al., 2020). 3. Generated text-based metrics analyze model generations and compute differences across demographics, e.g., by evaluating model responses to standardized questionnaires (Durmus et al., 2024), or using classifiers to analyze the characteristics of generations such as toxicity (Dhamala et al., 2021;Hartvigsen et al., 2022;Smith et al., 2022). ...
... CrowSPairs (Nangia et al., 2020) is a dataset of crowd-sourced sentence pairs designed to evaluate stereotypes related to race, gender, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status. Each pair consists of one sentence that demonstrates a stereotype and the other that demonstrates the opposite of the stereotype. ...
Preprint
Last few years have seen unprecedented advances in capabilities of Large Language Models (LLMs). These advancements promise to deeply benefit a vast array of application domains. However, due to their immense size, performing inference with LLMs is both costly and slow. Consequently, a plethora of recent work has proposed strategies to enhance inference efficiency, e.g., quantization, pruning, and caching. These acceleration strategies reduce the inference cost and latency, often by several factors, while maintaining much of the predictive performance measured via common benchmarks. In this work, we explore another critical aspect of LLM performance: demographic bias in model generations due to inference acceleration optimizations. Using a wide range of metrics, we probe bias in model outputs from a number of angles. Analysis of outputs before and after inference acceleration shows significant change in bias. Worryingly, these bias effects are complex and unpredictable. A combination of an acceleration strategy and bias type may show little bias change in one model but may lead to a large effect in another. Our results highlight a need for in-depth and case-by-case evaluation of model bias after it has been modified to accelerate inference.
... Most evaluation tools, including those that assess bias and fairness risk, evaluate LLMs at the model-level by calculating metrics based on the responses of the LLMs to static benchmark datasets of prompts (Rudinger et al., 2018;Zhao et al., 2018;Webster et al., 2018;Levy et al., 2021;Nadeem et al., 2020;Bartl et al., 2020;Nangia et al., 2020;Felkner et al., 2024;Barikeri et al., 2021;Kiritchenko and Mohammad, 2018;Qian et al., 2022;Gehman et al., 2020;Dhamala et al., 2021;Huang et al., 2023;Nozza et al., 2021;Parrish et al., 2022;Li et al., 2020;Krieg et al., 2023) that do not consider prompt-specific risks and are often independent of the task at hand. Holistic Evaluation of Language Models (HELM) (Liang et al., 2023), DecodingTrust , and several other toolkits (Srivastava et al., 2022;Huang et al., 2024;Nazir et al., 2024;Huggingface, 2022) follow this paradigm. ...
Preprint
Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce LangFair, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases. The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner's use case. To guide in metric selection, LangFair offers an actionable decision framework.
... Poiché i bias di genere sono, ancora oggi, fortemente presenti nella nostra realtà sociale, essi possono essere -seppur inconsapevolmente -integrati nei modelli nelle tecnologie, influenzando non solo i comportamenti delle IA e dei robot, ma anche la loro percezione da parte degli esseri umani (Nangia et al. 2020;Guo & Kaliskan 2021). La reiterazione degli stereotipi di genere nel settore tecnologico risulta, in questo modo, potenzialmente contribuiva al rafforzamento dei luoghi comuni già presenti nella quotidianità (De Angeli & Brahnam 2006). ...
Article
Full-text available
L'articolo mira ad esaminare l'impatto dell'Intelligenza Artificiale Generativa e della Robotica Sociale sulla rappresentazione delle donne, affrontando le conseguenze dei pregiudizi di genere. Questi ultimi, infatti, possono generare danni, perpetuare disuguaglianze e rafforzare pratiche discriminatorie, con possibili impatti sproporzionati sulle minoranze. In una società in rapida evoluzione, la pervasività delle tecnologie influenzate da tali pregiudizi potrebbe radicare permanentemente lacune e stereotipi. Date queste premesse, risulta indispensabile indagare socialmente queste tecnologie per comprendere l'entità dell'interconnessione circolare tra progresso tecnologico e ambiente sociale in un processo dinamico di stimolo e innovazione.
... Generative AI models have also been shown to exhibit other forms of bias, such as anti-Muslim bias (Abid, Farooqi and Zou, 2021), bias towards Western culture (Naous, Ryan and Xu, 2023), and stereotypical depictions of race, gender, age, nationality, and socioeconomic status (Nangia et al., 2020). ...
Preprint
There is strong agreement that generative AI should be regulated, but strong disagreement on how to approach regulation. While some argue that AI regulation should mostly rely on extensions of existing laws, others argue that entirely new laws and regulations are needed to ensure that generative AI benefits society. In this paper, I argue that the debates on generative AI regulation can be informed by the debates and evidence on social media regulation. For example, AI companies have faced allegations of political bias regarding the images and text their models produce, similar to the allegations social media companies have faced regarding content ranking on their platforms. First, I compare and contrast the affordances of generative AI and social media to highlight their similarities and differences. Then, I discuss specific policy recommendations based on the evolution of social media and their regulation. These recommendations include investments in: efforts to counter bias and perceptions thereof (e.g., via transparency, researcher access, oversight boards, democratic input, research studies), specific areas of regulatory concern (e.g., youth wellbeing, election integrity) and trust and safety, computational social science research, and a more global perspective. Applying lessons learnt from social media regulation to generative AI regulation can save effort and time, and prevent avoidable mistakes.
... StereoSet consists of two subsets: intrasentence, which measures biases within a individual sentence, and intersentence, which evaluates biases at the discourse level across multiple sentences. Nangia et al. (2020) also introduced the CrowS-Pairs benchmark for bias neasurements. ...
Preprint
Full-text available
The use of language models (LMs) has increased considerably in recent years, and the biases and stereotypes in training data that are reflected in the LM outputs are causing social problems. In this paper, inspired by the task arithmetic, we propose the ``Bias Vector'' method for the mitigation of these LM biases. The Bias Vector method does not require manually created debiasing data. The three main steps of our approach involve: (1) continual training the pre-trained LMs on biased data using masked language modeling; (2) constructing the Bias Vector as the difference between the weights of the biased LMs and those of pre-trained LMs; and (3) subtracting the Bias Vector from the weights of the pre-trained LMs for debiasing. We evaluated the Bias Vector method on the SEAT across three LMs and confirmed an average improvement of 0.177 points. We demonstrated that the Bias Vector method does not degrade the LM performance on downstream tasks in the GLUE benchmark. In addition, we examined the impact of scaling factors, which control the magnitudes of Bias Vectors, with effect sizes on the SEAT and conducted a comprehensive evaluation of our debiased LMs across both the SEAT and GLUE benchmarks.
... Details of the scoring metrics used for annotation along with examples based on a biased sentence. The original biased sentence is taken from CrowS-Pairs dataset(Nangia et al., 2020). ...
Preprint
Full-text available
Although Wikipedia is the largest multilingual encyclopedia, it remains inherently incomplete. There is a significant disparity in the quality of content between high-resource languages (HRLs, e.g., English) and low-resource languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate information. To bridge these content gaps, we propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework extracts relevant information from external resources readily available (such as English books) and adapts it to align with Wikipedia's distinctive style, including its \textit{neutral point of view} (NPOV) policy, using in-context learning capabilities of large language models. The adapted content is then machine-translated into Hindi for integration into the corresponding Wikipedia articles. On the other hand, if the English version is comprehensive and up-to-date, the framework directly transfers knowledge from English to Hindi. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.
... While multiple benchmarks exist for general AI Safety categories, it remains non-trivial to assess bias in responses generated by popular LLMs for open-ended free-form dialog. There are several datasets used in the literature for the evaluation of bias that look at masked token generation [Zhao et al., 2018a], unmasked sentences Nangia et al. [2020], Smith et al. [2022], prompt completion Dhamala et al. [2021b], Gehman et al. [2020], question answering [Parrish et al., 2022]. Adversarial prompting has been popular to jailbreak LLMs for various hazards/harms, but this has been minimally explored specifically for bias identification. ...
Conference Paper
Large Language Models (LLMs) have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evaluation metrics lack standards as well as consensus and existing methods often rely on human-generated templates and annotations which are expensive and labor intensive. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs. We present LLM-based bias detection metrics, i.e., LLM-as-a-judge, and also analyze several existing automatic evaluation methods and metrics. We analyze the various nuances of model responses, identify the strengths and weaknesses of model families, and assess where evaluation methods fall short. We compare these metrics to human evaluation and show that the LLM-as-a-Judge metric aligns with human judgement on bias detection in response generation.
... Addressing social biases is crucial for ensuring the trustworthiness of language models (Nangia et al., 2020;Nadeem et al., 2020). LLMs face similar fairness issues: many studies have confirmed that LLMs can capture social biases from unprocessed training data and transmit these biases to downstream tasks (Abid et al., 2021b;Brown et al., 2020;. ...
... We propose MPO, which leverages the outputs of a language model as a dataset for preference optimization, relying extensively on the outputs from the SFT model. Previous researches (Sheng et al. (2019), Nangia et al. (2020)) has shown that selfsupervised language models, which are trained on unlabeled web-scale datasets, can unintentionally learn and perpetuate social and ethical biases, including racism and sexism. If such biases are inherent within the data, our proposed self-feedback framework may unintentionally reinforce them. ...
... Our third measure, CrowS-Pairs (Nangia et al., 2020), comprises a crowd-sourced dataset of sentence pairs, with the first sentence being more stereotypical (e.g. Women are always too sensitive about things) than the second (e.g. ...
... In previous discrimination evaluation settings, researchers often measure stereotype sentence pairs that only differ in the sensitive attribute. For example, they often adapt terms "Male" and "Female" (Nangia et al., 2020;Delobelle et al., 2022;Gallegos et al., 2023) and for Race, they often substitute terms "Black", "White" and "Asian" (Zhang et al., 2023b;Tamkin et al., 2023). Among the allocational harms, previous studies found that LLMs often exhibit discrimination against certain groups. ...
... We propose an extensive probing framework spanning three modalities: Text-to-Text (T2T), Text-to-Image (T2I), and Image-to-Text (I2T). We utilize the CROWS-PAIRS dataset (Nangia et al., 2020) to identify entities across 400 descriptors and nine demographic dimensions: age (AG), disability (DA), gender (GE), nationality (NT), physical appearance (PA), race/color (RC), religion (RE), sexual orientation (SO), and socio-economic status (SE). This yields approximately 400 demographic descriptors. ...
... Representational harms are abstract concepts that cannot be measured directly [38]-yet measuring such harms is important, as they can cause tangible negative outcomes, e.g., through the entrenchment of harmful social hierarchies, which may affect people's belief systems and psychological states [17,62,16]. To facilitate the measurement of representational harms, the NLP research community has produced and made publicly available numerous measurement instruments, 1 including tools [e.g., 40,14], datasets [e.g., 64,28,32,34,56], metrics [e.g., 11,15,10,58,43], benchmarks (consisting of both datasets and metrics) [e.g., 23,48,45,46,55,63,22,24,26,27,31], annotation instructions [e.g., 42], and other techniques [e.g., 36,52,61]. However, the research community lacks clarity about whether and to what extent these instruments meet the needs of practitioners tasked with developing and deploying LLM-based systems in the real world, and how the instruments could be improved. ...
Preprint
To facilitate the measurement of representational harms caused by large language model (LLM)-based systems, the NLP research community has produced and made publicly available numerous measurement instruments, including tools, datasets, metrics, benchmarks, annotation instructions, and other techniques. However, the research community lacks clarity about whether and to what extent these instruments meet the needs of practitioners tasked with developing and deploying LLM-based systems in the real world, and how these instruments could be improved. Via a series of semi-structured interviews with practitioners in a variety of roles in different organizations, we identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms caused by LLM-based systems: (1) challenges related to using publicly available measurement instruments; (2) challenges related to doing measurement in practice; (3) challenges arising from measurement tasks involving LLM-based systems; and (4) challenges specific to measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs, thus better facilitating the responsible development and deployment of LLM-based systems.
... The same scrutiny that is applied to content moderation on social media should be applied to developers' enforcement of acceptable use policies, including the data labor employed as part of this work [63,130]. Evaluations of foundation models can also be seen as a form of content moderation, as they are used to assess whether a model will produce violative content and inform interventions to reduce this behavior [114,182]. ...
Article
Full-text available
As foundation models have accumulated hundreds of millions of users, developers have begun to take steps to prevent harmful types of uses. One salient intervention that foundation model developers adopt is acceptable use policies—legally binding policies that prohibit users from using a model for specific purposes. This paper identifies acceptable use policies from 30 foundation model developers, analyzes the use restrictions they contain, and argues that acceptable use policies are an important lens for understanding the regulation of foundation models. Taken together, developers’ acceptable use policies include 127 distinct use restrictions; the wide variety in the number and type of use restrictions may create fragmentation across the AI supply chain. Developers also employ acceptable use policies to prevent competitors or specific industries from making use of their models. Developers alone decide what constitutes acceptable use, and rarely provide transparency about how they enforce their policies. In practice, acceptable use policies are difficult to enforce, and scrupulous enforcement can act as a barrier to researcher access and limit beneficial uses of foundation models. Nevertheless, acceptable use policies for foundation models are an early example of self-regulation that have a significant impact on the market for foundation models and the overall AI ecosystem.
... Other settings. In addition to measuring bias in text and language representations, several recent works investigate biases in language models via the probabilities they assign to specific words or sequences (Kurita et al. 2019;Nangia et al. 2020;Nadeem, Bethke, and Reddy 2021). Since language modeling is currently the premier means for representation learning (Devlin et al. 2019;Bommasani et al. 2021), there is a natural question regarding the relationship between measuring biases of a pretrained language model and of representations induced by a pretrained language model. ...
Article
How do we design measures of social bias that we trust? While prior work has introduced several measures, no measure has gained widespread trust: instead, mounting evidence argues we should distrust these measures. In this work, we design bias measures that warrant trust based on the cross-disciplinary theory of measurement modeling. To combat the frequently fuzzy treatment of social bias in natural language processing, we explicitly define social bias, grounded in principles drawn from social science research. We operationalize our definition by proposing a general bias measurement framework DivDist, which we use to instantiate 5 concrete bias measures. To validate our measures, we propose a rigorous testing protocol with 8 testing criteria (e.g. predictive validity: do measures predict biases in US employment?). Through our testing, we demonstrate considerable evidence to trust our measures, showing they overcome conceptual, technical, and empirical deficiencies present in prior measures.
... Dataset Generation to Audit LLMs: The task of generating datasets for evaluating bias in LLMs involves two main approaches: unmasking tokens within sentences (Nadeem, Bethke, and Reddy 2021;Zhao et al. 2018) and selecting sentences based on given contexts (Nangia et al. 2020;Kiritchenko and Mohammad 2018). In the first approach, LLMs are tasked with filling in a masked token, considering the sentence's context. ...
Article
Large Language Models (LLMs) are increasingly integrated into critical decision-making processes, such as loan approvals and visa applications, where inherent biases can lead to discriminatory outcomes. In this paper, we examine the nuanced relationship between demographic attributes and socioeconomic biases in LLMs, a crucial yet understudied area of fairness in LLMs. We introduce a novel dataset of one million English sentences to systematically quantify socioeconomic biases across various demographic groups. Our findings reveal pervasive socioeconomic biases in both established models such as GPT-2 and state-of-the-art models like Llama 2 and Falcon. We demonstrate that these biases are significantly amplified when considering intersectionality, with LLMs exhibiting a remarkable capacity to extract multiple demographic attributes from names and then correlate them with specific socioeconomic biases. This research highlights the urgent necessity for proactive and robust bias mitigation techniques to safeguard against discriminatory outcomes when deploying these powerful models in critical real-world applications. Warning: This paper discusses and contains content that can be offensive or upsetting.
... Without an explicitly systematized concept, it is hard to know exactly what is being operationalized, and thus measured. For example, StereoSet [20] and CrowS-Pairs [21], two widely used benchmarks in NLP for measuring stereotyping, appear to jump straight from high-level definitions of the concept, encompassing broad constellations of meanings and understandings, to specific measurement instruments, obscuring exactly what those instruments measure [7]. Both benchmarks' measurement instruments rely on crowdworkers, who, in the absence of an explicitly systematized concept, must rely on their own understandings of these high-level definitions, which may be contradictory. ...
Preprint
Full-text available
Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI (GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.
... Winogender is a similar dataset that focuses on pronoun resolution with a second participant to avoid direct occupational stereotyping, with results suggesting that coreference systems exhibit gender bias that correlates with real-world gender statistics for occupations [12]. CrowS-Pairs is a dataset that focuses on intra-sentence prediction testing and contains pairs of sentences used to detect bias when one sentence is favored by the model over another due to stereotypical associations, covering nine categories: race, gender, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status [13]. StereoSet includes intra-sentence and inter-sentence prediction tests to test for bias in the areas of gender, occupation, race, and religion [14]. ...
Preprint
Full-text available
Gender bias in artificial intelligence has become an important issue, particularly in the context of language models used in communication-oriented applications. This study examines the extent to which Large Language Models (LLMs) exhibit gender bias in pronoun selection in occupational contexts. The analysis evaluates the models GPT-4, GPT-4o, PaLM 2 Text Bison and Gemini 1.0 Pro using a self-generated dataset. The jobs considered include a range of occupations, from those with a significant male presence to those with a notable female concentration, as well as jobs with a relatively equal gender distribution. Three different sentence processing methods were used to assess potential gender bias: masked tokens, unmasked sentences, and sentence completion. In addition, the LLMs suggested names of individuals in specific occupations, which were then examined for gender distribution. The results show a positive correlation between the models' pronoun choices and the gender distribution present in U.S. labor force data. Female pronouns were more often associated with female-dominated occupations, while male pronouns were more often associated with male-dominated occupations. Sentence completion showed the strongest correlation with actual gender distribution, while name generation resulted in a more balanced 'politically correct' gender distribution, albeit with notable variations in predominantly male or female occupations. Overall, the prompting method had a greater impact on gender distribution than the model selection itself, highlighting the complexity of addressing gender bias in LLMs. The findings highlight the importance of prompting in gender mapping.
... Numerous prior studies highlight that bias exists in applications of LLMs, such as text generation (Liang et al. 2021;Yang et al. 2022;Dhamala et al. 2021), question-answering (Parrish et al. 2021), machine translation (Měchura 2022), information retrieval (Rekabsaz and Schedl 2020), classification (Mozafari, Farahbakhsh, and Crespi 2020;Sap et al. 2019). Some previous studies (Steed et al. 2022;Nadeem, Bethke, and Reddy 2020;Nangia et al. 2020) have highlighted the presence of harmful social biases in pre-trained language models and have introduced datasets to measure biases related to gender, race, nationality in NLP tasks. These studies inspire us to examine the prevalence of bias when applying LLMs in code generation. ...
Preprint
Full-text available
Large language models (LLMs) have significantly advanced the field of automated code generation. However, a notable research gap exists in the evaluation of social biases that may be present in the code produced by LLMs. To solve this issue, we propose a novel fairness framework, i.e., Solar, to assess and mitigate the social biases of LLM-generated code. Specifically, Solar can automatically generate test cases for quantitatively uncovering social biases of the auto-generated code by LLMs. To quantify the severity of social biases in generated code, we develop a dataset that covers a diverse set of social problems. We applied Solar and the crafted dataset to four state-of-the-art LLMs for code generation. Our evaluation reveals severe bias in the LLM-generated code from all the subject LLMs. Furthermore, we explore several strategies for bias mitigation, including Chain-of-Thought (CoT) prompting, combining positive role-playing with CoT prompting and iterative prompting. Our experiments show that iterative prompting can effectively reduce social bias in LLM-generated code by up to 90%. Solar is highly extensible to evaluate new social problems.
... Paired bias evaluation datasets, like WINOQUEER [38] and CROWS-PAIRS [52], are originally designed to evaluate social harms in masked language models by comparing stereotypical versus non-stereotypical sentences. Importantly, because these evaluations inherently mirror a Bradley-Terry (BT) preference model-where one text is implicitly more socially favorable (less stereotype) than the other-we can repurpose them to construct mock preference data and then extract implicit reward signals from DPO-aligned LLMs that used the same BT preference setup. ...
Preprint
Full-text available
Natural-language assistants are designed to provide users with helpful responses while avoiding harmful outputs, largely achieved through alignment to human preferences. Yet there is limited understanding of whether alignment techniques may inadvertently perpetuate or even amplify harmful biases inherited from their pre-aligned base models. This issue is compounded by the choice of bias evaluation benchmarks in popular preference-finetuned models, which predominantly focus on dominant social categories, such as binary gender, thereby limiting insights into biases affecting underrepresented groups. Towards addressing this gap, we center transgender, nonbinary, and other gender-diverse identities to investigate how alignment procedures interact with pre-existing gender-diverse bias in LLMs. Our key contributions include: 1) a comprehensive survey of bias evaluation modalities across leading preference-finetuned LLMs, highlighting critical gaps in gender-diverse representation, 2) systematic evaluation of gender-diverse biases across 12 models spanning Direct Preference Optimization (DPO) stages, uncovering harms popular bias benchmarks fail to detect, and 3) a flexible framework for measuring harmful biases in implicit reward signals applicable to other social contexts. Our findings reveal that DPO-aligned models are particularly sensitive to supervised finetuning (SFT), and can amplify two forms of real-world gender-diverse harms from their base models: stigmatization and gender non-affirmative language. We conclude with recommendations tailored to DPO and broader alignment practices, advocating for the adoption of community-informed bias evaluation frameworks to more effectively identify and address underrepresented harms in LLMs.
... Nadeem et al. [5] introduced StereoSet, a dataset designed to gauge a language model's preference for texts expressing stereotypes. Nangia et al. [6] presented a dataset akin to StereoSet, measuring a masked language model's preference for stereotypical sentences over unbiased ones. Dhamala et al. [7] introduced BOLD, which significantly differs from previous works. ...
Conference Paper
Full-text available
In the landscape of Natural Language Generation (NLG), rectifying bias holds paramount importance to ensure the fairness and dependability of models. This research centres on mitigating bias in Small Language Models (SLMs) that inadvertently propagate societal biases, leading to injustice against specific groups. Acknowledging the immense potential of SLMs in understanding and generating human-like text, the study addresses the critical issue of these models inadvertently learning, perpetuating, and amplifying harmful social biases. We propose a comprehensive benchmarking assessment of bias in state-of-the-art SLMs and investigate the effectiveness of Counterfactual Data Augmentation (CDA) and Transfer Learning as bias mitigation strategies. Our study aims to furnish researchers and practitioners with a clear guide to the existing literature, enabling a better understanding of bias propagation in SLMs and facilitating the development of strategies to mitigate such biases in the future.
... Research has been actively conducted to create datasets composed of sentence pairs in various languages that reflect stereotypes from diverse cultures. Névéol et al. (2022) present French CrowS-Pairs by adapting the original CrowS-Pairs (Nangia et al. 2020) and newly crowdsourcing stereotyped statements. Fort et al. (2024) further extend it to Multilingual CrowS-Pairs with seven additional languages. ...
Preprint
Full-text available
Large-scale deployment of large language models (LLMs) in various applications, such as chatbots and virtual assistants, requires LLMs to be culturally sensitive to the user to ensure inclusivity. Culture has been widely studied in psychology and anthropology, and there has been a recent surge in research on making LLMs more culturally inclusive in LLMs that goes beyond multilinguality and builds on findings from psychology and anthropology. In this paper, we survey efforts towards incorporating cultural awareness into text-based and multimodal LLMs. We start by defining cultural awareness in LLMs, taking the definitions of culture from anthropology and psychology as a point of departure. We then examine methodologies adopted for creating cross-cultural datasets, strategies for cultural inclusion in downstream tasks, and methodologies that have been used for benchmarking cultural awareness in LLMs. Further, we discuss the ethical implications of cultural alignment, the role of Human-Computer Interaction in driving cultural inclusion in LLMs, and the role of cultural alignment in driving social science research. We finally provide pointers to future research based on our findings about gaps in the literature.
... Masked Tokens datasets contain sentences with placeholders that the language model needs to complete, like Winogender [48]. Unmasked Sentences datasets require the model to complete a fill-in-the-blank task, exemplified by datasets like CrowS-Pairs [40] and RedditBias [4]. On the other hand, Prompts entail specifying the initial words in a sentence or presenting a question to prompt the model to continue or provide an answer. ...
Article
Full-text available
The examination of gender bias, alongside other demographic biases like race, nationality, and religion, within generative large language models (LLMs), is increasingly capturing the attention of both the scientific community and industry stakeholders. These biases often affect generative LLMs, influencing popular products and potentially compromising user experiences. A growing body of research is dedicated to enhancing gender representations in natural language processing (NLP) across a spectrum of generative LLMs. This paper explores the current research focused on identifying and evaluating gender bias in generative LLMs. A comprehensive investigation is conducted to evaluate and mitigate gender bias across five distinct generative LLMs. The mitigation strategies implemented yield significant improvements in gender bias scores, with performance enhancements of up to 46% compared to zero-shot text generation approaches. Additionally, we explore how different levels of LLM precision and quantization impact gender bias, providing insights into how technical factors influence bias mitigation strategies. By tackling these challenges and suggesting areas for future research, we aim to contribute to the ongoing discussion about gender bias in language technologies, promoting more equitable and inclusive NLP systems.
... Measuring Bias Methods for measuring bias such as Bias Benchmark for Question Answering (BBQ) (Parrish et al., 2022), RealToxici-tyPrompts (Gehman et al., 2020), and Crowdsourced Stereotype Pairs benchmark (CrowS-Pairs) (Nangia et al., 2020). (Touvron et al., 2023a) find that larger LLaMA models exhibit increased measured bias on RealToxicityPrompts. (Zhao et al., 2023a) replicate this with Stere-oSet (Nadeem et al., 2021) and their metric GPT-BIAS, which uses GPT-4 to classify responses as biased or unbiased. ...
Preprint
Full-text available
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources, making them ideal for various settings including on-device, mobile, edge devices, among many others. In this article, we present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques. We propose a novel taxonomy for categorizing the methods used to optimize SLMs, including model compression, pruning, and quantization techniques. We summarize the benchmark datasets that are useful for benchmarking SLMs along with the evaluation metrics commonly used. Additionally, we highlight key open challenges that remain to be addressed. Our survey aims to serve as a valuable resource for researchers and practitioners interested in developing and deploying small yet efficient language models.
... However, studying the harmful behaviors of AI companions is challenging due to the private, personal nature of these interactions. Prior work on algorithmic harms have largely relied on hypothetical scenarios [11,84], expert-driven auditing [12,62,73,95], or AI incident databases (e.g., AIAAIC, AIID) [27,41]. While insightful, these approaches often lack the detailed, context-rich data needed to fully capture the harmful machine behaviors users encounter in everyday lives. ...
Preprint
Full-text available
As conversational AI systems increasingly permeate the socio-emotional realms of human life, they bring both benefits and risks to individuals and society. Despite extensive research on detecting and categorizing harms in AI systems, less is known about the harms that arise from social interactions with AI chatbots. Through a mixed-methods analysis of 35,390 conversation excerpts shared on r/replika, an online community for users of the AI companion Replika, we identified six categories of harmful behaviors exhibited by the chatbot: relational transgression, verbal abuse and hate, self-inflicted harm, harassment and violence, mis/disinformation, and privacy violations. The AI contributes to these harms through four distinct roles: perpetrator, instigator, facilitator, and enabler. Our findings highlight the relational harms of AI chatbots and the danger of algorithmic compliance, enhancing the understanding of AI harms in socio-emotional interactions. We also provide suggestions for designing ethical and responsible AI systems that prioritize user safety and well-being.
... Previous methods for evaluating fairness can be divided into two main categories: embedding or probability-based approaches and generated text-based approaches. Embedding or probability-based approaches methods assess LLMs by analyzing the hidden representations or predicted probabilities of tokens in counterfactual scenarios (Caliskan et al., 2017;May et al., 2019;Guo & Caliskan, 2021;Nadeem et al., 2020;Nangia et al., 2020). Generated text-based approaches evaluate LLMs by using prompts, such as questions, to elicit text completions or answers from the model (Dhamala et al., 2021;Wan et al., 2023;Liang et al., 2022;Nozza et al., 2021). ...
Preprint
Full-text available
The growing use of large language model (LLM)-based chatbots has raised concerns about fairness. Fairness issues in LLMs can lead to severe consequences, such as bias amplification, discrimination, and harm to marginalized communities. While existing fairness benchmarks mainly focus on single-turn dialogues, multi-turn scenarios, which in fact better reflect real-world conversations, present greater challenges due to conversational complexity and potential bias accumulation. In this paper, we propose a comprehensive fairness benchmark for LLMs in multi-turn dialogue scenarios, \textbf{FairMT-Bench}. Specifically, we formulate a task taxonomy targeting LLM fairness capabilities across three stages: context understanding, user interaction, and instruction trade-offs, with each stage comprising two tasks. To ensure coverage of diverse bias types and attributes, we draw from existing fairness datasets and employ our template to construct a multi-turn dialogue dataset, \texttt{FairMT-10K}. For evaluation, GPT-4 is applied, alongside bias classifiers including Llama-Guard-3 and human validation to ensure robustness. Experiments and analyses on \texttt{FairMT-10K} reveal that in multi-turn dialogue scenarios, current LLMs are more likely to generate biased responses, and there is significant variation in performance across different tasks and models. Based on this, we curate a challenging dataset, \texttt{FairMT-1K}, and test 15 current state-of-the-art (SOTA) LLMs on this dataset. The results show the current state of fairness in LLMs and showcase the utility of this novel approach for assessing fairness in more realistic multi-turn dialogue contexts, calling for future work to focus on LLM fairness improvement and the adoption of \texttt{FairMT-1K} in such efforts.
... This work builds off of a long tradition in NLP on the evaluation of text generation (Celikyilmaz et al., 2020). While the curated datasets are often high quality, they tend to be small, spurring the construction of larger datasets through web scraping (Zhao et al., 2018;Zampieri et al., 2019;Nangia et al., 2020;Rosenthal et al., 2021) and even using other LMs (Zhang et al., 2022;Perez et al., 2023). Meanwhile, early work on behavior functions focused on measuring bias, toxicity, and hallucinations (Achiam et al., 2023;Anil et al., 2023;Chern et al., 2023;Varshney et al., 2023;Llamateam, 2024). ...
Preprint
Full-text available
As language models (LMs) approach human-level performance, a comprehensive understanding of their behavior becomes crucial. This includes evaluating capabilities, biases, task performance, and alignment with societal values. Extensive initial evaluations, including red teaming and diverse benchmarking, can establish a model's behavioral profile. However, subsequent fine-tuning or deployment modifications may alter these behaviors in unintended ways. We present a method for continual Behavioral Shift Auditing (BSA) in LMs. Building on recent work in hypothesis testing, our auditing test detects behavioral shifts solely through model generations. Our test compares model generations from a baseline model to those of the model under scrutiny and provides theoretical guarantees for change detection while controlling false positives. The test features a configurable tolerance parameter that adjusts sensitivity to behavioral changes for different use cases. We evaluate our approach using two case studies: monitoring changes in (a) toxicity and (b) translation performance. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.
Article
Taking an interdisciplinary approach to surveying issues around gender bias in textual and visual AI, we present literature on gender bias detection and mitigation in NLP, CV, as well as combined visual-linguistic models. We identify conceptual parallels between these strands of research as well as how methodologies were adapted cross-disciplinary from NLP to CV. We also find that there is a growing awareness for theoretical frameworks from the social sciences around gender in NLP that could be beneficial for aligning bias analytics in CV with human values and conceptualising gender beyond the binary categories of male/female.
Article
Full-text available
Concerns about gender bias in word embedding models have captured substantial attention in the algorithmic bias research literature. Other bias types however have received lesser amounts of scrutiny. This work describes a large-scale analysis of sentiment associations in popular word embedding models along the lines of gender and ethnicity but also along the less frequently studied dimensions of socioeconomic status, age, physical appearance, sexual orientation, religious sentiment and political leanings. Consistent with previous scholarly literature, this work has found systemic bias against given names popular among African-Americans in most embedding models examined. Gender bias in embedding models however appears to be multifaceted and often reversed in polarity to what has been regularly reported. Interestingly, using the common operationalization of the term bias in the fairness literature, novel types of so far unreported bias types in word embedding models have also been identified. Specifically, the popular embedding models analyzed here display negative biases against middle and working-class socioeconomic status, male children, senior citizens, plain physical appearance and intellectual phenomena such as Islamic religious faith, non-religiosity and conservative political orientation. Reasons for the paradoxical underreporting of these bias types in the relevant literature are probably manifold but widely held blind spots when searching for algorithmic bias and a lack of widespread technical jargon to unambiguously describe a variety of algorithmic associations could conceivably be playing a role. The causal origins for the multiplicity of loaded associations attached to distinct demographic groups within embedding models are often unclear but the heterogeneity of said associations and their potential multifactorial roots raises doubts about the validity of grouping them all under the umbrella term bias. Richer and more fine-grained terminology as well as a more comprehensive exploration of the bias landscape could help the fairness epistemic community to characterize and neutralize algorithmic discrimination more efficiently.
Article
Full-text available
We analyze human’s disagreements about the validity of natural language inferences. We show that, very often, disagreements are not dismissible as annotation “noise”, but rather persist as we collect more ratings and as we vary the amount of context provided to raters. We further show that the type of uncertainty captured by current state-of-the-art models for natural language inference is not reflective of the type of uncertainty present in human disagreements. We discuss implications of our results in relation to the recognizing textual entailment (RTE)/natural language inference (NLI) task. We argue for a refined evaluation objective that requires models to explicitly capture the full distribution of plausible human judgments.
Article
Full-text available
Machines learn what people know implicitly AlphaGo has demonstrated that a machine can learn how to do things that people spend many years of concentrated study learning, and it can rapidly learn how to do them better than any human can. Caliskan et al. now show that machines can learn word associations from written texts and that these associations mirror those learned by humans, as measured by the Implicit Association Test (IAT) (see the Perspective by Greenwald). Why does this matter? Because the IAT has predictive value in uncovering the association between concepts, such as pleasantness and flowers or unpleasantness and insects. It can also tease out attitudes and beliefs—for example, associations between female names and family or male names and career. Such biases may not be expressed explicitly, yet they can prove influential in behavior. Science , this issue p. 183 ; see also p. 133
Article
Full-text available
The blind application of machine learning runs the risk of amplifying biases present in data. Such a danger is facing us with word embedding, a popular framework to represent text data as vectors which has been used in many machine learning and natural language processing tasks. We show that even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent. This raises concerns because their widespread use, as we describe, often tends to amplify these biases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding. Second, gender neutral words are shown to be linearly separable from gender definition words in the word embedding. Using these properties, we provide a methodology for modifying an embedding to remove gender stereotypes, such as the association between between the words receptionist and female, while maintaining desired associations such as between the words queen and female. We define metrics to quantify both direct and indirect gender biases in embeddings, and develop algorithms to "debias" the embedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstrate that our algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings can be used in applications without amplifying gender bias.
Article
The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4 – 79.1%, which are ∼15-35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% – 100% respectively).Furthermore, we establish new state-of-the-art results on five related benchmarks — WSC (→ 90.1%), DPR (→ 93.1%), COPA(→ 90.6%), KnowRef (→ 85.6%), and Winogender (→ 97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.
Article
Coreference resolution is an important task for natural language understanding, and the resolution of ambiguous pronouns a longstanding challenge. Nonetheless, existing corpora do not capture ambiguous pronouns in sufficient volume or diversity to accurately indicate the practical utility of models. Furthermore, we find gender bias in existing corpora and systems favoring masculine entities. To address this, we present and release GAP, a gender-balanced labeled corpus of 8,908 ambiguous pronoun–name pairs sampled to provide diverse coverage of challenges posed by real-world text. We explore a range of baselines that demonstrate the complexity of the challenge, the best achieving just 66.9% F1. We show that syntactic structure and continuous neural models provide promising, complementary cues for approaching the challenge.
Article
In this paper, we propose data statements as a design solution and professional practice for natural language processing technologists, in both research and development. Through the adoption and widespread use of data statements, the field can begin to address critical scientific and ethical issues that result from the use of data from certain populations in the development of technology for other populations. We present a form that data statements can take and explore the implications of adopting them as part of regular practice. We argue that data statements will help alleviate issues related to exclusion and bias in language technology, lead to better precision in claims about how natural language processing research can generalize and thus better engineering results, protect companies from public embarrassment, and ultimately lead to language technology that meets its users in their own preferred linguistic style and furthermore does not misrepresent them to others.
Conference Paper
Weighted tree transducers have been pro- posed as useful formal models for rep- resenting syntactic natural language pro- cessing applications, but there has been little description of inference algorithms for these automata beyond formal founda- tions. We give a detailed description of algorithms for application of cascades of weighted tree transducers to weighted tree acceptors, connecting formal theory with actual practice. Additionally, we present novel on-the-fly variants of these algo- rithms, and compare their performance on a syntax machine translation cascade based on (Yamada and Knight, 2001). 1 Motivation
A typology of ethical risks in language technology with an eye towards where transparent documentation can help
  • M Emily
  • Bender
Emily M Bender. 2019. A typology of ethical risks in language technology with an eye towards where transparent documentation can help.
Language (technology) is power: A critical survey of "bias
  • Solon Su Lin Blodgett
  • Hal Barocas
  • Iii Daumé
  • Hanna Wallach
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of "bias" in nlp. ArXiv.
BERT: Pre-training of deep bidirectional transformers for language understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. Association for Computational Linguistics.
CVBed: Structuring CVs using-Word embeddings
  • Shweta Garg
  • S Sudhanshu
  • Abhijit Singh
  • Kuntal Mishra
  • Dey
Shweta Garg, Sudhanshu S Singh, Abhijit Mishra, and Kuntal Dey. 2017. CVBed: Structuring CVs using-Word embeddings. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 349-354, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Race, perceptions of femininity, and the power of the first lady: A comparative analysis
  • Andra Gillespie
Andra Gillespie. 2016. Race, perceptions of femininity, and the power of the first lady: A comparative analysis. In Nadia E. Brown and Sarah Allen Gershon, editors, Distinct Identities: Minority Women in U.S. Politics. Routledge.
Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them
  • Hila Gonen
  • Yoav Goldberg
Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In Proceedings of the 2019 Workshop on Widening NLP, pages 60-63, Florence, Italy. Association for Computational Linguistics.
ALBERT: A lite bert for self-supervised learning of language representations
  • Zhenzhong Lan
  • Mingda Chen
  • Sebastian Goodman
  • Kevin Gimpel
  • Piyush Sharma
  • Radu Soricut
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
Towards debiasing sentence representations
  • Irene Mengze Paul Pu Liang
  • Emily Li
  • Zheng
  • Chong Yao
  • Ruslan Lim
  • Louis-Philippe Salakhutdinov
  • Morency
Paul Pu Liang, Irene Mengze Li, Emily Zheng, Yao Chong Lim, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Towards debiasing sentence representations. In Proceedings of the 2020 Association for Computational Linguistics. Association for Computational Linguistics.
BERT has a mouth, and it must speak: BERT as a Markov random field language model
  • Alex Wang
  • Kyunghyun Cho
Alex Wang and Kyunghyun Cho. 2019. BERT has a mouth, and it must speak: BERT as a Markov random field language model. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 30-36, Minneapolis, Minnesota. Association for Computational Linguistics.
SuperGLUE: A stickier benchmark for general-purpose language understanding systems
  • Alex Wang
  • Yada Pruksachatkun
  • Nikita Nangia
  • Amanpreet Singh
  • Julian Michael
  • Felix Hill
  • Omer Levy
  • Samuel Bowman
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3266-3280. Curran Associates, Inc.
Fair work: Crowd work minimum wage with one line of code
  • E Mark
  • Grant Whiting
  • Michael S Hugh
  • Bernstein
Mark E Whiting, Grant Hugh, and Michael S Bernstein. 2019. Fair work: Crowd work minimum wage with one line of code. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 7, pages 197-206.
HuggingFace's transformers: State-of-the-art natural language processing
  • Thomas Wolf
  • Lysandre Debut
  • Victor Sanh
  • Julien Chaumond
  • Clement Delangue
  • Anthony Moi
  • Pierric Cistac
  • Tim Rault
  • Rémi Louf
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace's transformers: State-of-the-art natural language processing. ArXiv.
Aligning books and movies: Towards story-like visual explanations by watching movies and reading books
  • Yukun Zhu
  • Ryan Kiros
  • Richard S Zemel
  • Ruslan Salakhutdinov
  • Raquel Urtasun
  • Antonio Torralba
  • Sanja Fidler
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pages 19-27.