Article

Communicating and combating algorithmic bias: effects of data diversity, labeler diversity, performance bias, and user feedback on AI trust

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Inspired by the emerging documentation paradigm emphasizing data and model transparency, this study explores whether displaying racial diversity cues in training data and labelers' backgrounds enhance users' expectations of algorithmic fairness and trust in AI systems, even to the point of making them overlook racially biased performance. It also explores how their trust is affected when the system invites their feedback. We conducted a factorial experiment (N=597) to test hypotheses derived from a model of Human-AI Interaction based on the Theory of Interactive Media Effects (HAII-TIME). We found that racial diversity cues in either training data or labelers' backgrounds trigger the representativeness heuristic, which is associated with higher algorithmic fairness expectations and increased trust. Inviting feedback enhances users' sense of agency and is positively related to behavioral trust, but it reduces usability for Whites when the AI shows unbiased performance. Implications for designing socially responsible AI interfaces are discussed, considering both users' cognitive limitations and usability. ARTICLE HISTORY

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Therefore, variability in training datasets is crucial for mitigating biases in AI systems. When the training data lacks diversity or disproportionately represents certain categories, AI models can inadvertently learn and perpetuate such biases, thus leading to inaccurate outcomes [102]. Addressing AI bias in medical applications requires a multifaceted approach. ...
Article
Full-text available
Given the scale of user-generated content online, the use of artificial intelligence (AI) to flag problematic posts is inevitable, but users do not trust such automated moderation of content. We explore if (a) involving human moderators in the curation process and (b) affording “interactive transparency,” wherein users participate in curation, can promote appropriate reliance on AI. We test this through a 3 (Source: AI, Human, Both) × 3 (Transparency: No Transparency, Transparency-Only, Interactive Transparency) × 2 (Classification Decision: Flagged, Not Flagged) between-subjects online experiment (N = 676) involving classification of hate speech and suicidal ideation. We discovered that users trust AI for the moderation of content just as much as humans, but it depends on the heuristic that is triggered when they are told AI is the source of moderation. We also found that allowing users to provide feedback to the algorithm enhances trust by increasing user agency.
Article
Full-text available
When evaluating automated systems, some users apply the “positive machine heuristic” (i.e. machines are more accurate and precise than humans), whereas others apply the “negative machine heuristic” (i.e. machines lack the ability to make nuanced subjective judgments), but we do not know much about the characteristics that predict whether a user would apply the positive or negative machine heuristic. We conducted a study in the context of content moderation and discovered that individual differences relating to trust in humans, fear of artificial intelligence (AI), power usage, and political ideology can predict whether a user will invoke the positive or negative machine heuristic. For example, users who distrust other humans tend to be more positive toward machines. Our findings advance theoretical understanding of user responses to AI systems for content moderation and hold practical implications for the design of interfaces to appeal to users who are differentially predisposed toward trusting machines over humans.
Conference Paper
Full-text available
We introduce the psychometric concepts of bias and fairness in a multimodal machine learning context assessing individuals’ hireability from prerecorded video interviews. We collected interviews from 733 participants and hireability ratings from a panel of trained annotators in a simulated hiring study, and then trained interpretable machine learning models on verbal, paraverbal, and visual features extracted from the videos to investigate unimodal versus multimodal bias and fairness. Our results demonstrate that, in the absence of any bias mitigation strategy, combining multiple modalities only marginally improves prediction accuracy at the cost of increasing bias and reducing fairness compared to the least biased and most fair unimodal predictor set (verbal). We further show that gender-norming predictors only reduces gender predictability for paraverbal and visual modalities, while removing gender-biased features can achieve gender blindness, minimal bias, and fairness (for all modalities except for visual) at the cost of some prediction accuracy. Overall, the reduced-feature approach using predictors from all modalities achieved the best balance between accuracy, bias, and fairness, with the verbal modality alone performing almost as well. Our analysis highlights how optimizing model prediction accuracy in isolation and in a multimodal context may cause bias, disparate impact, and potential social harm, while a more holistic optimization approach based on accuracy, bias, and fairness can avoid these pitfalls.
Preprint
Full-text available
Risk assessment instrument (RAI) datasets, particularly ProPublica's COMPAS dataset, are commonly used in algorithmic fairness papers due to benchmarking practices of comparing algorithms on datasets used in prior work. In many cases, this data is used as a benchmark to demonstrate good performance without accounting for the complexities of criminal justice (CJ) processes. We show that pretrial RAI datasets contain numerous measurement biases and errors inherent to CJ pretrial evidence and due to disparities in discretion and deployment, are limited in making claims about real-world outcomes, making the datasets a poor fit for benchmarking under assumptions of ground truth and real-world impact. Conventional practices of simply replicating previous data experiments may implicitly inherit or edify normative positions without explicitly interrogating assumptions. With context of how interdisciplinary fields have engaged in CJ research, algorithmic fairness practices are misaligned for meaningful contribution in the context of CJ, and would benefit from transparent engagement with normative considerations and values related to fairness, justice, and equality. These factors prompt questions about whether benchmarks for intrinsically socio-technical systems like the CJ system can exist in a beneficial and ethical way.
Article
Full-text available
This study employs an experiment to test assessments of music composed by artificial intelligence. We examined the influence of (a) met or unmet expectations about artificial intelligence (AI)-composed music, (b) whether the music is better or worse than expected, and (c) the genre of the evaluation of music using a 2 (expectancy violation vs confirmation) × 2 (positive vs negative evaluation) × 2 (electronic dance music vs classical) design. The relationship between the belief about creative AI and the music evaluation was also analyzed. Participants (n = 299) in an online survey listened to a randomly assigned music piece. The acceptance of creative AI was found to have a positive relationship with the assessment of AI-composed music. A two-way interaction between the expectancy violation and its valence, and a three-way interaction between the expectancy violation, its valence, and the genre of music were found. Implications for Expectancy Violation Theory and AI applications are discussed.
Article
Full-text available
Do politicians use the representativeness heuristic when making judgements, that is, when they appraise the likelihood or frequency of an outcome that is unknown or unknowable? Heuristics are cognitive shortcuts that facilitate judgements and decision making. Oftentimes, heuristics are useful, but they may also lead to systematic biases that can be detrimental for decision making in a representative democracy. Thus far, we lack experimental evidence on whether politicians use the representativeness heuristic. To contribute to and extend the existing literature, we develop and conduct a survey experiment with as main participants Dutch elected local politicians from the larger municipalities (n = 211). This survey experiment examines whether politician participants display two decision-making biases related to the representativeness heuristic: the conjunction error and scope neglect. We also run the experiment with a student sample (n = 260), mainly to validate the experimental design. Our findings show that politician participants neglect scope in one scenario and that they display the conjunction error in two of three scenarios. These results suggest that politician participants use the representativeness heuristic. Conversely, our third conjunction error scenario does not find evidence for politician participants displaying this bias. As we discuss in the article, the latter may be an artifact of our experimental design. Overall, our findings contribute fundamentally to our understanding of how politicians process information and how this influences their judgements and decision making.
Article
Full-text available
Fueled by ever-growing amounts of (digital) data and advances in artificial intelligence, decision-making in contemporary societies is increasingly delegated to automated processes. Drawing from social science theories and from the emerging body of research about algorithmic appreciation and algorithmic perceptions, the current study explores the extent to which personal characteristics can be linked to perceptions of automated decision-making by AI, and the boundary conditions of these perceptions, namely the extent to which such perceptions differ across media, (public) health, and judicial contexts. Data from a scenario-based survey experiment with a national sample (N = 958) show that people are by and large concerned about risks and have mixed opinions about fairness and usefulness of automated decision-making at a societal level, with general attitudes influenced by individual characteristics. Interestingly, decisions taken automatically by AI were often evaluated on par or even better than human experts for specific decisions. Theoretical and societal implications about these findings are discussed.
Article
Full-text available
In the last few years, Artificial Intelligence (AI) has achieved a notable momentum that, if harnessed appropriately, may deliver the best of expectations over many application sectors across the field. For this to occur shortly in Machine Learning, the entire community stands in front of the barrier of explainability, an inherent problem of the latest techniques brought by sub-symbolism (e.g. ensembles or Deep Neural Networks) that were not present in the last hype of AI (namely, expert systems and rule based models). Paradigms underlying this problem fall within the so-called eXplainable AI (XAI) field, which is widely acknowledged as a crucial feature for the practical deployment of AI models. The overview presented in this article examines the existing literature and contributions already done in the field of XAI, including a prospect toward what is yet to be reached. For this purpose we summarize previous efforts made to define explainability in Machine Learning, establishing a novel definition of explainable Machine Learning that covers such prior conceptual propositions with a major focus on the audience for which the explainability is sought. Departing from this definition, we propose and discuss about a taxonomy of recent contributions related to the explainability of different Machine Learning models, including those aimed at explaining Deep Learning methods for which a second dedicated taxonomy is built and examined in detail. This critical literature analysis serves as the motivating background for a series of challenges faced by XAI, such as the interesting crossroads of data fusion and explainability. Our prospects lead toward the concept of Responsible Artificial Intelligence , namely, a methodology for the large-scale implementation of AI methods in real organizations with fairness, model explainability and accountability at its core. Our ultimate goal is to provide newcomers to the field of XAI with a thorough taxonomy that can serve as reference material in order to stimulate future research advances, but also to encourage experts and professionals from other disciplines to embrace the benefits of AI in their activity sectors, without any prior bias for its lack of interpretability.
Article
Full-text available
Technological advancements in Artificial Intelligence allow the automation of every part of job interviews (information acquisition, information analysis, action selection, action implementation) resulting in highly automated interviews. Efficiency advantages exist, but it is unclear how people react to such interviews (and whether reactions depend on the stakes involved). Participants (N = 123) in a 2 (highly automated, videoconference) × 2 (high‐stakes, low‐stakes situation) experiment watched and assessed videos depicting a highly automated interview for high‐stakes (selection) and low‐stakes (training) situations or an equivalent videoconference interview. Automated high‐stakes interviews led to ambiguity and less perceived controllability. Additionally, highly automated interviews diminished overall acceptance through lower social presence and fairness. To conclude, people seem to react negatively to highly automated interviews and acceptance seems to vary based on the stakes. OPEN PRACTICES This study was pre‐registered on the Open Science Framework (osf.io/hgd5r) and on AsPredicted (https://AsPredicted.org/i52c6.pdf).
Article
Full-text available
Algorithms increasingly make managerial decisions that people used to make. Perceptions of algorithms, regardless of the algorithms' actual performance, can significantly influence their adoption, yet we do not fully understand how people perceive decisions made by algorithms as compared with decisions made by humans. To explore perceptions of algorithmic management, we conducted an online experiment using four managerial decisions that required either mechanical or human skills. We manipulated the decision-maker (algorithmic or human), and measured perceived fairness, trust, and emotional response. With the mechanical tasks, algorithmic and human-made decisions were perceived as equally fair and trustworthy and evoked similar emotions; however, human managers' fairness and trustworthiness were attributed to the manager's authority, whereas algorithms' fairness and trustworthiness were attributed to their perceived efficiency and objectivity. Human decisions evoked some positive emotion due to the possibility of social recognition, whereas algorithmic decisions generated a more mixed response – algorithms were seen as helpful tools but also possible tracking mechanisms. With the human tasks, algorithmic decisions were perceived as less fair and trustworthy and evoked more negative emotion than human decisions. Algorithms' perceived lack of intuition and subjective judgment capabilities contributed to the lower fairness and trustworthiness judgments. Positive emotion from human decisions was attributed to social recognition, while negative emotion from algorithmic decisions was attributed to the dehumanizing experience of being evaluated by machines. This work reveals people's lay concepts of algorithmic versus human decisions in a management context and suggests that task characteristics matter in understanding people's experiences with algorithmic technologies.
Conference Paper
Full-text available
Our goal is to enable robots to express their incapability, and to do so in a way that communicates both what they are trying to accomplish and why they are unable to accomplish it. We frame this as a trajectory optimization problem: maximize the similarity between the motion expressing incapability and what would amount to successful task execution, while obeying the physical limits of the robot. We introduce and evaluate candidate similarity measures, and show that one in particular generalizes to a range of tasks, while producing expressive motions that are tailored to each task. Our user study supports that our approach automatically generates motions expressing incapability that communicate both what and why to end-users, and improve their overall perception of the robot and willingness to collaborate with it in the future.
Conference Paper
Full-text available
Successful teams are characterized by high levels of trust between team members, allowing the team to learn from mistakes, take risks, and entertain diverse ideas. We investigated a robot's potential to shape trust within a team through the robot's expressions of vulnerability. We conducted a between-subjects experiment (N = 35 teams, 105 participants) comparing the behavior of three human teammates collaborating with either a social robot making vulnerable statements or with a social robot making neutral statements. We found that, in a group with a robot making vulnerable statements, participants responded more to the robot's comments and directed more of their gaze to the robot, displaying a higher level of engagement with the robot. Additionally, we discovered that during times of tension, human teammates in a group with a robot making vulnerable statements were more likely to explain their failure to the group, console team members who had made mistakes, and laugh together, all actions that reduce the amount of tension experienced by the team. These results suggest that a robot's vulnerable behavior can have "ripple effects" on their human team members' expressions of trust-related behavior.
Chapter
Full-text available
Expectancy violations theory predicts and explains the effects of nonverbal behavior violations on interpersonal communication outcomes such as attraction, credibility, persuasion, and smooth interactions. Human interactions are strongly governed by expectations which, if violated, are arousing and trigger an appraisal process that may be moderated by the rewardingness of the violator. Violation interpretations and evaluations determine whether they are positive or negative violations. Positive violations are predicted to produce more favorable outcomes, and negative violations less favorable outcomes, than positive and negative confirmations respectively. Many of the theory's propositions have been supported empirically. Some contrary findings have led to revision of the theory. The theory has also been expanded to several kinds of nonverbal violations, including personal space, eye contact, posture, touch, involvement, and immediacy violations. The theory also spawned the investigation of the meanings associated with violations and the kinds of arousal that violations provoke. Keywords: communication theory; expectations; interpersonal communication; interpersonal theory; interviewing; nonverbal communication; relational communication; social norms; violations
Conference Paper
Full-text available
How do individuals perceive algorithmic vs. group-made decisions? We investigated people's perceptions of mathematically-proven fair division algorithms making social division decisions. In our first qualitative study, about one third of the participants perceived algorithmic decisions as less than fair (30% for self, 36% for group), often because algorithmic assumptions about users did not account for multiple concepts of fairness or social behaviors, and the process of quantifying preferences through interfaces was prone to error. In our second experiment, algorithmic decisions were perceived to be less fair than discussion-based decisions, dependent on participants' interpersonal power and computer programming knowledge. Our work suggests that for algorithmic mediation to be fair, algorithms and their interfaces should account for social and altruistic behaviors that may be difficult to define in mathematical terms.
Article
Human assumption of superior performance by machines has a long history, resulting in the concept of “machine heuristic” (MH), which is a mental shortcut that individuals apply to automated systems. This article provides a formal explication of this concept and develops a new scale based on three studies (Combined N = 1129). Measurement items were derived from the explication and an open-ended survey (Study 1, N = 270). These were then administered in a closed-ended survey (Study 2, N = 448) to identify their dimensionality through exploratory factor analysis (EFA). Lastly, we conducted another survey (Study 3, N = 411) to verify the factor structure obtained in Study 2 by employing confirmatory factor analysis (CFA). Analyses resulted in a validated scale of seven items that reflect the level of MH in individuals and identified six sets of descriptive labels for machines (expert, efficient, rigid, superfluous, fair, and complex) that serve as formative indicators of MH. Theoretical and practical implications are discussed.
Conference Paper
Explanations are believed to aid understanding of AI models, but do they affect users' perceptions and trust in AI, especially in the presence of algorithmic bias? If so, when should explanations be provided to optimally balance explainability and usability? To answer these questions, we conducted a user study (N = 303) exploring how explanation timing influences users' perception of trust calibration, understanding of the AI system, and user experience and user interface satisfaction under both biased and unbi-ased AI performance conditions. We found that pre-explanations seem most valuable when the AI shows bias in its performance, whereas post-explanations appear more favorable when the system is bias-free. Showing both pre-and post-explanations tends to result in higher perceived trust calibration regardless of bias, despite concerns about content redundancy. Implications for designing socially responsible, explainable, and trustworthy AI interfaces are discussed.
Conference Paper
To promote data transparency, frameworks such as CrowdWork- Sheets encourage documentation of annotation practices on the interfaces of AI systems, but we do not know how they affect user experience. Will the quality of labeling affect perceived credibility of training data? Does the source of annotation matter? Will a credible dataset persuade users to trust a system even if it shows racial biases in its predictions? To find out, we conducted a user study (N = 430) with a prototype of a classification system, using a 2 (labeling quality: high vs. low) × 4 (source: others-as-source vs. self-as-source cue vs. self-as-source voluntary action, vs. self-as-source forced action) × 3 (AI performance: none vs. biased vs. unbiased) experiment. We found that high-quality labeling leads to higher perceived training data credibility, which in turn enhances users’ trust in AI, but not when the system shows bias. Practical implications for explainable and ethical AI interfaces are discussed.
Article
Awareness of bias in algorithms is growing among scholars and users of algorithmic systems. But what can we observe about how users discover and behave around such biases? We used a cross-platform audit technique that analyzed online ratings of 803 hotels across three hotel rating platforms and found that one site’s algorithmic rating system biased ratings, particularly low-to-medium quality hotels, significantly higher than others (up to 37%). Analyzing reviews of 162 users who independently discovered this bias, we seek to understand if, how, and in what ways users perceive and manage this bias. Users changed the typical ways they used a review on a hotel rating platform to instead discuss the rating system itself and raise other users’ awareness of the rating bias. This raising of awareness included practices like efforts to reverse-engineer the rating algorithm, efforts to correct the bias, and demonstrations of broken trust. We conclude with a discussion of how such behavior patterns might inform design approaches that anticipate unexpected bias and provide reliable means for meaningful bias discovery and response.
Article
Machine learning and artificial intelligence algorithms can assist human decision making and analysis tasks. While such technology shows promise, willingness to use and rely on intelligent systems may depend on whether people can trust and understand them. To address this issue, researchers have explored the use of explainable interfaces that attempt to help explain why or how a system produced the output for a given input. However, the effects of meaningful and meaningless explanations (determined by their alignment with human logic) are not properly understood, especially with users who are non-experts in data science. Additionally, we wanted to explore how explanation inclusion and level of meaningfulness would affect the user’s perception of accuracy. We designed a controlled experiment using an image classification scenario with local explanations to evaluate and better understand these issues. Our results show that whether explanations are human-meaningful can significantly affect perception of a system’s accuracy independent of the actual accuracy observed from system usage. Participants significantly underestimated the system’s accuracy when it provided weak, less human-meaningful explanations. Therefore, for intelligent systems with explainable interfaces, this research demonstrates that users are less likely to accurately judge the accuracy of algorithms that do not operate based on human-understandable rationale.
Article
Media systems that personalize their offerings keep track of users’ tastes by constantly learning from their activities. Some systems use this characteristic of machine learning to encourage users with statements like “the more you use the system, the better it can serve you in the future.” However, it is not clear whether users indeed feel encouraged and consider the system to be helpful and beneficial, or become discouraged at the prospect of jeopardizing their privacy in the process. A between-subjects experiment (N = 269) was conducted to find out. Guided by the HAII-TIME model (Sundar, 2020), the current study examined the effects of both explicit and implicit cues on the interface conveying machine learning. Data indicate that users consider the system to be a helper and tend to trust it more when the system is transparent about its learning, regardless of the quality of its performance and the degree of explicitness in conveying the fact that it is learning from their activities. The study found no evidence to suggest privacy concerns arising from the machine disclosing that it is learning from its users. The present paper discusses theoretical and practical implications of deploying machine learning cues to enhance user experience of AI-embedded systems.
Article
Advances in personalization algorithms and other applications of machine learning have vastly enhanced the ease and convenience of our media and communication experiences, but they have also raised significant concerns about privacy, transparency of technologies and human control over their operations. Going forth, reconciling such tensions between machine agency and human agency will be important in the era of artificial intelligence (AI), as machines get more agentic and media experiences become increasingly determined by algorithms. Theory and research should be geared toward a deeper understanding of the human experience of algorithms in general and the psychology of Human–AI interaction (HAII) in particular. This article proposes some directions by applying the dual-process framework of the Theory of Interactive Media Effects (TIME) for studying the symbolic and enabling effects of the affordances of AI-driven media on user perceptions and experiences.
Article
A group of industry, academic, and government experts convene in Philadelphia to explore the roots of algorithmic bias.
Article
Introduction: Clinical decision-making is a daily practice conducted by medical practitioners, yet the processes surrounding it are poorly understood. The influence of ‘short cuts’ in clinical decision-making, known as heuristics, remains unknown. This paper explores heuristics and the valuable role they play in medical practice, as well as offering potential solutions to minimise the risk of incorrect decision-making. Method: The quasisystematic review was conducted according to modified PRISMA guidelines utilising the electronic databases Medline, Embase and Cinahl. All English language papers including bias and the medical profession were included. Papers with evidence from other healthcare professions were included if medical practitioners were in the study sample. Discussion: The most common decisional short cuts used in medicine are the Availability, Anchoring and Confirmatory heuristics. The Representativeness, Overconfidence and Bandwagon effects are also prevalent in medical practice. Heuristics are mostly positive but can also result in negative consequences if not utilised appropriately. Factors such as personality and level of experience may influence a doctors use of heuristics. Heuristics are influenced by the context and conditions in which they are performed. Mitigating strategies such as reflective practice and technology may reduce the likelihood of inappropriate use. Conclusion: It remains unknown if heuristics are primarily positive or negative for clinical decision-making. Future efforts should assess heuristics in real-time and controlled trials should be applied to assess the potential impact of mitigating factors in reducing the negative impact of heuristics and optimising their efficiency for positive outcomes.
Conference Paper
Fairness for Machine Learning has received considerable attention, recently. Various mathematical formulations of fairness have been proposed, and it has been shown that it is impossible to satisfy all of them simultaneously. The literature so far has dealt with these impossibility results by quantifying the tradeoffs between different formulations of fairness. Our work takes a different perspective on this issue. Rather than requiring all notions of fairness to (partially) hold at the same time, we ask which one of them is the most appropriate given the societal domain in which the decision-making model is to be deployed. We take a descriptive approach and set out to identify the notion of fairness that best captures lay people's perception of fairness. We run adaptive experiments designed to pinpoint the most compatible notion of fairness with each participant's choices through a small number of tests. Perhaps surprisingly, we find that the most simplistic mathematical definition of fairness---namely, demographic parity---most closely matches people's idea of fairness in two distinct application scenarios. This conclusion remains intact even when we explicitly tell the participants about the alternative, more complicated definitions of fairness, and we reduce the cognitive burden of evaluating those notions for them. Our findings have important implications for the Fair ML literature and the discourse on formalizing algorithmic fairness.
Article
Although artificial intelligence is a growing area of research, several problems remain. One such problem of particular importance is the low accuracy of predictions. This paper suggests that users' help is a practical approach to improve accuracy and it considers four factors that trigger users' willingness to help for an imperfect AI system. The two factors covered in Study 1 are utilitarian benefit based on egoistic motivation, and empathy based on altruistic motivation. In Study 2, utilitarian benefit is divided into explainable AI and monetary reward. The results indicate that two variables, namely empathy and monetary reward, have significant positive effects on willingness to help, and monetary reward is the strongest stimulus. In addition, explainable AI is shown to be positively associated with trust in AI. This study applies social studies of help motivation to the HCI field in order to induce users' willingness to help for an imperfect AI. The triggers of help motivation, empathy and monetary reward, can be utilized to induce the users’ voluntary engagement in the loop with an imperfect AI.
Conference Paper
In this day and age of identity theft, are we likely to trust machines more than humans for handling our personal information? We answer this question by invoking the concept of "machine heuristic," which is a rule of thumb that machines are more secure and trustworthy than humans. In an experiment (N = 160) that involved making airline reservations, users were more likely to reveal their credit card information to a machine agent than a human agent. We demonstrate that cues on the interface trigger the machine heuristic by showing that those with higher cognitive accessibility of the heuristic (i.e., stronger prior belief in the rule of thumb) were more likely than those with lower accessibility to disclose to a machine, but they did not differ in their disclosure to a human. These findings have implications for design of interface cues conveying machine vs. human sources of our online interactions.
Conference Paper
Increasingly, algorithms are used to make important decisions across society. However, these algorithms are usually poorly understood, which can reduce transparency and evoke negative emotions. In this research, we seek to learn design principles for explanation interfaces that communicate how decision-making algorithms work, in order to help organizations explain their decisions to stakeholders, or to support users' "right to explanation". We conducted an online experiment where 199 participants used different explanation interfaces to understand an algorithm for making university admissions decisions. We measured users' objective and self-reported understanding of the algorithm. Our results show that both interactive explanations and "white-box" explanations (i.e. that show the inner workings of an algorithm) can improve users' comprehension. Although the interactive approach is more effective at improving comprehension, it comes with a trade-off of taking more time. Surprisingly, we also find that users' trust in algorithmic decisions is not affected by the explanation interface or their level of comprehension of the algorithm.
Conference Paper
We address a relatively under-explored aspect of human-computer interaction: people's abilities to understand the relationship between a machine learning model's stated performance on held-out data and its expected performance post deployment. We conduct large-scale, randomized human-subject experiments to examine whether laypeople's trust in a model, measured in terms of both the frequency with which they revise their predictions to match those of the model and their self-reported levels of trust in the model, varies depending on the model's stated accuracy on held-out data and on its observed accuracy in practice. We find that people's trust in a model is affected by both its stated accuracy and its observed accuracy, and that the effect of stated accuracy can change depending on the observed accuracy. Our work relates to recent research on interpretable machine learning, but moves beyond the typical focus on model internals, exploring a different component of the machine learning pipeline.
Preprint
Ensuring fairness of machine learning systems is a human-in-the-loop process. It relies on developers, users, and the general public to identify fairness problems and make improvements. To facilitate the process we need effective, unbiased, and user-friendly explanations that people can confidently rely on. Towards that end, we conducted an empirical study with four types of programmatically generated explanations to understand how they impact people's fairness judgments of ML systems. With an experiment involving more than 160 Mechanical Turk workers, we show that: 1) Certain explanations are considered inherently less fair, while others can enhance people's confidence in the fairness of the algorithm; 2) Different fairness problems--such as model-wide fairness issues versus case-specific fairness discrepancies--may be more effectively exposed through different styles of explanation; 3) Individual differences, including prior positions and judgment criteria of algorithmic fairness, impact how people react to different styles of explanation. We conclude with a discussion on providing personalized and adaptive explanations to support fairness judgments of ML systems.
Conference Paper
Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. While we focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. We propose model cards as a step towards the responsible democratization of machine learning and related artificial intelligence technology, increasing transparency into how well artificial intelligence technology works. We hope this work encourages those releasing trained machine learning models to accompany model releases with similar detailed evaluation numbers and other relevant documentation.
Article
The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses — the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.
Article
Computer scientists must identify sources of bias, de-bias training data and develop artificial-intelligence algorithms that are robust to skews in the data, argue James Zou and Londa Schiebinger. Computer scientists must identify sources of bias, de-bias training data and develop artificial-intelligence algorithms that are robust to skews in the data.
Conference Paper
Algorithm fairness has started to attract the attention of researchers in AI, Software Engineering and Law communities, with more than twenty different notions of fairness proposed in the last few years. Yet, there is no clear agreement on which definition to apply in each situation. Moreover, the detailed differences between multiple definitions are difficult to grasp. To address this issue, this paper collects the most prominent definitions of fairness for the algorithmic classification problem, explains the rationale behind these definitions, and demonstrates each of them on a single unifying case-study. Our analysis intuitively explains why the same case can be considered fair according to some definitions and unfair according to others.
Conference Paper
Algorithmic systems increasingly shape information people are exposed to as well as influence decisions about employment, finances, and other opportunities. In some cases, algorithmic systems may be more or less favorable to certain groups or individuals, sparking substantial discussion of algorithmic fairness in public policy circles, academia, and the press. We broaden this discussion by exploring how members of potentially affected communities feel about algorithmic fairness. We conducted workshops and interviews with 44 participants from several populations traditionally marginalized by categories of race or class in the United States. While the concept of algorithmic fairness was largely unfamiliar, learning about algorithmic (un)fairness elicited negative feelings that connect to current national discussions about racial injustice and economic inequality. In addition to their concerns about potential harms to themselves and society, participants also indicated that algorithmic fairness (or lack thereof) could substantially affect their trust in a company or product.
Article
Currently there is no standard way to identify how a dataset was created, and what characteristics, motivations, and potential skews it represents. To begin to address this issue, we propose the concept of a datasheet for datasets, a short document to accompany public datasets, commercial APIs, and pretrained models. The goal of this proposal is to enable better communication between dataset creators and users, and help the AI community move toward greater transparency and accountability. By analogy, in computer hardware, it has become industry standard to accompany everything from the simplest components (e.g., resistors), to the most complex microprocessor chips, with datasheets detailing standard operating characteristics, test results, recommended usage, and other information. We outline some of the questions a datasheet for datasets should answer. These questions focus on when, where, and how the training data was gathered, its recommended use cases, and, in the case of human-centric datasets, information regarding the subjects' demographics and consent as applicable. We develop prototypes of datasheets for two well-known datasets: Labeled Faces in The Wild~\cite{lfw} and the Pang \& Lee Polarity Dataset~\cite{polarity}.
Article
As more news articles are written via collaboration between journalists and algorithms, questions have arisen regarding how automation influences the way that news is processed and evaluated by audiences. Informed by expectancy violation theory and the MAIN model, two experimental studies were conducted that examined the effect of purported machine authorship on perceptions of news credibility. Study One (N = 129) revealed that news attributed to a machine is perceived as less credible than news attributed to a human journalist. Study Two (N = 182) also observed negative effects of machine authorship through the indirect pathway of source anthropomorphism and negative expectancy violations, with evidence of moderation by prior recall of robotics also observed. The theoretical and practical implications of these findings are discussed.
Article
The statistical tests used in the analysis of structural equation models with unobservable variables and measurement error are examined. A drawback of the commonly applied chi square test, in addition to the known problems related to sample size and power, is that it may indicate an increasing correspondence between the hypothesized model and the observed data as both the measurement properties and the relationship between constructs decline. Further, and contrary to common assertion, the risk of making a Type II error can be substantial even when the sample size is large. Moreover, the present testing methods are unable to assess a model's explanatory power. To overcome these problems, the authors develop and apply a testing system based on measures of shared variance within the structural model, measurement model, and overall model.