Conference Paper

Large Language Models Help Humans Verify Truthfulness – Except When They Are Convincingly Wrong

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... While most prior work on AI reliance has been in the context of classical AI models (e.g., specialized classification models), there is a growing body of work looking at reliance on systems based on LLMs or other modern generative AI models [60,71,109,110,113,135]. For example, several recent studies explored the effect of communicating (un)certainty in LLMs by highlighting uncertain parts of LLM responses [110,113] or inserting natural language expressions of uncertainty [60,135], finding that some but not all types of (un)certainty information help foster appropriate reliance. ...
... Contributing to this line of work, we first take a bottom-up approach to identify the features of LLM responses that impact user reliance in the context of answering objective questions with the assistance of a popular LLM-infused application ChatGPT (Section 3). In line with findings from prior work [109], we see that reliance is shaped by the content of explanations provided by the system, particularly whether or not these explanations contain inconsistencies. We also observe that participants seek out sources to verify the information provided in responses. ...
... We then conduct a larger-scale, pre-registered, controlled experiment to quantitatively examine the effects of these features. While a previous work by Si et al. [109] has studied the effects of LLM-generated explanations and inconsistencies on people's fact-checking performance through a small-scale study (16 participants per condition), our work provides a more holistic picture by studying what (else) might contribute to reliance and how the identified features affect a wider range of variables including people's evaluation of the LLM response's justification quality and actionability and likelihood of asking follow-up questions. As for the findings, first, consistent with Si et al. [109], we find that explanations increase people's reliance, including overreliance on incorrect answers, and that inconsistencies in explanations can reduce overreliance. ...
Preprint
Full-text available
Large language models (LLMs) can produce erroneous responses that sound fluent and convincing, raising the risk that users will rely on these responses as if they were correct. Mitigating such overreliance is a key challenge. Through a think-aloud study in which participants use an LLM-infused application to answer objective questions, we identify several features of LLM responses that shape users' reliance: explanations (supporting details for answers), inconsistencies in explanations, and sources. Through a large-scale, pre-registered, controlled experiment (N=308), we isolate and study the effects of these features on users' reliance, accuracy, and other measures. We find that the presence of explanations increases reliance on both correct and incorrect responses. However, we observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies. We discuss the implications of these findings for fostering appropriate reliance on LLMs.
... Due to the capability of generating coherent, knowledgeable, and high-quality responses to diverse human input [100], a wide community of human-computer interaction researchers has paid attention to large language models [54]. Researchers have actively explored how LLMs can assist users in various tasks like data annotation [30,97], programming [72], scientific writing [84], and fact verification [85]. All the above functions can be achieved with elaborate prompt engineering using a single LLM. ...
... We also received some open text feedback such as, -"The plans look nice, I do not find any space for improvement" and "the planning stage of the LLM assistant was helpful and trustworthy. " Findings from recent work on LLM-assisted fact checking corroborate this, wherein authors found that convincing explanations provided by LLMs can cause over-reliance when LLMs are wrong [85]. ...
... Users following a wrong plan can lead to negative outcomes. Combined with existing work on algorithm aversion [11] and the impact of negative first impressions on user trust [88], we can infer that such convincingly wrong content [85] can bias user trust and reliance towards the extremes. Before users take notice, they may develop an uncalibrated trust in the AI system, as observed through our findings in high-risk tasks (i.e., tasks 1,2,3) and corroborating work by Si et al. [85]. ...
Preprint
Full-text available
Since the explosion in popularity of ChatGPT, large language models (LLMs) have continued to impact our everyday lives. Equipped with external tools that are designed for a specific purpose (e.g., for flight booking or an alarm clock), LLM agents exercise an increasing capability to assist humans in their daily work. Although LLM agents have shown a promising blueprint as daily assistants, there is a limited understanding of how they can provide daily assistance based on planning and sequential decision making capabilities. We draw inspiration from recent work that has highlighted the value of 'LLM-modulo' setups in conjunction with humans-in-the-loop for planning tasks. We conducted an empirical study (N = 248) of LLM agents as daily assistants in six commonly occurring tasks with different levels of risk typically associated with them (e.g., flight ticket booking and credit card payments). To ensure user agency and control over the LLM agent, we adopted LLM agents in a plan-then-execute manner, wherein the agents conducted step-wise planning and step-by-step execution in a simulation environment. We analyzed how user involvement at each stage affects their trust and collaborative team performance. Our findings demonstrate that LLM agents can be a double-edged sword -- (1) they can work well when a high-quality plan and necessary user involvement in execution are available, and (2) users can easily mistrust the LLM agents with plans that seem plausible. We synthesized key insights for using LLM agents as daily assistants to calibrate user trust and achieve better overall task outcomes. Our work has important implications for the future design of daily assistants and human-AI collaboration with LLM agents.
... Lu [64]. In addition, Si and colleagues investigated the use of Large Language Models to assess the truthfulness of claims, providing textual explanations to aid users distinguish between true and false information [94]. At the same time, studies have shown some drawbacks to user trust and performance when the model explanation presentation does not align with the user's mental model of the news [78]. ...
... Building on this work, our study explored presenting concise, understandable explanations of the AI system's verdict process. Unlike previous studies that presented true/false results with debunking narratives [41,81,94], our study demonstrated that even without specific debunking details, users could still gain valuable insights into information credibility from short indicators and the types of sources that contributed to the indicator. In light of the balance between communicating effective veracity information and avoiding information overload, these results are promising. ...
Preprint
Full-text available
Reducing the spread of misinformation is challenging. AI-based fact verification systems offer a promising solution by addressing the high costs and slow pace of traditional fact-checking. However, the problem of how to effectively communicate the results to users remains unsolved. Warning labels may seem an easy solution, but they fail to account for fuzzy misinformation that is not entirely fake. Additionally, users' limited attention spans and social media information should be taken into account while designing the presentation. The online experiment (n = 537) investigates the impact of sources and granularity on users' perception of information veracity and the system's usefulness and trustworthiness. Findings show that fine-grained indicators enhance nuanced opinions, information awareness, and the intention to use fact-checking systems. Source differences had minimal impact on opinions and perceptions, except for informativeness. Qualitative findings suggest the proposed indicators promote critical thinking. We discuss implications for designing concise, user-friendly AI fact-checking feedback.
... AI explanations have been widely studied as a decision aid (Bussone et al., 2015;Wang and Yin, 2021;Poursabzi-Sangdeh et al., 2021). Prior work has shown that natural language explanations supporting the AI prediction cause over-reliance (Si et al., 2024;Sieker et al., 2024;Hashemi Chaleshtori et al., 2024). We hypothesize that providing natural language supporting explanations when user trust is low (< 5) can mitigate under-reliance. ...
... Similar to how providing supporting explanations are an effective intervention when user trust is low, we investigate whether providing reasons for why the model prediction might be incorrect can counter-balance the cognitive effect of high trust (> 8). Such counter-explanations have been shown to reduce over-reliance compared to regular supporting explanations (Si et al., 2024). Similar to the supporting explanations, we generate natural language counter-explanations for each option by prompting GPT-4o to list 1-2 reasons why that option might be incorrect, while not completely rejecting that option (e.g. ...
Preprint
Full-text available
Trust biases how users rely on AI recommendations in AI-assisted decision-making tasks, with low and high levels of trust resulting in increased under- and over-reliance, respectively. We propose that AI assistants should adapt their behavior through trust-adaptive interventions to mitigate such inappropriate reliance. For instance, when user trust is low, providing an explanation can elicit more careful consideration of the assistant's advice by the user. In two decision-making scenarios -- laypeople answering science questions and doctors making medical diagnoses -- we find that providing supporting and counter-explanations during moments of low and high trust, respectively, yields up to 38% reduction in inappropriate reliance and 20% improvement in decision accuracy. We are similarly able to reduce over-reliance by adaptively inserting forced pauses to promote deliberation. Our results highlight how AI adaptation to user trust facilitates appropriate reliance, presenting exciting avenues for improving human-AI collaboration.
... Second, the lack of explainability of LLM-based and RAG-based systems when reasoning and inference process hinders fact-checking and exacerbates over-reliance on AI recommendations (Si et al., 2023). RAG typically only provides retrieved information for the generators rather than to users. ...
... Transparent LLM-based systems that provide explanations and retrieve materials from existing research enhance human verification of AIgenerated content (Vasconcelos et al., 2023), assist in decision-making (Fok and Weld, 2023), calibrate user trust (Bussone et al., 2015), and reduce overreliance on AI (Das et al., 2022;Zhang et al., 2020;Bansal et al., 2021). Thus, providing reliable background information is critical to mitigating overreliance on AI and enabling experts to calibrate trust through explanations and retrieved context (Si et al., 2023). ...
Preprint
Full-text available
Drug discovery (DD) has tremendously contributed to maintaining and improving public health. Hypothesizing that inhibiting protein misfolding can slow disease progression, researchers focus on target identification (Target ID) to find protein structures for drug binding. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug discovery, integrating models into cohesive workflows remains challenging. We conducted a user study with drug discovery researchers to identify the applicability of LLMs and RAGs in Target ID. We identified two main findings: 1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on an initial protein and protein candidates that have a therapeutic impact; 2) the model must provide the PPI and relevant explanations for better understanding. Based on these observations, we identified three limitations in previous approaches for Target ID: 1) semantic ambiguity, 2) lack of explainability, and 3) short retrieval units. To address these issues, we propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve agent pipeline RAG framework to support large-scale PPI signaling pathway exploration in understanding therapeutic impacts by decomposing the analysis of entire PPI pathways into sub-tasks focused on the analysis of PPI edges.
... This lack of ChatGPT Social AI influences users' moral judgments. This is true whether or not users know that the source of advice is generated by Social AI Si et al. (2023) Humans' ability to assess (true or false) information and explanations generated by Social AI compared to information snippets from Wikipedia N = 1500 participants across five conditions. The sample demographic is not disclosed ChatGPT Users tend to over-rely on explanations provided by ChatGPT, which decreases their ability to detect false claims in informational content Spitale et al. (2023) To what extent Social AI is more capable than humans at writing convincing, yet wrong information N = 697. ...
... The recent studies reviewed in Table 1 reveal that Social AI, particularly ChatGPT powered with GPT-3 and GPT-4, may significantly impact human information processing and decision-making across different scenarios. A clear finding across the diverse studies is the tendency of users' over-reliance on content provided by Social AI (Krugel et al. 2023;Si et al. 2023;Zhou et al. 2024). The model power revealed is particularly concerning when Social AI influences users' decision-making, their understanding of the world, and political or ideological stances (Bai et al. 2023;Goldstein et al. 2024;Hackenburg and Margetts 2023;Palmer et al. ...
Article
Full-text available
Given the widespread integration of Social AI like ChatGPT, Gemini, Copilot, and MyAI, in personal and professional contexts, it is crucial to understand their effects on information and knowledge processing, and individual autonomy. This paper builds on Bråten’s concept of model power, applying it to Social AI to offer a new perspective on the interaction dynamics between humans and AI. By reviewing recent user studies, we examine whether and how models of the world reflected in Social AI may disproportionately impact human-AI interactions, potentially leading to model monopolies where Social AI impacts human beliefs, behaviour and homogenize the worldviews of its users. The concept of model power provides a framework for critically evaluating the impact and influence that Social AI has on communication and meaning-making, thereby informing the development of future systems to support more balanced and meaningful human-AI interactions.
... ADVQA (ORAL) had the highest ADVSCORE with highest κ, and ADVQA (WRITTEN) had the highest margin µ among four datasets, indicating ADVQA questions were the most adversarial. The positive µ 5 We use human responses from a user study by Si et al. [52]. values of both versions of ADVQA validates that our creation pipeline can be generalized to other data collection methods. ...
Preprint
Full-text available
Adversarial benchmarks validate model abilities by providing samples that fool models but not humans. However, despite the proliferation of datasets that claim to be adversarial, there does not exist an established metric to evaluate how adversarial these datasets are. To address this lacuna, we introduce ADVSCORE, a metric which quantifies how adversarial and discriminative an adversarial dataset is and exposes the features that make data adversarial. We then use ADVSCORE to underpin a dataset creation pipeline that incentivizes writing a high-quality adversarial dataset. As a proof of concept, we use ADVSCORE to collect an adversarial question answering (QA) dataset, ADVQA, from our pipeline. The high-quality questions in ADVQA surpasses three adversarial benchmarks across domains at fooling several models but not humans. We validate our result based on difficulty estimates from 9,347 human responses on four datasets and predictions from three models. Moreover, ADVSCORE uncovers which adversarial tactics used by human writers fool models (e.g., GPT-4) but not humans. Through ADVSCORE and its analyses, we offer guidance on revealing language model vulnerabilities and producing reliable adversarial examples.
... As noted in Section 5.3, we caution against equating improved citation quality with improved factuality since even correctly cited passages can be factually incorrect or misleading without additional context about the passage and its source. Citations can therefore invoke a false sense of trust and amplify observations made in Si et al. (2023). It remains important for users to stay critical and reflect on generated responses. ...
Preprint
Large Language Models (LLMs) frequently hallucinate, impeding their reliability in mission-critical situations. One approach to address this issue is to provide citations to relevant sources alongside generated content, enhancing the verifiability of generations. However, citing passages accurately in answers remains a substantial challenge. This paper proposes a weakly-supervised fine-tuning method leveraging factual consistency models (FCMs). Our approach alternates between generating texts with citations and supervised fine-tuning with FCM-filtered citation data. Focused learning is integrated into the objective, directing the fine-tuning process to emphasise the factual unit tokens, as measured by an FCM. Results on the ALCE few-shot citation benchmark with various instruction-tuned LLMs demonstrate superior performance compared to in-context learning, vanilla supervised fine-tuning, and state-of-the-art methods, with an average improvement of 34.1, 15.5, and 10.5 citation F1_1 points, respectively. Moreover, in a domain transfer setting we show that the obtained citation generation ability robustly transfers to unseen datasets. Notably, our citation improvements contribute to the lowest factual error rate across baselines.
... These models' ability to dynamically expand their knowledge base through activities like browsing the internet, reading books, or watching videos is an exciting potential advancement. Key concerns in LLMs include security [94,64,90,98], privacy [31,38], and the propagation of truthfulness [30,77,45] and prevention of misinformation [80,72,103]. For VLMs, they face unique safety challenges: 1) incorrect alignment of multimodal data can lead to harmful outputs, 2) images may contain sensitive information, necessitating careful handling, and 3) VLMs are vulnerable to attacks manipulating both text and images. ...
Preprint
Full-text available
Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate VLMs. We curated WV-Bench by selecting 500 high-quality samples from 8,000 user submissions in WV-Arena. WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo. This significantly outperforms other benchmarks like MMVet, MMMU, and MMStar. Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs. For example, we find that although GPT-4V surpasses many other models like Reka-Flash, Opus, and Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces challenges with subtle contextual cues, spatial reasoning, visual imagination, and expert domain knowledge. Additionally, current VLMs exhibit issues with hallucinations and safety when intentionally provoked. We are releasing our chat and feedback data to further advance research in the field of VLMs.
... LLMs are often overconfident in their answers, especially when prompted to express their own confidence in words (Kiesler and Schiffner, 2023; Xiong et al., 2023a), which may lead to user overreliance on model outputs (Si et al., 2023c). Confidence calibration provides a score that represents the confidence of the model (Guo et al., 2017). ...
Preprint
Full-text available
Generative Artificial Intelligence (GenAI) systems are being increasingly deployed across all parts of industry and research settings. Developers and end users interact with these systems through the use of prompting or prompt engineering. While prompting is a widespread and highly researched concept, there exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the area's nascency. This paper establishes a structured understanding of prompts, by assembling a taxonomy of prompting techniques and analyzing their use. We present a comprehensive vocabulary of 33 vocabulary terms, a taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities. We further present a meta-analysis of the entire literature on natural language prefix-prompting.
... The explainer tools are closely relevant to the users, which make some properties particularly desirable, for example, helpful for answering unseen instances (Joshi et al., 2023) and for fact-checking (Si et al., 2024). In this paper, we are particularly interested in situatedness: the explanations of the same phenomena can and should be tailored to the audience. ...
Preprint
Large language models (LLMs) can be used to generate natural language explanations (NLE) that are adapted to different users' situations. However, there is yet to be a quantitative evaluation of the extent of such adaptation. To bridge this gap, we collect a benchmarking dataset, Situation-Based Explanation. This dataset contains 100 explanandums. Each explanandum is paired with explanations targeted at three distinct audience types-such as educators, students, and professionals-enabling us to assess how well the explanations meet the specific informational needs and contexts of these diverse groups e.g. students, teachers, and parents. For each "explanandum paired with an audience" situation, we include a human-written explanation. These allow us to compute scores that quantify how the LLMs adapt the explanations to the situations. On an array of pretrained language models with varying sizes, we examine three categories of prompting methods: rule-based prompting, meta-prompting, and in-context learning prompting. We find that 1) language models can generate prompts that result in explanations more precisely aligned with the target situations, 2) explicitly modeling an "assistant" persona by prompting "You are a helpful assistant..." is not a necessary prompt technique for situated NLE tasks, and 3) the in-context learning prompts only can help LLMs learn the demonstration template but can't improve their inference performance. SBE and our analysis facilitate future research towards generating situated natural language explanations.
... Several studies have also focused on hallucination in LLMs. A comparative study on how LLMs assist humans in fact-checking tasks is presented in [40]. The paper explores how LLMs fare in comparison to traditional search engines in facilitating fact-checking tasks. ...
Preprint
Full-text available
As LLMs become more pervasive across various users and scenarios, identifying potential issues when using these models becomes essential. Examples include bias, inconsistencies, and hallucination. Although auditing the LLM for these problems is desirable, it is far from being easy or solved. An effective method is to probe the LLM using different versions of the same question. This could expose inconsistencies in its knowledge or operation, indicating potential for bias or hallucination. However, to operationalize this auditing method at scale, we need an approach to create those probes reliably and automatically. In this paper we propose an automatic and scalable solution, where one uses a different LLM along with human-in-the-loop. This approach offers verifiability and transparency, while avoiding circular reliance on the same LLMs, and increasing scientific rigor and generalizability. Specifically, we present a novel methodology with two phases of verification using humans: standardized evaluation criteria to verify responses, and a structured prompt template to generate desired probes. Experiments on a set of questions from TruthfulQA dataset show that we can generate a reliable set of probes from one LLM that can be used to audit inconsistencies in a different LLM. The criteria for generating and applying auditing probes is generalizable to various LLMs regardless of the underlying structure or training mechanism.
Preprint
Full-text available
Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.
Article
Full-text available
The current literature on AI‐advised decision making—involving explainable AI systems advising human decision makers—presents a series of inconclusive and confounding results. To synthesize these findings, we propose a simple theory that elucidates the frequent failure of AI explanations to engender appropriate reliance and complementary decision making performance. In contrast to other common desiderata, for example, interpretability or spelling out the AI's reasoning process, we argue that explanations are only useful to the extent that they allow a human decision maker to verify the correctness of the AI's prediction. Prior studies find in many decision making contexts that AI explanations do not facilitate such verification. Moreover, most tasks fundamentally do not allow easy verification, regardless of explanation method, limiting the potential benefit of any type of explanation. We also compare the objective of complementary performance with that of appropriate reliance, decomposing the latter into the notions of outcome‐graded and strategy‐graded reliance.
Article
Full-text available
Explainable AI provides insights to users into the why for model predictions, offering potential for users to better understand and trust a model, and to recognize and correct AI predictions that are incorrect. Prior research on human and explainable AI interactions has focused on measures such as interpretability, trust, and usability of the explanation. There are mixed findings whether explainable AI can improve actual human decision-making and the ability to identify the problems with the underlying model. Using real datasets, we compare objective human decision accuracy without AI (control), with an AI prediction (no explanation), and AI prediction with explanation. We find providing any kind of AI prediction tends to improve user decision accuracy, but no conclusive evidence that explainable AI has a meaningful impact. Moreover, we observed the strongest predictor for human decision accuracy was AI accuracy and that users were somewhat able to detect when the AI was correct vs. incorrect, but this was not significantly affected by including an explanation. Our results indicate that, at least in some situations, the why information provided in explainable AI may not enhance user decision-making, and further research may be needed to understand how to integrate explainable AI into real systems.
Article
Full-text available
A question answering system that in addition to providing an answer provides an explanation of the reasoning that leads to that answer has potential advantages in terms of debuggability, extensibility, and trust. To this end, we propose QED, a linguistically informed, extensible framework for explanations in question answering. A QED explanation specifies the relationship between a question and answer according to formal semantic notions such as referential equality, sentencehood, and entailment. We describe and publicly release an expert-annotated dataset of QED explanations built upon a subset of the Google Natural Questions dataset, and report baseline models on two tasks—post- hoc explanation generation given an answer, and joint question answering and explanation generation. In the joint setting, a promising result suggests that training on a relatively small amount of QED data can improve question answering. In addition to describing the formal, language-theoretic motivations for the QED approach, we describe a large user study showing that the presence of QED explanations significantly improves the ability of untrained raters to spot errors made by a strong neural QA baseline.
Conference Paper
Full-text available
Rapid increase of misinformation online has emerged as one of the biggest challenges in this post-truth era. This has given rise to many fact-checking websites that manually assess doubtful claims. However, the speed and scale at which misinformation spreads in online media inherently limits manual verification. Hence, the problem of automatic credibility assessment has attracted great attention. In this work, we present CredEye, a system for automatic credibility assessment. It takes a natural language claim as input from the user and automatically analyzes its credibility by considering relevant articles from the Web. Our system captures joint interaction between language style of articles, their stance towards a claim and the trustworthiness of the sources. In addition, extraction of supporting evidence in the form of enriched snippets makes the verdicts of CredEye transparent and interpretable.
Article
Full-text available
There has been a recent resurgence in the area of explainable artificial intelligence as researchers and practitioners seek to provide more transparency to their algorithms. Much of this research is focused on explicitly explaining decisions or actions to a human observer, and it should not be controversial to say that, if these techniques are to succeed, the explanations they generate should have a structure that humans accept. However, it is fair to say that most work in explainable artificial intelligence uses only the researchers' intuition of what constitutes a `good' explanation. There exists vast and valuable bodies of research in philosophy, psychology, and cognitive science of how people define, generate, select, evaluate, and present explanations. This paper argues that the field of explainable artificial intelligence should build on this existing research, and reviews relevant papers from philosophy, cognitive psychology/science, and social psychology, which study these topics. It draws out some important findings, and discusses ways that these can be infused with work on explainable artificial intelligence.
Conference Paper
Full-text available
We introduce AI rationalization, an approach for generating explanations of autonomous system behavior as if a human had performed the behavior. We describe a rationalization technique that uses neural machine translation to translate internal state-action representations of an autonomous agent into natural language. We evaluate our technique in the Frogger game environment, training an autonomous game playing agent to rationalize its action choices using natural language. A natural language training corpus is collected from human players thinking out loud as they play the game. We motivate the use of rationalization as an approach to explanation generation and show the results of two experiments evaluating the effectiveness of rationalization. Results of these evaluations show that neural machine translation is able to accurately generate rationalizations that describe agent behavior, and that rationalizations are more satisfying to humans than other alternative methods of explanation.
Conference Paper
Full-text available
Clinical decision support systems (CDSS) are increasingly used by healthcare professionals for evidence-based diagnosis and treatment support. However, research has suggested that users often over-rely on system suggestions – even if the suggestions are wrong. Providing explanations could potentially mitigate misplaced trust in the system and over-reliance. In this paper, we explore how explanations are related to user trust and reliance, as well as what information users would find helpful to better understand the reliability of a system's decision-making. We investigated these questions through an exploratory user study in which healthcare professionals were observed using a CDSS prototype to diagnose hypothetic cases using fictional patients suffering from a balance-related disorder. Our results show that the amount of system confidence had only a slight effect on trust and reliance. More importantly, giving a fuller explanation of the facts used in making a diagnosis had a positive effect on trust but also led to over-reliance issues, whereas less detailed explanations made participants question the system's reliability and led to self-reliance problems. To help them in their assessment of the reliability of the system's decisions, study participants wanted better explanations to help them interpret the system's confidence, to verify that the disorder fit the suggestion, to better understand the reasoning chain of the decision model, and to make differential diagnoses. Our work is a first step toward improved CDSS design that better supports clinicians in making correct diagnoses.
Article
The limited information (data voids) on political topics relevant to underrepresented communities has facilitated the spread of disinformation. Independent journalists who combat disinformation in underrepresented communities have reported feeling overwhelmed because they lack the tools necessary to make sense of the information they monitor and address the data voids. In this paper, we present a system to identify and address political data voids within underrepresented communities. Armed with an interview study, indicating that the independent news media has the potential to address them, we designed an intelligent collaborative system, called Datavoidant. Datavoidant uses state-of-the-art machine learning models and introduces a novel design space to provide independent journalists with a collective understanding of data voids to facilitate generating content to cover the voids. We performed a user interface evaluation with independent news media journalists (N=22). These journalists reported that Datavoidant's features allowed them to more rapidly while easily having a sense of what was taking place in the information ecosystem to address the data voids. They also reported feeling more confident about the content they created and the unique perspectives they had proposed to cover the voids. We conclude by discussing how Datavoidant enables a new design space wherein individuals can collaborate to make sense of their information ecosystem and actively devise strategies to prevent disinformation.
Conference Paper
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
Conference Paper
The reporting and the analysis of current events around the globe has expanded from professional, editor-lead journalism all the way to citizen journalism. Nowadays, politicians and other key players enjoy direct access to their audiences through social media, bypassing the filters of official cables or traditional media. However, the multiple advantages of free speech and direct communication are dimmed by the misuse of media to spread inaccurate or misleading claims. These phenomena have led to the modern incarnation of the fact-checker --- a professional whose main aim is to examine claims using available evidence and to assess their veracity. Here, we survey the available intelligent technologies that can support the human expert in the different steps of her fact-checking endeavor. These include identifying claims worth fact-checking, detecting relevant previously fact-checked claims, retrieving relevant evidence to fact-check a claim, and actually verifying a claim. In each case, we pay attention to the challenges and the potential impact on real-world fact-checking.
Article
People supported by AI-powered decision support tools frequently overrely on the AI: they accept an AI's suggestion even when that suggestion is wrong. Adding explanations to the AI decisions does not appear to reduce the overreliance and some studies suggest that it might even increase it. Informed by the dual-process theory of cognition, we posit that people rarely engage analytically with each individual AI recommendation and explanation, and instead develop general heuristics about whether and when to follow the AI suggestions. Building on prior research on medical decision-making, we designed three cognitive forcing interventions to compel people to engage more thoughtfully with the AI-generated explanations. We conducted an experiment (N=199), in which we compared our three cognitive forcing designs to two simple explainable AI approaches and to a no-AI baseline. The results demonstrate that cognitive forcing significantly reduced overreliance compared to the simple explainable AI approaches. However, there was a trade-off: people assigned the least favorable subjective ratings to the designs that reduced the overreliance the most. To audit our work for intervention-generated inequalities, we investigated whether our interventions benefited equally people with different levels of Need for Cognition (i.e., motivation to engage in effortful mental activities). Our results show that, on average, cognitive forcing interventions benefited participants higher in Need for Cognition more. Our research suggests that human cognitive motivation moderates the effectiveness of explainable AI solutions.
Conference Paper
Machine learning is an important tool for decision making, but its ethical and responsible application requires rigorous vetting of its interpretability and utility: an understudied problem, particularly for natural language processing models. We propose an evaluation of interpretation on a real task with real human users, where the effectiveness of interpretation is measured by how much it improves human performance. We design a grounded, realistic human-computer cooperative setting using a question answering task, Quizbowl. We recruit both trivia experts and novices to play this game with computer as their teammate, who communicates its prediction via three different interpretations. We also provide design guidance for natural language processing human-in-the-loop settings.
Article
Automation is often problematic because people fail to rely upon it appropriately. Because people respond to technology socially, trust influences reliance on automation. In particular, trust guides reliance when complexity and unanticipated situations make a complete understanding of the automation impractical. This review considers trust from the organizational, sociological, interpersonal, psychological, and neurological perspectives. It considers how the context, automation characteristics, and cognitive processes affect the appropriateness of trust. The context in which the automation is used influences automation performance and provides a goal-oriented perspective to assess automation characteristics along a dimension of attributional abstraction. These characteristics can influence trust through analytic, analogical, and affective processes. The challenges of extrapolating the concept of trust in people to trust in automation are discussed. A conceptual model integrates research regarding trust in automation and describes the dynamics of trust, the role of context, and the influence of display characteristics. Actual or potential applications of this research include improved designs of systems that require people to manage imperfect automation. Copyright © 2004, Human Factors and Ergonomics Society. All rights reserved.
Conference Paper
Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally varound the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.
Examining the digital toolsets of journalists reporting on disinformation
  • Andrew Beers
Andrew Beers. 2019. Examining the digital toolsets of journalists reporting on disinformation. In Computation + Journalism Conference.
Aniruddh Sriram, Greg Durrett, and Eunsol Choi. 2023. Complex claim verification with evidence retrieved in the wild
  • Jifan Chen
  • Grace Kim
Jifan Chen, Grace Kim, Aniruddh Sriram, Greg Durrett, and Eunsol Choi. 2023. Complex claim verification with evidence retrieved in the wild. ArXiv, abs/2305.11859.
Partisanship, propaganda, and disinformation: Online media and the
  • Robert Faris
  • Hal Roberts
  • Bruce Etling
  • Nikki Bourassa
  • Ethan Zuckerman
  • Yochai Benkler
Robert Faris, Hal Roberts, Bruce Etling, Nikki Bourassa, Ethan Zuckerman, and Yochai Benkler. 2017. Partisanship, propaganda, and disinformation: Online media and the 2016 U.S. presidential election. Social Science Research Network.
Retrievalaugmented generation for large language models: A survey
  • Yunfan Gao
  • Yun Xiong
  • Xinyu Gao
  • Kangxiang Jia
  • Jinliu Pan
  • Yuxi Bi
  • Yi Dai
  • Jiawei Sun
  • Qianyu Guo
  • Meng Wang
  • Haofen Wang
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. 2023b. Retrievalaugmented generation for large language models: A survey. ArXiv, abs/2312.10997.
On calibration of modern neural networks
  • Chuan Guo
  • Geoff Pleiss
  • Yu Sun
  • Kilian Q Weinberger
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On calibration of modern neural networks. International Conference on Machine Learning.
Towards a science of human-AI decision making: A survey of empirical studies
  • Vivian Lai
  • Chacha Chen
  • Qingzi Vera Liao
  • Alison Smith-Renner
  • Chenhao Tan
Vivian Lai, Chacha Chen, Qingzi Vera Liao, Alison Smith-Renner, and Chenhao Tan. 2021. Towards a science of human-AI decision making: A survey of empirical studies. ArXiv, abs/2112.11471.
Troops, trolls and troublemakers: A global inventory of organized social media manipulation. Computational Propaganda Research Project
  • Ricardo Mendes
Ricardo Mendes. 2017. Troops, trolls and troublemakers: A global inventory of organized social media manipulation. Computational Propaganda Research Project.
Debate helps supervise unreliable experts
  • Julian Michael
  • Salsabila Mahdi
  • David Rein
  • Jackson Petty
  • Julien Dirani
  • Vishakh Padmakumar
  • Samuel R Bowman
Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, and Samuel R. Bowman. 2023. Debate helps supervise unreliable experts. ArXiv, abs/2311.08702.
FActScore: Fine-grained atomic evaluation of factual precision in long form text generation
  • Sewon Min
  • Kalpesh Krishna
  • Xinxi Lyu
  • Mike Lewis
  • Pang Wen Tau Yih
  • Mohit Wei Koh
  • Luke Iyyer
  • Hanna Zettlemoyer
  • Hajishirzi
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hanna Hajishirzi. 2023. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
Machine learning explanations to prevent overtrust in fake news detection
  • Sina Mohseni
  • Fan Yang
  • K Shiva
  • Mengnan Pentyala
  • Yi Du
  • Nic Liu
  • Xia Lupfer
  • Shuiwang Hu
  • Eric D Ji
  • Ragan
Sina Mohseni, Fan Yang, Shiva K. Pentyala, Mengnan Du, Yi Liu, Nic Lupfer, Xia Hu, Shuiwang Ji, and Eric D. Ragan. 2021. Machine learning explanations to prevent overtrust in fake news detection. Proceedings of the International AAAI Conference on Web and Social Media.
Automated fact checking: Task formulations, methods and future directions
  • James Thorne
  • Andreas Vlachos
James Thorne and Andreas Vlachos. 2018. Automated fact checking: Task formulations, methods and future directions. Proceedings of the 27th International Conference on Computational Linguistics.