Conference Paper

Will You Accept an Imperfect AI?: Exploring Designs for Adjusting End-user Expectations of AI Systems

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

AI technologies have been incorporated into many end-user applications. However, expectations of the capabilities of such systems vary among people. Furthermore, bloated expectations have been identified as negatively affecting perception and acceptance of such systems. Although the intelligibility of ML algorithms has been well studied, there has been little work on methods for setting appropriate expectations before the initial use of an AI-based system. In this work, we use a Scheduling Assistant - an AI system for automated meeting request detection in free-text email - to study the impact of several methods of expectation setting. We explore two versions of this system with the same 50% level of accuracy of the AI component but each designed with a different focus on the types of errors to avoid (avoiding False Positives vs. False Negatives). We show that such different focus can lead to vastly different subjective perceptions of accuracy and acceptance. Further, we design expectation adjustment techniques that prepare users for AI imperfections and result in a significant increase in acceptance.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Outside of the education domain, research has shown how algorithms' imperfections and diferences from human judgement may result in trust and acceptance issues [17,33,36], but we know less about how students interact with and perceive imperfect AI autograders. Forming guidelines on how to manage students' and teachers' perceived accuracy, fairness, educational value, and other attitudes towards NLP autograders will be key to further adoption of these systems. ...
... Dzindolet et al. found that giving an explanation for why errors might happen prior to use of a system increased trust, but potentially to an unwarranted level [19]. Kocielnik et al. found that when users interacted with a very inaccurate AI (that found appointments in emails), transparency improved system acceptance, but only when the system was optimized for high precision [33]. In an educational setting, Kizilcec varied transparency levels in a system that automatically compensated for peer grading bias in a high-stakes college class assignment. ...
... Our frst explanation states that students were more sensitive to false negatives or remembered them better, i.e., the well-documented phenomenon of algorithm aversion [17], where people quickly lose trust in algorithmic predictions after seeing mistakes, even when the systems outperform humans. Setting proper expectations can help counter algorithm aversion, and providing transparency around algorithmic processes before use has shown promise in the literature [19,33]. One avenue for transparency for our case is to inform students of the training process which used past students' answers, as all but one interviewee failed to mention this in their folk theories. ...
... Job promotion [15], meeting scheduling assistance [74], email topic classification [125], cybersecurity monitoring [42], profession prediction [94], military planning (monitor and direct unmanned vehicles) [130]. ...
... Liu et al. [94] define a task to predict a person's occupation given a short biography. Other tasks not related to jobs include classifying emails' topic [125], AI-assisted meeting scheduling [74], military planning via monitoring and unmanned vehicles [130], and cybersecurity monitoring [42]. ...
... Based on this conceptualization, we categorize the AI assistance elements studied in the survey paper into these four groups, as listed in Table 3. It is interesting to note that many of the studies reviewed focused on providing information [80], prototypes [23,43,74] Model documentation Overview of the model or algorithm [74,76,77,88,112], model prediction distribution [139] Information about training data Input features or information the model considers [61,77,111,155], aggregate statistics (e.g., demographic) [15,39], full training "data explanation" [7] Other AI system elements affecting user agency or experience ...
Preprint
As AI systems demonstrate increasingly strong predictive performance, their adoption has grown in numerous domains. However, in high-stakes domains such as criminal justice and healthcare, full automation is often not desirable due to safety, ethical, and legal concerns, yet fully manual approaches can be inaccurate and time consuming. As a result, there is growing interest in the research community to augment human decision making with AI assistance. Besides developing AI technologies for this purpose, the emerging field of human-AI decision making must embrace empirical approaches to form a foundational understanding of how humans interact and work with AI to make decisions. To invite and help structure research efforts towards a science of understanding and improving human-AI decision making, we survey recent literature of empirical human-subject studies on this topic. We summarize the study design choices made in over 100 papers in three important aspects: (1) decision tasks, (2) AI models and AI assistance elements, and (3) evaluation metrics. For each aspect, we summarize current trends, discuss gaps in current practices of the field, and make a list of recommendations for future research. Our survey highlights the need to develop common frameworks to account for the design and research spaces of human-AI decision making, so that researchers can make rigorous choices in study design, and the research community can build on each other's work and produce generalizable scientific knowledge. We also hope this survey will serve as a bridge for HCI and AI communities to work together to mutually shape the empirical science and computational technologies for human-AI decision making.
... Most physicians do not expect their clinical systems to behave inconsistently and imperfectly [69,86], which can lead to mistrust and potential abandonment of these technologies within real-world clinical setups [19,25,26]. While considerable work as focused on improving the accuracy of AI algorithms, comparatively less work focused on improving trust and usability of interactive assistance techniques. ...
... Previous works, outside the clinical scope [53,86], denote that the impact of False-Positive (FP) vs. False-Negative (FN) on UX is generally unexplored. However, this is of high relevance when considering AI for the clinical domain [22,50] as it will be experimentally shown. ...
... Indeed, measuring predictions are typically quantified as precision in contrast with recall [128]. We, therefore, explore the following Research Questions and associated each to the set of Hypothesis following the guidelines described in the literature [8,86]. ...
Article
In this paper, we developed BreastScreening-AI within two scenarios for the classification of multimodal beast images: (1) Clinician-Only; and (2) Clinician-AI. The novelty relies on the introduction of a deep learning method into a real clinical workflow for medical imaging diagnosis. We attempt to address three high-level goals in the two above scenarios. Concretely, how clinicians: i) accept and interact with these systems, revealing whether are explanations and functionalities required; ii) are receptive to the introduction of AI-assisted systems, by providing benefits from mitigating the clinical error; and iii) are affected by the AI assistance. We conduct an extensive evaluation embracing the following experimental stages: (a) patient selection with different severities, (b) qualitative and quantitative analysis for the chosen patients under the two different scenarios. We address the high-level goals through a real-world case study of 45 clinicians from nine institutions. We compare the diagnostic and observe the superiority of the Clinician-AI scenario, as we obtained a decrease of 27% for False-Positives and 4% for False-Negatives. Through an extensive experimental study, we conclude that the proposed design techniques positively impact the expectations and perceptive satisfaction of 91% clinicians, while decreasing the time-to-diagnose by 3 min per patient.
... Human-computer interaction studies relate user acceptance of such errors to the cost associated with the error, which is task-specific [39]. For example, it is more acceptable for users that a scheduling assistant wrongly detects appointments in emails than that it overlooks appointments that take place. ...
... For example, it is more acceptable for users that a scheduling assistant wrongly detects appointments in emails than that it overlooks appointments that take place. The reason is, ignoring the false appointment is easy for users, whereas missing an appointment comes with high costs [39]. In such cases, the system should favour recall over precision. ...
Preprint
Full-text available
Most learners fail to develop deep text comprehension when reading textbooks passively. Posing questions about what learners have read is a well-established way of fostering their text comprehension. However, many textbooks lack self-assessment questions because authoring them is timeconsuming and expensive. Automatic question generators may alleviate this scarcity by generating sound pedagogical questions. However, generating questions automatically poses linguistic and pedagogical challenges. What should we ask? And, how do we phrase the question automatically? We address those challenges with an automatic question generator grounded in learning theory. The paper introduces a novel pedagogically meaningful content selection mechanism to find question-worthy sentences and answers in arbitrary textbook contents. We conducted an empirical evaluation study with educational experts, annotating 150 generated questions in six different domains. Results indicate a high linguistic quality of the generated questions. Furthermore, the evaluation results imply that the majority of the generated questions inquire central information related to the given text and may foster text comprehension in specific learning scenarios.
... An AI-based system may be a helpful tool as an adjunct for clinicians or, in the common case that radiologic expertise is not available, for other medical teams [75,83]. However, most clinicians do not expect their clinical systems to behave inconsistently and imperfectly [42,50], which can lead to mistrust and potential abandonment of these technologies within real-world clinical setups [9,20,21]. ...
... Specifically, we aim to understand better the factors affecting the adoption and use of intelligent agents [47,71,80]. We were particularly interested in investigating the role of risk, privacy, and trust in the context of the adoption of AI systems that support clinicians during patient diagnosis via medical imaging [20,50]. We also want to understand the effect of moderator variables (e.g., gender, age, medical experience, training levels and areas of expertise) in the adoption of AI across this clinical workflow [33]. ...
... In light of this, recent work in HCI has evaluated the efficacy of these tools in helping people understand ML models. These findings suggest that ML practitioners [52] and end-users [10,54] are not always able to make accurate judgments about the model, even with the help of explanations. In fact, having access to these tools often leads to over-trust in the ML models. ...
... For end-users, completeness (rather than soundness) of explanations helps people form accurate mental models [57]. Accuracy and example-based explanations can similarly shape people's mental models and expectations, albeit in different ways [54]. ...
Preprint
Full-text available
Understanding how ML models work is a prerequisite for responsibly designing, deploying, and using ML-based systems. With interpretability approaches, ML can now offer explanations for its outputs to aid human understanding. Though these approaches rely on guidelines for how humans explain things to each other, they ultimately solve for improving the artifact -- an explanation. In this paper, we propose an alternate framework for interpretability grounded in Weick's sensemaking theory, which focuses on who the explanation is intended for. Recent work has advocated for the importance of understanding stakeholders' needs -- we build on this by providing concrete properties (e.g., identity, social context, environmental cues, etc.) that shape human understanding. We use an application of sensemaking in organizations as a template for discussing design guidelines for Sensible AI, AI that factors in the nuances of human cognition when trying to explain itself.
... In parallel to developing diferent explanation approaches, numerous studies have investigated the impact of explanations on user perceptions of and interactions with machine learning systems [5,14,20,33,58,59,85,101,104]. ...
... Prior work has found that increased transparency through explanations can increase user acceptance of the systems [28,49,59,100]. However, increased transparency does not always lead to increased trust. ...
... Explanations. Consistent with prior work [12,17,47], we found that providing explanations could robustly improve users' perceived understanding of AI. More importantly, participants appreciated the additional support enabled by explanations, to better direct their attention to the parts of the video likely associated with UX problems, and to remind them of the indicators of UX problems that they might have otherwise missed. ...
... Recent research has suggested that the precision and recall of an AI might affect its users' perceptions [47]. Therefore, ...
Preprint
Full-text available
Analyzing usability test videos is arduous. Although recent research showed the promise of AI in assisting with such tasks, it remains largely unknown how AI should be designed to facilitate effective collaboration between user experience (UX) evaluators and AI. Inspired by the concepts of agency and work context in human and AI collaboration literature, we studied two corresponding design factors for AI-assisted UX evaluation: explanations and synchronization. Explanations allow AI to further inform humans how it identifies UX problems from a usability test session; synchronization refers to the two ways humans and AI collaborate: synchronously and asynchronously. We iteratively designed a tool, AI Assistant, with four versions of UIs corresponding to the two levels of explanations (with/without) and synchronization (sync/async). By adopting a hybrid wizard-of-oz approach to simulating an AI with reasonable performance, we conducted a mixed-method study with 24 UX evaluators identifying UX problems from usability test videos using AI Assistant. Our quantitative and qualitative results show that AI with explanations, regardless of being presented synchronously or asynchronously, provided better support for UX evaluators' analysis and was perceived more positively; when without explanations, synchronous AI better improved UX evaluators' performance and engagement compared to the asynchronous AI. Lastly, we present the design implications for AI-assisted UX evaluation and facilitating more effective human-AI collaboration.
... We believe that this precision-oriented approach is vital for AQG systems. In Human-Computer Interaction research, it has been shown that failures causing the most harm to the user should be minimized (Kocielnik et al., 2019), and we assume that irrelevant questions are significantly more harmful to the user than possibly missing questions. Hence, a focus on high precision generation is critical. ...
... Additionally, a task-specific adaptation of the classifier is applied, setting the decision threshold to 0.70. The idea is to favor classification errors that affect the users the least (Kocielnik et al., 2019). Hence, it will generate fewer questions, but the questions will more likely originate from an actual definition. ...
Article
Full-text available
Background Asking learners manually authored questions about their readings improves their text comprehension. Yet, not all reading materials comprise sufficiently many questions and many informal reading materials do not contain any. Therefore, automatic question generation has great potential in education as it may alleviate the lack of questions. However, currently, there is insufficient evidence on whether or not those automatically generated questions are beneficial for learners' understanding in reading comprehension scenarios. Objectives We investigate the positive and negative effects of automatically generated short-answer questions on learning outcomes in a reading comprehension scenario. Methods A learner-centric, in between-groups, quasi-experimental reading comprehension case study with 48 college students is conducted. We test two hypotheses concerning positive and negative effects on learning outcomes during the text comprehension of science texts and descriptively explore how the generated questions influenced learners. Results The results show a positive effect of the generated questions on the participants learning outcomes. However, we cannot entirely exclude question-induced adverse side effects on learning of non-questioned information. Interestingly, questions identified as computer-generated by learners nevertheless seemed to benefit their understanding. Take Away Automatic question generation positively impacts reading comprehension in the given scenario. In the reported case study, even questions recognized as computer-generated supported reading comprehension.
... If the human mistakenly trusts the AI system in regions where it is likely to err, catastrophic failures may occur. This is a strong argument in favour of Bayesian approaches to probabilistic reasoning: research in the intersection of AI and HCI has found that interaction improves when setting expectations right about what the system can do and how well it performs (Kocielnik et al., 2019;Bansal et al., 2019a). Guidelines have been produced , and they recommend to Make clear what the system can do (G1), and Make clear how well the system can do what it can do (G2). ...
Article
Full-text available
When collaborating with an AI system, we need to assess when to trust its recommendations. If we mistakenly trust it in regions where it is likely to err, catastrophic failures may occur, hence the need for Bayesian approaches for probabilistic reasoning in order to determine the confidence (or epistemic uncertainty) in the probabilities in light of the training data. We propose an approach to Bayesian inference of posterior distributions that overcomes the independence assumption behind most of the approaches dealing with a large class of probabilistic reasoning that includes Bayesian networks as well as several instances of probabilistic logic. We provide an algorithm for Bayesian inference of posterior distributions from sparse, albeit complete, observations, and for deriving inferences and their confidences keeping track of the dependencies between variables when they are manipulated within the unifying computational formalism provided by probabilistic circuits. Each leaf of such circuits is labelled with a beta-distributed random variable that provides us with an elegant framework for representing uncertain probabilities. We achieve better estimation of epistemic uncertainty than state-of-the-art approaches, including highly engineered ones, while being able to handle general circuits and with just a modest increase in the computational effort compared to using point probabilities.
... This ambiguity is especially challenging to handle in a dialogue setting, where a system is limited by returning only one answer in response to each request, unlike in web search setup where diversification of results is possible and acceptable (Vallet and Castells, 2012). Previous research has shown that users are much more forgiving about system mistakes if they can act on them with minimal efforts spent (Kocielnik et al., 2019;Kiseleva et al., 2016b). Therefore it is more appropriate to ask a clarifying question in user request ambiguity rather than generating incorrect answers. ...
... This ambiguity is especially challenging to handle in a dialogue setting, where a system is limited by returning only one answer in response to each request, unlike in web search setup where diversification of results is possible and acceptable (Vallet and Castells, 2012). Previous research has shown that users are much more forgiving about system mistakes if they can act on them with minimal efforts spent (Kocielnik et al., 2019;Kiseleva et al., 2016b). Therefore it is more appropriate to ask a clarifying question in user request ambiguity rather than generating incorrect answers. ...
Preprint
Full-text available
Enabling open-domain dialogue systems to ask clarifying questions when appropriate is an important direction for improving the quality of the system response. Namely, for cases when a user request is not specific enough for a conversation system to provide an answer right away, it is desirable to ask a clarifying question to increase the chances of retrieving a satisfying answer. To address the problem of 'asking clarifying questions in open-domain dialogues': (1) we collect and release a new dataset focused on open-domain single- and multi-turn conversations, (2) we benchmark several state-of-the-art neural baselines, and (3) we propose a pipeline consisting of offline and online steps for evaluating the quality of clarifying questions in various dialogues. These contributions are suitable as a foundation for further research.
... As demonstrated by Wang and colleagues (Wang et al. 2018;, users develop implicit assumptions about how technological systems behave. In the context of AI systems, users assume that the system is neutral (Kocielnik et al. 2019) and fair (Martin 2019). Detecting that the AI system grants more loans to male applicants due to gender bias violates those initial assumptions about the AI system. ...
Conference Paper
Biases in Artificial Intelligence (AI) can reinforce social inequality. Increasing transparency of AI systems through explanations can help to avoid the negative consequences of those biases. However, little is known about how users evaluate explanations of biased AI systems. Thus, we apply the Psychological Contract Violation Theory to investigate the implications of a gender-biased AI system on user trust. We allocated 339 participants into three experimental groups, each with a different loan forecasting AI system version: explainable gender-biased, explainable neutral, and non-explainable AI system. We demonstrate that only users with moderate to high general awareness of gender stereotypes in society, i.e., stigma consciousness, perceive the gender-biased AI system as not trustworthy. However, users with low stigma consciousness perceive the gender-biased AI system as trustworthy as it is more transparent than a system without explanations. Our findings show that AI biases can reinforce social inequality if they match with human stereotypes.
... The controlled launch and promotion of the agents also helped to set expectations around the systems [77]. All editorials stressed the recent nature of the agents, the fact that they were machines (and therefore should not be expected to behave as their doctors) and outlined the contents the CAs were able to deliver. ...
Article
Full-text available
One of the key aspects in the process of caring for people with diabetes is Therapeutic Education (TE). TE is a teaching process for training patients so that they can self-manage their care plan. Alongside traditional methods of providing educational content, there are now alternative forms of delivery thanks to the implementation of advanced Information Technologies systems such as conversational agents (CAs). In this context, we present the AIDA project: an ensemble of two different CAs intended to provide a TE tool for people with diabetes. The Artificial Intelligence Diabetes Assistant (AIDA) consists of a text-based chatbot and a speech-based dialog system. Their content has been created and validated by a scientific board. AIDA Chatbot—the text-based agent—provides a broad spectrum of information about diabetes, while AIDA Cookbot—the voice-based agent—presents recipes compliant with a diabetic patient’s diet. We provide a thorough description of the development process for both agents, the technology employed and their usage by the general public. AIDA Chatbot and AIDA Cookbot are freely available and they represent the first example of conversational agents in Italian to support diabetes patients, clinicians and caregivers.
... Overall, a best practice is to conduct a manual 'sanity check' for the questions before conducting the interviews. This is because an "AI-generated" approach that uses algorithms might miss specific contextual and cultural nuances important for a given use case [29,30]. ...
Chapter
Full-text available
Personas are often created based on user interviews. Yet, researchers rarely make their interview questions publicly available or justify how they were chosen. We manually extract 276 interview questions and categorize them into 10 themes, making this list publicly available for researchers and practitioners. We also demonstrate an approach of using natural language processing to assist in selecting persona interview questions for a given use case.
... Highly relevant to our research, prior work has studied human perceptions of AI, including among data scientists [119], young children [124], and the public at large [27,64]. For example, Kocielnik et al. studied how people's expectations regarding AI impacted their perceptions of accuracy and acceptance [65], and Jakesch et al. examined how the perception that text was written by AI afects trustworthiness [61]. Kelley et. ...
... In a similar vein, Liao et al. [48] called for more design research in which "design practitioners perform an indispensable role." Hence, one stream of CHI research mainly focused on why it is uniquely difcult to design AI and how to solve emerging problems [4,20,38,48,53,81,89]. ...
Conference Paper
Recently, Artificial Intelligence (AI) has been used to enable efficient decision-making in managerial and organizational contexts, ranging from employment to dismissal. However, to avoid employees’ antipathy toward AI, it is important to understand what aspects of AI employees like and/or dislike. In this paper, we aim to identify how employees perceive current human resource (HR) teams and future algorithmic management. Specifically, we explored what factors negatively influence employees’ perceptions of AI making work performance evaluations. Through in-depth interviews with 21 workers, we found that 1) employees feel six types of burdens (i.e., emotional, mental, bias, manipulation, privacy, and social) toward AI's introduction to human resource management (HRM), and that 2) these burdens could be mitigated by incorporating transparency, interpretability, and human intervention to algorithmic decision-making. Based on our findings, we present design efforts to alleviate employees’ burdens. To leverage AI for HRM in fair and trustworthy ways, we call for the HCI community to design human-AI collaboration systems with various HR stakeholders.
... In other cases, users might configure ADM systems by specifying what kind of prediction errors carry greater weight for them. This issue has been illustrated with regard to scheduling assistants which detect meeting requests from e-mails (Kocielnik et al. 2019). Even with the same level of overall accuracy, ADM system configurations can vary with regard to different error rates: It can create false positives (mails wrongly classified as positive), and false negatives (mails falsely classified as negative, i.e. overlooked meetings), and the ratio of these errors can be altered to some degree. ...
Article
Full-text available
Algorithmic systems that provide services to people by supporting or replacing human decision-making promise greater convenience in various areas. The opacity of these applications, however, means that it is not clear how much they truly serve their users. A promising way to address the issue of possible undesired biases consists in giving users control by letting them configure a system and aligning its performance with users' own preferences. However, as the present paper argues, this form of control over an algorithmic system demands an algorithmic literacy that also entails a certain way of making oneself knowable: users must interrogate their own dispositions and see how these can be formalized such that they can be translated into the algorithmic system. This may, however, extend already existing practices through which people are monitored and probed and means that exerting such control requires users to direct a computational mode of thinking at themselves.
... Research in XAI, or more generally in AI-systems acceptance, rarely explores users' expectations of system outputs [94]. Our results demonstrated that they are indeed a relevant factor in producing satisfying explanations since our users explicitly reported higher satisfaction if the output of the system was what they expected, even if the AI-system output and the users' expectation were both wrong. ...
Article
Research in the social sciences has shown that expectations are an important factor in explanations as used between humans: rather than explaining the cause of an event per se, the explainer will often address other event that did not occur but that the explainee might have expected. For AI-powered systems, this finding suggests that explanation-generating systems may need to identify such end user expectations. In general, this is a challenging task, not the least because users often keep them implicit; there is thus a need to investigate the importance of such an ability. In this paper, we report an empirical study with 181 participants who were shown outputs from a text classifier system along with an explanation of why the system chose a particular class for each text. Explanations were both factual, explaining why the system produced a certain output or counterfactual, explaining why the system produced one output instead of another. Our main hypothesis was explanations should align with end user expectations; that is, a factual explanation should be given when the system's output is in line with end user expectations, and a counterfactual explanation when it is not. We find that factual explanations are indeed appropriate when expectations and output match. When they do not, neither factual nor counterfactual explanations appear appropriate, although we do find indications that our counterfactual explanations contained at least some necessary elements. Overall, this suggests that it is important for systems that create explanations of AI systems to infer what outputs the end user expected so that factual explanations can be generated at the appropriate moments. At the same time, this information is, by itself, not sufficient to also create appropriate explanations when the output and user expectations do not match. This is somewhat surprising given investigations of explanations in the social sciences, and will need more scrutiny in future studies.
... Compared to Study 1's confidence alerts, Study 2's confidence alerts were presented in a more concise manner in order to reduce cognitive load. At the end of each treatment, participants rated their agreement with the 7-point Likert-type statement "I would use AI Assistant 1 [or 2] if it were available to me, " adapted from Kocielnik et al. [42]. ...
Preprint
An important goal in the field of human-AI interaction is to help users more appropriately trust AI systems' decisions. A situation in which the user may particularly benefit from more appropriate trust is when the AI receives anomalous input or provides anomalous output. To the best of our knowledge, this is the first work towards understanding how anomaly alerts may contribute to appropriate trust of AI. In a formative mixed-methods study with 4 radiologists and 4 other physicians, we explore how AI alerts for anomalous input, very high and low confidence, and anomalous saliency-map explanations affect users' experience with mockups of an AI clinical decision support system (CDSS) for evaluating chest x-rays for pneumonia. We find evidence suggesting that the four anomaly alerts are desired by non-radiologists, and the high-confidence alerts are desired by both radiologists and non-radiologists. In a follow-up user study, we investigate how high- and low-confidence alerts affect the accuracy and thus appropriate trust of 33 radiologists working with AI CDSS mockups. We observe that these alerts do not improve users' accuracy or experience and discuss potential reasons why.
... Machine Learning (ML) has become an important technique to leverage the potential of data and allows businesses to be more innovative [1], efficient [13], and sustainable [22]. However, the success of many productive ML applications in real-world settings falls short of expectations [21]. A large number of ML projects fail-with many ML proofs of concept never progressing as far as production [30]. ...
Preprint
Full-text available
The final goal of all industrial machine learning (ML) projects is to develop ML products and rapidly bring them into production. However, it is highly challenging to automate and operationalize ML products and thus many ML endeavors fail to deliver on their expectations. The paradigm of Machine Learning Operations (MLOps) addresses this issue. MLOps includes several aspects, such as best practices, sets of concepts, and development culture. However, MLOps is still a vague term and its consequences for researchers and professionals are ambiguous. To address this gap, we conduct mixed-method research, including a literature review, a tool review, and expert interviews. As a result of these investigations, we provide an aggregated overview of the necessary principles, components, and roles, as well as the associated architecture and workflows. Furthermore, we furnish a definition of MLOps and highlight open challenges in the field. Finally, this work provides guidance for ML researchers and practitioners who want to automate and operate their ML products with a designated set of technologies.
... Several studies reported challenges in prototyping the user experience of AI systems [21,93]. In response, researchers have developed practitioner-facing AI tools, methods, guidelines, and design patterns to aid designers in accounting for AI systems' UX breakdowns [2,3,31,51,61,65], such as planning for AI inference errors [38] or setting user expectations [45]. ...
Conference Paper
Full-text available
HCI research has explored AI as a design material, suggesting that designers can envision AI's design opportunities to improve UX. Recent research claimed that enterprise applications offer an opportunity for AI innovation at the user experience level. We conducted design workshops to explore the practices of experienced designers who work on cross-functional AI teams in the enterprise. We discussed how designers successfully work with and struggle with AI. Our findings revealed that designers can innovate at the system and service levels. We also discovered that making a case for an AI feature's return on investment is a barrier for designers when they propose AI concepts and ideas. Our discussions produced novel insights on designers' role on AI teams, and the boundary objects they used for collaborating with data scientists. We discuss the implications of these findings as opportunities for future research aiming to empower designers in working with data and AI.
... Future deployments of AI technologies, including those used during clinical examinations and surgeries, will inevitably present users with incorrect classifications and recommendations. This raises questions regarding the end-user acceptance and trust towards these systems [21,37], (legal) accountability in case of errors [30,38], and the impact of AI support systems on user interaction. This work presents colonoscopy as a case study for understanding the effects of imperfect AI support on end users during continuous interactions. ...
Article
Full-text available
The use of artificial intelligence (AI) in clinical support systems is increasing. In this article, we focus on AI support for continuous interaction scenarios. A thorough understanding of end-user behaviour during these continuous human-AI interactions, in which user input is sustained over time and during which AI suggestions can appear at any time, is still missing. We present a controlled lab study involving 21 endoscopists and an AI colonoscopy support system. Using a custom-developed application and an off-the-shelf videogame controller, we record participants’ navigation behaviour and clinical assessment across 14 endoscopic videos. Each video is manually annotated to mimic an AI recommendation, being either true positive or false positive in nature. We find that time between AI recommendation and clinical assessment is significantly longer for incorrect assessments. Further, the type of medical content displayed significantly affects decision time. Finally, we discover that the participant’s clinical role plays a large part in the perception of clinical AI support systems. Our study presents a realistic assessment of the effects of imperfect and continuous AI support in a clinical scenario.
... HCI perspectives about the user interface can improve AI through better quality feedback on performance [81]. For example, AI output presentation can impact end-users' subjective perception of errors and how they adjust their expectations about AI [54]. ...
... In the context of HCI, an often used definition of trust is łthe attitude that an agent will help achieve an individual's goals in a situation characterized by uncertainty and vulnerabilityž [77, p 51]. Trust in the system influences whether and how AI is used in practice, which in turn is highly relevant when designing AI. For example, how the AI and its capabilities are framed by the designer highly impact acceptance and accuracy perceptions of users [76]. While trust has been established as an influential factor in technology use, it has only been fairly recently that methods have emerged to systematically include trust insights in system design [120]. ...
Conference Paper
Full-text available
While artificial intelligence (AI) is increasingly applied for decision-making processes, ethical decisions pose challenges for AI applications. Given that humans cannot always agree on the right thing to do, how would ethical decision-making by AI systems be perceived and how would responsibility be ascribed in human-AI collaboration? In this study, we investigate how the expert type (human vs. AI) and level of expert autonomy (adviser vs. decider) influence trust, perceived responsibility, and reliance. We find that participants consider humans to be more morally trustworthy but less capable than their AI equivalent. This shows in participants’ reliance on AI: AI recommendations and decisions are accepted more often than the human expert’s. However, AI team experts are perceived to be less responsible than humans, while programmers and sellers of AI systems are deemed partially responsible instead.
... Secondly, testing different ML errors in advance would allow designers to create user interfaces that respond or adapt to ML errors, making the overall UX more pleasant or understandable. Previous studies have shown that ML errors can greatly impact the UX [1,10,11], the mental models users form to interact with the system, and their trust in the ML model [17]. Different error types could differently influence the UX, e.g. ...
Article
Emerging input techniques that rely on sensing and recognition can misinterpret a user's intention, resulting in errors and, potentially, a negative user experience. To enhance the development of such input techniques, it is valuable to understand implications of these errors, but they can very costly to simulate. Through two controlled experiments, this work explores various low-cost methods for evaluating error acceptability of freehand mid-air gestural input in virtual reality. Using a gesture-driven game and a drawing application, the first experiment elicited error characteristics through text descriptions, video demonstrations, and a touchscreen-based interactive simulation. The results revealed that video effectively conveyed the dynamics of errors, whereas the interactive modalities effectively reproduced the user experience of effort and frustration. The second experiment contrasts the interactive touchscreen simulation with the target modality - a full VR simulation - and highlights the relative costs and benefits for assessment in an alternative, but still interactive, modality. These findings introduce a spectrum of low-cost methods for evaluating recognition-based errors in VR and a series of characteristics that can be understood in each.
Article
Enterprises have recently adopted AI to human resource management (HRM) to evaluate employees’ work performance evaluation. However, in such an HRM context where multiple stakeholders are complexly intertwined with different incentives, it is problematic to design AI reflecting one stakeholder group's needs (e.g., enterprises, HR managers). Our research aims to investigate what tensions surrounding AI in HRM exist among stakeholders and explore design solutions to balance the tensions. By conducting stakeholder-centered participatory workshops with diverse stakeholders (including employees, employers/HR teams, and AI/business experts), we identified five major tensions: 1) divergent perspectives on fairness, 2) the accuracy of AI, 3) the transparency of the algorithm and its decision process, 4) the interpretability of algorithmic decisions, and 5) the trade-off between productivity and inhumanity. We present stakeholder-centered design ideas for solutions to mitigate these tensions and further discuss how to promote harmony among various stakeholders at the workplace.
Article
Background Artificial intelligence (AI), such as machine learning (ML), shows great promise for improving clinical decision-making in cardiac diseases by outperforming statistical-based models. However, few AI-based tools have been implemented in cardiology clinics because of the sociotechnical challenges during transitioning from algorithm development to real-world implementation. Objective This study explored how an ML-based tool for predicting ventricular tachycardia and ventricular fibrillation (VT/VF) could support clinical decision-making in the remote monitoring of patients with an implantable cardioverter defibrillator (ICD). Methods Seven experienced electrophysiologists participated in a near-live feasibility and qualitative study, which included walkthroughs of 5 blinded retrospective patient cases, use of the prediction tool, and questionnaires and interview questions. All sessions were video recorded, and sessions evaluating the prediction tool were transcribed verbatim. Data were analyzed through an inductive qualitative approach based on grounded theory. Results The prediction tool was found to have potential for supporting decision-making in ICD remote monitoring by providing reassurance, increasing confidence, acting as a second opinion, reducing information search time, and enabling delegation of decisions to nurses and technicians. However, the prediction tool did not lead to changes in clinical action and was found less useful in cases where the quality of data was poor or when VT/VF predictions were found to be irrelevant for evaluating the patient. Conclusions When transitioning from AI development to testing its feasibility for clinical implementation, we need to consider the following: expectations must be aligned with the intended use of AI; trust in the prediction tool is likely to emerge from real-world use; and AI accuracy is relational and dependent on available information and local workflows. Addressing the sociotechnical gap between the development and implementation of clinical decision-support tools based on ML in cardiac care is essential for succeeding with adoption. It is suggested to include clinical end-users, clinical contexts, and workflows throughout the overall iterative approach to design, development, and implementation.
Chapter
Explaining artificial intelligence (AI) to people is crucial since the large number of AI-generated results can greatly affect people’s decision-making process in our daily life. Chatbots have great potential to serve as an effective tool to explain AI. Chatbots have the advantage of conducting proactive interactions and collecting customer requests with high availability and scalability. We make the first-step exploration of using chatbots to explain AI. We propose a chatbot explanation framework which includes proactive interactions on the explanation of the AI model and the explanation of the confidence level of AI-generated results. In addition, to understand what users would like to know about AI for further improvement on the chatbot design, our framework also collects users’ requests about AI-generated results. Our preliminary evaluation shows the effectiveness of our chatbot to explain AI and gives us important design implications for further improvements.
Article
This work contributes a research protocol for evaluating human-AI interaction in the context of specific AI products. The research protocol enables UX and HCI researchers to assess different human-AI interaction solutions and validate design decisions before investing in engineering. We present a detailed account of the research protocol and demonstrate its use by employing it to study an existing set of human-AI interaction guidelines. We used factorial surveys with a 2x2 mixed design to compare user perceptions when a guideline is applied versus violated, under conditions of optimal versus sub-optimal AI performance. The results provided both qualitative and quantitative insights into the UX impact of each guideline. These insights can support creators of user-facing AI systems in their nuanced prioritization and application of the guidelines.
Article
Analyzing usability test videos is arduous. Although recent research showed the promise of AI in assisting with such tasks, it remains largely unknown how AI should be designed to facilitate effective collaboration between user experience (UX) evaluators and AI. Inspired by the concepts of agency and work context in human and AI collaboration literature, we studied two corresponding design factors for AI-assisted UX evaluation: explanations and synchronization. Explanations allow AI to further inform humans how it identifies UX problems from a usability test session; synchronization refers to the two ways humans and AI collaborate: synchronously and asynchronously. We iteratively designed a tool-AI Assistant-with four versions of UIs corresponding to the two levels of explanations (with/without) and synchronization (sync/async). By adopting a hybrid wizard-of-oz approach to simulating an AI with reasonable performance, we conducted a mixed-method study with 24 UX evaluators identifying UX problems from usability test videos using AI Assistant. Our quantitative and qualitative results show that AI with explanations, regardless of being presented synchronously or asynchronously, provided better support for UX evaluators' analysis and was perceived more positively; when without explanations, synchronous AI better improved UX evaluators' performance and engagement compared to the asynchronous AI. Lastly, we present the design implications for AI-assisted UX evaluation and facilitating more effective human-AI collaboration.
Article
Full-text available
Donation-based support for open, peer production projects such as Wikipedia is an important mechanism for preserving their integrity and independence. For this reason understanding donation behavior and incentives is crucial in this context. In this work, using a dataset of aggregated donation information from Wikimedia's 2015 fund-raising campaign, representing nearly 1 million pages from English and French language versions of Wikipedia, we explore the relationship between the properties of contents of a page and the number of donations on this page. Our results suggest the existence of a reciprocity mechanism, meaning that articles that provide more utility value attract a higher rate of donation. We discuss these and other findings focusing on the impact they may have on the design of banner-based fundraising campaigns. Our findings shed more light on the mechanisms that lead people to donate to Wikipedia and the relation between properties of contents and donations.
Article
Full-text available
Mobile, wearable and other connected devices allow people to collect and explore large amounts of data about their own activities, behavior, and well-being. Yet, learning from-, and acting upon such data remain a challenge. The process of reflection has been identified as a key component of such learning. However, most tools do not explicitly design for reflection, carrying an implicit assumption that providing access to self-tracking data is sufficient. In this paper, we present Reflection Companion, a mobile conversational system that supports engaging reflection on personal sensed data, specifically physical activity data collected with fitness trackers. Reflection Companion delivers daily adaptive mini-dialogues and graphs to users' mobile phones to promote reflection. To generate our system's mini dialogues, we conducted a set of workshops with fitness trackers users, producing a diverse corpus of 275 reflection questions synthesized into a set of 25 reflection mini dialogues. In a 2-week field deployment with 33 active Fitbit users, we examined our system's ability to engage users in reflection through dialog. Results suggest that the mini-dialogues were successful in triggering reflection and that this reflection led to increased motivation, empowerment, and adoption of new behaviors. As a strong indicator of our system's value, 16 of the 33 participants elected to continue using the system for two additional weeks without compensation. We present our findings and describe implications for the design of technology-supported dialog systems for reflection on data.
Article
Full-text available
Machine learning (ML) has become increasingly influential to human society, yet the primary advancements and applications of ML are driven by research in only a few computational disciplines. Even applications that affect or analyze human behaviors and social structures are often developed with limited input from experts outside of computational fields. Social scientists—experts trained to examine and explain the complexity of human behavior and interactions in the world—have considerable expertise to contribute to the development of ML applications for human-generated data, and their analytic practices could benefit from more human-centered ML methods. Although a few researchers have highlighted some gaps between ML and social sciences [51, 57, 70], most discussions only focus on quantitative methods. Yet many social science disciplines rely heavily on qualitative methods to distill patterns that are challenging to discover through quantitative data. One common analysis method for qualitative data is qualitative coding. In this article, we highlight three challenges of applying ML to qualitative coding. Additionally, we utilize our experience of designing a visual analytics tool for collaborative qualitative coding to demonstrate the potential in using ML to support qualitative coding by shifting the focus to identifying ambiguity. We illustrate dimensions of ambiguity and discuss the relationship between disagreement and ambiguity. Finally, we propose three research directions to ground ML applications for social science as part of the progression toward human-centered machine learning.
Conference Paper
Full-text available
Everyday predictive systems typically present point predic­tions, making it hard for people to account for uncertainty when making decisions. Evaluations of uncertainty displays for transit prediction have assessed people’s ability to extract probabilities, but not the quality of their decisions. In a controlled, incentivized experiment, we had subjects decide when to catch a bus using displays with textual uncertainty, uncer­ tainty visualizations, or no-uncertainty (control). Frequency- based visualizations previously shown to allow people to bet­ ter extract probabilities (quantile dotplots) yielded better deci­sions. Decisions with quantile dotplots with 50 outcomes were (1) better on average, having expected payoffs 97% of optimal (95% CI: [95%,98%]), 5 percentage points more than con­ trol (95% CI: [2,8]); and (2) more consistent, having within- subject standard deviation of 3 percentage points (95% CI: [2,4]), 4 percentage points less than control (95% CI: [2,6]). Cumulative distribution function plots performed nearly as well, and both outperformed textual uncertainty, which was sensitive to the probability interval communicated. We discuss implications for realtime transit predictions and possible generalization to other domains.
Article
Full-text available
Our ultimate goal is to efficiently enable end-users to correctly anticipate a robot's behavior in novel situations. This behavior is often a direct result of the robot's underlying objective function. Our insight is that end-users need to have an accurate mental model of this objective function in order to understand and predict what the robot will do. While people naturally develop such a mental model over time through observing the robot act, this familiarization process may be lengthy. Our approach reduces this time by having the robot model how people infer objectives from observed behavior, and then selecting those behaviors that are maximally informative. The problem of computing a posterior over objectives from observed behavior is known as Inverse Reinforcement Learning (IRL), and has been applied to robots learning human objectives. We consider the problem where the roles of human and robot are swapped. Our main contribution is to recognize that unlike robots, humans will not be \emph{exact} in their IRL inference. We thus introduce two factors to define candidate approximate-inference models for human learning in this setting, and analyze them in a user study in the autonomous driving domain. We show that certain approximate-inference models lead to the robot generating example behaviors that better enable users to anticipate what the robot will do in test situations. Our results also suggest, however, that additional research is needed in modeling how humans extrapolate from examples of robot behavior.
Article
Full-text available
We develop and test a model that suggests that expectations influence subjective usability and emotional experiences and, thereby, behavioral intentions to continue use and to recommend the service to others. A longitudinal study of 165 real-life users examined the proposed model in a proximity mobile payment domain at three time points: before use, after three weeks of use, and after six weeks of use. The results confirm the short-term influence of expectations on users? evaluations of both usability and enjoyment of the service after three weeks of real-life use. Users? evaluations of their experiences mediated the influence of expectations on behavioral intentions. However, after six weeks, users? cumulative experiences of the mobile payment service had the strongest impact on their evaluations and the effect of pre-use expectations decreased. The research clarifies the role of expectations and highlights the importance of viewing expectations through a temporal perspective when evaluating user experience.
Conference Paper
Full-text available
Robots have the potential to save lives in emergency scenarios, but could have an equally disastrous effect if participants overtrust them. To explore this concept, we performed an experiment where a participant interacts with a robot in a non-emergency task to experience its behavior and then chooses whether to follow the robot's instructions in an emergency or not. Artificial smoke and fire alarms were used to add a sense of urgency. To our surprise, all 26 participants followed the robot in the emergency, despite half observing the same robot perform poorly in a navigation guidance task just minutes before. We performed additional exploratory studies investigating different failure modes. Even when the robot pointed to a dark room with no discernible exit the majority of people did not choose to safely exit the way they entered.
Article
Full-text available
Before using an interactive product, people form expectations about what the experience of use will be like. These expectations may affect both the use of the product and users’ attitudes toward it. This article briefly reviews existing theories of expectations to design and perform two crowdsourced experiments that investigate how expectations affect user experience measures. In the experiments, participants saw a primed or neutral review of a simple online game, played it, and rated it on various user experience measures. Results suggest that when expectations are confirmed, users tend to assimilate their ratings with their expectations; conversely, if the product quality is inconsistent with expectations, users tend to contrast their ratings with expectations and give ratings correlated with the level of disconfirmation. Results also suggest that expectation disconfirmation can be used more widely in analyses of user experience, even when the analyses are not specifically concerned with expectation disconfirmation.
Article
Full-text available
Quick interaction between a human teacher and a learning machine presents numerous benefits and challenges when working with web-scale data. The human teacher guides the machine towards accomplishing the task of interest. The learning machine leverages big data to find examples that maximize the training value of its interaction with the teacher. When the teacher is restricted to labeling examples selected by the machine, this problem is an instance of active learning. When the teacher can provide additional information to the machine (e.g., suggestions on what examples or predictive features should be used) as the learning task progresses, then the problem becomes one of interactive learning. To accommodate the two-way communication channel needed for efficient interactive learning, the teacher and the machine need an environment that supports an interaction language. The machine can access, process, and summarize more examples than the teacher can see in a lifetime. Based on the machine's output, the teacher can revise the definition of the task or make it more precise. Both the teacher and the machine continuously learn and benefit from the interaction. We have built a platform to (1) produce valuable and deployable models and (2) support research on both the machine learning and user interface challenges of the interactive learning problem. The platform relies on a dedicated, low-latency, distributed, in-memory architecture that allows us to construct web-scale learning machines with quick interaction speed. The purpose of this paper is to describe this architecture and demonstrate how it supports our research efforts. Preliminary results are presented as illustrations of the architecture but are not the primary focus of the paper.
Article
Full-text available
Intelligent interactive systems (IIS) have great potential to improve users' experience with technology by tailoring their behaviour and appearance to users' individual needs; however, these systems, with their complex algorithms and dynamic behaviour, can also suffer from a lack of comprehensibility and transparency. We present the results of two studies examining the comprehensibility of, and desire for explanations with deployed, low-cost IIS. The first study, a set of interviews with 21 participants, reveals that i) comprehensibility is not always dependent on explanations, and ii) the perceived cost of viewing explanations tends to outweigh the anticipated benefits. Our second study, a two-week diary study with 14 participants, confirms these findings in the context of daily use, with participants indicating a desire for an explanation in only 7% of diary entries. We discuss the implications of our findings for the design of explanation facilities.
Article
Full-text available
Context-aware intelligent systems employ implicit inputs, and make decisions based on complex rules and machine learning models that are rarely clear to users. Such lack of system intelligibility can lead to loss of user trust, satisfaction and acceptance of these systems. However, automatically providing explanations about a system"s decision process can help mitigate this problem. In this paper we present results from a controlled study with over 200 participants in which the effectiveness of different types of explanations was examined. Participants were shown examples of a system"s operation along with various automatically generated explanations, and then tested on their understanding of the system. We show, for example, that explanations describing why the system behaved a certain way resulted in better understanding and stronger feelings of trust. Explanations describing why the system did not behave a certain way, resulted in lower understanding yet adequate performance. We discuss implications for the use of our findings in real-world context-aware applications.
Article
Full-text available
The present research develops and tests a theoretical extension of the Technology Acceptance Model (TAM) that explains perceived usefulness and usage intentions in terms of social influence and cognitive instrumental processes. The extended model, referred to as TAM2, was tested using longitudinal data collected regarding four different systems at four organizations (N = 156), two involving voluntary usage and two involving mandatory usage. Model constructs were measured at three points in time at each organization: preimplementation, one month postimplementation, and three months postimplementation. The extended model was strongly supported for all four organizations at all three points of measurement, accounting for 40%--60% of the variance in usefulness perceptions and 34%--52% of the variance in usage intentions. Both social influence processes (subjective norm, voluntariness, and image) and cognitive instrumental processes (job relevance, output quality, result demonstrability, and perceived ease of use) significantly influenced user acceptance. These findings advance theory and contribute to the foundation for future research aimed at improving our understanding of user adoption behavior.
Conference Paper
Full-text available
Context-aware applications should be intelligible so users can better understand how they work and improve their trust in them. However, providing intelligibility is non-trivial and requires the developer to understand how to generate explanations from application decision models. Furthermore, users need different types of explanations and this complicates the implementation of intelligibility. We have developed the Intelligibility Toolkit that makes it easy for application developers to obtain eight types of explanations from the most popular decision models of context-aware applications. We describe its extensible architecture, and the explanation generation algorithms we developed. We validate the usefulness of the toolkit with three canonical applications that use the toolkit to generate explanations for end-users.
Conference Paper
Full-text available
Intelligibility can help expose the inner workings and inputs of context-aware applications that tend to be opaque to users due to their implicit sensing and actions. However, users may not be interested in all the information that the applications can produce. Using scenarios of four real-world applications that span the design space of context-aware computing, we conducted two experiments to discover what information users are interested in. In the first experiment, we elicit types of information demands that users have and under what moderating circumstances they have them. In the second experiment, we verify the findings by soliciting users about which types they would want to know and establish whether receiving such information would satisfy them. We discuss why users demand certain types of information, and provide design implications on how to provide different intelligibility types to make context-aware applications intelligible and acceptable to users.
Conference Paper
Full-text available
Understanding the complexities of users' judgements and user experience is a prerequisite for informing HCI design. Current user experience (UX) research emphasises that, beyond usability, non-instrumental aspects of system quality contribute to overall judgement and that the user experience is subjective and variable. Based on judgement and decision-making theory, we have previously demonstrated that judgement of websites can be influenced by contextual factors. This paper explores the strength of such contextual influence by investigating framing effects on user judgement of website quality. Two experimental studies investigate how the presentation of information about a website influences the user experience and the relative importance of individual quality attributes for overall judgement. Theoretical implications for the emerging field of UX research and practical implications for design are discussed.
Conference Paper
Full-text available
For automatic or context-aware systems a major issue is user trust, which is to a large extent determined by system reliability. For systems based on sensor input which are inherently uncertain or even uncomplete there is little hope that they will ever be perfectly reliable. In this paper we test the hypothesis if explicitly displaying the current confidence of the system increases the usability of such systems. For the example of a context-aware mobile phone, the experiments show that displaying confidence information increases the user’s trust in the system.
Conference Paper
Full-text available
Context-aware mobile applications and systems have been extensively explored in the last decade and in the last few years we already saw promising products on the market. Most of these applications assume that context data is highly accurate. But in practice this information is often unreliable, especially when gathered from sensors or external sources. Previous research has argued that the system usability can be improved by displaying the uncertainty to the user. The research presented in this paper shows that it is not always an advantage to show the confidence of the context-aware application to the user. We developed a system for automatic form filling on mobile devices which fills in any web form with user data stored on the mobile device. The used algorithm generates rules which indicate with which probability which input field of a form should be filled in with which value. Based on this we developed two versions of our system. One shows the uncertainty of the system and one not. We then conducted a user study which shows that the user needs slightly more time and produces slightly more errors when the confidence of the system is visualized.
Conference Paper
Full-text available
The aim of this project is detection, analysis and recognition of facial features. The system operates on grayscale images. For the analysis Haar-like face detector was used along with anthropometric face model and a hybrid feature detection approach. The system localizes 17 characteristic points of analyzed face and, based on their displacements certain emotions can be automatically recognized. The system was tested on a publicly available database (Japanese Female Expression Database) JAFFE with ca. 77% accuracy for 7 basic emotions using various classifiers. Thanks to its open structure the system can cooperate well with any HCI system.
Article
Full-text available
Individual-level information systems adoption research has recently seen the introduction of expectation-disconfirmation theory (EDT) to explain how and why user reactions change over time. This prior research has produced valuable insights into the phenomenon of technology adoption beyond traditional models, such as the technology acceptance model. First, we identify gaps in EDT research that present potential opportunities for advances-specifically, we discuss methodological and analytical limitations in EDT research in information systems and present polynomial modeling and response surface methodology as solutions. Second, we draw from research on cognitive dissonance, realistic job preview, and prospect theory to present a polynomial model of expectation-disconfirmation in information systems. Finally, we test our model using data gathered over a period of 6 months among 1,143 employees being introduced to a new technology. The results confirmed our hypotheses that disconfirmation in general was bad, as evidenced by low behavioral intention to continue using a system for both positive and negative disconfirmation, thus supporting the need for a polynomial model to understand expectation disconfirmation in information systems.
Article
Full-text available
Consumers rated several qualitative attributes of ground beef that framed the beef as either ???75% lean??? or ???25% fat.??? The consumers' evaluations were more favorable toward the beef labeled ???75% lean??? than that labeled ???25% fat.??? More importantly, the magnitude of this information framing effect lessened when consumers actually tasted the meat. We discuss these results in terms of an averaging model, which suggests that a diagnostic product experience dilutes the impact of information framing.
Article
The authors investigate whether it is necessary to include disconfirmation as an intervening variable affecting satisfaction as is commonly argued, or whether the effect of disconfirmation is adequately captured by expectation and perceived performance. Further, they model the process for two types of products, a durable and a nondurable good, using experimental procedures in which three levels of expectations and three levels of performance are manipulated for each product in a factorial design. Each subject's perceived expectations, performance evaluations, disconfirmation, and satisfaction are subsequently measured by using multiple measures for each construct. The results suggest the effects are different for the two products. For the nondurable good, the relationships are as typically hypothesized. The results for the durable good are different in important respects. First, neither the disconfirmation experience nor subjects’ initial expectations affected subjects’ satisfaction with it. Rather, their satisfaction was determined solely by the performance of the durable good. Expectations did combine with performance to affect disconfirmation, though the magnitude of the disconfirmation experience did not translate into an impact on satisfaction. Finally, the direct performance-satisfaction link accounts for most of the variation in satisfaction.
Article
This study experimentally investigated the effects on product ratings of both overstatement and understatement of product quality. Results support common marketing practice in that overstatement resulted in more favorable ratings and understatement resulted in less favorable ratings.
Article
Results of a laboratory experiment indicate that customer satisfaction with a product is influenced by the effort expended to acquire the product, and the expectations concerning the product. Specifically, the experiment suggests that satisfaction with the product may be higher when customers expend considerable effort to obtain the product than when they use only modest effort. This finding is opposed to usual notions of marketing efficiency and customer convenience. The research also suggests that customer satisfaction is lower when the product does not come up to expectations than when the product meets expectations.
Article
Four psychological theories are considered in determining the effects of disconfirmed expectations on perceived product performance and consumer satisfaction. Results reveal that too great a gap between high consumer expectations and actual product performance may cause a less favorable evaluation of a product than a somewhat lower level of disparity.
Conference Paper
Algorithmic prioritization is a growing focus for social media users. Control settings are one way for users to adjust the prioritization of their news feeds, but they prioritize feed content in a way that can be difficult to judge objectively. In this work, we study how users engage with difficult-to-validate controls. Via two paired studies using an experimental system -- one interview and one online study -- we found that control settings functioned as placebos. Viewers felt more satisfied with their feed when controls were present, whether they worked or not. We also examine how people engage in sensemaking around control settings, finding that users often take responsibility for violated expectations -- for both real and randomly functioning controls. Finally, we studied how users controlled their social media feeds in the wild. The use of existing social media controls had little impact on user's satisfaction with the feed; instead, users often turned to improvised solutions, like scrolling quickly, to see what they want.
Conference Paper
Advances in artificial intelligence, sensors and big data management have far-reaching societal impacts. As these systems augment our everyday lives, it becomes increasing-ly important for people to understand them and remain in control. We investigate how HCI researchers can help to develop accountable systems by performing a literature analysis of 289 core papers on explanations and explaina-ble systems, as well as 12,412 citing papers. Using topic modeling, co-occurrence and network analysis, we mapped the research space from diverse domains, such as algorith-mic accountability, interpretable machine learning, context-awareness, cognitive psychology, and software learnability. We reveal fading and burgeoning trends in explainable systems, and identify domains that are closely connected or mostly isolated. The time is ripe for the HCI community to ensure that the powerful new autonomous systems have intelligible interfaces built-in. From our results, we propose several implications and directions for future research to-wards this goal.
Conference Paper
Machine learning (ML) is now a fairly established technology, and user experience (UX) designers appear regularly to integrate ML services in new apps, devices, and systems. Interestingly, this technology has not experienced a wealth of design innovation that other technologies have, and this might be because it is a new and difficult design material. To better understand why we have witnessed little design innovation, we conducted a survey of current UX practitioners with regards to how new ML services are envisioned and developed in UX practice. Our survey probed on how ML may or may not have been a part of their UX design education, on how they work to create new things with developers, and on the challenges they have faced working with this material. We use the findings from this survey and our review of related literature to present a series of challenges for UX and interaction design research and education. Finally, we discuss areas where new research and new curriculum might help our community unlock the power of design thinking to re-imagine what ML might be and might do.
Article
Although information workers may complain about meetings, they are an essential part of their work life. Consequently, busy people spend a significant amount of time scheduling meetings. We present Calendar.help, a system that provides fast, efficient scheduling through structured workflows. Users interact with the system via email, delegating their scheduling needs to the system as if it were a human personal assistant. Common scheduling scenarios are broken down using well-defined workflows and completed as a series of microtasks that are automated when possible and executed by a human otherwise. Unusual scenarios fall back to a trained human assistant who executes them as unstructured macrotasks. We describe the iterative approach we used to develop Calendar.help, and share the lessons learned from scheduling thousands of meetings during a year of real-world deployments. Our findings provide insight into how complex information tasks can be broken down into repeatable components that can be executed efficiently to improve productivity.
Conference Paper
Shared expectations and mutual understanding are critical facets of teamwork. Achieving these in human-robot collaborative contexts can be especially challenging, as humans and robots are unlikely to share a common language to convey intentions, plans, or justifications. Even in cases where human co-workers can inspect a robot's control code, and particularly when statistical methods are used to encode control policies, there is no guarantee that meaningful insights into a robot's behavior can be derived or that a human will be able to efficiently isolate the behaviors relevant to the interaction. We present a series of algorithms and an accompanying system that enables robots to autonomously synthesize policy descriptions and respond to both general and targeted queries by human collaborators. We demonstrate applicability to a variety of robot controller types including those that utilize conditional logic, tabular reinforcement learning, and deep reinforcement learning, synthesizing informative policy descriptions for collaborators and facilitating fault diagnosis by non-experts.
Conference Paper
Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally varound the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.
Conference Paper
Autonomous systems are designed to take actions on behalf of users, acting autonomously upon data from sensors or online sources. As such, the design of interaction mechanisms that enable users to understand the operation of autonomous systems and flexibly delegate or regain control is an open challenge for HCI. Against this background, in this paper we report on a lab study designed to investigate whether displaying the confidence of an autonomous system about the quality of its work, which we call its confidence information, can improve user acceptance and interaction with autonomous systems. The results demonstrate that confidence information encourages the usage of the autonomous system we tested, compared to a situation where such information is not available. Furthermore, an additional contribution of our work is the methodology we employ to study users' incentives to do work in collaboration with the autonomous system. In experiments comparing different incentive strategies, our results indicate that our translation of behavioural economics research methods to HCI can support the study of interactions with autonomous systems in the lab.
Conference Paper
Users often rely on realtime predictions in everyday contexts like riding the bus, but may not grasp that such predictions are subject to uncertainty. Existing uncertainty visualizations may not align with user needs or how they naturally reason about probability. We present a novel mobile interface design and visualization of uncertainty for transit predictions on mobile phones based on discrete outcomes. To develop it, we identified domain specific design requirements for visualizing uncertainty in transit prediction through: 1) a literature review, 2) a large survey of users of a popular realtime transit application, and 3) an iterative design process. We present several candidate visualizations of uncertainty for realtime transit predictions in a mobile context, and we propose a novel discrete representation of continuous outcomes designed for small screens, quantile dotplots. In a controlled experiment we find that quantile dotplots reduce the variance of probabilistic estimates by ~1.15 times compared to density plots and facilitate more confident estimation by end-users in the context of realtime transit prediction scenarios.
Article
To perform a survey-based assessment of patients' knowledge of radiologic imaging examinations, including patients' perspectives regarding communication of such information. Adult patients were given a voluntary survey before undergoing an outpatient imaging examination at our institution. Survey questions addressed knowledge of various aspects of the examination, as well as experiences, satisfaction, and preferences regarding communication of such knowledge. A total of 176 surveys were completed by patients awaiting CT (n = 45), MRI (n = 41), ultrasound (n = 46), and nuclear medicine (n = 44) examinations. A total of 97.1% and 97.8% of patients correctly identified the examination modality and the body part being imaged, respectively. A total of 45.8% correctly identified whether the examination entailed radiation; 51.1% and 71.4% of patients receiving intravenous or oral contrast, respectively, correctly indicated its administration. A total of 78.6% indicated that the ordering physician explained the examination in advance; among these, 72.1% indicated satisfaction with the explanation. A total of 21.8% and 20.5% indicated consulting the Internet, or friends and family, respectively, to learn about the examination. An overall understanding of the examination was reported by 70.8%. A total of 18.8% had unanswered questions about the examination, most commonly regarding examination logistics, contrast-agent usage, and when results would be available. A total of 52.9% were interested in discussing the examination with a radiologist in advance. Level of understanding was greatest for CT and least for nuclear medicine examinations, and lower when patients had not previously undergone the given examination. Patients' knowledge of their imaging examinations is frequently incomplete. The findings may motivate initiatives to improve patients' understanding of their imaging examinations, enhancing patient empowerment and contributing to patient-centered care. Copyright © 2015 American College of Radiology. Published by Elsevier Inc. All rights reserved.
Article
Four psychological theories are considered in determining the effects of disconfirmed expectations on perceived product performance and consumer satisfaction. Results reveal that too great a gap between high consumer expectations and actual product performance may cause a less favorable evaluation of a product than a somewhat lower level of disparity.
Conference Paper
Nowadays, people are overwhelmed with multiple tasks and responsibilities, resulting in increasing stress level. At the same time, it becomes harder to find time for self-reflection and diagnostics of problems that can be source of stress. In this paper, we propose a tool that supports a person in self-reflection by providing views on life events in their relation to person's well-being in a concise and intuitive form. The tool, called LifelogExplorer, takes sensor data (like skin conductance and accelerometer measurements) and data obtained from digital sources (like personal calendars) as input and generates views on this data which are comprehensible and meaningful for the user due to filtering and aggregation options which help to cope with the data explosion. We evaluate our approach on the data collected from two case studies focused on addressing stress at work: 1) with academic staff of a university, and 2) with teachers from a vocational school.
Article
Results of a laboratory experiment indicate that customer satisfaction with a product is influenced by the effort expended to acquire the product, and the expectations concerning the product. Specifically, the experiment suggests that satisfaction with the product may be higher when customers expend considerable effort to obtain the product than when they use only modest effort. This finding is opposed to usual notions of marketing efficiency and customer convenience. The research also suggests that customer satisfaction is lower when the product does not come up to expectations than when the product meets expectations.
Article
Reviews interpretations of the effect of expectation and disconfirmation on perceived product performance. At issue is the relative effect of the initial expectation level and the degree of positive or negative disconfirmation on affective judgments following product exposure. Although the results of prior studies suggest a dominant expectation effect, it is argued that detection of the disconfirmation phenomenon may have been clouded by a conceptual and methodological overdetermination problem. To test this notion, 243 college students responded to expectation and disconfirmation measures in a 3-stage field study of reactions to a recently introduced automobile model. These measures were later related to postexposure affect and intention variables in a hierarchical analysis of variance design. Although the results support earlier conclusions that level of expectation is related to postexposure judgments, it is also shown that the disconfirmation experience may have an independent and equally significant impact. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Nine pictorial displays for communicating quantitative information about the value of an uncertain quantity, x, were evaluated for their ability to communicate x̄, p(x > a) and p(b > x > a) to well-educated semi-and nontechnical subjects. Different displays performed best in different applications. Cumulative distribution functions alone can severely mislead some subjects in estimating the mean. A “rusty” knowledge of statistics did not improve performance, and even people with a good basic knowledge of statistics did not perform as well as one would like. Until further experiments are performed, the authors recommend the use of a cumulative distribution function plotted directly above a probability density function with the same horizontal scale, and with the location of the mean clearly marked on both curves.
Article
A recent and dramatic increase in the use of automation has not yielded comparable improvements in performance. Researchers have found human operators often underutilize (disuse) and overly rely on (misuse) automated aids (Parasuraman and Riley, 1997). Three studies were performed with Cameron University students to explore the relationship among automation reliability, trust, and reliance. With the assistance of an automated decision aid, participants viewed slides of Fort Sill terrain and indicated the presence or absence of a camouflaged soldier. Results from the three studies indicate that trust is an important factor in understanding automation reliance decisions. Participants initially considered the automated decision aid trustworthy and reliable. After observing the automated aid make errors, participants distrusted even reliable aids, unless an explanation was provided regarding why the aid might err. Knowing why the aid might err increased trust in the decision aid and increased automation reliance, even when the trust was unwarranted. Our studies suggest a need for future research focused on understanding automation use, examining individual differences in automation reliance, and developing valid and reliable self-report measures of trust in automation.
Article
Although machine learning is becoming commonly used in today's software, there has been little research into how end users might interact with machine learning systems, beyond communicating simple “right/wrong” judgments. If the users themselves could work hand-in-hand with machine learning systems, the users’ understanding and trust of the system could improve and the accuracy of learning systems could be improved as well. We conducted three experiments to understand the potential for rich interactions between users and machine learning systems. The first experiment was a think-aloud study that investigated users’ willingness to interact with machine learning reasoning, and what kinds of feedback users might give to machine learning systems. We then investigated the viability of introducing such feedback into machine learning systems, specifically, how to incorporate some of these types of user feedback into machine learning systems, and what their impact was on the accuracy of the system. Taken together, the results of our experiments show that supporting rich interactions between users and machine learning systems is feasible for both user and machine. This shows the potential of rich human–computer collaboration via on-the-spot interactions as a promising direction for machine learning systems and users to collaboratively share intelligence.
Article
Information systems with an "intelligent" or "knowledge" component are now prevalent and include knowledge-based systems, decision support systems, intelligent agents and knowledge management systems. These systems are in principle capable of explaining their reasoning or justifying their behavior. There appears to be a lack of understanding, however, of the benefits that can flow from explanation use, and how an explanation function should be constructed. Work with newer types of intelligent systems and help functions for everyday systems, such as word-processors, appears in many cases to neglect lessons learned in the past. This paper attempts to rectify this situation by drawing together the considerable body of work on the nature and use of explanations. Empirical studies, mainly with knowledge-based systems, are reviewed and linked to a sound theoretical base. The theoretical base combines a cognitive effort perspective, cognitive learning theory, and Toulmin's model of argumentation. Conclusions drawn from the review have both practical and theoretical significance. Explanations are important to users in a number of circumstances - when the user perceives an anomaly, when they want to learn, or when they need a specific piece of knowledge to participate properly in problem solving. Explanations, when suitably designed, have been shown to improve performance and learning and result in more positive user perceptions of a system. The design is important, however, because it appears that explanations will not be used if the user has to exert "too much" effort to get them. Explanations should be provided automatically if this can be done relatively unobtrusively, or by hypertext links, and should be context-specific rather than generic. Explanations that conform to Toulmin's model of argumentation, in that they provide adequate justification for the knowledge offered, should be more persuasive and lead to greater trust, agreement, satisfaction, and acceptance - of the explanation and possibly also of the system as a whole.
Article
This paper examines cognitive beliefs and affect influencing one's intention to continue using (con- tinuance) information systems (IS). Expectation- confirmation theory is adapted from the consumer behavior literature and integrated with theoretical and empirical findings from prior IS usage research to theorize a model of IS continuance. Five research hypotheses derived from this model are empirically validated using a field survey of online banking users. The results suggest that users' continuance intention is determined by their satisfaction with IS use and perceived usefulness of continued IS use. User satisfaction, in turn, is influenced by their confirmation of expectation from prior IS use and perceived usefulness. Post- acceptance perceived usefulness is influenced by 1 Ron Weber was the accepting senior editor for this paper. users' confirmation level. This study draws atten- tion to the substantive differences between accep- tance and continuance behaviors, theorizes and validates one of the earliest theoretical models of IS continuance, integrates confirmation and user satisfaction constructs within our current under- standing of IS use, conceptualizes and creates an initial scale for measuring IS continuance, and offers an initial explanation for the acceptance- discontinuance anomaly