Available via license: CC BY 4.0
Content may be subject to copyright.
Advancing Human-AI Complementarity: The Impact of User Expertise and
Algorithmic Tuning on Joint Decision Making
KORI INKPEN, Microsoft Research, USA
SHREYA CHAPPIDI, University of Virginia, USA
KERI MALLARI, University of Washington, USA
BESMIRA NUSHI, Microsoft Research, USA
DIVYA RAMESH, University of Michigan, USA
PIETRO MICHELUCCI, Human Computation Institute, USA
VANI MANDAVA, Microsoft Research, USA
LIBUŠE HANNAH VEPŘEK, LMU Munich, Germany
GABRIELLE QUINN, Western Washington University, USA
1 ABSTRACT
Human-AI collaboration for decision-making strives to achieve team performance that exceeds the performance of humans or AI
alone. However, many factors can impact success of Human-AI teams, including a user’s domain expertise, mental models of an AI
system, trust in recommendations, and more. This paper reports on a study that examines users’ interactions with three simulated
algorithmic models, all with equivalent accuracy rates but each tuned dierently in terms of true positive and true negative rates. Our
study examined user performance in a non-trivial blood vessel labeling task where participants indicated whether a given blood vessel
was owing or stalled. Users completed 150 trials across multiple stages, rst without an AI and then with recommendations from an
AI-Assistant. Although all users had prior experience with the task, their levels of prociency varied widely.
Our results demonstrated that while recommendations from an AI-Assistant can aid in users’ decision making, several underlying
factors, including user base expertise and complementary human-AI tuning, signicantly impact the overall team performance. First,
users’ base performance matters, particularly in comparison to the performance level of the AI. Novice users improved, but not to
the accuracy level of the AI. Highly procient users were generally able to discern when they should follow the AI recommendation
and typically maintained or improved their performance. Mid-performers, who had a similar level of accuracy to the AI, were most
variable in terms of whether the AI recommendations helped or hurt their performance. Second, tuning an AI algorithm to complement
users’ strengths and weaknesses also signicantly impacted users’ performance. For example, users in our study were better at
detecting owing blood vessels, so when the AI was tuned to reduce false negatives (at the expense of increasing false positives), users
were able to reject those recommendations more easily and improve in accuracy. Finally, users’ perception of the AI’s performance
relative to their own performance had an impact on whether users’ accuracy improved when given recommendations from the AI.
Authors’ addresses: Kori Inkpen, kori@microsoft.com, Microsoft Research, Redmond, WA, USA; Shreya Chappidi, University of Virginia, USA; Keri
Mallari, University of Washington, USA; Besmira Nushi, Microsoft Research, USA; Divya Ramesh, University of Michigan, USA; Pietro Michelucci,
Human Computation Institute, USA; Vani Mandava, Microsoft Research, USA; Libuše Hannah Vepřek, LMU Munich, Germany; Gabrielle Quinn, Western
Washington University, USA.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2022 Association for Computing Machinery.
Manuscript submitted to ACM
Manuscript submitted to ACM 1
arXiv:2208.07960v1 [cs.HC] 16 Aug 2022
2 Inkpen, Chappidi, et al.
Overall, this work reveals important insights on the complex interplay of factors inuencing Human-AI collaboration and provides
recommendations on how to design and tune AI algorithms to complement users in decision-making tasks.
CCS Concepts:
•Human-centered computing →Empirical studies in HCI
;
HCI theory, concepts and models
;
HCI design
and evaluation methods.
Additional Key Words and Phrases: human-AI collaboration, human-AI performance, human-centered AI, citizen science
ACM Reference Format:
Kori Inkpen, Shreya Chappidi, Keri Mallari, Besmira Nushi, Divya Ramesh, Pietro Michelucci, Vani Mandava, Libuše Hannah Vepřek,
and Gabrielle Quinn. 2022. Advancing Human-AI Complementarity: The Impact of User Expertise and Algorithmic Tuning on Joint
Decision Making. ACM Trans. Comput.-Hum. Interact. 1, 1 (August 2022), 30 pages. https://doi.org/10.1145/3534561
2 INTRODUCTION
Recent advances in articial intelligence (AI) have inspired promising applications of AI as assistants to people
during complex decision-making tasks, including legal, medical, and nancial scenarios. In such scenarios, there is an
expectation that AI recommendations will help in improving human decisions [
29
,
31
] and furthermore in forming
complementary Human-AI teams. In complementary Human-AI teams, it is expected that team performance is better
than if the human and the AI each were to operate alone. However, when AI systems and models are trained from
historical data, their performance and behavior are rarely optimized or even planned with such collaborative goals in
mind and therefore, team performance may be inferior to the performance of the AI alone [
6
,
8
,
10
]. For example, if a
model is used to give diagnostic recommendations to a medical professional, would it be best if the model is tuned to
ag a disease more often than not (even when likelihood is lower) to draw the attention of the human? Or is it best if it
only ags a disease when the likelihood is signicantly high? Would these behavioral dynamics change if the medical
professional had more or less experience in diagnosing the disease?
Answering these questions prior to deployment or potentially even training of an AI system could help stakeholders
choose the right model for deployment—the one that is the best t for the human collaborator. While for many
autonomous applications, the most accurate AI is also the best t for deployment, this is not always the case for Human-
AI collaborative applications. In fact, several studies on Human-AI decision making have presented counterexamples
where a less accurate AI may be a better collaborator for the human or when an equally accurate AI may have a variable
impact on team performance depending on the predictability of AI mistakes, changes in the AI behavior over time,
or the availability and form of algorithmic explanations (e.g., [
4
,
11
,
25
]). Despite the fact that most models today are
optimized with accuracy in mind, these studies indicate that other factors and algorithmic properties may be equally
important for Human-AI teams.
In this paper, we set out to expand the understanding of how AI recommendations can maximize Human-AI team
performance by exploring the impact of (1) algorithmic tuning and (2) human expertise. We consider algorithmic tuning
(i.e., true positive vs. true negative rates) as this is one of the most common control points that AI practitioners have to
tune the bias of a trained model. Such adjustments can be implemented in practice either by oversampling positive or
negative examples [
26
], assigning dierent costs to dierent examples during learning [
20
], or simply by assigning
dierent condence thresholds for detection depending on whether the goal is to reduce the number of false positives
or false negatives (i.e., generally, the higher the condence threshold, the lower the number of false positives when
condence is reliable). We intersect the study of algorithmic tuning impact with parallel dimensions of human expertise
and perceptions of AI to further develop our ndings. When the goal is to enable complementary teams, the extent of
Manuscript submitted to ACM
Advancing Human-AI Complementarity 3
human expertise and potential tuning that might have happened over time due to application requirements or incentive
structures may also impact how human decision-makers make use of an AI assistant. For example, would a human
decision maker that has a high (but imperfect) true positive rate (TPR) work best with a high-precision algorithm? Or
would they benet more from a partner with a high (but imperfect) true negative rate (TNR)?
In summary, this paper investigates the following research questions:
RQ1:
What is the impact of tuning the true positive and true negative rates of an AI system on overall Human-AI team
performance?
RQ2: What is the role of human expertise and predictive tuning in enabling complementary Human-AI teams?
RQ3:
What strategies do users employ when working with an AI-Assistant in a decision making task, and how do
users’ perceptions of an AI-Assistant impact Human-AI team performance?
Our study utilized Stall Catchers
1
, a citizen-science platform, to explore a complex decision-making task. The goal of
Stall Catchers is to produce high-quality label data to accelerate research on Alzheimer’s disease. More specically, on
the platform, citizen science participants are presented with a video clip of blood ow derived from in vivo 2-photon
excitation microscopy in a mouse brain. The analytic task is to decide whether an indicated blood vessel segment is
owing or stalled [42,45], and if stalled, indicate the location of the stall in the vessel.
The Stall Catchers platform was selected for three main reasons. First, the Stall Catchers task is complex for both
humans and learned models. This makes the expectation of complementary Human-AI performance more realistic
and useful in practice as a future AI system could be deployed to further improve team performance. Second, the task
reects a domain where people have varying degrees of expertise, allowing us to study the inuence of varying human
expertise. Finally, operating in a citizen-science domain enables us to approximate a real-world decision-making task
since the main participant incentive for the project is acceleration of Alzheimer’s disease research and participants are
aware that their decisions will be used for this purpose.
Our contributions are as follows:
•
We demonstrate how users’ baseline expertise signicantly impacts Human-AI team performance, and that users
who demonstrate performance levels similar to that of AI are particularly sensitive to algorithmic tuning.
•
We illustrate how users’ perceptions of the AI (relative performance, trust, and usefulness) signicantly impact
the utility of an AI-Assistant.
•
We show that tuning an algorithm’s properties (such as adjusting the true positive and true negative rate) must
take into account characteristics of the users in order to positively impact Human-AI team performance.
•
We highlight opportunities to develop human-centered algorithms that complement the limitations of human
expertise and user perceptions of AI.
This paper is organized as follows. Section 3situates the work in the context of prior studies and provides background
on the Stall Catchers citizen-science project. Section 4details the experimental setup along with the dierent conditions
and input data leveraged for the study. Next, Section 5presents the study results across the dimensions of dierent
algorithmic tuning and human expertise levels. Parallel to the quantitative analysis, this section also provides a rich
set of qualitative insights on how participants perceived and used the AI assistance. Section 6discusses the presented
results, their signicance, and known limitations. Finally, we conclude with a set of recommendations and implications
for future work in Section 7.
1https://stallcatchers.com
Manuscript submitted to ACM
4 Inkpen, Chappidi, et al.
3 RELATED WORK
3.1 Human AI Collaboration for decision making
A growing research area in AI involves developing systems that can partner with people to accomplish tasks that exceed
the capabilities of the AI alone or the human alone [
24
,
30
,
52
,
56
,
57
]. For example, Steiner et al. [
52
] explored the
impact of computer assistance in the eld of pathology to improve interpretation of images and clinical care [
52
]. They
found that algorithm-assisted pathologists demonstrated higher accuracy than either the algorithm or the pathologist
alone and that pathologists considered image review to be signicantly easier when interpreted with AI assistance.
Wang et al. also demonstrated how deep learning led to a signicant improvement in accuracy of pathological diagnoses
[
56
]. In a later work, Wang et al. built an AI system that detects pneumonia and COVID-19 severity from chest X-rays,
and compared system performance to that of radiologists in routine clinical practice [
57
]. They found that the system
helped junior radiologists perform close to the level of a senior, and that average weighted error decreased when the AI
system acted as a second reader as opposed to an ‘arbitrator’ between two radiologists.
On the other hand, there is also a set of work where the AI assistance did not lead to any team performance
improvements. Lehman et al. measured the diagnostic accuracy of screening mammography with and without a
computer-assisted diagnostic (CAD) tool [
36
]. In their work, they reported sensitivity (true positive) and specicity
(true negative) rates before and after the implementation of CAD. They found that CAD was associated with decreased
sensitivity and no changes in overall performance or specicity.
In the following sections, we will highlight prior research exploring the impact of tuning dierent properties of the
AI and human expertise on overall performance of Human-AI teams.
3.1.1 Findings on the impact of AI and AI properties on team performance. Prior work has demonstrated that overall AI
accuracy (both stated and observed) impacts human trust. In this work, Yin et al. found that the eect of stated accuracy,
or the model accuracy chosen by researchers, can change depending on the observed model accuracy by participants [
61
].
Stated accuracy, however, may not always correctly represent an algorithm’s performance, especially in real-world
deployments with domain shifts. Mismatches between factual and stated algorithmic scores were investigated by De
Arteaga et al., who showed that humans are less likely to follow the AI recommendation when the stated score is an
incorrect estimate of risk [13].
AI tuning was later explored by Kocielnik et al. whose work explored two versions of an AI-based scheduling assistant
with the same level of 50% accuracy but with a dierent emphasis on the types of errors made (avoiding false positives
vs. false negatives) [
33
]. Interestingly, the study nds that these dierent modications can impact humans’ perception
of accuracy and acceptance of the AI system by exploring techniques for setting expectations including an accuracy
indicator, example based explanations, and performance control. However, prior work has not yet investigated the
impact of such tuning on the overall team performance or the ability to support complementary teams, which is one of
the central research questions we address in this study.
On a related topic, Bansal et al. [
4
,
5
] show that other AI properties beyond AI accuracy also impact team performance,
including the predictability of errors and whether the model errors remain backward compatible over updates. These
ndings showcase that outside of tuning for true positive or true negative rates, there may exist other types of algorithmic
tuning that can better support humans.
3.1.2 Findings on the impact of human perception on team performance. A parallel line of work has explored the impact
of human perception, such as the role of trust and mental models of the AI, on the overall team performance.
Manuscript submitted to ACM
Advancing Human-AI Complementarity 5
Dzinolet et al. conducted a series of studies to determine whether people rely on automated and human group
members dierently [
18
]. Their self-reported data indicated a bias towards automated aids over human aids, but
performance data revealed that humans were more likely to disuse automated aids than to disuse human aids. Schaer
et al. [
50
] studied the eect of explanations on users’ trust in a modied version of the prisoner’s dilemma game. They
found that explanations not only swayed people who reported very low task familiarity, but showing explanations to
people who reported more task familiarity led to automation bias. This nding is further supported by a series of work
at the intersection of human-computer interaction and machine learning interpretability, which highlights concerns
around the phenomenon of increased, but not appropriate, reliance on automation [6,9,34,46,58].
Moreover, Zhang et al. [
62
] discusses trust calibration in AI in detail, as well as the importance of users having
a correct mental model of the AI’s error boundaries. By denition, the error boundary of a decision-maker (either
human or AI) separates the cases for which the decision-maker is correct from the incorrect ones [
4
]. Experimental
design in [
62
] includes exposure to an AI that is comparable in performance to humans. Post-experimental observations
highlighted that the error boundaries of the AI and the human were largely aligned (i.e., the humans and the AI were
making the same types of mistakes), which limited the ability of AI condence information to improve Human-AI
decision-making outcomes. Earlier work [
54
] investigating similar questions in a recidivism prediction domain found
that dierences between algorithmic and human error boundaries were not sucient to be leveraged in a way that
could signicantly improve hybrid decision-making. Recent work accounts for the dierences in error types between
people and AI and targets training ML models that are more complementary to human skills [
59
]. Our work instead
looks at the dierences in human and AI error boundaries from the simple but widely used lens of true positive and
negative rates and then draws conclusions for achieving complementarity in these dimensions.
3.1.3 Algorithmic Aversion and Appreciation. Gaube et al. ran a study where physicians and radiologists were given
chest X-rays and diagnostic advice [
23
]. All advice was generated by human experts, but for some participants, this
advice was labeled as generated from an AI system. In this work, they found that radiologists rated the advice as lower
quality when it came from an AI system, but physicians, who had less task expertise, did not. This work highlights the
phenomena of algorithmic aversion and algorithmic appreciation for users of dierent expertise.
Algorithmic aversion is the phenomenon where people tend to rely more on human advice than algorithmic advice,
even when the algorithm proves to be more accurate than the humans [
14
]. Prior research has found that factors such
as algorithmic error and domain of judgment may cause algorithmic aversion. Dietvorst et al. nds that people are
averse to algorithmic forecasts after seeing them make an error, even when they notice that the algorithm model
outperforms human forecast [
14
]. This is potentially due to human beliefs that people are more likely to be better than
algorithms and that humans are able to be perfect, which leads to a desire for perfect forecasts [
12
,
19
,
27
]. Another
aspect is the error rate, as humans are inclined to overestimate a machine’s perceived error rate even if it is wrong only
occasionally [
18
,
28
]. Consequently, people have less tolerance for errors caused by algorithmic systems than error
caused by humans [14].
Another factor that potentially causes algorithmic aversion is the domain of judgment. People want to understand
where a recommendation is coming from, and nd that they have more in common with human-based recommendations
as opposed to algorithms [
12
,
47
,
60
]. This has led to development of algorithms that are programmed to act more like
humans [41].
There is also increasing work on potential strategies to overcome algorithmic aversion, such as the human-in-
the-loop strategy. This strategy suggests that individuals are more likely to use and rely on algorithms when they
Manuscript submitted to ACM
6 Inkpen, Chappidi, et al.
have the opportunity to possibly correct and minimize errors in their decision-making judgement, even when the
algorithmic-assistance is imperfect [
15
]. In a parallel line of work, recent research has shown that individuals are not
always averse to algorithms [
38
]. Algorithm appreciation, also known as automation bias, refers to when people rely
on equivalent forecasts made by an algorithm more heavily than one made by a human [
38
]. Various factors that cause
algorithmic appreciation outlined below.
Djikstra et al. found that individuals believe algorithmic systems represent a more objective and rational perspective
than a human being [
16
,
17
]. Additionally, environmental inuences, such as time-critical situations, can inuence an
individual’s reliance on algorithmic advice over humans [
49
]. Yeomans et al. found that individuals are more likely to
rely on algorithmic decision aides when the algorithm is transparent [
60
]. Moreover, additional information from other
users, such as their prior experience with the model, also has a positive eect on an individual’s adoption of algorithms,
helping users reduce their hesitancy towards algorithms and better able to assess the reliability of a decision aid [
2
,
15
].
Misuse of algorithms is described as the failure that results when individuals incorrectly rely on algorithms. This
behavior is salient when an individual’s trust surpasses the algorithm’s real capabilities [
5
,
9
,
11
,
35
]. While algorithmic
trust can lead to positive outcomes when the algorithmic aid provides the right advice, this misuse of algorithms can
result in negative outcomes when the AI is incorrect.
3.2 Stall Catchers
Citizen science is a collaboration between scientists and volunteers to support advancement of scientic research. A
citizen science project can involve one person or millions collaborating towards a common goal, and public involvement
is typically in data collection, analysis, or reporting. Some long-term citizen science platforms and projects include
eBird [53], Zooniverse [51], and iNaturalist [44].
Stall Catchers is an online citizen science game created by the Human Computation Institute that crowdsources
the analysis of Alzheimer’s disease research data generated by Schaer-Nishimura Laboratory at Cornell University’s
Biomedical Engineering department. Volunteers are tasked to watch and analyze video clips generated from in vivo
images of mouse brains in order to classify vessels as owing or stalled (clogged blood vessels) as seen in Figure 1.
In these videos, a single blood vessel segment is designated for analysis with a colorful boundary generated by a
preprocessing algorithm. As users search for and "catch" stalls, they accumulate points. As users build up their scores,
they advance in levels, compete for leaderboard spots, and may receive digital badges for their achievements in the
game. The crowd-analyzed dataset reduces the number of blood vessel segments that lab-experts need to analyze for
advancing Alzheimer’s disease research by a factor of 20. Stall Catchers employs a so-called "wisdom of the crowd"
algorithm to sensibly aggregate several judgements about the same vessel segment to produce a single, expert-like
answer, which has been validated to be at least as accurate as that of a trained laboratory technician. Today, Stall
Catchers has over 42,000 registered users who have collectively contributed over 12 million individual annotations,
resulting in 1.4 million crowd-based labels. Several Alzheimer’s research results enabled by Stall Catchers analysis have
been published in top tier journals [
21
,
42
] and where possible, have listed "Stall Catchers players" as a co-author with
a hyperlink to the list of usernames corresponding to Stall Catchers volunteers who specically contributed to the
results being published, in a rank-ordered list based on volume of relevant annotations (e.g., [
7
,
21
]). Stall Catchers
data has also been used to support a machine learning (ML) competition created in partnership with DrivenData and
MathWorks which engaged over nine hundred participants and resulted in more than fty machine learning models
that can analyze Stall Catchers data [
37
]. As with human Stall Catchers players, none of the resultant ML models
Manuscript submitted to ACM
Advancing Human-AI Complementarity 7
Fig. 1. Side-by-side view of "flowing" vs "stalled" blood vessels from the Stall Catchers game.
achieved expert-like performance. These models did, however, exhibit a range of true positive and negative rates, which
raises the possibility of Human-AI collaboration in the manner explored by research questions in the present study.
In this work, we utilize a codeless experimentation toolkit developed by the Human Computation Institute, which
executes an experimental design utilizing a sandbox version of Stall Catchers. This experimentation platform provides
a proxy for problems relating to Human-AI interaction in medical decision-making. Several modications were made
to the experimental version of the Stall Catchers interface: 1) all gamication accoutrements (e.g., leaderboard, score,
in-game chat, etc.) were stripped from the interface to reduce bias that might originate from point incentives, 2) a
progress bar was added to the top of the interface to indicate the current stage and their progression through that
stage (Figure 2), 3) the requirement and interface for indicating the location of an identied stall was removed, and 4)
a small icon of a robot, with a hand pointing to either the "Flowing" or "Stalled" button, indicated the AI-Assistant’s
recommendation (Figure 3). We describe the details of these changes and setup in the following section.
4 STUDY
4.1 Participants
Recruitment of participants was done in compliance with the Human Computation Institute’s IRB policies. We recruited
58 remote participants located in the USA, aged 18 years or older from the existing pool of players on the Stall Catchers
citizen science platform [
1
]. This recruiting method ensured that all players were familiar with the task. Participants
had varying levels of expertise from novice to expert users and ranged in age from 18 to 82 years of age (mean = 49, sd
= 17.2). No other user demographic data was collected on the Stall Catchers platform. Each participant completed the
task in four stages, details of which are explained in Section 4.3. In each stage, participants watched videos of blood
owing through vessels of mice brains to identify if the vessels were owing or clogged ("stalled"). Each participant
was only allowed to participate in the task once to avoid any learning bias across dierent conditions. As an incentive
Manuscript submitted to ACM
8 Inkpen, Chappidi, et al.
for their voluntary participation, all participants received 5X Stall Catchers points for every correct answer, which were
then added to their overall Stall Catchers player points.
4.2 Videos
140 videos were sampled from a database with a validation set of 7,693 videos containing ground truth labels veried
by expert players (i.e., players whose sensitivity was 1) not involved in the study. Each video contained cross-sectional
views of blood owing through vessels in mice brains (Figure 2). Each video also had an associated diculty metric
(0,1) which was dened based on average response classication accuracy from historical Stall Catchers data.
4.3 Task Stages
The 140 videos were split across the four stages (20, 40, 40, 40) of the task, balanced by diculty metric. Each stage
contained equal proportions of stalled and owing vessels. Details of each of the four stages are discussed below.
4.3.1
Stage 1: No AI, With feedback
.In this stage, participants were shown 20 videos (10 owing, 10 stalled) and
asked to make a decision on each. Participants did not receive any form of assistance. After providing a response to a
video, participants were shown the correct answer (Figure 2). This stage served as a practice stage. Additionally, in the
original Stall Catchers game, players are 4 times more likely to be shown a owing video than a stalled video. Hence,
this stage also aimed to override such prior biases that users may have had from playing the original game.
Fig. 2. Stall Catcher interface showing feedback aer a decision was made during Stage 1 and 3.
4.3.2
Stage 2: No AI, No feedback
.In this stage, participants were shown 40 videos (20 owing, 20 stalled) in
randomized order and asked to make a decision alone. Participants were not shown the correct answers after their
selections. Responses from this stage were used to study the performance and skill level of participants when no AI
assistance is present.
Manuscript submitted to ACM
Advancing Human-AI Complementarity 9
Fig. 3. Stall Catcher interface with AI-Assistant suggestions, prior to the user making a selection, during Stage 3 and 4.
4.3.3
Stage 3: With AI, With feedback
.In this stage, participants were shown 40 videos (20 owing, 20 stalled)
with randomized ordering alongside a recommendation from an AI assistant (Figure 3). Participants were required to
make a nal decision and could choose whether to incorporate the AI assistant’s recommendation or ignore it. After
participants indicated their decision, they were immediately given feedback on the correct answer. Stage 3 allowed for
users to practice decision-making with the AI-Assistant and gauge the performance of the AI-Assistant.
4.3.4
Stage 4: With AI, No feedback
.Stage 4 was the test stage where participants were shown 40 videos (20
owing, 20 stalled) with randomized ordering alongside the AI-Assistant’s recommendation and asked to make a
decision. Participants were not shown the correct answers after their selections. Responses from this stage were used to
assess the Human-AI team performance.
4.4 Experimental Design
A mixed factor design was used, with one between subjects variable (each participants was randomly assigned to
dierent AI tuning conditions) and one within subjects variable (each participant was required to complete the task
with and without AI assistance). The 58 unique participants were randomly assigned to one of three experimental
conditions—Balanced condition, High TNR (True Negative Rate) condition, or High TPR (True Positive Rate) condi-
tion—while ensuring that each condition had similar numbers of participants. However, participants were not balanced
across their prior expertise levels on the main Stall Catchers site.
4.4.1 Experimental Conditions. The AI-Assistant in this study was simulated to enable specic control over the AI-
Assistant’s performance. Participants were randomly assigned to one of three experimental conditions varying how the
AI assistant was simulated: Balanced, High TNR, High TPR. In all three conditions, the AI-Assistant’s accuracy was
kept constant at 75% while the true positive and negative rates of the AI-Assistant changed in each condition. Table 1
Manuscript submitted to ACM
10 Inkpen, Chappidi, et al.
summarizes the number of true negatives and positives for each condition along with accuracy, true positive rates, and
true negative rates. We simulated the AI performance rather than tuning it for dierent true positive and negative rates
so that we could exactly control these rates for experimentation across stages. While precise control in the AI tuning
may not always be realistic or feasible in small samples, in this study, we aim at understanding the broad inuence of
dierent tuning factors.
Table 1. AI-Tuning in Experimental Conditions.
AI Recommendation = Flowing AI Recommendation = Stalled Accuracy TNR TPR
TN FN FP TP TN+TP
TN+FP+TP+FN
TN
TN+FP
TP
TP+FN
Balanced 15 5 5 15 75% 75% 75%
High TNR 19 9 1 11 75% 95% 55%
High TPR 11 1 9 19 75% 55% 95%
In the Balanced condition, users interacted with an AI-Assistant that made an equal number of errors on each type of
video (stalled or owing). The AI-Assistant made ve false positive errors, suggesting that a owing video was stalled,
and ve false negative errors, suggesting that a stalled video was owing, in each stage it appeared in. This resulted in
the AI-Assistant having a TPR of 75% and TNR of 75%.
In the High True Negative Rate (TNR) condition, the AI-Assistant was more likely to miss stalled videos. The AI-
Assistant suggested that videos were owing more frequently, resulting in a true negative rate of 95% and true positive
rate of 55% for the AI-Assistant.
In the High True Positive Rate (TPR) condition, the AI-Assistant was more likely to miss owing videos. The AI
assistant suggested that videos were stalled more frequently, resulting in the AI-Assistant having a true negative rate of
55% and true positive rate of 95%.
As described in 4.3, each participant completed the task in four stages, where responses from Stage 2 were used to
determine participants’ skill levels, and responses from Stage 4 were used to assess the Human-AI team performance.
Thus, the set of videos in Stage 2 and Stage 4 were counterbalanced to help minimize any eects from task diculty,
i.e., half the participants received one set of videos in Stage 2, while the other half of the participants experienced that
same set of videos in Stage 4.
4.5 Data Collection and Procedures
At the start of the experiment, participants were shown an instruction page and a participation consent form. After
providing consent, each participant completed the video labeling task in each of the four stages in Section 4.3. Participants
were allowed to take multiple 2-min breaks, but were expected to complete the experiment within one sitting. Data
collected included the participant’s decision for each video shown, along with the AI-Assistant’s recommendation (in
Stages 3 and 4), and the ground truth for that video. Additional data recorded included the instant at which the user
submitted their labels for each video (with one-second precision) and the response time (in seconds), i.e., the time from
when video loaded until user clicked a response button.
After each stage, participants completed a short survey to assess their self-rated performance in that stage and to
assess the perceived utility of the AI-Assistant in that stage (if present) on a scale from 1-5. These questions were
adapted from the study by Kocielnik et al. [
33
]. After Stage 3 and 4, participants were also asked to indicate what types
of errors they thought the AI was making (more false positives, more false negatives, or equal). At the end of Stage 4,
Manuscript submitted to ACM
Advancing Human-AI Complementarity 11
Table 2. Users’ performance in terms of accuracy for Stage 1 (without AI + feedback), Stage 2 (without AI + no feedback), Stage 3 (with
AI + feedback), and Stage 4 (with AI + no feedback). Users’ accuracy increased significantly with the addition of AI recommendations
(Stage 2 to Stage 4,
𝑝<.001
), but there was no significant main eect of experimental condition (p=.562). There did not seem to be
learning eects over time as there were no significant dierences between Stage 1 and Stage 2 (
𝑝=.979
), or between Stage 3 and
Stage 4 (p=.054).
Accuracy
Stage 1 Stage 2 Stage 3 Stage 4
Balanced 70% 66% 78% 72%
High TNR 61% 66% 72% 72%
High TPR 69% 68% 76% 77%
TOTAL 67% 66% 75% ∗73%
users answered eight questions related to their use, perceptions, and trust of the AI-Assistant. Users indicated their
agreement with these questions on a 5-point Likert scale.
A modication was made to the survey questions early in the study. As a result, some survey data is missing for the
rst 10 participants.
5 RESULTS
5.1 Overall Impact of the AI-Assistant on Performance
Result 1: AI-Assistant recommendations signicantly improved users’ accuracy regardless of the AI-
tuning.
Users’ accuracy in each condition is shown in Table 2. Accuracy was measured as the percentage of correct answers
(
TP +TN
TN +FP +TP +FN
). Results from Stage 2 and Stage 4 (the two non-feedback conditions) were used to assess the
impact of the AI-Assistant on users’ accuracy for the Stall Catchers task. A mixed-model, repeated measures ANOVA
was used with users’ performance in Stage 2 and Stage 4 as the within subjects factor and tuning of the AI model
(experimental condition) as the between subjects variable. A signicant dierence was found between Stage 2 and Stage
4, with users achieving higher accuracy alongside the AI-Assistant’s recommendations in Stage 4 (
𝐹1,55 =24.61, 𝑝 <
.001, 𝜂2=.31,1−𝛽=1.00
). Additionally, all three tunings of the AI had a positive impact on users’ accuracy and there
were no signicant dierences between the experimental conditions (𝐹2,55 =.61, 𝑝 =.55, 𝜂2=.02,1−𝛽=.15).
We also examined users’ performance in Stage 1 and Stage 3 to determine if the improvements could have been
attributed to a learning eect. Paired samples t-tests revealed no signicant dierences between Stage 1 and Stage 2
(
𝑡1,57 =.001, 𝑝 =.87
) or Stage 3 and Stage 4 (
𝑡1,57 =1.86, 𝑝 =.068
), which suggests that users performance did not
signicantly change between stages, except when the AI-Assistant recommendations were added.
5.1.1 Baseline Performance Clusters. A Pearson correlation coecient was calculated to examine the performance
gains made between Stage 2 (without AI) and Stage 4 (with AI). Across all users, there was a signicant negative
correlation (
𝑟(57)=−.59, 𝑝 <.001
) between Stage 2 baseline accuracy and gains made in Stage 4. Users with lower
levels of accuracy made higher gains in performance when working with the AI-assistant compared to users with
higher levels of accuracy. Figure 4illustrates the accuracy of participants in both stages. The diagonal line indicates the
boundary at which participants would be equally accurate in both stages. Visually, improvements in Stage 4 result in
Manuscript submitted to ACM
12 Inkpen, Chappidi, et al.
data points above the diagonal line. Improvements above the accuracy of the AI-Assistant result in data points above
the 75% horizontal line.
Due to the interaction eect between baseline accuracy and improvement with AI, a two-step cluster analysis was
used to group users based on their baseline accuracy in Stage 2. This resulted in three distinct performance clusters with
a silhouette measure of cohesion of 0.7 (good), and cluster centers (means): Cluster 1 (
¯
𝑥=51%
), Cluster 2 (
¯
𝑥=68%
)
and Cluster 3 (¯
𝑥=88%).
One-sample t-tests were used to compare mean accuracy of each cluster in Stage 2 to the constant 75 percent accuracy
of the AI-Assistant. The rst cluster (
𝑛=16
), termed the low-performer group, had signicantly lower accuracy than
the AI-Assistant (
𝑡15 =−19.62, 𝑝 <.001
). The second cluster (
𝑛=33
), termed the mid-performer group, was closer
to, but still signicantly below the accuracy of the AI (
𝑡32 =−7.96, 𝑝 <.001
). The third cluster (
𝑛=9
), termed the
high-performer group, had a signicantly higher baseline accuracy than the AI assistant (
𝑡8=5.22, 𝑝 <.001
). These
resulting clusters were used for all subsequent analyses.
Manuscript submitted to ACM
Advancing Human-AI Complementarity 13
Fig. 4. Accuracy of users in Stage 2 (without AI) and Stage 4 (with AI), clustered by users’ accuracy in Stage 2 as a measure of human
performance. Points above the diagonal line represent users who improved in accuracy, while points below the diagonal line represent
users whose performance decreased. The horizontal line at 75% accuracy represents the accuracy of the AI-Assistant. Data points
above the horizontal line represent users who performed beer than the AI-Assistant in Stage 4 and data points below the horizontal
line represent users who performed worse than the AI-Assistant in Stage 4.
5.2 Impact of AI-Assistant by Performance Cluster
Result 2a: Regardless of AI-Assistant tuning, the accuracy of low-performers signicantly increased,
though not to the level of the AI-Assistant.
Result 2b: Recommendations from the AI-Assistant had a mixed impact on mid-performers, with some
users improving signicantly and others getting worse. The High TPR condition was signicantly bet-
ter for mid-performers and resulted in more users achieving higher accuracy than the level of the AI-
Assistant.
Result 2c: High-performers maintained their high levels of accuracy, regardless of AI-Assistant presence
or subsequent tuning.
The eects of the AI-tuning were examined for each performance cluster (see Table 3). Mixed repeated measures
ANOVAs were conducted for each cluster. Users’ performance in Stage 2 and Stage 4 was the within subjects factor, and
tuning of the AI model (experimental condition) was the between subjects variable. In addition, single-sample t-tests
with a reference value equal to the accuracy of the AI-Assistant (75 percent) were used to analyze each cluster’s Stage 4
accuracy relative to the accuracy of the AI-Assistant. Figure 5shows users’ accuracy results for Stage 2 and Stage 4, by
Manuscript submitted to ACM
14 Inkpen, Chappidi, et al.
Table 3. Users’ performance in terms of accuracy for Stage 1 (without AI + feedback), Stage 2 (without AI + no feedback), Stage 3
(with AI + feedback), and Stage 4 (with AI + no feedback) by performance cluster. Low-performers significantly improved in accuracy
from Stage 2 to Stage 4 (𝑝<.001). Mid-performers also significantly improved in accuracy from Stage 2 to Stage 4 (𝑝<.01).
Accuracy
Stage 1 Stage 2 Stage 3 Stage 4
Low-performers 53% 51% 70% ∗67%
Mid-performers 68% 68% 74% ∗74%
High-performers 83% 88% 89% 88%
performance cluster and experimental condition.
5.2.1 Low-Performers. Recommendations from the AI-Assistant signicantly improved low-performers accuracy
(
𝐹1,13 =50.08, 𝑝 <.001, 𝜂 2=.79,1−𝛽=1.00
). Of the 16 low-performers, 15 improved their accuracy, 1 stayed the
same, and there were no low-performers whose accuracy degraded. Low-performers still performed signicantly worse
in Stage 4 than the AI-Assistant (
𝑡15 =−5.15, 𝑝 <.001
). No signicant eect of experimental condition (AI-tuning)
was found for the accuracy of low-performers (𝐹2,13 =0.41, 𝑝 <.674, 𝜂2=.06,1−𝛽=0.10).
Fig. 5. Accuracy of low-, mid-, and high-performers in Stage 2 and Stage 4 for each experimental condition.
5.2.2 Mid-Performers. Recommendations from the AI-Assistant signicantly improved mid-performers’ accuracy
(
𝐹1,30 =8.77, 𝑝 =.006, 𝜂 2=.23,1−𝛽=.82
), raising them to a comparable level of accuracy to the AI-Assistant in
Stage 4 (𝑡32 =−1.14, 𝑝 =.261).
Manuscript submitted to ACM
Advancing Human-AI Complementarity 15
Although there was a signicant overall eect, improvements in accuracy were highly variable for mid-performers
with some users improving and others falling below their baseline Stage 2 accuracy level. In Stage 4, 20 out of the 33
mid-performers improved their accuracy, two stayed the same, and 11 degraded in accuracy compared to their Stage 2
accuracy. Of the 20 users that improved in accuracy, 13 improved above the level of the AI-Assistant, ve improved to
the level of the AI-Assistant, and two improved but were still below the accuracy of the AI-Assistant.
A signicant main eect was also found for experimental condition for mid-performers (
𝐹2,30 =3.52, 𝑝 =.043, 𝜂2=
.19,1−𝛽=0.61
). We examined each AI tuning condition separately using paired t-tests. In both the Balanced condition
and the High TNR condition, overall mean accuracy rates increased when the users were given recommendations from
the AI-Assistant, but this dierence was not statistically signicant given the high variance ((
𝑡11 =−1.087, 𝑝 =.30
,
and
𝑡12 =−1.175, 𝑝 =.26
, respectively). In the Balanced condition, mean accuracy rates rose from
66%
in Stage 2, to
70%
in Stage 4. In the High-TNR condition, mean accuracy rose from
68%
in Stage 2 to
72%
in Stage 4. However, the
AI-Assistant recommendations did signicantly improve mid-performers’ accuracy in the High TPR condition (
70%
to
80%,𝑡7=−3.188, 𝑝 =.015)
Examining mid-performer performance in Stage 4, we also see interesting trends across experimental conditions.
Performance was mixed in the Balanced condition, where ve users improved their accuracy from Stage 2 to Stage 4,
two performed similarly, and ve fell below their Stage 2 score. The High TNR condition was similar, where eight users
improved in Stage 4 and ve whose performance degraded. However, in the High TPR condition, all of the users (n=7)
improved in accuracy, except one whose performance degraded.
To understand what conditions maximized Human-AI team performance, we examined which mid-performers were
able to increase their accuracy above the level of the AI-Assistant. Of all mid-performers,
17%
(n=2) improved their
accuracy above the level of the AI-Assistant in the Balanced condition,
38%
(n=5) improved above the level of the
AI-Assistant in the High TNR condition, and
75%
(n=6) increased their accuracy above the level of the AI-Assistant in
the High TPR condition. These results suggest that mid-performers saw the biggest gains in accuracy when working with
an AI-Assistant that was tuned for High TPR (i.e., an AI-Assistant that erroneously suggests more stalls). Given these
results, we conduct further analyses in Section 5.3 to understand whether human baseline performance interacted with
the AI tuning experimental condition in a complementary manner to inuence the accuracy rates of mid-performers.
5.2.3 High-Performers. The AI-Assistant did not signicantly change high-performers’ accuracy between Stage 2 and
Stage 4 (
𝐹1,6=0, 𝑝 =1, 𝜂2=.0,1−𝛽=.05
). High-performers continued to perform at levels above the AI-Assistant’s
accuracy in Stage 4 (
¯
𝑥=88%, 𝑡8=5.22, 𝑝 <.001
). Experimental condition also did not signicantly aect the accuracy
of high performers (
𝐹2,6=.29, 𝑝 =.76, 𝜂2=.09,1−𝛽=0.08
). We would like to note that access to experts was limited
so the number of high-performers in our study was low. This may have limited the statistical power to detect signicant
dierences for this group of users.
Manuscript submitted to ACM
16 Inkpen, Chappidi, et al.
Table 4. Users’ performance in terms of true positive and true negative rates for Stage 2 (without AI) and Stage 4 (with AI). In
the Balanced and High TPR conditions, users’ TPR increased significantly from Stage 2 to Stage 4 (Balanced:
𝑝<.001
, High TPR:
𝑝<.004). In the High TNR condition, users’ TNR increased significantly from Stage 2 to Stage 4 (𝑝<.001).
True Negative Rate True Positive Rate
Stage 2 Stage 4 Stage 2 Stage 4
Balanced 75% 79% 57% ∗66%
High TNR 75% ∗84% 56% 60%
High TPR 77%81% 59% ∗72%
TOTAL 75% ∗81% 57% ∗66%
5.3 Complementarity: The Impact of Algorithmic Tuning
Result 3: Users beneted most from the AI assistance when the AI was tuned to be more complemen-
tary to human expertise. Users only increased signicantly in a performance measure when their initial
baseline was below the AI-Assistant’s level for that performance measure.
As mid-performers beneted most from a High TPR assistant in Section 5.2.2, we examined users’ baseline true
positive and true negative rates to better understand the inherent strengths of users and the types of errors they make,
as well as the impact of tuning an AI-Assistant for one of these measures. The true positive rate is the percentage
of stalled videos detected correctly (
TP
TP +FN
). The true negative rate is the percentage of owing videos detected
correctly ( TN
TN +FP ). Mean TPR and TNR scores within experimental conditions for Stage 2, Stage 3, and Stage 4 are
shown in Table 4.
5.3.1 Human Tuning. We examined users’ baseline true positive and negative rates from Stage 2 to determine if the
users themselves were balanced in the errors they made, or if they were biased towards one of the metrics. Paired t-tests
comparing users’ scores in Stage 2 indicated that mid- and high-performers were better at detecting when vessels were
owing, leading them to have higher TNR compared to TPR (
𝑚𝑖𝑑 :𝑡32 =−7.33, 𝑝 <.001; ℎ𝑖𝑔ℎ :𝑡8=−2.92, 𝑝 =.019
).
This can be seen in Figure 6where more of the data points from low and high-performers fall above the diagonal.
Low-performers did not display a statistically signicant bias towards either of the performance metrics (
𝑙𝑜𝑤 :𝑡15 =
−0.72, 𝑝 =.485).
5.3.2 A Balanced AI-Assistant. In the balanced condition, the AI-Assistant’s true positive and negative rates were
equal (75% each) and therefore, made equal numbers of errors on stalled and owing videos. Users in this condition
had a baseline TNR of 75% (not signicantly dierent that the AI-Assistant,
𝑡19 =−.069, 𝑝 =.946
) and a baseline
TPR of 57% (signicantly below the level of the AI,
𝑡19 =−4.346, 𝑝 <.001
). Paired t-tests were used to examine
users’ improvement in both scores from Stage 2 to Stage 4. Users’ true negative rates did not change signicantly
(79%
𝑡19 =−1.182, 𝑝 =.252
); however, users’ true positive rates signicantly increased (66%,
𝑡19 =−2.275, 𝑝 =.035
),
though still remained below the level of the AI-Assistant (𝑡19 =−3.383, 𝑝 =.003).
5.3.3 A High TNR AI-Assistant. In the High TNR condition, the AI-Assistant was tuned to have a TNR of
95%
and a TPR
of
55%
. The AI-Assistant was highly accurate when it predicted a stall, but was also more likely to suggest that a video
was owing and therefore, miss more stalls. The users in this condition had a baseline TNR of 75% (signicantly below
the AI-Assistant,
𝑡19 =−4.251, 𝑝 <.001
) and a baseline TPR of 56% (not signicantly dierent from the AI-Assistant,
Manuscript submitted to ACM
Advancing Human-AI Complementarity 17
Fig. 6. True positive and negative rates of all users in Stage 2. Points above the diagonal line represent users who were biased towards
true negatives (beer at detecting flowing) and points below the diagonal line represent users who were biased towards true positives
(beer at detecting stalled).
𝑡19 =0.383𝑝=.706
). Paired t-tests used to examine user improvement in Stage 4 revealed that users’ TNR signicantly
increased (
84%, 𝑡19 =−2.132, 𝑝 =.047
), albeit still below the level of the AI-Assistant (
𝑡19 =−3.222, 𝑝 =.004
). Users’
TPR in Stage 4 did not change signicantly (60%, 𝑡19 =−0.892, 𝑝 =.384).
5.3.4 A High TPR AI-Assistant. In the High TPR condition, the AI-Assistant had a
95%
TPR and
55%
TNR. The
AI-Assistant was highly accurate when it predicted a vessel was owing, but was more likely to suggest that a video is
stalled and therefore missed more owing vessels. The users in this condition had a baseline TNR of
77%
(which was
signicantly better than the AI-Assistant,
𝑡18 =4.813, 𝑝 <.001
) and a baseline TPR of
59%
(which was signicantly
below the level of the AI-Assistant,
𝑡18 =−4.073, 𝑝 <.001
). Paired t-tests used to examine user changes in TNR and
TPR in Stage 4 revealed that users’ TNR did not increase signicantly (
¯
𝑥=81%
,
𝑡18 =−1.222, 𝑝 =.238
). Users’ TPR in
Stage 4 did signicantly increase (
72%
,
𝑡18 =−3.222, 𝑝 =.004
) although was still below the level of the AI-Assistant
(𝑡18 =−6.724, 𝑝 <.001).
5.4 Agreement with AI-Assistant
Result 4: Users were more likely to disagree with an incorrect recommendation (that owing videos were
stalled) in the High TPR condition. Additionally, users with higher baseline accuracy were more likely
to agree with correct recommendations and disagree with incorrect recommendations.
We were also interested in studying user agreement with the AI-Assistant both when the AI is correct (i.e., both
the human and the AI are correct) and when the AI is incorrect (i.e., both the human and the AI are incorrect). In the
Manuscript submitted to ACM
18 Inkpen, Chappidi, et al.
former case, agreement is benecial to the Human-AI team but in the latter case, inappropriate agreement would lead
to lower team performance. Non-parametric statistics were used to assess dierences in agreement since the normality
assumption was violated. Kruskal-Wallis Tests were used to examine dierences in agreement between the experimental
conditions with Mann-Whitney U tests for post-hoc pairwise comparisons of signicant main eects.
5.4.1 Agreement by Condition. Table 5shows the percentage of time users were in agreement with the AI-Assistant’s
recommendation for each experimental condition. No signicant dierence was found in terms of the overall level of
user-AI agreement across experimental conditions (𝜒2=3.213, 𝑝 =.201).
In each experimental condition, the AI-Assistant was correct 75% of the time and made incorrect recommendations
25% of the time. Optimal Human-AI team performance occurs when users agree with the AI-Assistant when it is correct
(
75%
of the time) and disagree with the AI-Assistant when it is incorrect (
25%
of the time). When the AI-Assistant
was correct, the experimental condition had no signicant impact on the level of agreement users had with those
recommendations (𝜒2=1.605, 𝑝 =.448). However, when the AI-Assistant was incorrect, experimental condition had
a signicant impact on user agreement with incorrect recommendations (
𝜒2=7.498, 𝑝 =.024
). Post-hoc pairwise
analyses revealed that users were less likely to agree with an incorrect recommendation (that "owing" videos were
"stalled") in the High TPR condition compared to the High TNR condition (𝑍=−2.605, 𝑝 =.009).
Table 5. User agreement with the AI-Assistant by experimental condition. The shaded columns indicate a beneficial action (user
agreement when the AI was correct and user disagreement when the AI was incorrect). Significantly more users disagreed with an
incorrect AI recommendation in the High TPR condition compared to the High TNR condition (𝑝<.017).
Overall AI Correct AI Incorrect
Agreement Agreed Disagreed Agreed Disagreed
(ideal = 75%) (ideal = 0%) (ideal = 0%) (ideal = 25%)
Balanced 69% 58% 17% 11% 14%
High TNR 73% 60% 15% 13% 12%
High TPR 68% 60% 15% ∗8% ∗17%
5.4.2 Agreement by Performance Cluster. Table 6shows the percentage of time that users were in agreement with the
AI-Assistant’s recommendation by users’ base-level performance. No signicant dierence was found in terms of the
overall level of agreement across dierent levels of performance (
𝜒2=5.820, 𝑝 =.054
). However, whether or not users
accepted the correct AI recommendations or rejected the incorrect AI recommendations varied based on the user’s
performance cluster.
When the AI-Assistant was both correct and incorrect, the base-level performance of the user had a signicant
impact on the level of agreement users had with each of these recommendations (
𝜒2=9.992, 𝑝 =.007
, and
𝜒2=
19.729, 𝑝 <=.001
). Post-hoc pairwise analyses revealed that high-performers were signicantly more likely to agree
with correct recommendations from the AI-Assistant than either mid- or low-performers (
𝑍=−3.155, 𝑝 =.002
and
𝑍=−2.484, 𝑝 =.013
, respectively). Additionally, high-performers were signicantly more likely to disagree with
incorrect recommendations than either mid- or low-performers (
𝑍=−2.392, 𝑝 =.016
and
𝑍=−3.624, 𝑝 <.001
,
respectively). Finally, mid-performers were signicantly more likely to disagree with incorrect recommendations than
low-performers (𝑍=−3.452, 𝑝 =.001).
Manuscript submitted to ACM
Advancing Human-AI Complementarity 19
5.5 Users’ Perception of the AI-Assistant
Table 6. User agreement with the AI-Assistant by baseline performance. The shaded columns indicate a beneficial action (agreement
when the AI was correct and disagreement when the AI was incorrect). Significantly more high-performers agreed with the AI
when it was correct than either mid- or low-performers (𝑝<.017). Significantly more high-performers disagreed with an incorrect
AI recommendation than low- and mid-performers (
𝑝<.017
). Significantly more mid-performers disagreed with an incorrect AI
recommendation than low-performers (𝑝<.017).
Overall AI Correct AI Incorrect
Agreement Agreed Disagreed Agreed Disagreed
(ideal = 75%) (ideal = 0%) (ideal = 0%) (ideal = 25%)
Low-Performers 74% 58% 17% 16% 9%
Mid-Performers 67% 58% 17% ∗10% ∗15%
High-Performers 73% ∗68% ∗7% ∗5% ∗20%
Result 5a: Users who perceived that the AI-Assistant’s performance was better than their own were more
likely to improve, while users who rated themselves higher than the AI-Assistant tended to degrade in
their performance when provided with AI recommendations.
Result 5b: Users who improved with the AI-Assistant’s recommendations rated the helpfulness and sat-
isfaction of the AI-Assistant higher than those whose performance degraded. They also rated the AI-
Assistant higher in terms of improving their ability to detect stalls.
Users were asked to rate their own performance as well as the AI-Assistant’s performance on a ve-point scale
after Stage 2 (without AI), Stage 3 (with AI + feedback), and Stage 4 (with AI + no feedback) (see Table 7).
2
Users’
self-ratings of their own performance did not change signicantly across the stages (Friedman test:
𝜒2=4.59, 𝑝 =.101
),
however, their rating of the AI-Assistant’s performance increased signicantly from Stage 3 to Stage 4 (Wilcoxon:
𝑍=−5.26, 𝑝 <.001). These results were consistent across all of the experimental conditions.
Fig. 7. User’ ratings of the AI-Assistant’s performance compared to their own performance by stage.
We also examined users’ relative ranking of their own performance in comparison to the AI-Assistant’s performance
in Stage 3 and Stage 4 (see Figure 7). After Stage 3, when users were rst introduced to the AI-Assistant and were also
2Note: Eleven users did not ll out the survey so we report on data for the remaining 47 users.
Manuscript submitted to ACM
20 Inkpen, Chappidi, et al.
Table 7. Users’ ratings of their own performance, and performance of the AI-Assistant for Stage 2 (no AI), Stage 3 (with AI + feedback),
and Stage 4 (with AI no feedback) by experimental condition. Rating were on a 5-point scale with 1 being low and 5 being high. User
ratings of the AI-Assistant increased significantly from Stage 3 to Stage 4 (
𝑝<.001
), while ratings of their own performance did not
change significantly (p=.101).
User Self-Ratings AI Ratings
Stage 2 Stage 3 Stage 4 Stage 3 Stage 4
Balanced 2.90 3.53 3.12 2.47 ∗3.35
High TNR 2.80 2.81 2.88 2.38 ∗3.31
High TPR 2.94 3.21 3.21 2.50 ∗3.29
TOTAL 2.88 3.19 3.06 2.45 ∗3.32
given feedback after every decision, users rated the AI-Assistant’s performance signicantly lower than their own
performance (Wilcoxon:
𝑍=−4.325, 𝑝 <.001
). However, after further interaction with the AI-Assistant in Stage 4,
users’ rating of the AI performance increased signicantly such that there was no signicant dierence between users’
rating of themselves and the AI-Assistant at the end of Stage 4 (𝑍=−1.766, 𝑝 =.077).
The perceived performance level of the AI-Assistant may have impacted users’ trust or reliance on the AI-Assistant’s
recommendations during Stage 4. Figure 8shows user ratings of the AI-Assistant for users whose performance improved
from Stage 2 to Stage 4 versus users whose performance degraded from Stage 2 to Stage 4. Of the eleven users whose
performance degraded, ve rated the AI-Assistant performance similarly to themselves, ve rated the AI-Assistant
lower, and one rated the AI-Assistant higher. This resulted in a mean rating of their own performance (
¯
𝑥=3.55
)
that was not signicantly dierent than the mean rating of the AI-Assistant’s performance (
¯
𝑥=3.09
) (Wilcoxon:
𝑍=−1.186, 𝑝 =.236
). In contrast, for the thirty users who improved their accuracy from Stage 2 to Stage 4, eleven
rated the AI-Assistant similarly to themselves, three rated the AI-Assistant lower, and sixteen rated the AI-Assistant
higher. This resulted in a mean rating of their own performance (
¯
𝑥=2.87
) which was signicantly lower than the
mean rating of the AI-Assistant’s performance (¯
𝑥=3.43) (Wilcoxon: 𝑍=−3.038, 𝑝 =.002).
After Stage 4, users also rated the AI-Assistant’s performance on a 5-point Likert agreement scale in terms of: 1) how
helpful it was; 2) how satised they were with it; 3) how much it improved their ability to catch stalls; and 4) whether
they would recommend the AI-Assistant to others. Responses to these questions were favorable, as
54%
reported that the
AI-Assistant was helpful,
47%
were satised with it,
34%
felt that it improved their ability to nd stalls, and
47%
would
recommend it to others. A smaller percentage (
29%
) indicated they they would use the AI-Assistant if it was available.
Examining user ratings of the AI in terms of helpfulness, satisfaction, and improved ability, we again see a signicant
dierence between users who improved in accuracy and those whose accuracy decreased. Users who improved in
accuracy in Stage 4 rated the AI-Assistant higher in terms of being helpful (Mann-Whitney U:
𝑍=−2.27, 𝑝 =.023
),
being satised with it (Mann-Whitney U:
𝑍=−2.43, 𝑝 =.015
), and feeling like it improved their ability to catch stalls
(Mann-Whitney U: 𝑍=−2.917, 𝑝 =.004).
5.6 User Mental Models of AI-Assistant’s Errors
After Stage 4, when users were asked if they had a good understanding of how the Stall Catchers assistant decided
whether a video contained stalled vessels or not, most users (
72%
) said they disagreed or were unsure. Similarly, when
asked if they knew what types of errors the AI-Assistant was making, most users (
72%
) disagreed or were unsure. When
asked whether the AI-Assistant was making more errors on "stalled" or "owing" vessels after Stage 4, the majority
Manuscript submitted to ACM
Advancing Human-AI Complementarity 21
Fig. 8. User comparisons of personal and AI performance based on accuracy improvements between Stage 2 and Stage 4
Table 8. Users’ Mental Model of AI-Assistant Error Types Aer Stage 4.
Condition
more errors on ’stalled’
vessels
equal errors on owing
and stalled vessels
more errors on ’ow-
ing’ vessels
makes no mistakes at
all
Balanced 3 12 4 0
High TNR 4 11 2 0
High TPR 1 10 5 0
TOTAL 8 33 11 0
of users (
62%
) felt that the AI-Assistant made equal number of errors on both owing and stalled vessels (Table 8),
regardless of the experimental condition. These results all suggest that users struggled with forming a mental model of
the AI-Assistant and seemed unable to discern what types of errors the AI-Assistant was making.
5.7 Self-Reported Use of AI
After Stage 4, users were asked "How did you use the Stall Catchers assistant’s recommendations (if at all)?" Although
the question was optional, all participants responded. Two researchers coded all of the comments, using a bottom-up
open-coding approach (
85%
inter-rater reliability) followed by Axial coding to cluster related comments. Four key
categories were identied: 1) follow recommendations if unsure, 2) take a second look / verify, 3) ignore, and 4) check
stalled videos. The remaining comments were labeled "other."
5.7.1 Follow recommendations if unsure. The most common way people used the AI-Assistant was when they were
unsure with their decision. Nineteen users indicated that they consulted or followed the AI-Assistant in instances where
they were unsure of what they viewed in the blood vessel video by themselves. For example, User 120 (high-performer,
High TPR) responded “I used the assistant to help me make a choice when I wasn’t able to decide which appropriate answer
to choose," and User 158 (mid-performer, High TPR) said “I only used the assistant if I couldn’t decide if there was a stall or
not, or if the image was too grainy to see one way or another." Users incorporating this strategy would likely benet from
a complementary AI where the AI is tuned to have higher accuracy in areas where users are weaker. There is support
for this hypothesis in our data through examination of the eleven mid-performers who used this strategy. Five were in
the non-complimentary Balanced condition, and four of these users performed worse in Stage 4. Two users were in the
Manuscript submitted to ACM
22 Inkpen, Chappidi, et al.
High TNR condition and both improved (one of whom improved above the level of the AI-Assistant). The nal four
users were in the High TPR condition, which was the most complimentary condition based on initial human tuning,
and all four users signicantly increased their accuracy above the level of the AI-Assistant.
5.7.2 Take a second look / verify. Ten users indicated that they would use the AI-recommendation to verify their
decisions or re-visit the video if the AI-Assistant recommendation disagreed with their initial assessment. User 166
(low-performer, High TNR) responded “I tried to ignore them and make my own choices and then look again if I disagreed."
User 182 (low-performer, High TPR) wrote “If it disagreed with my ndings I would re-evaluate the movie. Sometimes it
changed my mind and sometimes it didn’t." This approach would also benet from a complementary AI. If users and the
AI both tend to make mistakes in the same areas, teams would not improve in performance after users took a second
look in areas where they were weak. A complementary AI could be exploited by tuning the AI to prompt users to
scrutinize or put more cognitive eort into their areas of weakness.
5.7.3 Ignore. Six users reported that they ignored the AI-Assistant’s recommendation, all of which were mid-performers,
and four of these users’ accuracy went down in Stage 4. User 169 (mid-performer, High TNR, accuracy dropped from
60% to 55%) shared “After several movies I stopped paying attention to assistant’s opinion.". User 170 (mid-performer, High
TPR, accuracy dropped from 75% to 70%) also disregarded the AI-Assistant and said “I didn’t look at the recommendation
until after I viewed the movie and decided for myself. I didn’t really rely on the assistant." These results indicate that
mid-performers struggled with whether or not to use the AI-Assistant, and those that chose to ignore it often degraded
in performance.
5.7.4 Check stalled videos. Five users indicated that they mostly acted on information from the AI-Assistant when it
recommended a stall. User 110 (mid-performer, High TNR) wrote “If the assistant recommended a stall, I usually would
go back over the movie a few more times to try and catch it, just to gure out what it saw to ag it as a stall. Often, this
helped me catch a stall I missed, simply by suggesting it was a stall after all." User 107 (mid-performer, High TNR) wrote
“Only if I felt I could conrm them with my own eyes. Each time the assistant caught a stall that I didn’t, I went back to nd
it". Feedback from these users reinforces the utility of High TPR tuning (which is complementary to the users), as it
suggests more stalls.
6 DISCUSSION
Consistent with ndings from prior work [
24
,
30
,
52
,
56
,
57
], providing users with recommendations from an AI-
Assistant signicantly improved team performance on the Stall Catchers task. However, various experimental factors
in our study inuenced the extent of these gains such that some users had marginal improvements, some saw more
signicant improvements, and others degraded in performance. Although any improvements in performance are
benecial, performance gains that fall short of the level of the AI are common in prior work [
6
,
22
,
34
,
40
]. Our goal
was to understand how we can maximize Human-AI team performance such that it is signicantly better than the level
of the AI or the human alone. In this section, we reect on the study results to provide insights and guidance on ways
to maximize Human-AI team performance.
6.1 Baseline Expertise
The largest determining factor for superior Human-AI team performance was the base level expertise of the users.
Benets obtained from use of the AI-Assistant systematically varied with users’ baseline expertise on the task. Individuals
Manuscript submitted to ACM
Advancing Human-AI Complementarity 23
with low expertise signicantly improved their performance by working with an AI-Assistant, though never to accuracy
levels at or above the AI. In these cases, AI recommendations helped, but users lacked the knowledge or understanding of
when they should accept or reject an AI recommendation. Studies that primarily recruit novice users (such as Mechanical
Turk studies) likely experience similar issues. Conversely, the performance of individuals with high expertise was not
impacted by the AI-Assistant, and these users continued to perform at accuracy levels well above the AI-Assistant. For
individuals with mid-level performance the results were mixed, some improved above the level of the AI, and others
performed worse. Understanding the performance gains and losses for this class of users provides important insights
on how to eectively support Human-AI collaboration.
6.2 Tuning for Complementarity
A second determining factor of success was related to tuning the AI algorithm for complementarity. Too often, the
focus for an AI model is overall algorithmic performance or accuracy. As our work shows, even in situations when the
overall accuracy level remains constant, dierent tunings of an algorithm can produce signicantly dierent results.
Dierential tuning for false positives or false negatives reinforced or complemented users’ strengths and weak spots.
For example, in our study, mid-performing users beneted signicantly when the algorithm was tuned for a high TPR as
it complemented users’ natural bias towards a high TNR. As a result, the High TPR condition produced more consistent
improvements and resulted in more users achieving accuracy results above the level of the AI.
The relative performance of the AI compared to the users also inuenced gains obtained from working with the AI.
Across all our experimental conditions, signicant increases in a specic performance measure only occurred when the
users’ score started below the AI-assistant’s tuning for that measure. For example, if users had a lower TPR, then an AI
tuned for a high TPR would improve users’ TPR as well as their overall accuracy. This nding suggests that tuning
an algorithm to complement a user’s weakness can improve user performance on that measure and increase overall
Human-AI team performance.
At the same time, tuning for complementarity must not sacrice trust from the user in an AI tool. Lu & Yin nd that
human reliance on a model is signicantly inuenced by their agreement with an AI during task instances when they
have high condence in their own decision [
39
]. Thus, if an AI tuned to complement a user’s weaknesses makes too
many incorrect recommendations in the area of a user’s strengths, human trust and reliance on the AI may decrease.
As a result, there is likely a balance to be struck between complementary tuning and trust, which is discussed in the
next section.
Another important consideration related to algorithmic tuning includes users’ strengths in identifying or overriding
positive or negative cases. Optimal Human-AI team performance comes from users being able to accept instances when
the AI is correct, and override instances when the AI is incorrect. By examining when users aligned with and ignored
AI recommendations, we found that users in each experimental tuning condition were equally likely to agree with the
AI when it was correct. However, when the AI was incorrect, users in the High TPR condition were more likely to
override AI recommendations, likely because of an inherent strength in detecting owing vessels.
The Selective Accessibility (SA) model—a social cognition concept that attempts to explain the inuence of anchor
values and comparative judgements during human-decision making—might explain why our study found that the
High TPR condition increased user performance [
43
]. The SA model posits that when users are given an anchor value
during a task (in this case, the AI recommendation), users engage in hypothesis-consistent testing, actively retrieving
knowledge and their personal framework related to the judgement they are being asked to make. Given incorrect stall
recommendations in the High TPR condition, users might have activated hypotheses related to whether the video
Manuscript submitted to ACM
24 Inkpen, Chappidi, et al.
was stalled, rather than using their standard decision-making framework to decide if a video is stalled or owing.
Since users were better at detecting owing videos, it is possible that users had more robust evidence for factors that
characterized a certain video. This prior may give them more success in rejecting stalled recommendations at the
correct instances compared to owing recommendations. If users had a stronger hypothesis or evidence requirements
for one type of video, the SA model could explain why participants struggled to disagree with incorrect owing AI
suggestions in the High TPR condition. The SA model and the idea of hypothesis-consistent testing support our ndings
that complementary AI systems can enhance performance by nudging characteristics of a video to be more salient,
accessible, or relevant to the user during decision-making.
Klayman & Ha also nd that the most critical way to test hypotheses is a "positive test strategy," where participants
examine cases with the target characteristic and retrieve knowledge from their memory related to the assumption
provided by an anchor [
32
]. In our study, this strategy would involve users examining videos where the AI suggested a
stall, and evaluating the video with their past memories and expectations. This nding is consistent with users who
chose to focus their attention on positive AI suggestions and qualitatively reported their AI-usage workow as "checked
stalled videos."
6.3 Perceived Performance, Trust, Helpfulness, Satisfaction
The third determining factor of superior performance for users in our study was the perceived performance of the
AI-Assistant relative to their own performance. As shown in prior work [
9
,
35
,
62
], trust is a critical factor related to
Human-AI team performance; however, its impacts are not well understood and can be dicult to measure reliably. For
example, when a human agrees with the AI, it is dicult to distinguish whether they agreed because the human trusted
the AI or because they independently solved the task without AI help and happened to agree. Many studies rely either
on proxy measurements of trust (e.g., agreement, time of completion, usage, stickiness) or self-reported scores (e.g.,
satisfaction, perceived accuracy, recommending the assistant to others). We examined users’ self-ratings of their own
performance and performance of the AI to gauge users’ perception of, and possibly trust in the AI-Assistant.
Regardless of the absolute ranking, users who rated the AI performance above their own were more likely to
benet signicantly from the AI recommendations, while users who rated the AI-Assistant and their own performance
similarly tended to degrade in performance. It is possible that users who felt that the AI was performing better than
themselves inherently trusted the AI-Assistant more, resulting in more adherence to the recommendations and superior
performance. In addition, users’ ratings of the helpfulness of the AI-Assistant and satisfaction with the AI-Assistant
were also found to be positive indicators of success.
This result is also supported by the Technology Acceptance Model (TAM), which suggests that when users are
presented with new technology, factors including perceived usefulness (PU) and perceived ease-of-use (PEOU) inuence
their decision about when and how they use the new technology [
55
]. Ranking the AI-Assistant high in terms of
performance, helpfulness, and satisfaction all suggest that users perceived the AI-Assistant to be useful, which would
contribute to their acceptance of the AI-Assistant and its recommendations.
6.4 Mental Model Formation
Mental models can help users form trust and accept recommendations from an AI [
5
,
62
]; however, our ndings
suggest that users were unable to create an accurate mental model of the AI-Assistant’s recommendations for the Stall
Catchers task. Most users believed that the AI-Assistant made equal errors on "stalled" and "owing" videos regardless of
Manuscript submitted to ACM
Advancing Human-AI Complementarity 25
experimental condition. This suggests that the complementary tuning of the AI-Assistant may have been less eective
in supporting mental model creation.
Our study intentionally did not present any information to users about how the algorithm was tuned or its overall
accuracy so that we would be able to observe the presented eects without biasing their perception. However, for
real-world deployments, when reliable tuning scores are available, presenting them to users may support creation of a
mental model about the AI. Previous work has emphasized the importance of providing such information in order to
set expectations on performance [
3
,
33
,
61
]. However, further innovation is needed to detail these summaries beyond
aggregate scores of performance (e.g., aggregate accuracy, TPR, TNR) potentially with interactive and on-demand
visualizations that explain to users when and how the AI fails and what it is able to do for them.
An accurate mental model alone will not always lead to better Human-AI team performance. Even when a human
understands when the AI is not reliable, they still need to solve the task accurately. This problem is even more relevant
for multi-class predictions or content generation where the domain space of the solution is of a higher dimension. Again,
other forms of assistance may be needed in such cases that can guide the team towards more detailed and informed
reasoning (e.g., similar historical examples, outlier analysis, data exploration capabilities etc.).
6.5 Role of the AI-Assistant
While the TAM suggests that perceived usefulness and ease-of-use can predict acceptance/usage of technology, it is
also important to examine how users are working with and perceiving the AI system. Our data indicated that users
incorporated a variety of strategies when working with an AI-Assistant. These self-reported strategies may have
implications for user performance with AI, as prior work has shown performance dierences when users view the AI
system as a second reader versus an arbitrator [
57
]. In our study, some users provided feedback that they followed the
AI if they were unsure, hinting at its role as an arbitrator. Meanwhile, other users indicated that they would double
check the video if the AI disagreed with them, hinting at the possibility that they used the AI as a second reader in the
task. Two other users indicated that condence in their decisions increased when the AI agreed with them.
We also observed users who blindly followed the AI recommendations, suggesting overreliance on the AI. Algorithmic
appreciation is generally correlated with overreliance, but it is also important to understand why users are displaying a
certain behavior. For example, one user indicated that they followed the AI "out of laziness," which has more to do with
their own eort as opposed to appreciation for the AI. We also had many users who indicated that they ignored or did
not rely on the AI-Assistance, which could be a product of general algorithmic aversion or a result of the AI’s individual
performance. In either of these cases, methods to foster trust or increase reliance on the AI in complementary instances
might be dependent on how or why the user is using the AI in a certain way.
Since users in our study were not instructed to use the AI in any specic manner, these results indicate yet another
factor that can inuence Human-AI team performance. Users can have the same expertise level and mental models and
end up with completely dierent results based on how they work with the AI, such as a colleague, an assistant, a tool,
a distractor, etc. More work is needed to better understand the impact of these dierent strategies, which may also
change as users become more comfortable working with AI systems.
6.6 Tuning for Tasks/Scenarios or Users
Often, the ideal tuning for an AI model will depend on the task or scenario. For example, in a criminal justice situation,
the model may be tuned to minimize false positives so innocent people don’t get put in jail, while in a medical context,
a model may be tuned to minimize false negatives so a cancer diagnosis is not missed. However, as we have seen from
Manuscript submitted to ACM
26 Inkpen, Chappidi, et al.
this study, if both the users and the AI have the same tuning, performance gains from the Human-AI team may be
stied. In our study, an AI tuning that complemented users resulted in the largest performance gains. Additionally,
sacricing accuracy on a measure that users had strengths in, did not cause a degradation in that measure.
There are several potential directions for improvement inspired by these ndings dierentiating AI impact based on
the algorithmic tuning. First, these ndings can inform model selection prior to deployment. Furthermore, since dierent
users may have dierent levels of expertise and inherent biases, further benets can be achieved by personalizing the
AI tuning based on the human preference and expertise to deploy algorithms that are most complementary to the user.
One potential challenge in tuning for user complementarity is that AI models are often trained via human annotations,
and therefore, approximate user behaviors in aggregate. While recent work has aimed at training more complementary
models [
59
], often there may not exist enough complementarity in the available labels themselves to train models with
these properties. Therefore, other forms of assistance to human experts may be needed such as providing additional
information instead of oering a recommendation. Examples of additional information may include digested summaries
of data distributions or presenting historical examples.
Tuning AI systems for complementarity with human users may prove to be more complicated for tasks that extend
beyond multi-choice classication, including text generation or information retrieval. As tasks become more complex,
factors such as eciency, productivity, and cognitive eort may inuence optimal human-AI team performance. For
example, in the context of text generation, there may be trade-os in satisfaction and productivity between providing
a long output or translation that the user can edit and a shorter, more precise output that requires more user input.
Given the various ways in which users can incorporate AI recommendations into their decision making, the success of
an AI tool for a more complex task may rest on its ability to complement an individual’s decision making framework
or workow. Future studies should explore the role of human expertise and algorithmic tuning as it relates to more
complex tasks, such as retrieval, ranking, translation, and image/text generation.
6.7 Limitations
While the results from this study provide important insights related to Human-AI decision making, there are several
limitations which may have inuenced our ndings. First, given that we wanted to examine a complex, real-world
task that users had experience with, the available population of users was limited. This resulted in smaller sample
sizes, particularly for high performing users, and an unequal distribution of participant expertise across experimental
conditions. This potentially constrains some of our conclusions since the overall statistical dierences observed may
have been inuenced by the larger sample size in the mid- and low-performing clusters. We are also missing some
survey data from the rst eleven participants due to minor changes introduced in the survey after deployment.
Since recommendations from the AI-Assistant were shown at the beginning of each task, it was dicult to fully
understand how much the AI-Assistant’s recommendations inuenced users. It is possible that in many instances where
the user "agreed" or aligned with the AI, the user was going to select the response regardless of the AI’s recommendation.
At the same time, fully decoupling anchoring bias eects from intentional agreement [
48
] would require experimentation
with other types of workows where the human takes a temporary decision rst and then has a chance to revise it after
getting assistance from the AI. The challenge still remains valid even for such workows, however, given that humans
could still anchor to second opinions, which calls for further analytical methods encouraging self-reection and explicit
argumentation on why the decision maker may or may not agree with the AI.
The presentation of the task in the citizen science Stall Catchers platform frames the task as "nding stalled vessels."
This framing may have biased users towards focusing on stalled vessels, and also focusing on stalled recommendations
Manuscript submitted to ACM
Advancing Human-AI Complementarity 27
from the AI-Assistant. There is some evidence for this in the qualitative survey data, as some users indicated that a
stalled recommendation from the AI-Assistant made them take a second look to see if a stall was actually there. We did
not receive reciprocal comments regarding users trying to verify owing recommendations from the AI-Assistant.
7 CONCLUSION AND FUTURE WORK
This work examined the impact of algorithmic tuning and human expertise in Human-AI team performance, as well as
the inuence of user perceptions of AI. Our study centered around a real-world citizen science platform, Stall Catchers,
to understand the impact of assisting users in a complex decision-making task. Our results highlight how degrees of
human expertise can signicantly impact the potential value of AI assistance, as benets from AI assistance are highly
variable for users whose individual performance is similar to that of the AI. Furthermore, the work showed that there
are opportunities to boost team performance by deploying models that are complementary to human capabilities and
fostering Human-AI interactions that help users develop appropriate mental models, as well as trust and condence in
AI systems.
Informed by these ndings, we envision several directions for future work. The dierential impact of human expertise
and algorithmic tuning suggests that dierent types of AI could be benecial for dierent users. Personalization
techniques have been extensively studied in the context of content or item recommendation but not suciently for
customizing AI partners to users. Such techniques will require learning and updating user models over time and
adjusting AI deployments to those changes. While historical data for users may be initially sparse, possible avenues
can be explored to apply these approaches to groups of individuals based on their expertise, similar to the clusters
of users that we explored in this study. Considering the challenges of assisting human experts posed in our ndings,
personalizations of AI assistants may even need to go beyond measures of true positive and negative rates—instead
considering more ne-grained denitions of error boundaries to build models whose main goal is to provide assistance
for problems and examples that are most dicult to experts based on input characteristics. For such an audience,
productivity is another aspect of improvement. Even if the human remains equally accurate with AI assistance, assistance
that helps users make accurate decisions faster can still be benecial for particular use cases. Overall, we hope that
these ndings inspire eorts to improve AI systems with human-centered considerations in ways that best augment
and complement users.
ACKNOWLEDGMENTS
This study would not have been possible without the contributions of Stall Catchers users. The authors would like to
thank Ece Kamar for her assistance with this project and the anonymous reviewers for their useful feedback.
REFERENCES
[1] [n.d.]. Join a global game that’s trying to cure Alzheimer’s. https://stallcatchers.com/
[2]
Veronika Alexander, Collin Blinder, and Paul J Zak. 2018. Why trust an algorithm? Performance, cognition, and neurophysiology. Computers in
Human Behavior 89 (2018), 279–288.
[3]
Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori
Inkpen, et al. 2019. Guidelines for human-AI interaction. In Proceedings of the 2019 chi conference on human factors in computing systems. 1–13.
[4]
Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz. 2019. Beyond accuracy: The role of mental models in
human-ai team performance. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7. 2–11.
[5]
Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz. 2019. Updates in human-ai teams: Understanding and
addressing the performance/compatibility tradeo. In Proceedings of the AAAI Conference on Articial Intelligence, Vol. 33. 2429–2437.
Manuscript submitted to ACM
28 Inkpen, Chappidi, et al.
[6]
Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. 2021. Does the whole
exceed its parts? the eect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in
Computing Systems. 1–16.
[7]
Oliver Bracko, Lindsay K. Vinarcsik, Jean C. Cruz Hernández, Nancy E. Ruiz-Uribe, Mohammad Haft-Javaherian, Kaja Falkenhain, Egle M.
Ramanauskaite, Muhammad Ali, Aditi Mohapatra, Madisen A. Swallow, Brendah N. Njiru, Victorine Muse, Pietro E. Michelucci, Nozomi Nishimura,
and Chris B. Schaer. 2020. High fat diet worsens Alzheimer’s disease-related behavioral abnormalities and neuropathology in APP/PS1 mice, but
not by synergistically decreasing cerebral blood ow. Scientic Reports 10, 1 (June 2020), 1–16. https://doi.org/10.1038/s41598-020- 65908-y Number:
1 Publisher: Nature Publishing Group.
[8]
Zana Buçinca, Phoebe Lin, Krzysztof Z. Gajos, and Elena L. Glassman. 2020. Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating
Explainable AI Systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces (Cagliari, Italy) (IUI ’20). Association for
Computing Machinery, New York, NY, USA, 454–464. https://doi.org/10.1145/3377325.3377498
[9]
Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z Gajos. 2021. To trust or to think: cognitive forcing functions can reduce overreliance on AI in
AI-assisted decision-making. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (2021), 1–21.
[10]
Adrian Bussone, S. Stumpf, and D. O’Sullivan. 2015. The Role of Explanations on Trust and Reliance in Clinical Decision Support Systems. 2015
International Conference on Healthcare Informatics (2015), 160–169.
[11]
Chun-Wei Chiang and Ming Yin. 2021. You’d Better Stop! Understanding Human Reliance on Machine Learning Models under Covariate Shift.
(2021).
[12] Robyn M Dawes. 1979. The robust beauty of improper linear models in decision making. American psychologist 34, 7 (1979), 571.
[13] Maria De-Arteaga, Riccardo Fogliato, and Alexandra Chouldechova. 2020. A case for humans-in-the-loop: Decisions in the presence of erroneous
algorithmic scores. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
[14]
Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. 2015. Algorithm aversion: People erroneously avoid algorithms after seeing them err.
Journal of Experimental Psychology: General 144, 1 (2015), 114.
[15]
Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. 2018. Overcoming algorithm aversion: People will use imperfect algorithms if they can
(even slightly) modify them. Management Science 64, 3 (2018), 1155–1170.
[16] Jaap J Dijkstra. 1999. User agreement with incorrect expert system advice. Behaviour & Information Technology 18, 6 (1999), 399–411.
[17]
Jaap J Dijkstra, Wim BG Liebrand, and Ellen Timminga. 1998. Persuasiveness of expert systems. Behaviour & Information Technology 17, 3 (1998),
155–163.
[18] Mary T Dzindolet, Linda G Pierce, Hall P Beck, and Lloyd A Dawe. 2002. The perceived utility of human and automated aids in a visual detection
task. Human Factors 44, 1 (2002), 79–94.
[19] Hillel J Einhorn. 1986. Accepting error to make less error. Journal of personality assessment 50, 3 (1986), 387–395.
[20]
Charles Elkan. 2001. The foundations of cost-sensitive learning. In International joint conference on articial intelligence, Vol. 17. Lawrence Erlbaum
Associates Ltd, 973–978.
[21]
Kaja Falkenhain, Nancy E. Ruiz-Uribe, Mohammad Haft-Javaherian, Muhammad Ali, Stall Catchers, Pietro E. Michelucci, Chris B. Schaer, and
Oliver Bracko. 2020. A pilot study investigating the eects of voluntary exercise on capillary stalling and cerebral blood ow in the APP/PS1 mouse
model of Alzheimer’s disease. PLOS ONE 15, 8 (Aug. 2020), e0235691. https://doi.org/10.1371/journal.pone.0235691 Publisher: Public Library of
Science.
[22]
Shi Feng and Jordan Boyd-Graber. 2019. What can ai do for me? evaluating machine learning interpretations in cooperative play. In Proceedings of
the 24th International Conference on Intelligent User Interfaces. 229–239.
[23]
Susanne Gaube, Harini Suresh, Martina Raue, Alexander Merritt, Seth J Berkowitz, Eva Lermer, Joseph F Coughlin, John V Guttag, Errol Colak, and
Marzyeh Ghassemi. 2021. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ digital medicine 4, 1 (2021), 1–8.
[24]
Yashesh Gaur, Walter S Lasecki, Florian Metze, and Jerey P Bigham. 2016. The eects of automatic speech recognition quality on human
transcription latency. In Proceedings of the 13th Web for All Conference. 1–8.
[25]
Ana Valeria Gonzalez, Gagan Bansal, Angela Fan, Robin Jia, Yashar Mehdad, and Srinivasan Iyer. 2021. Human Evaluation of Spoken vs. Visual
Explanations for Open-Domain QA. ACL (2021).
[26]
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In
International conference on intelligent computing. Springer, 878–887.
[27]
Scott Highhouse. 2008. Stubborn reliance on intuition and subjectivity in employee selection. Industrial and Organizational Psychology 1, 3 (2008),
333–342.
[28]
Kevin Anthony Ho and Masooda Bashir. 2015. Trust in automation: Integrating empirical evidence on factors that inuence trust. Human factors
57, 3 (2015), 407–434.
[29]
Ece Kamar. 2016. Directions in Hybrid Intelligence: Complementing AI Systems with Human Intelligence. In Proceedings of the Twenty-Fifth
International Joint Conference on Articial Intelligence (New York, New York, USA) (IJCAI’16). AAAI Press, 4070–4073.
[30] Ece Kamar. 2016. Directions in Hybrid Intelligence: Complementing AI Systems with Human Intelligence.. In IJCAI. 4070–4073.
[31]
Ece Kamar, Severin Hacker, and Eric Horvitz. 2012. Combining Human and Machine Intelligence in Large-Scale Crowdsourcing. In Proceedings of
the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1 (Valencia, Spain) (AAMAS ’12). International Foundation
for Autonomous Agents and Multiagent Systems, Richland, SC, 467–474.
Manuscript submitted to ACM
Advancing Human-AI Complementarity 29
[32]
Joshua Klayman and Young-Won Ha. 1987. Conrmation, disconrmation, and information in hypothesis testing. Psychological review 94, 2 (1987),
211.
[33]
Rafal Kocielnik, Saleema Amershi, and Paul N Bennett. 2019. Will you accept an imperfect ai? exploring designs for adjusting end-user expectations
of ai systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.
[34]
Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learning models: A case study on deception
detection. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 29–38.
[35] John D Lee and Katrina A See. 2004. Trust in automation: Designing for appropriate reliance. Human factors 46, 1 (2004), 50–80.
[36]
Constance D Lehman, Robert D Wellman, Diana SM Buist, Karla Kerlikowske, Anna NA Tosteson, Diana L Miglioretti, Breast Cancer Surveillance
Consortium, et al
.
2015. Diagnostic accuracy of digital screening mammography with and without computer-aided detection. JAMA internal
medicine 175, 11 (2015), 1828–1837.
[37]
Greg Lipstein. 2020. Meet the winners of the Clog Loss Challenge for Alzheimer’s Research. https://www.drivendata.co/blog/clog-loss-alzheimers-
winners Section: blog.
[38]
Jennifer M Logg, Julia A Minson, and Don A Moore. 2019. Algorithm appreciation: People prefer algorithmic to human judgment. Organizational
Behavior and Human Decision Processes 151 (2019), 90–103.
[39]
Zhuoran Lu and Ming Yin. 2021. Human Reliance on Machine Learning Models When Performance Feedback is Limited: Heuristics and Risks. In
Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.
[40]
Scott M Lundberg, Bala Nair, Monica S Vavilala, Mayumi Horibe, Michael J Eisses, Trevor Adams, David E Liston, Daniel King-Wai Low, Shu-Fang
Newman, Jerry Kim, et al
.
2018. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature biomedical
engineering 2, 10 (2018), 749–760.
[41]
Poornima Madhavan and Douglas A Wiegmann. 2007. Similarities and dierences between human–human and human–automation trust: an
integrative review. Theoretical Issues in Ergonomics Science 8, 4 (2007), 277–301.
[42] Pietro Michelucci. 2019. The People and Serendipity of the EyesOnALZ project. Narrative inquiry in bioethics 9, 1 (2019), 29–33.
[43]
Thomas Mussweiler and Fritz Strack. 1999. Comparing is believing: A selective accessibility model of judgmental anchoring. European review of
social psychology 10, 1 (1999), 135–167.
[44] Jill Nugent. 2018. iNaturalist: citizen science for 21st-century naturalists. Science Scope 41, 7 (2018), 12.
[45] Jill Nugent. 2021. Accelerating Alzheimer’s Research With Stall Catchers Breadcrumb. The Science Teacher 88, 4 (2021).
[46]
Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Wortman Vaughan, and Hanna Wallach. 2021. Manipulating
and measuring model interpretability. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–52.
[47]
Andrew Prahl and Lyn Van Swol. 2017. Understanding algorithm aversion: When is advice from automation discounted? Journal of Forecasting 36, 6
(2017), 691–702.
[48]
Charvi Rastogi, Yunfeng Zhang, Dennis Wei, Kush R Varshney, Amit Dhurandhar, and Richard Tomsett. 2020. Deciding Fast and Slow: The Role of
Cognitive Biases in AI-assisted Decision-making. arXiv preprint arXiv:2010.07938 (2020).
[49]
Paul Robinette, Ayanna M Howard, and Alan R Wagner. 2017. Eect of robot performance on human–robot trust in time-critical situations. IEEE
Transactions on Human-Machine Systems 47, 4 (2017), 425–436.
[50]
James Schaer, John O’Donovan, James Michaelis, Adrienne Raglin, and Tobias Höllerer. 2019. I can do better than your AI: expertise and explanations.
In Proceedings of the 24th International Conference on Intelligent User Interfaces. 240–251.
[51]
Robert Simpson, Kevin R Page, and David De Roure. 2014. Zooniverse: observing the world’s largest citizen science platform. In Proceedings of the
23rd international conference on world wide web. 1049–1054.
[52]
David F Steiner, Robert MacDonald, Yun Liu, Peter Truszkowski, Jason D Hipp, Christopher Gammage, Florence Thng, Lily Peng, and Martin C
Stumpe. 2018. Impact of deep learning assistance on the histopathologic review of lymph nodes for metastatic breast cancer. The American journal
of surgical pathology 42, 12 (2018), 1636.
[53]
Brian L Sullivan, Jocelyn L Aycrigg, Jessie H Barry, Rick E Bonney, Nicholas Bruns, Caren B Cooper, Theo Damoulas, André A Dhondt, Tom
Dietterich, Andrew Farnsworth, et al
.
2014. The eBird enterprise: an integrated approach to development and application of citizen science. Biological
Conservation 169 (2014), 31–40.
[54]
Sarah Tan, Julius Adebayo, Kori Inkpen, and Ece Kamar. 2018. Investigating human+ machine complementarity for recidivism predictions. arXiv
preprint arXiv:1808.09123 (2018).
[55]
Viswanath Venkatesh and Fred D Davis. 2000. A theoretical extension of the technology acceptance model: Four longitudinal eld studies.
Management science 46, 2 (2000), 186–204.
[56]
Dayong Wang, Aditya Khosla, Rishab Gargeya, Humayun Irshad, and Andrew H Beck. 2016. Deep learning for identifying metastatic breast cancer.
arXiv preprint arXiv:1606.05718 (2016).
[57]
Guangyu Wang, Xiaohong Liu, Jun Shen, Chengdi Wang, Zhihuan Li, Linsen Ye, Xingwang Wu, Ting Chen, Kai Wang, Xuan Zhang, et al
.
2021.
A deep-learning pipeline for the diagnosis and discrimination of viral, non-viral and COVID-19 pneumonia from chest X-ray images. Nature
Biomedical Engineering (2021), 1–13.
[58]
Xinru Wang and Ming Yin. 2021. Are Explanations Helpful? A Comparative Study of the Eects of Explanations in AI-Assisted Decision-Making. In
26th International Conference on Intelligent User Interfaces. 318–328.
[59] Bryan Wilder, Eric Horvitz, and Ece Kamar. 2020. Learning to complement humans. In IJCAI.
Manuscript submitted to ACM
30 Inkpen, Chappidi, et al.
[60]
Michael Yeomans, Anuj Shah, Sendhil Mullainathan, and Jon Kleinberg. 2019. Making sense of recommendations. Journal of Behavioral Decision
Making 32, 4 (2019), 403–414.
[61]
Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the eect of accuracy on trust in machine learning models. In
Proceedings of the 2019 chi conference on human factors in computing systems. 1–12.
[62]
Yunfeng Zhang, Q. Vera Liao, and Rachel K. E. Bellamy. 2020. Eect of condence and explanation on accuracy and trust calibration in AI-assisted
decision making. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (Jan 2020). https://doi.org/10.1145/3351095.3372852
Manuscript submitted to ACM