Conference PaperPDF Available

Human-AI Complementarity in Hybrid Intelligence Systems: A Structured Literature Review

Authors:

Abstract and Figures

Hybrid Intelligence is an emerging concept that emphasizes the complementary nature of human intelligence and artificial intelligence (AI). One key requirement for collaboration between humans and AI is the interpretability of the decisions provided by the AI to enable humans to assess whether to comply with the presented decisions. Due to the black-box nature of state-of-the-art AI, the explainable AI (XAI) research community has developed various means to increase interpretability. However, many studies show that increased interpretability through XAI does not necessarily result in complementary team performance (CTP). Through a structured literature review, we identify relevant factors that influence collaboration between humans and AI. Additionally, as we collect relevant research articles and synthesize their findings, we develop a research agenda with relevant hypotheses to lay the foundation for future research on human-AI complementarity in Hybrid Intelligence systems.
Content may be subject to copyright.
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
1
Human-AI Complementarity in Hybrid Intelligence
Systems: A Structured Literature Review
Completed Research Paper
Patrick Hemmer1
Karlsruhe Institute of Technology
Karlsruhe, Germany
patrick.hemmer@kit.edu
Max Schemmer
Karlsruhe Institute of Technology
Karlsruhe, Germany
max.schemmer@kit.edu
Michael Vössing
Karlsruhe Institute of Technology
Karlsruhe, Germany
michael.voessing@kit.edu
Niklas Kühl
Karlsruhe Institute of Technology
Karlsruhe, Germany
niklas.kuehl@kit.edu
Abstract
Hybrid Intelligence is an emerging concept that emphasizes the complementary nature of
human intelligence and artificial intelligence (AI). One key requirement for collaboration
between humans and AI is the interpretability of the decisions provided by the AI to enable
humans to assess whether to comply with the presented decisions. Due to the black-box
nature of state-of-the-art AI, the explainable AI (XAI) research community has developed
various means to increase interpretability. However, many studies show that increased
interpretability through XAI does not necessarily result in complementary team
performance (CTP). Through a structured literature review, we identify relevant factors
that influence collaboration between humans and AI. Additionally, as we collect relevant
research articles and synthesize their findings, we develop a research agenda with relevant
hypotheses to lay the foundation for future research on human-AI complementarity in
Hybrid Intelligence systems.
Keywords: Human-AI Complementarity, Complementary Team Performance,
Explainable Artificial Intelligence, Hybrid Intelligence
Introduction
Over the last years, an unprecedented development in the field of artificial intelligence (AI) has
contributed to an improvement in prediction accuracy of modern AIseven exceeding the capabilities
of domain experts in an increasing number of fields (He et al., 2015). These advancements have fueled
the ongoing discussion of whether AI will replace domain experts in the foreseeable future (Schuetz
and Venkatesh, 2020). However, in many application domains, reducing human autonomy might not
be desirable. For example, the cost of errors in situations in which perfect algorithmic accuracy is not
attainable might not be acceptable. Moreover, legal regulations and ethical considerations might make
full algorithmic automation undesirable from a societal perspective. Additionally, the capabilities of AI
are often limited to narrowly defined application contexts as the utilized algorithms often struggle to
handle instances that differ from the patterns learned during training (D'Amour et al., 2020). In these
Patrick Hemmer and Max Schemmer contributed equally in a shared first authorship.
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
2
cases, humans can leverage capabilities not possessed even by state-of-the-art AIfor example,
intuition, creativity, and also common sense.
This line of thought gives rise to the vision of so-called Hybrid Intelligence (HI) systems (Dellerman et
al., 2019). The HI concept proposes to combine the complementary capabilities of humans and AI by
facilitating collaboration to achieve superior results in comparison to the isolated entities operating
independently (Dellerman et al., 2019; Liu et al., 2021). In this context, humans and AI are regarded as
equal team members that solve tasks in cooperation (Siemon et al., 2021). Hereby, we understand
complementary team performance (CTP) as the desired outcome of HI systems, i.e., the team
performance exceeds the maximum performance of both individual entities. One of the key
requirements for the success of HI systems lies in the fact that they enable humans to understand the
decisions provided by the AI and allow them to draw conclusions about when and to what extent they
can rely on the AI’s prediction. This concept can be traced back to the early research on expert systems
(Nunes and Jannach, 2017; Swartout and Moore, 1993).
Nowadays, with the rise of AI, algorithms emerging from the field of explainable AI (XAI) offer a rich
fundus of explainability techniques to be applied within HI systems (Bansal et al., 2020; Chu et al.,
2020; Liu et al., 2021). Since XAI techniques are a means to explain the decision-making process of
black-box models, we focus on HI systems that leverage XAI as a collaboration mechanism (Zschech
et al. 2021). In this context, Doshi-Velez and Kim (2017) propose a taxonomy of evaluation approaches
for interpretability. Their results emphasize the need for the rigorous empirical evaluation of XAI
algorithms. Although the body of literature dedicated to the evaluation of these XAI approaches is
steadily increasing, their utility in terms of CTP remains largely unexplored, as the research community
has initially focused on the study of constructs such as system trust (Davis et al., 2020). To date there
exists no structured literature review on relevant factors impacting CTP of HI systems. Therefore, in
this paper, we conduct a structured literature review (SLR) on user studies analyzing the task
performance of humans and AI separately as well as in the form of an HI system to answer the following
research question:
RQ1: What factors have been analyzed in user studies regarding the design of HI systems that impact
CTP?
As the identified factors depend on the human, the AI, and the task, we cluster them from a socio-
technical perspective (Maedche et al., 2019). A subsequent in-depth analysis of each perspective reveals
further factors that have not been taken into consideration by existing user studies. Thus, we formulate
the following second research question:
RQ2: What factors have not been analyzed in user studies regarding the design of HI systems that
impact CTP?
To answer the second research question, we derive and discuss, based on the SLR, neglected but
relevant factors of CTP. Additionally, we propose testable hypotheses future research needs to address
to realize the full potential of HI systems.
The contributions of this paper are twofold: First, we collect the existing body of knowledge for HI
systems and describe possible relevant factors of CTP. Second, we discuss yet neglected but relevant
factors of CTP and formulate respective hypotheses for future work.
The remaining work is structured as follows: In the next section, we explain the conceptual foundations
with regard to XAI and HI systems. Consecutively, we outline the methodology applied to conduct the
SLR and present our findings in the results section. Following, we derive and discuss yet neglected
factors that might be taken into consideration to achieve CTP. Lastly, the conclusion summarizes our
work by stressing its importance for the IS research discipline.
Conceptual Foundations
In the following, we provide a short overview of the two key concepts addressed in our workHybrid
Intelligence (HI) and explainable artificial intelligence (XAI).
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
3
Hybrid Intelligence
Dellermann et al. (2019, p. 640) define HI as “the ability to achieve complex goals by combining human
and AI, thereby reaching superior results to those each of them could have accomplished separately,
and continuously improve by learning from each other.”. Human-AI complementarity focuses on the
first part of the definition. On the one hand, humans can rely on their senses, perceptions, emotional
intelligence, and social skills (Braga and Logan, 2017). On the other hand, AI excels at detecting
patterns or calculating probabilities (Dellermann et al., 2019). These complementary skill sets allow for
superior performance in specific tasks through collaboration. For example, managers can use emotional
intelligence to build relationships and motivate employees to work for the company (Davenport and
Kirby, 2016). In contrast, repetitive and monotonous work can be conducted by AI.
The goals of HI systems are manifold. Among them are increasing the effectiveness and efficiency of
the outcome of a specific task (Dellerman et al., 2019). In this work, we focus on task effectiveness
with regard to CTP. We follow the definition of Liu et al. (2021) and define CTP as the performance of
teams consisting of humans and AI with the goal of achieving superior performance than AI or humans
could have accomplished alone. The performance can be measured by different metrics, depending on
the particular task, e.g., accuracy, recall, or the f1-score.
To enable CTP, humans need insights into AI decision-making. An emerging research stream that
enables interpretability of AI decisions is the field of XAI.
Explainable Artificial Intelligence
Explainability is a concept with a long tradition in the Information Systems (IS) research community.
With the rise of knowledge-based systems, expert systems, and intelligent agents in the 1980s and
1990s, the IS community laid the foundations for research on explainability (Meske et al., 2020). In this
context, Gregor and Benbasat (1999) provide a comprehensive overview of explanations in IS research.
The term XAI was first coined by Van Lent et al. (2004) to describe the ability of their system to explain
the behavior of agents in simulation games. The current rise of XAI originates from the need to increase
the interpretability of complex models (Wanner et al., 2020). In contrast to interpretable linear models,
more complex models can achieve higher performance. However, their inner workings are hard to grasp
for humans.
XAI encompasses a wide spectrum of algorithms. A comprehensive survey on many existing
explanation techniques can be found in Burkart and Huber (2021). In general, they can be differentiated
by their complexity, their scoop and their level of dependency (Adadi and Berrada, 2018).
Interpretability of a model directly depends on the complexity of the model. Wanner et al. (2020) cluster
different types of complexity in white-, grey-, and black-box models. They define white-box models as
models with perfect transparency, such as linear regressions. These models do not need additional
explainability techniques but are intrinsically explainable. Black-box models, on the other hand, tend
to achieve higher performance but lack interpretability. Lastly, grey-box models are not intrinsically
interpretable but are made interpretable with the help of additional explanation techniques. These
techniques can be differentiated in terms of their scoop, i.e., being global or local explanations. Global
XAI techniques address holistic explanations of the models as a whole. In contrast, local explanations
function on an individual instance basis. Besides the scoop, XAI techniques can also be differentiated
whether they are model agnostic, i.e., can be used with all kinds of models, or model specific.
Methodology
To answer our research questions, we conducted a structured literature review (SLR) based on the
methodology outlined by vom Brocke et al. (2009). We developed our search string consisting of two
main areas. The first was XAI, including relevant synonyms, such as “explainable AI” or
“interpretability” combined with “Artificial Intelligence”. The second part comprised synonyms of
behavioral experiments, e.g., “user study” or “user evaluation”. To find the synonyms, we initiated our
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
4
SLR with an explorative search. The search string was iteratively extended resulting in the following
final search string:
TITLE-ABS-KEY("explainable artificial intelligence" OR XAI OR "explainable AI" OR ( (
interpretability OR explanation ) AND ( "artificial intelligence" OR ai OR "machine
learning" ) ) ) AND ( "human performance" OR "human accuracy" OR "user study" OR
"empirical study" OR "online experiment" OR "human experiment" OR "behavioral
experiment" OR "human evaluation" OR "user evaluation")
Next, we selected an appropriate database. Our exploratory search revealed that relevant work is
dispersed across multiple publishers, conferences, and journals. Thus, we chose the SCOPUS database,
to ensure comprehensive coverage.
Following that, we defined our inclusion criteria, i.e., articles that were in scope of this SLR. We
included every article that (a) did conduct empirical research, (b) did report performance measures and
(c) did focus on an application context where humans and AI perform the same task.
With our search string defined, we conducted the SLR from January to March 2021. We identified 256
articles through the keyword-based search. As a next step, we analyzed the abstract of each article and
filtered based on our inclusion criteria, leading to 61 articles. Afterwards, two independent researchers
read all articles in detail and applied the inclusion criteria again. This led to a total of 14 remaining
studies. Based on these, we conducted forward and backward search. With the forward and backward
search, we identified 15 additional articles leading to a final set of 29 articles that were consequently
analyzed in-depth to collect data about each experiment. The increase in articles can be attributed to a
large number of yet unpublished papers.
The data collection process was conducted by two independent researchers. Differences were discussed
and corrected. The main focus of the SLR was to extract the treatments and outcomes of each
experiment reported in the studies. For example, if two XAI techniques were used and compared as
separate experimental treatments we added two entries into our database.
We clustered the extracted treatments from a socio-technical view, which is in line with other research
(Buçinca et al., 2020). From a socio-technical view, HI systems can be divided in the following three
key elements that are connected through a collaboration mechanism (Goodhue and Thompson, 1995):
human(s) with a specific goal, the task that needs to be accomplished, and the technologyin our case
the AI (Maedche et al., 2019). The collaboration mechanism enables teamwork between humans and
AI regarding the task to be done. Figure 1 depicts the relationship between all relevant elements of an
HI system.
Figure 1. Key elements of an HI system.
Results
In this section, we present the results of our SLR. The final data set consists of 29 articles in which 93
XAI-related experimental conditions are described. We start by providing an overview of the subset of
Artificial IntelligenceTask
Human
Collaboration
Mechanism
Hybrid Intelligence System
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
5
articles that report CTP. Subsequently, we analyze the experimental conditions and cluster them
according to our introduced socio-technical approach.
Overview
We extract, whenever possible, four different performance metrics from the articleshuman, AI, AI-
assisted and XAI-assisted performance. Human and AI performance refer to the performance achieved
by humans or the AI when conducting the task individually. AI-assisted performance refers to the
human performance when provided with the AI prediction. Finally, XAI-assisted refers to the
performance of humans when provided with the AI’s recommendation as well as a supplementary
explanation. Both, AI-assisted and XAI-assisted are measures of team performance. CTP is reached if
this team performance exceeds both human and AI performance.
Figure 2 displays the number of studies in which AI or XAI has a positive impact on HI-system
performance considering different constraints. As not all studies report all four performance metrics,
the following observations apply to different subsets of experiments. In general, 72 experiments
measure the AI-assisted performance and XAI-assisted performance, but not necessarily human or AI
performance. Of these, in 46 experiments XAI-assisted performance exceeds AI-assisted performance.
Moreover, 59 out of 63 experiments report that providing either the AI’s prediction or a supplementary
explanation (i.e., XAI) has a positive effect on human performance. In 53 experiments, all necessary
information to analyze CTP are givenhuman and AI performance and either AI-assisted or XAI-
assisted performance. Just 16 out of 53 experiments achieve CTP. Of these 16 experiments, five times,
the XAI-assisted performance exceeds the AI-assisted performance. The 16 experiments are reported
in two articles conducted by Bansal et al. (2020) and Chu et al. (2020).
Figure 2. Overview of the number of studies in which a) XAI-assisted performance
exceeds AI-assisted performance, b) AI- or XAI-assisted performance exceeds human
performance, c) CTP is achieved, and d) CTP is reached by using XAI.
Bansal et al. (2020) report 11 different experiments in which CTP is achieved meaning the team
performance exceeds both the individual AI and human performance. All experiments focus on textual
data. It is important to highlight that while CTP is reached, XAI does not yield a significant
improvement over pure AI-assisted recommendations.
Further, in 5 experiments reported by Chu et al. (2020), CTP is achieved. The experiments focus on the
task of predicting the age of a human based on facial images. In contrast to Bansal et al. (2020), XAI
has a significant effect on performance.
In addition to identifying existing studies that demonstrate CTP of HI systems, in the following, we
provide an overview of all experimental conditions examined in the 93 studies. While most of these
conditions have currently not led to CTP, they still have shown some effect on team performance or on
93
72
46
63
59
53
16
44
5
Relevant for the study
Report AI and XAI performance
XAI-assisted > AI-assisted
Report Human and AI- or XAI-assisted performance
AI- or XAI-assisted > Human performance
Report Human and AI and (XAI- or AI-assisted) performance
Reach CTP
Report Human and AI and XAI- and AI-assisted performance
Reach CTP and XAI-assisted > AI-assisted
d)
c)
b)
a)
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
6
relevant behavioral constructs, such as trust or cognitive load. We group the experimental conditions
into four groups: collaboration characteristics, task characteristics, AI characteristics, and human
characteristics (see Figure 1).
Collaboration Characteristics
One important factor regarding the collaboration is the order in which the AI's predictions and
explanations are made available to the human. Green and Chen (2019) find empirical evidence that
asking participants to make a prediction before providing them the AI prediction or explanation leads
to a better XAI-assisted performance. One possible reason might be that users are encouraged to invest
cognitive capability in an active way instead of passively accepting the AI’s suggestion (Green and
Chen, 2019).
Another important factor is the interactivity of the HI system. Liu et al. (2021) test this experimental
condition and find an improvement in terms of human perception of AI-assistance. Bansal et al. (2020)
implement adaptive explanations based on the confidence of the AI decision. If the AI prediction has a
low confidence, more explanations are given. Their reasoning behind this approach is that a low
confidence indicates a higher probability of a wrong AI prediction. By displaying more explanations,
the human is encouraged to reflect more about the prediction.
Lastly, some researchers modify the degree of automation. One factor is whether the AI’s prediction
should be revealed at all (Lai et al., 2019). Lai et al. (2020) test various XAI techniques without
displaying the actual AI prediction. For example, they highlight all words that are relevant for the AI
decision. Another condition is to colorize these highlights differently depending on the influence of the
words. Their results show that differently colorized highlights result in a significant increase in XAI-
assisted performance (70.7% accuracy for colorized highlights compared to 60.4% accuracy for human
performance).
Task Characteristics
The tasks conducted in the studies play a decisive role in terms of individual and team performance. In
general, in order to allow for generalizability, many studies utilize multiple tasks and data sets. In this
context, Liu et al. (2021, p. 23) state that “[...] it is important to explore the diverse space and
understand how the choice of tasks may induce different results in the emerging area of human-AI
interaction”. For example, Alufaisan et al. (2020) study the influence of explanations on an income
prediction and a recidivism task. In terms of concrete experimental conditions, our SLR shows that
existing research focuses on skill sets and data types.
As discussed in the conceptual foundations, the skill set of AI and humans are complementary. Liu et
al. (2021) find initial empirical evidence consistent with this line of thought by demonstrating enhanced
team performance in correctly classifying out-of-distribution examples as the AI struggles to deal with
instances that are beyond the patterns learned during training (D'Amour et al., 2020). However, it
remains unclear how strong the distribution shift between in- and out-of-distribution data must be to
discern a positive influence on CTP.
Concerning the data type, Lai et al. (2020, p. 744) state that “there may also exist significant variation
between understanding text and interpreting images, because the former depends on culture and life
experience, while the latter relies on basic visual cognition.”. The experimental studies analyzed in this
SLR deal with human-AI tasks that are performed on either image (n = 24), tabular (n = 28), text (n =
39), or video (n = 2) data. In total, the analysis reveals the existence of only one study achieving CTP
on textual data (Bansal et al., 2020) and one on image data (Chu et al., 2020), respectively. Regarding
tabular data, we are not aware of any study demonstrating comparable results. Hase and Bansal (2020)
explicitly compare tabular and textual data. In this context, they observe that users rate explanations
from tabular data higher than from textual data. Moreover, they find that displaying feature importance
improves team performance on tabular data and that displaying examples helps for both data types.
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
7
Artificial Intelligence Characteristics
In addition to the first two components, the AI plays a central role in the HI system. It can be divided
into two different componentsbackend and frontend, i.e., the AI and XAI techniques and how
explanations and predictions are displayed to humans.
Generally, most studies do not vary their AI techniques. An exception is the work of Lai et al. (2020)
who test the influence of AI complexity on the XAI results. Their results show that explanations from
simple models lead to better XAI-assisted performance.
In general, the XAI techniques vary from simple white-box models (Poursabzi-Sangdeh et al., 2018)
over confidence scores (Zhang et al., 2020) to complex counterfactual explanations (Liu et al., 2021).
As specific experimental conditions mostly confidence scores, feature importance, examples, and rules
are used. Hereby, AI confidence refers to a probabilistic value that quantifies the certainty of the model
with respect to a specific prediction (Zhang et al., 2020). Feature importance quantifies the contribution
of each input feature to the prediction, and example-based explanations select and display most similar
instances of a knowledge base (Adadi and Berrada, 2018). Lastly, rules refer to explanations based on
if-then statements that are either extracted from more complex models or directly generated by the
model (van der Waa et al., 2021).
Our SLR highlights that the included studies report contradicting results with regards to the effect of
these XAI techniques. For example, while Bansal et al. (2020), Lai and Tan (2019), and Zhang et al.
(2020) find that communicating the confidence of an AI’s prediction is more effective than providing
the importance of individual features, Chandrasekaran et al. (2017) report exactly the opposite. This
indicates that focusing purely on XAI conditions might not be sufficient to extract generalizable and
valuable insights. Table 1 highlights all XAI-technique-related experimental conditions identified in the
SLR. We want to emphasize that these results should not be interpreted quantitatively, since a lot of
control variables and even measurements differ between the experiments.
Table 1. Experimental conditions of XAI and their comparisons in existing studies.
Feature Importance
(FI)
Confidence
Examples
Rules
Feature Importance
(FI)
FI outperforms
confidence:
Chandrasekaran et al.
(2017)
FI outperforms
examples:
Lai et al. (2019)
Hase and Bansal
(2020)
Yeung et al. (2020)
FI outperforms rules:
Hase and Bansal
(2020)
Confidence
Confidence
outperforms FI:
Lai and Tan (2019)
Zhang et al. (2020)
Confidence
outperforms
examples:
Lai and Tan (2019)
-
Examples
Examples outperform
FI:
Adhikari et al. (2019)
Hase and Bansal
(2020)
-
Examples outperform
rules:
Hase and Bansal
(2020)
Van der Waa et al.
(2021)
Rules
Rules outperform FI:
Hase and Bansal
(2020)
Riberio and Guestrin
(2018)
-
Rules outperform
examples:
Hase and Bansal
(2020)
In addition to the actual XAI technical category there are also differences in terms of the concrete
implementations of the XAI algorithm. For example, Schmidt and Biessmann (2019) as well as
Chandrasekaran et al. (2017) compare different technical implementations of feature importance
algorithms. In particular, Schmidt and Biessmann (2019) compare LIME (Ribeiro et al., 2016) and
covariance-based explanations. They find significant XAI-assisted performance differences (81.72%
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
8
for LIME and 84.52% for covariance) indicating that the size of the improvement also depends on the
selection of the specific XAI algorithm.
Besides these rather algorithmic experimental conditions our SLR reveals also more frontend-specific
conditions. Lai et al. (2019) show that the way the performance of the AI is displayed has an impact on
team performance. Their results illustrate that independent of the AI’s performance, sharing information
about the performance increases trust. However, lower displayed performance relatively decreased
XAI-assisted performance and trust. Carton et al. (2020) test the influence of the information amount
communicated through the explanation. They evaluate the performance of detecting misclassification
of online toxicity with full and sparse explanations. Full explanations highlight all words that have an
influence on the prediction. Sparse explanations highlight just the most important words. This condition
analyzes whether explanations need sufficient or comprehensive information. Their results indicate that
there is no significant difference between both conditions (52.4% for full and 52.6% accuracy for sparse
explanations).
Human Characteristics
Lastly, human characteristics influence the effectiveness of HI systems. One human-specific factor
considered in the experiments is the human knowledge of the XAI techniques. Lai et al. (2020) examine
the usage of human-centered tutorials to build up essential knowledge and report a positive effect.
Another experimental condition tested is the self-assessment capability. Green and Chen (2019)
explicitly test this experimental condition by asking participants how confident they are in their own
decisions on a 5-point Likert scale. In their study they find that participants can not determine their own
or the model’s accuracy and fail to calibrate their use of AI.
Research Implications for Hybrid Intelligence Systems
The SLR revealed relevant factors impacting the collaboration of human and AI. However, due to the
small number of studies achieving CTP, it becomes evident that further research is needed on the design
of HI systems. Therefore, based on the findings of the previous section, we derive and discuss possible
factors beyond those tested in existing studies, which should be taken into consideration when studying
CTP. Again, we structure this discussion from a socio-technical perspective. Finally, for each identified
characteristic with a possible effect on CTP, we formulate multiple testable hypotheses with CTP as a
dependent variable that future research should address.
Collaboration Characteristics
In the collaboration scenarios analyzed in existing studies, AI typically assists the human in form of
recommendations. However, this must not lead to the human any longer questioning the AI’s prediction.
Against this background, it has been demonstrated that it is beneficial to actively involve the human in
the decision-making processeither through encouragement by asking the human to make an informed
decision before receiving the AI’s prediction (Green and Chen, 2019), or by having the possibility to
dynamically interact with the system (Liu et al., 2021). Following Lai et al. (2020), who find a positive
effect of human training for the interpretation of static explanations, we hypothesize a similar effect
when humans are being trained to dynamically interact with the AI.
H1: Training humans to dynamically interact with the AI and interpret its recommendations
has a positive effect on CTP.
In addition to training for dynamic interaction and interpretation, it could also be beneficial to visualize
the AI’s error boundary which highlights for each input if the model output is the correct action for that
input feature combinationpotentially enabling the human to predict when the AI will err and decide
when to override the prediction. An improved understanding of the AI’s error boundary in turn might
positively contribute to CTP (Bansal et al., 2019).
H2: Visualizing the AI’s error boundary depending on provided input features has a positive
effect on CTP.
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
9
Moreover, to balance the amount of required cognitive load invested by the human, we suggest that it
could also be viable to let the human decide case by case whether the recommendation or explanation
should be revealed.
H3: Allowing the human to decide whether the AI’s prediction should be revealed has a positive
effect on CTP.
One aspect all studies considered have in common is the fact that the responsibility of the final team
decision is with the human. A new perspective on collaboration could be that the AI distributes a priori
who will have the final responsibility of the team decision depending on who has the higher expected
probability of correctly executing the task taking the individual strengths of both team members into
consideration (Mozannar and Sontag, 2020; Wilder et al., 2020).
H4: Assigning the final decision dynamically to either the human or the AI has a positive effect
on CTP.
Task Characteristics
Even though existing literature discusses various facets of task characteristics, we see multiple
directions for future research in terms of task complexity as well as performance differences between
humans and AI.
In terms of task difficulty, previous studies found humans tend to rely more on heuristics the harder a
task becomes (Goddard et al., 2014). The presence of AI support within HI systems might create the
risk that people will rely more strongly on the proposed AI recommendation as the difficulty increases
(Xu et al., 2007). In this context, we hypothesize that increasing task difficulty for the human might be
counterproductive in terms of CTP.
H5: Increasing task difficulty has a negative effect on CTP.
When working together on the same task, the absolute performance difference of humans and AI might
play a significant role. There are situations in which the AI outperforms humans (Alufaisan et al., 2020)
and vice versa (Chu et al., 2020). In this context, a very low performance of the AI, independent of the
human performance, could result in algorithmic aversion (Manzey et al., 2012). Contrary, a very high
performing AI could induce over-reliance in human action (Skitka et al., 2000). Following this line of
reasoning, one might assume that the potential for CTP might be leveraged when humans and AI have
comparable performance. In this context, Bansal et al. (2020) hypothesize that a similar performance
level of human and AI may contribute to achieving a significant effect on CTP. However, a small
performance gap alone might not be sufficient for achieving CTP. In this context, we want to highlight
that comparable performance does not inherently increase the probability of achieving the threshold to
CTP. The important point is not the comparable performance but no positive correlation of human and
AI errors. We hypothesize that even within the same task no positive correlation of human and AI errors
contributes to reaching CTP.
H6: No positive correlation of human and AI errors has a positive effect on CTP.
Artificial Intelligence Characteristics
An important component to ensure the collaboration between human and AI is the communication
capability of the AI. In this context, the frontend serves as a means for communicating the AI’s
prediction including explanations to the human. Its design has a significant influence on how well the
user will interpret and use the recommendations derived by the AI. In general, it is crucial to balance
the amount of information in order to prevent information overload (Shneiderman, 2003). For instance,
Klapp (1986) states that high volumes of information can have the same effect as noise, distraction or
stress resulting in erroneous judgement. Besides the amount of information, the visualization quality
also plays a significant role. For this reason, the question emerges whether a more user-centered design
regarding the information presentation resulting from existing explanation algorithms could result in
better human understanding. In this context, Suresh et al. (2021) propose a framework for characterizing
the stakeholders of interpretable AI including their needs. According to them not only the knowledge
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
10
of humans but also the context in which human-AI interaction occurs plays a decisive role. For this
reason, in-line with Kühl et al. (2019), we hypothesize that tailoring the information presentation
particularly to the user while considering its knowledge and application context yields potential for
improved CTP.
H7: Personalized information presentation considering the humans’ knowledge and the
application context has a positive effect on CTP.
Besides the information presented to the user, the accuracy of the inferred explanations plays a decisive
role. Researchers have proposed evaluations to assess the performance of explanations, which is also
known as fidelity of explanations (Shen and Huang, 2020). It can be interpreted as the capability of the
explanation to reflect the AI’s behavior (Melis and Jaakkola, 2018). For example, Papenmeier et al.
(2019) find that humans could lose trust in the AI when exposed to low fidelity explanations. In this
context, Shen and Huang (2020, p. 172) mention “the representational power––including the
correctness, sensitivity, etc., of the interpretation model––might not be sufficient to augment human
reasoning about errors.”. Therefore, we hypothesize a strong positive correlation between the fidelity
of explanations and the ability of humans to detect when the AI errs.
H8: Increasing explanation fidelity has a positive effect on CTP.
Even if established explanation techniques suffice the fidelity criterion, a further notable aspect should
be their robustness. In this context, a large body of research found that explanations can vary
significantly even for instances that are nearly identical and have the same classification (Alvarez-Melis
and Jaakkola, 2018; Ghorbani et al., 2019). For example, Tomsett et al. (2019) test the consistency of
saliency maps and state their statistical unreliability. For this reason, we formulate the following
hypothesis:
H9: Increasing explanation robustness has a positive effect on CTP.
Human Characteristics
Intensive research has been conducted across multiple disciplines over the past decades on
characteristics that allow individuals to succeed in team settings (Zhao and Feng, 2019). For example,
Morgeson et al. (2005) emphasize the importance of social skills, personality characteristics as well as
team knowledge in the context of team member selection. Similar to human teams, we suggest that
individual human characteristics also play a crucial role in HI systems. We focus on a small subset of
characteristics that have been examined in the context of human teams, which we believe will also have
a major impact on HI systems. In this context, multiple meta-analytic studies found an influence of
personality characteristics such as conscientiousness, agreeableness, or emotional stability (Hogan and
Holland, 2003; Hurtz and Donovan, 2000). Even though human-AI teams differ significantly from
human teams, we assume these characteristics to impact CTP for the following reasons (Riefle and
Benz, 2021). Conscientious people tend to demonstrate willingness to contribute to team performance
regardless of their designated role (Barrick et al., 1998; Neuman and Wright, 1999). Moreover, they
stand out by being especially concerned with performing their required behaviors towards achieving
defined team goals (LePine et al., 1997). Furthermore, the characteristic of agreeableness encompasses
traits such as cooperativeness and flexibility (Digman, 1990). We hypothesize that these traits might be
beneficial in the human-AI setting as well, since those individuals might be more willing to contemplate
the AI’s opinion. In addition, emotional stability might play a decisive role in this context as people
with this trait tend to be more stress-resistant allowing them to more sovereignly manage demanding
and ambiguous situations (Mount et al., 1998). For the reasons mentioned above, we formulate the
following hypotheses:
H10: Personality characteristics (e.g., conscientiousness, agreeableness, emotional stability)
have an effect on CTP.
A further interesting direction of future research might be the influence of human cognitive capacity on
CTP. In addition to various studies that find empirical evidence for cognitive ability being a strong
predictor of individual performance (Hunter and Hunter, 1984; Wagner, 1997), a similar relationship
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
11
can be shown at the team level (Devine et al., 2001). Thus, we suspect a similar relationship with regard
to human-AI teams in the context of HI systems.
H11: Cognitive ability has a positive effect on CTP.
Lastly, human decision-making is heavily influenced by human biases (Kahneman, 2011). Human-AI
collaboration is not spared of these biases. A particularly serious bias is the automation bias, i.e., “the
tendency to use automated cues as a heuristic replacement for vigilant information seeking and
processing” (Mosier and Skitka, 1999, p. 344), that may lead to an overreliance on AI
recommendations. Therefore, we formulate the following hypothesis:
H12: Automation bias has a negative effect on CTP.
Conclusion
The main goal of this study was to determine the current state of HI systems with regard to CTP.
Therefore, we conducted a SLR. Subsequently, we provided an overview of the proportion of articles
that reached CTP and presented experimental conditions that were tested in the articles. Based on the
SLR and supplementary work we derived and discussed further experimental conditions and formulated
testable hypotheses.
Unleashing the potential of HI systems that leverage the complementary capabilities of humans and AI
to achieve CTP requires a multidimensional design process. For this reason, we see IS as the predestined
research discipline to advance research in this field. We hope to motivate IS researchers and
practitioners to actively participate in the exploration of understanding of the factors contributing to the
design of HI systems for CTP.
Considerably more work needs to be done to determine the underlying patterns of Human-AI
complementarity. Therefore, rigorous research models based on behavioral constructs need to be
developed. Constructs such as mental model, cognitive load, and trust need to be measured to
understand and enable CTP. Future work needs to address the testable hypotheses outlined in this work
in behavioral experiments. We invite researchers to support and join us on the path to CTP.
References
Adadi, A., and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial
intelligence (XAI). IEEE Access, 6, 5213852160.
Adhikari, A., Tax, D. M. J., Satta, R., and Faeth, M. (2019). LEAFAGE: Example-based and
Feature importance-based Explanations for Black-box ML models. 2019 IEEE International
Conference on Fuzzy Systems (FUZZ-IEEE), 17.
Alufaisan, Y., Marusich, L. R., Bakdash, J. Z., Zhou, Y., and Kantarcioglu, M. (2020). Does
Explainable Artificial Intelligence Improve Human Decision-Making? ArXiv.
Alvarez-Melis, D., and Jaakkola, T. S. (2018). Towards robust interpretability with self-explaining
neural networks. ArXiv Preprint ArXiv:1806.07538.
Bansal, G., Nushi, B., Kamar, E., Lasecki, W. S., Weld, D. S., and Horvitz, E. (2019). Beyond
accuracy: The role of mental models in human-ai team performance. Proceedings of the
AAAI Conference on Human Computation and Crowdsourcing, 7(1), 211.
Bansal, G., Tongshuang, W. U., Zhou, J., Raymond, F. O. K., Nushi, B., Kamar, E., Ribeiro, M.
T., and Weld, D. S. (2020). Does the Whole Exceed its Parts? The Effect of AI Explanations
on Complementary Team Performance. ArXiv, 1(1), 126.
Barrick, M. R., Stewart, G. L., Neubert, M. J., and Mount, M. K. (1998). Relating member ability
and personality to work-team processes and team effectiveness. Journal of Applied
Psychology, 83(3), 377.
Braga, A., and Logan, R. K. (2017). The emperor of strong AI has no clothes: Limits to artificial
intelligence. Information, 8(4), 156.
Brocke, J. vom, Simons, A., Niehaves, B., Niehaves, B., Reimer, K., Plattfaut, R., and Cleven, A.
(2009). Reconstructing the giant: On the importance of rigour in documenting the literature
search process.
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
12
Buçinca, Z., Lin, P., Gajos, K. Z., and Glassman, E. L. (2020). Proxy tasks and subjective
measures can be misleading in evaluating explainable AI systems. International Conference
on Intelligent User Interfaces, Proceedings IUI, 454464.
Burkart, N., and Huber, M. F. (2021). A Survey on the Explainability of Supervised Machine
Learning. Journal of Artificial Intelligence Research, 70, 245317.
Carton, S., Mei, Q., and Resnick, P. (2020). Feature-Based Explanations Don’t Help People Detect
Misclassifications of Online Toxicity. Aaai.Org, 2020(Icwsm). www.aaai.org
Chandrasekaran, A., Yadav, D., Chattopadhyay, P., Prabhu, V., and Parikh, D. (2017). It takes
two to Tango: Towards theory of AI’s mind. ArXiv.
Chu, E., Roy, D., and Andreas, J. (2020). Are Visual Explanations Useful? A Case Study in
Model-in-the-Loop Prediction. 118.
Confalonieri, R., Weyde, T., Besold, T. R., and Martín, F. M. del P. (2019). Trepan Reloaded: A
Knowledge-driven Approach to Explaining Artificial Neural Networks.
D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B., Beutel, A., Chen, C., Deaton,
J., Eisenstein, J., and Hoffman, M. D. (2020). Underspecification presents challenges for
credibility in modern machine learning. ArXiv Preprint ArXiv:2011.03395.
Davenport, T. H., and Kirby, J. (2016). Only humans need apply: Winners and losers in the age
of smart machines. Harper Business New York, NY.
Davis, B., Glenski, M., Sealy, W., and Arendt, D. (2020). Measure Utility, Gain Trust: Practical
Advice for XAI Researchers. 2020 IEEE Workshop on TRust and EXpertise in Visual
Analytics (TREX), 18.
Dellermann, D., Ebel, P., Söllner, M., & Leimeister, J. M. (2019). Hybrid Intelligence. Business
and Information Systems Engineering, 61(5), 637643.
Devine, D. J., and Philips, J. L. (2001). Do smarter teams do better: A meta-analysis of cognitive
ability and team performance. Small Group Research, 32(5), 507532.
Digman, J. M. (1990). Personality structure: Emergence of the five-factor model. Annual Review
of Psychology, 41(1), 417440.
Doshi-Velez, F., and Kim, B. (2017). Towards a rigorous science of interpretable machine
learning. ArXiv Preprint ArXiv:1702.08608.
Gerber, A., Derckx, P., Döppner, D. A., and Schoder, D. (2020). Conceptualization of the Human-
Machine Symbiosis A Literature Review. Proceedings of the 53rd Hawaii International
Conference on System Sciences, 3, 289298.
Ghorbani, A., Abid, A., and Zou, J. (2019). Interpretation of neural networks is fragile.
Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 36813688.
Goddard, K., Roudsari, A., and Wyatt, J. C. (2014). Automation bias: empirical results assessing
influencing factors. International Journal of Medical Informatics, 83(5), 368375.
Goodhue, D. L., and Thompson, R. L. (1995). Task-technology fit and individual performance.
MIS Quarterly, 213236.
Green, B., and Chen, Y. (2019). The principles and limits of algorithm-in-the-loop decision
making. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW).
Gregor, S., and Benbasat, I. (1999). Explanations from intelligent systems: Theoretical
foundations and implications for practice. MIS Quarterly, 497530.
Hase, P., and Bansal, M. (2020). Evaluating Explainable AI: Which Algorithmic Explanations
Help Users Predict Model Behavior? ArXiv, 3.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification. Proceedings of the IEEE International
Conference on Computer Vision, 10261034.
Hogan, J., and Holland, B. (2003). Using theory to evaluate personality and job-performance
relations: A socioanalytic perspective. Journal of Applied Psychology, 88(1), 100.
Hunter, J. E., and Hunter, R. F. (1984). Validity and utility of alternative predictors of job
performance. Psychological Bulletin, 96(1), 72.
Hurtz, G. M., and Donovan, J. J. (2000). Personality and job performance: The Big Five revisited.
Journal of Applied Psychology, 85(6), 869.
Kahneman, D. (2011). Thinking, fast and slow. Macmillan.
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
13
Klapp, O. E. (1986). Overload and boredom: Essays on the quality of life in the information
society. Greenwood Publishing Group Inc.
Kühl, N., Lobana, J., & Meske, C. (2019). Do you comply with AI? - Personalized explanations
of learning algorithms and their impact on employees’ compliance behavior.
Lai, V., Liu, H., & Tan, C. (2020). Why is Chicago deceptive? Towards Building Model-Driven
Tutorials for Humans. In Proceedings of the 2020 CHI Conference on Human Factors in
Computing Systems (pp. 1-13).
Lai, V., and Tan, C. (2019). On human predictions with explanations and predictions of machine
learning models: A case study on deception detection. FAT* 2019 - Proceedings of the 2019
Conference on Fairness, Accountability, and Transparency, 2938.
LePine, J. A., Hollenbeck, J. R., Ilgen, D. R., and Hedlund, J. (1997). Effects of individual
differences on the performance of hierarchical decision-making teams: Much more than g.
Journal of Applied Psychology, 82(5), 803.
Liu, H., Lai, V., and Tan, C. (2021). Understanding the Effect of Out-of-distribution Examples
and Interactive Explanations on Human-AI Decision Making. 142.
Maedche, A., Legner, C., Benlian, A., Berger, B., Gimpel, H., Hess, T., Hinz, O., Morana, S., and
Söllner, M. (2019). AI-based digital assistants. Business and Information Systems
Engineering, 61(4), 535544.
Manzey, D., Reichenbach, J., and Onnasch, L. (2012). Human performance consequences of
automated decision aids: The impact of degree of automation and system experience. Journal
of Cognitive Engineering and Decision Making, 6(1), 5787.
Meske, C., Bunde, E., Schneider, J., and Gersch, M. (2020). Explainable Artificial Intelligence:
Objectives, Stakeholders, and Future Research Opportunities. Information Systems
Management, 111.
Morgeson, F. P., Delaney-Klinger, K., and Hemingway, M. A. (2005). The importance of job
autonomy, cognitive ability, and job-related skill for predicting role breadth and job
performance. Journal of Applied Psychology, 90(2), 399.
Mosier, K. L., and Skitka, L. J. (1999). Automation use and automation bias. Proceedings of the
Human Factors and Ergonomics Society Annual Meeting, 43(3), 344348.
Mozannar, H., and Sontag, D. (2020). Consistent estimators for learning to defer to an expert.
International Conference on Machine Learning, 70767087.
Neuman, G. A., and Wright, J. (1999). Team effectiveness: beyond skills and cognitive ability.
Journal of Applied Psychology, 84(3), 376.
Nunes, I., and Jannach, D. (2017). A systematic review and taxonomy of explanations in decision
support and recommender systems. User Modeling and User-Adapted Interaction, 27(3),
393444.
Papenmeier, A., Englebienne, G., and Seifert, C. (2019). How model accuracy and explanation
fidelity influence user trust. ArXiv Preprint ArXiv:1907.12652.
Poursabzi-Sangdeh, F., Goldstein, D. G., Hofman, J. M., Vaughan, J. W., and Wallach, H. (2018).
Manipulating and measuring model interpretability. ArXiv.
Ribeiro, M. T., and Guestrin, C. (2018). Anchors: High-Precision Model-Agnostic Explanations.
15271535.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). “Why should I trust you?” Explaining the
predictions of any classifier. Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 13-17-Augu, 11351144.
Riefle, L., & Benz, C. (2021). User-specific Determinants of Conversational Agent Usage: A
Review and Potential for Future Research [in press]. Proceedings of the 16th International
Conference on Wirtschaftsinformatik (WI).
Schmidt, P., and Biessmann, F. (2019). Quantifying interpretability and trust in machine learning
systems. ArXiv.
Schuetz, S., and Venkatesh, V. (2020). The rise of human machines: How cognitive computing
systems challenge assumptions of user-system interaction. Journal of the Association for
Information Systems, 21(2), 460482.
Shen, H., and Huang, T.-H. (2020). How Useful Are the Machine-Generated Interpretations to
General Users? A Human Evaluation on Guessing the Incorrectly Predicted Labels.
Human-AI Complementarity in Hybrid Intelligence Systems
Twenty-fifth Pacific Asia Conference on Information Systems, Dubai, UAE, 2021
14
Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 8(1),
168172.
Shneiderman, B. (2003). The eyes have it: A task by data type taxonomy for information
visualizations. In The craft of information visualization (pp. 364371). Elsevier.
Siemon, D., Li, R., & Robra-Bissantz, S. (2021). Towards a model of team roles in human-
machine collaboration. International Conference on Information Systems, ICIS 2020 -
Making Digital Inclusive: Blending the Local and the Global, 09.
Skitka, L. J., Mosier, K., and Burdick, M. D. (2000). Accountability and automation bias.
International Journal of Human Computer Studies, 52(4), 701717.
Smith-Renner, A., Fan, R., Birchfield, M., Wu, T., Boyd-Graber, J., Weld, D. S., and Findlater,
L. (2020). No Explainability without Accountability: An Empirical Study of Explanations
and Feedback in Interactive ML. Conference on Human Factors in Computing Systems -
Proceedings, 113.
Suresh, H., Gomez, S. R., Nam, K. K., and Satyanarayan, A. (2021). Beyond Expertise and Roles:
A Framework to Characterize the Stakeholders of Interpretable Machine Learning and their
Needs. ArXiv Preprint ArXiv:2101.09824.
Swartout, W. R., and Moore, J. D. (1993). Explanation in second generation expert systems. In
Second generation expert systems (pp. 543585). Springer.
Tomsett, R., Harborne, D., Chakraborty, S., Gurram, P., and Preece, A. (2020). Sanity checks for
saliency metrics. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04),
60216029.
van der Waa, J., Nieuwburg, E., Cremers, A., and Neerincx, M. (2021). Evaluating XAI: A
comparison of rule-based and example-based explanations. Artificial Intelligence, 291,
103404.
Van Lent, M., Fisher, W., and Mancuso, M. (2004). An explainable artificial intelligence system
for small-unit tactical behavior. Proceedings of the National Conference on Artificial
Intelligence, 900907.
Wagner, R. K. (1997). Intelligence, training, and employment. American Psychologist, 52(10),
1059.
Wanner, J., Herm, L.-V., Heinrich, K., Janiesch, C., and Zschech, P. (2020). White, Grey, Black:
Effects of XAI Augmentation on the Confidence in AI-based Decision Support Systems.
Wilder, B., Horvitz, E., and Kamar, E. (2020). Learning to complement humans. ArXiv Preprint
ArXiv:2005.00582.
Xu, X., Wickens, C. D., and Rantanen, E. M. (2007). Effects of conflict alerting system reliability
and task difficulty on pilots’ conflict detection with cockpit display of traffic information.
Ergonomics, 50(1), 112130.
Yeung, A. Y., Joshi, S., Williams, J. J., and Rudzicz, F. (2020). Sequential explanations with
mental model-based policies. ArXiv, 2017.
Zhang, Y., Vera Liao, Q., and Bellamy, R. K. E. (2020). Effect of confidence and explanation on
accuracy and trust calibration in AI-assisted decision making. FAT* 2020 - Proceedings of
the 2020 Conference on Fairness, Accountability, and Transparency, 295305.
Zhao, Y., and Feng, Y.-Y. (2019). A Study on the Impact of Team Heterogeneity on
Organizational Performance in Start-ups. DEStech Transactions on Economics, Business
and Management, icem.
Zschech, P., Walk, J., Heinrich, K., Vössing, M., Kühl, N. (2021). A Picture is Worth a
Collaboration: Accumulating Design Knowledge for Computer-Vision-based Hybrid
Intelligence Systems. ECIS 2021 Research Papers. 127.
... With the ongoing development of XAI techniques, researchers have started to evaluate AI with and without explanations to assess whether their utility for better decision-making can be quantified [4,8,9,12,14,22,23,27,36,37,39,60,65,67]. Whereas some researchers identify a benefit of XAI-based decision support in user studies [8,36], others find only negligible evidence [12,39], with the underlying causes remaining partly unexplored [28,54]. Therefore, in this article, we aim to clarify the current "snapshot" of the utility of XAI-based decision support. ...
... The underlying idea is that humans will benefit from the AI's suggestion if it is accompanied with an explanation. Therefore, a constantly rising number of studies has started to analyze the effects of explanations in behavioral experiments [28]. In these experiments, many different target variables are taken into consideration, e.g., whether humans are capable of predicting what a model would recommend (proxy tasks) [8,13,23] or whether explanations support them in model debugging [2,32]. ...
... Besides these factors, the explanation type of an AI prediction can play a decisive role. In this context, research has developed various explainability techniques [1] ranging from feature importance methods [49] over example-based approaches [10] to rule-based explanations [50] that have been evaluated in user studies accordingly [28]. However, the current picture emerging from the results of different studies regarding the effects of XAI methods on AI-assisted decision-making performance is not unambiguous. ...
Conference Paper
Full-text available
Research in artificial intelligence (AI)-assisted decision-making is experiencing tremendous growth with a constantly rising number of studies evaluating the effect of AI with and without techniques from the field of explainable AI (XAI) on human decision-making performance. However, as tasks and experimental setups vary due to different objectives, some studies report improved user decision-making performance through XAI, while others report only negligible effects. Therefore, in this article, we present an initial synthesis of existing research on XAI studies using a statistical meta-analysis to derive implications across existing research. We observe a statistically positive impact of XAI on users' performance. Additionally, the first results indicate that human-AI decision-making tends to yield better task performance on text data. However, we find no effect of explanations on users' performance compared to sole AI predictions. Our initial synthesis gives rise to future research investigating the underlying causes and contributes to further developing algorithms that effectively benefit human decision-makers by providing meaningful explanations.
... With the ongoing development of XAI techniques, researchers have started to evaluate AI with and without explanations to assess whether their utility for better decision-making can be quantified [4,8,9,12,14,22,23,27,36,37,39,60,65,67]. Whereas some researchers identify a benefit of XAI-based decision support in user studies [8,36], others find only negligible evidence [12,39], with the underlying causes remaining partly unexplored [28,54]. Therefore, in this article, we aim to clarify the current "snapshot" of the utility of XAI-based decision support. ...
... The underlying idea is that humans will benefit from the AI's suggestion if it is accompanied by an explanation. Therefore, a constantly rising number of studies has started to analyze the effects of explanations in behavioral experiments [28]. In these experiments, many different target variables are taken into consideration, e.g., whether humans are capable of predicting what a model would recommend (proxy tasks) [8,13,23] or whether explanations support them in model debugging [2,32]. ...
... Besides these factors, the explanation type of an AI prediction can play a decisive role. In this context, research has developed various explainability techniques [1] ranging from feature importance methods [49] over example-based approaches [10] to rule-based explanations [50] that have been evaluated in user studies accordingly [28]. However, the current picture emerging from the results of different studies regarding the effects of XAI methods on AI-assisted decision-making performance is not unambiguous. ...
Preprint
Full-text available
Research in artificial intelligence (AI)-assisted decision-making is experiencing tremendous growth with a constantly rising number of studies evaluating the effect of AI with and without techniques from the field of explainable AI (XAI) on human decision-making performance. However, as tasks and experimental setups vary due to different objectives, some studies report improved user decision-making performance through XAI, while others report only negligible effects. Therefore, in this article, we present an initial synthesis of existing research on XAI studies using a statistical meta-analysis to derive implications across existing research. We observe a statistically positive impact of XAI on users' performance. Additionally, the first results indicate that human-AI decision-making tends to yield better task performance on text data. However, we find no effect of explanations on users' performance compared to sole AI predictions. Our initial synthesis gives rise to future research investigating the underlying causes and contributes to further developing algorithms that effectively benefit human decision-makers by providing meaningful explanations.
... In line with the adoption factor of complementarity, Dellermann et al. (2019) argue that humans and AI generally possess different potentially complementary skills. However, its realization in terms of superior collaborative task performance often remains an unsolved problem in practice (Hemmer et al., 2021). A possible reason might be that humans do not always appropriately rely on AI advice and might even be convinced of explanations in cases of incorrect recommendations (Bansal et al., 2021). ...
Conference Paper
Recent developments in Artificial Intelligence (AI) have fueled the emergence of human-AI collaboration, a setting where AI is a coequal partner. Especially in clinical decision-making, it has the potential to improve treatment quality by assisting overworked medical professionals. Even though research has started to investigate the utilization of AI for clinical decision-making, its potential benefits do not imply its adoption by medical professionals. While several studies have started to analyze adoption criteria from a technical perspective, research providing a human-centered perspective with a focus on AI’s potential for becoming a coequal team member in the decision-making process remains limited. Therefore, in this work, we identify factors for the adoption of human-AI collaboration by conducting a series of semi-structured interviews with experts in the healthcare domain. We identify six relevant adoption factors and highlight existing tensions between them and effective human-AI collaboration.
... Explainable AI verfolgt das Ziel, durch eine künstliche Intelligenz generierte Handlungsempfehlungen nachvollziehbar zu machen. Das Verständnis für von der KI vorgeschlagene Entscheidungen ist eine Grundvoraussetzung für eine Akzeptanz der KI und ein erfolgreiches, sich ergänzendes Zusammenspiel zwischen menschlicher und künstlicher Intelligenz [2,6]. Um dies im Zusammenhang mit Sales-und-Demand-Forecasting zu erreichen, bedarf es der Transparenz verwendeter KI-Methoden und deren Möglichkeiten, aber auch Grenzen: ...
... While a variety of factors are known to increase the trust of humans (Wang and Benbasat, 2008), one particular approach that has received significant attention in recent years emphasizes that the reasoning of the systems should be made transparent through explanations (Gilpin et al., 2018;Meske et al., 2020). We contribute to this discussion by examining how agent transparency (Stowers et al., 2016) affects the collaboration between humans and AI systems (Kiousis, 2002;Benyon, 2014;Patrick et al., 2021). In this work, we study how agent transparency influences the trust of domain experts as well as the collaborative task outcome (i.e., forecasting accuracy). ...
Article
Full-text available
The field of artificial intelligence (AI) is advancing quickly, and systems can increasingly perform a multitude of tasks that previously required human intelligence. Information systems can facilitate collaboration between humans and AI systems such that their individual capabilities complement each other. However, there is a lack of consolidated design guidelines for information systems facilitating the collaboration between humans and AI systems. This work examines how agent transparency affects trust and task outcomes in the context of human-AI collaboration. Drawing on the 3-Gap framework, we study agent transparency as a means to reduce the information asymmetry between humans and the AI. Following the Design Science Research paradigm, we formulate testable propositions, derive design requirements, and synthesize design principles. We instantiate two design principles as design features of an information system utilized in the hospitality industry. Further, we conduct two case studies to evaluate the effects of agent transparency: We find that trust increases when the AI system provides information on its reasoning, while trust decreases when the AI system provides information on sources of uncertainty. Additionally, we observe that agent transparency improves task outcomes as it enhances the accuracy of judgemental forecast adjustments.
... In line with the adoption factor of complementarity, Dellermann et al. (2019) argue that humans and AI generally possess different potentially complementary skills. However, its realization in terms of superior collaborative task performance often remains an unsolved problem in practice (Hemmer et al., 2021). A possible reason might be that humans do not always appropriately rely on AI advice and might even be convinced of explanations in cases of incorrect recommendations (Bansal et al., 2021). ...
Preprint
Full-text available
Recent developments in Artificial Intelligence (AI) have fueled the emergence of human-AI collaboration, a setting where AI is a coequal partner. Especially in clinical decision-making, it has the potential to improve treatment quality by assisting overworked medical professionals. Even though research has started to investigate the utilization of AI for clinical decision-making, its potential benefits do not imply its adoption by medical professionals. While several studies have started to analyze adoption criteria from a technical perspective, research providing a human-centered perspective with a focus on AI's potential for becoming a coequal team member in the decision-making process remains limited. Therefore, in this work, we identify factors for the adoption of human-AI collaboration by conducting a series of semi-structured interviews with experts in the healthcare domain. We identify six relevant adoption factors and highlight existing tensions between them and effective human-AI collaboration.
... While humans can rely on their senses, perceptions, emotional intelligence, and social skills (Braga and Logan 2017), AI excels at detecting patterns or calculating probabilities (Dellermann et al. 2019). These different skill set should in theory allow for superior complementary team performance in specific tasks (Hemmer et al. 2021). Hybrid intelligence is dampened if AB prevents an effective human-AI collaboration. ...
Preprint
Full-text available
Artificial intelligence (AI) is gaining momentum, and its importance for the future of work in many areas, such as medicine and banking, is continuously rising. However, insights on the effective collaboration of humans and AI are still rare. Typically, AI supports humans in decision-making by addressing human limitations. However, it may also evoke human bias, especially in the form of automation bias as an over-reliance on AI advice. We aim to shed light on the potential to influence automation bias by explainable AI (XAI). In this pre-test, we derive a research model and describe our study design. Subsequentially, we conduct an online experiment with regard to hotel review classifications and discuss first results. We expect our research to contribute to the design and development of safe hybrid intelligence systems.
... As well as a solution to these growing challenges, xAI is also identified as a key component of hybrid intelligence: Collaborative systems of human and AI agents working together [39]. In the near future, such systems may allow diagnosticians to dynamically interact with an AI counterpart through natural language queries and counterfactuals [36,40], or for AI systems to improve their performance by identifying limitations and actively bringing the human into the machine learning loop [27]. ...
Article
Full-text available
The increasing prevalence of digitized workflows in diagnostic pathology opens the door to life-saving applications of artificial intelligence (AI). Explainability is identified as a critical component for the safety, approval and acceptance of AI systems for clinical use. Despite the cross-disciplinary challenge of building explainable AI (xAI), very few application- and user-centric studies in this domain have been carried out. We conducted the first mixed-methods study of user interaction with samples of state-of-the-art AI explainability techniques for digital pathology. This study reveals challenging dilemmas faced by developers of xAI solutions for medicine and proposes empirically-backed principles for their safer and more effective design.
... To ensure that AI systems serve human interests and not the needs of AI itself, it is necessary to create a strict digital ethics framework [119,120]. Various authors see the possibility of a future transition to a human-AI symbiosis, in which hybrid intelligence systems will outperform individual parts [121]. In any case, AI should be an essential part of future human evolution. ...
Preprint
Full-text available
Since its inception, the universe has evolved towards increasing complexity due to entropy and natural forces. Once the universe was cooled sufficiently by its expansion, "free-living" particles united to form hydrogen atoms. Gravity united matter into compact, high-energy formations (stars). Conditions within the first generation of stars drove the formation of more complex elements. The existence of these elements allowed the creation of planetary systems. Favourable conditions on Earth and the diversity of elements enabled the creation of further complexity in the form of life. This complexity has not evolved in a linear manner but in leaps towards higher levels. Emergence of each new level, or "universal state", depends on a universal qualitative change: a "universal transition". On Earth, life inherited this tendency of evolution towards complexity by uniting matter in increasing levels of individuality, from organic molecules to molecular complexes, cells, multicellularity, and complex populations. The emergence of a new level is allowed by "scale constraints", which prevent eternal accumulation of complexity at lower levels, forcing the systems towards a new higher-level, i.e., the occurrence of an evolutionary transition in individuality (ETI). In humans, sociocultural and technological evolution may be guiding our species to a transition towards a global superorganism.
Article
The profession of architecture mainly involves the construction of building information; this is achieved within the knowledge paradigm of our society. Therefore, a shift in the knowledge paradigm can lead to the advent of new architectural professions; the Renaissance and the current era are periods of such shifts. However, during the process of such a shift, it is difficult to notice the essential nature of the change, such as the emergence of a new architectural profession utilizing building information modeling (BIM). This study is unique in that it uses the correlation of building information with knowledge to uncover the essential nature of the architectural profession. Furthermore, Nonaka’s knowledge creation process (SECI) model is used to reveal the knowledge paradigm of our society. The analysis results show that knowledge creation in society has increased explosively since it has become possible—through the printing revolution of the Renaissance period—to share knowledge that previously remained at the experience level. Accordingly, this has led to the advent of an architectural profession involving today’s intellectuals, who can construct and express building information by using conceptualized knowledge rather than experience. Nowadays, machines have emerged as new agents of knowledge processing through the digital revolution, innovatively strengthening connections between information silos. As a result, a new architectural profession has emerged, focused on improving the performance of buildings by using simulations with machine-readable building information as well as with one’s own knowledge. Further, this new architectural profession, by harnessing the newly developed hybrid intelligence of machines and humans, is expected to overcome the former limits of the profession.
Conference Paper
Full-text available
Computer vision (CV) techniques try to mimic human capabilities of visual perception to support labor-intensive and time-consuming tasks like the recognition and localization of critical objects. Nowadays, CV increasingly relies on artificial intelligence (AI) to automatically extract useful information from images that can be utilized for decision support and business process automation. However, the focus of extant research is often exclusively on technical aspects when designing AI-based CV systems while neglecting socio-technical facets, such as trust, control, and autonomy. For this purpose, we consider the design of such systems from a hybrid intelligence (HI) perspective and aim to derive prescriptive design knowledge for CV-based HI systems. We apply a reflective, practice-inspired design science approach and accumulate design knowledge from six comprehensive CV projects. As a result, we identify four design-related mechanisms (i.e., automation, signaling, modification, and collaboration) that inform our derived meta-requirements and design principles. This can serve as a basis for further socio-technical research on CV-based HI systems.
Conference Paper
Full-text available
Conversational agents (CAs) have become integral parts of providers' service offerings, yet their potential is not fully exploited as users' acceptance and usage of CAs are often limited. Whereas previous research is rather technology-oriented, our study takes a user-centric perspective on the phenomenon. We conduct a systematic literature review to summarize the determinants of individuals' acceptance, adoption, and usage of CAs that have been examined in extant research, followed by an interview study to identify potential for further research. In particular, five concepts are proposed for further research: personality, risk aversion, cognitive style, self-efficacy, and desire for control. Empirical studies are encouraged to assess the impact of these user-specific concepts on individuals' decision to use CAs to eventually inform the design of CAs that facilitate users' acceptance, adoption, and use. This paper intends to contribute to the body of knowledge about the determinants of CA usage.
Article
Full-text available
Current developments in Artificial Intelligence (AI) led to a resurgence of Explainable AI (XAI). New methods are being researched to obtain information from AI systems in order to generate explanations for their output. However, there is an overall lack of valid and reliable evaluations of the effects on user's experience and behaviour of explanations. New XAI methods are often based on an intuitive notion what an effective explanation should be. Contrasting rule- and example-based explanations are two exemplary explanation styles. In this study we evaluated the effects of these two explanation styles on system understanding, persuasive power and task performance in the context of decision support in diabetes self-management. Furthermore, we provide three sets of recommendations based on our experience designing this evaluation to help improve future evaluations. Our results show that rule-based explanations have a small positive effect on system understanding, whereas both rule- and example-based explanations seem to persuade users in following the advice even when incorrect. Neither explanation improves task performance compared to no explanation. This can be explained by the fact that both explanation styles only provide details relevant for a single decision, not the underlying rational or causality. These results show the importance of user evaluations in assessing the current assumptions and intuitions on effective explanations.
Article
Full-text available
Artificial Intelligence (AI) has diffused into many areas of our private and professional life. In this research note, we describe exemplary risks of black-box AI, the consequent need for explainability, and previous research on Explainable AI (XAI) in information systems research. Moreover, we discuss the origin of the term XAI, generalized XAI objectives and stakeholder groups, as well as quality criteria of personalized explanations. We conclude with an outlook to future research on XAI.
Conference Paper
Full-text available
The increasing importance of artificial intelligence in everyday work means that new insights into team collaboration must be gained. It is important to research how changes in team composition affect joint work, as previous theories and insights on teams are based on the knowledge of pure human teams. Especially, when an AI-based system acts as the coequal partner in a collaboration scenario, its role within the team needs to be defined. We examine existing concepts of team roles and construct possible roles that AI-based teammates should fulfill in teams with the help of a conducted study (n=1.358). The preliminary results of a conducted exploratory factor analysis show four possible new roles, which we will validate in the future and compare with existing team role concepts in order to construct a model for team roles in human-machine collaboration.
Conference Paper
Full-text available
AI-based decision support systems (DSS) have become increasingly popular for solving a variety of tasks in both, low-stake, and high-stake situations. However, due to their complexity, they often lack transparency into their decision process. Therefore, the field of explainable AI (XAI) has emerged to provide explanations for these black-box systems. While XAI research assumes an increase in confidence when using their augmented grey-box systems, test designs for this proposition are scarce. Therefore, we propose an empirical study to test the effect of black-box, grey-box, and white-box explanations on a domain expert's confidence in the system, and subsequently on the effectiveness of the overall decision process. For this purpose, we derive hypotheses from theory and implement AI-based DSS with XAI augmentations for low-stake and high-stake situations. Further, we provide detailed information on a future survey-based study, which we will conduct to complete this research-in-progress.
Article
We introduce a novel model-agnostic system that explains the behavior of complex models with high-precision rules called anchors, representing local, "sufficient" conditions for predictions. We propose an algorithm to efficiently compute these explanations for any black-box model with high-probability guarantees. We demonstrate the flexibility of anchors by explaining a myriad of different models for different domains and tasks. In a user study, we show that anchors enable users to predict how a model would behave on unseen instances with less effort and higher precision, as compared to existing linear explanations or no explanations.