Conference PaperPDF Available

User Decision Guidance with Selective Explanation Presentation from Explainable-AI

Authors:

Figures

Content may be subject to copyright.
User Decision Guidance with Selective Explanation Presentation
from Explainable-AI
Yosuke Fukuchi1and Seiji Yamada2,3
Abstract This paper addresses the challenge of selecting
explanations for XAI (Explainable AI)-based Intelligent Deci-
sion Support Systems (IDSSs). IDSSs have shown promise in
improving user decisions through XAI-generated explanations
along with AI predictions, and the development of XAI made
it possible to generate a variety of such explanations. However,
how IDSSs should select explanations to enhance user decision-
making remains an open question. This paper proposes X-
Selector, a method for selectively presenting XAI explanations.
It enables IDSSs to strategically guide users to an AI-suggested
decision by predicting the impact of different combinations of
explanations on a user’s decision and selecting the combination
that is expected to minimize the discrepancy between an AI
suggestion and a user decision. We compared the efficacy of
X-Selector with two naive strategies (all possible explanations
and explanations only for the most likely prediction) and two
baselines (no explanation and no AI support) in virtual stock-
trading support scenarios. The results suggest the potential of
X-Selector to guide users to AI-suggested decisions and improve
task performance under the condition of a high AI accuracy.
I. INTRODUCTION
Intelligent Decision Support Systems (IDSSs) [1], empow-
ered by Artificial Intelligence (AI), have the potential to help
users make better decisions by introducing explainability into
their support. An increasing number of methods have been
proposed for achieving explainable AIs (XAIs) [2], and the
development of large language models (LLMs) has also made
it possible to generate various post-hoc explanations that
justify AI predictions. Previous studies have integrated such
XAI methods into IDSSs and shown their effectiveness in
presenting explanations along with AI predictions in diverse
applications [3], [4].
Now that IDSSs can have a variety of explanation can-
didates, a new question arises as to which explanations an
IDSS should provide in dynamic interaction. Explanation is
a complex cognitive process [5]. Although XAI explanations
can potentially guide users to make better decisions, there is
also a risk of having negative effects on explainees’ deci-
sions. Various causes including explanation uninterpretabil-
ity [6], [7], information overload [8], [9], and contextual
inaccuracy [10] can affect users and thus the performance
of decision-making. A subtle difference in the nuance of
a linguistic explanation can also have a different impact
and sometimes mislead user decisions when influenced by
the context, the status of the task, and the cognitive and
1Faculty of Systems Design, Tokyo Metropolitan University, Tokyo,
Japan fukuchi@tmu.ac.jp.2Digital Content and Media Sciences
Research Division, National Institute of Infomatics, Tokyo, Japan. 3The
Graduate University for Advanced Studies, SOKENDAI, Tokyo, Japan. This
work was supported in part by JST CREST Grant Number JPMJCR21D4
and JSPS KAKENHI Grant Number JP24K20846.
psychological status of the users. Conversely, we can expect
that IDSSs can be greatly enhanced if they can strategically
select explanations that are likely to lead users to better
decisions while taking the situation into account.
To address the question of how IDSSs can select ex-
planations to improve user decisions, this paper proposes
X-Selector, a method for dynamically selecting which ex-
planations to provide along with an AI prediction. The
main characteristic of X-Selector is that it predicts how
explanations affect user decision-making for each trial and
attempts to guide users to an AI-suggested decision referring
to the prediction results. The design of guiding users with ex-
planation selection is inspired by libertarian paternalism [11],
the idea of influencing one’s choices to better ones while
embracing the autonomy of their decision-making.
This paper also reports user experiments that simulated
stock trading with an XAI-based IDSS. In a preliminary ex-
periment, we compared two naive but common strategies—
ALL (providing all possible explanations) and ARGMAX
(providing only explanations for the AI’s most probable
prediction)—against baseline scenarios providing no expla-
nations or no decision support. The results suggest that the
ARGMAX strategy works better with high AI accuracy,
and ALL is more effective when AI accuracy is lower,
indicating that the strategy for selecting explanations affects
user performance. In the second experiment, we compared
the results of explanations selected by X-Selector with
ARGMAX and ALL. The results indicate the potential of X-
Selector’s selective explanations to more strongly lead users
to suggested decisions and to outperform ARGMAX when
AI accuracy is high.
II. BACKGROU ND
A. XAIs for deep learning models
While various methods such as Fuzzy Logic and Evo-
lutionary Computing have been introduced to IDSSs, this
paper targets IDSSs with Deep Learning (DL) models. IDSSs
driven by DL models are capable of dealing with high-
dimensional data such as visual images and are actively
studied in diverse fields [12]–[14]. Due to their blackbox
nature, explainability for DL models is also an area of active
research, and this can potentially offer benefits for IDSSs.
There are various forms of explanations depending on the
nature of the target AI. Common explanations for visual
information processing AIs include presenting saliency maps.
The class activation map (CAM) is a widely used method for
visualizing a saliency map of convolutional neural network
(CNN) layers [15]. It identifies the regions of an input image
that contribute the most for a model to classify the image into
a particular class.
Language is also a common modality of XAIs, and free-
text explanation is becoming rapidly available thanks to
the advance of LLMs [16]. LLMs can generate post-hoc
explanations for AI predictions. Here, post-hoc means that
the explanations are generated after the AI’s decision-making
process has occurred, as opposed to intrinsic methods that
generate explanations in an integral part of that process [2].
B. Human-XAI interaction
The theme of this study involves how to facilitate users’
appropriate use of AI. Avoiding human over/under-reliance
on an AI is a fundamental problem of human-AI interac-
tion [17]. Here, over-reliance is a state in which a human
overestimates the capability of an AI and blindly follows its
decision, whereas under-reliance is a state in which a human
misuses an AI even though it can perform well.
Although explanation is believed to generally help people
appropriately use AI by providing transparency in AI predic-
tions, previous studies suggest that XAI explanations do not
always work positively [18]. Maehigashi et al. demonstrated
that presenting AI saliency maps has different effects on
user trust in an AI depending on the task difficulty and
the interpretability of the saliency map [6]. Herm revealed
that the type of XAI explanation strongly influences users’
cognitive load, task performance, and task time [9]. Panigutti
et al. conducted a user study with an ontology-based clinical
IDSS, and they found that the users were influenced by
the explanations despite the low perceived explanation qual-
ity [19]. These results suggest potential risks of triggering
under-reliance with explanations or, conversely, leading to
users blindly following explanations from an IDSS even if
the conclusion drawn from the explanations is incorrect.
This study aims to computationally predict how explana-
tions affect user decisions in order to avoid misleading users
and encourage them to make better decisions by selecting
explanations. Work by Wiegreffe et al. [16] shares a similar
concept with this study. They propose a method of evaluat-
ing explanations generated by LLMs by predicting human
ratings of their acceptability. This approach is pivotal in
understanding how users perceive AI-generated explanations.
However, our study diverges by focusing on the behavioral
impacts of these explanations on human decision-making.
We are particularly interested in how these explanations can
alter the decisions made by users getting an IDSS’s support,
rather than just their perceptions of the explanations. Another
relevant study, Pred-RC [20], [21], aims to predict the effect
of explanations of AI performance so that users can avoid
over/under-reliance. It dynamically predicts a user’s binary
judgment of whether s/he assigns a task to the AI and
selects explanations that guide him/her to better assignment.
X-Selector aims to take a further step to predict concrete
decisions taking the effects of explanations into account and
proactively influences them to improve the performance of
each decision-making.
III. X-SEL ECTOR
A. Overview
This paper proposes X-Selector, a method for dynamically
selecting explanations for AI predictions. The idea of X-
Selector is that it predicts user decisions under possible
combinations of explanations and chooses the best one that is
predicted to minimize the discrepancy between the decision
that the user is likely to make and the AI-suggested one.
B. Algorithm
The main components of X-Selector are UserModel and
π. UserModel is a model of a user who makes decision du.
X-Selector uses it for user decision prediction:
UserModel(c
c
c,x
x
x,d) = P(du=d|c
c
c,x
x
x).(1)
The output of UserModel is represented as a probaility
distribution of duconditioned by c
c
cand x
x
x, where x
x
xX
X
Xis a
combination of explanations to be presented to the user, and
c
c
crepresents all the other contextual information including
AI predictions, task status, and user status. In this paper,
we developed a dataset of (c
c
c,x
x
x,and du) and prepared a
machine learning model that was trained with the dataset
for implementing this.
In addition, X-Selector has a policy π, which considers a
decision dAI based on c
c
c. This inference is done in parallel
with user decision-making:
π(c
c
c,d) = P(dAI =d|c
c
c).(2)
X-Selector aims to minimize the discrepancy between du
and dAI by comparing the effect of each x
x
xon du. The selected
combination ˆ
x
x
xis calculated as:
ˆ
x
x
x=argminx
x
x|EdUserModel(c
c
c,x
x
x,d)[d]Edπ(c
c
c,d)[d]|.(3)
To calculate this equation, X-Selector simulates how x
x
xwill
change duusing UserModel and aims to choose the best one
that guides duto dAI the most.
C. Implementation
1) Task: We implemented X-Selector in a stock trading
simulator in which users get support from an XAI-based
IDSS. Figure 1 shows screenshots of the simulator. The
simulation was conducted on a website. Participants were
virtually given three million JPY and traded stocks for 60
days with a stock price chart, AI prediction of the future
stock price, and explanations for the prediction.
In the simulation, participants checked the opening price
and a price chart for each day and decided whether to buy
stocks with the funds they had, sell stocks they had, or
hold their position. In accordance with Japan’s general stock
trading system, participants could trade stocks in units of
100 shares. Participants were asked to show their decision
twice a day to clarify the influence of the explanations. They
were first asked to decide an initial order d, that is, the
amount of trade only with chart information and without the
support of the IDSS. Then, the IDSS showed a bar graph that
indicated the output of a stock price prediction model and
(a) Chart
(b) Examples of StockAI’s prediction and its explanations
Fig. 1: Screenshots of trading simulator
its explanations. We did not explicitly show dAI to enhance
the autonomy of users’ decision-making, which is inspired
by libertarian paternalism [11], the idea of affecting behavior
while respecting freedom of choice as well. However, we can
easily extend X-Selector to a setting in which dAI is given
to users by including it with c
c
c(when you always show dAI)
or x
x
x(when you want to selectively show dAI ). Finally, they
input their final order d. After this, the simulator immediately
transited to the next day. The positions carried over from
the final day were converted into cash on the basis of the
average stock price over the next five days to calculate the
participants’ total performance.
2) StockAI: In the task, an IDSS provides a prediction of
a stock price prediction model (StockAI) as user support.
StockAI is a machine-learning model that is designed to
predict the average stock price in the next five business days,
and we used its prediction as the target of the explanation
provided to users. StockAI predicts future stock prices on
the basis of an image of a candlestick chart. Although using
a candlestick chart as an input does not necessarily lead
to better performance than modern approaches proposed in
the latest studies [22], we chose this because of the better
understandability of saliency maps generated with the model
as an explanation of AI predictions. Note that the aim of
this research is not building a high-performance prediction
model but investigating the interaction between a human and
an AI whose performance is not necessarily perfect.
Linear layers
𝑑 𝑑′
𝑖𝑝δ
𝒄
𝒙!"#
concat
𝑑′
𝒙
𝒙$%&$
embed
𝑃(𝑑 𝑑!)
Fig. 2: Structure of UserModel
For the implementation of StockAI, we used ResNet-
18 [23], a deep-learning visual information processing
model, using the PyTorch library 1. The StockAI is trained in
a supervised manner; it classifies the ratio of the future stock
price to the opening price of the day into three classes: BULL
(over +2%), NEUTRAL (from -2 to +2%), and BEAR (under
-2%). The prediction results are presented as a bar graph of
the probability distribution for each class, which hereafter
denoted as p. For the training, we collected the historical
stock data (from 2018/5/18 to 2023/5/16) of companies that
are included in the Japanese stock index Nikkei225. We split
the data by stock code, with three-quarters of the data as the
training dataset and the remainder as the test. The accuracy
with which the model was able to predict the correct class
among the three classes was 0.474, and the accuracy for
binary classification, or the matching rate of the sign of the
expected value of the model’s prediction and that of actual
fluctuations, was 0.63 for the test dataset.
3) Explanations: We prepared two types of explanations:
saliency maps and free-texts. We applied CAM-based meth-
ods available in the pytorch-grad-cam package 2to StockAI
and adopted Score-CAM [24] because it most clearly visual-
izes saliency maps of StockAI. Because CAM-based methods
can generate a saliency map for each prediction class, three
maps were acquired for each prediction. Let x
x
xmap be the set
of the acquired maps.
In addition, we created a set of free-text explanations
based on the GPT-4V model in the Open-AI API [25], which
allows images as input. We input a chart with a prompt
that asked GPT-4V to generate two explanation sentences
that justify each prediction class (BULL, NEUTRAL, and
BEAR). Therefore, we acquired six sentences in total for
each chart. Let us denote the set of them by x
x
xtext.
As a result, three saliency maps and six free-text ex-
planations were available for each trading day, and X-
Selector considered 29=512 combinations of the selected
explanations (ˆ
x
x
xx
x
xmap x
x
xtext).
4) Models: We implemented UserModel with a deep
learning model (Figure 2). The input of UserModel is a
tuple (c
c
c,x
x
x).c
c
cincludes four variables: date i, StockAI’s
prediction p, total rate δ, and initial order d.iis a categorical
variable that embeds the context of the day such as the stock
price. pis a three-dimensional vector that corresponds to
the values in the bar graph (Fig. 1b). δis the percentage
1https://pytorch.org/
2https://github.com/frgfm/torch-cam
increase or decrease of the user’s total asset from the initial
amount. iand the other variables are encoded in 2048-
dimensional vectors (hi,hp,hr,hd)with the Embedding and
Linear modules implemented in PyTorch, respectively.
Let us denote x
x
xmap and x
x
xtext as a set {(x,cls,f l ag)},
where xis the raw data of an explanation, and cls
{BULL,NEUTRAL,BEAR}.f lag =1 if xis to be pre-
sented, and 0 when hidden. x
x
xmap and x
x
xtext are also encoded
in 2048-dimensional vectors (hmap,htext):
hmap =
x,cls,
f lag x
x
xmap
f lag ·(CNN(x)ClsE mbedding(cls)),
htext =
x,cls,
f lag x
x
xtext
f lag ·(TextEncod er(x)ClsE mbed ding(cls)),
where denotes an element-wise product. CNN is a three-
layer CNN model. For TextE ncoder, we used the E5 (embed-
dings from bidirectional encoder representations) model [26]
with pretrained parameters3.
All embedding vectors (hi,hp,hr,hd,hmap,htext) are con-
catenated and input to a three-layer linear model. To extract
the influence of explanations, the model was trained to
predict not dbut the difference dd. In our initial trial,
UserModel always predicted ddto be nearly 0 due
to the distributional bias, so we added an auxiliary task
of predicting whether d=dand trained the model for
predicting ddonly when d=d. The expected value of
duin Equation 3 is P(d=d)·(dd) + d.
πwas acquired with the deep deterministic policy gradi-
ent, a deep reinforcement learning method [27]. We simply
trained πto decide dto maximize assets on the basis of pfor
the training dataset. The reward for the policy is calculated
as the difference in total assets between the current day and
the previous day.
IV. EXP E RI MEN TS
A. Preliminary experiment
1) Procedure: We conducted a preliminary experiment
to investigate the performance of users who were provided
explanations with two naive strategies (ALL and ARGMAX).
ALL shows all the nine explanations available for each day,
and ARGMAX selects explanations for StockAI’s most prob-
able prediction. To ensure the quality of the explanations,
we also prepared two baselines: ONLY PRED shows pbut
does not provide any explanations. In PLAIN, participants
received no support from the IDSS and acted on their own.
For simulation, we chose a Japanese general trading com-
pany (code: 2768) from the test dataset on the basis of the
common stock price range (1,000 - 3,000 JPY) and its high
volatility compared with the other Nikkei225 companies.
Because we had anticipated that the accuracy of StockAI
would affect the result, we prepared two scenarios: high-
accuracy and low-accuracy. We calculated the moving aver-
age of the accuracy of StockAI with a window size of 60
and chose two sections for them. The accuracy of StockAI
3https://huggingface.co/intfloat/multilingual-e5-large
TABLE I: Sample sizes of preliminary experiment
ALL ARGMAX ONLY PRED PLAIN
High-accuracy 39 40 34 41
Low-accuracy 31 34 34 38
for high-accuracy was 0.750, which was the highest, and that
for low-accuracy was 0.333, the chance level of three-class
classification.
We recruited participants to join the simulation with com-
pensation of 220 JPY through Lancers4, a Japanese crowd-
sourcing platform, and got 336 participants. The participants
were first provided pertinent information, and 325 consented
to the participation. We gave them instructions on the task
and gave basic explanations about stock charts and the price
prediction AI. We instructed the participants to increase the
given three million JPY as much as possible by trading with
the IDSS’s support. To motivate them, we told them that
additional rewards would be given to the top performers.
We did not notify them of the amount of the additional
rewards and the number of participants who got them. We
asked six questions to check their comprehension of the task.
34 participants who failed to answer correctly were excluded
from the task. After familiarization with the trading simula-
tor, the participants traded for 60 virtual days successively.
242 participants completed the task (152 males, 88 females,
and 2 did not answer; aged 14-77, M=42.8,SD =10.1).
Table I gives details on the sample sizes.
2) Result: Figure 3 shows the changes in the partici-
pants’ performance. A conspicuous result is the underperfor-
mance of PLAIN, particularly in the high-accuracy scenario.
ONLY PRED performed well for high-accuracy, but could
not outperform PLAIN for low-accuracy. This suggests that
presenting palone contributes to improving performance
only when it has enough accuracy.
ALL and ARGMAX showed different results between the
scenarios. For high-accuracy, ARGMAX outperformed ALL.
ALL slightly underperformed the ONLY PRED baseline as
well. This suggests that ARGMAX explanations successfully
guided users to follow the prediction of StockAI while
ALL toned down the guidance, which worked negatively
in this scenario. On the other hand, ALL outperformed
ARGMAX and the baselines for low-accuracy. Interestingly,
ARGMAX also outperformed the baselines, which suggests
that explanations successfully provide users with insights
into situations and AI accuracy and can contribute to better
decision-making. ALL positively worked for low-accuracy
by providing multiple perspectives.
B. Experiment with X-Selector
1) Procedure: To evaluate X-Selector, we conducted a
simulation with its selected explanations. To train User-
Model, we used the data of the preliminary experiment and
additional data acquired in another experiment in which
explanations were randomly selected. We added the data
to broaden the variety of explanation combinations in the
4https://lancers.jp/
0 10 20 30 40 50 60
day
0.96
0.98
1.00
1.02
1.04
1.06
1.08
asset
condition
ALL
ARGMAX
ONLY_PRED
PLAIN
(a) High-accuracy
0 10 20 30 40 50 60
day
1.00
1.02
1.04
1.06
1.08
1.10
asset
condition
ALL
ARGMAX
ONLY_PRED
PLAIN
(b) Low-accuracy
Fig. 3: Comparisons of baseline total assets. Error bands
represent standard errors.
−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
corrcoef
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
density
condition
X-Selector
ALL
ARGMAX
(a) Distribution of correlation coefficient between
duand dAI for each user
0 10 20 30 40 50 60
day
0.98
1.00
1.02
1.04
1.06
1.08
asset
condition
X-Selector
ALL
ARGMAX
(b) User performance
Fig. 4: Results for high-accuracy scenario
dataset. The numbers of the additional participants were
54 and 45 for high- and low-accuracy, respectively. We
conducted a 4-fold cross validation for UserModel, and the
correlation coefficient between the model’s predictions and
the ground truths was 0.429 on average (SD =0.056).
We obtained a participation of 97 participants. Finally, 39
and 35 participants completed the task for high-accuracy and
low-accuracy, respectively (46 males, 26 females and 2 did
not answer; aged 23-64, M=39.6,SD =10.0).
To analyze the results, we compared the correlation coeffi-
cient between dAI and dufor each participant as a measure of
whether X-Selector could successfully guided users to dAI,
as well as the comparison of user performance that we did
in the preliminary experiment.
2) Result: Figure 4a shows the correlation coefficient
distribution between dAI and duin the high-accuracy con-
dition. The results for ALL and ARGMAX are also shown
for comparison. Notably, while the peaks for ALL and
ARGMAX are centered around zero, X-Selector shifted this
peak to the right, indicating a stronger correlation between du
and dAI for a greater number of participants. This means that
X-Selector effectively guides users to dAI without coercing
but presenting explanations selectively.
Figure 4b illustrates the user performance. X-Selector
generally outperformed ALL and ARGMAX, meaning that
X-Selector enabled users to trade better with selective ex-
planations. In more detail, X-Selector first underperformed
ARGMAX, but the score reversed on day 16. The gap once
narrowed near day 39, but it broadened again until the end.
A possible reason for X-Selector’s better performance is
that it can predict which combination of explanations guides
participants to sell or buy shares more. For example, the
stock price around day 16 dropped steeply, so the IDSS
needed to guide participants to reverse their position. Here,
whereas ARGMAX showed explanations for BEAR, X-
Selector showed explanations for NEUTRAL as well as
BEAR, which may have helped users sell their shares more.
Similarly, X-Selector also attempted to guide users to buy
a moderate amount when dAI was positive but not high by,
for example, showing only a saliency map for BULL and
no text explanations. Another reason is that X-Selector can
overcome the ambiguity in the interpretation of p.preflects
a momentum of stock price in the high-accuracy scenario and
must provide some insight for trading, but it was up to the
participants how to use this to actually decide their order.
πsometimes suggested that they buy shares even though
NEUTRAL or BEAR was the most likely in p. Thus, we
can say that pwas poorly calibrated, but by referring to π,
X-Selector can avoid misleading participants and instead lead
them to more promising decisions.
On the other hand, X-Selector underperformed ARGMAX
until day 16. The stock price was in an uptrend until day
14, and ARGMAX continuously presented explanations for
BULL for 12 days in a row, which may have strongly guided
participants to buy stocks and lead to large benefits. In our
implementation, UserModel considers only the explanations
of the day and does not consider the history of what
explanations were previously provided, which can be a next
target for future work.
X-Selector could not improve user performance in the
low-accuracy scenario (Figure 5b). Overall, the result was
similar to ARGMAX and underperformed ALL. We further
focused on the correlation coefficient between duand dAI.
Figure 5a shows that, contrary to the high-accuracy scenario,
X-Selector did not increase the score. The different results
between the high- and low-accuracy scenarios indicate the
possibility that participants actively assessed the reliability
of the AI and autonomously decided whether to follow X-
−0.4 −0.2 0.0 0.2 0.4 0.6
corrcoef
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
density
condition
X-Selector
ALL
ARGMAX
(a) Distribution of correlation coefficient
0 10 20 30 40 50 60
day
1.00
1.02
1.04
1.06
1.08
1.10
asset
condition
X-Selector
ALL
ARGMAX
(b) User performance
Fig. 5: Results for low-accuracy scenario
Selector’s guidance. This itself highlights a positive aspect of
introducing libertarian paternalism to human-AI interaction
in that users can potentially avoid AI failure depending on
its reliability. However, this did not result in improving their
performance in this scenario. The lack of correlation between
the score and the final asset amounts in the X-Selector
condition (r=0.048) suggests that merely disregarding the
AI’s guidance does not guarantee performance improvement.
A future direction for this problem can be developing a
mechanism to control the strength of AI guidance and
provide explanations in more neutral way depending on AI
accuracy.
V. CONCLUSION
This paper investigated the question of how IDSSs can
select explanations, and we proposed X-Selector, which is
a method for dynamically selecting which explanations to
provide along with an AI prediction. In X-Selector, User-
Model predicts the effect of presenting explanations on a
user decision for each possible combination to show. Then,
it selects the best combination that minimizes the difference
between the predicted user decision and the AI’s suggestion.
We applied X-Selector to a stock trading simulation with the
support of an XAI-based IDSS. The result indicated that X-
Selector can select explanations that guide users to suggested
decisions effectively and improve the performance when the
accuracy of the AI is high, and in addition, it revealed a new
challenge of X-Selector for low-accuracy cases.
REFERENCES
[1] G. Phillips-Wren, Intelligent Decision Support Systems, 02 2013, pp.
25–44.
[2] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey
on explainable artificial intelligence (xai),” IEEE Access, vol. 6, pp.
52 138–52 160, 2018.
[3] M. H. Lee and C. J. Chew, “Understanding the effect of counterfactual
explanations on trust and reliance on ai for human-ai collaborative
clinical decision making,” Proc. ACM Hum.-Comput. Interact., vol. 7,
no. CSCW2, oct 2023.
[4] D. P. Panagoulias, E. Sarmas, V. Marinakis, M. Virvou, G. A.
Tsihrintzis, and H. Doukas, “Intelligent decision support for energy
management: A methodology for tailored explainability of artificial
intelligence analytics,” Electronics, vol. 12, no. 21, 2023.
[5] T. Miller, “Explanation in artificial intelligence: Insights from the
social sciences,” Artificial Intelligence, vol. 267, pp. 1–38, 2019.
[6] A. Maehigashi, Y. Fukuchi, and S. Yamada, “Modeling reliance on xai
indicating its purpose and attention,” in Proc. Annu. Meet. of CogSci,
vol. 45, 2023, pp. 1929–1936.
[7] ——, “Empirical investigation of how robot’s pointing gesture influ-
ences trust in and acceptance of heatmap-based xai,” in 2023 32nd
IEEE Intl. Conf. RO-MAN, 2023, pp. 2134–2139.
[8] A. N. Ferguson, M. Franklin, and D. Lagnado, “Explanations that
backfire: Explainable artificial intelligence can cause information
overload, in Proc. Annu. Meet. of CogSci, vol. 44, no. 44, 2022.
[9] L.-V. Herm, “Impact of explainable ai on cognitive load: Insights from
an empirical study, in 31st Euro. Conf. Info. Syst., 2023, 269.
[10] U. Ehsan, P. Tambwekar, L. Chan, B. Harrison, and M. O. Riedl,
“Automated rationale generation: A technique for explainable ai and
its effects on human perceptions, in Proc. 24th Int. Conf. IUI, 2019,
p. 263–274.
[11] C. R. Sunstein, Why Nudge?: The Politics of Libertarian Paternalism.
Yale University Press, 2014.
[12] M. Kraus and S. Feuerriegel, “Decision support from financial dis-
closures with deep neural networks and transfer learning,” Decision
Support Systems, vol. 104, pp. 38–48, 2017.
[13] A. Chernov, M. Butakova, and A. Kostyukov, “Intelligent decision
support for power grids using deep learning on small datasets, in
2020 2nd Intl. Conf. SUMMA, 2020, pp. 958–962.
[14] C.-Y. Hung, C.-H. Lin, T.-H. Lan, G.-S. Peng, and C.-C. Lee,
“Development of an intelligent decision support system for ischemic
stroke risk assessment in a population-based electronic health record
database,” PLOS ONE, vol. 14, no. 3, pp. 1–16, 03 2019.
[15] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learn-
ing deep features for discriminative localization, in Proc. IEEE Conf.
CVPR, June 2016.
[16] S. Wiegreffe, J. Hessel, S. Swayamdipta, M. Riedl, and Y. Choi, “Re-
framing human-AI collaboration for generating free-text explanations,
in Proc. of the 2022 Conf. of NAACL. ACL, Jul. 2022, pp. 632–658.
[17] R. Parasuraman and V. Riley, “Humans and automation: Use, misuse,
disuse, abuse,” Human Factors, vol. 39, no. 2, pp. 230–253, 1997.
[18] H. Vasconcelos, M. J ¨
orke, M. Grunde-McLaughlin, T. Gerstenberg,
M. S. Bernstein, and R. Krishna, “Explanations can reduce
overreliance on ai systems during decision-making, roc. ACM
Hum.-Comput. Interact., vol. 7, no. CSCW1, apr 2023. [Online].
Available: https://doi.org/10.1145/3579605
[19] C. Panigutti, A. Beretta, F. Giannotti, and D. Pedreschi, “Understand-
ing the impact of explanations on advice-taking: A user study for
ai-based clinical decision support systems,” in Proc. of 2022 CHI.
ACM, 2022.
[20] Y. Fukuchi and S. Yamada, “Selectively providing reliance calibration
cues with reliance prediction,” in Proc. Annu. Meet. of CogSci, vol. 45,
2023, pp. 1579–1586.
[21] ——, “Dynamic selection of reliance calibration cues with ai reliance
model,” IEEE Access, vol. 11, pp. 138870–138 881, 2023.
[22] J.-F. Chen, W.-L. Chen, C.-P. Huang, S.-H. Huang, and A.-P. Chen,
“Financial time-series data analysis using deep convolutional neural
networks,” in 2016 7th Intl. Conf. on CCBD, 2016, pp. 87–92.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. CVPR, 2016, pp. 770–778.
[24] H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel,
and X. Hu, “Score-cam: Score-weighted visual explanations for convo-
lutional neural networks,” in Proc. IEEE/CVF Conf. CVPR Workshops,
June 2020.
[25] OpenAI, “Gpt-4 technical report,” 2023.
[26] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder,
and F. Wei, “Text embeddings by weakly-supervised contrastive pre-
training,” 2022.
[27] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-
learning with model-based acceleration,” in Proc. 33rd Intl. Conf. on
Machine Learning, vol. 48, 20–22 Jun 2016, pp. 2829–2838.
... Driven by this motivation, this paper aims to deepen the discussion on our previously proposed Nudge-XAI, 21,22 which attempts to affect user decision-making by dynamically biasing how XAI explanations are presented. Nudge-XAI algorithmically determines the manner in which explanations are presented-deciding whether to show an explanation 21 or whether to emphasize it 22 -using a user model that predicts user decisions for various scenarios regarding how explanations are provided. ...
Conference Paper
Full-text available
This paper reviews our previous trials of Nudge-XAI, an approach that introduces automatic biases into explanations from explainable AIs (XAIs) with the aim of leading users to better decisions, and it discusses the benefits and challenges. Nudge-XAI uses a user model that predicts the influence of providing an explanation or emphasizing it and attempts to guide users toward AI-suggested decisions without coercion. The nudge design is expected to enhance the autonomy of users, reduce the risk associated with an AI making decisions without users' full agreement, and enable users to avoid AI failures. To discuss the potential of Nudge-XAI, this paper reports a post-hoc investigation of previous experimental results using cluster analysis. The results demonstrate the diversity of user behavior in response to Nudge-XAI, which supports our aim of enhancing user autonomy. However, it also highlights the challenge of users who distrust AI and falsely make decisions contrary to AI suggestions, suggesting the need for personalized adjustment of the strength of nudges to make this approach work more generally.
... Driven by this motivation, this paper aims to deepen the discussion on our previously proposed Nudge-XAI, 21,22 which attempts to affect user decision-making by dynamically biasing how XAI explanations are presented. Nudge-XAI algorithmically determines the manner in which explanations are presented-deciding whether to show an explanation 21 or whether to emphasize it 22 -using a user model that predicts user decisions for various scenarios regarding how explanations are provided. ...
Preprint
This paper reviews our previous trials of Nudge-XAI, an approach that introduces automatic biases into explanations from explainable AIs (XAIs) with the aim of leading users to better decisions, and it discusses the benefits and challenges. Nudge-XAI uses a user model that predicts the influence of providing an explanation or emphasizing it and attempts to guide users toward AI-suggested decisions without coercion. The nudge design is expected to enhance the autonomy of users, reduce the risk associated with an AI making decisions without users' full agreement, and enable users to avoid AI failures. To discuss the potential of Nudge-XAI, this paper reports a post-hoc investigation of previous experimental results using cluster analysis. The results demonstrate the diversity of user behavior in response to Nudge-XAI, which supports our aim of enhancing user autonomy. However, it also highlights the challenge of users who distrust AI and falsely make decisions contrary to AI suggestions, suggesting the need for personalized adjustment of the strength of nudges to make this approach work more generally.
Article
Full-text available
Understanding what an AI system can and cannot do is necessary for end-users to use the AI properly without being over- or under-reliant on it. Reliance calibration cues (RCCs) communicate an AI’s capability to users, resulting in optimizing their reliance on it. Previous studies have typically focused on continuously presenting RCCs, and although providing an excessive amount of RCCs is sometimes problematic, limited consideration has been given to the question of how an AI can selectively provide RCCs. This paper proposes vPred-RC, an algorithm in which an AI decides whether to provide an RCC and which RCC to provide. It evaluates the influence of an RCC on user reliance with a cognitive model that predicts whether a human will assign a task to an AI agent with or without an RCC. We tested vPred-RC in a human-AI collaborative task called the collaborative CAPTCHA (CC) task. First, our reliance prediction model was trained on a dataset of human task assignments for the CC task and found to achieve 83.5% accuracy. We further evaluated vPred-RC’s dynamic RCC selection in a user study. As a result, the RCCs selected by vPred-RC enabled participants to more accurately assign tasks to an AI when and only when the AI succeeded compared with randomly selected ones, suggesting that vPred-RC can successfully calibrate human reliance with a reduced number of RCCs. The selective presentation of RCCs has the potential to enhance the efficiency of collaboration between humans and AIs with fewer communication costs.
Article
Full-text available
This paper presents a novel development methodology for artificial intelligence (AI) analytics in energy management that focuses on tailored explainability to overcome the “black box” issue associated with AI analytics. Our approach addresses the fact that any given analytic service is to be used by different stakeholders, with different backgrounds, preferences, abilities, skills, and goals. Our methodology is aligned with the explainable artificial intelligence (XAI) paradigm and aims to enhance the interpretability of AI-empowered decision support systems (DSSs). Specifically, a clustering-based approach is adopted to customize the depth of explainability based on the specific needs of different user groups. This approach improves the accuracy and effectiveness of energy management analytics while promoting transparency and trust in the decision-making process. The methodology is structured around an iterative development lifecycle for an intelligent decision support system and includes several steps, such as stakeholder identification, an empirical study on usability and explainability, user clustering analysis, and the implementation of an XAI framework. The XAI framework comprises XAI clusters and local and global XAI, which facilitate higher adoption rates of the AI system and ensure responsible and safe deployment. The methodology is tested on a stacked neural network for an analytics service, which estimates energy savings from renovations, and aims to increase adoption rates and benefit the circular economy.
Conference Paper
Full-text available
This study investigated how displaying a robot's attention heatmap while the robot point gesture at it influences human trust and acceptance of its outputs. We conducted an experiment using two types of visual tasks. In these tasks, the participants were required to decide whether to accept or reject the answers of an AI or robot. The participants could see the answers with an AI attention heatmap, the heatmap with AI pointing (displayed as a laser dot cursor), a robot attention heatmap with robot pointing (pointing at a certain location on the heatmap displayed on a tablet with a stick), or no heatmap. The experimental results revealed that the AI and robot pointing at their attention heatmaps lowered the participants' acceptance of their answers when the heatmaps had low interpretability in a more difficult task. Also, the robot pointing at the heatmaps showed the possibility of increasing acceptance of its answer when the heatmaps had high interpretability in a more difficult task. In addition, the acceptance of the robot's answers correlated with emotional trust in the robot. This study demonstrates that a robot pointing gesture at its attention heatmap could be used to control human behaviors and emotional trust in human-robot interactions.
Conference Paper
Full-text available
This study used XAI, which shows its purposes and attention as explanations of its process, and investigated how these explanations affect human trust in and use of AI. In this study, we generated heat maps indicating AI attention, conducted Experiment 1 to confirm the validity of the interpretability of the heat maps, and conducted Experiment 2 to investigate the effects of the purpose and heat maps in terms of reliance (depending on AI) and compliance (accepting answers of AI). The results of structural equation modeling (SEM) analyses showed that (1) displaying the purpose of AI positively and negatively influenced trust depending on the types of AI usage, reliance or compliance, and task difficulty, (2) just displaying the heat maps negatively influenced trust in a more difficult task, and (3) the heat maps positively influenced trust according to their interpretability in a more difficult task.
Conference Paper
Full-text available
For effective collaboration between humans and intelligent agents that employ machine learning for decision-making, humans must understand what agents can and cannot do to avoid over/under-reliance. A solution to this problem is adjusting human reliance through communication using reliance calibration cues (RCCs) to help humans assess agents' capabilities. Previous studies typically attempted to calibrate reliance by continuously presenting RCCs, and when an agent should provide RCCs remains an open question. To answer this, we propose Pred-RC, a method for selectively providing RCCs. Pred-RC uses a cognitive reliance model to predict whether a human will assign a task to an agent. By comparing the prediction results for both cases with and without an RCC, Pred-RC evaluates the influence of the RCC on human reliance. We tested Pred-RC in a human-AI collaboration task and found that it can successfully calibrate human reliance with a reduced number of RCCs.
Conference Paper
Full-text available
While the emerging research field of explainable artificial intelligence (XAI) claims to address the lack of explainability in high-performance machine learning models, in practice, XAI targets developers rather than actual end-users. Unsurprisingly, end-users are often unwilling to use XAI-based decision support systems. Similarly, there is limited interdisciplinary research on end-users' behavior during XAI explanations usage, rendering it unknown how explanations may impact cognitive load and further affect end-user performance. Therefore, we conducted an empirical study with 271 prospective physicians, measuring their cognitive load, task performance, and task time for distinct implementation-independent XAI explanation types using a COVID-19 use case. We found that these explanation types strongly influence end-users' cognitive load, task performance, and task time. Further, we contextualized a mental efficiency metric, ranking local XAI explanation types best, to provide recommendations for future applications and implications for sociotechnical XAI research.
Article
Artificial intelligence (AI) is increasingly being considered to assist human decision-making in high-stake domains (e.g. health). However, researchers have discussed an issue that humans can over-rely on wrong suggestions of the AI model instead of achieving human AI complementary performance. In this work, we utilized salient feature explanations along with what-if, counterfactual explanations to make humans review AI suggestions more analytically to reduce overreliance on AI and explored the effect of these explanations on trust and reliance on AI during clinical decision-making. We conducted an experiment with seven therapists and ten laypersons on the task of assessing post-stroke survivors' quality of motion, and analyzed their performance, agreement level on the task, and reliance on AI without and with two types of AI explanations. Our results showed that the AI model with both salient features and counterfactual explanations assisted therapists and laypersons to improve their performance and agreement level on the task when 'right' AI outputs are presented. While both therapists and laypersons over-relied on 'wrong' AI outputs, counterfactual explanations assisted both therapists and laypersons to reduce their over-reliance on 'wrong' AI outputs by 21% compared to salient feature explanations. Specifically, laypersons had higher performance degrades by 18.0 f1-score with salient feature explanations and 14.0 f1-score with counterfactual explanations than therapists with performance degrades of 8.6 and 2.8 f1-scores respectively. Our work discusses the potential of counterfactual explanations to better estimate the accuracy of an AI model and reduce over-reliance on 'wrong' AI outputs and implications for improving human-AI collaborative decision-making.
Article
Prior work has identified a resilient phenomenon that threatens the performance of human-AI decision-making teams: overreliance, when people agree with an AI, even when it is incorrect. Surprisingly, overreliance does not reduce when the AI produces explanations for its predictions, compared to only providing predictions. Some have argued that overreliance results from cognitive biases or uncalibrated trust, attributing overreliance to an inevitability of human cognition. By contrast, our paper argues that people strategically choose whether or not to engage with an AI explanation, demonstrating empirically that there are scenarios where AI explanations reduce overreliance. To achieve this, we formalize this strategic choice in a cost-benefit framework, where the costs and benefits of engaging with the task are weighed against the costs and benefits of relying on the AI. We manipulate the costs and benefits in a maze task, where participants collaborate with a simulated AI to find the exit of a maze. Through 5 studies (N = 731), we find that costs such as task difficulty (Study 1), explanation difficulty (Study 2, 3), and benefits such as monetary compensation (Study 4) affect overreliance. Finally, Study 5 adapts the Cognitive Effort Discounting paradigm to quantify the utility of different explanations, providing further support for our framework. Our results suggest that some of the null effects found in literature could be due in part to the explanation not sufficiently reducing the costs of verifying the AI's prediction.