Content uploaded by Jasper van der Waa
Author content
All content in this area was uploaded by Jasper van der Waa on Dec 09, 2020
Content may be subject to copyright.
Artificial Intelligence 291 (2021) 103404
Contents lists available at ScienceDirect
Artificial Intelligence
www.elsevier.com/locate/artint
Evaluating XAI: A comparison of rule-based and
example-based explanations
Jasper van der Waa a,b,∗, Elisabeth Nieuwburg a,c, Anita Cremers a,
Mark Neerincx a,b
aTNO, Perceptual & Cognitive Systems, Soesterberg, Nether lands
bTech nical University of Delft, Interactive Intelligence, Delft, Netherlands
cUniversity of Amsterdam, Institute of Interdisciplinary St udies, Amsterdam, Netherlands
a r t i c l e i n f o a b s t r a c t
Article history:
Received 21 February 2020
Received in revised form 20 August 2020
Accepted 26 October 2020
Avail abl e online 28 October 2020
Keywords:
Explainable Artificial Intelligence (XAI)
User evaluations
Contrastive explanations
Artificial Intelligence (AI)
Machine learning
Decision support systems
Current developments in Artificial Intelligence (AI) led to a resurgence of Explainable
AI (XAI). New methods are being researched to obtain information from AI systems in
order to generate explanations for their output. However, there is an overall lack of
valid and reliable evaluations of the effects on users’ experience of, and behavior in
response to explanations. New XAI methods are often based on an intuitive notion what an
effective explanation should be. Rule- and example-based contrastive explanations are two
exemplary explanation styles. In this study we evaluate the effects of these two explanation
styles on system understanding, persuasive power and task performance in the context
of decision support in diabetes self-management. Furthermore, we provide three sets of
recommendations based on our experience designing this evaluation to help improve
future evaluations. Our results show that rule-based explanations have a small positive
effect on system understanding, whereas both rule- and example-based explanations
seem to persuade users in following the advice even when incorrect. Neither explanation
improves task performance compared to no explanation. This can be explained by the fact
that both explanation styles only provide details relevant for a single decision, not the
underlying rational or causality. These results show the importance of user evaluations in
assessing the current assumptions and intuitions on effective explanations.
©2020 Elsevier B.V. All rights reserved.
1. Introduction
Humans expect others to comprehensibly explain decisions that have an impact on them [1]. The same holds for humans
interacting with decision support systems (DSS). To help them understand and trust a system’s reasoning, such systems
need to explain their advice to human users [1,2]. Currently, several approaches are proposed in the field of Explainable
Artificial Intelligence (XAI) that allow DSS to generate explanations [3]. Aside from the numerous computational evaluations
of implemented methods, literature reviews show that there is an overall lack of high quality user evaluations that add a
user-centered focus to the field of XAI [4,5]. As explanations fulfill a user need, explanations generated by a DSS need to be
evaluated among these users. This can provide valuable insights into user requirements and effects. In addition, evaluations
can be used to benchmark XAI methods to measure the research field’s progress.
*Corresponding author at: TNO, Perceptual & Cognitive Systems, Soesterberg, Netherlands.
E-mail address: jasper.vanderwaa@tno.nl (J.S. van der Waa).
https://doi.org/10.1016/j.artint.2020.103404
0004-3702/©2020 Elsevier B.V. All rights reserved.
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
The contribution of this article is twofold. First, we propose a set of recommendations on designing user evaluations
in the field of XAI. Second, we performed an extensive user evaluation on the effects of rule-based and example-based
contrastive explanations. The recommendations regard 1) how to construct a theory of the effects that explanations are
expected to have, 2) how to select a use case and participants to evaluate that theory, and 3) which types of measurements
to use for the theorized effects. These recommendations are intended as a reference for XAI researchers unfamiliar to user
evaluations. These recommendations are based on our experience designing a user evaluation and retread knowledge that
is more common in fields such as cognitive psychology and Human-Computer Interaction.
The present user study focused on two styles of contrastive explanations and their evaluation. Contrastive explanations
in the context of a DSS are those that answer questions as “Why this advice instead of that advice?” [6]. These explanations
help users to understand and pinpoint information that caused the system to give one advice over the other. In two separate
experiments, we evaluated two contrastive explanation styles. An explanation style defines the way information is structured
and is often defined by the algorithmic approach to generate explanations. Note that this is different from explanation form,
which defines how it is presented (e.g. textually or visually). The two evaluated styles were rule-based and example-based
explanations, with no explanation as a control. These two styles of explanations are often referred to as means to convey a
system’s internal workings to a user. However, these statements are not yet formalized into a theory nor compared in detail.
Hence, our second contribution is the evaluation of the effects that rule-based and example-based explanations have on
system understanding (Experiment I), persuasive power and task performance (Experiment II). We define system understanding
as the user’s ability to know how the system behaves in a novel situation and why. The persuasive power of an explanation
is defined as its capacity to convince the user to follow the given advice independent of whether it is correct or not. Task
performance is defined as the decision accuracy of the combination of the system, explanation and user. Together, these
concepts relate to the broader concept of trust, an important topic in XAI research. System understanding is believed to help
users achieve an appropriate level of trust in a DSS, and both system understanding and appropriate trust are assumed to
improve task performance [7]. Explanations might also persuade the user to various extents, resulting in either appropriate,
over- or under-trust, which could affect task performance [8]. Instead of measuring trust directly, we opted for measuring
the intermediate variables of understanding and persuasion to better understand how these concepts affect the task.
The way of structuring explanatory information differs between the two explanation styles examined in this study. Rule-
based explanations are “if... then...” statements, whereas example-based explanations provide historical situations similar to
the current situation. In our experiments, both explanation styles were contrastive, comparing a given advice to an alter-
native advice that was not given. The rule-based contrastive explanations explicitly conveyed the DSS’s decision boundary
between the given advice and the alternative advice. The example-based contrastive explanations provided two examples, one
on either side of this decision boundary, both as similar as possible to the current situation. The first example illustrated
a situation where the given advice proved to be correct, and the second example showed a different situation where an
alternative advice was correct.
Rule-based explanations explicitly state the DSS’s decision boundary between the given and the contrasting advice. Given
this fact, we hypothesized that these explanations improve a participant’s understanding of system behavior, causing an
improved task performance compared to example-based explanations. Specifically, we expected participants to be able to
identify the most important feature used by the DSS in a given situation, replicate this feature’s relevant decision thresholds
and use this knowledge to predict the DSS’s behavior in novel situations. When the user is confronted with how correct its
decisions were, this knowledge would result in a better estimate of when a DSS’s advice is correct or not. However, rule-
based explanations are very factual and provide little information to convince the participant of the correctness of a given
advice. As such, we expected rule-based explanations to have little persuasive power. For the example-based explanations
we hypothesized opposite effects. As examples of correct past behavior would incite confidence in a given advice, we
hypothesize d them to hold more persuasive power. However, the amount of understanding a participant would gain would
be limited, as it would rely on participants inferring the separating decision boundary between the examples rather than
having it presented to them. Whether persuasive power is desirable in an explanation depends on the use case as well as
the performance of the DSS. A low performance DSS combined with a highly persuasive explanation for example, would
likely result in a low task performance.
The use case of the user evaluation was based on a diabetes mellitus type 1 (DMT1) self-management context, where
patients are assisted by a personalized DSS to decide on the correct dosage of insulin. Insulin is a hormone that DMT1
patients have to administer to prevent the negative effects of the disturbed blood glucose regulation associated with this
condition. The dose is highly personal and context dependent, and an incorrect dose can cause the patient short- or long-
term harm. The purpose of the DSS’s advice is to minimize these adverse effects. This use case was selected for two
reasons. Firstly, AI is increasingly more often used in DMT1 self-management [9–11]. Therefore, the results are relevant for
research on DSS aided DMT1 self-management. Secondly, this use case was both understandable and motivating for healthy
participants without any experience with DMT1. Because DMT1 patients would have potentially confounding experience
with insulin administration or certain biases, we recruited healthy participants that imagined themselves in the situation
of a DMT1 patient. Empathizing with a patient motivated them to make correct decisions, even if this meant to ignore the
DSS’s advice in favor of their own choice, or vice versa. This required an understanding of when the DSS’s advice would be
correct and incorrect and how it would behave in novel situations.
The paper is structured as follows. First we discuss the background and shortcomings of current XAI user evaluations.
Furthermore, we provide examples on how rule-based and example-based explanations are currently used in XAI. The sub-
2
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
sequent section describes three sets of recommendations for user evaluations in XAI, based on our experience designing
the evaluation as well as on relevant literature. Next, we illustrate our own recommendations by explaining the use case
in more detail and offering the theory behind our hypotheses. This is followed by a detailed description of our methods,
analysis and results. We conclude with a discussion on the validity and reliability of the results and a brief discussion of
future work.
2. Background
The following two sections discuss the current state of user evaluations in XAI and rule-based and example-based con-
trastive explanations. The former section illustrates the shortcomings of current user evaluations, formed by either a lack of
validity and reliability or the entire omission of an evaluation. The latter discusses the two explanation styles used in our
evaluation in more detail, and illustrates their prevalence in the field of XAI.
2.1. User evaluations in XAI
A major goal of Explainable Artificial Intelligence (XAI) is to have AI-systems construct explanations for their own output.
Common purposes of these explanations are to increase system understanding [12], improve behavior predictability [13]
and calibrate system trust [14,15,8]. Other purposes include support in system debugging [16,12], verification [13]and
justification [17]. Currently, the exact purpose of explanation methods is often not defined or formalized, even though these
different purposes may result in profoundly different requirements for explanations [18]. This makes it difficult for the field
of XAI to progress and to evaluate developed methods.
The difficulties in XAI user evaluations are reflected in recent surveys from Anjomshoae et al. [5], Adadi et al. [19], and
Doshi-Velez and Kim [4] that summarize current efforts of user evaluations in the field. The systematic literature review
by [5]shows that 97% of the 62 reviewed articles underline that explanations serve a user need but 41% did not evaluate
their explanations with such users. In addition, of those papers that performed a user evaluation, relatively few provided
a good discussion of the context (27%), results (19%) and limitations (14%) of their experiment. The second survey from
[19]evaluated 381 papers and found that only 5% had an explicit focus on the evaluation of the XAI methods. These two
surveys show that, although user evaluations are being conducted, many of them provide limited conclusions for other XAI
researchers to build on.
A third survey by [4] discusses an explicit issue with user evaluations in XAI. The authors argue to systematically start
evaluating different explanations styles and forms in various domains, a rigor that is currently lacking in XAI user evalua-
tions. To do so in a valid way, several recommendations are given. First, the application level of the study context should be
made clear; either a real, simplified or generic application. Second, any (expected) task-specific explanation requirements
should be mentioned. Examples include the average human level of expertise targeted, and whether the explanation should
address the entire system or a single output. Finally, the explanations and their effects should be clearly stated together
with a discussion of the study’s limitations. Togeth er, these three surveys illustrate the shortcomings of current XAI user
evaluations.
From several studies that do focus on evaluating user effects, we note that the majority focuses on subjective measure-
ment. Surveys and interviews are used to measure user satisfaction [20,21], the goodness of an explanation [22], acceptance
of the system’s advice [23,24] and trust in the system [25–28]. Such subjective measurements can provide a valuable in-
sight in the user’s perspective on the explanation. However, these results do not necessarily relate to the behavioral effects
an explanation could cause. Therefore, these subjective measurements require further investigation to see if they correlate
with a behavioral effect [7]. Without such an investigation, these subjective results only provide information on the user’s
beliefs and opinions, but not on actual gained understanding, trust or task performance. Some studies, however, do perform
objective measurements. The work from [29]for example, measured both subjective ease-of-use of an explanation and a
participant’s capacity to correctly make inferences based on the explanations. This allowed the authors to differentiate be-
tween behavioral and self-perceived effects of an explanation, underlining the value of performing objective measurements.
The above described critical view on XAI user evaluations is related to the concepts of construct validity and reliability.
These two concepts provide clear standards to scientifically sound user evaluations [30–32]. The construct validity of an
evaluation is its accuracy in measuring the intended constructs (e.g. understanding or trust). Examples of how validity may
be harmed is a poor design, ill defined constructs or arbitrarily selected measurements. Reliability, on the other hand, refers
to the evaluation’s internal consistency and reproducibility, and may be harmed by a lack of documentation, an unsuitable
use case or noisy measurements. In the social sciences, a common condition for results to be generalized to other cases
and to infer causal relations is that a user evaluation is both valid and reliable [30]. This can be (partially) obtained by
developing different types of measurements for common constructs. For example, self-reported subjective measurements
such as ratings and surveys can be supplemented by behavioral measurements to gather data on the performance in a
specific task.
2.2. Rule-based and example-based explanations
Human explanations tend to be contrastive: they compare a certain phenomenon (fact) with a hypothetical one (foil)
[33,34]. In the case of a decision support systems (DSS), a natural question to ask is “Why this advice?”. This question
3
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
Fig. 1. An overview of three sets of practical recommendations to improve user evaluations for XAI.
implies a contrast, as the person asking this question often has an explicit contrasting foil in mind. In other words, the
implicit question is “Why this advice and not that advice?”. The specific contrast allows the explanation to be limited to the
differences between fact and foil. Humans use contrastive explanations to explain events in a concise and specific manner
[2]. This advantage also applies to systems: contrastive explanations narrow down the available information to a concrete
difference between two outputs.
Contrastive explanations can vary depending on the way the advice is contrasted with a different advice, for example
using rules or examples. Within the context of a DSS advising an insulin dose for DMT1 self-management, a contrastive
rule-based explanation could be: “Currently the temperature is below 10 degrees and a lower insulin dose is advised. If the
temperature was above 30 degrees, a normal insulin dose would have been advised.” This explanation contains two rules
that explicitly state the differentiating decision boundaries between the fact and foil. Several XAI methods aim to generate
this type of “if... then...” rules, such as [35–38].
An example-based explanation refers to historical situations in which the advice was found to be true or false: “The
temperature is currently 8 degrees, and a lower insulin dose is advised. Yes te rda y was similar: it was 7 degrees and the
same advice proved to be correct. Two months ago, when it was 31 degrees, a normal dose was advised instead, which
proved to be correct for that situation”. Such example- or instance-based explanations are often used between humans, as
they illustrate past behavior and allow for generalization to new situations [39–42]. Several XAI methods try to identify
examples to generate such explanations, for example those from [43–47].
Research on system explanations using rules and examples is not new. Most of the existing research focused on exploring
how users preferred a system would reason, by rules or through examples. For example, users prefer an example-based
spam-filter over a rule-based [48], while they prefer spam-filter explanations to be rule-based [49]. Another evaluation
showed that the number of rule factors in an explanation had an effect on task performance by either promoting system
over-reliance (too many factors) or self-reliance (too few factors) [50]. Work by Lim et al. [51]shows that rule-based
explanations cause users to understand system behavior, especially if those rules explain why the system behaves in a
certain way as opposed to why it does not behave in a different (expected) way. Studies such as these tend to evaluate
either rules or examples, depending on the research field (e.g. recommender system explanations tend to be example-based)
but few compare rules with examples.
3. Recommendations for XAI user evaluations
As discussed in Section 2.1, user evaluations play an invaluable role in XAI but are often omitted or of insufficient quality.
Our main contribution is a thorough evaluation of rule-based and example-based contrastive explanations. In addition, we
believe that the experience and lessons learned in designing this evaluation can be valuable for other researchers. Especially
researchers in the field of XAI that are less familiar with user evaluations can benefit from guidance in the design of user
studies incorporating knowledge from different disciplines. To that end, we propose three sets of recommendations with
practical methods to help improve user evaluations. An overview is provided in Fig. 1.
4
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
3.1. R1: Constructs and relations
As stated in Section 2.1, the field of XAI often deals with ambiguously defined concepts such as ‘understanding’. We
believe that this hinders the creation and replication of XAI user evaluations and their results. Through clear definitions and
motivation, the contribution of the evaluation becomes more apparent. This also aids other researchers to extend on the
results. We provide three practical recommendations to clarify the evaluated constructs and their relations.
Our first recommendation is to clearly define the intended purposes of an explanation in the form of a construct. A
construct is either the intended purpose, an intermediate requirement for the purpose or a potential confound to your
purpose. Constructs form the basis of the scientific theory underlying XAI methods and user evaluations. By defining a
construct, it becomes easier to develop measurements. Second, we recommend to clearly define the relations expected
between the constructs. A concrete and visual way to do so is through a Causal Diagram which presents the expected causal
relations between constructs [52]. These relations form your hypotheses and make sure they are formulated in terms of
your constructs. Clearly stating hypotheses allows other researchers to critically reflect on the underlying theory assumed,
proved or falsified with the evaluation. It offers insight in how constructs are assumed to be related and how the results
support or contradict these relations.
Our final recommendation regarding constructs is to adopt existing theories, such as from philosophy, (cognitive) psychol-
ogy and from human-computer interaction (see [2,6]for an overview). The former provides construct definitions whereas
the latter two provide theories of human-human and human-computer explanations. These three recommendations to de-
fine constructs and their relations and grounding them in other research disciplines can contribute to more valid and reliable
user evaluations. In addition, this practice allows results to be meaningful even if hypotheses are rejected, as they falsify a
scientific theory that may have been accepted as true.
3.2. R2: Use case and experimental context
The second set of recommendations regards the experimental context, including the use case. The use case determines
the task, the participants that can and should be used, the mode of the interaction, the communication that takes place and
the information available to the user [53]. As [4]already stated, the selected use case has a large effect on the conclusions
that can be drawn and the extent to which they can be generalized. Also, the use case does not necessarily need to be of
high fidelity, as a low fidelity allows for more experimental control and a potentially more valid and reliable evaluation [54].
We recommend to take these aspects into account when determining the use case and to reflect on the choices made when
interpreting the results the user evaluation. This improves both the validity and reliability of the evaluation. A concrete way
to structure the choice for a use case is to follow the taxonomy provided by [4](see Section 2.1) or a similar one.
The second recommendation concerns the sample of participants selected, as this choice determines the initial knowl-
edge, experience, beliefs, opinions and biases the users have. Whether participants are university students, domain experts
or recruited online through platforms such as Mechanical Turk, the characteristics of the group will have an effect on the
results. The choice of population should be governed by purpose of the evaluation. For example, our evaluation was per-
formed with healthy participants rather than diabetes patients, as the latter tend to vary in their diabetes knowledge and
suffer from misconceptions [55]. These factors can interfere in an exploratory study such as ours, in which the findings
are not domain specific. Hence, we recommend to invest in both understanding the use case domain and reflecting on the
intended purpose of the evaluation. These considerations should be consolidated in inclusion criteria to ensure that the
results are meaningful with respect to the study’s aim.
Our final recommendation related to the context considers the experimental setting and surroundings, as these may
affect the quality and generalizability of the results. An online setting may provide a large quantity of readily available
participants, but the results are often of ambiguous quality (see [56]for a review). If circumstances allow, we recommend
to use a controlled setting (e.g. a room with no distractions, or a use case specific environment). This allows for valuable
interaction with participants while reducing potential confounds that threaten the evaluation’s reliability and validity.
3.3. R3: Measurements
Numerous measurements exist for computational experiments on suggested XAI methods (for example; fidelity [57],
sensitivity [58] and consistency [59]). However, there is a lack of validated measurements for user evaluations [7]. Hence,
our third group of recommendations regards the type of measurement to use for the operationalization of the constructs.
We identify two main measurement types useful for XAI user evaluations: self-reported measures and behavioral measures.
Self-reported measures are subjective and are often used in XAI user evaluations. They provide insights in users’ conscious
thoughts, opinions and perceptions. We recommend the use of self-reported measures for subjective constructs (e.g. per-
ceived understanding), but also recommend a critical perspective on whether the measures indeed address the intended
constructs. Behavioral measures have a more observational nature and are used to measure actual behavioral effects. We
recommend their usage for objectively measuring constructs such as understanding and task performance. Importantly how-
ever, such measures often only measure one aspect of behavior. Ideally, a combination of both measurement types should be
used to assess effects on both the user’s perception and behavior. In this way, a complete perspective on a construct can be
5
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
obtained. In practice, some constructs lend themselves more for self-reported measurements, for example a user’s percep-
tion on trust or understanding. Other constructs are more suitable for behavioral measurements, such as task performance,
simulatability, predictability, and persuasive power.
Furthermore, we recommend to measure explanation effects implicitly, rather than explicitly. When participants are
not aware of the evaluation’s purpose, their responses may be more genuine. Also, when measuring understanding or
similar constructs, the participant’s explicit focus on the explanations may cause skewed results not present in a real world
application. This leads to our third recommendation to measure potential biases. Biases can regard the participant’s overall
perspective on AI, the use case, decision-making or similar. However, biases can also be introduced by the researchers
themselves. For example, one XAI method can be presented more attractively or reliably than another. It can be difficult
to prevent such biases. One way to mitigate these biases is to design how the explanation are presented, the explanation
form, in an iterative manner with expert reviews and pilots. In addition, one can measure these biases nonetheless if
possible and reasonable. For example, a usability questionnaire can be used to measure potential differences between the
way explanations are presented in the different conditions. For our study we designed the explanations iteratively and
verified that the chosen form for each explanation type did not differ significantly in the perception of the participants.
4. The use case: diabetes self-management
In this study, we focused on personalized healthcare, an area in which machine learning is promising and explanations
are essential for realistic applications [60]. Our use case is that of assisting patients with diabetes mellitus type 1 (DMT1)
with personalized insulin advice. DMT1 is a chronic autoimmune disorder in which glucose homeostasis is disturbed and
intake of the hormone insulin is required to balance glucose levels. Since blood glucose levels are influenced by both
environmental and personal factors, it is often difficult to find the adequate dose of insulin that stabilizes blood glucose
levels [61]. Therefore, personalized advice systems can be a promising tool in DMT1 management to improve quality of life
and mitigate long-term health risks.
In our context, a DMT1 patient finds it difficult to find the optimal insulin dose for a meal in a given situation. On the
patient’s request, a fictitious intelligent DSS provides assistance with the insulin intake before a meal. Based on different
internal and external factors (e.g. hours of sleep, temperature, past activity, etc.), the system may advise to take a normal
insulin dose, or a higher or lower dose than usual. For example, the system could advise a lower insulin dose based on
the current temperature. The factors that were used in the evaluation are realistic, and were based on Bosch [62] and an
interview with a DMT1 patient.
In this use case, both the advice and the explanations are simplified. This study therefore falls under the human grounded
evaluation category of Doshi-Velez and Kim [4]: a simplified task of a real-world application. The advice is binary (higher or
lower), whereas in reality one would expect either a specific dose or a range of suggested doses. This simplification allowed
us to evaluate with novice users (see Section 6.3), as we could limit our explanation to the effects of a too low or too high
dosage without going into detail about effects of specific doses. Furthermore, this prevented the unnecessary complication
of having multiple potential foils for our contrastive explanations. Although the selection of the foil, either by system or
user, is an interesting topic regarding contrastive explanations, it was deemed out of scope for this evaluation. The second
simplification was that the explanations were not generated using a specific XAI method, but designed by the researchers
instead. Several design iterations were conducted based on feedback from XAI researchers and interaction designers to
remove potential design choices in the explanation form that could cause one explanation to be favored over another. Since
the explanations were not generated by a specific XAI method, we were able to explore the effects of more prototypical rule-
and example-based explanations inspired by multiple XAI methods that generate similar explanations (see Section 2.2).
There are several limitations caused by these two simplifications. First, we imply that the system can automatically select
the appropriate foil for contrastive explanations. Second, we assume that the XAI method is able to identify only the most
relevant factors to explain a decision. Although this assumes a potentially complex requirement for the XAI method, it is a
reasonable assumption as humans prefer a selective explanation over a complete one [2].
5. Constructs, expected relations and measurements
The user evaluation focused on three constructs: system understanding, persuasive power, and task performance. Al-
though an important goal of offering explanations is to allow users to arrive at the appropriate level of trust in the system
[63,7], the construct of trust is difficult to define and measure [18]. As such, our focus was on constructs influencing trust
that were more suitable to translate into measurable constructs; the intermediate construct of system understanding and
the final construct of task performance of the entire user-system combination. The persuasive power of an explanation was
also measured, as an explanation might cause over-trust in a user; believing that the system is correct while it is not,
without having a proper system understanding. As such, the persuasive power of an explanation confounds to the effect of
understanding on task performance.
Both contrastive rule- and example-based explanations were compared to each other with no explanation as a control. Our
hypotheses are visualized in a Causal Diagram depicted in Fig. 2[52]. From rule-based explanations we expected participants
to gain a better understanding of when and how the system arrives at a specific advice. Contrastive rule-based explanations
6
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
Fig. 2. Our theory, depicted as a Causal Diagram. It describes the exp ected effects of contrastive rule- and example-based explanations on the constructs of
system understanding, persuasive power and task performance. The solid and green arrows depict expected positive effects and the dashed and re d arrow
depicts a negative effect. The arrow thickness depicts the size of the expected effect. The opaque grey boxes are the measurements that were performed
for that construct, divided into behavioral and self-reported measurements.
explicate the system’s decision boundary between fact and foil and we expected the participants to recall and apply this in-
formation. Second, we expected that contrastive example-based explanations persuade participants to follow the advice more
often. We believe that examples raise confidence in the correctness of an advice as they illustrate past good performance
of the system. Third, we hypothesized that both system understanding and persuasive power have an effect on task perfor-
mance. Whereas this effect was expected to be positive for system understanding, persuasive power was expected to affect
task performance negatively in case a system’s advice is not always correct. This follows the argumentation that persuasive
explanations can cause harm as they may convince users to over-trust a system [64]. Note that we conducted two separate
experiments to measure the effects of an explanation type on understanding and persuasion. This allowed us to measure the
effect of each construct separately on task performance, but not their combined effect (e.g. whether sufficient understanding
can counteract the persuasiveness of an explanation).
The construct of understanding was measured with two behavioral measurements and one self-reported measurement.
The first behavioral measurement assessed the participant’s capacity to correctly identify the decisive factor of the situations
in the system’s advice. This measured to what extent the participant recalled what factor the system believed to be im-
portant for a specific advice and situation. Second, we measured the participant’s ability to accurately predict the advice in
novel situations. This tested whether the participant obtained a mental model of the system that was sufficiently accurate
enough to predict its behavior in novel situations. The self-reported measurement tested the participant’s perceived system
understanding. This provided insight in whether participants over- or underestimated their understanding of the system
compared to what their behavior told us.
Persuasive power of the system’s advice was measured with one behavioral measurement, namely the number of times
participants copied the advice, independent of its correctness. If participants that received an explanation followed the advice
more often than participants without an explanation, we addressed this to the persuasiveness of the explanation.
Task performance was measured as the number of correct decisions, a behavioral measurement, and perception of predicting
advice correctness, a self-reported measurement. We assumed a system that did not have a 100% accurate performance,
meaning that it also made incorrect decisions. Therefore, the number of correct decisions made by the participant while
aided by the system could be used to measure task performance. The self-reported measure allowed us to measure how
well participants believed they could predict the correctness of the system advice.
Finally, two self-reported measurements were added to check for potential confounds. The first was a brief usability
questionnaire addressing issues such as readability and the organization of information. This could reveal whether one ex-
planation style was designed and visualized better than the other, which would be a confounding variable. The second,
perceived system accuracy, measured how accurate the participant thought the system was. This could help identify a poten-
tial over- or underestimation of the usefulness of the system, that could have affected to what extent participants attended
to the system’s advice and explanation.
The combination of self-reported and behavioral measurements enabled us to draw relations between our observations
and a participant’s own perception. Finally, by measuring a single construct with different measurements (known as trian-
gulation [65]) we could identify and potentially overcome biases and other weaknesses in our measurements.
7
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
Fig. 3. The contrastive rule-based (above) and example-based (below) explanation styles. Participants could view the situation, advice and expl anation
indefinitely.
6. Methods
In this section we describe the operationalization of our user evaluation in two separate experiments in the context
of DSS advice in DMT1 self-management (see Section 4). Experiment Ifocused on the construct of system understanding.
Experiment II focused on the constructs of persuasive power and task performance. The explanation style (contrastive rule-
based, contrastive example-based or no explanation) was the independent variable in both experiments and was tested
between-subjects. See Fig. 3for an example of each explanation style.
The experimental procedure was similar in both experiments:
1. Introduction. Participants were informed about the study, use-case and task, as well as presented with a brief narrative
about a DMT1 patient for immersive purposes.
2. Demographics questionnaire. Age and education level were inquired to identify whether the population sample was
sufficiently broad.
3. Pre-questionnaire. Participants were questioned on DMT1 knowledge to assess if DMT1 was sufficiently introduced and
to check our assumption that participants had no additional domain knowledge.
8
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
Fig. 4. A schematic overview of the learning (left) and testing (right) block in Experiment I.
Tabl e 1
An overview of the nine factors that played a role in the experiment. For each factor, its influence on the correct
insulin dose is shown, as well as the system threshold for that influence. The thresholds differed between the two
experime nts and the set of rules of the first experiment were defined as the ground truth. Three factors served
as fillers and had no influence.
Factor Insulindose Exp.IRules Exp.IIRules
Planned alcohol intake Lower dose >1unit >1unit
Planned physical exercise Lower dose >17 minutes >20 minutes
Physical health Lower dose Diarrhoea & Nausea Diarrhoea & Nausea
Hours slept Higher dose <6 hours <6 hours
Environmental temperature Higher dose >26 ◦C>31 ◦C
Anticipated tension level Higher dose >3 (a little tense) >4(quitetense)
Water intake so far - - -
Planned caffeine intake - - -
Mood - - -
4. Learning block. Multiple stimuli were presented, accompanied with either the example- or rule-based explanations, or
no explanations (control group).
5. Testing block. Several trials followed to conduct the behavioral measurements (advice prediction and decisive factor iden-
tification in Experiment I, the number of times advice copied and number of correct decisions in Experiment II).
6. Post-questionnaire. A questionnaire was completed to obtain self-reported measurements (perceived system understanding
in Experiment Iand perceived prediction of advice correctness in Experiment II).
7. Usability questionnaire. Participants filled out a usability questionnaire to identify potential interface related confounds.
8. Control questionnaire. The experimental procedure concluded with several questions to assess whether the purpose of
the study was suspected and to measure perceived system accuracy to identify over- or under-trust in the system.
6.1. Experiment I: System understanding
The purpose of Experiment Iwas to measure the effects of rule-based and example-based explanations on system under-
standing compared to each other and to the control group with no explanations. See Fig. 4for an overview of both the
learning and testing blocks. The learning block consisted of 18 randomly ordered trials, each trial describing a single sit-
uation with three factors and values from Table 1. The situation description was followed by the system’s advice, in turn
followed by an explanation (in the experimental groups). Finally, the participant was asked to make a decision on admin-
istering a higher or lower insulin dose than usual. This block served only to familiarize the participant with the system’s
advice and its explanation and to learn when and why a certain advice was given. Participants were not instructed to focus
on the explanations in the learning block, nor were they informed of the purpose of the two blocks.
In the testing block, two behavioral measures were used to test the construct of understanding: advice prediction and
decisive factor identification. The testing block consisted of 30 randomized trials, each with a novel situation description.
Each description was followed by the question what advice the participant thought the system would give. This formed the
9
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
Fig. 5. A schematic overview of the learning (left) and testing (right) block in Experiment II.
measurement of advice prediction. The measurement decisive factor identification was formed by the subsequent question to
select a single factor from a situation description that they believed was decisive for the predicted system advice.
A third, self-reported measurement was conducted in the post-questionnaire, which contained an eight-item question-
naire based on a 7-point Likert scale. These items formed the measurement of perceived system understanding. The questions
were asked without mentioning the term explanation and simply addressed ‘system output’. The amount of eight items was
deemed necessary, to obtain a measurement less dependent on the formulation of one item.
6.2. Experiment II: Persuasive power and task performance
The purpose of Experiment II was to measure the effects of rule-based and example-based explanations on persuasive
power and task performance, and to compare these to each other and to the control group with no explanation. Fig. 5
provides an overview of the learning and testing blocks of this experiment. The learning block was similar to that of the
first experiment: a situation was shown, containing three factors from Table 1. In the experimental groups, the situation was
followed by an advice and explanation. Next, the participant was asked to make a decision on the insulin dose. After this
point, the learning block differed from the learning block in the first experiment: the participant’s decision was followed
with feedback on its correctness. In 12 of the 18 randomly ordered trials of this learning block (66%), the system’s advice
was correct. In the six other trials, the advice was incorrect. Through this feedback, participants learned that the system’s
advice could be incorrect and in which situations. Instead of following the ground truth rule set (from Experiment I), this
system followed a second, partially correct set of rules, as shown in Table 1.
The testing block contained 30 trials, also presented in random order, in which a presented situation was followed by
the system’s advice and explanation. Next, participants had to choose which insulin dose was correct based on the system’s
advice, explanation and gained knowledge of when the system is incorrect. Persuasive power was operationalized as the
number of times a participant followed the advice, independent of whether it was correct or not. Task performance was
represented by the number of times a correct decision was made. The former reflected how persuasive the advice and
explanation was, even when participants experienced system errors. The latter reflected how well participants were able to
understand when the system makes errors and compensate accordingly in their decision.
Also in this experiment, a self-reported measurement with eight 7-point Likert scale questions was performed. It mea-
sured the participant’s subjective sense of their ability to estimate when the system was correct.
6.3. Participants
In Experiment I, 45 participants took part, of which 21 female and 24 male, aged between 18 and 64 years old
(M=44.2 ±16.8). Their education levels varied from lower vocational to university education. In Experiment II 45 dif-
ferent participants took part, of which 31 female and 14 male, aged between 18 and 61 years old (M=36.5 ±14.5). Their
10
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
education levels varied from secondary vocational to university education. Participants were recruited from a participant
database at TNO Soesterberg (NL) as well as via advertisements in Utrecht University (NL) buildings and on social media.
Participants received a compensation of 20,- euro and their travel costs were reimbursed. Both samples aimed to represent
the entire Dutch population and as such the entire range of potential DMT1 patients, hence the wide age and educational
ranges.
The inclusion criteria were as follows: not diabetic, no close relatives or friends with diabetes, and no extensive knowl-
edge of diabetes through work or education. General criteria were Dutch native speaking, good or corrected eyesight, and
basic experience using computers. These inclusion criteria were verified in the pre-questionnaire. A total of 16 participants
reported a close relative or friend with diabetes and one participant had experience with diabetes through work, despite
clear inclusion instructions beforehand. After careful inspection of their answers, none were excluded because their answers
on diabetes questions in the pre-questionnaire were not more accurate or elaborate than others. From this we concluded
that their knowledge of diabetes was unlikely to influence the results.
7. Data analysis
Statistical tests were conducted using SPSS Statistics 22. An alpha level of 0.05 was used for all statistical tests.
The data from the behavioral measures in Experiment Iwere analyzed using a one-way Multivariate Analysis of Variance
(MANOVA) with explanation style (rule-based, example-based or no explanation) as the independent between-subjects variable
and advice prediction and decisive factor identification as dependent variables. The reason for a one-way MANOVA was the
multivariate operationalization of a single construct, understanding [66]. Cronbach’s Alpha was used to assess the internal
consistency of the self-reported measurement for perceived system understanding from the post-questionnaire. Subsequently,
a one-way Analysis of Variance (ANOVA) was conducted with the mean rating on this questionnaire as dependent variable
and the explanation style as independent variable. Finally, the relation between the two behavioral and the self-reported
measurements was examined with Pearson’s product-moment correlations.
For Experiment II two one-way ANOVA’s were performed. The first ANOVA had the explanation style (rule-based, example-
based or no explanation) as independent variable and the number of times the advice was copied as dependent variable.
The second ANOVA also had explanation style as independent variable, but the number of correct decisions as dependent
variable. The internal consistency of the self-reported measurement of perceived prediction of advice correctness from the
post-questionnaire was assessed with Cronbach’s Alpha and analyzed with a one-way ANOVA. Explanation style was the
independent and the mean rating on the questionnaire the dependent variable. The presence of correlations between the
behavioral and the self-reported measurements was assessed with Pearson’s product-moment correlations. Detected outliers
were excluded from the analysis.
8. Results
8.1. Experiment I: System understanding
The purpose of Experiment Iwas to measure gained system understanding when a system provides a rule- or example-
based explanation, compared to no explanation. This was measured with two behavioral measures and one self-reported
measure.
Fig. 6shows the results on the two behavioral measures: correct advice prediction in novel situations and correct iden-
tification of the system’s decisive factor. A one-way MANOVA with Wilks’ lambda indicated a significant main effect of
explanation style on both measurements (F(4, 82) =6.675, p <0.001, =.450, η2
p=.246). Further analysis revealed a
significant effect for explanation style on factor identification (F(2, 42) =14.816, p <0.001, η2
p=.414), but not for advice
prediction (F(2, 42) =14.816, p =.264, η2
p=.414). One assumption of a one-way MANOVA was violated, as the linear
relationships between the two dependent variables and each explanation style was weak. This was indicated by Pearson’s
product-moment correlations for the rule-based (r=.487, p =.066), example-based (r=−.179, p =.522) and no explana-
tion (r=.134, p =.636) groups. Some caution is needed in interpreting these results, as this lack of significant correlations
shows a potential lack of statistical power. Further post-hoc analysis showed a significant difference in factor identification
in favor of rule-based explanations compared to example-based explanations and no explanations (p <0.001). No significant
difference between example-based explanations and no explanation was found (p =.796).
Fig. 7shows the results on the self-reported measure of system understanding. The consistency between the different
items in the measure was very high, as reflected by Cronbach’s alpha (α=.904). The mean rating over all eight items was
used as the participant’s subjective rating of system understanding. A one-way ANOVA showed a significant main effect
of explanation style on this rating (F(2, 41) =7.222, p =.002, pη2
p=.261). Two assumptions of a one-way ANOVA were
violated. First, the rule-based explanations group had one outlier, of which the exclusion did not affect the analysis in
any way. The results after removal of this outlier are reported. Second, Levene’s test was not significant (p =.017) signaling
inequality between group variances. However, ANOVA is robust against the variance homogeneity violation with equal group
sizes [67,68]. Further post-hoc tests revealed that only rule-based explanations caused a significantly higher self-reported
understanding compared to no explanations (p =.001). No significant difference was found for example-based explanations
with no explanations (p =.283) and with rule-based explanations (p =.072).
11
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
Fig. 6. Bar plot of the mean percentages of correct prediction of the system’s advice and correct identification of the decisive factor for that advice. Values
are relative to the total of 30 randomized trials in Experiment I. The error bars represent a 95% confidence interval. Note; *** p <0.001.
Fig. 7. Bar plot of the mean self-reported system understanding. All values are on a 7-point Likert scale and error bars represent a 95% confidence interval.
Note; ** p <0.01.
Finally, Fig. 8shows a scatter plot between both behavioral measures and the self-reported measure. Pearson’s product-
moment analysis revealed no significant correlations between self-reported understanding and advice prediction (r=−.007,
p =.965), not within the rule-based explanation group (r=−.462, p =.129), the example-based explanation group
(r=−.098, p =.729), nor the no explanation group (r=.001, p =.996). Similar results were found for the correlation
between self-reported understanding and factor identification (r=.192, p =.211) and for the separate groups of rule-
based explanations (r=−.124, p =.673), example-based explanations (r=.057, p =.840) and no explanations (r=−.394,
p =.146).
8.2. Experiment II: Persuasive power and task performance
The purpose of Experiment II was to measure a participant’s ability to use a decision support system appropriately when
it provides a rule- or example-based explanation, compared with no explanation. This was measured with one behavioral
and one self-reported measurement. In addition, we measured the persuasiveness of the system for each explanation style,
compared to no explanations. This was assessed with one behavioral measure.
12
J.S. van der Waa, E.G.I. Nieuwburg, A.H.M. Cremers et al. Artificial Intelligence 291 (2021) 103404
Fig. 8. Scatter plots displaying the relation between advice prediction (left) and decisive factor identification (right) with self-reported understanding.
Outliers are circled.
Fig. 9. Bar plot displaying task performance (the mean percentage of correct decisions) and persuasive power (the mean percentage of decisions following
the system’s advice independent of correctness). Error bars represent a 95% confidence interval. Note; *p <0.05, ***p <0.001.
Fig. 9shows the results of the behavioral measure for task performance, as reflected by the user’s decision accuracy. A
one-way ANOVA showed no significant differences (F(2, 41) =1.716, p =.192, η2
p=.077). Two violations of ANOVA were
discovered. There was one outlier in the example-based explanations, with 93.3% accuracy (1 error). Removal of the outlier
did not affect the analysis. Levene’s test showed there was no homogeneity of variances (p =.007), however ANOVA is
believed to be robust against this under equal group sizes [67,68].
Fig. 9shows the results of the behavioral measure for persuasiveness, i.e. the number times system advice was followed.
Note that in Experiment II the system’s accuracy was 66.7%. Thus, following the advice in a higher percentage of cases
denotes an adverse amount of persuasion. A one-way ANOVA showed that explanation style had a significant effect on
following the system’s advice (F(2, 41) =11.593, p <.001, η2
p=.361). Further analysis revealed that participants with
no explanation followed the system’s advice significantly less than those with rule-based (p =.049) and example-based
explanations (p <.001). However, there was no significant difference between the two explanation styles (p =.068). One
outlier violated the assumptions of an ANOVA. One participant in the rule-based explanation group followed the system’s
advice only 33.3% of the time. Its exclusion affected the outcomes of the ANOVA and the results after exclusion are reported.
Fig. 10 displays the self-reported capacity to predict correctness, operationalized by a rating how well participants
thought they were able to predict when system advice was correct or not. The consistency of the eight 7-point Likert
scale questions was high according to Cronbach’s Alpha (α=.820). Therefore, we took the mean rating of all questions as
an estimate of participants’ performance estimation. A one-way ANOVA was performed, revealing no significant differences
(F(2, 41) =2.848, p =.069, η2
p=.122). One outlier from the rule-based explanation group was found, its