ArticlePDF Available

Interpretable confidence measures for decision support systems


Abstract and Figures

Decision support systems (DSS) have improved significantly but are more complex due to recent advances in Artificial Intelligence. Current XAI methods generate explanations on model behaviour to facilitate a user’s understanding, which incites trust in the DSS. However, little focus has been on the development of methods that establish and convey a system’s confidence in the advice that it provides. This paper presents a framework for Interpretable Confidence Measures (ICMs). We investigate what properties of a confidence measure are desirable and why, and how an ICM is interpreted by users. In several data sets and user experiments, we evaluate these ideas. The presented framework defines four properties: 1) accuracy or soundness, 2) transparency, 3) explainability and 4) predictability. These characteristics are realized by a case-based reasoning approach to confidence estimation. Example ICMs are proposed for -and evaluated on- multiple data sets. In addition, ICM was evaluated by performing two user experiments. The results show that ICM can be as accurate as other confidence measures, while behaving in a more predictable manner. Also, ICM’s underlying idea of case-based reasoning enables generating explanations about the computation of the confidence value, and facilitates user’s understandability of the algorithm.
Content may be subject to copyright.
Contents lists available at ScienceDirect
International Journal of Human-Computer Studies
journal homepage:
Interpretable confidence measures for decision support systems
Jasper van der Waa
, Tjeerd Schoonderwoerd
, Jurriaan van Diggelen
, Mark Neerincx
TNO, Soesterberg, Kampweg 55, the Netherlands
Technical University of Delft, Delft, Mekelweg 5, the Netherlands
Machine learning
Decision support systems
Explainable AI
Artificial intelligence
User study
Interpretable machine learning
Trust calibration
2018 MSC:
Decision support systems (DSS) have improved significantly but are more complex due to recent advances in
Artificial Intelligence. Current XAI methods generate explanations on model behaviour to facilitate a user’s
understanding, which incites trust in the DSS. However, little focus has been on the development of methods that
establish and convey a system’s confidence in the advice that it provides. This paper presents a framework for
Interpretable Confidence Measures (ICMs). We investigate what properties of a confidence measure are desirable
and why, and how an ICM is interpreted by users. In several data sets and user experiments, we evaluate these
ideas. The presented framework defines four properties: 1) accuracy or soundness, 2) transparency, 3) ex-
plainability and 4) predictability. These characteristics are realized by a case-based reasoning approach to
confidence estimation. Example ICMs are proposed for -and evaluated on- multiple data sets. In addition, ICM
was evaluated by performing two user experiments. The results show that ICM can be as accurate as other
confidence measures, while behaving in a more predictable manner. Also, ICM’s underlying idea of case-based
reasoning enables generating explanations about the computation of the confidence value, and facilitates user’s
understandability of the algorithm.
1. Introduction
The successes in Artificial Intelligence (AI), Machine Learning (ML)
in particular, caused a boost in the accuracy and application of in-
telligent decision support systems (DSS). They are used in lifestyle
management (Wu et al., 2017), management decisions (Bose and
Mahapatra, 2001), genetics (Libbrecht and Noble, 2015), national se-
curity (Pita et al., 2011), and in prevention of environmental disasters
in the maritime domain (van Diggelen et al., 2017). In these high-risk
domains, a DSS could be beneficial as it can reduce workload of a user,
and increase task performance. However, the complexity of current DSS
(e.g. based on Deep Learning) impedes users’ understanding of a given
advice, often resulting in too much or too little trust in the system,
which can have catastrophic consequences (Burrell, 2016; Cabitza
et al., 2017).
The field of Explainable AI (XAI) researches how a DSS can improve
a user’s understanding of the system by generating explanations about
its behaviour (Guidotti et al., 2018; Kim et al., 2015; Miller, 2018b;
Miller et al., 2017; Ridgeway et al., 1998). More specifically, the goal of
these explanations is to increase understanding of the system’s rationale
and certainty of an advice that it provides (Holzinger et al., 2019a;
2019b; Miller, 2018a). It is hypothesized that the understanding that a
user gains from these explanations facilitates adequate use of the DSS
(Hoffman et al., 2018), and calibrates the user’s trust in the system
(Cohen et al., 1998; Fitzhugh et al., 2011; Hoffman et al., 2013).
Although understanding of the system can help a user to decide
when to follow the advice of a DSS, it is often overlooked that a con-
fidence measure can achieve the same effect (Papadopoulos et al.,
2001). In this paper, we define a confidence measure as a measure that
provides an expectation that an advice will prove to be correct (or in-
correct). To help develop such measures, we introduce the Interpretable
Confidence Measure (ICM) framework. The ICM framework assumes
that a confidence measure should be 1) accurate, 2) able to explain a
single confidence value, 3) use a transparent algorithm and 4) providing
confidence values that are predictable for humans (see Fig. 1).
To illustrate the ICM framework, we will define an example ICM.
We evaluated its accuracy, robustness and genericity on several clas-
sification tasks with different machine learning models. In addition, we
applied the concept of an ICM on the use case of Dynamic Positioning
(DP) within the maritime domain (van Diggelen et al., 2017). Here, a
human operator supervises a ship’s auto-pilot while receiving assistance
from a DSS that provides a warning when human intervention is
deemed necessary (e.g. based on weather conditions). It can be cata-
strophic if the operator fails to intervene in time. For example, an oil
Received 20 September 2019; Received in revised form 26 May 2020; Accepted 6 June 2020
Corresponding author at: TNO, Soesterberg, Kampweg 55, the Netherlands.
E-mail address: (J.v.d. Waa).
International Journal of Human-Computer Studies 144 (2020) 102493
Available online 09 June 2020
1071-5819/ © 2020 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license
tanker might spill large amounts of oil in the ocean, because the op-
erator failed to intervene to prevent the ship from rupturing its con-
nection to an oil rig. This use case provided a realistic dataset to
evaluate our example ICM, as well as a context for a qualitative us-
ability experiment with these operators. In this experiment, we eval-
uated the transparency and explainability properties of the ICM fra-
mework. To further substantiate these results, we performed a
quantitative online user experiment in the context of self-driving cars.
We provide the ICM framework in Section 3, describe our example
ICM in Section 4, our evaluations on the data sets in Section 4.1, and the
two user experiments in Section 5 and 6. The next Section presents
related work in the field of XAI and confidence measures in Machine
Learning, which defines many current DSS.
2. Related work
Explainable AI (XAI) researches how we can improve the user’s
understanding in a DSS to reach an appropriate level of trust in its
advice (Herman, 2017; Kim et al., 2015; Miller, 2018b; Miller et al.,
2017; Ridgeway et al., 1998). For example by allowing users to detect
biases (Doshi-Velez and Kim, 2017; Gilpin et al., 2018; Goodman and
Flaxman, 2016; Zhou and Chen, 2018). Some XAI research focuses on
these aspects from a societal perspective, trying to identify how in-
telligent systems should be implemented, when they should be used,
and who should regulate them (Doshi-Velez and Kim, 2017; Lipton,
2016; Zhou and Chen, 2018; Zliobaite, 2015). Other researchers ap-
proach the field from a methodological perspective, and aim to develop
methods that solve the potential issues of applying intelligent systems
in society. See for example the overview of methods from
Guidotti et al. (2018).
To generate explanations, many XAI methods use a meta-model that
describes the actual system’s behaviour in a limited input space sur-
rounding the to be explained data point (Ribeiro et al., 2016). It only
has to be accurate in this local space and can thus be less complex and
more explainable than the actual system. A disadvantage of these ap-
proaches is that the meaningfulness of the explanation is dependent on
the size of the local space and the brittleness of the used meta-model.
When it is too small, the explanation cannot be generalized, and when it
is too large, the explanation may lack fidelity. The advantage is that
these methods can be applied to most systems (i.e. they are system- or
model-agnostic). A second advantage is that the fidelity of explanations
can be measured, since the meta-model’s ground truth is the output of
the system, which is readily available. This can be exploited to measure
a meta-model’s accuracy through data perturbation. In our proposed
ICM framework, we apply the idea of system-agnostic local meta-
models to obtain an interpretable confidence measure, not a post-hoc
explanation of an output.
Confidence measures allow DSS to convey when an advice is
trustworthy (Papadopoulos et al., 2001). However, a user’s commit-
ment to follow a DSS’ advice is linked to his or her own confidence and
that conveyed by the DSS (Landsbergen et al., 1997). A confident user
confronted with a low system confidence reduces the user’s confidence
in his or herself, and vice versa. The work from Ye and Johnson (1995)
and Waterman (1986) shows that this can be mitigated by explaining
the DSS’ confidence value by using a transparent algorithm. The work
from Walley (1996) shows users tend to change their confidence when
evidence for a correct or incorrect decision is gained or lost. Users ex-
pect the same predictable behaviour from a DSS’ confidence measure.
Hence, it should not only be transparent with explainable values but
also behave predictable for humans.
Current DSS are often based on Machine Learning (ML). Different
categories of confidence measures can be identified from this field, see
Table 1 for an overview. The first, confusion metrics such as accuracy
and the F1-score, are based on the confusion matrix. These tend to be
transparent and predictable but lack accuracy and explainability for
conveying the confidence of a single advice (Foody, 2005; Labatut and
Cherifi, 2011). A ML model’s prediction score such as the SoftMax output
of a Neural Network, are also common as confidence measures. They
represent the model’s estimated likelihood for a certain prediction
(Zaragoza and d’Alché Buc, 1998). They are highly accurate but their
transparency and explainability is often low (Samek et al., 2017; Sturm
et al., 2016). Furthermore, these measures tend to behave un-
predictable as small changes in a data point can cause non-monotonic
increases or decreases in the confidence value (Goodfellow et al., 2014;
Nguyen et al., 2015). In rescaling such as with Platt Scaling (Platt and
others, 1999) or Isotonic Regression (Zadrozny and Elkan, 2001; 2002),
the prediction scores are translated into more predictable and accurate
values (Hao et al., 2003; Liu et al., 2004). However, these are used to
enable post-processing and not intended to be explainable or trans-
parent (Niculescu-Mizil and Caruana, 2005). Some ML models are in-
herently probabilistic and output conditional probability distributions
over its predictions. Examples are Naive Bayes (Rish et al., 2001), the
Relevance Vector Machine (Tipping, 2000) and using neuron dropout
(Gal and Ghahramani, 2016) or Bayesian inference (Fortunato et al.,
2017; Graves, 2011; Paisley et al., 2012) on trained Neural Networks.
Although they are accurate, they are also opaque and difficult to predict
as conditional probabilities are difficult to comprehend by humans
Fig. 1. The four properties of an Interpretable Confidence Measure to perform
effective trust calibration.
Table 1
Categories and examples of commonly used confidence measures in Machine Learning and if their adherence to the four properties of an ICM.
Property Confusion metrics Prediction scores Rescaling Probability Voting
Accurate - + + + +
Predictable + - + - -
Transparent + - - - -
Explainable - - - + +
Example F1-score
Foody (2005)
Papernot and McDaniel (2018)
Platt Scaling
Platt and others (1999)
Tipping (2000)
Random Forest
Bhattacharyya (2013)
J.v.d. Waa, et al. International Journal of Human-Computer Studies 144 (2020) 102493
(Evans et al., 2003; Pollatsek et al., 1987). There are efforts to make
such values more explainable for specific model types, see for example
(Qin, 2006) and (Ridgeway et al., 1998). Finally, ML models are known
to use voting to arrive at a confidence value (Polikar, 2006; Tóth and
Pataki, 2008; Van Erp et al., 2002). Known examples are Random
Forest, Decision Trees and ensembles of Decision Stumps (Stone and
Veloso, 1997). These confidence values can be explained through ex-
amples (Florez-Lopez and Ramon-Jeronimo, 2015). However, their al-
gorithmic transparency depends on the model and their values tend to
change step-wise given continuous changes to the input, making them
hard to predict by humans.
As can be seen in Table 1, neither category is accurate, predictable,
explainable and transparent in a DSS context. A likely reason is that the
purpose of these measures is to convey performance of an ML model to
a developer, not the confidence of a DSS in an advice to a user. As a
consequence, many of these measures are tailored to work for a specific
or subset of model types. Only the confusion metrics of these categories
are system-agnostic. In the next section we propose a system agnostic
approach to confidence measures based on case-based reasoning that
are not only as accurate as the above described measures, but also
transparent, explainable and predictable.
3. A framework for interpretable confidence measures
In this section we propose a framework to create Interpretable
Confidence Measures (ICM) that are not only accurate in their con-
fidence assessment, but whose values are predictable as well as ex-
plainable based on a transparent algorithm. The ICM framework relies
on a system-agnostic approach and performs a regression analysis with
the correctness of an advice as the regressor. It does so based on case-
based reasoning.
Case-based reasoning or learning provides a prediction by extra-
polating labels of past cases to the current queried case (Atkeson et al.,
1997). The basis of many case-based reasoning methods is the k-Nearest
Neighbours (kNN) algorithm (Fix and Hodges Jr, 1951). This method
follows a purely lazy approach (Wettschereck et al., 1997). When
queried with a novel case, it selects the kmost similar cases from a
stored data set and assigns the case with a weighted aggregation of the
neighbour’s labels. The advantage of case-based learning methods is
that its principle idea is closely related to that of human decision-
making (Harteis and Billett, 2013; Hodgkinson et al., 2008; Schank
et al., 2014). This makes such algorithms easier to understand and in-
terpret (Freitas, 2014). In addition, they allow for example-based ex-
planations of a single prediction (Doyle et al., 2003). These properties
are exploited in the ICM framework to define a confidence measure as
performing a regression analysis with case-based reasoning.
3.1. The ICM framework
In this section we formally describe the ICM framework. We assume
the DSS as a function
that assigns an advice
to data
of ldimensions. It does this with a certain accuracy relative to
the ground truth or label
. An ICM goes through four steps to
define the confidence value
C x( )
: 1) an update step, 2) a selection
step, 3) a separation step, and 4) a computation step. Below we discuss
these steps, and an overview is shown in Fig. 2.
In the first step, the update, a memory
updated. This Dforms the set of cases from which the confidence is
computed. Given an update procedure uand new data-label pairs
x y( , *),
an ICM continuously updates this memory
=D u x y D(( , *), )
such that
=D n| |
. This ensures that Dadapts to changes in the DSS over
time. The initial Dis initialized with a training set but is expanded and
replaced with novel pairs during DSS usage. The size of Dis fixed to n,
and maintained by u. Examples of ucan be as simple as a queue (newest
in, oldest out) or based on more complex sampling methods (e.g. those
that take the label and data distributions into account).
In the selection step a set Sis sampled from Dsuch that
=S s x y D( , | ),
where sis some selection procedure. The purpose of sis to select all
relevant data-label pairs to define the ICM’s confidence value for the
x y( , )
. For example, following kNN, the kclosest neighbours to
can be selected based on a similarity or distance function.
In the separation step, S is split into
based on the current
x y( , )
. The
contains all
x y( , *)
=y y *,
. In other
contains all data points whose advice was similar to the
current advice and correct. The
contains all data points with a dif-
ferent correct advice.
In the computation step, the
are used to calculate the
confidence value
C x S S( | , )
with a weighting scheme
(often abbreviated as
C x(
C x S S Z x S w x x w x x( , ) ( | ) ( , ) ( , )
x S
x S
i j
The weights wrepresent how much a data point in
ences the confidence of the advice for
. Again, taking kNN as an ex-
ample, the wcan simply contain a delta-function to ‘count’ the number
of points in
. Although, more complex weighting schemes are
possible and advised. The
is a normalization factor:
=Z x
w x x
( ) 1
( , )
x S i
This ensures that the confidence value is bounded;
C[ 1, 1],
denoting the confidence that some ywould prove to be
incorrect or correct respectively. Intermediate values represent the
surplus of available evidence for a correct or incorrect advice relative to
all available evidence. For example, when
=C x( ) 0.5
there is 50%
surplus evidence that the advice ywill be incorrect, relative to all
available evidence. What constitutes as ‘evidence’ is determined by sto
select relevant past data-label pairs and the weighting scheme wto
assign their relevance. An ICM allows wand sto be any weighting
scheme or selection procedure. Following other case-based reasoning
methods, wand soften use a similarity or distance measure (e.g.
Euclidean distance).
3.2. The four properties of ICM
In this section we explain why the above proposed ICM framework
results in confidence measures that are not only accurate, but also
predictable, transparent and explainable.
Accurate. We define the accuracy of a confidence measure as its
ability to convey a high confidence for either a correct or incorrect
advice, when the advice is indeed correct or incorrect. For an ICM, this
can be defined as:
aDx y C x
| | ( , *, ( ))
| |
Where δis the Kronecker delta, with
=f x y( ) *
C(x) ≥ 0, or when
f x y( ) *
and C(x) < 0. Overall, case-based rea-
soning methods are often accurate enough for realistic data sets
(McLean, 2016). However, the accuracy depends on the choice for the
selection procedure sand weighting scheme w. If one chooses a simple
kNN paradigm, one may expect a lower accuracy then when using a
more sophisticated sand w. More complex options could include
learning a complex similarity measure (Papernot and McDaniel, 2018).
This potentially increases the accuracy, but at the cost of ICM’s trans-
parency and predictability.
Predictable. A confidence measure should behave predictable; it
should monotonically increase or decrease when more evidence or data
becomes available for an advice being correct or incorrect respectively.
For an ICM to be predictable, it must use a monotonic similarity
function. Any step-wise or non-monotonic similarity function creates
J.v.d. Waa, et al. International Journal of Human-Computer Studies 144 (2020) 102493
confidence values that suffer from changes that are unexpected for
humans. In addition, with the update procedure uan ICM adjusts its
confidence according to any changes in the data distribution or DSS
Transparent. An ICM is transparent; its algorithm can be under-
stood relatively easily by its users. Case-based reasoning is often applied
by humans themselves (Schank et al., 2014). This makes the idea of an
ICM, recall past data-label pairs and extrapolate those to the current
data point into a confidence value, relatively easy to comprehend. A
deeper understanding of the algorithm may be possible, but depends on
the complexity of the similarity measure, the selection procedure sand
weighting scheme w.
Explainable. The confidence of an ICM can be easily explained
using examples as selected from
. It allows for a template-
based explanation paradigm, for example:
“I am
C x( )
confident that ywill be correct based on |S| past cases
deemed similar to
. Of these cases, in
S| |
cases the advice ywas
correct. In
S| |
cases the advice ywould be incorrect.”
These cases can then be further visualized through a user-interface,
for example with a parallel-coordinates plot (Artero et al., 2004). Such
plots provide a means to visualize high-dimensional data and convey
the ICM’s weighting scheme. They allow users to identify if the selected
past data points and their weights make sense and evaluate if what the
ICM constitutes as evidence should indeed be treated as such. It may
even enable the user to interact with the ICM by tweaking its potential
hyper parameters (e.g. parameters for the selection procedure and
weighting scheme).
Existing research such as that by Mandelbaum and
Weinshall (2017),Subramanya et al. (2017) and Papernot and
McDaniel (2018) can be framed as an ICM. All are based on case-based
learning and can be described by the four steps of the framework.
However, their transparency and predictability tends to be limited due
to their choice to use a Neural Network to define their similarity
measure. This hinders the ICM’s transparency and predictability, but
still allows the generation of explanations.
4. ICM Examples
In this section we propose three examples of implementing an ICM
using relatively simple techniques from the field of case-based rea-
soning. To define our ICM, we need to define the update procedure u,
the selection procedure sand the weighting scheme w. The uremains
unchanged: A queue mechanism that stores the latest
x y( , *)
pair and
removes the oldest from D.
The first example, ICM-1, is based on kNN and use it to define both s
and w. The selection procedure is
=S s x D k d( | , , )
which selects the k
closest neighbours in Dto
with dbeing a distance function. The
weighting scheme becomes
=w x x x S( , ) 1,
i i
. When applied to
Eq. (1), the resulting ICM counts and the relative number of points in
to arrive at a confidence value:
+ +
C x S S
S S( , ) 1(| | | |)
This reflects the idea that confidence is ≥ 0 when the majority of k
nearest neighbours are in favor of the given advice, and < 0 otherwise.
For our second example, ICM-2, we extend ICM-1 with the idea of
Weighted kNN (Dudani, 1976; Hechenbichler and Schliep, 2004). It
weights each neighbour with a kernel based on its similarity to
cording to a distance function d. Given a Radial Basis Function (RBF) as
kernel, the weighting scheme becomes
( )
w x x d x x( , ) exp ( , )
i i
The σis the standard deviation of the RBF and we set it to
d x x( , )
is the
distant neighbour of
. If we
choose to use the Euclidean distance
=d x x ,
the confidence
value becomes:
( )
( )
C x S S d x x
x x
( | , , ) exp )
exp )
Sx S i
Sx S j
| |
| |
These values depend not only on the number of points in
but also on their similarity to
. With this RBF kernel neighbours are
weighted exponentially less important as they become dissimilar to
In our third example, ICM-3, we build further on ICM-2. In it, we
Fig. 2. A visual depiction of the ICM framework and its four steps for computing the confidence value Cfor a data point
and advice y. Given a continuously
updated data set D, set Sis selected containing relevant data-label pairs. This Sis separated in
with the points that show yis correct or incorrect
respectively. The confidence value
C x( )
is computed using
J.v.d. Waa, et al. International Journal of Human-Computer Studies 144 (2020) 102493
estimate σfor each confidence value as the average similarity between
the kneighbours. Hence, ICM-3 remains equal to Eq. (5), instead with
=x x
kx S i
. With it, ICM-3 provides confidence values that
take the number of data points in
into account, but also
weighs their similarity to
according to how similar the kneighbours
are to each other. Meaning that the neighbour most similar to
tributes the most to the confidence estimation relative to the other
4.1. Comparison of exemplar ICM behaviour
In this section we evaluate ICM-1, ICM-2 and ICM-3 and assess their
behaviour, accuracy and predictability over changes in the data.
See Table 2 for the confidence values of all three example ICM on a
synthetic 2D binary classification solved by standard SVM. This data set
was generated using Python’s SciKit Learn package (Pedregosa et al.,
2011). The table contains six plots of ICM-1, ICM-2 and ICM-3 with
. ICM-1 shows a high confidence when we would expect
it. As points with a certain prediction approaches memorized points (in
Euclidean space) with that prediction as their label, the confidence for a
correct predictions increases. As opposed to an increasing confidence
for an incorrect prediction when such points approach memorized data
points whose label is different than the prediction. ICM-1 does show
abrupt confidence changes with
that decrease for
. Similar
behaviour can be seen for ICM-2 and ICM-3. The difference is that both
show even smaller abrupt changes due to their RBF kernel, with ICM-3
being the smoothest as the kernel adapts to the local density. For
we see that ICM-2 and ICM-3 result in an overall lack of confidence.
With higher kvalues, Sstarts to contain nearly all data points from D.
The summed weights for
begin to represent the label ratio
and confidence goes to zero. This sensitivity is likely unique to our ICM
examples, and state of the art case-based reasoning algorithms are less
likely to be as sensitive to kor use a different mechanism than kNN.
Next, we evaluate the accuracy of ICM-3 on two benchmark clas-
sification tasks each solved by a Support Vector Machine (SVM),
Random Forest and Multi-layer Perceptron (MLP). We chose for ICM-3
as the most sophisticated ICM example. The confidence accuracy of
ICM-3 was computed using Eq. (3). The confidence values of the SVM
were computed using Platt scaling (Platt and others, 1999), of the
Random Forest using its voting mechanism, and of the MLP by setting
SoftMax as its output layer’s activation function. Since neither of these
confidence values could express a high confidence for an incorrect
classification, the accuracy from Eq. (3) was adjusted to measure zero
confidence as correct for an incorrect classification. The two classifi-
cation tasks were a handwritten digits recognition task (Alimoglu et al.,
1996) and the diagnoses of heart failure in patients (Detrano et al.,
1989). The data set properties, trained models and their hyper para-
meters are summarized in Tables 3 and 4respectively.
Fig. 3 shows the results from ten different runs per test set and
model combination. The ICM performs equally well in confidence es-
timation as the models on both data sets. It shows that an ICM can be
applied to a variety of models and performs equally well in terms of
estimating when a classification would be correct. In addition, an ICM
conveys also its confidence in a classification being incorrect and tends
to be more transparent, predictable and explainable.
Fig. 4 shows the accuracy of the example ICM over different values
Table 2
These figures show the confidence values for the three example ICM implementations on a 2D synthetic binary classification task for
. The background of each figure represents the confidence value at that point, the classification model’s decision
boundaries are shown by the dashed lines and Dis plotted as points coloured by their true class label.
Table 3
The properties of the two benchmark data sets used to evaluate the three ICM
examples. Also shows the properties of the synthetic data used to evaluate the
robustness to changes in data distributions.
Name Classes Type Features Train/test
Heart Detrano et al. (1989) 3 Tabular 4 227/76
Digits Alimoglu et al. (1996) 10 Images 64 1347/450
Synthetic 6 Tabular 2 100/300
J.v.d. Waa, et al. International Journal of Human-Computer Studies 144 (2020) 102493
for k. The nwas set to encapsulate the entire training data set. This
figure shows that ICM-3 is most robust against different values for k.
More state of the art algorithms based on kNN can be applied to in-
crease this robustness further, or algorithms based on an entirely dif-
ferent paradigm can be used to define the selection procedure s.
Fig. 5 shows how the example ICM behaves with different numbers
of memorized data points. The kwas fixed to its optimal value of 10
neighbours for both data sets. These results show that even these simple
ICM are accurate at memory sizes around 10% of the data the models
needed for training.
In a separate study (van Diggelen et al., 2017) we applied ICM-3 to a
real-world DSS in the Dynamic Positioning case described in the in-
troduction. This case was also used in one of our user experiments,
described in detail in Section 5. Here, a Deep Neural Network predicted
when an ocean ship was likely to drift of course and notified a human
operator to intervene. The ICM-3 was used to express more information
to the operator on whether a prediction could be trusted to prevent
under- or over-trust. In this study, we showed that ICM was able to
compute a confidence value of the Deep Neural Networks prediction
with 87% accuracy (van Diggelen et al., 2017).
Finally, we evaluated how well ICM-3 was able to adjust its con-
fidence values when faced with a shift in the data and label distribution.
As stated in Section 4, the update procedure uused a simple queuing
method to update D. To test the effects this uhas on the confidence
accuracy, we constructed a synthetic non-linear classification task and
artificially shifted its data distribution after having computed the con-
fidence of the first 100 data points. We compared this confidence ac-
curacy over time with the performance of the Random Forest model
with and without continuously updating that model.
The results of are shown in Fig. 6, repeated ten times with different
random seeds to obtain the confidence bounds that are shown. The plot
shows that ICM-3 is capable of adjusting its confidence estimation to
abrupt changes in the data distribution. It performs nearly the same to
continuously retraining the model when obtaining a new data point,
however the ICM requires no explicit update.
The above results illustrate that even simple ICM can already per-
form surprisingly accurate on two benchmark data sets and on different
models. In addition, even a simple update of the memorized data points
result in an confidence estimation adaptive to changes in the data. It
shows that ICM can provide a common framework to devise system-
agnostic confidence measures.
5. A qualitative user experiment: Interviews with domain experts
This section summarizes the first of two user experiments. This
experiment is explained in more detail in our previous work
(van der Waa et al., 2018). In the experiment, several domain experts
Table 4
Shows the hyper parameters and accuracy on train- and test set for each model and data set combination used to compare our example ICM with. We used the SciKit
Learn package from Python as the implementation of each model (Pedregosa et al., 2011).
Data Model Accuracy (train) Parameters
Heart SVM 77.63% (94.27%) RBF kernel,
Heart MLP 76.32% (91.85%) Adam optimizer (
=e1 ,
), 100 epochs, 16 batch size, Softmax output layer, 2 hidden Layers ([16, 3], ReLu, 20%
Heart Random Forest 73.68% (100%) Gini, Bootstrapping, 50 estimators
Digits SVM 98.22% (100%) RBF kernel,
Digits MLP 99.33% (99.73%) Adam optimizer (
=e1 ,
), 250 epochs, 16 batch size, Softmax output layer, 3 hidden layers ([64, 32, 6], ReLu,
20% dropout)
Digits Random Forest 97.33% (100%) Gini, Bootstrapping, 100 estimators
Synthetic Random Forest 97.33% (100%) Gini, Bootstrapping, 100 estimators
Fig. 3. The accuracy of ICM-3 on two data sets with the accuracy of the con-
fidence estimates from various models. It shows that ICM implementations are
applicable to different models and can be equally accurate as the model itself.
The error bars represent the 95% confidence intervals.
Fig. 4. The accuracy of ICM on two data sets for different numbers of nearest
neighbours used. It shows our proposed Robust Weighted kNN algorithm (ICM-
3) compared to ICMs with weighted kNN (ICM-2) and kNN (ICM-1). It illus-
trates the robustness of ICM-3 against different k.
Fig. 5. The accuracy of the example ICM on the two benchmark data sets for
different values of n, the number of memorized data points. It illustrates the
robustness of each ICM with different n.
J.v.d. Waa, et al. International Journal of Human-Computer Studies 144 (2020) 102493
were interviewed to evaluate the transparency of the case-based rea-
soning approach underlying an ICM compared to other confidence
Dynamic Positioning (DP) formed the use case of the experiment.
Here, a ship’s bridge operator is responsible for maintaining the ship’s
position aided by an auto-pilot and a DSS (van Diggelen et al., 2017).
The DSS warns the user when the ship’s position is expected to deviate
from course and human intervention is required. Structured interviews
with DP operators were conducted where we elicited their under-
standing and needs of a confidence value that accompany the DSS’
prediction. Three confidence measure categories were evaluated; 1)
ICM, 2) Platt Scaling and 3) SoftMax activation functions.
The interview was structured in three phases. In the first phase we
provided a layman’s - but complete - description of each confidence
measure. Participants were asked to select their preferred method fol-
lowed by explaining each measure in their own words. This enables us
to discover which algorithm they preferred, but also which they could
reproduce accurately (signifying a better understanding). We found that
they understood ICM best, but preferred the SoftMax measure. When
asked, participants mentioned that estimating confidence in their line
of work is difficult and as such they expected a confidence measure to
be very complex. This result points towards what users might prefer in a
confidence measure (complexity), may not necessarily be what they
need (transparency).
The second phase provided examples of realistic situations, the DSS’
prediction and a confidence value. Each example was accompanied by
three explanations, one from each measure. Participants were asked
which explanation they preferred for each example. On average, they
preferred the explanations from ICM as it specifically addressed past
examples and explained their contribution to the confidence value.
Afterwards, participants were asked to explain in their own words how
each confidence measure would compute their values for unseen si-
tuations. The results showed that the operators could replicate ICM’s
explanations more easily than that of the other two.
The third and final phase allowed the participants to describe their
ideal confidence measure for the DSS. Several participants described a
case-based reasoning approach as used by ICM. Others preferred a
combination of both an ICM and SoftMax. When asked why, they re-
plied that they preferred the case-based reasoning approach but they
believed it to be too simplistic on its own to be accurate in their line of
work. They tried to add their interpretation of a SoftMax activation
function to ICM to satisfy their need for added complexity.
These results may indicate that domain experts are able to under-
stand a case-based reasoning approach for a confidence measure more
easily than the DSS’ prediction scores defined by a SoftMax output
layer, or the scaled prediction scores with Platt Scaling.
6. A quantitative user experiment: an online survey on user
The second experiment was performed using a quantitative online
survey. We evaluated the users’ interests and preferences concerning
explanations about the confidence of an advice as provided by a DSS.
Moreover, we investigated if the proposed ICM, based on case-based
reasoning, is in line with what humans desire from a confidence mea-
sure and explanations.
Below we describe the use case, participant group, stimuli, design
and analyses in more detail, followed by the results.
6.1. Use case: Autonomous driving
In the survey, participants were provided a written scenario de-
scribing an autonomous car. This scenario stated that the car could
provide an advice to turn its self-driving mode on or off, given the
current and predicted road, weather and traffic conditions. The advice
would be accompanied by a confidence value as calculated by the car.
Participants were instructed to assume several years of experience with
the car and that the car showed to be capable of driving autonomously
on frequently used roads. At some point on such a familiar road, the car
would provide the advice to turn on automatic driving mode. The ex-
periment followed with a questionnaire revolving around this advice
and the given confidence value.
6.2. Participants
Recruitment was done via Amazon’s Mechanical Turk, and each parti-
cipant received $0.45 for participating in the survey based on the estimated
time for completion and average wages. Only participants of 21 years or
older were included. A total of 26 men and 14 women aged between 24 and
64 years (M = 35.6, SD = 9.4) were recruited, who were all (self-rated)
fluent English speakers. On average, participants indicated on a 5-point
Likert scale that they had some prior knowledge with self-driving cars (M =
3.00, SD = 0.68). Hence, participants could be biased towards answering
questions based upon knowledge about self-driving cars, instead of using the
description in the questions. However, the scores on the dependent variables
of the participants indicated they were knowledgeable (
) or very
knowledgeable (
) did not significantly differ from the scores of others
and were included.
6.3. Stimuli
We composed a survey in which we asked participants about their
interests and preferences concerning explanations about the confidence
Fig. 6. The moving average accuracy of
querying confidence values from ICM-3, a
static Random Forest and a continuously up-
dated Random Forest. It shows a shift in the
label distribution after 100 data points in the
synthetic data. The plot shows that ICM-3 is
capable of adjusting its confidence values
nearly as well as the confidence from the con-
tinuously updating model.
J.v.d. Waa, et al. International Journal of Human-Computer Studies 144 (2020) 102493
of an advice as provided by a self-driving car. The system was presented
as being able to drive perfectly without assistance from a user within
most situations, but unable to drive fully autonomously in some other
undefined situations. We asked participants to indicate how much im-
portance they would attach to: 1) understanding the confidence mea-
sure’s underlying algorithm, 2) their past experience with other advice
from the car, and 3) predictions about future conditions (e.g. weather).
The importance was indicated on a 7-point Likert scale with 1 meaning
‘not at all important’ and 7 meaning ‘very much important’.
Moreover, we asked participants to rank five methods of presenting
the advice that the car could provide (with 1 being most preferred, and
5 being least preferred):
a) No additional information;
b) A general summary of prior experiences;
c) General prior experience accompanied by an illustrative specific
past experience;
d) Current situational aspects that played a role;
e) Predicted future situational characteristics that could affect the de-
cision’s outcome.
Fig. 7 shows a screenshot that contains the question in that asked
users to rank different types of explanations according to their pre-
ference. Advice 2 and 3 provide illustrative examples of the type of
information that an ICM can provide to a user (corresponding to b) and
c) in the above enumeration). That is, the confidence of the DSS is
explained in terms of similar stored past experiences with its own
performance. The difference between advice 2 and 3 is that the latter
includes a specific example of a situation in which the advice appeared
not to be correct, while the former does not.
6.4. Experimental design
We investigated two variables. 1) The importance of different in-
formation in determining when to follow an advice: information about
the confidence measure’s algorithm, information about prior experi-
ence, or information about the predicted future situation. 2) The in-
formation preference in an accompanying explanation: no additional
explanation, general prior experience, specific prior experience, current
situation, or predicted future situation. Both dependent variables were
investigated within-subjects, meaning that all participants indicated
their importance rating and preference rankings for all types of in-
formation and explanations respectively.
6.5. Analyses
We performed two non-parametric Friedman tests with post-hoc
Wilcoxon signed rank tests on the ordinal Likert scale data to in-
vestigate two topics: 1) The relative importance of information that
taken into account when deciding whether or not to follow the advice,
and 2) the difference between preference ratings of the types of ex-
Fig. 7. Screenshot of the section of the survey in which participants were asked to rank different kinds of explanations based on their preference. Advice 2 and 3
provide illustrative examples of the type of information that an ICM can provide to a user.
J.v.d. Waa, et al. International Journal of Human-Computer Studies 144 (2020) 102493
6.6. Results
Fig. 8 shows the distribution of Likert scale ratings concerning the
importance of information in the advice. Ratings are high in general, as
indicated by the high medians and the minor deviations from these
median scores.
There is a statistically significant difference in importance ratings of
the considered information when evaluating an advice, χ
(2) = 16.77,
p< 0.001. Wilcoxon signed-rank tests showed that participants rated
prior experience with the system as more important for deciding about
following an advice than understanding the advice system (
p< 0.001), but not more important than predictions about future si-
tuational circumstances (
). However, predictions
about future circumstances were rated as being more important than
understanding the advice system (
Fig. 9 shows the means and 95% confidence intervals of the rank-
ings concerning the preferences of participants for different types of
additional information given in an advice.
There is a statistically significant difference in rankings of the five
types of advice, χ
(4) = 39.38, p< 0.001. Table 5 shows the results of
the post-hoc tests. Importantly, participants preferred the explanation
that contained general prior experiences over the one that presented a
specific experience of a case in which the advice was not followed. They
also preferred general prior information over information concerning
the future situation, and over no additional information. However,
preference ratings for using general prior experience as explanation
about an advice were, on average, not higher than using information
about present situational circumstances.
In this user experiment, we investigated how participants judged
different types of information a confidence measure may use and in-
clude in an explanation. Overall, the use of relevant prior experiences
was judged as important in both defining confidence values and ex-
plaining them. Equally important was the information contained in the
current situation. This indicates that ICM and its explanations match
peoples expectations and preferences of a confidence estimation. It also
underlines the importance of confidence measures providing an ex-
planation about its values, something ICM readily supports. However,
confidence measures may also require to explain how the current si-
tuation relates to those past experiences. For ICM that entails explaining
the similarity function and why it selected those past experiences given
the current situation. To do so, the similarity function needs to be easily
understood or explained.
7. Discussion
Although the proposed ICM framework relies on a case-based rea-
soning approach, it is also closely related to the field of conformal
prediction (Shafer and Vovk, 2008). Methods from this field define a set
of predictions that is guaranteed to contain the true prediction with a
certain probability (e.g. 95%). Conformal prediction methods share
many similarities to ICM, such as their model-agnostic approach and
use of (dis)similarity functions. Current research focuses on making
these methods more explainable and transparent (Johansson et al.,
2018). Our experimental work on these topics may provide valuable
insights for future conformal prediction methods. In addition, future
work may aim to explore how conformal prediction methods can be
used in the ICM framework.
An important trade-off in an ICM is between its accuracy and
transparency, as an increase in accuracy implies an increase in com-
plexity. A concrete example is the similarity measure, it can be as
straightforward as Euclidean distance or as complex as a trained Deep
Neural Network (as in Mandelbaum and Weinshall, 2017). For some
domains, a relative simple similarity measure may not suffice due to its
high dimensional nature or less-than apparent relations between fea-
tures (e.g. the many pixels in an image recognition task). A more
complex or even learned similarity measure may solve such issues.
However, it may may prevent users from adopting the system in their
work due to a lack of understanding (Ye and Johnson, 1995). This is
sometimes referred to as the accuracy and transparency trade-off in
current AI. To solve this, simplified model-agnostic methods generating
explanations may be a solution. However, it also requires exploring
where users allow for system complexity and where transparency is
Besides such technical issues, an interesting finding from the online
survey was that participants did not found it important for an ex-
planation to refer to past situations in which the provided advice
proved to be incorrect. This could indicate the tendency of people to
favor information that confirms their preexisting beliefs and to be ig-
norant towards falsification, a phenomenon known as the confirmation
bias (Gilovich et al., 2002). Importantly, such a preference does not
necessarily mean that it is best to omit this kind of information. That is,
the main goal of the transparency and explainability properties of an
ICM is to enable users to better understand where the confidence value
originates from in order to more accurately predict the extent to which
an advice of the system can be trusted. In order to enable people to
make an accurate assessment, it is essential to provide both confirming
and contradictory information, precisely because we know that people
are prone to ignore information that does not confirm their beliefs.
Future work on confidence measures should not only conduct user
Fig. 8. Boxplot of the Likert scale ratings indicating the importance of different
types of information used to determine to follow the advice to turn on auto-
matic driving mode. The higher the ratings, the more the information was
Fig. 9. Means and 95% confidence intervals of the preference rankings con-
cerning the different types of advice that are provided by the autonomous car.
The rankings are inverted, the higher the rank the more preferred.
J.v.d. Waa, et al. International Journal of Human-Computer Studies 144 (2020) 102493
experiments revolving around preferences, but also on how they affect
system adoption, usage and task performance.
Moreover, findings from our user experiments implied that people
prefer to know about the current situational circumstances. This pre-
ference holds even when a given confidence value was high and they
said they trusted this estimation. This could indicate that people still
want to be able to form their own judgement about the DSS’ advice
based on their own observations, in order to maintain a sense of control
and autonomy (Legault, 2016). Hence, a confidence measure is not a
substitute for a user’s own judgement process and should be designed to
facilitate this process. ICM’s property of explainability may offer a vital
contribution to this process. Further investigation is required to identify
what should be explained in addition to an ICM and how this should be
8. Conclusion
In this work we proposed the concept of Interpretable Confidence
Measures (ICM). We used the idea of case-based reasoning to formalise
such measures. In addition, we motivated the need for confidence
measures to be not only accurate, but also explainable, transparent and
predictable. An ICM aims to provide a user of a decision support system
(DSS) information whether a DSS’s advice should be trusted or not. It
does so by conveying how likely it is that the given advice turns out to
be correct based on past experiences.
Three straightforward ICM implementations were proposed and
evaluated, to serve as concrete examples of the proposed ICM frame-
work. Two user experiments were conducted that showed that partici-
pants were able to understand the idea of case-based reasoning and that
this was in line with their own reasoning about confidence. In addition,
participants especially preferred their confidence values to be explained
by referring to past experiences and by highlighting specific experi-
ences in the process.
Future work may focus on further expanding the ICM framework by
incorporating more state of the art methods for confidence estimation.
Especially methods from the field of conformal prediction may prove
valuable. Additional user experiments could provide more insight in
user requirements for confidence measures. Other user experiments
could investigate the effects of confidence measures on actual task
CRediT authorship contribution statement
Jasper van der Waa: Conceptualization, Methodology, Software,
Formal analysis, Investigation, Writing - original draft, Writing - review
& editing. Tjeerd Schoonderwoerd: Formal analysis, Investigation,
Writing - original draft, Writing - review & editing. Jurriaan van
Diggelen: Writing - review & editing, Supervision, Funding acquisition.
Mark Neerincx: Writing - review & editing, Supervision, Funding ac-
Declaration of Competing Interest
The authors declare no competing interests.
This research was funded by TNO’s early research programs (ERP)
and risk-bearing exploratory research programs (RVO). We would like
to thank our colleagues for insightful discussions. Special gratitude go
to Alexander Boudewijn, Stephan Raaijmakers and Catholijn Jonker.
Alimoglu, F., Alpaydin, E., Denizhan, Y., 1996. Combining Multiple Classifiers for Pen-
Based Handwritten Digit Recognition. Institute of Graduate Studies in Science and
Engineering, Bogazici University Master’s thesis.
Artero, A.O., de Oliveira, M.C.F., Levkowitz, H., 2004. Uncovering clusters in crowded
parallel coordinates visualizations. IEEE Symposium on Information Visualization.
IEEE, pp. 81–88.
Atkeson, C.G., Moore, A.W., Schaal, S., 1997. Locally weighted learning. Artif. Intell. Rev.
11 (June), 11–73.
Bhattacharyya, S., 2013. Confidence in predictions from random tree ensembles. Knowl.
Inf. Syst. 35 (2), 391–410.
Bose, I., Mahapatra, R.K., 2001. Business data mining; a machine learning perspective.
Inf. Manage. 39 (3), 211–225.
Burrell, J., 2016. How the machine thinks: understanding opacity in machine learning
algorithms. Big Data Soc. 3 (1).
Cabitza, F., Rasoini, R., Gensini, G.F., 2017. Unintended consequences of machine
learning in medicine. JAMA 318 (6), 517–518.
Cohen, M.S., Parasuraman, R., Freeman, J.T., 1998. Trust in decision aids: a model and its
training implications. in Proc. Command and Control Research and Technology
Symp. Citeseer.
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J.-J., Sandhu, S., Guppy,
K.H., Lee, S., Froelicher, V., 1989. International application of a new probability
algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol. 64 (5),
van Diggelen, J., van den Broek, H., Schraagen, J.M., van der Waa, J., 2017. An intelligent
operator support system for dynamic positioning. International Conference on
Applied Human Factors and Ergonomics. Springer, pp. 48–59.
Doshi-Velez, F., Kim, B., 2017. Towards a rigorous science of interpretable machine
learning. arXiv:1702.08608.
Doyle, D., Tsymbal, A., Cunningham, P., 2003. A Review of Explanation and Explanation
in Case-Based Reasoning. Technical Report. Trinity College Dublin, Department of
Computer Science.
Dudani, S.A., 1976. The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man
Cybern. 325–327.
Evans, J., Handley, S., Over, D., 2003. Conditionals and conditional probability. Exp.
Psychol. 29 (2), 321.
Fitzhugh, E.W., Hoffman, R.R., Miller, J.E., 2011. Active Trust Management. Ashgate.
Fix, E., Hodges Jr, J.L., 1951. Discriminatory Analysis-Nonparametric Discrimination:
Consistency Properties. Technical Report. California Univ Berkeley.
Florez-Lopez, R., Ramon-Jeronimo, J.M., 2015. Enhancing accuracy and interpretability
of ensemble strategies in credit risk assessment. A correlated-adjusted decision forest
proposal. Expert Syst. Appl. 42 (13), 5737–5753.
Foody, G.M., 2005. Local characterization of thematic classification accuracy through
spatially constrained confusion matrices. Int. J. Remote Sens. 26 (6), 1217–1228.
Fortunato, M., Blundell, C., Vinyals, O., 2017. Bayesian recurrent neural networks.
Freitas, A.A., 2014. Comprehensible classification models: a position paper. ACM SIGKDD
Explor. Newsl. 15 (1), 1–10.
Gal, Y., Ghahramani, Z., 2016. A theoretically grounded application of dropout in re-
current neural networks. Advances in Neural Information Processing Systems. pp.
Gilovich, T., Griffin, D., Kahneman, D., 2002. Heuristics and Biases: The Psychology of
Intuitive Judgment. Cambridge university press.
Gilpin, L. H., Bau, D., Yuan, B. Z., Bajwa, A., Specter, M., Kagal, L., 2018. Explaining
explanations: an approach to evaluating interpretability of machine learning.
Goodfellow, I. J., Shlens, J., Szegedy, C., 2014. Explaining and harnessing adversarial
examples. arXiv:1412.6572.
Goodman, B., Flaxman, S., 2016. European union regulations on algorithmic decision-
making and a “right to explanation”. arXiv:1606.08813.
Graves, A., 2011. Practical variational inference for neural networks. Advances in Neural
Information Processing Systems. pp. 2348–2356.
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D., 2018. A
Table 5
Results of the Wilcoxon signed rank post-hoc tests on the preference rankings of information that is included in an explanation about the advice.
Present Situation General past experience Future situation Specific past experience
Present situation
General past experience n.s.
Future situation
Specific past experience
No information
p< .001
p< .001
J.v.d. Waa, et al.
International Journal of Human-Computer Studies 144 (2020) 102493
survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 51
(5), 93.
Hao, H., Liu, C.-L., Sako, H., et al., 2003. Confidence evaluation for combining diverse
classifiers. ICDAR. vol. 3. pp. 760–765.
Harteis, C., Billett, S., 2013. Intuitive expertise: theories and empirical evidence. Educ.
Res. Rev. 9, 145–157.
Hechenbichler, K., Schliep, K., 2004. Weighted k-nearest-neighbor techniques and ordinal
classification. Discussion Paper 399, SFB386
Herman, B., 2017. The promise and peril of human evaluation for model interpretability.
Hodgkinson, G.P., Langan-Fox, J., Sadler-Smith, E., 2008. Intuition: a fundamental
bridging construct in the behavioural sciences. Br. J. Psychol. 99 (1), 1–27.
Hoffman, R.R., Johnson, M., Bradshaw, J.M., Underbrink, A., 2013. Trust in automation.
IEEE Intell. Syst. 28 (1), 84–88.
Hoffman, R. R., Mueller, S. T., Klein, G., Litman, J., 2018. Metrics for explainable ai:
challenges and prospects. arXiv:1812.04608.
Holzinger, A., Carrington, A., Müller, H., 2019a. Measuring the quality of explanations:
the system causability scale (SCS). Comparing human and machine explanations.
Holzinger, A., Langs, G., Denk, H., Zatloukal, K., Müller, H., 2019. Causability and ex-
plainability of artificial intelligence in medicine. WIREs Data Min. Knowl. Discov. 9
(4), e1312.
Johansson, U., Linusson, H., Löfström, T., Boström, H., 2018. Interpretable regression
trees using conformal prediction. Expert Syst. Appl. 97, 394–404.
Kim, B., Glassman, E., Johnson, B., Shah, J., 2015. iBCM: Interactive Bayesian Case Model
Empowering Humans via Intuitive Interaction. Technical Report. MIT-CSAIL-TR-
Labatut, V., Cherifi, H., 2011. Evaluation of performance measures for classifiers com-
parison. arXiv:1112.4133.
Landsbergen, D., Coursey, D.H., Loveless, S., Shangraw Jr, R., 1997. Decision quality,
confidence, and commitment with expert systems: an experimental study. J. Public
Adm. Res.Theory 7 (1), 131–158.
Legault, L., 2016. The need for autonomy. Encyclopedia of Personality and Individual
Differences. Springer, New York, NY, pp. 1120–1122.
Libbrecht, M.W., Noble, W.S., 2015. Machine learning applications in genetics and
genomics. Nat. Rev. Genet. 16 (6), 321.
Lipton, Z. C., 2016. The mythos of model interpretability. arXiv:1606.03490.
Liu, C.-L., Hao, H., Sako, H., 2004. Confidence transformation for combining classifiers.
Pattern Anal. Appl. 7 (1), 2–17.
Mandelbaum, A., Weinshall, D., 2017. Distance-based confidence score for neural net-
work classifiers. arXiv:1709.09844.
McLean, S.F., 2016. Case-based learning and its application in medical and health-care
fields: a review of worldwide literature. J. Med. Educ. Curric.Dev. 3, S20377.
Miller, T., 2018a. Contrastive explanation: a structural-model approach. arXiv:1811.
Miller, T., 2018. Explanation in artificial intelligence: insights from the social sciences.
Artif. Intell.
Miller, T., Howe, P., Sonenberg, L., 2017. Explainable AI: beware of inmates running the
asylum or: how i learnt to stop worrying and love the social and behavioural sciences.
Nguyen, A., Yosinski, J., Clune, J., 2015. Deep neural networks are easily fooled: high
confidence predictions for unrecognizable images. Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 427–436.
Niculescu-Mizil, A., Caruana, R., 2005. Predicting good probabilities with supervised
learning. Proceedings of the 22nd International Conference on Machine Learning.
ACM, pp. 625–632.
Paisley, J., Blei, D., Jordan, M., 2012. Variational bayesian inference with stochastic
search. arXiv:1206.6430.
Papernot, N., McDaniel, P., 2018. Deep k-nearest neighbors: towards confident, inter-
pretable and robust deep learning. arXiv:1803.04765.
Papadopoulos, G., Edwards, P.J., Murray, A.F., 2001. Confidence estimation methods for
neural networks: a practical comparison. IEEE Trans. Neural Netw. 12 (6),
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: machine learning in
python. J. Mach. Learn. Res. 12, 2825–2830.
Pita, J., Tambe, M., Kiekintveld, C., Cullen, S., Steigerwald, E., 2011. Guards: game
theoretic security allocation on a national scale. The 10th International Conference
on Autonomous Agents and Multiagent Systems. International Foundation for
Autonomous Agents and Multiagent Systems, pp. 37–44.
Platt, J., others, 1999. Probabilistic outputs for support vector machines and comparisons
to regularized likelihood methods. Adv. Large Margin Classif. 10 (3), 61–74.
Polikar, R., 2006. Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6
(3), 21–45.
Pollatsek, A., Well, A.D., Konold, C., Hardiman, P., Cobb, G., 1987. Understanding con-
ditional probabilities. Organ. Behav. Hum. Decis. Process. 40 (2), 255–269.
Qin, Z., 2006. Naive bayes classification given probability estimation trees. 2006 5th
International Conference on Machine Learning and Applications (ICMLA’06). IEEE,
pp. 34–42.
Ribeiro, M. T., Singh, S., Guestrin, C., 2016. Model-agnostic interpretability of machine
learning. arXiv:1606.05386.
Ridgeway, G., Madigan, D., Richardson, T., O’Kane, J., 1998. Interpretable boosted Naïve
bayes classification. KDD. pp. 101–104.
Rish, I., et al., 2001. An empirical study of the naive bayes classifier. IJCAI 2001
Workshop on Empirical Methods in Artificial Intelligence. pp. 41–46.
Samek, W., Wiegand, T., Müller, K.-R., 2017. Explainable artificial intelligence: under-
standing, visualizing and interpreting deep learning models. arXiv:1708.08296.
Schank, R.C., Kass, A., Riesbeck, C.K., 2014. Inside Case-Based Explanation. Psychology
Shafer, G., Vovk, V., 2008. A tutorial on conformal prediction. J. Mach. Learn. Res. 9
(Mar), 371–421.
Stone, P., Veloso, M., 1997. Using decision tree confidence factors for multiagent control.
Robot Soccer World Cup. Springer, pp. 99–111.
Sturm, I., Lapuschkin, S., Samek, W., Müller, K.-R., 2016. Interpretable deep neural
networks for single-trial eeg classification. J. Neurosci. Methods 274, 141–145.
Subramanya, A., Srinivas, S., Babu, R. V., 2017. Confidence estimation in deep neural
networks via density modelling. arXiv:1707.07013.
Tipping, M.E., 2000. The relevance vector machine. Advances in Neural Information
Processing Systems (NIPS’ 2000). pp. 652–658.
Tóth, N., Pataki, B., 2008. Classification confidence weighted majority voting using de-
cision tree classifiers. Int. J. Intell. Comput.Cybern. 1 (2), 169–192.
Van Erp, M., Vuurpijl, L., Schomaker, L., 2002. An overview and comparison of voting
methods for pattern recognition. Proceedings Eighth International Workshop on
Frontiers in Handwriting Recognition. IEEE, pp. 195–200.
van der Waa, J., van Diggelen, J., Neerincx, M.A., Raaijmakers, S., 2018. ICM: An in-
tuitive model independent and accurate certainty measure for machine learning.
ICAART. 2. pp. 314–321.
Walley, P., 1996. Measures of uncertainty in expert systems. Artif. Intell. 83 (1), 1–58.
Waterman, D., 1986. A Guide to Expert Systems. Addison-Wesley Pub. Co., Reading, MA.
Wettschereck, D., Aha, D.W., Mohri, T., 1997. A review and empirical evaluation of
feature weighting methods for a class of lazy learning algorithms. Artif. Intell. Rev. 11
(1–5), 273–314.
Wu, Y., Yao, X., Vespasiani, G., Nicolucci, A., Dong, Y., Kwong, J., Li, L., Sun, X., Tian, H.,
Li, S., 2017. Mobile app-based interventions to support diabetes self-management: a
systematic review of randomized controlled trials to identify functions associated
with glycemic efficacy. JMIR mHealth and uHealth 5 (3), e35.
Ye, L.R., Johnson, P.E., 1995. The impact of explanation facilities on user acceptance of
expert systems advice. Mis Q. 157–172.
Zadrozny, B., Elkan, C., 2001. Obtaining calibrated probability estimates from decision
trees and naive bayesian classifiers. 1. pp. 609–616.
Zadrozny, B., Elkan, C., 2002. Transforming classifier scores into accurate multiclass
probability estimates. Proceedings of the eighth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, pp. 694–699.
Zaragoza, H., d’Alché Buc, F., 1998. Confidence measures for neural network classifiers.
Proceedings of the Seventh Int. Conf. Information Processing and Management of
Uncertainty in Knowledge Based Systems.
Zhou, J., Chen, F., 2018. 2D transparency space; bring domain users and machine
learning experts together. Human and Machine Learning. Springer, pp. 3–19.
Zliobaite, I., 2015. A survey on measuring indirect discrimination in machine learning.
J.v.d. Waa, et al. International Journal of Human-Computer Studies 144 (2020) 102493
... Like in Research Areas 1 and 2, most papers conduct functionally grounded evaluation (52%). However, as repeatedly stated by the authors in this research area, XAI methods are designed to assist humans in building appropriate trust (e.g., Bunde, 2021;van der Waa et al., 2020). Accordingly, in recent years, papers include evaluations with users (46%) (Abdul et al., 2020;Hardt et al., 2021;Ming et al., 2019). ...
... Accordingly, in recent years, papers include evaluations with users (46%) (Abdul et al., 2020;Hardt et al., 2021;Ming et al., 2019). User studies serve, for instance, to assess perceived characteristics of explanations (Förster et al., 2020b(Förster et al., , 2021 or to compare the utility of different explanations for decisionmaking (van der Waa et al., 2020). Researchers often resort to simplified tasks with subjects being students (Štrumbelj & Kononenko, 2014) or recruited via platforms like Amazon Mechanical Turk (van der Waa et al., 2020). ...
Full-text available
The quest to open black box artificial intelligence (AI) systems evolved into an emerging phenomenon of global interest for academia, business, and society and brought about the rise of the research field of explainable artificial intelligence (XAI). With its pluralistic view, information systems (IS) research is predestined to contribute to this emerging field; thus, it is not surprising that the number of publications on XAI has been rising significantly in IS research. This paper aims to provide a comprehensive overview of XAI research in IS in general and electronic markets in particular using a structured literature review. Based on a literature search resulting in 180 research papers, this work provides an overview of the most receptive outlets, the development of the academic discussion, and the most relevant underlying concepts and methodologies. Furthermore, eight research areas with varying maturity in electronic markets are carved out. Finally, directions for a research agenda of XAI in IS are presented.
... However, we draw a distinction between built-in mechanisms (such as attention maps) and post-hoc mechanisms (such as saliency maps) [72]. Finally, in case a mechanism of explanation did not fall into a standard category, we kept the description used by the paper (e.g., decision boundary graphs [79] or interpretable confidence measures [75]). ...
... Out of eleven papers, nine found a positive effect for at least one variable of interest, e.g. understandability [2,14,15,25,41,60,64,68,70,75,79,81], while two papers found a negative effect for at least one variable of interest [26,40]. The two papers with a negative effect used prototypes. ...
Full-text available
AI deep learning models have become more capable, but also more complex and less explainable. To address this development, new research on so-called explainable AI (XAI) has proliferated. This paper surveys the empirical literature on XAI based on human-subject experiments. It classifies extant work across different technical and experimental dimensions. Our findings suggest that explainable AI improves self-reported understanding and trust in AI. However, this rarely translates into improved performance of humans in incentivized tasks with AI support. We list several implications of these findings concerning the use of explainable AI in human-computer interaction.
... [43][44][45] The transportation domain includes navigation systems, applications for autonomous cars and flight planning for the aviation industry. [46][47][48] Financial applications of XAI research include the area of insurance, the possibility of financial fraud detection and loan application management. 49,50 For E-commerce, XAI is used as a useful marketing tool, or explained online purchase recommendations. ...
Full-text available
Background: The management of medical waste is a complex task that necessitates effective strategies to mitigate health risks, comply with regulations, and minimize environmental impact. In this study, a novel approach based on collaboration and technological advancements is proposed. Methods: By utilizing colored bags with identification tags, smart containers with sensors, object recognition sensors, air and soil control sensors, vehicles with Global Positioning System (GPS) and temperature humidity sensors, and outsourced waste treatment, the system optimizes waste sorting, storage, and treatment operations. Additionally, the incorporation of explainable artificial intelligence (XAI) technology, leveraging scikit-learn, xgboost, catboost, lightgbm, and skorch, provides real-time insights and data analytics, facilitating informed decision-making and process optimization. Results: The integration of these cutting-edge technologies forms the foundation of an efficient and intelligent medical waste management system. Furthermore, the article highlights the use of genetic algorithms (GA) to solve vehicle routing models, optimizing waste collection routes and minimizing transportation time to treatment centers. Conclusions: Overall, the combination of advanced technologies, optimization algorithms, and XAI contributes to improved waste management practices, ultimately benefiting both public health and the environment.
... Third, improving the computing and communication capabilities of edge computing hardware (e.g., Jetson NANO which can run AI/ML models for applications like image classification, object detection, segmentation, and speech processing [102] ) can accelerate the deployment of AI/ML solutions on the factory floor by removing the need to run AI/ML models on a server. [103] Fourth, building trust in decisions from AI/ML decisions through the concept of Explainable AI, [104] which involves working to develop a formal decision confidence measure for AI, to improve interpretability by humans. This provides human operators with more detailed information to determine whether to trust a decision made by AI/ML. ...
Full-text available
Artificial intelligence (AI) and machine learning (ML) can improve manufacturing efficiency, productivity, and sustainability. However, using AI in manufacturing also presents several challenges, including issues with data acquisition and management, human resources, infrastructure, as well as security risks, trust, and implementation challenges. For example, getting the data needed to train AI models can be difficult for rare events or costly for large datasets that need labeling. AI models can also pose security risks when integrated into industrial control systems. In addition, some industry players may be hesitant to use AI due to a lack of trust or understanding of how it works. Despite these challenges, AI has the potential to be extremely helpful in manufacturing, particularly in applications such as predictive maintenance, quality assurance, and process optimization. It is important to consider the specific needs and capabilities of each manufacturing scenario when deciding whether and how to use AI in manufacturing. This review identifies current developments, challenges, and future directions in AI/ML relevant to manufacturing, with the goal of improving understanding of AI/ML technologies available for solving manufacturing problems, providing decision‐support for prioritizing and selecting appropriate AI/ML technologies, and identifying areas where further research can yield transformational returns for the industry. Early experience suggests that AI/ML can have significant cost and efficiency benefits in manufacturing, especially when combined with the ability to capture enormous amounts of data from manufacturing systems.
Full-text available
Existing eXplainable Artificial Intelligence (XAI) techniques support people in interpreting AI advice. However, while previous work evaluates the users’ understanding of explanations, factors influencing the decision support are largely overlooked in the literature. This paper addresses this gap by studying the impact of user uncertainty , AI correctness , and the interaction between AI uncertainty and explanation logic-styles , for classification tasks. We conducted two separate studies: one requesting participants to recognise hand-written digits and one to classify the sentiment of reviews. To assess the decision making, we analysed the task performance, agreement with the AI suggestion, and the user’s reliance on the XAI interface elements. Participants make their decision relying on three pieces of information in the XAI interface (image or text instance, AI prediction, and explanation). Participants were shown one explanation style (between-participants design): according to three styles of logical reasoning (inductive, deductive, and abductive). This allowed us to study how different levels of AI uncertainty influence the effectiveness of different explanation styles. The results show that user uncertainty and AI correctness on predictions significantly affected users’ classification decisions considering the analysed metrics. In both domains (images and text), users relied mainly on the instance to decide. Users were usually overconfident about their choices, and this evidence was more pronounced for text. Furthermore, the inductive style explanations led to over-reliance on the AI advice in both domains – it was the most persuasive, even when the AI was incorrect. The abductive and deductive styles have complex effects depending on the domain and the AI uncertainty levels.
This edited book is a collection of selected research papers presented at the 2022 3rd International Conference on Artificial Intelligence in Education Technology (AIET 2022), held in Wuhan, China, on July 1–3, 2022. AIET establishes a platform for AI in education researchers to present research, exchange innovative ideas, propose new models, as well as demonstrate advanced methodologies and novel systems. The book is divided into five main sections – 1) AI in Education in the Post-COVID New Norm, 2) Emerging AI Technologies, Methods, Systems and Infrastructure, 3) Innovative Practices of Teaching and Assessment Driven by AI and Education Technologies, 4) Curriculum, Teacher Professional Development and Policy for AI in Education, and 5) Issues and Discussions on AI In Education and Future Development. Through these sections, the book provides a comprehensive picture of the current status, emerging trends, innovations, theory, applications, challenges and opportunities of current AI in education research. This timely publication is well aligned with UNESCO’s Beijing Consensus on Artificial Intelligence (AI) and Education. It is committed to exploring how AI may play a role in bringing more innovative practices, transforming education in the post-pandemic new norm and triggering an exponential leap toward the achievement of the Education 2030 Agenda. Providing broad coverage of recent technology-driven advances and addressing a number of learning-centric themes, the book is an informative and useful resource for researchers, practitioners, education leaders and policy-makers who are involved or interested in AI and education.
Massive open online courses and other online study opportunities are providing easier access to education for more and more people around the world. To cope with the large number of exams to be assessed in these courses, AI-driven automatic short answer grading can recommend teaching staff to assign points when evaluating free text answers, leading to faster and fairer grading. But what would be the best way to work with the AI? In this paper, we investigate and evaluate different methods for explainability in automatic short answer grading. Our survey of over 70 professors, lecturers and teachers with grading experience showed that displaying the predicted points together with matches between student answer and model answer is rated better than the other tested explainable AI (XAI) methods in the aspects trust, informative content, speed, consistency and fairness, fun, comprehensibility, applicability, use in exam preparation, and in general.
Full-text available
Faced with the challenge of online teaching-learning, university teachers continued with the responsibility of developing their learning sessions, innovating teaching material and methodology during this process, changing the way of generating learning in health sciences students, through the application of videos, summary readings and practices carried out with family members who acted as patients, in order to achieve the planned competition. The importance of letting students know their achievements in relation to what is evaluated, helps them to understand their way of learning, assess their learning result and self-regulate. This is how feedback motivates the student to rethink their learning strategies. The purpose of this study was to determine the effect of feedback on the online learning outcome of health sciences university students, in a non-experimental research, descriptive-correlational level, with a sample of 294 students. The results obtained showed that feedback in university students of Health Sciences in virtual environments is effective when applied in a timely manner and can be planned, based on the evidence of the learning outcome. To achieve this, they must be previously trained, from the first semesters of study, in feedback literacy, making it part of the self-regulation of their learning.
Full-text available
Recent success in Artificial Intelligence (AI) and Machine Learning (ML) allow problem solving automatically without any human intervention. Autonomous approaches can be very convenient. However, in certain domains, e.g., in the medical domain, it is necessary to enable a domain expert to understand, why an algorithm came up with a certain result. Consequently, the field of Explainable AI (xAI) rapidly gained interest worldwide in various domains, particularly in medicine. Explainable AI studies transparency and traceability of opaque AI/ML and there are already a huge variety of methods. For example with layer-wise relevance propagation relevant parts of inputs to, and representations in, a neural network which caused a result, can be highlighted. This is a first important step to ensure that end users, e.g., medical professionals, assume responsibility for decision making with AI/ML and of interest to professionals and regulators. Interactive ML adds the component of human expertise to AI/ML processes by enabling them to re-enact and retrace AI/ML results, e.g. let them check it for plausibility. This requires new human–AI interfaces for explainable AI. In order to build effective and efficient interactive human–AI interfaces we have to deal with the question of how to evaluate the quality of explanations given by an explainable AI system. In this paper we introduce our System Causability Scale to measure the quality of explanations. It is based on our notion of Causability (Holzinger et al. in Wiley Interdiscip Rev Data Min Knowl Discov 9(4), 2019) combined with concepts adapted from a widely-accepted usability scale.
Full-text available
Explainable artificial intelligence (AI) is attracting much interest in medicine. Technically, the problem of explainability is as old as AI itself and classic AI represented comprehensible retraceable approaches. However, their weakness was in dealing with uncertainties of the real world. Through the introduction of probabilistic learning, applications became increasingly successful, but increasingly opaque. Explainable AI deals with the implementation of transparency and traceability of statistical black‐box machine learning methods, particularly deep learning (DL). We argue that there is a need to go beyond explainable AI. To reach a level of explainable medicine we need causability. In the same way that usability encompasses measurements for the quality of use, causability encompasses measurements for the quality of explanations. In this article, we provide some necessary definitions to discriminate between explainability and causability as well as a use‐case of DL interpretation and of human explanation in histopathology. The main contribution of this article is the notion of causability, which is differentiated from explainability in that causability is a property of a person, while explainability is a property of a system This article is categorized under: • Fundamental Concepts of Data and Knowledge > Human Centricity and User Interaction Abstract Explainable AI.
Full-text available
In the last years many accurate decision support systems have been constructed as black boxes, that is as systems that hide their internal logic to the user. This lack of explanation constitutes both a practical and an ethical issue. The literature reports many approaches aimed at overcoming this crucial weakness sometimes at the cost of scarifying accuracy for interpretability. The applications in which black box decision systems can be used are various, and each approach is typically developed to provide a solution for a specific problem and, as a consequence, delineating explicitly or implicitly its own definition of interpretability and explanation. The aim of this paper is to provide a classification of the main problems addressed in the literature with respect to the notion of explanation and the type of black box system. Given a problem definition, a black box type, and a desired explanation this survey should help the researcher to find the proposals more useful for his own work. The proposed classification of approaches to open black box models should also be useful for putting the many research open questions in perspective.
Conference Paper
Full-text available
End-users of machine learning-based systems benefit from measures that quantify the trustworthiness of the underlying models. Measures like accuracy provide for a general sense of model performance, but offer no detailed information on specific model outputs. Probabilistic outputs, on the other hand, express such details, but they are not available for all types of machine learning, and can be heavily influenced by bias and lack of representative training data. Further, they are often difficult to understand for non-experts. This study proposes an intuitive certainty measure (ICM) that produces an accurate estimate of how certain a machine learning model is for a specific output, based on errors it made in the past. It is designed to be easily explainable to non-experts and to act in a predictable, reproducible way. ICM was tested on four synthetic tasks solved by support vector machines, and a real-world task solved by a deep neural network. Our results show that ICM is both more accurate and intuitive than related approaches. Moreover, ICM is neutral with respect to the chosen machine learning model, making it widely applicable.
Machine Learning (ML) is currently facing prolonged challenges with the user acceptance of delivered solutions as well as seeing system misuse, disuse, or even failure. These fundamental challenges can be attributed to the nature of the “black-box” of ML methods for domain users when offering ML-based solutions. That is, transparency of ML is essential for domain users to trust and use ML confidently in their practices. This chapter argues for a change in how we view the relationship between human and machine learning to translate ML results into impact. We present a two-dimensional transparency space which integrates domain users and ML experts together to make ML transparent. We identify typical Transparent ML (TML) challenges and discuss key obstacles to TML, which aim to inspire active discussions of making ML transparent with a systematic view in this timely field.
A key property of conformal predictors is that they are valid, i.e., their error rate on novel data is bounded by a preset level of confidence. For regression, this is achieved by turning the point predictions of the underlying model into prediction intervals. Thus, the most important performance metric for evaluating conformal regressors is not the error rate, but the size of the prediction intervals, where models generating smaller (more informative) intervals are said to be more efficient. State-of-the-art conformal regressors typically utilize two separate predictive models: the underlying model providing the center point of each prediction interval, and a normalization model used to scale each prediction interval according to the estimated level of difficulty for each test instance. When using a regression tree as the underlying model, this approach may cause test instances falling into a specific leaf to receive different prediction intervals. This clearly deteriorates the interpretability of a conformal regression tree compared to a standard regression tree, since the path from the root to a leaf can no longer be translated into a rule explaining all predictions in that leaf. In fact, the model cannot even be interpreted on its own, i.e., without reference to the corresponding normalization model. Current practice effectively presents two options for constructing conformal regression trees: to employ a (global) normalization model, and thereby sacrifice interpretability; or to avoid normalization, and thereby sacrifice both efficiency and individualized predictions. In this paper, two additional approaches are considered, both employing local normalization: the first approach estimates the difficulty by the standard deviation of the target values in each leaf, while the second approach employs Mondrian conformal prediction, which results in regression trees where each rule (path from root node to leaf node) is independently valid. An empirical evaluation shows that the first approach is as efficient as current state-of-the-art approaches, thus eliminating the efficiency vs. interpretability trade-off present in existing methods. Moreover, it is shown that if a validity guarantee is required for each single rule, as provided by the Mondrian approach, a penalty with respect to efficiency has to be paid, but it is only substantial at very high confidence levels.
Conference Paper
Several machine learning models, including neural networks, consistently mis- classify adversarial examples—inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed in- put results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to ad- versarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Us- ing this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.
Conference Paper
Recurrent neural networks (RNNs) stand at the forefront of many recent developments in deep learning. Yet a major difficulty with these models is their tendency to overfit, with dropout shown to fail when applied to recurrent layers. Recent results at the intersection of Bayesian modelling and deep learning offer a Bayesian interpretation of common deep learning techniques such as dropout. This grounding of dropout in approximate Bayesian inference suggests an extension of the theoretical results, offering insights into the use of dropout with RNN models. We apply this new variational inference based dropout technique in LSTM and GRU models, assessing it on language modelling and sentiment analysis tasks. The new approach outperforms existing techniques, and to the best of our knowledge improves on the single model state-of-the-art in language modelling with the Penn Treebank (73.4 test perplexity). This extends our arsenal of variational tools in deep learning.