Conference PaperPDF Available

Contrastive explanations for reinforcement learning in terms of expected consequences

  • TNO, Soesterberg, Netherlands

Abstract and Figures

Machine Learning models become increasingly proficient in complex tasks. However, even for experts in the field, it can be difficult to understand what the model learned. This hampers trust and acceptance, and it obstructs the possibility to correct the model. There is therefore a need for transparency of machine learning models. The development of transparent classification models has received much attention, but there are few developments for achieving transparent Reinforcement Learning (RL) models. In this study we propose a method that enables a RL agent to explain its behavior in terms of the expected consequences of state transitions and outcomes. First, we define a translation of states and actions to a description that is easier to understand for human users. Second, we developed a procedure that enables the agent to obtain the consequences of a single action, as well as its entire policy. The method calculates contrasts between the consequences of the user's query-derived policy, and of the learned policy of the agent. Third, a format for generating explanations was constructed. A pilot survey study was conducted to explore preferences of users for different explanation properties. Results indicate that human users tend to favor explanations about policy rather than about single actions.
Content may be subject to copyright.
Contrastive explanations for reinforcement learning in terms of expected
J. van der Waa 1, J. van Diggelen1, K. van den Bosch1, M. Neerincx1
1Netherlands Organisation for Applied Scientific Research
Machine Learning models become increasingly
proficient in complex tasks. However, even for
experts in the field, it can be difficult to under-
stand what the model learned. This hampers trust
and acceptance, and it obstructs the possibility to
correct the model. There is therefore a need for
transparency of machine learning models. The de-
velopment of transparent classification models has
received much attention, but there are few devel-
opments for achieving transparent Reinforcement
Learning (RL) models. In this study we propose a
method that enables a RL agent to explain its be-
havior in terms of the expected consequences of
state transitions and outcomes. First, we define a
translation of states and actions to a description that
is easier to understand for human users. Second,
we developed a procedure that enables the agent
to obtain the consequences of a single action, as
well as its entire policy. The method calculates
contrasts between the consequences of the user’s
query-derived policy, and of the learned policy of
the agent. Third, a format for generating explana-
tions was constructed. A pilot survey study was
conducted to explore preferences of users for dif-
ferent explanation properties. Results indicate that
human users tend to favor explanations about pol-
icy rather than about single actions.
1 Introduction
Complex machine learning (ML) models such as Deep Neu-
ral Networks (DNNs) and Support Vector Machines (SVMs)
perform very well on a wide range of tasks [Lundberg and
Lee, 2016], but their outcomes are often are often difficult to
understand by humans [Weller, 2017]. Moreover, machine
learning models cannot explain how they achieved their re-
sults. Even for experts in the field, it can be very difficult
to understand what the model actually learned [Samek et al.,
2016]. To remedy this issue, the field of eXplainable Artifi-
cial Intelligence (XAI) studies how such complex but useful
models can be made more understandable [?].
Achieving transparency of ML models has multiple ad-
vantages [Weller, 2017]. For example, if a model designer
knows why a model performs badly on some data, he or she
can start a more informed process of resolving the perfor-
mance issues [Kulesza et al., 2015; Papernot and McDaniel,
2018]. However, even if a model has high performance, the
users (typically non-experts in ML) would still like to know
why it came to a certain output [Miller, 2017]. Especially
in high-risk domains such as defense and health care, inap-
propriate trust in the output may cause substantial risks and
problems [Lipton, 2016; Ribeiro et al., 2016]. If a ML model
fails to provide transparency, the user cannot safely rely on its
outcomes, which hampers the model’s applicability [Lipton,
2016]. If, however, a ML-model is able to explain its work-
ings and outcomes satisfactorily to the user, then this would
not only improve the user’s trust; it would also be able to pro-
vide new insights to the user.
For the problem of classification, recent research has de-
veloped a number of promising methods that enable classifi-
cation models to explain their output [Guidotti et al., 2018].
Several of these methods prove to be model-independent in
some way, allowing them to be applied on any existing ML
classification model. However, for Reinforcement Learn-
ing (RL) models, there are relatively few methods available
[Verma et al., 2018; Shu et al., 2017; Hein et al., 2017]. The
scarcity of methods that enable RL agents to explain their
actions and policies towards humans severely hampers the
practical applications of RL-models in this field. It also di-
minishes the, often highly rated, value of RL to Artificial In-
telligence [Hein et al., 2017; Gosavi, 2009]. Take for exam-
ple a simple agent within a grid world that needs to reach a
goal position while evading another actor who could cause
termination as well as evading other static terminal states.
The RL agent cannot easily explain why it takes the route
it has learned as it only knows numerical rewards and its co-
ordinates in the grid. The agent has no grounded knowledge
about the ’evil actor’ that tries to prevent it from reaching its
goal nor has it knowledge of how certain actions will effect
such grounded concepts. These state features and rewards are
what drives the agent but do not lend themselves well for an
explanations as they may not be grounded concepts nor do
they offer a reason why the agent behaves a certain way.
arXiv:1807.08706v1 [cs.LG] 23 Jul 2018
Important pioneering work has been done by Hayes and
Shah [Hayes and Shah, 2017]. They developed a method for
eXplainable Reinforcement Learning (XRL) that can gener-
ate explanations about a learned policy in a way that is under-
standable to humans. Their method converts feature vectors
to a list of predicates by using a set of binary classification
models. This list of predicates is searched to find sub-sets
that tend to co-occur with specific actions. The method pro-
vides information about which actions are performed when
which state predicates are true. A method that uses the co-
occurrence to generate explanations may be useful for small
problems, but becomes less comprehensible in larger plan-
ning and control problems, because the overview of predicate
and action combinations becomes too large. Also, the method
addresses only what the agent does, and not why it acts as it
does. In other words, the method presents the user with the
correlations between states and the policy but it does not pro-
vide a motivation why that policy is used in terms of rewards,
or state transitions.
This study proposes an approach to XRL that allows an
agent to answer questions about its actions and policy in terms
of their consequences. Other questions unique to RL are also
possible, for example those that ask about the time it takes to
obtain some goal or those about RL specific problems (loop
behavior, lack of exploration or exploitation, etc.). However
we believe that a non-expert in RL is mostly interested in the
expected consequences of the agent’s learned behavior and
whether the agent finds these consequences good or bad. This
information can be used as an argument why the agent be-
haves in some way. This would allow human users to gain in-
sight in what information the agent can perceive from a state
and which outcomes it expects from an action or state visit.
Furthermore, to limit the amount of information of all con-
sequences, our proposed method aims to support contrastive
explanations [Miller, 2017]. Contrastive explanations are a
way of answering causal ’why’-questions. In such questions,
two potential items, the fact and foil, are compared to each
other in terms of their causal effects on the world. Contrastive
questions come natural between humans and offer an intuitive
way of gathering motivations about why one performs a cer-
tain action instead of another [Miller, 2017]. In our case we
allow the user to formulate a question of why the learned pol-
icy πt(the ’fact’) is used instead of some other policy πf(the
’foil) that is of interest to the user. Furthermore, our pro-
posed method translates the set of states and actions in a set
of more descriptive state classes Cand action outcomes O
similar to that of [Hayes and Shah, 2017]. This allows the
user to query the agent in a more natural way as well as re-
ceive more informative explanations as both refer to the same
concepts instead of plain features. The translation of state
features to more high-level concepts and actions in specific
states to outcomes, is also done in the proposed algorithm of
[Sherstov and Stone, 2005]. The translation in this algorithm
was used to facilitate transfer learning within a single action
over multiple tasks and domains. In our method we used it to
create a user-interpretable variant of the underlying Markov
Decision Problem (MDP).
For the purpose of implementation and evaluation of our
proposed method, we performed a pilot study. In this study,
a number of explanation examples were presented to partic-
ipants to see which of their varying properties are preferred
the most. One of the properties was to see whether the partic-
ipants prefer explanations about the expected consequences
of a single-action or the entire policy.
2 Approach for consequence-based
The underlying Markov Decision Problem (MDP) of a RL
agent consists of the tuple hS, A, R, T , λi. Here, Sand Aare
the set of states (described by a feature vector) and actions
respectively, R:S×ARis the reward function and
T:S×AP r(S)the transition function that provides
a probability distribution over states. Also, λis the discount
factor that governs how much of future rewards are taken into
account by the agent. This tuple provides the required infor-
mation to derive the consequences of the learned policy πtor
the foil policy πffrom the user’s question. As one can use
the transition function Tto sample the effects of both πtand
πf. In the case Tis not explicit, one may use a separate ML
model to learn it in addition to the actual agent. Through this
simulation, one constructs a Markov Chain of state visits un-
der each policy πtand πfand can present the difference to
the user.
Through the simulation of future states with T, informa-
tion can be gathered about state consequences. In turn, from
the agent itself the state or state-action values for simulated
state visits can be obtained to develop an explanation in terms
of rewards. However, the issue with this approach is that the
state features and rewards may not be easy to understand for
a user as it would consist of possibly low-level concepts and
numerical reward values or expected returns. To mitigate this
issue we can apply a translation of the states and actions to
a set of predefined state concepts and outcomes. These con-
cepts can be designed to be more descriptive and informative
for the potential user. A way to do this translation is by train-
ing a set of binary classifiers to recognize each outcome or
state concept from the state features and taken action, a simi-
lar approach to the one from [Hayes and Shah, 2017]. Their
training can occur during the exploratory learning process of
the agent. This translation allows us to use the above de-
scribed method of simulating consequences and transform the
state features and results of actions to more user-interpretable
2.1 A user-interpretable MDP
The original set of states can be transformed to a more de-
scriptive set Caccording to the function k:SC. This
is similar to the approach of [Hayes and Shah, 2017]where
kconsists of a number of classifiers. Also, rewards can be
explained in terms of a set of action outcomes Oaccording
to t:C×AP r(O). This provides the results of an ac-
tion in some state in terms of the concepts O. For example,
the outcomes that the developer had in mind when designing
the reward function R. The transformation of states and ac-
tions in state classes and outcomes is adopted from the work
of [Sherstov and Stone, 2005]where the transformations are
used to allow for transfer learning in RL. Here however, we
Figure 1: An overview of the proposed method, a dotted line repre-
sents a feedback loop. We assume a general reinforcement learning
agent that acts upon a state sthrough some action aand receives a
reward r. We train a transition model Tthat can be used to simulate
the effect of actions on states. By repeatedly simulating a state siwe
can obtain the expected consequences γof an entire policy. Also,
the consequences of a contrastive policy consisting of an alternative
courses of action afcan be simulated with the same transition model
T. Finally, in constructing the explanation we transform states and
actions into user-interpretable concepts and construct an explanation
that is contrastive.
use them as a translation towards a more user-interpretable
representation of the actual MDP.
The result is the new MDP tuple
hS, A, R, T , λ,C,O,t,ki. An RL agent is still trained
on S,A,Rand Twith λindependent of the descriptive
sets Cand Oand functions kand t. This makes the
transformation independent of the RL algorithm used to train
the agent. See Figure 1 for an overview of this approach.
As an example take the grid world illustrated in Fig-
ure 2 that shows an agent in a simple myopic navigation
task. The states Sare the (x, y)coordinates and the pres-
ence of a forest, monster or trap in adjacent tiles with A=
U p, Down, Lef t, Right.Rconsists of a small transient
penalty, a slightly larger penalty for tiles with a forest, a large
penalty shared over all terminal states (traps or adjacent tiles
to a monster) and a large positive reward for the finishing
state. Tis skewed towards the intended result with small
probabilities for the other results if possible.
The state transformation kcan consist out of a set of clas-
sifiers for the predicates whether the agent is next to a forest,
a wall, a trap or monster, or in the forest. Applying kto some
state sSresults in a Boolean vector cCwhose informa-
tion can be used to construct an explanation in terms of the
stated predicates. The similar outcome transformation tmay
predict the probability of the outcomes Ogiven a state and
action. In our example, Oconsists of whether the agent will
be at the goal, in a trap, next to the monster or in the forest.
Each outcome ocan be flagged as being positive o+or neg-
ative opurely such that they can be presented differently in
the eventual explanation.
Given the above transformations we can simulate the next
state of a single action awith Tor even the entire chain of
Figure 2: A simple RL problem where the agent has to navigate
from the bottom left to the top right (goal) while evading traps, a
monster and a forest. The agent terminates when in a tile with a trap
or adjacent to the monster. The traps and the monster only occur
in the red-shaded area and as soon as the agents enter this area the
monster moves towards the agent.
actions and visited states given some policy π. These can
then be transformed into state descriptions Cand action out-
comes Oto form the basis of an explanation. As mentioned,
humans usually ask for contrastive questions especially re-
garding their actions [Miller, 2017]. In the next section we
propose a method of translating the foil in a contrastive ques-
tion into a new policy.
2.2 Contrastive questions translated into value
A contrastive question consists of a fact and a foil, and its
answer describes the contrast between the two from the fact’s
perspective [Miller, 2017]. In our case, the fact consists of the
entire learned policy πt, a single action from it at=πt(st)
or any number of consecutive actions from πt. We propose a
method of how one can obtain a foil policy πfbased on the
foil in the user’s question. An example of such a question
could be (framed within the case of Figure 2);
”Why do you move up and then right (fact) in-
stead of moving to the right until you hit a wall and
then move up (foil)?”
The foil policy πfis ultimately obtained by combining a
state-action value function QI– that represents the user’s
preference for some actions according to his/her question –
with the learned Qtto obtain Qf;
Qf(s, a) = Qt(s, a) + QI(s, a),s, a S, A (1)
Each state-action value is of the form Q:S×AR.
QIonly values the state-action pairs queried by the user.
For instance, the QIof the above given user question can
be based on the following reward scheme for all potentially
simulated sS;
The action a1
f=’Right’ receives a reward such that
Qf(s, Right)> Qt(s, πt(s))
If ’RightWall’ k(s)
Then the action a2
f=’Up’ receives a reward such that
Qf(·,Up)> Qt(·, πt(s)).
Given this reward scheme we can train QIand obtain Qf
according to equation 1. The state-action values Qfcan then
be used to obtain the policy πfusing the original action se-
lection mechanism of the agent. This results in a policy that
tries to follow the queried policy as best as it can. The ad-
vantage of having πfconstructed from Qfis that the agent
is allowed to learn a different action then those in the user’s
question as long as the reward is higher in the long run (more
user defined actions can be performed). Also, it allows for the
simulation of the actual expected behavior of the agent as it
is still based on the agent’s action selection mechanism. This
would both not be the case if we simply forced the agent to
do exactly what the user stated.
The construction of QIis done through simulation with
the help of the transition model T. The rewards that are given
during the simulation are selected with Equation 1 in mind, as
they need to eventually compensate for the originally learned
action based on Qt. Hence, the reward for each state and
queried action is as follows;
RI(si, af) = λf
λw(si, st) [R(si, af)R(si, at] (1 + )
With at=πt(st)the originally learned action and wbeing
a distance based weight;
w(si, st) = ed(si,st)
First, siwith i∈ {t, t + 1, ..., t +n}is the i’th state in
the simulation starting with st.afis the current foil action
governed by the conveyed policy by the user. The fact that
afis taken as the only rewarding action each time, greatly
reduces the time needed to construct QI. Next, w(si, st)is
obtained from a Radial Basis Function (RBF) with a Gaus-
sian kernel and distance function d. This RBF represents the
exponential distance between our actual state stand the sim-
ulated state si. The Gaussian kernel is governed by the stan-
dard deviation σand allows us to reduce the effects of QIas
we get further from our actual state st. The ratio of discount
factors λf
λallows for the compensation between the discount
factor λof the original agent and the potentially different fac-
tor λffor QIif we wish it to be more shortsighted. Finally,
[R(si, af)R(si, at)] (1 + )is the amount of reward that
afneeds such that QI(si, af) > Q(si, at). With  > 0that
determines how much more QIwill prefer afover at.
The parameter ndefines how many future state transitions
we simulate and are used to retrieve QI. As a general rule
n3σas at this point the Gaussian kernel will reduce the
contribution of QIto near zero such that Qfwill resemble
Qt. Hence, by setting σone can vary the number of states
the foil policy should starting from st. Also, by setting the
strength of how much each afshould be preferred over atcan
be regulated. Finally, λfdefines how shortsighted QIshould
be. If set to λf= 0,πfwill force the agent to perform afas
long as siis not to distant from st. If set to values near one,
πfis allowed to take different actions as long as it results into
more possibilities of performing af.
2.3 Generating explanations
At this point we have the user-interpretable MDP consisting
of state concepts Cand action outcomes Oprovided by their
respective transformation function kand t. Also, we have a
definition of RIthat values the actions and/or states that are
of interest by the user which can be used to train QIthrough
simulation and obtain Qfaccording to Equation 1. This pro-
vides us with the basis of obtaining the information needed to
construct an explanation.
As mentioned before, the explanations are based on simu-
lating the effects with Tof πtand that of πf(if defined by
the user). We can call Ton the previous state si1for some
action π(si1to obtain siand repeat this until i== n. The
result is a single sequence or trajectory of visited states and
performed actions for any policy πstarting from st;
γ(st, π) = {(s0, a0), ..., (sn, an)|T , π}(4)
If Tis probabilistic, multiple simulations with the same
policy and starting state may result in different trajectories.
To obtain the most probable trajectory γ(st, π)we can take
the transition from Twith the highest probability. Otherwise
a Markov chain could be constructed instead of a single tra-
The next step is to transform each state and action pair in
γ(st, π)to the user-interpretable description with the func-
tions kand t;
P ath(st, π) = {(c0, o0), ..., (cn, on)},
ci=k(si),oi=t(si, ai),(si, ai)γ(st, π)(5)
From P ath(st, πt)an explanation can be constructed about
the state the agent will most likely visit and the action out-
comes it will obtain. For example with the use of the follow-
ing template;
”For the next nactions I will mostly perform a.
During these actions, I will come across situations
with cP ath(st, πt). This will cause me o+
P ath(st, πt)but also oP ath(st, πt)”.
Let ahere be the action most common in γ(st, πt)and
both o+and othe positive and negative action outcomes
respectively. Since we have access to the entire simulation of
πf, a wide variety of explanations is possible. For instance
we could also focus on the less common actions;
”For the next nactions I will perform a1when
in situations with cP ath(st, πt|πt=a1)and
a2when in situations with cP ath(st, πt|πt=
a2). These actions prevent me from o+
P ath(st, πt)but also oP ath(st, πt)”.
A contrastive explanation given some question from the
user that describes the foil policy πfcan be constructed
in a similar manner but take the contrast. Given a foil
we can focus on the differences between P ath(st, πt)and
P ath(st, πf). This can be obtained by taking the relative
complement P ath(st, πt)\P ath(st, πf); the set of expected
unique consequences when behaving according to πtand not
πf. A more extensive explanation can be given by taking the
symmetric difference P ath(st, πt)4P ath(st, πf)to explain
the unique differences between both policies.
3 User study
The above proposed method allows an RL agent to explain
and motivate its behavior in terms of expected states and out-
comes. It also enables the construction of contrastive expla-
nations where any policy can be compared to the learned pol-
icy. This contrastive explanation is based on differences in
expected outcomes between the compared policies.
We performed a small user study in which 82 participants
were shown a number of exemplar explanations about the
case shown in figure 2. These explanations addressed either
the single next action or the policy. Both explanations can be
generated by the above method by adjusting the Radial Ba-
sis Function weighting scheme and/or the foil’s discount fac-
tor. Also, some example explanations were contrastive with
only the second best action or policy, while others provided
all consequences. Contrasts were determined using the rela-
tive complement between fact and foil. Whether the learned
action or policy was treated as the fact or foil, was also sys-
tematically manipulated in this study.
We presented the developed exemplar explanations in pairs
to the participants and asked them to select the explanation
that helped them most to understand the agent’s behavior.
Afterwards we asked which of the following properties they
used to assess their preference: long versus short explana-
tions; explanations with ample information versus little infor-
mation; explanations addressing actions versus those that ad-
dress strategies (policies); and explanations addressing short-
term consequences of actions versus explanations that ad-
dress distant consequences of actions.
The results of the preferred factors are shown in Figure 3.
This shows that the participants prefer explanations that ad-
dress strategy and policy, and that provide ample information.
We note here that, given the simple case from figure 2, partici-
pants may have considered an explanation addressing a single
action only as trivial, because the optimal action was, in most
cases, already evident to the user.
4 Conclusion
We proposed a method for a reinforcement learning (RL)
agent to generate explanations for its actions and strategies.
The explanations are based on the expected consequences of
its policy. These consequences were obtained through simu-
lation according to a (learned) state transition model. Since
state features and numerical rewards do not lend themselves
easily for an explanation that is informative to humans, we
developed a framework that translates states and actions into
user-interpretable concepts and outcomes.
Figure 3: A plot depicting the percentage of participants (y-axis) for
each explanation property (x-axis) that caused them to prefer some
explanations over others. Answers of a total of 82 participants where
We also proposed a method for converting the foil, –or pol-
icy of interest to the user–, of a contrastive ’why’-question
about actions into a policy. This policy follows locally the
user’s query but gradually transgresses back towards the orig-
inal learned policy. This policy favors the actions that are of
interest to the user such that the agent tries to perform them
as best as possible. How much these actions are favored com-
pared to the originally learned action can be set with a single
Through running simulations for a given number steps of
both the policy derived from the user’s question and the ac-
tually learned policy, we were able to obtain expected con-
sequences of each. From here, we were able to construct
contrastive explanations: explanations addressing the conse-
quences of the learned policy and what would be different if
the derived policy would have been followed.
An online survey pilot study was conducted to explore
which of several explanations are most preferred by human
users. Results indicate that users prefer explanations about
policies rather than about single actions.
Future work will focus on implementing the method on
complex RL benchmarks to explore the scalability of this ap-
proach in realistic cases. This is important given the compu-
tational costs of simultaneously simulating the consequences
of different policies in large state spaces. Also, we will ex-
plore more methods to construct our translation functions
from states and actions to concepts and outcomes. A more
extensive user study will be carried out to evaluate the instruc-
tional value of generated explanations in more detail, and to
explore the relationship between explanations and users’ trust
in the agent’s performance.
We would like to thank the reviewers for their time and effort
in improving this paper. Also, we are grateful for the funding
from the RVO Man Machine Teaming research project that
made this research possible.
[Gosavi, 2009]Abhijit Gosavi. Reinforcement learning: A
tutorial survey and recent advances. INFORMS Journal
on Computing, 21(2):178–192, 2009.
[Guidotti et al., 2018]Riccardo Guidotti, Anna Monreale,
Franco Turini, Dino Pedreschi, and Fosca Giannotti. A
survey of methods for explaining black box models. arXiv
preprint arXiv:1802.01933, 2018.
[Hayes and Shah, 2017]Bradley Hayes and Julie A Shah.
Improving robot controller transparency through au-
tonomous policy explanation. In Proceedings of the 2017
acm/ieee international conference on human-robot inter-
action, pages 303–312. ACM, 2017.
[Hein et al., 2017]Daniel Hein, Steffen Udluft, and
Thomas A Runkler. Interpretable policies for reinforce-
ment learning by genetic programming. arXiv preprint
arXiv:1712.04170, 2017.
[Kulesza et al., 2015]Todd Kulesza, Margaret Burnett,
Weng-Keen Wong, and Simone Stumpf. Principles of
explanatory debugging to personalize interactive machine
learning. In Proceedings of the 20th International Con-
ference on Intelligent User Interfaces, pages 126–137.
ACM, 2015.
[Lipton, 2016]Zachary C Lipton. The mythos of model in-
terpretability. arXiv preprint arXiv:1606.03490, 2016.
[Lundberg and Lee, 2016]Scott Lundberg and Su-In Lee.
An unexpected unity among methods for interpreting
model predictions. arXiv preprint arXiv:1611.07478,
[Miller, 2017]Tim Miller. Explanation in artificial intelli-
gence: Insights from the social sciences. arXiv preprint
arXiv:1706.07269, 2017.
[Papernot and McDaniel, 2018]Nicolas Papernot and
Patrick McDaniel. Deep k-nearest neighbors: Towards
confident, interpretable and robust deep learning. arXiv
preprint arXiv:1803.04765, 2018.
[Ribeiro et al., 2016]Marco Tulio Ribeiro, Sameer Singh,
and Carlos Guestrin. Why should i trust you?: Explaining
the predictions of any classifier. In Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 1135–1144. ACM,
[Samek et al., 2016]Wojciech Samek, Gr´
egoire Montavon,
Alexander Binder, Sebastian Lapuschkin, and Klaus-
Robert M¨
uller. Interpreting the predictions of complex
ml models by layer-wise relevance propagation. arXiv
preprint arXiv:1611.08191, 2016.
[Sherstov and Stone, 2005]Alexander A Sherstov and Peter
Stone. Improving action selection in mdp’s via knowledge
transfer. In AAAI, volume 5, pages 1024–1029, 2005.
[Shu et al., 2017]Tianmin Shu, Caiming Xiong, and
Richard Socher. Hierarchical and interpretable skill
acquisition in multi-task reinforcement learning. arXiv
preprint arXiv:1712.07294, 2017.
[Verma et al., 2018]Abhinav Verma, Vijayaraghavan Mu-
rali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaud-
huri. Programmatically interpretable reinforcement learn-
ing. arXiv preprint arXiv:1804.02477, 2018.
[Weller, 2017]Adrian Weller. Challenges for transparency.
arXiv preprint arXiv:1708.01870, 2017.
... Our agent's policy (Section 4.2) resembles those derived by Reinforcement Learning (RL) algorithms [60]. Explanations for the actions of RL agents have been forward looking, presenting goals or situations to be reached or avoided [19,30,41,62]; or have combined forward-with backward-looking information (about previous interactions with the environment) [17]. ...
... However, they can readily be incorporated into the formalism described in Section 4.2 for other configurations.8 An interesting avenue of investigation is the automatic derivation of rationales, e.g., by using binary classifiers to recognize high-level concepts from low-level state features and actions taken[25,62], or learning a mapping from an agent's actions to signals given by human experts[20]. ...
Full-text available
We present insights obtained from two user studies performed in the context of a web-based game set in a care-taking scenario — a retirement village, where elderly residents live in smart homes equipped with monitoring systems. These systems should raise alerts when adverse events happen, but they do not function perfectly (they may issue false alerts or miss true events). Players, who “work” in the village, perform a primary task whereby they must ensure the welfare of the residents by attending to adverse events in a timely manner, and a secondary routine task that demands their attention. In the first user study, we investigate the relationship between the performance of different monitoring systems, in terms of error type, and user behaviour and trust in these systems. In the second study, we examine the effect of the advice offered by an advisor agent on users’ behaviour. Our contributions are (1) the game itself, which supports experimentation with various trust-related factors, and a version of the game augmented with an advisor agent; (2) a methodology for calibrating the parameters of the game; (3) insights regarding the relationship between device accuracy and user behaviour and trust in automation; (4) findings about the training effect of an advisor agent; and (5) insights from predictive models about factors that influence trust and behaviour.
... Hayes et al. proposed a natural-language question-answering system that provides an explanation of what an IA does in a particular situation [1]. Waa et al. proposed a method for explaining not a one-shot action but a sequence of actions [3]. ...
... Strictly speaking, this is a global explanation based on a Markov decision-process model. Waa et al. proposed a method for explaining not a one-shot action but a sequence of actions [3]. However, they focused on an IA in a grid world, and challenges remain for applying it to another domain. ...
Full-text available
Intelligent agents (IAs) that use machine learning for decision-making often lack the explainability about what they are going to do, which makes human-IA collaboration challenging. However, previous methods of explaining IA behavior require IA developers to predefine vocabulary that expresses motion, which is problematic as IA decision-making becomes complex. This paper proposes Manifestor, a method for explaining an IA’s future motion with autonomous vocabulary learning. With Manifestor, an IA can learn vocabulary from a person’s instructions about how the IA should act. A notable contribution of this paper is that we formalized the communication gap between a person and IA in the vocabulary-learning phase, that is, the IA’s goal may be different from what the person wants the IA to achieve, and the IA needs to infer the latter to judge whether a motion matches that person’s instruction. We evaluated Manifestor by investigating whether people can accurately predict an IA’s future motion with explanations generated with Manifestor. We compared Manifestor’s vocabulary with that from optimal acquired in a situation in which the communication-gap problem did not exist and that from ablation , which was learned with a false assumption that an IA and person shared a goal. The experimental results revealed that vocabulary learned with Manifestor improved people’s prediction accuracy as much as with optimal , while ablation failed, suggesting that Manifestor can enable an IA to properly learn vocabulary from people’s instructions even if a communication gap exists.
... Transparency can be achieved in the context of interactive behaviors by explaining the robot's decision making process to the human and by exposing the robot's internal state. Several solutions have been proposed to improve the explainability of robot reinforcement learning [52,55,27,45,75,6,5]. For example, Likmeta et al. [45] introduced an interpretable rule-based controller for the transparency of RL in transportation applications. ...
Deployment of reinforcement learning algorithms for robotics applications in the real world requires ensuring the safety of the robot and its environment. Safe robot reinforcement learning (SRRL) is a crucial step towards achieving human-robot coexistence. In this paper, we envision a human-centered SRRL framework consisting of three stages: safe exploration, safety value alignment, and safe collaboration. We examine the research gaps in these areas and propose to leverage interactive behaviors for SRRL. Interactive behaviors enable bi-directional information transfer between humans and robots, such as conversational robot ChatGPT. We argue that interactive behaviors need further attention from the SRRL community. We discuss four open challenges related to the robustness, efficiency, transparency, and adaptability of SRRL with interactive behaviors.
... Artificial Intelligence (XAI) that could be integrated in this framework, specifically: explainable RL [64,65]. This would help with explainability of the agents and interpreting their decisions. ...
Full-text available
Climate Change is an incredibly complicated problem that humanity faces. When many variables interact with each other, it can be difficult for humans to grasp the causes and effects of the very large-scale problem of climate change. The climate is a dynamical system, where small changes can have considerable and unpredictable repercussions in the long term. Understanding how to nudge this system in the right ways could help us find creative solutions to climate change. In this research, we combine Deep Reinforcement Learning and a World-Earth system model to find, and explain, creative strategies to a sustainable future. This is an extension of the work from Strnad et al. where we extend on the method and analysis, by taking multiple directions. We use four different Reinforcement Learning agents varying in complexity to probe the environment in different ways and to find various strategies. The environment is a low-complexity World Earth system model where the goal is to reach a future where all the energy for the economy is produced by renewables by enacting different policies. We use a reward function based on planetary boundaries that we modify to force the agents to find a wider range of strategies. To favour applicability, we slightly modify the environment, by injecting noise and making it fully observable, to understand the impacts of these factors on the learning of the agents.
... Based on the type of data they work with, post-hoc methods can also be split into two areas. The first class of methods only work with structured data, such as genetic programming [20], VIPER [7] and other rule-based methods [21,22,23], expected consequences [24], causal lens [25], complementary RL [26], etc. These methods often adopt existing forms of interpretable models from supervised learning, such as rules [7,26]. ...
Full-text available
While deep reinforcement learning has proven to be successful in solving control tasks, the "black-box" nature of an agent has received increasing concerns. We propose a prototype-based post-hoc policy explainer, ProtoX, that explains a blackbox agent by prototyping the agent's behaviors into scenarios, each represented by a prototypical state. When learning prototypes, ProtoX considers both visual similarity and scenario similarity. The latter is unique to the reinforcement learning context, since it explains why the same action is taken in visually different states. To teach ProtoX about visual similarity, we pre-train an encoder using contrastive learning via self-supervised learning to recognize states as similar if they occur close together in time and receive the same action from the black-box agent. We then add an isometry layer to allow ProtoX to adapt scenario similarity to the downstream task. ProtoX is trained via imitation learning using behavior cloning, and thus requires no access to the environment or agent. In addition to explanation fidelity, we design different prototype shaping terms in the objective function to encourage better interpretability. We conduct various experiments to test ProtoX. Results show that ProtoX achieved high fidelity to the original black-box agent while providing meaningful and understandable explanations.
This paper presents a model of contrastive explanation using structural casual models. The topic of causal explanation in artificial intelligence has gathered interest in recent years as researchers and practitioners aim to increase trust and understanding of intelligent decision-making. While different sub-fields of artificial intelligence have looked into this problem with a sub-field-specific view, there are few models that aim to capture explanation more generally. One general model is based on structural causal models . It defines an explanation as a fact that, if found to be true, would constitute an actual cause of a specific event. However, research in philosophy and social sciences shows that explanations are contrastive : that is, when people ask for an explanation of an event—the fact —they (sometimes implicitly) are asking for an explanation relative to some contrast case ; that is, ‘Why P rather than Q ?’. In this paper, we extend the structural causal model approach to define two complementary notions of contrastive explanation , and demonstrate them on two classical problems in artificial intelligence: classification and planning. We believe that this model can help researchers in subfields of artificial intelligence to better understand contrastive explanation.
Automated decision-making systems become increasingly powerful due to higher model complexity. While powerful in prediction accuracy, Deep Learning models are black boxes by nature, preventing users from making informed judgments about the correctness and fairness of such an automated system. Explanations have been proposed as a general remedy to the black box problem. However, it remains unclear if effects of explanations on user trust generalise over varying accuracy levels. In an online user study with 959 participants, we examined the practical consequences of adding explanations for user trust: We evaluated trust for three explanation types on three classifiers of varying accuracy. We find that the influence of our explanations on trust differs depending on the classifier’s accuracy. Thus, the interplay between trust and explanations is more complex than previously reported. Our findings also reveal discrepancies between self-reported and behavioural trust, showing that the choice of trust measure impacts the results.
Full-text available
We develop a Reinforcement Learning (RL) framework for improving an existing behavior policy via sparse, user-interpretable changes. Our goal is to make minimal changes while gaining as much benefit as possible. We define a minimal change as having a sparse, global contrastive explanation between the original and proposed policy. We improve the current policy with the constraint of keeping that global contrastive explanation short. We demonstrate our framework with a discrete MDP and a continuous 2D navigation domain.
This article considers the ways that explainable AI can be used to help secure human-interactive robots. To do so, we acknowledge that robots interact with a variety of people. For example, some people may operate robots that perform tasks in their homes or offices, while other people may be tasked with defending robots from potential attackers. We describe how explainable AI can be used to help the human operators of robots appropriately calibrate the trust they have in their systems, and we demonstrate this through an implementation. We also describe a novel generalizable human-in-the-loop framework based on control loops to characterize and explain attacks on robots to a robot defender. We explore the utility of such a framework through an analysis of its application in the incident management process, applied to robots. This framework allows formal definition of explainability, and the necessary condition for explainability in robots. The overarching goal of this article is to introduce the application of explainability for security of robotics as a novel area of research, therefore, we also discuss several open research problems we uncovered while applying explainable AI to security of robots.
  • Adrian Weller
  • Weller
Weller, 2017] Adrian Weller. Challenges for transparency. arXiv preprint arXiv:1708.01870, 2017.