Content uploaded by Jasper van der Waa
Author content
All content in this area was uploaded by Jasper van der Waa on Jul 24, 2018
Content may be subject to copyright.
Contrastive explanations for reinforcement learning in terms of expected
consequences
J. van der Waa 1, J. van Diggelen1, K. van den Bosch1, M. Neerincx1
1Netherlands Organisation for Applied Scientific Research
jasper.vanderwaa@tno.nl
Abstract
Machine Learning models become increasingly
proficient in complex tasks. However, even for
experts in the field, it can be difficult to under-
stand what the model learned. This hampers trust
and acceptance, and it obstructs the possibility to
correct the model. There is therefore a need for
transparency of machine learning models. The de-
velopment of transparent classification models has
received much attention, but there are few devel-
opments for achieving transparent Reinforcement
Learning (RL) models. In this study we propose a
method that enables a RL agent to explain its be-
havior in terms of the expected consequences of
state transitions and outcomes. First, we define a
translation of states and actions to a description that
is easier to understand for human users. Second,
we developed a procedure that enables the agent
to obtain the consequences of a single action, as
well as its entire policy. The method calculates
contrasts between the consequences of the user’s
query-derived policy, and of the learned policy of
the agent. Third, a format for generating explana-
tions was constructed. A pilot survey study was
conducted to explore preferences of users for dif-
ferent explanation properties. Results indicate that
human users tend to favor explanations about pol-
icy rather than about single actions.
1 Introduction
Complex machine learning (ML) models such as Deep Neu-
ral Networks (DNNs) and Support Vector Machines (SVMs)
perform very well on a wide range of tasks [Lundberg and
Lee, 2016], but their outcomes are often are often difficult to
understand by humans [Weller, 2017]. Moreover, machine
learning models cannot explain how they achieved their re-
sults. Even for experts in the field, it can be very difficult
to understand what the model actually learned [Samek et al.,
2016]. To remedy this issue, the field of eXplainable Artifi-
cial Intelligence (XAI) studies how such complex but useful
models can be made more understandable [?].
Achieving transparency of ML models has multiple ad-
vantages [Weller, 2017]. For example, if a model designer
knows why a model performs badly on some data, he or she
can start a more informed process of resolving the perfor-
mance issues [Kulesza et al., 2015; Papernot and McDaniel,
2018]. However, even if a model has high performance, the
users (typically non-experts in ML) would still like to know
why it came to a certain output [Miller, 2017]. Especially
in high-risk domains such as defense and health care, inap-
propriate trust in the output may cause substantial risks and
problems [Lipton, 2016; Ribeiro et al., 2016]. If a ML model
fails to provide transparency, the user cannot safely rely on its
outcomes, which hampers the model’s applicability [Lipton,
2016]. If, however, a ML-model is able to explain its work-
ings and outcomes satisfactorily to the user, then this would
not only improve the user’s trust; it would also be able to pro-
vide new insights to the user.
For the problem of classification, recent research has de-
veloped a number of promising methods that enable classifi-
cation models to explain their output [Guidotti et al., 2018].
Several of these methods prove to be model-independent in
some way, allowing them to be applied on any existing ML
classification model. However, for Reinforcement Learn-
ing (RL) models, there are relatively few methods available
[Verma et al., 2018; Shu et al., 2017; Hein et al., 2017]. The
scarcity of methods that enable RL agents to explain their
actions and policies towards humans severely hampers the
practical applications of RL-models in this field. It also di-
minishes the, often highly rated, value of RL to Artificial In-
telligence [Hein et al., 2017; Gosavi, 2009]. Take for exam-
ple a simple agent within a grid world that needs to reach a
goal position while evading another actor who could cause
termination as well as evading other static terminal states.
The RL agent cannot easily explain why it takes the route
it has learned as it only knows numerical rewards and its co-
ordinates in the grid. The agent has no grounded knowledge
about the ’evil actor’ that tries to prevent it from reaching its
goal nor has it knowledge of how certain actions will effect
such grounded concepts. These state features and rewards are
what drives the agent but do not lend themselves well for an
explanations as they may not be grounded concepts nor do
they offer a reason why the agent behaves a certain way.
arXiv:1807.08706v1 [cs.LG] 23 Jul 2018
Important pioneering work has been done by Hayes and
Shah [Hayes and Shah, 2017]. They developed a method for
eXplainable Reinforcement Learning (XRL) that can gener-
ate explanations about a learned policy in a way that is under-
standable to humans. Their method converts feature vectors
to a list of predicates by using a set of binary classification
models. This list of predicates is searched to find sub-sets
that tend to co-occur with specific actions. The method pro-
vides information about which actions are performed when
which state predicates are true. A method that uses the co-
occurrence to generate explanations may be useful for small
problems, but becomes less comprehensible in larger plan-
ning and control problems, because the overview of predicate
and action combinations becomes too large. Also, the method
addresses only what the agent does, and not why it acts as it
does. In other words, the method presents the user with the
correlations between states and the policy but it does not pro-
vide a motivation why that policy is used in terms of rewards,
or state transitions.
This study proposes an approach to XRL that allows an
agent to answer questions about its actions and policy in terms
of their consequences. Other questions unique to RL are also
possible, for example those that ask about the time it takes to
obtain some goal or those about RL specific problems (loop
behavior, lack of exploration or exploitation, etc.). However
we believe that a non-expert in RL is mostly interested in the
expected consequences of the agent’s learned behavior and
whether the agent finds these consequences good or bad. This
information can be used as an argument why the agent be-
haves in some way. This would allow human users to gain in-
sight in what information the agent can perceive from a state
and which outcomes it expects from an action or state visit.
Furthermore, to limit the amount of information of all con-
sequences, our proposed method aims to support contrastive
explanations [Miller, 2017]. Contrastive explanations are a
way of answering causal ’why’-questions. In such questions,
two potential items, the fact and foil, are compared to each
other in terms of their causal effects on the world. Contrastive
questions come natural between humans and offer an intuitive
way of gathering motivations about why one performs a cer-
tain action instead of another [Miller, 2017]. In our case we
allow the user to formulate a question of why the learned pol-
icy πt(the ’fact’) is used instead of some other policy πf(the
’foil) that is of interest to the user. Furthermore, our pro-
posed method translates the set of states and actions in a set
of more descriptive state classes Cand action outcomes O
similar to that of [Hayes and Shah, 2017]. This allows the
user to query the agent in a more natural way as well as re-
ceive more informative explanations as both refer to the same
concepts instead of plain features. The translation of state
features to more high-level concepts and actions in specific
states to outcomes, is also done in the proposed algorithm of
[Sherstov and Stone, 2005]. The translation in this algorithm
was used to facilitate transfer learning within a single action
over multiple tasks and domains. In our method we used it to
create a user-interpretable variant of the underlying Markov
Decision Problem (MDP).
For the purpose of implementation and evaluation of our
proposed method, we performed a pilot study. In this study,
a number of explanation examples were presented to partic-
ipants to see which of their varying properties are preferred
the most. One of the properties was to see whether the partic-
ipants prefer explanations about the expected consequences
of a single-action or the entire policy.
2 Approach for consequence-based
explanations
The underlying Markov Decision Problem (MDP) of a RL
agent consists of the tuple hS, A, R, T , λi. Here, Sand Aare
the set of states (described by a feature vector) and actions
respectively, R:S×A→Ris the reward function and
T:S×A→P r(S)the transition function that provides
a probability distribution over states. Also, λis the discount
factor that governs how much of future rewards are taken into
account by the agent. This tuple provides the required infor-
mation to derive the consequences of the learned policy πtor
the foil policy πffrom the user’s question. As one can use
the transition function Tto sample the effects of both πtand
πf. In the case Tis not explicit, one may use a separate ML
model to learn it in addition to the actual agent. Through this
simulation, one constructs a Markov Chain of state visits un-
der each policy πtand πfand can present the difference to
the user.
Through the simulation of future states with T, informa-
tion can be gathered about state consequences. In turn, from
the agent itself the state or state-action values for simulated
state visits can be obtained to develop an explanation in terms
of rewards. However, the issue with this approach is that the
state features and rewards may not be easy to understand for
a user as it would consist of possibly low-level concepts and
numerical reward values or expected returns. To mitigate this
issue we can apply a translation of the states and actions to
a set of predefined state concepts and outcomes. These con-
cepts can be designed to be more descriptive and informative
for the potential user. A way to do this translation is by train-
ing a set of binary classifiers to recognize each outcome or
state concept from the state features and taken action, a simi-
lar approach to the one from [Hayes and Shah, 2017]. Their
training can occur during the exploratory learning process of
the agent. This translation allows us to use the above de-
scribed method of simulating consequences and transform the
state features and results of actions to more user-interpretable
concepts.
2.1 A user-interpretable MDP
The original set of states can be transformed to a more de-
scriptive set Caccording to the function k:S→C. This
is similar to the approach of [Hayes and Shah, 2017]where
kconsists of a number of classifiers. Also, rewards can be
explained in terms of a set of action outcomes Oaccording
to t:C×A→P r(O). This provides the results of an ac-
tion in some state in terms of the concepts O. For example,
the outcomes that the developer had in mind when designing
the reward function R. The transformation of states and ac-
tions in state classes and outcomes is adopted from the work
of [Sherstov and Stone, 2005]where the transformations are
used to allow for transfer learning in RL. Here however, we
Figure 1: An overview of the proposed method, a dotted line repre-
sents a feedback loop. We assume a general reinforcement learning
agent that acts upon a state sthrough some action aand receives a
reward r. We train a transition model Tthat can be used to simulate
the effect of actions on states. By repeatedly simulating a state siwe
can obtain the expected consequences γof an entire policy. Also,
the consequences of a contrastive policy consisting of an alternative
courses of action afcan be simulated with the same transition model
T. Finally, in constructing the explanation we transform states and
actions into user-interpretable concepts and construct an explanation
that is contrastive.
use them as a translation towards a more user-interpretable
representation of the actual MDP.
The result is the new MDP tuple
hS, A, R, T , λ,C,O,t,ki. An RL agent is still trained
on S,A,Rand Twith λindependent of the descriptive
sets Cand Oand functions kand t. This makes the
transformation independent of the RL algorithm used to train
the agent. See Figure 1 for an overview of this approach.
As an example take the grid world illustrated in Fig-
ure 2 that shows an agent in a simple myopic navigation
task. The states Sare the (x, y)coordinates and the pres-
ence of a forest, monster or trap in adjacent tiles with A=
U p, Down, Lef t, Right.Rconsists of a small transient
penalty, a slightly larger penalty for tiles with a forest, a large
penalty shared over all terminal states (traps or adjacent tiles
to a monster) and a large positive reward for the finishing
state. Tis skewed towards the intended result with small
probabilities for the other results if possible.
The state transformation kcan consist out of a set of clas-
sifiers for the predicates whether the agent is next to a forest,
a wall, a trap or monster, or in the forest. Applying kto some
state s∈Sresults in a Boolean vector c∈Cwhose informa-
tion can be used to construct an explanation in terms of the
stated predicates. The similar outcome transformation tmay
predict the probability of the outcomes Ogiven a state and
action. In our example, Oconsists of whether the agent will
be at the goal, in a trap, next to the monster or in the forest.
Each outcome ocan be flagged as being positive o+or neg-
ative o−purely such that they can be presented differently in
the eventual explanation.
Given the above transformations we can simulate the next
state of a single action awith Tor even the entire chain of
Figure 2: A simple RL problem where the agent has to navigate
from the bottom left to the top right (goal) while evading traps, a
monster and a forest. The agent terminates when in a tile with a trap
or adjacent to the monster. The traps and the monster only occur
in the red-shaded area and as soon as the agents enter this area the
monster moves towards the agent.
actions and visited states given some policy π. These can
then be transformed into state descriptions Cand action out-
comes Oto form the basis of an explanation. As mentioned,
humans usually ask for contrastive questions especially re-
garding their actions [Miller, 2017]. In the next section we
propose a method of translating the foil in a contrastive ques-
tion into a new policy.
2.2 Contrastive questions translated into value
functions
A contrastive question consists of a fact and a foil, and its
answer describes the contrast between the two from the fact’s
perspective [Miller, 2017]. In our case, the fact consists of the
entire learned policy πt, a single action from it at=πt(st)
or any number of consecutive actions from πt. We propose a
method of how one can obtain a foil policy πfbased on the
foil in the user’s question. An example of such a question
could be (framed within the case of Figure 2);
”Why do you move up and then right (fact) in-
stead of moving to the right until you hit a wall and
then move up (foil)?”
The foil policy πfis ultimately obtained by combining a
state-action value function QI– that represents the user’s
preference for some actions according to his/her question –
with the learned Qtto obtain Qf;
Qf(s, a) = Qt(s, a) + QI(s, a),∀s, a ∈S, A (1)
Each state-action value is of the form Q:S×A→R.
QIonly values the state-action pairs queried by the user.
For instance, the QIof the above given user question can
be based on the following reward scheme for all potentially
simulated s∈S;
•The action a1
f=’Right’ receives a reward such that
Qf(s, Right)> Qt(s, πt(s))
•If ’RightWall’ ∈k(s)
•Then the action a2
f=’Up’ receives a reward such that
Qf(·,Up)> Qt(·, πt(s)).
Given this reward scheme we can train QIand obtain Qf
according to equation 1. The state-action values Qfcan then
be used to obtain the policy πfusing the original action se-
lection mechanism of the agent. This results in a policy that
tries to follow the queried policy as best as it can. The ad-
vantage of having πfconstructed from Qfis that the agent
is allowed to learn a different action then those in the user’s
question as long as the reward is higher in the long run (more
user defined actions can be performed). Also, it allows for the
simulation of the actual expected behavior of the agent as it
is still based on the agent’s action selection mechanism. This
would both not be the case if we simply forced the agent to
do exactly what the user stated.
The construction of QIis done through simulation with
the help of the transition model T. The rewards that are given
during the simulation are selected with Equation 1 in mind, as
they need to eventually compensate for the originally learned
action based on Qt. Hence, the reward for each state and
queried action is as follows;
RI(si, af) = λf
λw(si, st) [R(si, af)−R(si, at] (1 + )
(2)
With at=πt(st)the originally learned action and wbeing
a distance based weight;
w(si, st) = e−d(si,st)
σ2
(3)
First, siwith i∈ {t, t + 1, ..., t +n}is the i’th state in
the simulation starting with st.afis the current foil action
governed by the conveyed policy by the user. The fact that
afis taken as the only rewarding action each time, greatly
reduces the time needed to construct QI. Next, w(si, st)is
obtained from a Radial Basis Function (RBF) with a Gaus-
sian kernel and distance function d. This RBF represents the
exponential distance between our actual state stand the sim-
ulated state si. The Gaussian kernel is governed by the stan-
dard deviation σand allows us to reduce the effects of QIas
we get further from our actual state st. The ratio of discount
factors λf
λallows for the compensation between the discount
factor λof the original agent and the potentially different fac-
tor λffor QIif we wish it to be more shortsighted. Finally,
[R(si, af)−R(si, at)] (1 + )is the amount of reward that
afneeds such that QI(si, af) > Q(si, at). With > 0that
determines how much more QIwill prefer afover at.
The parameter ndefines how many future state transitions
we simulate and are used to retrieve QI. As a general rule
n≥3σas at this point the Gaussian kernel will reduce the
contribution of QIto near zero such that Qfwill resemble
Qt. Hence, by setting σone can vary the number of states
the foil policy should starting from st. Also, by setting the
strength of how much each afshould be preferred over atcan
be regulated. Finally, λfdefines how shortsighted QIshould
be. If set to λf= 0,πfwill force the agent to perform afas
long as siis not to distant from st. If set to values near one,
πfis allowed to take different actions as long as it results into
more possibilities of performing af.
2.3 Generating explanations
At this point we have the user-interpretable MDP consisting
of state concepts Cand action outcomes Oprovided by their
respective transformation function kand t. Also, we have a
definition of RIthat values the actions and/or states that are
of interest by the user which can be used to train QIthrough
simulation and obtain Qfaccording to Equation 1. This pro-
vides us with the basis of obtaining the information needed to
construct an explanation.
As mentioned before, the explanations are based on simu-
lating the effects with Tof πtand that of πf(if defined by
the user). We can call Ton the previous state si−1for some
action π(si−1to obtain siand repeat this until i== n. The
result is a single sequence or trajectory of visited states and
performed actions for any policy πstarting from st;
γ(st, π) = {(s0, a0), ..., (sn, an)|T , π}(4)
If Tis probabilistic, multiple simulations with the same
policy and starting state may result in different trajectories.
To obtain the most probable trajectory γ(st, π)∗we can take
the transition from Twith the highest probability. Otherwise
a Markov chain could be constructed instead of a single tra-
jectory.
The next step is to transform each state and action pair in
γ(st, π)∗to the user-interpretable description with the func-
tions kand t;
P ath(st, π) = {(c0, o0), ..., (cn, on)},
ci=k(si),oi=t(si, ai),(si, ai)∈γ(st, π)∗(5)
From P ath(st, πt)an explanation can be constructed about
the state the agent will most likely visit and the action out-
comes it will obtain. For example with the use of the follow-
ing template;
”For the next nactions I will mostly perform a.
During these actions, I will come across situations
with ∀c∈P ath(st, πt). This will cause me ∀o+∈
P ath(st, πt)but also ∀o−∈P ath(st, πt)”.
Let ahere be the action most common in γ(st, πt)and
both o+and o−the positive and negative action outcomes
respectively. Since we have access to the entire simulation of
πf, a wide variety of explanations is possible. For instance
we could also focus on the less common actions;
”For the next nactions I will perform a1when
in situations with ∀c∈P ath(st, πt|πt=a1)and
a2when in situations with ∀c∈P ath(st, πt|πt=
a2). These actions prevent me from ∀o+∈
P ath(st, πt)but also ∀o−∈P ath(st, πt)”.
A contrastive explanation given some question from the
user that describes the foil policy πfcan be constructed
in a similar manner but take the contrast. Given a foil
we can focus on the differences between P ath(st, πt)and
P ath(st, πf). This can be obtained by taking the relative
complement P ath(st, πt)\P ath(st, πf); the set of expected
unique consequences when behaving according to πtand not
πf. A more extensive explanation can be given by taking the
symmetric difference P ath(st, πt)4P ath(st, πf)to explain
the unique differences between both policies.
3 User study
The above proposed method allows an RL agent to explain
and motivate its behavior in terms of expected states and out-
comes. It also enables the construction of contrastive expla-
nations where any policy can be compared to the learned pol-
icy. This contrastive explanation is based on differences in
expected outcomes between the compared policies.
We performed a small user study in which 82 participants
were shown a number of exemplar explanations about the
case shown in figure 2. These explanations addressed either
the single next action or the policy. Both explanations can be
generated by the above method by adjusting the Radial Ba-
sis Function weighting scheme and/or the foil’s discount fac-
tor. Also, some example explanations were contrastive with
only the second best action or policy, while others provided
all consequences. Contrasts were determined using the rela-
tive complement between fact and foil. Whether the learned
action or policy was treated as the fact or foil, was also sys-
tematically manipulated in this study.
We presented the developed exemplar explanations in pairs
to the participants and asked them to select the explanation
that helped them most to understand the agent’s behavior.
Afterwards we asked which of the following properties they
used to assess their preference: long versus short explana-
tions; explanations with ample information versus little infor-
mation; explanations addressing actions versus those that ad-
dress strategies (policies); and explanations addressing short-
term consequences of actions versus explanations that ad-
dress distant consequences of actions.
The results of the preferred factors are shown in Figure 3.
This shows that the participants prefer explanations that ad-
dress strategy and policy, and that provide ample information.
We note here that, given the simple case from figure 2, partici-
pants may have considered an explanation addressing a single
action only as trivial, because the optimal action was, in most
cases, already evident to the user.
4 Conclusion
We proposed a method for a reinforcement learning (RL)
agent to generate explanations for its actions and strategies.
The explanations are based on the expected consequences of
its policy. These consequences were obtained through simu-
lation according to a (learned) state transition model. Since
state features and numerical rewards do not lend themselves
easily for an explanation that is informative to humans, we
developed a framework that translates states and actions into
user-interpretable concepts and outcomes.
Figure 3: A plot depicting the percentage of participants (y-axis) for
each explanation property (x-axis) that caused them to prefer some
explanations over others. Answers of a total of 82 participants where
gathered.
We also proposed a method for converting the foil, –or pol-
icy of interest to the user–, of a contrastive ’why’-question
about actions into a policy. This policy follows locally the
user’s query but gradually transgresses back towards the orig-
inal learned policy. This policy favors the actions that are of
interest to the user such that the agent tries to perform them
as best as possible. How much these actions are favored com-
pared to the originally learned action can be set with a single
parameter.
Through running simulations for a given number steps of
both the policy derived from the user’s question and the ac-
tually learned policy, we were able to obtain expected con-
sequences of each. From here, we were able to construct
contrastive explanations: explanations addressing the conse-
quences of the learned policy and what would be different if
the derived policy would have been followed.
An online survey pilot study was conducted to explore
which of several explanations are most preferred by human
users. Results indicate that users prefer explanations about
policies rather than about single actions.
Future work will focus on implementing the method on
complex RL benchmarks to explore the scalability of this ap-
proach in realistic cases. This is important given the compu-
tational costs of simultaneously simulating the consequences
of different policies in large state spaces. Also, we will ex-
plore more methods to construct our translation functions
from states and actions to concepts and outcomes. A more
extensive user study will be carried out to evaluate the instruc-
tional value of generated explanations in more detail, and to
explore the relationship between explanations and users’ trust
in the agent’s performance.
Acknowledgments
We would like to thank the reviewers for their time and effort
in improving this paper. Also, we are grateful for the funding
from the RVO Man Machine Teaming research project that
made this research possible.
References
[Gosavi, 2009]Abhijit Gosavi. Reinforcement learning: A
tutorial survey and recent advances. INFORMS Journal
on Computing, 21(2):178–192, 2009.
[Guidotti et al., 2018]Riccardo Guidotti, Anna Monreale,
Franco Turini, Dino Pedreschi, and Fosca Giannotti. A
survey of methods for explaining black box models. arXiv
preprint arXiv:1802.01933, 2018.
[Hayes and Shah, 2017]Bradley Hayes and Julie A Shah.
Improving robot controller transparency through au-
tonomous policy explanation. In Proceedings of the 2017
acm/ieee international conference on human-robot inter-
action, pages 303–312. ACM, 2017.
[Hein et al., 2017]Daniel Hein, Steffen Udluft, and
Thomas A Runkler. Interpretable policies for reinforce-
ment learning by genetic programming. arXiv preprint
arXiv:1712.04170, 2017.
[Kulesza et al., 2015]Todd Kulesza, Margaret Burnett,
Weng-Keen Wong, and Simone Stumpf. Principles of
explanatory debugging to personalize interactive machine
learning. In Proceedings of the 20th International Con-
ference on Intelligent User Interfaces, pages 126–137.
ACM, 2015.
[Lipton, 2016]Zachary C Lipton. The mythos of model in-
terpretability. arXiv preprint arXiv:1606.03490, 2016.
[Lundberg and Lee, 2016]Scott Lundberg and Su-In Lee.
An unexpected unity among methods for interpreting
model predictions. arXiv preprint arXiv:1611.07478,
2016.
[Miller, 2017]Tim Miller. Explanation in artificial intelli-
gence: Insights from the social sciences. arXiv preprint
arXiv:1706.07269, 2017.
[Papernot and McDaniel, 2018]Nicolas Papernot and
Patrick McDaniel. Deep k-nearest neighbors: Towards
confident, interpretable and robust deep learning. arXiv
preprint arXiv:1803.04765, 2018.
[Ribeiro et al., 2016]Marco Tulio Ribeiro, Sameer Singh,
and Carlos Guestrin. Why should i trust you?: Explaining
the predictions of any classifier. In Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 1135–1144. ACM,
2016.
[Samek et al., 2016]Wojciech Samek, Gr´
egoire Montavon,
Alexander Binder, Sebastian Lapuschkin, and Klaus-
Robert M¨
uller. Interpreting the predictions of complex
ml models by layer-wise relevance propagation. arXiv
preprint arXiv:1611.08191, 2016.
[Sherstov and Stone, 2005]Alexander A Sherstov and Peter
Stone. Improving action selection in mdp’s via knowledge
transfer. In AAAI, volume 5, pages 1024–1029, 2005.
[Shu et al., 2017]Tianmin Shu, Caiming Xiong, and
Richard Socher. Hierarchical and interpretable skill
acquisition in multi-task reinforcement learning. arXiv
preprint arXiv:1712.07294, 2017.
[Verma et al., 2018]Abhinav Verma, Vijayaraghavan Mu-
rali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaud-
huri. Programmatically interpretable reinforcement learn-
ing. arXiv preprint arXiv:1804.02477, 2018.
[Weller, 2017]Adrian Weller. Challenges for transparency.
arXiv preprint arXiv:1708.01870, 2017.