Conference PaperPDF Available

Let's Evaluate Explanations!



Transparency is an important factor for robots, autonomous systems and AI, if they are to be adopted into our lives and society at large. Explanations are one way to provide such transparency and natural language explanations are a clear and intuitive way to do this, helping users to understand what a robot or AI is doing and why. In this abstract, we highlight the importance of defining what makes a good explanation. Furthermore, we discuss evaluation methods for explanations by leveraging existing natural language generation evaluation metrics.
Let's Evaluate Explanations!
Miruna-Adriana Clinciu and Helen Hastie,
Heriot-Watt University
Robotarium, Earl Mountbatten Building
Edinburgh, UK EH14 4AS
Abstract: Transparency is an important factor for robots,
autonomous systems and AI, if they are to be adopted into our
lives and society at large. Explanations are one way to provide
such transparency and natural language explanations are a
clear and intuitive way to do this, helping users to understand
what a robot or AI is doing and why. In this abstract, we
highlight the importance of defining what makes a good
explanation. Furthermore, we discuss evaluation methods for
explanations by leveraging existing natural language
generation evaluation metrics.
Human-Robot Interaction (HRI) is a field of study dedicated to
understanding, designing, and evaluating robotic systems, with
the aim of creating a meaningful interaction between robots and
humans. In recent years, robotic systems have increased in
complexity and this has led to the need to explain their behaviour
and reasoning, in order to better understand their capabilities and
prevent errors. This aligns with the EPSRC Principles of
Robotics, “Robots are manufactured artefacts. They should not be
designed in a deceptive way to exploit vulnerable users; instead,
their machine nature should be transparent” (EPSRC, 2020).
Presently, robot behaviour can be perceived as providing too little
information about the robot’s intent and internal workings. This
prohibits clear mental models of what it can and cannot do for the
user but also this lack of transparency can inhibit progress from
the developer’s perspective (Wortham et al. (2017)). This, in turn,
raises ethical and safety concerns. With regards to AI, EU GDPR
introduced the “right to explanation” in the Article 22 “Automated
individual decision-making, including profiling” GDPR in 2018.
According to the above-mentioned regulations and principles,
there is no doubt that we need a level of transparency and that this
transparency may need to be communicated to the user. The
importance of explanations for building trust and transparency in
intelligent systems has been investigated by several researchers
(Kulesza et al., 2012; Lim et al., 2009; Bussone et al., 2015) and
previous work has shown that explanations can increase user
understanding (Garcia et al., 2018) and trust in an intelligent
system (Lim et al., 2009).
The question is how do we define what a good explanation is?
Effective questioning (Wilen and Jr., 1986), a method of
explanation in pedagogy, could help us to define a strategy for
creating different types of explanations, but this is not sufficient.
It is necessary to extract the main properties/attributes of an
explanation in order to decide what makes a good or bad
explanation. Zemla et al. (2017) consider that an explanation can
have the following attributes: alternatives, articulation,
complexity, desired complexity, evidence credibility, evidence
relevance, expert, external coherence, generality, incompleteness,
internal coherence, novelty, perceived expertise, perceived truth,
possible explanation, principle consensus, prior knowledge,
quality, requires explanation, scope and visualisation and
according to Yuan et al. (2011), an explanation should be precise
and concise. How do we know which of these factors are
important and contribute the most to an effective explanation and
how do they vary depending on the user and the context?
We consider that an intuitive medium to provide explanations is
through natural language. There has been much work on natural
language generation (NLG) evaluation (Hastie and Belz, 2014;
Novikova et al., 2017) and we can potentially use these NLG
measures to gauge the quality of an automatically generated
explanation and even similarity to a ‘gold standard’ explanation
using automatic measures from machine translation such as
BLEU (Papineni et al, 2002) and ROUGE (Lin, 2004). In this
abstract, we focus on four important properties of explanations
that intersect between NLG and XAI namely: informativeness,
readability, clarity, effectiveness.
Firstly, informativeness is linked with accuracy and adequacy and
“targets the relevance and correctness of the output relative to the
input specification” (Novikova et al., 2018). Secondly, readability
can be measured using automatic objective measures but also
human subjective evaluation. Automatic evaluation could be done
by applying traditional readability indices that could be used to
evaluate explanations e.g. Flesch Kincaid (Ease, 2009), FOG
(Gunning, 1969). Human Evaluation for readability could be
achieved by asking a target group to rate explanations for reading
ease and comprehensibility (TAUS, 2014). Thirdly. according to
Manishina (2016), important semantic formalisms are clarity and
intuitiveness. Natural language explanations represent “support
sentences”, which are sentences that provide further information
about the topic sentence through examples, reasons, or
descriptions (McWhorter, 2016). Evaluating explanations in
terms of clarity could focus on linguistic phenomena such as the
misplaced or dangling modifiers, wordiness and redundancy and
tense. Correct syntax is not the only factor to affect clarity and can
also include many other factors, such as how to introduce
concepts and ideas new to the user in a way that is appropriate to
their knowledge and previous experience. For this, we can turn to
the fields education and intelligent tutoring systems for
inspiration (Graesser, 2016). Fourth with regards effectiveness, as
mentioned by Tintarev and Masthoff (2007), effective
explanations should help humans make good decisions.
Effectiveness could be evaluated by calculating the difference of
understanding in a model before and after providing the
explanation and validity of any resulting decision. This could be
achieved through gamifying a task to reward for good decisions
(Gkatzia et al., 2017) or comparing a user’s understanding before
and after an explanation (Garcia et al., 2018).
In conclusion, there is a clear need to define evaluation metrics
for natural language explanations, in order to decide what makes
a good or bad explanation and thus, in turn, increase transparency
and avoid confusion and misunderstanding. Our current work is
concerned with explaining causal Bayesian Networks where
participants evaluate human explanations for graphical models, in
terms of informativeness, clarity and effectiveness, taking
inspiration from existing natural language generation evaluation
metrics. Other properties of explanations, such as scrutability,
satisfaction, persuasiveness, efficiency, soundness, coherence and
understandability, will be taken into consideration for future
research. It’s clear that this is a multidisciplinary endeavour and
factors from fields such as linguistics, NLP/NLG, cognitive
science, psychology, pedagogy, as well as robotics and
engineering will need to be considered.
The authors gratefully acknowledge the support of Dr. Inês
Cecílio, Prof. Mike Chantler. This research was funded by
Schlumberger Cambridge Research Centre Doctoral programme.
[1] Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan.
(2015). The role of explanations on trust and reliance in clinical
decision support systems. In Proceedings of the 2015 IEEE
International Conference on Healthcare Informatics, ICHI 2015.
Institute of Electrical and Electronics Engineers Inc., Piscataway,
New Jersey, US, 160169.
[2] Martin Caminada et al. (2014). Scrutable plan enactment via
argumentation and natural language generation. In Proceedings
of the 13th International Conference on Autonomous Agents and
Multiagent Systems, AAMAS 2014, Vol. 2. International
Foundation for Autonomous Agents and Multiagent Systems
(IFAA-MAS), 16251626.
[3] GDPR. (2018). Article 22 EU GDPR "Automated individual
decision-making, including profiling." http://www.privacy-
making-including-profiling-GDPR.htm. Online; accessed 23
January 2020.
[4] Todd Kulesza et al. (2012). Tell me more? the effects of mental
model soundness on personalizing an intelligent agent. In
Proceedings of the Conference on Human Factors in Computing
Systems (CHI ’12). Association for Computing Machinery, New
York, NY, USA, 110.
[5] Brian Y. Lim, Anind K. Dey, and Daniel Avrahami. (2009). Why
and Why Not Explanations Improve the Intelligibility of Context-
Aware Intelligent Systems. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems (CHI ’09).
Association for Computing Machinery, New York, NY, USA,
[6] Elena Manishina. (2016). Data-driven natural language
generation using statistical machine translation and
discriminative learning. Theses. Université d’Avignon.
[7] Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. (2018).
RankME:Reliable Human Ratings for Natural Language
Generation. In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short
Papers). Association for Computational Linguistics, New
Orleans, Louisiana, 7278.
[8] The Engineering and Physical Sciences Research Council
(EPSRC). (2020). Principles of robotics.
activities/principlesofrobotics/. Online; accessed 23 January
[9] Nava Tintarev. (2007). Explaining recommendations. In Lecture
Notes in Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics), Vol.
4511 LNCS. 470474.
[10] Nava Tintarev and Judith Masthoff. (2007). Effective
Explanations of Recommendations: User-centered Design. In
Proceedings of the 2007 ACM Conference on Recommender
Systems (Minneapolis, MN, USA) (RecSys ’07). ACM, New
York, NY, USA, 153156.
[11] William W. Wilen and Ambrose A. Clegg Jr. (1986). Effective
Questions and Questioning: A Research Review. Theory &
Research in Social Education 14, 2 (1986), 153161. 10505518
[12] Robert H. Wortham, Andreas Theodorou, and Joanna J. Bryson.
(2017). Robot transparency: Improving understanding of
intelligent behaviour for designers and users. In Lecture Notes in
Computer Science (including subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics), Vol.
10454 LNAI. Springer Verlag, Berlin, Germany,274289.
[13] Xiu Yu. (2017). A Brief Study on the Qualities of an Effective
Sentence. Journal of Language Teaching and Research, 801.
[14] Changhe Yuan, Heejin Lim, and Tsai-Ching Lu. (2011). Most
Relevant Explanation in Bayesian Networks.The Journal of
Artificial IntelligenceResearch42, 1 (Sept. 2011), 309352.
[15] Zemla, J. C. et al. (2017). Evaluating everyday explanations.
Psychonomic Bulletin and Review, 24(5), 14881500.
[16] Francisco J. Chiyah Garcia et al. (2018). Explainable Autonomy:
A Study of Explanation Styles for Building Clear Mental Models
through a Multimodal Interface. In Proceedings of the 11th
International Conference of Natural Language Generation
(INLG), Tilburg, The Netherlands.
[17] Helen Hastie and Anja Belz (2014). A Comparative Evaluation
Framework for NLG in Interactive Systems. In Proceedings of
the Language Resources and Evaluation Conference (LREC).
Reykjavik, Iceland, May 2014
[18] Kishore Papineni et al. (2002). BLEU: a method for automatic
evaluation of machine translation. In Proceedings of the 40th
Annual Meeting on Association for Computational Linguistics
(ACL ’02). Association for Computational Linguistics, USA,
311318. DOI:
[19] Lin, C. Y. (2004). Rouge: A package for automatic evaluation of
summaries. Proceedings of the Workshop on Text
Summarization Branches out (WAS 2004), (1), 2526. Retrieved
from papers2://publication/uuid/5DDA0BB8-E59F-44C1-88E6-
[20] Kathleen T. McWhorter. (2016). Pathways: Scenarios for
Sentence and Paragraph Writing, Books a la Carte Edition, 4th
[21] Novikova, J. et al. (2018). Why We Need New Evaluation
Metrics for NLG (pp. 22412252). Association for
Computational Linguistics (ACL).
[22] Ease, F. R. (2009). FleschKincaid readability test. Reading, Vol.
70, pp. 810.
[23] Gunning, R. (1969). The fog index after twenty years. Journal of
Business Communication, 6(2), 313.
[24] Graesser, A. C. (2016). Conversations with AutoTutor Help
Students Learn. International Journal of Artificial Intelligence in
Education, 26(1), 124132.
[25] Gkatzia, D., Lemon, O., & Rieser, V. (2017). Data-to-Text
Generation Improves Decision-Making Under Uncertainty. IEEE
Computational Intelligence Magazine, 12(3), 1017.
[26] TAUS (2014). Best Practices on Readability Evaluation.
[ONLINE] Available at:
evaluation. [Last Accessed 12th March 2020].
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Autonomous robots can be difficult to design and understand. Designers have difficulty decoding the behaviour of their own robots simply by observing them. Naive users of robots similarly have difficulty deciphering robot behaviour simply through observation. In this paper we review relevant robot systems architecture, design, and transparency literature, and report on a programme of research to investigate practical approaches to improve robot transparency. We report on the investigation of real-time graphical and vocalised outputs as a means for both designers and end users to gain a better mental model of the internal state and decision making processes taking place within a robot. This approach, combined with a graphical approach to behaviour design, offers improved transparency for robot designers. We also report on studies of users’ understanding, where significant improvement has been achieved using both graphical and vocalisation transparency approaches.
Full-text available
People frequently rely on explanations provided by others to understand complex phenomena. A fair amount of attention has been devoted to the study of scientific explanation, and less on understanding how people evaluate naturalistic, everyday explanations. Using a corpus of diverse explanations from Reddit's "Explain Like I'm Five" and other online sources, we assessed how well a variety of explanatory criteria predict judgments of explanation quality. We find that while some criteria previously identified as explanatory virtues do predict explanation quality in naturalistic settings, other criteria, such as simplicity, do not. Notably, we find that people have a preference for complex explanations that invoke more causal mechanisms to explain an effect. We propose that this preference for complexity is driven by a desire to identify enough causes to make the effect seem inevitable.
Conference Paper
Full-text available
Clinical decision support systems (CDSS) are increasingly used by healthcare professionals for evidence-based diagnosis and treatment support. However, research has suggested that users often over-rely on system suggestions – even if the suggestions are wrong. Providing explanations could potentially mitigate misplaced trust in the system and over-reliance. In this paper, we explore how explanations are related to user trust and reliance, as well as what information users would find helpful to better understand the reliability of a system's decision-making. We investigated these questions through an exploratory user study in which healthcare professionals were observed using a CDSS prototype to diagnose hypothetic cases using fictional patients suffering from a balance-related disorder. Our results show that the amount of system confidence had only a slight effect on trust and reliance. More importantly, giving a fuller explanation of the facts used in making a diagnosis had a positive effect on trust but also led to over-reliance issues, whereas less detailed explanations made participants question the system's reliability and led to self-reliance problems. To help them in their assessment of the reliability of the system's decisions, study participants wanted better explanations to help them interpret the system's confidence, to verify that the disorder fit the suggestion, to better understand the reasoning chain of the decision model, and to make differential diagnoses. Our work is a first step toward improved CDSS design that better supports clinicians in making correct diagnoses.
Decision-making is often dependent on uncertain data, e.g. data associated with confidence scores or probabilities. This article presents a comparison of different information presentations for uncertain data and, for the first time, measures their effects on human decision-making, in the domain of weather forecast generation. We use a game-based setup to evaluate the different systems. We show that the use of Natural Language Generation (NLG) enhances decision-making under uncertainty, compared to state-of-the-art graphical-based representation methods. In a task-based study with 442 adults, we found that presentations using NLG led to 24% better decision-making on average than the graphical presentations, and to 44% better decision-making when NLG is combined with graphics. We also show that women achieve significantly better results when presented with NLG output (an 87% increase on average compared to graphical presentations). Finally, we present a further analysis of demographic data and its impact on decision-making, and we discuss implications for future NLG systems.
The humanity has long been passionate about creating intellectual machines that can freely communicate with us in our language. Most modern systems communicating directly with the user share one common feature: they have a dialog system (DS) at their base. As of today almost all DS components embraced statistical methods and widely use them as their core models. Until recently Natural Language Generation (NLG) component of a dialog system used primarily hand-coded generation templates, which represented model phrases in a natural language mapped to a particular semantic content. Today data-driven models are making their way into the NLG domain. In this thesis, we follow along this new line of research and present several novel data-driven approaches to natural language generation. In our work we focus on two important aspects of NLG systems development: building an efficient generator and diversifying its output. Two key ideas that we defend here are the following: first, the task of NLG can be regarded as the translation between a natural language and a formal meaning representation, and therefore, can be performed using statistical machine translation techniques, and second, corpus extension and diversification which traditionally involved manual paraphrasing and rule crafting can be performed automatically using well-known and widely used synonym and paraphrase extraction methods. Concerning our first idea, we investigate the possibility of using NGRAM translation framework and explore the potential of discriminative learning, notably Conditional Random Fields (CRF) models, as applied to NLG; we build a generation pipeline which allows for inclusion and combination of different generation models (NGRAM and CRF) and which uses an efficient decoding framework (finite-state transducers' best path search). Regarding the second objective, namely corpus extension, we propose to enlarge the system's vocabulary and the set of available syntactic structures via integrating automatically obtained synonyms and paraphrases into the training corpus. To our knowledge, there have been no attempts to increase the size of the system vocabulary by incorporating synonyms. To date most studies on corpus extension focused on paraphrasing and resorted to crowd-sourcing in order to obtain paraphrases, which then required additional manual validation often performed by system developers. We prove that automatic corpus extension by means of paraphrase extraction and validation is just as effective as crowd-sourcing, being at the same time less costly in terms of development time and resources. During intermediate experiments our generation models showed a significantly better performance than the phrase-based baseline model and appeared to be more robust in handling unknown combinations of concepts than the current in-house rule-based generator. The final human evaluation confirmed that our data-driven NLG models is a viable alternative to rule-based generators.
This paper aims to introduce the five essential qualities of an effective sentence in English, that is, correctness, clearness, unity, coherence and emphasis. By analyzing these essential qualities, English learners can make and use English sentences more effectively and efficiently.
AutoTutor helps students learn by holding a conversation in natural language. AutoTutor is adaptive to the learners’ actions, verbal contributions, and in some systems their emotions. Many of AutoTutor’s conversation patterns simulate human tutoring, but other patterns implement ideal pedagogies that open the door to computer tutors eclipsing human tutors in learning gains. Indeed, current versions of AutoTutor yield learning gains on par with novice and expert human tutors. This article selectively highlights the status of AutoTutor’s dialogue moves, learning gains, implementation challenges, differences between human and ideal tutors, and some of the systems that evolved from AutoTutor. Current and future AutoTutor projects are investigating three-party conversations, called trialogues, where two agents (such as a tutor and student) interact with the human learner.
Autonomous systems suffer from opacity due to the potentially large number of sophisticated interactions among many parties and how these influence the outcomes of the systems. It is very difficult for humans to scrutinise, understand and, ultimately, work with such systems. To address this shortcoming, we developed a demonstrator which uses formal argumentation techniques, coupled with natural language generation, to explain the rationale of a hybrid software-human many-party joint plan during its enactment. Copyright © 2014, International Foundation for Autonomous Agents and Multiagent Systems ( All rights reserved.