Content uploaded by Javier Del Ser
Author content
All content in this area was uploaded by Javier Del Ser on Dec 30, 2019
Content may be subject to copyright.
Explainable Artificial Intelligence (XAI): Concepts, Taxonomies,
Opportunities and Challenges toward Responsible AI
Alejandro Barredo Arrietaa, Natalia D´
ıaz-Rodr´
ıguezb, Javier Del Sera,c,d, Adrien Bennetotb,e,f ,
Siham Tabikg, Alberto Barbadoh, Salvador Garciag, Sergio Gil-Lopeza, Daniel Molinag,
Richard Benjaminsh, Raja Chatilaf, and Francisco Herrerag
aTECNALIA, 48160 Derio, Spain
bENSTA, Institute Polytechnique Paris and INRIA Flowers Team, Palaiseau, France
cUniversity of the Basque Country (UPV/EHU), 48013 Bilbao, Spain
dBasque Center for Applied Mathematics (BCAM), 48009 Bilbao, Bizkaia, Spain
eSegula Technologies, Parc d’activit´
e de Pissaloup, Trappes, France
fInstitut des Syst`
emes Intelligents et de Robotique, Sorbonne Universit`
e, France
gDaSCI Andalusian Institute of Data Science and Computational Intelligence, University of Granada, 18071 Granada, Spain
hTelefonica, 28050 Madrid, Spain
Abstract
In the last few years, Artificial Intelligence (AI) has achieved a notable momentum that, if harnessed
appropriately, may deliver the best of expectations over many application sectors across the field. For this
to occur shortly in Machine Learning, the entire community stands in front of the barrier of explainability,
an inherent problem of the latest techniques brought by sub-symbolism (e.g. ensembles or Deep Neural
Networks) that were not present in the last hype of AI (namely, expert systems and rule based models).
Paradigms underlying this problem fall within the so-called eXplainable AI (XAI) field, which is widely
acknowledged as a crucial feature for the practical deployment of AI models. The overview presented in
this article examines the existing literature and contributions already done in the field of XAI, including a
prospect toward what is yet to be reached. For this purpose we summarize previous efforts made to define
explainability in Machine Learning, establishing a novel definition of explainable Machine Learning that
covers such prior conceptual propositions with a major focus on the audience for which the explainability
is sought. Departing from this definition, we propose and discuss about a taxonomy of recent contributions
related to the explainability of different Machine Learning models, including those aimed at explaining
Deep Learning methods for which a second dedicated taxonomy is built and examined in detail. This
critical literature analysis serves as the motivating background for a series of challenges faced by XAI,
such as the interesting crossroads of data fusion and explainability. Our prospects lead toward the concept
of Responsible Artificial Intelligence, namely, a methodology for the large-scale implementation of AI
methods in real organizations with fairness, model explainability and accountability at its core. Our
ultimate goal is to provide newcomers to the field of XAI with a thorough taxonomy that can serve
as reference material in order to stimulate future research advances, but also to encourage experts and
professionals from other disciplines to embrace the benefits of AI in their activity sectors, without any
prior bias for its lack of interpretability.
Keywords: Explainable Artificial Intelligence, Machine Learning, Deep Learning, Data Fusion,
Interpretability, Comprehensibility, Transparency, Privacy, Fairness, Accountability, Responsible
Artificial Intelligence.
∗
Corresponding author. TECNALIA. P. Tecnologico, Ed. 700. 48170 Derio (Bizkaia), Spain. E-mail: javier.delser@tecnalia.com
Preprint submitted to Information Fusion December 26, 2019
1. Introduction
Artificial Intelligence (AI) lies at the core of many activity sectors that have embraced new information
technologies [
1
]. While the roots of AI trace back to several decades ago, there is a clear consensus on the
paramount importance featured nowadays by intelligent machines endowed with learning, reasoning and
adaptation capabilities. It is by virtue of these capabilities that AI methods are achieving unprecedented
levels of performance when learning to solve increasingly complex computational tasks, making them
pivotal for the future development of the human society [
2
]. The sophistication of AI-powered systems
has lately increased to such an extent that almost no human intervention is required for their design
and deployment. When decisions derived from such systems ultimately affect humans’ lives (as in e.g.
medicine, law or defense), there is an emerging need for understanding how such decisions are furnished
by AI methods [3].
While the very first AI systems were easily interpretable, the last years have witnessed the rise of
opaque decision systems such as Deep Neural Networks (DNNs). The empirical success of Deep Learning
(DL) models such as DNNs stems from a combination of efficient learning algorithms and their huge
parametric space. The latter space comprises hundreds of layers and millions of parameters, which makes
DNNs be considered as complex black-box models [
4
]. The opposite of black-box-ness is transparency,
i.e., the search for a direct understanding of the mechanism by which a model works [5].
As black-box Machine Learning (ML) models are increasingly being employed to make important
predictions in critical contexts, the demand for transparency is increasing from the various stakeholders in
AI [
6
]. The danger is on creating and using decisions that are not justifiable, legitimate, or that simply do
not allow obtaining detailed explanations of their behaviour [
7
]. Explanations supporting the output of a
model are crucial, e.g., in precision medicine, where experts require far more information from the model
than a simple binary prediction for supporting their diagnosis [
8
]. Other examples include autonomous
vehicles in transportation, security, and finance, among others.
In general, humans are reticent to adopt techniques that are not directly interpretable, tractable and
trustworthy [9], given the increasing demand for ethical AI [3]. It is customary to think that by focusing
solely on performance, the systems will be increasingly opaque. This is true in the sense that there is a
trade-off between the performance of a model and its transparency [
10
]. However, an improvement in the
understanding of a system can lead to the correction of its deficiencies. When developing a ML model,
the consideration of interpretability as an additional design driver can improve its implementability for 3
reasons:
•
Interpretability helps ensure impartiality in decision-making, i.e. to detect, and consequently, correct
from bias in the training dataset.
•
Interpretability facilitates the provision of robustness by highlighting potential adversarial perturbations
that could change the prediction.
•
Interpretability can act as an insurance that only meaningful variables infer the output, i.e., guaranteeing
that an underlying truthful causality exists in the model reasoning.
All these means that the interpretation of the system should, in order to be considered practical,
provide either an understanding of the model mechanisms and predictions, a visualization of the model’s
discrimination rules, or hints on what could perturb the model [11].
In order to avoid limiting the effectiveness of the current generation of AI systems, eXplainable AI
(XAI) [
7
] proposes creating a suite of ML techniques that 1) produce more explainable models while
maintaining a high level of learning performance (e.g., prediction accuracy), and 2) enable humans to
understand, appropriately trust, and effectively manage the emerging generation of artificially intelligent
partners. XAI draws as well insights from the Social Sciences [
12
] and considers the psychology of
explanation.
2
2012 2013 2014 2015 2016 2017 2018 2019
(December 10th)
Year
0
25
50
75
100
125
150
175
200
# of contributed works in the literature
Interpretable Artificial Intelligence
XAI
Explainable Artificial Intelligence
Figure 1: Evolution of the number of total publications whose title, abstract and/or keywords refer to the field of XAI during
the last years. Data retrieved from Scopus
R
(December 10th, 2019) by using the search terms indicated in the legend when
querying this database. It is interesting to note the latent need for interpretable AI models over time (which conforms to intuition, as
interpretability is a requirement in many scenarios), yet it has not been until 2017 when the interest in techniques to explain AI
models has permeated throughout the research community.
Figure 1 displays the rising trend of contributions on XAI and related concepts. This literature
outbreak shares its rationale with the research agendas of national governments and agencies. Although
some recent surveys [
8
,
13
,
10
,
14
,
15
,
16
,
17
] summarize the upsurge of activity in XAI across sectors
and disciplines, this overview aims to cover the creation of a complete unified framework of categories
and concepts that allow for scrutiny and understanding of the field of XAI methods. Furthermore, we pose
intriguing thoughts around the explainability of AI models in data fusion contexts with regards to data
privacy and model confidentiality. This, along with other research opportunities and challenges identified
throughout our study, serve as the pull factor toward Responsible Artificial Intelligence, term by which
we refer to a series of AI principles to be necessarily met when deploying AI in real applications. As we
will later show in detail, model explainability is among the most crucial aspects to be ensured within this
methodological framework. All in all, the novel contributions of this overview can be summarized as
follows:
1.
Grounded on a first elaboration of concepts and terms used in XAI-related research, we propose a
novel definition of explainability that places audience (Figure 2) as a key aspect to be considered when
explaining a ML model. We also elaborate on the diverse purposes sought when using XAI techniques,
from trustworthiness to privacy awareness, which round up the claimed importance of purpose and
targeted audience in model explainability.
2.
We define and examine the different levels of transparency that a ML model can feature by itself, as
well as the diverse approaches to post-hoc explainability, namely, the explanation of ML models that
are not transparent by design.
3.
We thoroughly analyze the literature on XAI and related concepts published to date, covering ap-
proximately 400 contributions arranged into two different taxonomies. The first taxonomy addresses
the explainability of ML models using the previously made distinction between transparency and
post-hoc explainability, including models that are transparent by themselves, Deep and non-Deep (i.e.,
3
shallow) learning models. The second taxonomy deals with XAI methods suited for the explanation of
Deep Learning models, using classification criteria closely linked to this family of ML methods (e.g.
layerwise explanations, representation vectors, attention).
4. We enumerate a series of challenges of XAI that still remain insufficiently addressed to date. Specifi-
cally, we identify research needs around the concepts and metrics to evaluate the explainability of ML
models, and outline research directions toward making Deep Learning models more understandable.
We further augment the scope of our prospects toward the implications of XAI techniques in regards
to confidentiality, robustness in adversarial settings, data diversity, and other areas intersecting with
explainability.
5.
After the previous prospective discussion, we arrive at the concept of Responsible Artificial Intelligence,
a manifold concept that imposes the systematic adoption of several AI principles for AI models to
be of practical use. In addition to explainability, the guidelines behind Responsible AI establish that
fairness, accountability and privacy should also be considered when implementing AI models in real
environments.
6.
Since Responsible AI blends together model explainability and privacy/security by design, we call
for a profound reflection around the benefits and risks of XAI techniques in scenarios dealing with
sensitive information and/or confidential ML models. As we will later show, the regulatory push
toward data privacy, quality, integrity and governance demands more efforts to assess the role of XAI
in this arena. In this regard, we provide an insight on the implications of XAI in terms of privacy and
security under different data fusion paradigms.
The remainder of this overview is structured as follows: first, Section 2 and subsections therein open a
discussion on the terminology and concepts revolving around explainability and interpretability in AI,
ending up with the aforementioned novel definition of interpretability (Subsections 2.1 and 2.2), and a
general criterion to categorize and analyze ML models from the XAI perspective. Sections 3 and 4 proceed
by reviewing recent findings on XAI for ML models (on transparent models and post-hoc techniques
respectively) that comprise the main division in the aforementioned taxonomy. We also include a review
on hybrid approaches among the two, to attain XAI. Benefits and caveats of the synergies among the
families of methods are discussed in Section 5, where we present a prospect of general challenges and
some consequences to be cautious about. Finally, Section 6 elaborates on the concept of Responsible
Artificial Intelligence. Section 7 concludes the survey with an outlook aimed at engaging the community
around this vibrant research area, which has the potential to impact society, in particular those sectors that
have progressively embraced ML as a core technology of their activity.
2. Explainability: What, Why, What For and How?
Before proceeding with our literature study, it is convenient to first establish a common point of
understanding on what the term explainability stands for in the context of AI and, more specifically,
ML. This is indeed the purpose of this section, namely, to pause at the numerous definitions that have
been done in regards to this concept (what?), to argue why explainability is an important issue in AI and
ML (why? what for?) and to introduce the general classification of XAI approaches that will drive the
literature study thereafter (how?).
2.1. Terminology Clarification
One of the issues that hinders the establishment of common grounds is the interchangeable misuse of
interpretability and explainability in the literature. There are notable differences among these concepts.
To begin with, interpretability refers to a passive characteristic of a model referring to the level at which
a given model makes sense for a human observer. This feature is also expressed as transparency. By
4
contrast, explainability can be viewed as an active characteristic of a model, denoting any action or
procedure taken by a model with the intent of clarifying or detailing its internal functions.
To summarize the most commonly used nomenclature, in this section we clarify the distinction and
similarities among terms often used in the ethical AI and XAI communities.
•Understandability
(or equivalently,
intelligibility
) denotes the characteristic of a model to make a
human understand its function – how the model works – without any need for explaining its internal
structure or the algorithmic means by which the model processes data internally [18].
•Comprehensibility
: when conceived for ML models, comprehensibility refers to the ability of a
learning algorithm to represent its learned knowledge in a human understandable fashion [
19
,
20
,
21
].
This notion of model comprehensibility stems from the postulates of Michalski [
22
], which stated that
“the results of computer induction should be symbolic descriptions of given entities, semantically and
structurally similar to those a human expert might produce observing the same entities. Components of
these descriptions should be comprehensible as single ‘chunks’ of information, directly interpretable in
natural language, and should relate quantitative and qualitative concepts in an integrated fashion”.
Given its difficult quantification, comprehensibility is normally tied to the evaluation of the model
complexity [17].
•Interpretability
: it is defined as the ability to explain or to provide the meaning in understandable
terms to a human.
•Explainability
: explainability is associated with the notion of explanation as an interface between
humans and a decision maker that is, at the same time, both an accurate proxy of the decision maker
and comprehensible to humans [17].
•Transparency
: a model is considered to be transparent if by itself it is understandable. Since a model
can feature different degrees of understandability, transparent models in Section 3 are divided into three
categories: simulatable models, decomposable models and algorithmically transparent models [5].
In all the above definitions, understandability emerges as the most essential concept in XAI. Both
transparency and interpretability are strongly tied to this concept: while transparency refers to the
characteristic of a model to be, on its own, understandable for a human, understandability measures the
degree to which a human can understand a decision made by a model. Comprehensibility is also connected
to understandability in that it relies on the capability of the audience to understand the knowledge contained
in the model. All in all, understandability is a two-sided matter: model understandability and human
understandability. This is the reason why the definition of XAI given in Section 2.2 refers to the concept
of audience, as the cognitive skills and pursued goal of the users of the model have to be taken into
account jointly with the intelligibility and comprehensibility of the model in use. This prominent role
taken by understandability makes the concept of audience the cornerstone of XAI, as we next elaborate in
further detail.
2.2. What?
Although it might be considered to be beyond the scope of this paper, it is worth noting the discussion
held around general theories of explanation in the realm of philosophy [
23
]. Many proposals have been
done in this regard, suggesting the need for a general, unified theory that approximates the structure and
intent of an explanation. However, nobody has stood the critique when presenting such a general theory.
For the time being, the most agreed-upon thought blends together different approaches to explanation
drawn from diverse knowledge disciplines. A similar problem is found when addressing interpretability
in AI. It appears from the literature that there is not yet a common point of understanding on what
interpretability or explainability are. However, many contributions claim the achievement of interpretable
models and techniques that empower explainability.
5
To shed some light on this lack of consensus, it might be interesting to place the reference starting
point at the definition of the term Explainable Artificial Intelligence (XAI) given by D. Gunning in [7]:
“XAI will create a suite of machine learning techniques that enables human users to understand,
appropriately trust, and effectively manage the emerging generation of artificially intelligent partners”
This definition brings together two concepts (understanding and trust) that need to be addressed in
advance. However, it misses to consider other purposes motivating the need for interpretable AI models,
such as causality, transferability, informativeness, fairness and confidence [
5
,
24
,
25
,
26
]. We will later
delve into these topics, mentioning them here as a supporting example of the incompleteness of the above
definition.
As exemplified by the definition above, a thorough, complete definition of explainability in AI
still slips from our fingers. A broader reformulation of this definition (e.g. “An explainable Artificial
Intelligence is one that produces explanations about its functioning”) would fail to fully characterize the
term in question, leaving aside important aspects such as its purpose. To build upon the completeness, a
definition of explanation is first required.
As extracted from the Cambridge Dictionary of English Language, an explanation is “the details or
reasons that someone gives to make something clear or easy to understand” [
27
]. In the context of an
ML model, this can be rephrased as: ”the details or reasons a model gives to make its functioning clear
or easy to understand”. It is at this point where opinions start to diverge. Inherently stemming from the
previous definitions, two ambiguities can be pointed out. First, the details or the reasons used to explain,
are completely dependent of the audience to which they are presented. Second, whether the explanation
has left the concept clear or easy to understand also depends completely on the audience. Therefore, the
definition must be rephrased to reflect explicitly the dependence of the explainability of the model on the
audience. To this end, a reworked definition could read as:
Given a certain audience, explainability refers to the details and reasons a model gives to make its
functioning clear or easy to understand.
Since explaining, as argumenting, may involve weighting, comparing or convincing an audience with
logic-based formalizations of (counter) arguments [
28
], explainability might convey us into the realm of
cognitive psychology and the psychology of explanations [
7
], since measuring whether something has
been understood or put clearly is a hard task to be gauged objectively. However, measuring to which
extent the internals of a model can be explained could be tackled objectively. Any means to reduce the
complexity of the model or to simplify its outputs should be considered as an XAI approach. How big
this leap is in terms of complexity or simplicity will correspond to how explainable the resulting model
is. An underlying problem that remains unsolved is that the interpretability gain provided by such XAI
approaches may not be straightforward to quantify: for instance, a model simplification can be evaluated
based on the reduction of the number of architectural elements or number of parameters of the model
itself (as often made, for instance, for DNNs). On the contrary, the use of visualization methods or natural
language for the same purpose does not favor a clear quantification of the improvements gained in terms
of interpretability. The derivation of general metrics to assess the quality of XAI approaches remain as
an open challenge that should be under the spotlight of the field in forthcoming years. We will further
discuss on this research direction in Section 5.
Explainability is linked to post-hoc explainability since it covers the techniques used to convert a
non-interpretable model into a explainable one. In the remaining of this manuscript, explainability will be
considered as the main design objective, since it represents a broader concept. A model can be explained,
but the interpretability of the model is something that comes from the design of the model itself. Bearing
these observations in mind, explainable AI can be defined as follows:
Given an audience, an
explainable
Artificial Intelligence is one that produces details or reasons to
make its functioning clear or easy to understand.
6
This definition is posed here as a first contribution of the present overview, implicitly assumes that the
ease of understanding and clarity targeted by XAI techniques for the model at hand reverts on different
application purposes, such as a better trustworthiness of the model’s output by the audience.
2.3. Why?
As stated in the introduction, explainability is one of the main barriers AI is facing nowadays in
regards to its practical implementation. The inability to explain or to fully understand the reasons by
which state-of-the-art ML algorithms perform as well as they do, is a problem that find its roots in two
different causes, which are conceptually illustrated in Figure 2.
Without a doubt, the first cause is the gap between the research community and business sectors,
impeding the full penetration of the newest ML models in sectors that have traditionally lagged behind
in the digital transformation of their processes, such as banking, finances, security and health, among
many others. In general this issue occurs in strictly regulated sectors with some reluctance to implement
techniques that may put at risk their assets.
The second axis is that of knowledge. AI has helped research across the world with the task of
inferring relations that were far beyond the human cognitive reach. Every field dealing with huge amounts
of reliable data has largely benefited from the adoption of AI and ML techniques. However, we are
entering an era in which results and performance metrics are the only interest shown up in research
studies. Although for certain disciplines this might be the fair case, science and society are far from being
concerned just by performance. The search for understanding is what opens the door for further model
improvement and its practical utility.
Target audience
in XAI
Who? Domain experts/users of the model (e.g. medical doctors, insurance agents)
Why? Trust the model itself, gain scientific knowledge
Who? Regulatory entities/agencies
Why? Certify model compliance with the
Who? Users affected by model decisions
Why? Understand their situation, verify
Who? Managers and executive board members
Why? Assess regulatory compliance, understand
Who? Data scientists, developers, product owners...
Why? Ensure/improve product efficiency, research,
corporate AI applications...
new functionalities...
fair decisions...
legislation in force, audits, ...
?
?
< / >
?
$$$
?
?
?
Figure 2: Diagram showing the different purposes of explainability in ML models sought by different audience profiles. Two goals
occur to prevail across them: need for model understanding, and regulatory compliance. Image partly inspired by the one presented
in [29], used with permission from IBM.
The following section develops these ideas further by analyzing the goals motivating the search for
explainable AI models.
2.4. What for?
The research activity around XAI has so far exposed different goals to draw from the achievement
of an explainable model. Almost none of the papers reviewed completely agrees in the goals required
to describe what an explainable model should compel. However, all these different goals might help
discriminate the purpose for which a given exercise of ML explainability is performed. Unfortunately,
scarce contributions have attempted to define such goals from a conceptual perspective [
5
,
13
,
24
,
30
].
We now synthesize and enumerate definitions for these XAI goals, so as to settle a first classification
criteria for the full suit of papers covered in this review:
7
XAI Goal Main target audience (Fig. 2) References
Trustworthiness Domain experts, users of the model
affected by decisions [5, 10, 24, 32, 33, 34, 35, 36, 37]
Causality
Domain experts, managers and
executive board members,
regulatory entities/agencies
[35, 38, 39, 40, 41, 42, 43]
Transferability Domain experts, data scientists
[5, 44, 21, 26, 45, 30, 32, 37, 38, 39, 46, 47, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85]
Informativeness All
[
5
,
44
,
21
,
25
,
26
,
45
,
30
,
32
,
34
,
35
,
37
,
38
,
41
,
46
,
49
,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 63, 64, 65,
66, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 86,
87, 88, 89, 59, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,
100
,
101
,
102
,
103
,
104
,
105
,
106
,
107
,
108
,
109
,
110
,
111
,
112
,
113
,
114
,
115
,
116
,
117
,
118
,
119
,
120
,
121
,
122
,
123
,
124
,
125
,
126
,
127
,
128
,
129
,
130
,
131
,
132
,
133
,
134
,
135
,
136
,
137
,
138
,
139
,
140
,
141
,
142
,
143
,
144
,
145
,
146
,
147
,
148
,
149
,
150
,
151
,
152
,
153
,
154
]
Confidence
Domain experts, developers,
managers, regulatory
entities/agencies
[5, 45, 35, 46, 48, 54, 61, 72, 88, 89, 96, 108, 117,
119, 155]
Fairness Users affected by model decisions,
regulatory entities/agencies
[5, 24, 45, 35, 47, 99, 100, 101, 120, 121, 128, 156,
157, 158]
Accessibility Product owners, managers, users
affected by model decisions
[21, 26, 30, 32, 37, 50, 53, 55, 62, 67, 68, 69, 70, 71,
74, 75, 76, 86, 93, 94, 103, 105, 107, 108, 111, 112,
113, 114, 115, 124, 129]
Interactivity Domain experts, users affected by
model decisions [37, 50, 59, 65, 67, 74, 86, 124]
Privacy awareness Users affected by model decisions,
regulatory entities/agencies [89]
Table 1: Goals pursued in the reviewed literature toward reaching explainability, and their main target audience.
•
Trustworthiness: several authors agree upon the search for trustworthiness as the primary aim of an
explainable AI model [
31
,
32
]. However, declaring a model as explainable as per its capabilities of
inducing trust might not be fully compliant with the requirement of model explainability. Trustwor-
thiness might be considered as the confidence of whether a model will act as intended when facing a
given problem. Although it should most certainly be a property of any explainable model, it does not
imply that every trustworthy model can be considered explainable on its own, nor is trustworthiness
a property easy to quantify. Trust might be far from being the only purpose of an explainable model
since the relation among the two, if agreed upon, is not reciprocal. Part of the reviewed papers mention
the concept of trust when stating their purpose for achieving explainability. However, as seen in Table
1, they do not amount to a large share of the recent contributions related to XAI.
•
Causality: another common goal for explainability is that of finding causality among data variables.
Several authors argue that explainable models might ease the task of finding relationships that, should
they occur, could be tested further for a stronger causal link between the involved variables [
159
,
160
].
The inference of causal relationships from observational data is a field that has been broadly studied
over time [
161
]. As widely acknowledged by the community working on this topic, causality requires a
wide frame of prior knowledge to prove that observed effects are causal. A ML model only discovers
correlations among the data it learns from, and therefore might not suffice for unveiling a cause-effect
relationship. However, causation involves correlation, so an explainable ML model could validate
the results provided by causality inference techniques, or provide a first intuition of possible causal
8
relationships within the available data. Again, Table 1 reveals that causality is not among the most
important goals if we attend to the amount of papers that state it explicitly as their goal.
•
Transferability: models are always bounded by constraints that should allow for their seamless
transferability. This is the main reason why a training-testing approach is used when dealing with
ML problems [
162
,
163
]. Explainability is also an advocate for transferability, since it may ease the
task of elucidating the boundaries that might affect a model, allowing for a better understanding and
implementation. Similarly, the mere understanding of the inner relations taking place within a model
facilitates the ability of a user to reuse this knowledge in another problem. There are cases in which the
lack of a proper understanding of the model might drive the user toward incorrect assumptions and
fatal consequences [
44
,
164
]. Transferability should also fall between the resulting properties of an
explainable model, but again, not every transferable model should be considered as explainable. As
observed in Table 1, the amount of papers stating that the ability of rendering a model explainable is to
better understand the concepts needed to reuse it or to improve its performance is the second most used
reason for pursuing model explainability.
•
Informativeness: ML models are used with the ultimate intention of supporting decision making [
92
].
However, it should not be forgotten that the problem being solved by the model is not equal to that
being faced by its human counterpart. Hence, a great deal of information is needed in order to be able
to relate the user’s decision to the solution given by the model, and to avoid falling in misconception
pitfalls. For this purpose, explainable ML models should give information about the problem being
tackled. Most of the reasons found among the papers reviewed is that of extracting information about
the inner relations of a model. Almost all rule extraction techniques substantiate their approach on
the search for a simpler understanding of what the model internally does, stating that the knowledge
(information) can be expressed in these simpler proxies that they consider explaining the antecedent.
This is the most used argument found among the reviewed papers to back up what they expect from
reaching explainable models.
•
Confidence: as a generalization of robustness and stability, confidence should always be assessed
on a model in which reliability is expected. The methods to maintain confidence under control are
different depending on the model. As stated in [
165
,
166
,
167
], stability is a must-have when drawing
interpretations from a certain model. Trustworthy interpretations should not be produced by models
that are not stable. Hence, an explainable model should contain information about the confidence of its
working regime.
•
Fairness: from a social standpoint, explainability can be considered as the capacity to reach and
guarantee fairness in ML models. In a certain literature strand, an explainable ML model suggests a
clear visualization of the relations affecting a result, allowing for a fairness or ethical analysis of the
model at hand [
3
,
100
]. Likewise, a related objective of XAI is highlighting bias in the data a model
was exposed to [
168
,
169
]. The support of algorithms and models is growing fast in fields that involve
human lives, hence explainability should be considered as a bridge to avoid the unfair or unethical use
of algorithm’s outputs.
•
Accessibility: a minor subset of the reviewed contributions argues for explainability as the property
that allows end users to get more involved in the process of improving and developing a certain ML
model [
37
,
86
] . It seems clear that explainable models will ease the burden felt by non-technical or
non-expert users when having to deal with algorithms that seem incomprehensible at first sight. This
concept is expressed as the third most considered goal among the surveyed literature.
•
Interactivity: some contributions [
50
,
59
] include the ability of a model to be interactive with the user
as one of the goals targeted by an explainable ML model. Once again, this goal is related to fields in
9
which the end users are of great importance, and their ability to tweak and interact with the models is
what ensures success.
•
Privacy awareness: almost forgotten in the reviewed literature, one of the byproducts enabled by ex-
plainability in ML models is its ability to assess privacy. ML models may have complex representations
of their learned patterns. Not being able to understand what has been captured by the model [
4
] and
stored in its internal representation may entail a privacy breach. Contrarily, the ability to explain the
inner relations of a trained model by non-authorized third parties may also compromise the differential
privacy of the data origin. Due to its criticality in sectors where XAI is foreseen to play a crucial role,
confidentiality and privacy issues will be covered further in Subsections 5.4 and 6.3, respectively.
This subsection has reviewed the goals encountered among the broad scope of the reviewed papers.
All these goals are clearly under the surface of the concept of explainability introduced before in this
section. To round up this prior analysis on the concept of explainability, the last subsection deals with
different strategies followed by the community to address explainability in ML models.
2.5. How?
The literature makes a clear distinction among models that are interpretable by design, and those
that can be explained by means of external XAI techniques. This duality could also be regarded as the
difference between interpretable models and model interpretability techniques; a more widely accepted
classification is that of transparent models and post-hoc explainability. This same duality also appears
in the paper presented in [
17
] in which the distinction its authors make refers to the methods to solve
the transparent box design problem against the problem of explaining the black-box problem. This
work, further extends the distinction made among transparent models including the different levels of
transparency considered.
Within transparency, three levels are contemplated: algorithmic transparency, decomposability and
simulatability
1
. Among post-hoc techniques we may distinguish among text explanations,visualizations,
local explanations,explanations by example,explanations by simplification and feature relevance. In this
context, there is a broader distinction proposed by [
24
] discerning between 1) opaque systems, where
the mappings from input to output are invisible to the user; 2) interpretable systems, in which users can
mathematically analyze the mappings; and 3) comprehensible systems, in which the models should output
symbols or rules along with their specific output to aid in the understanding process of the rationale
behind the mappings being made. This last classification criterion could be considered included within
the one proposed earlier, hence this paper will attempt at following the more specific one.
2.5.1. Levels of Transparency in Machine Learning Models
Transparent models convey some degree of interpretability by themselves. Models belonging to
this category can be also approached in terms of the domain in which they are interpretable, namely,
algorithmic transparency, decomposability and simulatability. As we elaborate next in connection to
Figure 3, each of these classes contains its predecessors, e.g. a simulatable model is at the same time a
model that is decomposable and algorithmically transparent:
•
Simulatability denotes the ability of a model of being simulated or thought about strictly by a human,
hence complexity takes a dominant place in this class. This being said, simple but extensive (i.e., with
too large amount of rules) rule based systems fall out of this characteristic, whereas a single perceptron
neural network falls within. This aspect aligns with the claim that sparse linear models are more
interpretable than dense ones [
170
], and that an interpretable model is one that can be easily presented
1
The alternative term simulability is also used in the literature to refer to the capacity of a system or process to be simulated.
However, we note that this term does not appear in current English dictionaries.
10
to a human by means of text and visualizations [
32
]. Again, endowing a decomposable model with
simulatability requires that the model has to be self-contained enough for a human to think and reason
about it as a whole.
•
Decomposability stands for the ability to explain each of the parts of a model (input, parameter and
calculation). It can be considered as intelligibility as stated in [
171
]. This characteristic might empower
the ability to understand, interpret or explain the behavior of a model. However, as occurs with
algorithmic transparency, not every model can fulfill this property. Decomposability requires every
input to be readily interpretable (e.g. cumbersome features will not fit the premise). The added
constraint for an algorithmically transparent model to become decomposable is that every part of the
model must be understandable by a human without the need for additional tools.
•
Algorithmic Transparency can be seen in different ways. It deals with the ability of the user to
understand the process followed by the model to produce any given output from its input data. Put
it differently, a linear model is deemed transparent because its error surface can be understood and
reasoned about, allowing the user to understand how the model will act in every situation it may
face [
163
]. Contrarily, it is not possible to understand it in deep architectures as the loss landscape
might be opaque [
172
,
173
] since it cannot be fully observed and the solution has to be approximated
through heuristic optimization (e.g. through stochastic gradient descent). The main constraint for
algorithmically transparent models is that the model has to be fully explorable by means of mathematical
analysis and methods.
Mϕ
x1
x2
x3
y
(b)
If x2>180 then y= 1
Else if x1+x3>150 then y= 1
Else y= 0
Mϕ
x1
x2
x3
y
(a)
If g(fA(x1), fB(x2)) >5
then y= 1, else y= 0
fA(x1) = 1/x2
1,fB(x2) = log x2
g(f, g) = 1/(f+g)
x1: weight, x2: height, x3: age
Mϕ
x1
x2
x3
y
(c)(c)
95% of the positive training samples
have x2>180 7→ Rule 1
90% of the positive training samples
have x1+x3>150 7→ Rule 2
?
?
?
Figure 3: Conceptual diagram exemplifying the different levels of transparency characterizing a ML model Mϕ, with ϕdenoting
the parameter set of the model at hand: (a) simulatability; (b) decomposability; (c) algorithmic transparency. Without loss of
generality, the example focuses on the ML model as the explanation target. However, other targets for explainability may include a
given example, the output classes or the dataset itself.
2.5.2. Post-hoc Explainability Techniques for Machine Learning Models
Post-hoc explainability targets models that are not readily interpretable by design by resorting to
diverse means to enhance their interpretability, such as text explanations,visual explanations,local
explanations,explanations by example,explanations by simplification and feature relevance explanations
techniques. Each of these techniques covers one of the most common ways humans explain systems and
processes by themselves.
Further along this river, actual techniques, or better put, actual group of techniques are specified
to ease the future work of any researcher that intends to look up for an specific technique that suits its
knowledge. Not ending there, the classification also includes the type of data in which the techniques has
been applied. Note that many techniques might be suitable for many different types of data, although
the categorization only considers the type used by the authors that proposed such technique. Overall,
post-hoc explainability techniques are divided first by the intention of the author (explanation technique
e.g. Explanation by simplification), then, by the method utilized (actual technique e.g. sensitivity analysis)
and finally by the type of data in which it was applied (e.g. images).
11
•
Text explanations deal with the problem of bringing explainability for a model by means of learning to
generate text explanations that help explaining the results from the model [
169
]. Text explanations also
include every method generating symbols that represent the functioning of the model. These symbols
may portrait the rationale of the algorithm by means of a semantic mapping from model to symbols.
•
Visual explanation techniques for post-hoc explainability aim at visualizing the model’s behavior.
Many of the visualization methods existing in the literature come along with dimensionality reduction
techniques that allow for a human interpretable simple visualization. Visualizations may be coupled
with other techniques to improve their understanding, and are considered as the most suitable way to
introduce complex interactions within the variables involved in the model to users not acquainted to
ML modeling.
•
Local explanations tackle explainability by segmenting the solution space and giving explanations
to less complex solution subspaces that are relevant for the whole model. These explanations can be
formed by means of techniques with the differentiating property that these only explain part of the
whole system’s functioning.
•
Explanations by example consider the extraction of data examples that relate to the result generated by
a certain model, enabling to get a better understanding of the model itself. Similarly to how humans
behave when attempting to explain a given process, explanations by example are mainly centered in
extracting representative examples that grasp the inner relationships and correlations found by the
model being analyzed.
•
Explanations by simplification collectively denote those techniques in which a whole new system is
rebuilt based on the trained model to be explained. This new, simplified model usually attempts at
optimizing its resemblance to its antecedent functioning, while reducing its complexity, and keeping a
similar performance score. An interesting byproduct of this family of post-hoc techniques is that the
simplified model is, in general, easier to be implemented due to its reduced complexity with respect to
the model it represents.
•
Finally, feature relevance explanation methods for post-hoc explainability clarify the inner functioning
of a model by computing a relevance score for its managed variables. These scores quantify the affection
(sensitivity) a feature has upon the output of the model. A comparison of the scores among different
variables unveils the importance granted by the model to each of such variables when producing its
output. Feature relevance methods can be thought to be an indirect method to explain a model.
The above classification (portrayed graphically in Figure 4) will be used when reviewing spe-
cific/agnostic XAI techniques for ML models in the following sections (Table 2). For each ML model, a
distinction of the propositions to each of these categories is presented in order to pose an overall image of
the field’s trends.
3. Transparent Machine Learning Models
The previous section introduced the concept of transparent models. A model is considered to be
transparent if by itself it is understandable. The models surveyed in this section are a suit of transparent
models that can fall in one or all of the levels of model transparency described previously (namely,
simulatability, decomposability and algorithmic transparency). In what follows we provide reasons for
this statement, with graphical support given in Figure 5.
12
Black-box
model
x
y
Mϕ
x= (x1, ... , xn)
Feature
relevance
...
“Feature x2has a
90% importance in y”
...
Local
explanations
xi
yi
Mϕ
“What happens with the prediction yiif
we change slightly the features of xi?”
xi: input instance
Visualization
x
y
Mϕ
x1
x2
x3
x4
xn
x1
x3
Model
simplification
x
y
Mϕ
x1
x3
x7
x13
y0
x
F
G
Text
explanations
xi
yi
Mϕ
“The output for xiis
yibecause x3> γ ”
by example
Explanations
xi
yi
Mϕ
”Explanatory examples
for the model:”
-xA7→ yA
-xB7→ yB
-xC7→ yC
Figure 4: Conceptual diagram showing the different post-hoc explainability approaches available for a ML model Mϕ.
3.1. Linear/Logistic Regression
Logistic Regression (LR) is a classification model to predict a dependent variable (category) that is
dichotomous (binary). However, when the dependent variable is continuous, linear regression would
be its homonym. This model takes the assumption of linear dependence between the predictors and the
predicted variables, impeding a flexible fit to the data. This specific reason (stiffness of the model) is the
one that maintains the model under the umbrella of transparent methods. However, as stated in Section 2,
explainability is linked to a certain audience, which makes a model fall under both categories depending
who is to interpret it. This way, logistic and linear regression, although clearly meeting the characteristics
of transparent models (algorithmic transparency, decomposability and simulatability), may also demand
post-hoc explainability techniques (mainly, visualization), particularly when the model is to be explained
to non-expert audiences.
The usage of this model has been largely applied within Social Sciences for quite a long time,
which has pushed researchers to create ways of explaining the results of the models to non-expert
users. Most authors agree on the different techniques used to analyze and express the soundness of LR
[
174
,
175
,
176
,
177
], including the overall model evaluation, statistical tests of individual predictors,
goodness-of-fit statistics and validation of the predicted probabilities. The overall model evaluation
shows the improvement of the applied model over a baseline, showing if it is in fact improving the model
without predictions. The statistical significance of single predictors is shown by calculating the Wald
chi-square statistic. The goodness-of-fit statistics show the quality of fitness of the model to the data
and how significant this is. This can be achieved by resorting to different techniques e.g. the so-called
Hosmer-Lemeshow (H-L) statistic. The validation of predicted probabilities involves testing whether the
output of the model corresponds to what is shown by the data. These techniques show mathematical ways
of representing the fitness of the model and its behavior.
Other techniques from other disciplines besides Statistics can be adopted for explaining these re-
13
Model
Transparent ML Models Post-hoc
analysis
Simulatability Decomposability Algorithmic Transparency
Linear/Logistic Regression
Predictors are human readable and
interactions among them are kept to a
minimum
Variablesare still readable, but the number
of interactions and predictors involved in
them have grown to force decomposition
Variablesand interactions are too complex
to be analyzed without mathematical tools Not needed
Decision Trees
A human can simulate and obtain the
prediction of a decision tree on his/her own,
without requiring any mathematical
background
The model comprises rules that do not alter
data whatsoever, and preserves their
readability
Human-readable rules that explain the
knowledge learned from data and allows
for a direct understanding of the prediction
process
Not needed
K-Nearest Neighbors
The complexity of the model (number of
variables, their understandability and the
similarity measure under use) matches
human naive capabilities for simulation
The amount of variables is too high and/or
the similarity measure is too complex to be
able to simulate the model completely, but
the similarity measure and the set of
variables can be decomposed and analyzed
separately
The similarity measure cannot be
decomposed and/or the number of
variables is so high that the user has to rely
on mathematical and statistical tools to
analyze the model
Not needed
Rule Based Learners
Variablesincluded in rules are readable,
and the size of the rule set is manageable
by a human user without external help
The size of the rule set becomes too large
to be analyzed without decomposing it into
small rule chunks
Rules have become so complicated (and
the rule set size has grown so much) that
mathematical tools are needed for
inspecting the model behaviour
Not needed
General Additive Models
Variablesand the interaction among them
as per the smooth functions involved in the
model must be constrained within human
capabilities for understanding
Interactions become too complex to be
simulated, so decomposition techniques are
required for analyzing the model
Due to their complexity, variablesand
interactions cannot be analyzed without the
application of mathematical and statistical
tools
Not needed
Bayesian Models
Statistical relationships modeled among
variables and the variables themselves
should be directly understandable by the
target audience
Statistical relationships involve so many
variables that they must be decomposed in
marginals so as to ease their analysis
Statistical relationships cannot be
interpreted even if already decomposed,
and predictors are so complex that model
can be only analyzed with mathematical
tools
Not needed
Tree Ensembles 7 7 7 Needed: Usually Model simplification or
Featurerelevance techniques
Support VectorMachines 7 7 7 Needed: Usually Model simplification or
Local explanations techniques
Multi–layer Neural Network 7 7 7
Needed: Usually Model simplification,
Featurerelevance or Visualization
techniques
Convolutional Neural Network 7 7 7 Needed: Usually Feature relevanceor
Visualization techniques
Recurrent Neural Network 7 7 7 Needed: Usually Feature relevance
techniques
Table 2: Overall picture of the classification of ML models attending to their level of explainability.
gression models. Visualization techniques are very powerful when presenting statistical conclusions to
users not well-versed in statistics. For instance, the work in [
178
] shows that the usage of probabilities to
communicate the results, implied that the users where able to estimate the outcomes correctly in 10% of
the cases, as opposed to 46% of the cases when using natural frequencies. Although logistic regression is
among the simplest classification models in supervised learning, there are concepts that must be taken
care of.
In this line of reasoning, the authors of [
179
] unveil some concerns with the interpretations derived
from LR. They first mention how dangerous it might be to interpret log odds ratios and odd ratios as
substantive effects, since they also represent unobserved heterogeneity. Linked to this first concern,
[
179
] also states that a comparison between these ratios across models with different variables might be
problematic, since the unobserved heterogeneity is likely to vary, thereby invalidating the comparison.
Finally they also mention that the comparison of these odds across different samples, groups and time is
also risky, since the variation of the heterogeneity is not known across samples, groups and time points.
This last paper serves the purpose of visualizing the problems a model’s interpretation might entail, even
when its construction is as simple as that of LR.
Also interesting is to note that, for a model such as logistic or linear regression to maintain decompos-
ability and simulatability, its size must be limited, and the variables used must be understandable by their
users. As stated in Section 2, if inputs to the model are highly engineered features that are complex or
difficult to understand, the model at hand will be far from being decomposable. Similarly, if the model is
so large that a human cannot think of the model as a whole, its simulatability will be put to question.
3.2. Decision Trees
Decision trees are another example of a model that can easily fulfill every constraint for transparency.
Decision trees are hierarchical structures for decision making used to support regression and classification
problems [
132
,
180
]. In the simplest of their flavors, decision trees are simulatable models. However,
their properties can render them decomposable or algorithmically transparent.
14
x2
x1
x1≥γ
x2≥γ0
x2≥γ00
Yes
No
x1≥γ000
Yes
No
Yes
No
Class
Support: 70%
Impurity: 0.1
Straightforward what-if testing
Simple univariate thresholds
Simulatable, decomposable
Direct support and impurity measures
wi: increase in yif xi
w0(intercept): yfor a test instance
with average normalized features
increases by one unit
x1
x2
xtest
2
xtest
1
Yes
Prediction by majority voting
Ksimilar training instances
Simulatable, decomposable
Algorithmic transparency (lazy training)
Linguistic rules: easy to interpret
Training
dataset
−If x1is high then y=
−If x1is low and x2is
high then y=
−If x2is low then y=
Simulatable if ruleset coverage and
specifity are kept constrained
Fuzziness improves interpretability
y=w1x1+w2x2+w0
g(E(y)) = w1f1(x1) + w2f2(x2)
Training
dataset
g(z)
z
fi(xi)
xi
E(y): exp ected value
Simulatable, decomposable
Interpretability depends on link
function g(z), the selected fi(xi)
and the sparseness of [w1,...,wN]
y
Training
dataset
Training
dataset
Training
dataset
Training
dataset
x1
x2
y
p(y|x1, x2)∝p(y|x1)p(y|x2)
xi
p(y|xi)
to assess the contribution of each variable
The independence assumption permits
Simulatable, decomposable
Algorithmic transparency (distribution fitting)
(a)
(b)
(c)
(d)
(e)
(f)
Class
Class
Class
Class
?
?
?
?
?
?
Figure 5: Graphical illustration of the levels of transparency of different ML models considered in this overview: (a) Linear
regression; (b) Decision trees; (c) K-Nearest Neighbors; (d) Rule-based Learners; (e) Generalized Additive Models; (f) Bayesian
Models.
Decision trees have always lingered in between the different categories of transparent models. Their
utilization has been closely linked to decision making contexts, being the reason why their complexity
and understandability have always been considered a paramount matter. A proof of this relevance can
be found in the upsurge of contributions to the literature dealing with decision tree simplification and
generation [
132
,
180
,
181
,
182
]. As noted above, although being capable of fitting every category within
transparent models, the individual characteristics of decision trees can push them toward the category of
algorithmically transparent models. A simulatable decision tree is one that is manageable by a human
user. This means its size is somewhat small and the amount of features and their meaning are easily
understandable. An increment in size transforms the model into a decomposable one since its size impedes
its full evaluation (simulation) by a human. Finally, further increasing its size and using complex feature
relations will make the model algorithmically transparent loosing the previous characteristics.
Decision trees have long been used in decision support contexts due to their off-the-shelf transparency.
Many applications of these models fall out of the fields of computation and AI (even information
technologies), meaning that experts from other fields usually feel comfortable interpreting the outputs of
these models [
183
,
184
,
185
]. However, their poor generalization properties in comparison with other
models make this model family less interesting for their application to scenarios where a balance between
predictive performance is a design driver of utmost importance. Tree ensembles aim at overcoming such
a poor performance by aggregating the predictions performed by trees learned on different subsets of
training data. Unfortunately, the combination of decision trees looses every transparent property, calling
for the adoption of post-hoc explainability techniques as the ones reviewed later in the manuscript.
15
3.3. K-Nearest Neighbors
Another method that falls within transparent models is that of K-Nearest Neighbors (KNN), which
deals with classification problems in a methodologically simple way: it predicts the class of a test sample
by voting the classes of its K nearest neighbors (where the neighborhood relation is induced by a measure
of distance between samples). When used in the context of regression problems, the voting is replaced by
an aggregation (e.g. average) of the target values associated with the nearest neighbors.
In terms of model explainability, it is important to observe that predictions generated by KNN models
rely on the notion of distance and similarity between examples, which can be tailored depending on the
specific problem being tackled. Interestingly, this prediction approach resembles that of experience-based
human decision making, which decides upon the result of past similar cases. There lies the rationale
of why KNN has also been adopted widely in contexts in which model interpretability is a requirement
[
186
,
187
,
188
,
189
]. Furthermore, aside from being simple to explain, the ability to inspect the reasons
by which a new sample has been classified inside a group and to examine how these predictions evolve
when the number of neighbors K is increased or decreased empowers the interaction between the users
and the model.
One must keep in mind that as mentioned before, KNN’s class of transparency depends on the features,
the number of neighbors and the distance function used to measure the similarity between data instances.
A very high K impedes a full simulation of the model performance by a human user. Similarly, the usage
of complex features and/or distance functions would hinder the decomposability of the model, restricting
its interpretability solely to the transparency of its algorithmic operations.
3.4. Rule-based Learning
Rule-based learning refers to every model that generates rules to characterize the data it is intended to
learn from. Rules can take the form of simple conditional if-then rules or more complex combinations of
simple rules to form their knowledge. Also connected to this general family of models, fuzzy rule based
systems are designed for a broader scope of action, allowing for the definition of verbally formulated
rules over imprecise domains. Fuzzy systems improve two main axis relevant for this paper. First, they
empower more understandable models since they operate in linguistic terms. Second, they perform better
that classic rule systems in contexts with certain degrees of uncertainty. Rule based learners are clearly
transparent models that have been often used to explain complex models by generating rules that explain
their predictions [126, 127, 190, 191].
Rule learning approaches have been extensively used for knowledge representation in expert systems
[
192
]. However, a central problem with rule generation approaches is the coverage (amount) and the
specificity (length) of the rules generated. This problem relates directly to the intention for their use in
the first place. When building a rule database, a typical design goal sought by the user is to be able to
analyze and understand the model. The amount of rules in a model will clearly improve the performance
of the model at the stake of compromising its intepretability. Similarly, the specificity of the rules plays
also against interpretability, since a rule with a high number of antecedents an/or consequences might
become difficult to interpret. In this same line of reasoning, these two features of a rule based learner
play along with the classes of transparent models presented in Section 2. The greater the coverage or
the specificity is, the closer the model will be to being just algorithmically transparent. Sometimes, the
reason to transition from classical rules to fuzzy rules is to relax the constraints of rule sizes, since a
greater range can be covered with less stress on interpretability.
Rule based learners are great models in terms of interpretability across fields. Their natural and
seamless relation to human behaviour makes them very suitable to understand and explain other models.
If a certain threshold of coverage is acquired, a rule wrapper can be thought to contain enough information
about a model to explain its behavior to a non-expert user, without forfeiting the possibility of using the
generated rules as an standalone prediction model.
16
3.5. General Additive Models
In statistics, a Generalized Additive Model (GAM) is a linear model in which the value of the
variable to be predicted is given by the aggregation of a number of unknown smooth functions defined
for the predictor variables. The purpose of such model is to infer the smooth functions whose aggregate
composition approximates the predicted variable. This structure is easily interpretable, since it allows the
user to verify the importance of each variable, namely, how it affects (through its corresponding function)
the predicted output.
Similarly to every other transparent model, the literature is replete with case studies where GAMs
are in use, specially in fields related to risk assessment. When compared to other models, these are
understandable enough to make users feel confident on using them for practical applications in finance
[
193
,
194
,
195
], environmental studies [
196
], geology [
197
], healthcare [
44
], biology [
198
,
199
] and
energy [
200
]. Most of these contributions use visualization methods to further ease the interpretation of
the model. GAMs might be also considered as simulatable and decomposable models if the properties
mentioned in its definitions are fulfilled, but to an extent that depends roughly on eventual modifications
to the baseline GAM model, such as the introduction of link functions to relate the aggregation with the
predicted output, or the consideration of interactions between predictors.
All in all, applications of GAMs like the ones exemplified above share one common factor: under-
standability. The main driver for conducting these studies with GAMs is to understand the underlying
relationships that build up the cases for scrutiny. In those cases the research goal is not accuracy for its
own sake, but rather the need for understanding the problem behind and the relationship underneath the
variables involved in data. This is why GAMs have been accepted in certain communities as their de facto
modeling choice, despite their acknowledged misperforming behavior when compared to more complex
counterparts.
3.6. Bayesian Models
A Bayesian model usually takes the form of a probabilistic directed acyclic graphical model whose
links represent the conditional dependencies between a set of variables. For example, a Bayesian network
could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the
network can be used to compute the probabilities of the presence of various diseases. Similar to GAMs,
these models also convey a clear representation of the relationships between features and the target, which
in this case are given explicitly by the connections linking variables to each other.
Once again, Bayesian models fall below the ceiling of Transparent models. Its categorization leaves
it under simulatable,decomposable and algorithmically transparent. However, it is worth noting that
under certain circumstances (overly complex or cumbersome variables), a model may loose these first
two properties. Bayesian models have been shown to lead to great insights in assorted applications such
as cognitive modeling [
201
,
202
], fishery [
196
,
203
], gaming [
204
], climate [
205
], econometrics [
206
] or
robotics [
207
]. Furthermore, they have also been utilized to explain other models, such as averaging tree
ensembles [208].
4. Post-hoc Explainability Techniques for Machile Learning Models: Taxonomy, Shallow Models
and Deep Learning
When ML models do not meet any of the criteria imposed to declare them transparent, a separate
method must be devised and applied to the model to explain its decisions. This is the purpose of post-hoc
explainability techniques (also referred to as post-modeling explainability), which aim at communicating
understandable information about how an already developed model produces its predictions for any given
input. In this section we categorize and review different algorithmic approaches for post-hoc explainability,
discriminating among 1) those that are designed for their application to ML models of any kind; and 2)
those that are designed for a specific ML model and thus, can not be directly extrapolated to any other
17
learner. We now elaborate on the trends identified around post-hoc explainability for different ML models,
which are illustrated in Figure 6 in the form of hierarchical bibliographic categories and summarized next:
•
Model-agnostic techniques for post-hoc explainability (Subsection 4.1), which can be applied seam-
lessly to any ML model disregarding its inner processing or internal representations.
•
Post-hoc explainability that are tailored or specifically designed to explain certain ML models. We
divide our literature analysis into two main branches: contributions dealing with post-hoc explainability
of shallow ML models, which collectively refers to all ML models that do not hinge on layered
structures of neural processing units (Subsection 4.2); and techniques devised for deep learning models,
which correspondingly denote the family of neural networks and related variants, such as convolutional
neural networks, recurrent neural networks (Subsection 4.3) and hybrid schemes encompassing deep
neural networks and transparent models. For each model we perform a thorough review of the latest
post-hoc methods proposed by the research community, along with a identification of trends followed
by such contributions.
•
We end our literature analysis with Subsection 4.4, where we present a second taxonomy that com-
plements the more general one in Figure 6 by classifying contributions dealing with the post-hoc
explanation of Deep Learning models. To this end we focus on particular aspects related to this family
of black-box ML methods, and expose how they link to the classification criteria used in the first
taxonomy.
4.1. Model-agnostic Techniques for Post-hoc Explainability
Model-agnostic techniques for post-hoc explainability are designed to be plugged to any model
with the intent of extracting some information from its prediction procedure. Sometimes, simplification
techniques are used to generate proxies that mimic their antecedents with the purpose of having something
tractable and of reduced complexity. Other times, the intent focuses on extracting knowledge directly
from the models or simply visualizing them to ease the interpretation of their behavior. Following the
taxonomy introduced in Section 2, model-agnostic techniques may rely on model simplification,feature
relevance estimation and visualization techniques:
•
Explanation by simplification. They are arguably the broadest technique under the category of model
agnostic post-hoc methods. Local explanations are also present within this category, since sometimes,
simplified models are only representative of certain sections of a model. Almost all techniques taking
this path for model simplification are based on rule extraction techniques. Among the most known
contributions to this approach we encounter the technique of Local Interpretable Model-Agnostic
Explanations (LIME) [
32
] and all its variations [
214
,
216
]. LIME builds locally linear models around
the predictions of an opaque model to explain it. These contributions fall under explanations by
simplification as well as under local explanations. Besides LIME and related flavors, another approach
to rule extraction is G-REX [
212
]. Although it was not originally intended for extracting rules from
opaque models, the generic proposition of G-REX has been extended to also account for model
explainability purposes [
190
,
211
]. In line with rule extraction methods, the work in [
215
] presents a
novel approach to learn rules in CNF (Conjunctive Normal Form) or DNF (Disjunctive Normal Form)
to bridge from a complex model to a human-interpretable model. Another contribution that falls off the
same branch is that in [
218
], where the authors formulate model simplification as a model extraction
process by approximating a transparent model to the complex one. Simplification is approached from a
different perspective in [
120
], where an approach to distill and audit black box models is presented. In
it, two main ideas are exposed: a method for model distillation and comparison to audit black-box risk
scoring models; and an statistical test to check if the auditing data is missing key features it was trained
with. The popularity of model simplification is evident, given it temporally coincides with the most
18
XAI in ML
Transparent Models
Logistic / Linear Regression
Decision Trees
K-Nearest Neighbors
Rule-base Learners
General Additive Models: [44]
Bayesian Models: [31,49, 209, 210]
Post-Hoc Explainability
Model-Agnostic
Explanation by simplification
Rule-based learner: [32,51, 120, 190, 211, 212, 213, 214, 215, 216]
Decision Tree: [21,119, 133, 135, 149, 217, 218]
Others: [56,219]
Feature relevance explanation
Influence functions: [173,220, 221]
Sensitivity: [222,223]
Game theory inspired: [224,225] [226]
Saliency: [85,227]
Interaction based: [123,228]
Others: [140,141, 229, 230, 231]
Local Explanations
Rule-based learner: [32, 216]
Decision Tree: [232,233]
Others: [67,224, 230, 234, 235, 236, 237]
Visual explanation
Conditional / Dependence / Shapley plots: [56,224, 238, 239]
Sensitivity / Saliency: [85, 227] [222, 223]
Others: [117,123, 140, 178, 234]
Model-Specific
Ensembles and Multiple Classifier Systems
Explanation by simplification Decision Tree/Prototype: [84, 118, 122]
Feature relevance explanation Featureimportance / contribution: [103, 104, 240, 241]
Visual explanation Variable importance / attribution: [104,241] [242, 243]
Support VectorMachines
Explanation by simplification
Rule-based learner: [57,93, 94, 98, 106, 134, 244, 245, 246]
Probabilistic: [247,248]
Others: [102]
Feature relevance explanation FeatureContribution / Statistics: [249] [116, 249]
Visual explanation Internalvisualization: [68, 77, 250]
Multi-Layer Neural Networks
Explanation by simplification
Rule-based learner: [82,83, 147, 148, 251, 252, 253, 254, 255, 256]
Decision Tree: [21,56, 79, 81, 97, 135, 257, 258, 259]
Others: [80]
Feature relevance explanation
Importance/Contribution: [60,61, 110, 260, 261]
Sensitivity / Saliency: [260] [262]
Local explanation Decision Tree/ Sensitivity: [233] [263]
Explanation by Example Activationclusters: [264, 144]
Textexplanation Caption generation: [111] [150]
Visual explanation Saliency/ Weights: [265]
Architecture modification Others: [264] [266] [267]
Convolutional Neural Networks
Explanation by simplification Decision Tree: [78]
Feature relevance explanation
Activations: [72,268] [46]
Feature Extraction: [72,268]
Visual explanation
Filter / Activation: [63, 136, 137, 142, 152, 269, 270, 271]
Sensitivity / Saliency: [131, 272] [46]
Others: [273]
Architecture modification
Layer modification: [143,274, 275]
Model combination: [91,274, 276]
Attention networks: [107,114, 277, 278] [91]
Loss modification: [276] [113]
Others: [279]
Recurrent Neural Networks
Explanation by simplification Rule-based learner: [146]
Feature relevance explanation Activationpropagation: [280]
Visual explanation Activations: [281]
Arquitecture modification
Loss / Layer modification: [276,282] [274]
Others: [151,283, 284] [285]
Figure 6: Taxonomy of the reviewed literature and trends identified for explainability techniques related to different ML models.
References boxed in blue, green and red correspond to XAI techniques using image, text or tabular data, respectively. In order
to build this taxonomy, the literature has been analyzed in depth to discriminate whether a post-hoc technique can be seamlessly
applied to any ML model, even if, e.g., explicitly mentions Deep Learning in its title and/or abstract.
recent literature on XAI, including techniques such as LIME or G-REX. This symptomatically reveals
that this post-hoc explainability approach is envisaged to continue playing a central role on XAI.
•
Feature relevance explanation techniques aim to describe the functioning of an opaque model by
19
ranking or measuring the influence, relevance or importance each feature has in the prediction output by
the model to be explained. An amalgam of propositions are found within this category, each resorting
to different algorithmic approaches with the same targeted goal. One fruitful contribution to this path
is that of [
224
] called SHAP (SHapley Additive exPlanations). Its authors presented a method to
calculate an additive feature importance score for each particular prediction with a set of desirable
properties (local accuracy, missingness and consistency) that its antecedents lacked. Another approach
to tackle the contribution of each feature to predictions has been coalitional Game Theory [
225
] and
local gradients [
234
]. Similarly, by means of local gradients [
230
] test the changes needed in each
feature to produce a change in the output of the model. In [
228
] the authors analyze the relations and
dependencies found in the model by grouping features, that combined, bring insights about the data.
The work in [
173
] presents a broad variety of measures to tackle the quantification of the degree of
influence of inputs on outputs of systems. Their QII (Quantitative Input Influence) measures account
for correlated inputs while measuring influence. In contrast, in [
222
] the authors build upon the existing
SA (Sensitivity Analysis) to construct a Global SA which extends the applicability of the existing
methods. In [
227
] a real-time image saliency method is proposed, which is applicable to differentiable
image classifiers. The study in [
123
] presents the so-called Automatic STRucture IDentification method
(ASTRID) to inspect which attributes are exploited by a classifier to generate a prediction. This method
finds the largest subset of features such that the accuracy of a classifier trained with this subset of
features cannot be distinguished in terms of accuracy from a classifier built on the original feature set.
In [
221
] the authors use influence functions to trace a model’s prediction back to the training data, by
only requiring an oracle version of the model with access to gradients and Hessian-vector products.
Heuristics for creating counterfactual examples by modifying the input of the model have been also
found to contribute to its explainability [
236
,
237
]. Compared to those attempting explanations by
simplification, a similar amount of publications were found tackling explainability by means of feature
relevance techniques. Many of the contributions date from 2017 and some from 2018, implying that as
with model simplification techniques, feature relevance has also become a vibrant subject study in the
current XAI landscape.
•
Visual explanation techniques are a vehicle to achieve model-agnostic explanations. Representative
works in this area can be found in [
222
], which present a portfolio of visualization techniques to help in
the explanation of a black-box ML model built upon the set of extended techniques mentioned earlier
(Global SA). Another set of visualization techniques is presented in [
223
]. The authors present three
novel SA methods (data based SA, Monte-Carlo SA, cluster-based SA) and one novel input importance
measure (Average Absolute Deviation). Finally, [
238
] presents ICE (Individual Conditional Expecta-
tion) plots as a tool for visualizing the model estimated by any supervised learning algorithm. Visual
explanations are less common in the field of model-agnostic techniques for post-hoc explainability.
Since the design of these methods must ensure that they can be seamlessly applied to any ML model
disregarding its inner structure, creating visualizations from just inputs and outputs from an opaque
model is a complex task. This is why almost all visualization methods falling in this category work
along with feature relevance techniques, which provide the information that is eventually displayed to
the end user.
Several trends emerge from our literature analysis. To begin with, rule extraction techniques prevail
in model-agnostic contributions under the umbrella of post-hoc explainability. This could have been
intuitively expected if we bear in mind the wide use of rule based learning as explainability wrappers
anticipated in Section 3.4, and the complexity imposed by not being able to get into the model itself.
Similarly, another large group of contributions deals with feature relevance. Lately these techniques are
gathering much attention by the community when dealing with DL models, with hybrid approaches that
utilize particular aspects of this class of models and therefore, compromise the independence of the feature
relevance method on the model being explained. Finally, visualization techniques propose interesting
20
ways for visualizing the output of feature relevance techniques to ease the task of model’s interpretation.
By contrast, visualization techniques for other aspects of the trained model (e.g. its structure, operations,
etc) are tightly linked to the specific model to be explained.
4.2. Post-hoc Explainability in Shallow ML Models
Shallow ML covers a diversity of supervised learning models. Within these models, there are strictly
interpretable (transparent) approaches (e.g. KNN and Decision Trees, already discussed in Section 3).
However, other shallow ML models rely on more sophisticated learning algorithms that require additional
layers of explanation. Given their prominence and notable performance in predictive tasks, this section
concentrates on two popular shallow ML models (tree ensembles and Support Vector Machines, SVMs)
that require the adoption of post-hoc explainability techniques for explaining their decisions.
4.2.1. Tree Ensembles, Random Forests and Multiple Classifier Systems
Tree ensembles are arguably among the most accurate ML models in use nowadays. Their advent
came as an efficient means to improve the generalization capability of single decision trees, which are
usually prone to overfitting. To circumvent this issue, tree ensembles combine different trees to obtain an
aggregated prediction/regression. While it results to be effective against overfitting, the combination of
models makes the interpretation of the overall ensemble more complex than each of its compounding tree
learners, forcing the user to draw from post-hoc explainability techniques. For tree ensembles, techniques
found in the literature are explanation by simplification and feature relevance techniques; we next examine
recent advances in these techniques.
To begin with, many contributions have been presented to simplify tree ensembles while maintaining
part of the accuracy accounted for the added complexity. The author from [
119
] poses the idea of training
a single albeit less complex model from a set of random samples from the data (ideally following the real
data distribution) labeled by the ensemble model. Another approach for simplification is that in [
118
], in
which authors create a Simplified Tree Ensemble Learner (STEL). Likewise, [
122
] presents the usage
of two models (simple and complex) being the former the one in charge of interpretation and the latter
of prediction by means of Expectation-Maximization and Kullback-Leibler divergence. As opposed to
what was seen in model-agnostic techniques, not that many techniques to board explainability in tree
ensembles by means of model simplification. It derives from this that either the proposed techniques are
good enough, or model-agnostic techniques do cover the scope of simplification already.
Following simplification procedures, feature relevance techniques are also used in the field of tree
ensembles. Breiman [
286
] was the first to analyze the variable importance within Random Forests. His
method is based on measuring MDA (Mean Decrease Accuracy) or MIE (Mean Increase Error) of the
forest when a certain variable is randomly permuted in the out-of-bag samples. Following this contribution
[
241
] shows, in an real setting, how the usage of variable importance reflects the underlying relationships
of a complex system modeled by a Random Forest. Finally, a crosswise technique among post-hoc
explainability, [
240
] proposes a framework that poses recommendations that, if taken, would convert
an example from one class to another. This idea attempts to disentangle the variables importance in a
way that is further descriptive. In the article, the authors show how these methods can be used to elevate
recommendations to improve malicious online ads to make them rank higher in paying rates.
Similar to the trend shown in model-agnostic techniques, for tree ensembles again, simplification and
feature relevance techniques seem to be the most used schemes. However, contrarily to what was observed
before, most papers date back from 2017 and place their focus mostly on bagging ensembles. When
shifting the focus towards other ensemble strategies, scarce activity has been recently noted around the
explainability of boosting and stacking classifiers. Among the latter, it is worth highlighting the connection
between the reason why a compounding learner of the ensemble produces an specific prediction on a given
data, and its contribution to the output of the ensemble. The so-called Stacking With Auxiliary Features
(SWAF) approach proposed in [
242
] points in this direction by harnessing and integrating explanations in
21
stacking ensembles to improve their generalization. This strategy allows not only relying on the output
of the compounding learners, but also on the origin of that output and its consensus across the entire
ensemble. Other interesting studies on the explainability of ensemble techniques include model-agnostic
schemes such as DeepSHAP [
226
], put into practice with stacking ensembles and multiple classifier
systems in addition to Deep Learning models; the combination of explanation maps of multiple classifiers
to produce improved explanations of the ensemble to which they belong [
243
]; and recent insights dealing
with traditional and gradient boosting ensembles [287, 288].
4.2.2. Support Vector Machines
Another shallow ML model with historical presence in the literature is the SVM. SVM models are
more complex than tree ensembles, with a much opaquer structure. Many implementations of post-hoc
explainability techniques have been proposed to relate what is mathematically described internally in
these models, to what different authors considered explanations about the problem at hand. Technically,
an SVM constructs a hyper-plane or set of hyper-planes in a high or infinite-dimensional space, which
can be used for classification, regression, or other tasks such as outlier detection. Intuitively, a good
separation is achieved by the hyperplane that has the largest distance (so-called functional margin) to the
nearest training-data point of any class, since in general, the larger the margin, the lower the generalization
error of the classifier. SVMs are among the most used ML models due to their excellent prediction
and generalization capabilities. From the techniques stated in Section 2, post-hoc explainability applied
to SVMs covers explanation by simplification,local explanations,visualizations and explanations by
example.
Among explanation by simplification, four classes of simplifications are made. Each of them dif-
ferentiates from the other by how deep they go into the algorithm inner structure. First, some authors
propose techniques to build rule based models only from the support vectors of a trained model. This is
the approach of [
93
], which proposes a method that extracts rules directly from the support vectors of a
trained SVM using a modified sequential covering algorithm. In [
57
] the same authors propose eclectic
rule extraction, still considering only the support vectors of a trained model. The work in [94] generates
fuzzy rules instead of classical propositional rules. Here, the authors argue that long antecedents reduce
comprehensibility, hence, a fuzzy approach allows for a more linguistically understandable result. The
second class of simplifications can be exemplified by [
98
], which proposed the addition of the SVM’s
hyperplane, along with the support vectors, to the components in charge of creating the rules. His method
relies on the creation of hyper-rectangles from the intersections between the support vectors and the
hyper-plane. In a third approach to model simplification, another group of authors considered adding
the actual training data as a component for building the rules. In [
126
,
244
,
246
] the authors proposed a
clustering method to group prototype vectors for each class. By combining them with the support vectors,
it allowed defining ellipsoids and hyper-rectangles in the input space. Similarly in [
106
], the authors
proposed the so-called Hyper-rectangle Rule Extraction, an algorithm based on SVC (Support Vector
Clustering) to find prototype vectors for each class and then define small hyper-rectangles around. In
[
105
], the authors formulate the rule extraction problem as a multi-constrained optimization to create a
set of non-overlapping rules. Each rule conveys a non-empty hyper-cube with a shared edge with the
hyper-plane. In a similar study conducted in [
245
], extracting rules for gene expression data, the authors
presented a novel technique as a component of a multi-kernel SVM. This multi-kernel method consists
of feature selection, prediction modeling and rule extraction. Finally, the study in [
134
] makes use of a
growing SVC to give an interpretation to SVM decisions in terms of linear rules that define the space in
Voronoi sections from the extracted prototypes.
Leaving aside rule extraction, the literature has also contemplated some other techniques to contribute
to the interpretation of SVMs. Three of them (visualization techniques) are clearly used toward explaining
SVM models when used for concrete applications. For instance, [
77
] presents an innovative approach to
visualize trained SVM to extract the information content from the kernel matrix. They center the study
22
on Support Vector Regression models. They show the ability of the algorithm to visualize which of the
input variables are actually related with the associated output data. In [
68
] a visual way combines the
output of the SVM with heatmaps to guide the modification of compounds in late stages of drug discovery.
They assign colors to atoms based on the weights of a trained linear SVM that allows for a much more
comprehensive way of debugging the process. In [
116
] the authors argue that many of the presented
studies for interpreting SVMs only account for the weight vectors, leaving the margin aside. In their study
they show how this margin is important, and they create an statistic that explicitly accounts for the SVM
margin. The authors show how this statistic is specific enough to explain the multivariate patterns shown
in neuroimaging.
Noteworthy is also the intersection between SVMs and Bayesian systems, the latter being adopted
as a post-hoc technique to explain decisions made by the SVM model. This is the case of [
248
] and
[
247
], which are studies where SVMs are interpreted as MAP (Maximum A Posteriori) solutions to
inference problems with Gaussian Process priors. This framework makes tuning the hyper-parameters
comprehensible and gives the capability of predicting class probabilities instead of the classical binary
classification of SVMs. Interpretability of SVM models becomes even more involved when dealing
with non-CPD (Conditional Positive Definite) kernels that are usually harder to interpret due to missing
geometrical and theoretical understanding. The work in [
102
] revolves around this issue with a geometrical
interpretation of indefinite kernel SVMs, showing that these do not classify by hyper-plane margin
optimization. Instead, they minimize the distance between convex hulls in pseudo-Euclidean spaces.
A difference might be appreciated between the post-hoc techniques applied to other models and those
noted for SVMs. In previous models, model simplification in a broad sense was the prominent method
for post-hoc explainability. In SVMs, local explanations have started to take some weight among the
propositions. However, simplification based methods are, on average, much older than local explanations.
As a final remark, none of the reviewed methods treating SVM explainability are dated beyond 2017,
which might be due to the progressive proliferation of DL models in almost all disciplines. Another
plausible reason is that these models are already understood, so it is hard to improve upon what has
already been done.
4.3. Explainability in Deep Learning
Post-hoc local explanations and feature relevance techniques are increasingly the most adopted
methods for explaining DNNs. This section reviews explainability studies proposed for the most used
DL models, namely multi-layer neural networks, Convolutional Neural Networks (CNN) and Recurrent
Neural Networks (RNN).
4.3.1. Multi-layer Neural Networks
From their inception, multi-layer neural networks (also known as multi-layer perceptrons) have been
warmly welcomed by the academic community due to their huge ability to infer complex relations among
variables. However, as stated in the introduction, developers and engineers in charge of deploying these
models in real-life production find in their questionable explainability a common reason for reluctance.
That is why neural networks have been always considered as black-box models. The fact that explainability
is often a must for the model to be of practical value, forced the community to generate multiple
explainability techniques for multi-layer neural networks, including model simplification approaches,
feature relevance estimators, text explanations,local explanations and model visualizations.
Several model simplification techniques have been proposed for neural networks with one single
hidden layer, however very few works have been presented for neural networks with multiple hidden
layers. One of these few works is DeepRED algorithm [
257
], which extends the decompositional approach
to rule extraction (splitting at neuron level) presented in [
259
] for multi-layer neural network by adding
more decision trees and rules.
Some other works use model simplification as a post-hoc explainability approach. For instance, [
56
]
presents a simple distillation method called Interpretable Mimic Learning to extract an interpretable model
23
by means of gradient boosting trees. In the same direction, the authors in [
135
] propose a hierarchical
partitioning of the feature space that reveals the iterative rejection of unlikely class labels, until association
is predicted. In addition, several works addressed the distillation of knowledge from an ensemble of
models into a single model [80, 289, 290] .
Given the fact that the simplification of multi-layer neural networks is more complex as the number of
layers increases, explaining these models by feature relevance methods has become progressively more
popular. One of the representative works in this area is [
60
], which presents a method to decompose the
network classification decision into contributions of its input elements. They consider each neuron as an
object that can be decomposed and expanded then aggregate and back-propagate these decompositions
through the network, resulting in a deep Taylor decomposition. In the same direction, the authors in [
110
]
proposed DeepLIFT, an approach for computing importance scores in a multi-layer neural network. Their
method compares the activation of a neuron to the reference activation and assigns the score according to
the difference.
On the other hand, some works try to verify the theoretical soundness of current explainability methods.
For example, the authors in [
262
], bring up a fundamental problem of most feature relevance techniques,
designed for multi-layer networks. They showed that two axioms that such techniques ought to fulfill
namely, sensitivity and implementation invariance, are violated in practice by most approaches. Following
these axioms, the authors of [
262
] created integrated gradients, a new feature relevance method proven
to meet the aforementioned axioms. Similarly, the authors in [
61
] analyzed the correctness of current
feature relevance explanation approaches designed for Deep Neural Networks, e,g., DeConvNet, Guided
BackProp and LRP, on simple linear neural networks. Their analysis showed that these methods do not
produce the theoretically correct explanation and presented two new explanation methods PatternNet and
PatternAttribution that are more theoretically sound for both, simple and deep neural networks.
4.3.2. Convolutional Neural Networks
Currently, CNNs constitute the state-of-art models in all fundamental computer vision tasks, from
image classification and object detection to instance segmentation. Typically, these models are built as
a sequence of convolutional layers and pooling layers to automatically learn increasingly higher level
features. At the end of the sequence, one or multiple fully connected layers are used to map the output
features map into scores. This structure entails extremely complex internal relations that are very difficult
to explain. Fortunately, the road to explainability for CNNs is easier than for other types of models, as the
human cognitive skills favors the understanding of visual data.
Existing works that aim at understanding what CNNs learn can be divided into two broad categories:
1) those that try to understand the decision process by mapping back the output in the input space to
see which parts of the input were discriminative for the output; and 2) those that try to delve inside the
network and interpret how the intermediate layers see the external world, not necessarily related to any
specific input, but in general.
One of the seminal works in the first category was [
291
]. When an input image runs feed-forward
through a CNN, each layer outputs a number of feature maps with strong and soft activations. The authors
in [
291
] used Deconvnet, a network designed previously by the same authors [
142
] that, when fed with a
feature map from a selected layer, reconstructs the maximum activations. These reconstructions can give
an idea about the parts of the image that produced that effect. To visualize these strongest activations in
the input image, the same authors used the occlusion sensitivity method to generate a saliency map [
136
],
which consists of iteratively forwarding the same image through the network occluding a different region
at a time.
To improve the quality of the mapping on the input space, several subsequent papers proposed
simplifying both the CNN architecture and the visualization method. In particular, [
96
] included a global
average pooling layer between the last convolutional layer of the CNN and the fully-connected layer that
predicts the object class. With this simple architectural modification of the CNN, the authors built a class
24
activation map that helps identify the image regions that were particularly important for a specific object
class by projecting back the weights of the output layer on the convolutional feature maps. Later, in [
143
],
the authors showed that max-pooling layers can be used to replace convolutional layers with a large stride
without loss in accuracy on several image recognition benchmarks. They obtained a cleaner visualization
than Deconvnet by using a guided backpropagation method.
To increase the interpretability of classical CNNs, the authors in [
113
] used a loss for each filter in
high level convolutional layers to force each filter to learn very specific object components. The obtained
activation patterns are much more interpretable for their exclusiveness with respect to the different labels
to be predicted. The authors in [
72
] proposed visualizing the contribution to the prediction of each single
pixel of the input image in the form of a heatmap. They used a Layer-wise Relevance Propagation (LRP)
technique, which relies on a Taylor series close to the prediction point rather than partial derivatives at
the prediction point itself. To further improve the quality of the visualization, attribution methods such
as heatmaps, saliency maps or class activation methods (GradCAM [
292
]) are used (see Figure 7). In
particular, the authors in [
292
] proposed a Gradient-weighted Class Activation Mapping (Grad-CAM),
which uses the gradients of any target concept, flowing into the final convolutional layer to produce a
coarse localization map, highlighting the important regions in the image for predicting the concept.
(a) Heatmap [168] (b) Attribution [293] (c) Grad-CAM [292]
Figure 7: Examples of rendering for different XAI visualization techniques on images.
In addition to the aforementioned feature relevance and visual explanation methods, some works
proposed generating text explanations of the visual content of the image. For example, the authors in [
91
]
combined a CNN feature extractor with an RNN attention model to automatically learn to describe the
content of images. In the same line, [
278
] presented a three-level attention model to perform a fine-grained
classification task. The general model is a pipeline that integrates three types of attention: the object
level attention model proposes candidate image regions or patches from the input image, the part-level
attention model filters out non-relevant patches to a certain object, and the last attention model localizes
discriminative patches. In the task of video captioning, the authors in [
111
] use a CNN model combined
with a bi-directional LSTM model as encoder to extract video features and then feed these features to an
LSTM decoder to generate textual descriptions.
One of the seminal works in the second category is [
137
]. In order to analyse the visual information
contained inside the CNN, the authors proposed a general framework that reconstruct an image from the