PreprintPDF Available

More Than Words: Towards Better Quality Interpretations of Text Classifiers

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The large size and complex decision mechanisms of state-of-the-art text classifiers make it difficult for humans to understand their predictions, leading to a potential lack of trust by the users. These issues have led to the adoption of methods like SHAP and Integrated Gradients to explain classification decisions by assigning importance scores to input tokens. However, prior work, using different randomization tests, has shown that interpretations generated by these methods may not be robust. For instance, models making the same predictions on the test set may still lead to different feature importance rankings. In order to address the lack of robustness of token-based interpretability, we explore explanations at higher semantic levels like sentences. We use computational metrics and human subject studies to compare the quality of sentence-based interpretations against token-based ones. Our experiments show that higher-level feature attributions offer several advantages: 1) they are more robust as measured by the randomization tests, 2) they lead to lower variability when using approximation-based methods like SHAP, and 3) they are more intelligible to humans in situations where the linguistic coherence resides at a higher granularity level. Based on these findings, we show that token-based interpretability, while being a convenient first choice given the input interfaces of the ML models, is not the most effective one in all situations.
Content may be subject to copyright.
More Than Words: Towards Better Quality
Interpretations of Text Classifiers
Muhammad Bilal Zafar,1Philipp Schmidt,2Michele Donini,1
Cédric Archambeau,1Felix Biessmann,2Sanjiv Ranjan Das,1,3Krishnaram Kenthapadi1
1Amazon Web Services, 2Amazon Search, 3Santa Clara University
[zafamuh,phschmid,donini,cedrica,biessman,sanjivda,kenthk]@amazon.com
Abstract
The large size and complex decision mechanisms of state-of-the-art text classifiers
make it difficult for humans to understand their predictions, leading to a potential
lack of trust by the users. These issues have led to the adoption of methods like
SHAP and Integrated Gradients to explain classification decisions by assigning
importance scores to input tokens. However, prior work, using different random-
ization tests, has shown that interpretations generated by these methods may not
be robust. For instance, models making the same predictions on the test set may
still lead to different feature importance rankings. In order to address the lack
of robustness of token-based interpretability, we explore explanations at higher
semantic levels like sentences. We use computational metrics and human subject
studies to compare the quality of sentence-based interpretations against token-based
ones. Our experiments show that higher-level feature attributions offer several
advantages: 1) they are more robust as measured by the randomization tests, 2)
they lead to lower variability when using approximation-based methods like SHAP,
and 3) they are more intelligible to humans in situations where the linguistic coher-
ence resides at a higher granularity level. Based on these findings, we show that
token-based interpretability, while being a convenient first choice given the input
interfaces of the ML models, is not the most effective one in all situations.
1 Introduction
Recent advances in natural language processing, especially those aided by large language models
like BERT [
15
], RoBERTa [
34
], GPT-3 [
12
], and Switch Transformer [
19
], have helped set up new
benchmarks for text classification. While these performance gains have been attributed to the vast
amounts of training data and the large number of parameters (i.e., model complexity), it resulted
in model predictions being more difficult to interpret. For text classification tasks, approaches to
interpret model predictions are often borrowed from classifiers applied to tabular and image data [
60
].
As a whole, interpretability for ML models over text data (e.g., Transformers, LSTMs) is an under-
explored domain and poses a unique set of challenges. Well-adopted interpretability techniques in
the tabular and image domain (e.g., LIME [
46
], SHAP [
38
], Integrated Gradients [
57
], and feature
permutation [
11
]) do not provide satisfactory performance on text data as they are often designed
to provide interpretability based on individual tokens. For instance, recent work has pointed out
that interpretations provided by these methods may suffer from a lack of robustness [
64
], where two
models with identical architectures and predictions can lead to large differences in interpretations.
To overcome these issues, we investigate moving away from token-based interpretation and explore
interpretations at higher levels of granularity like sentences.
*Equal Contribution.
arXiv:2112.12444v1 [cs.CL] 23 Dec 2021
There are also additional factors that motivate the need to explore higher levels of interpretation
granularity. First, feature importance at the level of tokens is hard to interpret because it removes
context surrounding the tokens, leading to a sparse ‘bag of words’ interpretation when the model
(e.g., BERT [
15
]) itself is contextually rich. This begs the question as to whether coarser than
token-level interpretations better accommodate the context. Second, good textual interpretations
should be pithy and perspicuous [
40
]. Token-based interpretations highlight (too) many tokens, many
of which may be duplicates and/or synonymous with each other, rendering the semantic information
being represented possibly confusing. Third, token-based interpretations are often not contiguous,
resulting in a higher cognitive load for humans. Such increased complexity has been demonstrated
to negatively impact the usefulness of interpretations [
30
] as the lack of sequencing may prevent
us from providing meaningful interpretations. Fourth, using sentences instead of tokens as features
reduces the size of the feature set, thereby decreasing the computational cost of stochastic explainers
such as SHAP [38], and can also help reduce the variability in the explainer output.
Contributions.
We take a first step towards investigating the differences in the quality of token vs.
sentence-based interpretations using metrics accounting for the statistical robustness, the compu-
tational cost, and the cognitive load for human subjects. Specifically, (1) we adopt the parameter
randomization tests of [
64
] to assess the robustness of sentence-based interpretations 4.2); (2) we
propose metrics to measure the variability in output of stochastic explanation methods like SHAP for
text classification, and compare the outcomes for token and sentence-based interpretations 4.2); (3)
we design a formal experimental setting to determine which interpretations are better, i.e., to examine
whether removal of context matters and which level of granularity lets the humans be most effective
(in terms of accuracy and response time) in carrying out an annotation task 4.3).
Our main findings for texts consisting of several sentences suggest the following. (1) Sentence-based
interpretations exhibit greater robustness as determined by parameter randomization tests [
64
] and
hence are likely to be more trustworthy to users than token-based ones 5.1). (2) An analysis of
SHAP shows that a coarser level of granularity (sentences) leads to lower variability across multiple
runs of the interpretability method on the same input. This low variability can help mitigate the
potential erosion of user trust that can stem from observing different interpretations on the same
input 5.2). (3) Humans achieve a higher annotation accuracy and a similar or lower response time
(indicating lower cognitive load) in predicting the ground truth labels of the text for sentence-based
interpretations than for token-based interpretations. This improvement in annotation performance
suggests that sentence-based interpretations are significantly more effective than their token-based
counterparts 6).
2 Related work
As mentioned in §1, most methods for NLP interpretability are borrowed from vision and tabular data
domains, see for instance usage of Layerwise Relevance Propagation [
50
] in NLP [
43
]. Also borrowed
from vision/tabular are methods like, DeepLIFT [
52
], and LIMSSE (for substring explanations) [
46
].
Our work extends the literature on assessment of NLP interpretations [
23
]. For instance, [
7
] finds that
gradient-based methods perform the best. [
31
] suggests model agnostic interpretability with accuracy
tradeoffs. Domain specific interpretability has also been assessed for medicine [
59
] and finance [
63
].
Current commercial implementations are largely token-based, though they offer excellent visualiza-
tions [
58
,
60
]. Recent concurrent work by [
49
] also advocates for sentence-based NLP interpretability.
This analysis is, however, based on a custom-built dataset and does not include human surveys and
robustness analysis, though an informal assessment of the quality of interpretations is undertaken by
the authors. [
22
] draws a distinction between interpretability (feature importance based on system
internals) versus explanations (human ingestible answer to a question) and suggests that sentence-
based explanations may offer more than mere interpretability. This is related to arguments in [
10
],
suggesting that current interpretability approaches serve internal users more than external ones, as
they lack transparency, an aspect that sentence-based explanations may improve. [
48
] is a recent
attempt at human-in-the-loop (HITL) development of NLP models; they introduce a task-agnostic
checklist approach which also generates diverse test cases to detect potential bugs. Here the human
involvement is in generating test cases, not in evaluating the quality of interpretations. Some of our
questions are similar to those assessed by [37] for tree models, and they demonstrate improvements
in run time, clustering performance, identification of important features, assessed via a user study.
2
The recently proposed XRAI approach [
27
] also advocates presenting interpretations by grouping
multiple features, though unlike us, they focus on the image domain.
How the interpretation is presented to the human subject matters. [
30
] demonstrates that the length
and complexity of interpretations have significant impact on the effectiveness as measured in HITL
metrics. Highlighting important tokens is a popular way of presenting explanations. An alternative
is the removal of non-important tokens. However, comprehensibility of the interpretation drops
considerably with this approach [
20
]. Borrowing visualization methods from computer vision (e.g.,
[
2
,
29
]), [
32
] explores how negation, intensification etc. may be built up into sentence-based salience.
Text interpretations may be improved using anchors [47], where rules are encoded as super features
that override other features, for example negations (e.g., good vs not good). Anchors may be a way
to get highly parsimonious interpretations, but may not be sufficiently explanatory for long texts.
Phrase-based interpretations such as topical n-grams may be more fruitful than using tokens for
unsupervised tasks [
61
]. [
6
] explores the relevance of token-based feature importance to build up
sentence-based interpretations. There are related questions as to whether pre-trained language models
(e.g., [
8
]) that are domain specific lead to better interpretations than generic ones, such as BERT [
15
]
or RoBERTa [
34
]. Using sentence-based embeddings from BERT, such as SBERT [
45
], may also
generate better interpretations. Our work complements all these varied experiments in an attempt
to understand better what form of interpretations would be most comprehensible to human subjects,
which after all, is the goal of machine learning interpretability.
Finally, a lack of interpretability is not the only risk associated with complex text models. Recent
work has highlighted several other issues like environmental impact and bias [
1
,
9
,
17
,
54
]. Reliable
interpretability methods can be used to detect bias in model behavior [18,33].
3 Background on token-based interpretability
In this section, we briefly describe the functionality of two popular token-based interpretability meth-
ods, SHAP [
38
] and Integrated Gradients (IG) [
57
]. We picked these methods as (i) they are shown
to provide superior empirical performance as compared to their counterparts like LIME [
46
] (see
for instance [
38
]), (ii) provide desirable axiomatic properties [
38
,
56
,
57
], and (iii) lend themselves
readily to interpretability at meta-feature level (e.g., phrases, sentences) as we will discuss later in §4.
We assume a tokenized input text,
t= [t1, . . . , tT]
, where
T
is the number of tokens. The task is to
classify the input into one of
K
classes. In this work, we focus on Transformer-based neural text
classifiers and hence assume that the tokenization is done using the tokenizer accompanying the
model (e.g., WordPiece for BERT, BPE for RoBERTa/GPT-2). For neural text models, the classifier
produces a score for each class,
F(t)RK
. The input is then assigned the label of the class with the
highest score. We refer to the score corresponding to this class as f(t)R.
Given the model score for the predicted class
f(t)
, the task of the token-based interpretability methods
is to obtain a vector
Φ(t)=[φ(t1), . . . , φ(tT)]
of token attributions that assign an importance score
to each of the tokens. The score
φ(ti)
(for brevity, also denoted as
φi
) indicates the importance of
token tiin predicting the score f(t).
SHAP, which is a model-agnostic method, estimates the importance of a token by simulating its
absence from different subsets
c
of tokens, called coalitions, with
ct
, and computing the average
marginal gain of adding the token in question to these. Concretely, the token importance is obtained
by solving the following optimization problem:
Φ = arg min
ΦX
ct
[fc(c)(φ0+X
wc
φ(w))]2×kSH AP (t,c)(1)
where
t
is the original input,
c
is a sub-part of the input
t
corresponding to the tokens that are not
dropped,
w
are the tokens contained within
c
,
kSH AP (t,c)
represents the SHAP kernel,
φ0
refers to
the model output on an empty input (all tokens dropped), and
fc
refers to the model output on the
remaining (non-dropped) tokens. We simulate token dropping by replacing the corresponding tokens
with the unknown vocabulary token.
The exact computation requires solving Eq. 1over all the
2T
subsets of tokens. The exact computation
becomes infeasible for even modestly large number of input features (e.g., 30). In practice, a
much smaller number of coalitions is used. For instance, the SHAP library suggests a heuristic of
3
using
2|t|+ 2048
coalitions [
36
]. Some recent studies aim at using the output uncertainty [
53
] or
variance [14] to define a tradeoff between accuracy and number of coalitions.
Integrated Gradients (IG), which is a model-specific method (in that it requires access to model
gradients), operates by estimating the token attribution as:
Φ=(x¯
x)Z1
0
∂f (¯
x+α(x¯
x))
xdα, (2)
where
xi
represents the embedding of input token
ti
,
is the Hadamard product, and
¯
xi
is the
embedding of the baseline token (e.g., unknown vocabulary token). In a manner similar to SHAP
computation, the integral is empirically estimated using summation over the line connecting
x
and
¯
x
.
In a manner similar to SHAP, a sum over larger number of terms leads to better approximation [57].
By construction, both, SHAP and IG have the desirable property that the sum of token attributions
equals the predicted score of the class, that is
PT
i=0 φi=f(t)
. Here,
φ0
is the model output on the
baseline input (i.e., no features present), which corresponds to computing
f()
for SHAP and
f(¯
x)
for IG.
4 Interpretability based on meta-tokens
We describe how to generate interpretations based on meta-tokens. While meta-tokens can be
constructed at various granularities (e.g., phrases, paragraphs), in this paper, we limit ourselves to
sentences. We also describe the setup for comparing the quality of token-based and sentence-based
interpretations.
4.1 Generating meta-token interpretations
We assume that the tokenized text
t= [t1, . . . , tT]
can be partitioned into non-overlapping meta-
tokens
m= [m1, . . . , mM]
where each meta-token consists of one or more contiguous tokens.
When meta-tokens are sentences, the partitioning can be done using off-the-shelf sentencizers like
spacy [25].
Given token-based attributions
Φ(t) = [φ(t1), . . . , φ(tT)]
, one can form SHAP and IG meta-token
attributions by summing the attributions of all the tokens corresponding to a meta-token
m
. This
follows from the summation property of these methods mentioned in §3and the commonly taken
feature-independence [
38
,
56
] / marginal expectation [
26
] assumption. We refer to this procedure for
computing meta-token attributions as the indirect method. However, for SHAP, one can also directly
compute the feature attributions based on meta-tokens by directly solving Eq.(1) where the feature
coalitions are formed by dropping full meta-tokens instead of individual tokens. This variant where
feature coalitions are formed based on meta-tokens will reduce the number of all possible coalitions
as
MT=2M2T
, and as a result, can lead to more accurate estimations for the same
computational budget.
4.2 Computational evaluation metrics
To compare token vs. sentence-based attributions, we first extend the randomization tests of [
64
].
Then, we propose a metric to compare the variability of interpretations for stochastic methods like
SHAP.
Metrics based on parameter randomization [64]
Given the training data and an interpretability
method, [
64
] start by training three different models: (i) a fully trained model referred to as
Init#1
,
(ii) another fully trained model
Init#2
that is identical to
Init#1
in all aspects (e.g., training data,
training procedure like batching, and learning rate) except for the initial randomized parameters,
and (iii) an untrained model
Untrained
which is obtained by taking
Init#1
and randomizing the
weights of all layers except the encoder ones. Given these models, the following tests are carried out:
Different Initializations Test (
DIT
). This test measures the overlap in feature attributions of the same
input from two functionally equivalent models
Init#1
and
Init#2
(i.e., models that agree on all of
the test inputs [
57
,
64
]). The overlap in the attributions is measured using Jaccard@K% metric [
64
]
4
(also called the intersection-over-union metric in [
16
]) between the two ranked attribution lists. Given
the two sets of attributions
Φi,Φj
for the same input, Jaccard@K% is defined as
Ji,Φj) = |rirj|
|rirj|
where riis the set of top-K% tokens as specified according to the attribution Φi.
Untrained Model Test (UMT). This test measures the overlap in feature attributions of the same input
between the trained model Init#1 and the untrained model Untrained.
The authors in [
64
] argue that a low overlap according to
DIT
or a high overlap according to
UMT
suggests an interpretability method is not robust.
To compare the quality of token- and sentece based interpretations, we conduct the tests separately
for both token- and sentence-based interpretations. For each test, let
Jt
denote Jaccard@K% under
token-based interpretations and
Js
denote Jaccard@K% under sentence-based interpretations. We
analyze the quantity
JsJt
to compare the robustness.
JsJt>0
for
DIT
implies greater robustness
for sentences. Similarly, JsJt<0for UMT implies greater robustness for sentences.
Overlap.
We also propose a second measure to test the variability of the interpretations. Specifically,
inspired by the variance analysis of [
14
], we measure the extent to which the top-ranked interpretations
vary across different runs of the interpretability method with the same model on the same input. Given
a model and an input
t
, we compute the feature attributions
L
times. For
1jL
, let
Φj
denote
the outcome of
jth
run. Then, we compute the median pairwise overlap (see §5.1 for reasoning on
the choice of median over mean) between attributions obtained from different runs of the attribution
method as:
overlap =median({Ji,Φj)|1i<jL}).(3)
A high overlap means higher commonality between the top-K% attributions across different runs of
the interpretability method.
4.3 Human intelligibility measures
A crucial goal of explainable AI (XAI) methods is to render ML predictions more intelligible
for human observers. The extent to which robustness measures (or more generally XAI quality
metrics without humans in the loop) correlate with human perception is the subject of active research
[
24
,
30
,
44
]. While studies with human subjects are a prerequisite for effective evaluation of XAI
methods, there is no single human-in-the-loop metric that would account for all relevant aspects
of XAI quality. Two metrics have become popular in the context of XAI quality assessment: (a)
simulatability [
33
], that is, using the features identified as important by the XAI method, how well
humans can replicate ML predictions, and (b) annotation quality, meaning how accurately human
annotators can replicate the ground truth label using the important features. Both metrics have been
used to assess the impact of transparency in ML models on human-ML interaction [24,30].
For human studies in this paper, we chose the annotation quality metric. This choice is motivated
by the following two reasons: First, the models obtain close to perfect test accuracy on the datasets
used in the surveys (Appendix C), hence simulatability is almost equivalent to annotating the ground
truth. Second, and more importantly, the impact of transparency on ground truth annotations can be
considered as more relevant for real world scenarios: in many XAI applications, ML predictions are
used to assist humans in annotating ground truth and explanations are used to validate ML predictions
[
30
]. To measure the usefulness of interpretability in such assistive settings we ask subjects to
annotate the ground truth. In contrast to simulatability, this metric allows to assess cases when human
annotations, ML predictions, and ground truth are not the same, for instance when explanations bias
human annotations to blindly follow wrong predictions or in scenarios where ML models need to be
debugged.
In our experiments we randomly assign subjects to one out of three experimental conditions: (1)
control, where no XAI-based highlights were shown to users, (2) token, where individual words were
highlighted as part of the task, and (3) sentence, where entire sentences were highlighted. In both
token and sentence cases, we highlight top-10% features, that is, highlight tokens / sentences, starting
from ones with the highest attribution, until 10% of the tokens in the text have been highlighted.
As part of our evaluation, we report on (1) human accuracy for ground truth annotation and (2) time
taken for annotations. Following [
51
], we combine (1) and (2) using the Information Transfer Rate
metric
I T R =I(yh,y )
t
, where
y
denotes the true label,
yh
is the label annotated by the human,
t
is
5
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4 Type
token
sentence
(a) Sentence based interpretations are more robust
w.r.t.
DIT
. With sentence-based interpretations,
over
50%
of the cases result in a Jaccard@25%
value of
1
meaning that the top
25%
interpretations
for two functionally equivalent models (
Init#1
vs.
Init#2) are identical.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4 Type
token
sentence
(b) Sentence-based interpretations are more robust
w.r.t.
UMT
. With sentence-based interpretations, al-
most
50%
of the cases result in a Jaccard@25%
value of
0
meaning that the top
25%
interpreta-
tions for trained and untrained models (
Init#1
vs.
Untrained) have no overlap.
Figure 1: Histogram of Jaccard@25% with a
BERT
model on IMDB data. Higher values for sentences
with DIT (than tokens) and much lower values with UMT suggest that sentences provide more robust
interpretations.
the average response time, and
I
is the mutual information. An interpretability method that is more
helpful for humans will have a higher annotation throughput, measured by ITR in bits/s.
5 Experiments with computational measures
We now describe the experiments comparing token-based and sentence-based interpretability.
Datasets.
With the goal of covering different application domains and data characteristics, we
consider the following three datasets. IMDB: The movie review data where the task is to assess
the sentiment (positive or negative) of a movie review. Medical: Medical text data, where the task
is to classify the condition of a patient into one of five classes from a medical abstract. Wiki: The
Wikipedia article data where the task is to predict whether an article is written with a promotional or
neutral tone. Appendix Aprovides more details on the datasets, their sources, and licenses.
Models, training, and hyperparameters.
Following [
64
], we focus on pretrained Transformers,
specifically, BERT (
BERT
), RoBERTa (
RoB
), and their distilled versions namely DistilBERT (
dBERT
)
and DistilRoBERTa (
dRoB
). All experiments were performed on AWS g4dn.xlarge instances. The
architectures, training details, and the information on the software used are included in Appendix B.
For both tokens and sentences, following the author implementation [
36
], we set the number of
coalitions for the SHAP computation to
2M+ 211
, where
M
is the number of features (e.g., tokens or
sentences). For IG, following the original paper [
57
], we use
300
as the number of iterations. Due to
the high computational cost of obtaining SHAP attributions for large models, we limit the analysis of
attribution comparison to
1,000
randomly selected test inputs. For the computation of Jaccard@K%,
following [
64
], we set the value of K to 25% as [
64
] obtained similar results for other values. The
model accuracy and overlap statistics between different initializations are shown in Appendix C.
We now describe the results comparing the token vs. sentence-based interpretations using the metrics
described in §4. For expositional simplicity, the results reported in this section are with the SHAP
algorithm, and sentence-based attributions are computed using the direct method in §4. The results
for IG method and SHAP with with indirect computation are briefly summarized at the end of this
section, and the detailed results can be found in Appendix D.
5.1 Robustness under the randomization tests
Figure 1shows the result of the two randomization tests with SHAP attribution on the IMDB dataset
when trained with
BERT
model. Specifically, Figure 1a shows the histogram of similarity between
interpretations generated using two functionally equivalent models, whereas Figure 1b shows the
similarity between trained and untrained models. In Figure 1a, we notice that in over half of the
cases (
53%
), Jaccard@25% is
1
, that is, the set of top-25% sentences (ranked according to their
attributions) are identical between the two functionally equivalent models (
Init#1
and
Init#2
).
6
Table 1: Positive (median) values of
JsJt
4.2) for
DIT
in 8/12, and negative values for
UMT
in
10/12 cases mean that sentence-based interpretations are more robust.
(a)
DIT
: Median difference in Jaccard@25% of sen-
tences and tokens when comparing over
Init#1
and
Init#2
. Positive values mean sentences are more ro-
bust.
BERT RoB dBERT dRoB
IMDB 20 5 14 22
Medical 15 11 9 17
Wiki -25 -19 -25 -4
(b)
UMT
: Median difference in Jaccard@25% of sen-
tences and tokens when comparing over
Init#1
and
Untrained
. Negative values mean sentences are more
robust.
BERT RoB dBERT dRoB
IMDB -17 14 -16 -11
Medical -5 2 -14 -17
Wiki -23 -16 -44 -6
Table 2: Median difference between overlap 4.2) of sentences and tokens when computing the
feature attributions repeatedly 10 times. A positive value means that sentences-based attributions are
more stable.
(a) Trained model (Init#1)
BERT RoB dBERT dRoB
IMDB 62 67 65 64
Medical 65 25 63 63
Wiki 12 35 36 36
(b) Untrained model (Untrained)
BERT RoB dBERT dRoB
IMDB 19 66 29 66
Medical 64 64 63 62
Wiki 36 34 19 33
On the other hand, Jaccard@25% is never
1.0
for tokens. So as expected, sentences lead to higher
robustness w.r.t. the different initializations test. In Figure 1b, we notice that in almost half of the
cases (
49%
), Jaccard@25% is
0
when comparing sentence-based interpretations between
Init#1
and
Untrained
. In other words, the sets of top-25% sentences have no overlap between the trained
model (
Init#1
) and the untrained model (
Untrained
). On the other hand, Jaccard@25% is
0
only
8% of the times for tokens. Again, sentence-based interpretations are more robust.
We present the results over all the models and datasets in Table 1. The results show the median
difference in Jaccard@25% between sentences and tokens for the two tests. We compare the medians
instead of the means due to the asymmetric and long-tailed nature of the distributions of Jaccard@25%
for sentences in Figure 1. The table shows that sentences have a higher median Jaccard@25% than
tokens for IMDB and Medical dataset when comparing
Init#1
and
Init#2
(Table 1a). The trend is
however reversed for the Wiki dataset where the tokens are more robust. When comparing
Init#1
and
Untrained
, sentences have lower median Jaccard@25% in 10 out of 12 cases (Table 1b).
The only cases where tokens are more robust is the
RoB
model on IMDB and Medical datasets.
Appendix Dshows the results for indirect computation of sentence attribution using SHAP 4).
The results show a very similar trend as in Table 1. The corresponding results for IG (Appendix D)
however show that while sentence-based interpretations are more robust w.r.t.
DIT
in most cases, they
are less robust w.r.t. UMT.
To summarize, the results show that sentences lead to a higher robustness w.r.t. both randomization
tests when using SHAP, whereas for IG, sentences are more robust only w.r.t. DIT.
5.2 Interpretation variability
For each input, we generate the interpretations 10 times (that is,
L= 10
from §4.2). Following the
description in §4.2, we present the median value of the overlap metric for both token and sentence-
based interpretations in Table 2. Table 2a shows that in all configurations, the overlap is positive,
implying that sentences lead to higher overlap across different runs of the interpretability method.
The same holds for the untrained model as well (Table 2b). However, we notice that with indirect
computations of sentence-based SHAP attributions (Table 9in Appendix D), sentences often lead to
higher variability than tokens.
7
Table 3: Survey results on the IMDB and Wiki datasets, showing the human accuracy in predicting
the ground truth under the control (C), token (T), and sentence (S) conditions. Also shown are the
aggregated time taken by the participants and the Information Transfer Rate (ITR). In both cases,
sentences (S) provide higher ITR than tokens (T). Best condition in boldface per column.
(a) Both token and sentence-based interpretations
decrease human annotation accuracy as compared
to control with similar cognitive load for annotators,
as indicated by the task response times. Combining
accuracy and time via ITR shows that both tokens
and sentences lead to a lower ITR than control, but
sentences perform better than tokens.
Accuracy Time [s] ITR [bits/s]
C0.87 ± 0.33 2111.11 ± 527.99 0.009 ± 0.003
T0.80 ± 0.4 2084.30 ± 749.91 0.005 ± 0.002
S0.84 ± 0.36 2104.13 ± 1022.43 0.009 ± 0.005
(b) Token-based interpretations decrease human an-
notation accuracy and increase cognitive load, as in-
dicated by the increased response times. In contrast,
sentence-based interpretations significantly improve
annotation accuracy and lower the annotators cogni-
tive load. Sentences lead to the best ITR, a more than
2-fold improvement over tokens.
Accuracy Time [s] ITR [bits/s]
0.62 ± 0.49 2291.83 ± 1287.84 0.0014 ± 0.0018
0.61 ± 0.49 3133.75 ± 1645.26 0.0007 ± 0.0012
0.65 ± 0.48 2046.72 ± 1145.81 0.0018 ± 0.0017
6 Impact of interpretation granularity on human annotation performance
We use the
BERT
model, and IMDB and Wiki datasets. We omit the Medical dataset as the highly
technical medical terminology would require access to human subjects who are experts in the domain.
For each dataset, we randomly assigned human judges to one out of three conditions: control,
token and sentence (detailed description in §4.3). We chose sample sizes according to a power
pre-registration: We determined the minimum sample count for our studies by specifying Type I
(
α= 0.05
) and Type II (
β= 0.2
) error rates, with estimated effect sizes coming from a pilot study.
To execute the user surveys, we carefully compiled annotation instructions that explained the task
as well as the payment details to the annotators. We ran all user studies on Toloka, a paid site for
crowdsourced data labeling.
1
As described in §4.3, for both token and sentence conditions, we
highlighted
10%
of the text. Further details on the human surveys including the data pre-processing,
the screenshots of the survey, payment information and quality control can be found in Appendix E.
For
IMDB
, we collected a total of
N= 2150
samples across the three conditions. In Table 3a, we
show human performances in terms of their ability to annotate the ground truth correctly and their
task times. We note, that both the accuracy and annotation time deteriorate with interpretations (token
and sentence), however, sentence still perform better than token. The relation between treatments and
human performance was significant,
χ2(2, N = 2150) = 14.15, p = 0.001
. A pairwise comparison
between token and sentence condition with Bonferroni-adjusted significance level (
α= 0.017
) was
above the significance threshold, χ2(1, N = 1400) = 4.65, p = 0.03.
For the
Wiki
dataset, we collected a total of
N= 1950
samples. In Table 3b, conditions control and
token have comparable human annotation accuracy, however for token, the annotation time increases
significantly, indicating increased cognitive load. Sentence-level explanations, in contrast, lead to
substantially higher annotation accuracy and less time spent on tasks. A comparison of ITR between
tokens and sentences shows that sentences increase the annotation throughput (measured in bits/s)
by over 150%. We conducted a chi-square test of independence to examine the relation between
treatments and human performance. The relation between conditions and human performance was not
significant,
χ2(2, N = 1950) = 3.37, p = 0.19
. A pairwise comparison with Bonferroni-adjusted
significance level (
α= 0.017
) between token and sentence was above the significance threshold,
χ2(1, N = 1350) = 2.68, p = 0.10.
While sentences perform better than tokens in both datasets, the inclusion of sentence highlights
deteriorates the annotation performance in the IMDB data when compared to the control condition.
We hypothesize that this discrepancy may be due to the difference in the difficulty level of the tasks:
While sentences do not improve the annotation performance on the easy IMDB data (accuracy of
0.87
with control), they increased it significantly on a more difficult Wiki data (accuracy of
0.62
with
control). To summarize, we notice that sentence-based interpretations lead to significantly better
human intelligibility (measured via response time and accuracy) as compared to token-based ones.
1toloka.yandex.com
8
Finally, we compare the human annotation performance with a closely related computational metric
of infidelity in Appendix F. Our analysis shows a possible misalignment between the two metrics.
7 Conclusions, limitations & future work
It has been widely recognized that evaluation of interpretability methods is one of the key research
challenges in ML [
3
]. We complement recent advances in the domain of computer vision [
4
,
13
,
27
]
and take a first step towards addressing the non-robustness of deep text classifiers. We argue that
the large number of tokens means that token-based interpretations are prone to a lack of robustness.
Similarly, large number of tokens may cause high variability 5.2), leading to a degradation of
trust by the users. In line with previous results on the impact of interpretation complexity [
30
],
our results demonstrate that the non-contiguous nature of token-based interpretations may make
them more difficult for the users to digest, as shown by the human surveys. Our results demonstrate
that sentence-based interpretations lead to improved annotation accuracy accompanied by a lower
cognitive load to process the interpretations. We find that in difficult tasks, this reduced cognitive
load can lead to a substantial increase in annotation throughput. These findings suggest that for
certain datasets, sentence-based interpretations have the potential to support AI assisted decision
making better than the traditional token-based interpretations and thus contribute effectively to more
responsible usage of ML technology.
Finally, we note limitations and avenues for improvement. Owing to their widespread usage, our
analysis has been limited to Transformer-based classifiers only. Extension to other sequence- (e.g.,
LSTMs/GRUs) and non-sequence-based models (e.g., n-gram models) is an interesting future re-
search direction. An extension to sentence-based interpretability is possible for any (model agnostic)
perturbation-based method (e.g., SHAP [
38
], LIME [
46
]), however, sentences support only a limited
number of gradient-based methods. Specifically, as discussed in §4, only methods for which the
feature attributions sum to the output class score—e.g., Integrated Gradients [
57
] and Layerwise
Relevance Propagation [
41
]—are supported. Extending gradient-based methods to accommodate
sentence-based interpretability is an important future direction. Results in §6showed that interpreta-
tions (token- or sentence-based) do not always improve the annotation performance of humans. We
hypothesized that this may be related to the “difficulty” level of the task. Further investigations to
map dataset characteristics to the kind of interpretations that are more suitable for humans are also
worth pursuing. In this paper, we only focused on sentences as units of meta-token interpretations.
We plan to explore extensions to other units such as phrases and paragraphs. Finally, our work also
highlighted the tension between automatic (infidelity) and human subject based evaluation metrics
of interpretability. Resolving these tensions and developing semi-automatic and scalable evaluation
metrics is another potential direction for research.
References
[1]
Abubakar Abid, Maheen Farooqi, and James Zou. Persistent Anti-Muslim Bias in Large
Language Models. arXiv:2101.05783 [cs], January 2021. arXiv: 2101.05783.
[2]
Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim.
Sanity Checks for Saliency Maps. In Proceedings of the 32nd International Conference on
Neural Information Processing Systems, NIPS’18, pages 9525–9536, Red Hook, NY, USA,
December 2018. Curran Associates Inc.
[3]
Julius Adebayo, Michael Muelly, Ilaria Liccardi, and Been Kim. Debugging Tests for Model
Explanations. In Proceedings of the 34th International Conference on Neural Information
Processing Systems, 2020.
[4]
David Alvarez-Melis and Tommi S. Jaakkola. Towards Robust Interpretability with Self-
explaining Neural Networks. In Proceedings of the 32nd International Conference on Neural
Information Processing Systems, NIPS’18, pages 7786–7795, Montréal, Canada, December
2018. Curran Associates Inc.
[5]
Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek.
Explaining Predictions of Non-Linear Classifiers in NLP. In Proceedings of the 1st Workshop
on Representation Learning for NLP, pages 1–7, June 2016.
9
[6]
Leila Arras, Ahmed Osman, Klaus-Robert Müller, and Wojciech Samek. Evaluating Recurrent
Neural Network Explanations. arXiv:1904.11829 [cs, stat], June 2019.
[7]
Pepa Atanasova, Jakob Grue Simonsen, Christina Lioma, and Isabelle Augenstein. A Diagnostic
Study of Explainability Techniques for Text Classification. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Language Processing (EMNLP), pages 3256–3274,
Online, November 2020. Association for Computational Linguistics.
[8]
Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A Pretrained Language Model for Scientific
Text. arXiv:1903.10676 [cs], September 2019.
[9]
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the
Dangers of Stochastic Parrots: Can Language Models Be Too Big? &#x1f99c;. In Proceedings
of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pages
610–623, Virtual Event, Canada, March 2021. Association for Computing Machinery.
[10]
Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep
Ghosh, Ruchir Puri, José M. F. Moura, and Peter Eckersley. Explainable Machine Learning
in Deployment. In Proceedings of the 2020 Conference on Fairness, Accountability, and
Transparency, FAT* ’20, pages 648–657, Barcelona, Spain, January 2020. Association for
Computing Machinery.
[11] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, October 2001.
[12]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler,
Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott
Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. Language Models are Few-Shot Learners. In H. Larochelle,
M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information
Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
[13]
Jiefeng Chen, Xi Wu, Vaibhav Rastogi, Yingyu Liang, and Somesh Jha. Robust Attribution
Regularization. In Advances in Neural Information Processing Systems, volume 32, 2019.
[14]
Ian Covert and Su-In Lee. Improving KernelSHAP: Practical Shapley Value Estimation via
Linear Regression. arXiv:2012.01536 [cs, stat], April 2021. arXiv: 2012.01536.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019
Conference of the North {A}merican Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), May 2019.
[16]
Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard
Socher, and Byron C. Wallace. ERASER: A Benchmark to Evaluate Rationalized NLP Models.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
pages 4443–4458, Online, July 2020. Association for Computational Linguistics.
[17]
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei
Chang, and Rahul Gupta. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended
Language Generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability,
and Transparency, FAccT ’21, pages 862–872, Virtual Event, Canada, March 2021. Association
for Computing Machinery.
[18]
Finale Doshi-Velez and Been Kim. Towards A Rigorous Science of Interpretable Machine
Learning. arXiv:1702.08608 [cs, stat], March 2017. arXiv: 1702.08608.
[19]
William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion
Parameter Models with Simple and Efficient Sparsity. arXiv:2101.03961 [cs], January 2021.
arXiv: 2101.03961.
10
[20]
Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-
Graber. Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of the
2018 Conference on Empirical Methods in Natural Language Processing, August 2018.
[21]
Ruth Fong, Mandela Patrick, and Andrea Vedaldi. Understanding Deep Networks via Extremal
Perturbations and Smooth Masks. In Proceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV), pages 2950–2958, 2019.
[22]
L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal. Explaining Explana-
tions: An Overview of Interpretability of Machine Learning. In 2018 IEEE 5th International
Conference on Data Science and Advanced Analytics (DSAA), pages 80–89, October 2018.
[23]
Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and
Dino Pedreschi. A Survey of Methods for Explaining Black Box Models. ACM Computing
Surveys, 51(5):93:1–93:42, August 2018.
[24]
Peter Hase and Mohit Bansal. Evaluating Explainable AI: Which Algorithmic Explanations
Help Users Predict Model Behavior? In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, pages 5540–5552, Online, July 2020. Association
for Computational Linguistics.
[25]
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial-
strength Natural Language Processing in Python, 2020. https://spacy.io/.
[26]
Dominik Janzing, Kailash Budhathoki, Lenon Minorics, and Patrick Blöbaum. Causal structure
based root cause analysis of outliers. arXiv:1912.02724 [cs, math, stat], December 2019.
[27]
Andrei Kapishnikov, Tolga Bolukbasi, Fernanda Viegas, and Michael Terry. XRAI: Better
Attributions Through Regions. In 2019 IEEE/CVF International Conference on Computer
Vision (ICCV), pages 4947–4956, Seoul, Korea (South), October 2019. IEEE.
[28]
Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan
Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, and Orion Reblitz-
Richardson. Captum: A Unified and Generic Model Interpretability Library for PyTorch.
arXiv:2009.07896 [cs, stat], September 2020. arXiv: 2009.07896.
[29]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep
Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural
Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, Red Hook, NY, USA,
December 2012. Curran Associates Inc.
[30]
Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale
Doshi-Velez. An Evaluation of the Human-Interpretability of Explanation. In Workshop on
Correcting and Critiquing Trends in Machine Learning, 2018.
[31]
Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. Faithful and Customizable
Explanations of Black Box Models. In Proceedings of the 2019 AAAI/ACM Conference on AI,
Ethics, and Society, AIES ’19, pages 131–138, New York, NY, USA, January 2019. Association
for Computing Machinery.
[32]
Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and Understanding Neural
Models in NLP. In Proceedings of the 2016 Conference of the North {A}merican Chapter of the
Association for Computational Linguistics: Human Language Technologies, January 2016.
[33]
Zachary C. Lipton. The Mythos of Model Interpretability. Communications of the ACM,
61(10):36–43, September 2018.
[34]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT
Pretraining Approach. arXiv:1907.11692 [cs], July 2019.
[35]
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In International
Conference on Learning Representations, 2019.
11
[36] Scott Lundberg. SHAP, 2018. https://github.com/slundberg/shap.
[37]
Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. Consistent Individualized Feature
Attribution for Tree Ensembles. arXiv:1802.03888 [cs, stat], March 2019. arXiv: 1802.03888.
[38]
Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. In
Proceedings of the 31st International Conference on Neural Information Processing Systems,
NIPS’17, pages 4768–4777, Red Hook, NY, USA, December 2017. Curran Associates Inc.
[39]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher
Potts. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics: Human Language Technologies -
Volume 1, HLT ’11, pages 142–150, USA, June 2011. Association for Computational Linguistics.
[40]
Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial
Intelligence, 267:1–38, February 2019.
[41]
Grégoire Montavon, Alexander Binder, Sebastian Lapuschkin, Wojciech Samek, and Klaus-
Robert Müller. Layer-Wise Relevance Propagation: An Overview. In Wojciech Samek, Grégoire
Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller, editors, Explainable AI:
Interpreting, Explaining and Visualizing Deep Learning, Lecture Notes in Computer Science,
pages 193–209. Springer International Publishing, Cham, 2019.
[42]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style,
High-Performance Deep Learning Library. Advances in Neural Information Processing Systems,
32:8026–8037, 2019.
[43]
Nina Poerner, Hinrich Schütze, and Benjamin Roth. Evaluating Neural Network Explanation
Methods using Hybrid Documents and Morphosyntactic Agreement. In Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pages 340–350, Melbourne, Australia, July 2018. Association for Computational Linguistics.
[44]
Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wortman Vaughan,
and Hanna Wallach. Manipulating and Measuring Model Interpretability. In Proceedings of
the 2021 CHI Conference on Human Factors in Computing Systems, January 2021. arXiv:
1802.07810.
[45]
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Embeddings using Siamese
BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
guage Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), August 2019.
[46]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?": Explain-
ing the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining, August 2016.
[47]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-Precision Model-
Agnostic Explanations. In Thirty-Second AAAI Conference on Artificial Intelligence, April
2018.
[48]
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond Accuracy:
Behavioral Testing of NLP Models with CheckList. arXiv:2005.04118 [cs], May 2020.
[49]
Yves Rychener, Xavier Renard, Djamé Seddah, Pascal Frossard, and Marcin Detyniecki.
Sentence-Based Model Agnostic NLP Interpretability. arXiv:2012.13189 [cs, stat], December
2020. arXiv: 2012.13189.
[50]
W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller. Evaluating the Visualization
of What a Deep Neural Network Has Learned. IEEE Transactions on Neural Networks and
Learning Systems, 28(11):2660–2673, November 2017.
12
[51]
Philipp Schmidt and Felix Biessmann. Quantifying Interpretability and Trust in Machine
Learning Systems. In AAAI-19 Workshop on Network Interpretability for Deep Learning,
January 2019.
[52]
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning Important Features
Through Propagating Activation Differences. In International Conference on Machine Learning,
pages 3145–3153. PMLR, July 2017.
[53]
Dylan Slack, Sophie Hilgard, Sameer Singh, and Himabindu Lakkaraju. Reliable Post hoc
Explanations: Modeling Uncertainty in Explainability. In Neural Information Processing
Systems, 2021.
[54]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and Policy Considerations
for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 3645–3650, Florence, Italy, July 2019. Association for
Computational Linguistics.
[55]
Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to Fine-Tune BERT for Text
Classification? In Maosong Sun, Xuanjing Huang, Heng Ji, Zhiyuan Liu, and Yang Liu, editors,
Chinese Computational Linguistics, Lecture Notes in Computer Science, pages 194–206, Cham,
2019. Springer International Publishing.
[56]
Mukund Sundararajan and Amir Najmi. The Many Shapley Values for Model Explanation. In
International Conference on Machine Learning, pages 9269–9278. PMLR, November 2020.
[57]
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic Attribution for Deep Networks.
In International Conference on Machine Learning, June 2017.
[58]
Ian Tenney, James Wexler, Jasmijn Bastings, Tolga Bolukbasi, Andy Coenen, Sebastian
Gehrmann, Ellen Jiang, Mahima Pushkarna, Carey Radebaugh, Emily Reif, and Ann Yuan. The
Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP
Models. arXiv:2008.05122 [cs], August 2020.
[59]
Erico Tjoa and Cuntai Guan. A Survey on Explainable Artificial Intelligence (XAI): Towards
Medical XAI. IEEE Transactions on Neural Networks and Learning Systems, pages 1–21, 2020.
arXiv: 1907.07374.
[60]
Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subramanian, Matt Gardner, and Sameer Singh.
AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models. In Proceedings
of the 2019 EMNLP and the 9th IJCNLP (System Demonstrations), pages 7–12, September
2019.
[61]
X. Wang, A. McCallum, and X. Wei. Topical N-Grams: Phrase and Topic Discovery, with
an Application to Information Retrieval. In Seventh IEEE International Conference on Data
Mining (ICDM 2007), pages 697–702, October 2007.
[62]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain
Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-Art
Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020.
Association for Computational Linguistics.
[63]
Yi Yang, Mark Christopher Siy UY, and Allen Huang. FinBERT: A Pretrained Language Model
for Financial Communications. arXiv:2006.08097 [cs], July 2020.
[64]
Muhammad Bilal Zafar, Michele Donini, Dylan Slack, Cedric Archambeau, Sanjiv Das, and
Krishnaram Kenthapadi. On the Lack of Robust Interpretability of Neural Text Classifiers. In
ACL Findings, 2021.
13
A Datasets
We describe each of the datasets in detail here:
IMDB
: The movie review data made publicly available by [
39
].
2
The prediction task is to assess
whether a given movie review reflects a positive or negative sentiment. The binary sentiment labels
were obtained by [
39
] by binning the IMDB review scores that range from 1 to 10. The scores in the
range 1-4 are labeled as negative whereas the scores from 7-10 are labeled as positive.
Medical
: Medical text data from Kaggle (licensed under CC0: Public Domain).
3
The task is to
predict the condition of a patient from a medical abstract. The conditions are described by 5 classes:
e.g., digestive system diseases, cardiovascular diseases, etc.
Wiki
: The Wikipedia article data from Kaggle (licensed under CC BY-SA 3.0).
4
The task is to
predict whether a Wikipedia article is written with a promotional tone or neutral tone. Examples of
‘promotional’ articles include advertisements, resume-like articles, etc. For more details, we point the
reader to the corresponding page on Wikipedia.
5
We note that the articles marked promotional tend
to be significantly smaller in length than the neutral articles. In order to ensure that the prediction
models do not simply use the text length as an indicator or the label, we remove all inputs that have
fewer than 500 words.
Table 4shows the statistics for the datasets used in the experiments.
Table 4: Detailed statistics of the datasets used in the experiments. The columns Words and Sentences
show the average
±
standard deviation across the data. Prev. Most shows the prevalence (in
percentage) of the most prevalent class in the dataset.
Dataset Samples Classes Prev. Most Prev. Least Words Sentences
IMDB 50,000 2 50 50 232±171 10±7
Medical 14,438 5 33 10 184±80 8±3
Wiki 41,460 2 71 29 2,603±1,618 117±74
B Reproducibility
B.1 Software
We use the HuggingFace [
62
] and PyTorch [
42
] libraries for model training. Feature attributions are
generated using the Captum library at version 0.4.0 [
28
]. Input texts were split into sentences using
the spaCy library [25]. All the libraries are available under open-source licenses at GitHub.
B.2 Training details
For training the models, we follow a similar strategy as [64].
Specifically, we start from pretrained encoders and add a FC-layer of 512 units (with ReLU units)
followed by a final classification layer of
C
units where
C
is the number of classes. The encoder
embeddings are pooled using average pooling before being fed into the FC layer. As for training,
following [
64
], we use the AdamW optimizer recently proposed by [
35
]. Training is run for a
maximum of
25
epochs. We used the following patience-based early stopping strategy: if the
accuracy on the validation set does not increase for 5consecutive epochs, the training is stopped.
We divide each dataset into a
80% 20%
train-test split. Furthermore,
10%
of the train set is set
aside as a validation set and is used for hyperparameter tuning. The training pipeline consists of two
2https://ai.stanford.edu/~amaas/data/sentiment/
3https://www.kaggle.com/chaitanyakck/medical-text
4https://www.kaggle.com/urbanbricks/wikipedia-promotional-articles
5https://en.wikipedia.org/w/index.php?title=Category:Articles_with_a_promotional_
tone
14
Table 5: Test accuracy with best model (Init#1) and the untrained model (Untrained).
(a) Init#1
BERT RoB dBERT dRoB
IMDB 0.92 0.95 0.93 0.94
Medical 0.60 0.64 0.63 0.65
Wiki 0.94 0.95 0.95 0.96
(b) Untrained
BERT RoB dBERT dRoB
IMDB 0.27 0.51 0.20 0.46
Medical 0.21 0.28 0.04 0.07
Wiki 0.65 0.05 0.78 0.94
Table 6: Percentage of common predictions between different initializations.
(a) Init#1 vs. Init#2
BERT RoB dBERT dRoB
IMDB 98 97 97 98
Medical 88 80 82 91
Wiki 98 97 98 96
(b) Init#1 vs. Untrained
BERT RoB dBERT dRoB
IMDB 26 50 18 45
Medical 25 35 2 5
Wiki 68 5 81 97
hyperparameters: learning rate which is selected from
{102,103,104,105}
; and the number of
last layers of encoder that should be fine-tuned [55], which is selected from {0,2}.
C Model training results
Table 5shows the test set accuracy of the trained (Init#1) and untrained (Untrained) models.
With the trained model
Init#1
, for a given dataset, all the Transformer encoders lead to similar
classification accuracy. For both binary classification datasets (IMDB and Wiki), the accuracy is more
than
90%
in all cases. For the five-class Medical data, the accuracy is
60%
. With the untrained model,
as expected, the accuracy is quite low for all the datasets/encoder combinations. One exception to this
trend is the Wiki data with
dRoB
model where the test accuracy with
Untrained
is almost as high
as with the trained model
Init#1
. However, initializing the
Untrained
model with other random
seeds results in much lower accuracy values—the accuracy values with 5 different random seeds are
0.06
,
0.28
,
0.49
,
0.29
and
0.71
(as shown in Table 4, the majority / minority class split in the data is
71% /29%).
Table 6shows the fraction of common predictions between different initializations. For both IMDB
and Wiki data, the fraction of common predictions between the two trained models is almost
100%
meaning that the two models are indeed almost functionally equivalent [
57
,
64
]. For the Medical
data, the fraction of common predictions is somewhat lower.
D Robustness and variability with indirect computation of sentence
attributions
Here, we report the analogue of results from Section 5.1 when the sentence-level attribution scores are
derived via the indirect method in Section 4by summing the contributions of the individual tokens.
Table 7shows the results of the two randomization tests for SHAP. In a manner similar to that in
Section 5.1, the sentence-based interpretations perform better in both the randomization tests.
Table 8shows the results of the two randomization tests for IG. While the results for the
DIT
are
similar to those for SHAP (sentence-based interpretations are more robust), the results for the
UMT
show an opposite trend, the token-based interpretations are more robust.
Table 9shows the median difference in overlap metric between sentences and tokens. As opposed to
direct computation in §5.2, we notice a higher variability with sentences as compared to tokens.
15
Table 7: [SHAP indirect computation for sentence-based interpretations] Positive (median) values of
JsJt
4.2) for
DIT
in 8/12, and negative values for
UMT
in 10/12 cases mean that sentence-based
interpretations are more robust w.r.t. both tests.
(a)
DIT
: Median difference in Jaccard@25% of sen-
tences and tokens when comparing over
Init#1
and
Init#2
. Positive values mean sentences are more ro-
bust.
BERT RoB dBERT dRoB
IMDB 17 -3 10 17
Medical 14 -5 9 15
Wiki 5 -12 5 -3
(b)
UMT
: Median difference in Jaccard@25% of sen-
tences and tokens when comparing over
Init#1
and
Untrained
. Negative values mean sentences are more
robust.
BERT RoB dBERT dRoB
IMDB -8 13 -5 -12
Medical -5 -2 4 -10
Wiki -9 -12 -10 -3
Table 8: [IG indirect computation for sentence-based interpretations] Positive (median) values of
JsJt
4.2) for
DIT
in 8/12 cases means that sentence-based interpretations are more robust.
However, as opposed to SHAP, in 10/12 cases,
UMT
results in positive values, showing that the
token-based interpretations are more robust.
(a)
DIT
: Median difference in Jaccard@25% of sen-
tences and tokens when comparing over
Init#1
and
Init#2
. Positive values mean sentences are more ro-
bust.
BERT RoB dBERT dRoB
IMDB 16 -2 13 23
Medical 23 3 23 21
Wiki -2 -9 -7 0
(b)
UMT
: Median difference in Jaccard@25% of sen-
tences and tokens when comparing over
Init#1
and
Untrained
. Negative values mean sentences are more
robust.
BERT RoB dBERT dRoB
IMDB 8 16 9 10
Medical 0 21 39 16
Wiki 13 -2 -4 4
E Human intelligibility surveys
E.1 Data pre-processing
We applied two pre-processing steps to the texts before the annotations tasks:
1.
For token-level highlighting, we noticed that the individual features for the Transformer
models (that is, the tokens) can be at an even smaller granularity than words. This happens
due to the sub-word tokenization employed by these models to reduce the size of the
embedding matrix. This subword tokenization results in cases where the words may get
split in the following manner: ‘ailments’
‘ai’ + ‘##lm’ + ‘##ents’ [
15
]. In order to avoid
any additional complexity arising from this subword tokenization, we merge any sub-word
tokens into a whole word and use the average of the subword token attributions as the
attribution of the merged word.
2.
We notice that highlighting sentences may result in cases where more than 10% of the
text gets highlighted (due to the sentence being longer than 10% of the text). For a fair
comparison, we truncate such overflowing sentences such that no more than 10% of the text
is highlighted.
E.2 Annotator instructions
Figure 3and Figure 4shows the examples of a landing page and a highlighted text on the Wiki and
IMDB datasets respectively.
E.3 Payment details
We ran six user surveys in total, one for IMDB and another for the Wiki dataset, and three conditions
for each data (control, token highlighted, sentence highlighted). Users were randomly assigned to
16
Table 9: [SHAP indirect computation for sentence-based interpretations] Median difference between
overlap 4.2) of sentences and tokens when computing the feature attributions repeatedly 10 times.
A positive value means that sentences-based attributions are more stable.
(a) Trained model (Init#1)
BERT RoB dBERT dRoB
IMDB -5 0 -2 -3
Medical -2 -1 13 13
Wiki -17 -6 -11 8
(b) Untrained model (Untrained)
BERT RoB dBERT dRoB
IMDB 2 -1 2 -1
Medical -2 -3 -4 -4
Wiki -10 -7 -11 0
Table 10: Mean infidelity of different interpretability methods at token-level. Lower values are better.
IMDB Medical Wiki
BERT RoB dBERT dRoB BERT RoB dBERT dRoB BERT RoB dBERT dRoB
SHAP 15 28 22 18 25 23 22 17 38 39 37 31
IG 39 44 45 47 26 22 25 18 40 43 38 46
each condition. Each survey consisted of 50 individual questions (one for each input text). In the end,
for the IMDB data, we gathered 750, 750 and 650 annotations for the control, sentence and token
conditions, respectively. For Wiki, the numbers were 600, 750 and 600.
To be eligible for payments, annotators had to complete all 50 tasks. For both surveys, we calibrated
payments for an hourly wage of $11. As payments were done for completed surveys of 50 tasks,
we estimated single task completion times of about 30 seconds and 60 seconds for IMDB and Wiki
datasets respectively. Users received $6 to complete the IMDb survey and $10 for the Wikipedia
survey. In hindsight, users did complete tasks faster than what we estimated and therefore the initially
set hourly wage of $11 was exceeded.
E.4 Implementation of Information Transfer Rate (ITR)
Following from §4.3, let
y {0,1}N
denote the vector of true labels of the texts that a human
annotator was asked to annotate, and let
yh {0,1}N
denote the vector of annotations provided
by the human, where
N
is the number of questions in the survey. Then the ITR is computed
as [
51
]:
I T R =I(yh,y)
t
, where
t
is the average response time for each question, and
I
is the mutual
information.
For implementing
I
, we use the Mutual-Information Score function from
scikit-learn
.
6
Specif-
ically:
I(yh, y) = Pi∈{0,1}Pj∈{0,1}
|y(i)
hy(j)|
Nlog N|y(i)
hy(j)|
|y(i)
h||y(j)|
, where
y(i)
h
denotes the indices of
yhwith value i.
F (Mis)alignment between human and computational metrics.
Here, we compare the human annotation performance with a closely related measure of infidelity. The
infidelity measure is often used to automatically (without human assistance) measure how important
the top-ranked features (by an attribution method) are to the model output [
5
,
7
,
21
,
38
,
50
,
64
]. The
metric is computed as follows: Given the feature attributions (e.g., token- or sentence-based), remove
the features from the input iteratively in the order of decreasing importance until the model prediction
changes. Then, infidelity is defined as the percentage of text that need to be dropped for the prediction
to change. Thus, while ITR describes how well the feature attribution ranking aligns with human
perception, infidelity describes how well the attributions align with the model itself.
The infidelity metric is used in a number of domains (tabular, image, text) and appears in many
closely related variations, all aiming to measure the change in the model prediction upon dropping
6https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_
score.html
17
Table 11: Mean infidelity of different interpretability methods at sentence-level. Suffix “-i” denotes
the indirect computation whereas “-d” denotes the direct computation 4.1). Lower values are better.
IMDB Medical Wiki
BERT RoB dBERT dRoB BERT RoB dBERT dRoB BERT RoB dBERT dRoB
SHAP-d 61 63 65 63 55 59 58 59 57 50 58 52
SHAP-i 68 72 70 71 62 64 63 63 69 67 72 65
IG-i 78 79 76 78 60 65 62 67 67 67 74 70
important features [
5
,
7
,
21
,
38
,
50
,
64
]. We use the text-classification variant used in [
5
,
64
]. We
count the percentage of text that needs to be dropped in terms of tokens in the input (for both token
as well as sentence-based interpretations).
(a) Tokens. (b) Sentences.
Figure 2: Highlighting 10% of the text with token and sentence-based interpretations.
Results
Table 10 shows the infidelity of the token-based interpretations for the trained model
(Init#1). Table 11 shows the infidelity for sentence-based interpretations.
We notice that the infidelity of the token-based interpretations with IMDB data is 15% whereas that
of sentence-based interpretations is 61% (a lower infidelity is better as fewer important features need
to be dropped for the prediction to switch). The token- and sentence-based infidelity for the Wiki
data are 38% and 57% respectively. In other words, quite surprisingly, while the sentence-based
interpretations lead to better human annotation performance, they do not necessarily perform better
according to a related computational metric used by several studies. We see similar insights for other
datasets and models.
We show one possible reason for this via an example in Figure 2. The figure shows the same
instance with top-10% features highlighted with both token and sentence-based interpretations. When
highlighting sentences-based interpretations, we truncate the sentence if it overflows beyond 10%
of the tokens in the text (Appendix E.1). While the sentence-based highlights are naturally quite
compact, the token-based interpretations show that important tokens are spread all over the document.
Such redundancies could require nearly all these sentences to be dropped for the model prediction to
change.
18
(a) Landing page for the survey with the Wiki data.
(b) Example of a highlighted text for the Wiki data.
Figure 3: Examples of the landing page and the highlighted text from the Wikipedia human surveys.
19
(a) Landing page for the survey with the IMDB data.
(b) Example of a highlighted text for the IMDB data.
Figure 4: Examples of the landing page and the highlighted text from the IMDB human surveys.
20
... Here, we focus on global reconstructions of BERT's predictions for token-level classifications in this work, since this constitutes popular application scenarios of BERT (e.g., AS1, AS3) and since BERT also establishes text representations based on tokens. Moreover, as Zafar et al. (2021) and Yan et al. (2022) indicate, a reconstruction approach for token-level classifications can also serve as a basis for reconstructions of coarser classification tasks, for instance, for sentence-level classifications (e.g., AS2, AS4). ...
Article
Full-text available
Analyzing textual data by means of AI models has been recognized as highly relevant in information systems research and practice, since a vast amount of data on eCommerce platforms, review portals or social media is given in textual form. Here, language models such as BERT, which are deep learning AI models, constitute a breakthrough and achieve leading-edge results in many applications of text analytics such as sentiment analysis in online consumer reviews. However, these language models are “black boxes”: It is unclear how they arrive at their predictions. Yet, applications of language models, for instance, in eCommerce require checks and justifications by means of global reconstruction of their predictions, since the decisions based thereon can have large impacts or are even mandatory due to regulations such as the GDPR. To this end, we propose a novel XAI approach for global reconstructions of language model predictions for token-level classifications (e.g., aspect term detection) by means of linguistic rules based on NLP building blocks (e.g., part-of-speech). The approach is analyzed on different datasets of online consumer reviews and NLP tasks. Since our approach allows for different setups, we further are the first to analyze the trade-off between comprehensibility and fidelity of global reconstructions of language model predictions. With respect to this trade-off, we find that our approach indeed allows for balanced setups for global reconstructions of BERT’s predictions. Thus, our approach paves the way for a thorough understanding of language model predictions in text analytics. In practice, our approach can assist businesses in their decision-making and supports compliance with regulatory requirements.