PreprintPDF Available

Faithful to Whom? Questioning Interpretability Measures in NLP

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

A common approach to quantifying model interpretability is to calculate faithfulness metrics based on iteratively masking input tokens and measuring how much the predicted label changes as a result. However, we show that such metrics are generally not suitable for comparing the interpretability of different neural text classifiers as the response to masked inputs is highly model-specific. We demonstrate that iterative masking can produce large variation in faithfulness scores between comparable models, and show that masked samples are frequently outside the distribution seen during training. We further investigate the impact of adversarial attacks and adversarial training on faithfulness scores, and demonstrate the relevance of faithfulness measures for analyzing feature salience in text adversarial attacks. Our findings provide new insights into the limitations of current faithfulness metrics and key considerations to utilize them appropriately.
Content may be subject to copyright.
Faithful to Whom? Questioning Interpretability Measures in NLP
Evan Crothers, 1Herna Viktor, 1Nathalie Japkowicz 2
1University of Ottawa, Ottawa, Ontario, Canada
2American University, Washington D.C., U.S.A.
ecrot027@uottawa.ca, hviktor@uottawa.ca, japkowic@american.edu
Abstract
A common approach to quantifying model interpretability is
to calculate faithfulness metrics based on iteratively mask-
ing input tokens and measuring how much the predicted la-
bel changes as a result. However, we show that such metrics
are generally not suitable for comparing the interpretability
of different neural text classifiers as the response to masked
inputs is highly model-specific. We demonstrate that iterative
masking can produce large variation in faithfulness scores be-
tween comparable models, and show that masked samples
are frequently outside the distribution seen during training.
We further investigate the impact of adversarial attacks and
adversarial training on faithfulness scores, and demonstrate
the relevance of faithfulness measures for analyzing feature
salience in text adversarial attacks. Our findings provide new
insights into the limitations of current faithfulness metrics
and key considerations to utilize them appropriately.
1 Introduction
Transformer models have become ubiquitous in natural lan-
guage processing (NLP), achieving state-of-the-art perfor-
mance across a variety of domains (Vaswani et al. 2017). De-
spite this success, there remains widespread concern about
the lack of transparency and explainability of neural lan-
guage models (Lipton 2018), which hinders understanding
of what they learn and limits their use in high-stakes appli-
cations requiring human oversight or explanation.
Feature attributions are one method used to improve the
interpretability of neural text classifiers, producing scores
that indicate how much each token contributes to a given
prediction (Lundberg and Lee 2017; Sundararajan, Taly, and
Yan 2017). These scores allow a reviewer to identify impor-
tant input tokens and interpret model responses.
A challenge arises in quantifying the quality of these
interpretations. Previous work has defined the property of
“faithfulness” as how well an explanation reflects the ob-
served behaviour of a model, separate from human-centric
considerations such as an explanation’s plausibility (Jacovi
and Goldberg 2020). Measurement of faithfulness relies
upon automated measures designed to quantify to what ex-
tent an explanation accurately reflects how a model behaves.
For neural text classifiers, these measures are overwhelm-
ingly calculated by means of iteratively removing salient
features and measuring model responses (Nguyen 2018;
DeYoung et al. 2020; Atanasova et al. 2020; Zafar et al.
2021).
Iteratively masking features to measure faithfulness al-
lows for comparison of feature explanations based on
whether the explanation correctly ranks the tokens that cause
the largest impact on the output (Sundararajan, Taly, and Yan
2017). However, these measures have also been used as a
means of comparing different models. In such an approach,
the same feature explanation method is applied to two mod-
els, a masking-based faithfulness measure is calculated, and
the model that produces higher average scores is theorized
to be more interpretable (Yoo and Qi 2021).
Our results indicate that using iterative masking to com-
pare interpretability of neural text classifiers is not princi-
pled in its current form. We show that iterative masking
results in model-specific behaviours that undermine cross-
model comparison, and we demonstrate the out-of-domain
nature of iteratively masked samples by analyzing the im-
pact of masking on intermediate representations. Finally, we
analyze the faithfulness impact of training on adversarial
samples, and show that previously observed behaviours from
small-scale experiments are not consistently observed in our
larger scale study.
The remainder of this work is organized as follows. Sec-
tion 2 describes related work on evaluating faithfulness in
neural text classification. Section 3 describes the datasets,
models, and experiment settings used. Section 4 character-
izes the specific behaviours that undermine the robustness
of faithfulness measures for cross-model comparison. Sec-
tion 5 shows that removing words from samples produces
out-of-manifold inputs. Section 6 explores how robustness
to iterative masking relates to adversarial attacks. Section 7
summarizes the findings and presents our conclusions.
2 Related Work
Feature-based Interpretability Methods
Feature-based interpretability methods for deep learning
models, such as LIME (Ribeiro, Singh, and Guestrin 2016),
SHAP (Lundberg and Lee 2017), and integrated gradients
(Sundararajan, Taly, and Yan 2017) assign an importance
score to input features to determine their contribution to a
particular network result. Evaluation of these interpretability
methods has shown that gradient-based approaches demon-
strate the best agreement with human assessment for Trans-
former models, as well as best correlating with tokens which
cause the greatest drop in performance if they are removed
from the model (Atanasova et al. 2020).
Based on these findings, we use integrated gradients to
generate feature attributions in our work. Integrated gradi-
ents is a strong axiomatic method for calculating how input
features contribute to the output of a model (Sundararajan,
Taly, and Yan 2017). This method interpolates between a
baseline input representation x(in this case a zero embed-
ding vector) and the embedding vectors xof each token.
The robustness of feature-based interpretabilty methods
for neural text classifiers has been questioned (Zafar et al.
2021). Specifically, it has been demonstrated that 1) two
functionally near-equivalent models with differing weight
initializations may produce different explanations, and 2)
feature attributions of a model with random parameters may
be the same as for a model with learned parameters.
In our work, we demonstrate the mechanisms by which
breakdowns in faithfulness measures occur, and demonstrate
that while measures based on iterative masking may be use-
ful for characterizing model responses in certain situations,
such as under adversarial attacks, they are not generally ap-
propriate for cross-model comparison of interpretability.
Faithfulness Measures
Recall that a key method for evaluating the quality of in-
terpretable explanations are faithfulness measures. Faithful-
ness measures are commonly calculated by iteratively hiding
features in descending order of feature importance and de-
termining a score based on changes in model output. These
scores may be calculated by removing features until the out-
put classification changes (Nguyen 2018; Zafar et al. 2021)
or by removing a preset number of salient tokens and com-
paring the change in class probabilities (DeYoung et al.
2020; Atanasova et al. 2020).
Fidelity Calculation Our approach to quantifying faith-
fulness closely aligns with (Arras et al. 2016) and (Zafar
et al. 2021), in which tokens are masked from the input in
descending order of feature attribution scores until the result
of the model changes. We then record the % of tokens that
were removed at the point that the model’s output changes.
Formally, given an input text split into ntokens, T=
[t1, ..., tn], a vector of feature explanations Φ(T) =
[ϕ(t1), ..., ϕ(tn)], a model m, and the model’s unknown vo-
cabulary token [UNK], we first calculate m(T) = y0. We
then define an iterative scoring function f(T), that at each
step performs the replacement T[max Φ(t)] [UNK] =
T, and calculates m(T) = y. We iterate Ctimes until
y=y0, and return the ratio f(T) = C
N.f(T)is calculated
for all Kinput texts, and we calculate the fidelity score for
the model as:
F idelity(m)=11
K
k
X
k=1
f(Tk)(1)
We rely on fidelity due to its simplicity and the lack of
a priori parameters. For completeness, we now briefly de-
scribe the calculation of measures that rely on masking set
numbers of tokens so that we can refer to them in explaining
theoretical pitfalls in cross-model comparison.
Area over the perturbation curve A common faithful-
ness measure that relies on masking a preset number of
tokens is the “area over the perturbation curve” (AOPC)
(Samek et al. 2017; Nguyen 2018). AOPC involves first
creating an ordered ranking of input tokens by importance
x1, x2, ...xnin each n-token input sequence. AOPC is then
calculated by:
AOP C =1
L+ 1 *L
X
k=1
f(x)f(x1..k)+p(x)
(2)
where Lis the a priori number of tokens to mask, f(x1..k)
is the output probability for the original predicted class when
tokens 1..k are removed, and ⟨·⟩p(x)denotes the average
over all sequences in the dataset. We refer to this calcula-
tion when discussing theoretical weaknesses of cross-model
faithfulness comparison based on iterative masking.
Limitations of Faithfulness Measures While faithfulness
measures can be used to compare different methods of gen-
erating explanations (Nguyen 2018; Atanasova et al. 2020),
recent research suggests that such measures may not be suit-
able for cross-model comparison of interpretability.
Specifically, recent work demonstrates that untrained
models produce fidelity scores well above random masking,
and the fidelity of the same interpretability method applied
across different encoders can vary substantially (Zafar et al.
2021). This previous analysis left investigation of the root
cause of these phenomena as future work. We shed light on
this root cause by demonstrating that the iterative masking
process produces samples outside the manifold of the train-
ing data, resulting in differing model-specific behaviour.
Producing reliable explanations in the text domain has
been highlighted as difficult due to aberrant behaviour on
iteratively masked samples (Feng et al. 2018). By remov-
ing the least important word from a sequence iteratively,
the resulting condensed explanation is no longer meaning-
ful to human observers, impacting human-assessed plausi-
bility. Our work complements this, highlighting how itera-
tive masking affects representations within neural text clas-
sifiers and impacts automated assessment of faithfulness.
Adversarial Training and Faithfulness
Adversarial attacks inputs perturbed to cause a model to
produce an erroneous misclassification can be applied in
the text domain (Jin et al. 2020). A common word-based ap-
proach is to replace words with synonyms, typically by us-
ing a synonym dictionary, or by leveraging another language
model to find nearby words with nearby embeddings (Alzan-
tot et al. 2018; Shi et al. 2019). The attacks DeepWordBug
(Gao et al. 2018) and HotFlip (Ebrahimi et al. 2018) in-
troduce targeted character-level perturbations to cause erro-
neous classifications.
It has been shown that input feature attributions can be
used as an effective feature for detection of adversarial at-
tacks in the text domain (Huber et al. 2022). Previous work
has investigated the impact of adversarial training train-
ing on adversarial samples on interpretability measures,
and found that adversarial training appears to increase the
faithfulness measures of models (Yoo and Qi 2021) and
alignment with ground truth explanations (Sadria, Layton,
and Bader 2023). In the process of comparing faithfulness
of neural text classifiers after adversarial training, one previ-
ous work found that RoBERTa models scored significantly
lower than BERT models, and theorized that RoBERTa may
be less interpretable than BERT (Yoo and Qi 2021).
We demonstrate in our research that increased faithfulness
measures are not universal in the presence of adversarial at-
tacks. Further, our overall findings suggest that previously
observed differences between BERT and RoBERTa faithful-
ness scores are not a meaningful reflection of each models
inherent interpretability, but rather a result of model-specific
behaviours in the presence of iterative masking.
3 Datasets and Experimental Setup
To ensure a reproducible set of task-specific models and
associated data samples, we use the text classification at-
tack benchmark (TCAB) dataset and models (Asthana et al.
2022). This benchmark includes: 1) a number of established
NLP task datasets; 2) BERT and RoBERTa models trained
for each task; and 3) a variety of successful adversarial at-
tacks against the included models.
To obtain a range of data domains and sequence lengths,
we perform our experiments on the following task datasets
in TCAB: 1) the Stanford Sentiment Treebank (SST-2)
dataset of movie reviews for sentiment classification, a com-
mon NLP benchmark task (Socher et al. 2013); 2) the Twit-
ter climate change sentiment dataset, a multi-class dataset
of social media data (Qian 2019); 3) Wikipedia Toxic Com-
ments, a dataset with a longer sequence length widely-used
for studying adversarial attacks (Dixon et al. 2018); and 4)
the Civil Comments dataset, a dataset of online comments
for evaluating unintended bias in toxicity detection, an area
where interpretability may be important (Jigsaw 2019).
The Transformer language models included in this bench-
mark reflect common architectures for contemporary neural
text classifiers. We consider both a baseline bi-directional
Transformer architecture BERT (Devlin et al. 2019), as well
as the commonly used robustly-optimized variant of this ar-
chitecture, RoBERTa (Liu et al. 2019).
In Section 6, we select a variety of adversarial attacks
to analyze, including well-known word-level and character-
level attacks to compare any differences in the impacts of
adversarial training on fidelity. Specifically, we use Deep-
WordBug (Gao et al. 2018), TextFooler (Jin et al. 2020),
Genetic (Alzantot et al. 2018), and HotFlip (Ebrahimi et al.
2018) attacks.
As calculating input attributions is computationally ex-
pensive, fidelity calculations in Table 4 are based on a sam-
ple of 1,600 total records evenly split across dataset-model
combinations. Initial experiments indicated this was suffi-
cient to observe consistent patterns in fidelity scores, while
freeing computational resources to explore multiple com-
binations. We use N= 30 for the number of steps for
layer integrated gradient calculation, as this approximates
the largest value that fits within the memory constraints of
the system, and higher step counts typically produce more
accurate explanations (Sundararajan, Taly, and Yan 2017).
To evaluate models in the presence of class imbalance, we
use the macro F1 score.
All experiments were performed on a Windows worksta-
tion with an Intel i7-6800K 12-core CPU, 64GB RAM, and
a 24GB VRAM NVIDIA GPU. Seeds starting at 0 are used
throughout the experiments for reproducibility. Complete
experiment code, along with references to utilized models
and data are provided to ensure reproducible results.
4 Pitfalls in Cross-Model Comparison
Faithfulness measures based on iterative masking produces
scores that are sensitive to model initialization, with identi-
cal models of the same architecture and comparable dataset
performance producing dramatically differing scores (Zafar
et al. 2021). We explore the mechanisms that result in these
pitfalls by first explaining the theoretical deficiencies in the
iterative masking approach, illustrated with an example, and
then demonstrate the variation in behaviour from 8 mod-
els trained across 4 task datasets. In doing so, we explore
the mechanisms that enable faithfulness metrics to exhibit
substantial variation between comparable models, and their
high sensitivity to characteristics of the training dataset.
Figure 1 demonstrates an excerpt from a positive movie
review, which when iteratively masked, at no point causes
the sentiment classifier to determine that the movie review
is negative. Integrated gradients feature attributions indi-
cate which input tokens were most important to the out-
put classification, ranking the tokens “beautiful”, “images”,
and “solemn” as most important to the model’s output. After
masking these three words, the model’s class confidence is
at a minimum, but the output class is unchanged.
Neg Pos Sample under iterative masking; iterative deletion
0.09 99.1 The beautiful images and solemn words
0.24 97.6 The [UNK] images and solemn words
3.6 96.4 The [UNK] [UNK] and solemn words
28.3 71.7 The [UNK] [UNK] and [UNK] words
14.0 86.0 [UNK] [UNK] [UNK] and [UNK] words
5.2 94.8 [UNK] [UNK] [UNK] [UNK] [UNK] words
1.4 98.6 The images and solemn words
3.6 96.4 The and solemn words
12.0 88.0 The and words
5.8 94.2 and words
5.5 94.6 words
Figure 1: Iterative token removal in descending order of
feature importance on a sample from SST-2, a dataset of
phrases from movie reviews paired with review sentiment.
Despite the explanation identifying the most important to-
kens, the classification is unchanged during either iterative
masking or iterative deletion.
The point at which a classifier changes its predicted class
(if at all), may vary substantially between models. Similarly,
while removing the top Ktokens from the input, samples be-
come increasingly perturbed, and output probabilities may
skew back towards the original class, as is the case in Fig-
ure 1. Area-based faithfulness measures such as AOPC are
then similarly impacted as output probabilities on iteratively
masked samples are not consistent across models, leading to
variation in the f(x)f(x1..k)term (see Equation 2).
Under faithfulness measures that mask tokens until a
change in predicted class, samples that do not cause a change
in predicted class during masking incur maximum penalty to
the faithfulness score, regardless of the quality of the ex-
planation. For methods that mask an a priori set number
of tokens, model-specific output calibration similarly under-
mines comparison between models (Guo et al. 2017). Out-
put logits on out-of-domain samples that include small num-
bers of tokens or empty strings are difficult to predict, and
even if evaluating using a simple balanced binary classifica-
tion dataset, there is no guarantee that this behavior will be
consistent or symmetric when masking explanations for one
predicted class versus the other.
Our illustrative example, in which the classification of a
sample does not change during iterative masking, reflects a
common situation. Table 1 shows the performance of 8 clas-
sifiers on clean (without adversarial perturbation) samples
from the TCAB dataset. These models largely have compa-
rable performance, with an edge to RoBERTa on the climate-
change dataset. However, substantial variation is observed
across fidelity measures, highlighting the impact of both the
model and the task dataset.
To understand why discrepancies arise, we consider the
failure case that we demonstrated in Figure 1 situations
where the input class never changes. Table 2 shows the fre-
quency of samples where iterative masking did not lead to a
change in classification result at any point.
SST-2 WikiToxic Civil Com. Clim. Cha.
BERT 0.910 0.903 0.836 0.650
RoBERTa 0.920 0.907 0.830 0.746
FidelityBERT 0.491 0.304 0.103 0.678
FidelityRoBERTa 0.455 0.118 0.050 0.722
Table 1: F1 scores (macro) of task-specific TCAB BERT and
RoBERTa models on unperturbed validation set, and corre-
sponding fidelity scores.
SST-2 WikiToxic Civil Com. Clim. Cha.
BERT 0.350 0.090 0.880 0.055
RoBERTa 0.355 0.860 0.940 0.140
Table 2: Frequency of samples which did not result in
change of predicted class at any point during masking of
tokens based on feature importance.
The largest gap between models in both fidelity and fre-
quency of samples where the predicted class was unchanged
was observed on the Wikipedia Toxic Comments dataset.
The TCAB BERT and RoBERTa models trained on differ-
ent datasets have comparable performance on the original
classification problem, shown in Table 1, yet the fidelity
scores differ substantially. Table 2 provides deeper insight:
the RoBERTa-wikipedia model did not change its result at
any point during iterative masking on 86% of samples.
Fundamentally, the assumption that removing salient to-
kens should cause the output of a model to change is not
intuitive for all datasets. The Wikipedia Toxic Comments
dataset is an illustrative example of this. The class distri-
bution of TCAB validation set for the two-class wikipedia
dataset is 90.84% negative samples, i.e., mostly comments
that are not considered toxic. Removing salient tokens one
by one from an inoffensive comment is highly unlikely to
create a true toxic sample, no matter how many tokens are
removed.
As a result, a low fidelity score on this dataset arguably
demonstrates a beneficial property of the RoBERTa model
resilience against adversarial attack. The attack in this
case being the removal of targeted salient tokens with the
goal of changing the predicted class with the minimum num-
ber of removed tokens (even though the removal does not
affect the true class). Whether removing salient tokens is ex-
pected to affect the true class is entirely dependent on the
task and dataset undermining the idea that iterative mask-
ing should be framed consistently across models.
We conclude that it is not theoretically sound to assess
the interpretability of a model using a measure that pe-
nalizes robustness against perturbations, without consider-
ing the dataset. When taken together with previous research
that shows substantial variation based on model initializa-
tion (Zafar et al. 2021), it becomes apparent that while faith-
fulness measures are meaningful for evaluating the quality
of explanations on the same model, they are not appropri-
ate for cross-model comparisons of interpretability in neural
text classifiers.
5 Embeddings of Partially-Masked Samples
With the underlying mechanisms that undermine iterative
masking for model comparison established, we now show
that iteratively masked samples frequently fall outside the
training distribution of the dataset, particularly at the levels
of masking required for changes in predicted class. As sam-
ples from this distribution haven’t been seen during train-
ing, this likely leads to the inconsistent behaviour on masked
samples observed in the previous section.
Distributional Characteristics
Masking tokens causes the perturbed input samples to have
representations that are very different from ordinary real
samples in their training datasets. To demonstrate this, we
generate embeddings for each input sequence across all sam-
ples taken from each dataset by mean-pooling the output of
the second-last layer of a Transformer encoder, a common
approach that generally outperforms using the final-layer
[CLS] token embeddings (Xiao 2018; Devlin et al. 2019).
Using this embedding method, we create embedding vec-
tors Vd=v1, v2, . . . , vkusing the BERT and RoBERTa
SST-2 WikiToxic Civ. Com. Clim. Cha.
µσµσµσµσ
BERT 0.15 -0.13 0.40 -0.04 0.30 -0.05 0.46 -0.06
RoBERTa 0.12 -0.16 0.15 -0.05 0.15 -0.08 0.34 -0.12
Table 3: Change in centroid µand mean feature standard
deviation σat 50% mean sequence length tokens masked.
models for every sample across each dataset dwithin
our sample. Each vector can be represented as vi=
[xi1, xi2, . . . , xin]. We then calculate centroids µdfor each
dataset
µd=Pxi1
k,Pxi2
k,..., Pxin
k(3)
where Pxij represents the sum of the j-th component
of all vectors in the set Vd, and the model’s encoder layer
dimension n= 768 is the length of the embedding vector.
Iteratively masking tokens from each sample, we obtain an
embedding vector ωat each step, and calculate cosine simi-
larity as a scale-invariant vector similarity measure (Reimers
and Gurevych 2019):
cos sim =Pn
i=1(ω[i]·µ[i])
pPn
i=1(ω[i])2·pPn
i=1(µ[i])2(4)
We then analyze the cosine similarity of centroids and the
mean standard deviation of embedding vectors as the num-
ber of masked tokens increases, presenting the results in Fig-
ure 2, and comparing changes between models in Table 3.
From Table 3 and Figure 2, we can see that in all cases
there is an increased cosine distance between the origi-
nal dataset as tokens are masked. As more tokens are re-
placed with [UNK], we see a smaller mean standard devi-
ation across embedding features σ, implying that the in-
creased presence of a single token is leading to more ho-
mogenous representations overall. Importantly, we note sub-
stantial model-specific differences in the centroids of em-
beddings between unmasked and masked samples for BERT
and RoBERTa. On average, the internal representations of
Wikipedia Toxic Comments samples changed less during
masking for RoBERTa than for BERT, which may explain
the high frequency of samples that did not change predicted
class previously observed in Table 2.
Local and Global Structure
In addition to distributional characteristics, we can demon-
strate the difference in local and global structure between
unmasked and partially-masked sample embeddings us-
ing UMAP dimensionality reduction (McInnes and Healy
2018). We use the BERT and RoBERTa embeddings before
and after masking some number of tokens as the base for
these visualizations. We choose two illustrative datasets to
visualize against: SST-2 and Wikipedia Toxic Comments.
We select these two to show how the length of the input
sample impacts per-token sensitivity to masking. Our sam-
ple from the former dataset has an average sequence length
Figure 2: Comparison of centroid cosine similarity and mean
standard deviation of embedding vectors between BERT and
RoBERTa across various datasets. The left plot shows cen-
troid cosine similarity, demonstrating the shift of data rep-
resentations as tokens are masked. The right plot shows the
mean standard deviation of the embeddings, showing repre-
sentations of partially-masked inputs are less varied.
of 12.25 tokens, while the latter has an average sequence
length of 97.22 tokens. Results from our dimensionality re-
duction can be seen in Figure 3.
We show that on datasets with short input samples such as
SST-2, even just masking two salient tokens creates repre-
sentations far outside the data manifold on which the model
was trained, implying undefined behaviour. The per-token
impact of masking is lesser on samples with longer average
sequence lengths, as shown by the Wikipedia Toxic Com-
ments visualizations, but these longer samples also require a
larger number of tokens to be masked to change the model
classification. For example, our results in Table 1 give a fi-
delity score of 0.304 for BERT on the Wikipedia dataset.
This indicates that on average it is required to mask 69.6%
of tokens from a Wikipedia sample before the output classi-
fication of the model changes. For RoBERTa, this proportion
is even larger a fidelity score of 0.118 on the dataset im-
plies that 88.2% of tokens must be masked on average to
perturb the classification result.
Based on the magnitude of the deviation, we infer that the
departure of masked samples from the data manifold of the
training set may be responsible for any prediction “crossover
point”, particularly for datasets where the true class is likely
to be unchanged by masking (such as when masking non-
toxic samples in Wikipedia Toxic Comments). Calculating
faithfulness metrics using this approach thus relies heavily
on undefined model-specific behaviour on degenerate sam-
ples from an unseen manifold behaviour which is diffi-
cult to predict, and may vary dramatically between different
models.
6 Fidelity Under Adversarial Attack
Adversarial training, the process of training a model on ad-
versarial inputs to improve robustness against attacks, is a
major area of research in NLP (Bai et al. 2021). Recall that
previous work has shown some small-scale experiments that
suggest that adversarial training appears to improve faithful-
ness measures in neural text classifiers (Yoo and Qi 2021).
Figure 3: UMAP projections of sample embeddings at varying levels of masking. Masking more tokens moves the resulting
embeddings further out of domain of the original dataset. Masking a couple tokens within a dataset with a longer average
sequence length has a relatively minor effect (e.g., see the Wikipedia Toxic Comments examples), but longer samples still
generally require a significant portion of tokens to be masked to change classification (see Table 4)
We perform a large-scale analysis that suggests that differ-
ences in fidelity after adversarial training or between differ-
ent models do not follow easily discernible patterns, raising
questions as to whether this pattern holds generally for neu-
ral text classifiers, or only under certain conditions.
For adversarial training, we perform hyperparameter tun-
ing to determine appropriate combinations of learning rate
lr [106,103], weight decay wd [105,102], and
training epochs ep [0,5]. Tuning of hyperparameters is
required due to differences across task datasets and model
architectures. Evaluation for hyperparameter tuning was per-
formed by performing a 90-10 train-validation split of the
TCAB adversarial attack train dataset, with each training set
composed of half original task samples and half adversar-
ial samples. A batch size of 16 was set for efficient train-
ing within GPU memory limits. We present findings based
on the hyperparameter combination resulting in the best F1
score on the validation dataset.
In Table 4, we show fidelity calculations on 1) success-
ful adversarial attacks prior to adversarial training; 2) adver-
sarial samples after adversarial training; and 3) clean (non-
adversarial) samples after adversarial training. We also in-
clude fidelity scores on clean samples prior to adversarial
training, previously reported in Table 1, at the top of Table 4
for ease of comparison.
From the results in Table 4.1, we can see that fidelity
of explanations is generally higher on successful adversar-
ial samples compared to clean samples. Adversarial attacks
are typically optimized to minimize the number of pertur-
bations, while still altering model output. These constraints
lend themselves towards perturbed sequences where a small
portion of tokens have a significant influence on predic-
tions. Without an attack approach that also attacks model
interpretability methods (Ivankay et al. 2022), these highly
salient perturbed tokens are identified by the feature attribu-
tions, and masked early during fidelity calculation. As such,
the model output often returns to the original value more
quickly, leading to higher fidelity scores overall.
The only model where adversarial attack samples did not
universally manifest in increased fidelity scores, appears to
be the RoBERTa climate-change model. In this case, the fi-
delity score on clean samples was already the highest of all
included models, indicating that this model already relied
on a relatively small number of salient tokens to make cor-
rect predictions. In this case, adversarial attacks may pro-
duce perturbations that interfere with these salient tokens,
resulting in attributions spread more evenly across the re-
maining tokens.
We further note from Table 4.1 that fidelity scores of
BERT models under successful adversarial attacks prior to
BERT RoBERTa
SST-2 WikiToxic Civ. Com. Clim. Cha. SST-2 WikiToxic Civ. Com. Clim. Cha.
Clean 0.491 0.304 0.103 0.678 0.455 0.118 0.050 0.722
4.1) Adv. Samples
Pre Adv. Training
DeepWordBug 0.690 0.703 0.188 0.818 0.680 0.423 0.135 0.552
TextFooler 0.879 0.741 0.341 0.806 0.774 0.533 0.192 0.731
Genetic 0.798 0.789 0.318 0.871 0.567 0.511 0.201 0.519
HotFlip 0.771 0.793 0.473 0.866 0.576 0.535 0.205 0.479
4.2) Adv. Samples
Post Adv. Training
DeepWordBug 0.597 0.362 0.089 0.797 0.652 0.483 0.253 0.794
TextFooler 0.775 0.475 0.807 0.708 0.742 0.520 0.201 0.772
Genetic 0.662 0.860 0.752 0.834 0.545 0.579 0.400 0.865
HotFlip 0.695 0.608 0.599 0.796 0.399 0.241 0.010 0.798
4.3) Clean Samples
Post Adv. Training
DeepWordBug 0.499 0.643 0.805 0.658 0.492 0.514 0.846 0.724
TextFooler 0.454 0.628 0.476 0.694 0.476 0.474 0.770 0.688
Genetic 0.464 0.412 0.417 0.663 0.420 0.280 0.603 0.723
HotFlip 0.479 0.624 0.321 0.667 0.477 0.613 0.869 0.697
Table 4: Fidelity scores of task-specific BERT and RoBERTa classifiers under varying adversarial attacks, and fidelity of 32
adversarially trained models for each dataset-model-attack combination on adversarial and non-adversarial (clean) samples.
adversarial training are generally higher than those calcu-
lated on RoBERTa. We view this result not as a proxy mea-
sure of interpretability, but instead as a measure of sensitiv-
ity to iterative masking the behaviour which the metric
directly measures. As we are working with a dataset of suc-
cessful adversarial attack samples, the difference in fidelity
score indicates that BERT models return to the original class
after masking fewer salient tokens than RoBERTa models.
That is, successful attacks on the TCAB BERT models de-
pend on a comparatively smaller number of salient tokens. In
framing fidelity this way, we demonstrate how faithfulness
measures might be used to better understand how adversarial
attacks impact salient tokens in neural text classifiers, rather
than used as a proxy measure for explainability.
In many senses, iterative masking itself resembles a sim-
ple token-level adversarial attack. Instead of replacing a
salient token with an equivalent that causes a change in pre-
dicted class, salient tokens are removed or masked until a
change in predicted class occurs. As long as the dataset is
such that token removal does not influence the true class,
this meets the definition of an adversarial attack (Jin et al.
2020). Such an attack would noticeably degrade the origi-
nal sentence, though reduction in text quality has been ob-
served for other word-level and character-level attacks as
well (Crothers et al. 2022).
Taken together, Tables 4.2 and 4.3, which show fidelity
after adversarial training on adversarial samples and clean
samples respectively, demonstrate at scale that adversarial
training of neural text classifiers does not appear to have a
consistent impact on fidelity scores across different datasets.
The observed fidelity gaps can be very large, even for the
same encoder, such as the BERT results for the Civil Com-
ments dataset. From this, we conclude that training neu-
ral text classifiers on adversarial samples does not have
a straightforward relationship with sensitivity to iterative
masking, despite previous indications to the contrary.
7 Discussion and Conclusion
Overall our findings emphasize a key consideration com-
paring neural text classifiers using masking-based faithful-
ness measures as a proxy for interpretability is not generally
a principled approach, even if models have the same or sim-
ilar architectures. Measurements of faithfulness based on it-
erative masking are dependent on model-specific behaviour,
and partially masked samples are often well outside the data
manifold of the original training data. Model comparisons
based on responses to iterative masking should be thought
of in nuanced terms and performed carefully, taking into ac-
count the datasets used and interpreted through the lens of
robustness to iterative masking.
Based on our research, we have noted that successful text
adversarial attacks result in salient features that are highly
ranked by layer integrated gradients, and often result in
increased fidelity scores on successful adversarial attacks.
This aligns with previous research that has used similar fea-
tures for detecting adversarial attacks (Huber et al. 2022).
Beyond this observation, our findings indicate that adversar-
ial training of neural text classifiers does not have a consis-
tent impact on fidelity scores, suggesting that the relation-
ship between adversarial training and robustness to iterative
masking is less direct than previously thought.
Significant future work exists in the area of comparing
neural text classifiers based on interpretability. Within this
area, there clearly remains an open problem in comparing
faithfulness without relying on variable model-specific be-
haviour on out-of-domain samples. Perturbing input features
individually and measuring correlation between input fea-
ture rankings and changes in class confidence may be one
such approach, though this resembles the calculation of in-
tegrated gradients itself (Sundararajan, Taly, and Yan 2017).
Overall, it continues to remain difficult to disentangle data,
model, and explanation method. When working with mea-
sures based on iterative masking, it is distinctly important to
keep this in mind.
References
Alzantot, M.; Sharma, Y.; Elgohary, A.; Ho, B.-J.; Srivas-
tava, M.; and Chang, K.-W. 2018. Generating Natural Lan-
guage Adversarial Examples. In EMNLP 2018, 2890–2896.
Brussels, Belgium: ACL.
Arras, L.; Horn, F.; Montavon, G.; M¨
uller, K.-R.; and
Samek, W. 2016. Explaining Predictions of Non-Linear
Classifiers in NLP. In Proceedings of the 1st Workshop
on Representation Learning for NLP, 1–7. Berlin, Germany:
Association for Computational Linguistics.
Asthana, K.; Xie, Z.; You, W.; Noack, A.; Brophy, J.;
Singh, S.; and Lowd, D. 2022. TCAB: A Large-Scale
Text Classification Attack Benchmark. arXiv preprint
arXiv:2210.12233.
Atanasova, P.; Simonsen, J. G.; Lioma, C.; and Augenstein,
I. 2020. A Diagnostic Study of Explainability Techniques
for Text Classification. In Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Language Process-
ing (EMNLP), 3256–3274. Online: Association for Compu-
tational Linguistics.
Bai, T.; Luo, J.; Zhao, J.; Wen, B.; and Wang, Q. 2021. Re-
cent advances in adversarial training for adversarial robust-
ness. arXiv preprint arXiv:2102.01356.
Crothers, E.; Japkowicz, N.; Viktor, H.; and Branco, P. 2022.
Adversarial robustness of neural-statistical features in detec-
tion of generative transformers. In 2022 International Joint
Conference on Neural Networks (IJCNN), 1–8. IEEE.
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.
BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers), 4171–4186. Min-
neapolis, Minnesota: Association for Computational Lin-
guistics.
DeYoung, J.; Jain, S.; Rajani, N. F.; Lehman, E.; Xiong, C.;
Socher, R.; and Wallace, B. C. 2020. ERASER: A Bench-
mark to Evaluate Rationalized NLP Models. In Proceed-
ings of the 58th Annual Meeting of the Association for Com-
putational Linguistics, 4443–4458. Online: Association for
Computational Linguistics.
Dixon, L.; Li, J.; Sorensen, J.; Thain, N.; and Vasserman,
L. 2018. Measuring and mitigating unintended bias in text
classification. In Proceedings of the 2018 AAAI/ACM Con-
ference on AI, Ethics, and Society, 67–73.
Ebrahimi, J.; Rao, A.; Lowd, D.; and Dou, D. 2018. Hot-
Flip: White-Box Adversarial Examples for Text Classifica-
tion. In ACL 2018 (Volume 2: Short Papers), 31–36. Mel-
bourne, Australia: ACL.
Feng, S.; Wallace, E.; Grissom II, A.; Iyyer, M.; Rodriguez,
P.; and Boyd-Graber, J. 2018. Pathologies of Neural Mod-
els Make Interpretations Difficult. In Proceedings of the
2018 Conference on Empirical Methods in Natural Lan-
guage Processing, 3719–3728. Brussels, Belgium: Associ-
ation for Computational Linguistics.
Gao, J.; Lanchantin, J.; Soffa, M. L.; and Qi, Y. 2018. Black-
box generation of adversarial text sequences to evade deep
learning classifiers. In 2018 IEEE Security and Privacy
Workshops (SPW), 50–56. IEEE.
Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017.
On calibration of modern neural networks. In International
conference on machine learning, 1321–1330. PMLR.
Huber, L.; K¨
uhn, M. A.; Mosca, E.; and Groh, G. 2022. De-
tecting Word-Level Adversarial Text Attacks via SHapley
Additive exPlanations. In Proceedings of the 7th Workshop
on Representation Learning for NLP, 156–166.
Ivankay, A.; Girardi, I.; Marchiori, C.; and Frossard, P. 2022.
Fooling Explanations in Text Classifiers. arXiv preprint
arXiv:2206.03178.
Jacovi, A.; and Goldberg, Y. 2020. Towards faithfully inter-
pretable NLP systems: How should we define and evaluate
faithfulness? arXiv preprint arXiv:2004.03685.
Jigsaw. 2019. Jigsaw unintended bias in toxicity classifica-
tion.
Jin, D.; Jin, Z.; Zhou, J. T.; and Szolovits, P. 2020. Is BERT
Really Robust? A Strong Baseline for Natural Language At-
tack on Text Classification and Entailment. In Proceedings
of the AAAI conference on artificial intelligence, volume 34,
8018–8025.
Lipton, Z. C. 2018. The Mythos of Model Interpretability:
In Machine Learning, the Concept of Interpretability is Both
Important and Slippery. Queue, 16(3): 31–57.
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;
Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.
2019. RoBERTa: A Robustly Optimized BERT Pretraining
Approach. ArXiv, abs/1907.11692.
Lundberg, S. M.; and Lee, S.-I. 2017. A Unified Approach
to Interpreting Model Predictions. In Proceedings of the 31st
International Conference on Neural Information Processing
Systems, NIPS’17, 4768–4777. Red Hook, NY, USA: Cur-
ran Associates Inc. ISBN 9781510860964.
McInnes, L.; and Healy, J. 2018. UMAP: Uniform Man-
ifold Approximation and Projection for Dimension Reduc-
tion. ArXiv, abs/1802.03426.
Nguyen, D. 2018. Comparing automatic and human eval-
uation of local explanations for text classification. In Pro-
ceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long Papers),
1069–1078.
Qian, E. 2019. Twitter climate change sentiment dataset.
Reimers, N.; and Gurevych, I. 2019. Sentence-BERT:
Sentence Embeddings using Siamese BERT-Networks. In
Proceedings of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the 9th Interna-
tional Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), 3982–3992.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why
should i trust you?” Explaining the predictions of any clas-
sifier. In Proceedings of the 22nd ACM SIGKDD interna-
tional conference on knowledge discovery and data mining,
1135–1144.
Sadria, M.; Layton, A.; and Bader, G. 2023. Adversarial
training improves model interpretability in single-cell RNA-
seq analysis. bioRxiv, 2023–05.
Samek, W.; Binder, A.; Montavon, G.; Lapuschkin, S.; and
M¨
uller, K.-R. 2017. Evaluating the Visualization of What a
Deep Neural Network Has Learned. IEEE Transactions on
Neural Networks and Learning Systems, 28: 2660–2673.
Shi, Z.; Yao, T.; Xu, J.; and Huang, M. 2019. Robustness
to Modification with Shared Words in Paraphrase Identifica-
tion. arXiv preprint arXiv:1909.02560.
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning,
C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models
for semantic compositionality over a sentiment treebank. In
Proceedings of the 2013 conference on empirical methods in
natural language processing, 1631–1642.
Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic
attribution for deep networks. In International conference
on machine learning, 3319–3328. PMLR.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
tention is all you need. In NeurIPS, 5998–6008.
Xiao, H. 2018. bert-as-service. https://github.com/hanxiao/
bert-as-service.
Yoo, J. Y.; and Qi, Y. 2021. Towards Improving Adversarial
Training of NLP Models. In Findings of the Association for
Computational Linguistics: EMNLP 2021, 945–956.
Zafar, M. B.; Donini, M.; Slack, D.; Archambeau, C.; Das,
S.; and Kenthapadi, K. 2021. On the Lack of Robust In-
terpretability of Neural Text Classifiers. In Findings of the
Association for Computational Linguistics: ACL-IJCNLP
2021, 3730–3740. Online: Association for Computational
Linguistics.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Motivation Predictive computational models must be accurate, robust, and interpretable to be considered reliable in important areas such as biology and medicine. A sufficiently robust model should not have its output affected significantly by a slight change in the input. Also, these models should be able to explain how a decision is made to support user trust in the results. Efforts have been made to improve the robustness and interpretability of predictive computational models independently; however, the interaction of robustness and interpretability is poorly understood. Results As an example task, we explore the computational prediction of cell type based on single-cell RNA-seq data and show that it can be made more robust by adversarially training a deep learning model. Surprisingly, we find this also leads to improved model interpretability, as measured by identifying genes important for classification using a range of standard interpretability methods. Our results suggest that adversarial training may be generally useful to improve deep learning robustness and interpretability and that it should be evaluated on a range of tasks. Availability and implementation Our Python implementation of all analysis in this publication can be found at: https://github.com/MehrshadSD/robustness-interpretability. The analysis was conducted using numPy 0.2.5, pandas 2.0.3, scanpy 1.9.3, tensorflow 2.10.0, matplotlib 3.7.1, seaborn 0.12.2, sklearn 1.1.1, shap 0.42.0, lime 0.2.0.1, matplotlib_venn 0.11.9.