Conference PaperPDF Available

Abstract and Figures

Visualization methods for Convolutional Neural Networks (CNNs) are spreading within the medical community to obtain explainable AI (XAI). The sole qualitative assessment of the explanations is subject to a risk of confirmation bias. This paper proposes a methodology for the quantitative evaluation of common visual-ization approaches for histopathology images, i.e. Class Activation Mapping and Local-Interpretable Model-Agnostic Explanations. In our evaluation, we propose to assess four main points, namely the alignment with clinical factors, the agreement between XAI methods, the consistency and repeatability of the explanations. To do so, we compare the intersection over union of multiple visualizations of the CNN attention with the semantic annotation of functionally different nuclei types. The experimental results do not show stronger attributions to the multiple nuclei types than those of a randomly initialized CNN. The visualizations hardly agree on salient areas and LIME outputs have particularly unstable re-peatability and consistency. The qualitative evaluation alone is thus not sufficient to establish the appropriate-ness and reliability of the visualization tools. The code is available on GitHub at bit.ly/2K48HKz.
Content may be subject to copyright.
Evaluation and Comparison of CNN Visual Explanations for Histopathology
Mara Graziani1,3,, Thomas Lompech2,, Henning M¨
uller1,3, Vincent Andrearczyk1
1University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland
2INP-ENSEEIHT, 31000 Tolouse, France
3University of Geneva, Geneva, Switzerland
*Equal contribution
Abstract
Visualization methods for Convolutional Neural Net-
works (CNNs) are spreading within the medical com-
munity to obtain explainable AI (XAI). The sole quali-
tative assessment of the explanations is subject to a risk
of confirmation bias. This paper proposes a methodol-
ogy for the quantitative evaluation of common visual-
ization approaches for histopathology images, i.e. Class
Activation Mapping and Local-Interpretable Model-
Agnostic Explanations. In our evaluation, we propose
to assess four main points, namely the alignment with
clinical factors, the agreement between XAI methods,
the consistency and repeatability of the explanations. To
do so, we compare the intersection over union of multi-
ple visualizations of the CNN attention with the seman-
tic annotation of functionally different nuclei types. The
experimental results do not show stronger attributions to
the multiple nuclei types than those of a randomly ini-
tialized CNN. The visualizations hardly agree on salient
areas and LIME outputs have particularly unstable re-
peatability and consistency. The qualitative evaluation
alone is thus not sufficient to establish the appropriate-
ness and reliability of the visualization tools. The code
is available on GitHub at bit.ly/2K48HKz.
In many medical imaging tasks, such as segmentation and
classification in magnetic resonance, computed tomography
or ultrasound images, the input comprises an image with a
clear region of interest (e.g. organ or tumor) that often comes
with ground-truth segmentation. In histopathology, Whole
Slide Images (WSIs) can reach gigapixel sizes, with some-
times only infinitesimal regions being decisive for the task.
Isolated tumor cells, for example, can determine crucial de-
cisions despite being single cells or small clusters of tumor
cells that occupy less than 0.008% of the entire image.
Understanding the decision-making process of Convolu-
tional Neural Networks (CNNs) is a key point in medi-
cal imaging, to ensure that clinically correct decisions are
taken. Among the explainability (XAI) methods proposed
in the literature, the post-training attribution to either high-
level concepts (Graziani, Andrearczyk, and M¨
uller 2018;
Graziani et al. 2020; Kim et al. 2018) or input features
was proposed for medical applications (Palatnik de Sousa,
Copyright © 2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Maria Bernardes Rebuzzi Vellasco, and Costa da Silva 2019;
Ribeiro, Singh, and Guestrin 2016). Feature attribution, in
particular, highlights the most influential set of features in
the input space by generating saliency maps, also called
heatmaps (Selvaraju et al. 2017; Palatnik de Sousa, Maria
Bernardes Rebuzzi Vellasco, and Costa da Silva 2019; Chat-
topadhay et al. 2018; Ribeiro, Singh, and Guestrin 2016;
Zhou et al. 2016). One of the risks of accepting the plau-
sibility of the heatmaps only by visual assessment is that
of incurring in the so-called confirmation bias. As the re-
search in cognitive psychology explains, we tend to attribute
greater confidence to a hypothesis, even if false, when expla-
nations are generated for it (Lombrozo 2006). For this rea-
son, the reliability and trustworthiness of visual explanations
should be thoroughly addressed before their incorporation
into healthcare pipelines, clarifying the advantages, limita-
tions and similarities of the methods. As Tokenaboni argues
in (Tonekaboni et al. 2019), the specific needs of clinical
practice require the evaluation of the appropriateness of the
explanations, their alignment with clinical factors, their po-
tential of being translated into action and, finally, their con-
sistency over parameter shifts. Remarkable evaluations of
the consistency of saliency maps proposed adding constant
shifts into the data (Kindermans et al. 2019), comparing vi-
sualizations after cascading randomizations of the network
weights (Adebayo et al. 2018) and quantifying the similarity
of explanations under multiple conditions (Arun et al. 2020).
The instability of XAI visualization methods applied to nat-
ural images and chest X-rays emerged from these studies. If
the lesion contours are available, the appropriateness of the
explanations can be evaluated by localization metrics (Arun
et al. 2020). Evaluating visualization methods on the basis of
their localization performance as in (Arun et al. 2020), how-
ever, may easily fail in the context of histopathology images.
WSIs do not have a clear central subject on the foreground
but rather a structural disposition of many instances (e.g.
connective, adipose, or epithelium cells) at several scales,
as illustrated in Fig. 1.
In this work, we propose quantitative metrics that can
specifically evaluate visual explanations in the context of
histopathology images (at 40X magnification). To establish
whether the visualizations are appropriate for the domain,
we evaluate the Intersection over Union (IoU) between the
heatmaps and functionally different nuclei types, i.e. neo-
plastic, inflammatory, epithelial and connective nuclei. We
then compare the Structural SIMilarity index (SSIM) be-
tween multiple XAI methods in order to help users with
choosing the best visualization method and to reduce the
number of look-alike visualizations. The consistency of the
explanations is evaluated against small shifts in the hyper-
parameters of the explanation techniques, and against the
cascading randomization of the model parameters (i.e. CNN
weights). We finally assess, where needed, the repeatabil-
ity of the explanations for multiple initialization seeds. Our
analysis evaluates the commonly used activation maps and
linear surrogate models XAI methods, namely Class Activa-
tion Mapping (CAM), its gradient-based evolutions Grad-
CAM and Grad-CAM++, and Local-Interpretable Model
Agnostic Explanations (LIME). Results on the Camelyon
and PanNuke data collections show that the XAI visualiza-
tions in this paper do not explain the CNN decisions in terms
of the attention paid to tumorous nuclei. The visualizations,
moreover, disagree on the salient input regions, and appear
inconsistent for small shifts in the XAI hyper-parameters.
Methods
Breast Tissue Classification
Datasets We use a combination of three publicly available
datasets, namely Camelyon 16, Camelyon 17 (Litjens et al.
2018) and the breast subset of the PanNuke dataset (Gamper
et al. 2019)1.
The Camelyon collection includes more than a thou-
sand training WSIs of lymph node sections (899 for Came-
lyon 17 and 270 for Camelyon 16), with slide-level annota-
tions of metastasis type (negative, macro-metastases, micro-
metastases, isolated tumor cells) and a few manual segmen-
tations of tumor regions (only for 320 WSIs). The data were
collected at five medical centers with three scanner types and
present high heterogeneity and staining variability (Khan et
al. 2020).
The PanNuke dataset is a collection of 481 WSIs from
19 different tissue types with semi-automatic instance seg-
mentations of five different nuclei types, namely neoplas-
tic, inflammatory, connective, epithelial and dead nuclei. No
dead nuclei were segmented for the breast data used in our
work, as shown by the nuclei type statistics in (Gamper et
al. 2020). The benefit of combining images from several
datasets is twofold. It improves the model generalization to
unseen data and, at the same time, it provides semantic nu-
clei segmentations to evaluate the overlap of the visual ex-
planations with various nuclei types.
Image patches (of 224 ×224 pixels, i.e. the network input
size) are extracted from all WSIs at 40X magnification, the
highest level of magnification, to show the qualitative fea-
tures of the nuclei that are prognostic of cancer (Rakha et al.
2008). Since the inputs from PanNuke are already released
in image patches of size 256 ×256 pixels, and since they are
under-represented in number with respect to the Camelyon
dataset, we oversample the original image patches to a five
1https://camelyon17.grand-challenge.org/
and https://jgamper.github.io/PanNukeDataset/
Table 1: Summary of the training, validation and testing
splits.
Cam 16 Cam 17 Pannuke
Train Negative 12,954 107,951 2,915
Positive 6,036 17,475 4,965
Validation Negative - 820 -
Positive - 1000 -
Test Negative - 1,215 1,475
Positive - 1,499 2,400
times larger dataset. The oversampling is obtained by crop-
ping the images at several locations (center, upper left, upper
right, bottom left and bottom right corners). The training,
validation and testing splits are summarized in Table 1. For
the training split, the PanNuke patches were extracted from
the first two original splits provided for benchmarking the
data. The third split was used as testing data. Thus, the train-
ing set comprises 152,296 images (of which 123,820 do not
contain tumor, i.e. negative and 28,476 contain tumor, i.e.
positive). 1,820 images from two centers of the Camelyon17
data were used as validation. Finally, the test set comprises
a total of 6,589 images with 2,690 negative and 3,899 pos-
itive samples. Reinhard normalization is applied to all the
patches to reduce the stain variability, as suggested in (Khan
et al. 2020).
Network Architecture and Training The CNN archi-
tecture is an Inception V3 (Szegedy et al. 2016) with Im-
ageNet pre-trained weights that is entirely finetuned on the
histopathology training images to solve the binary classifica-
tion task of distinguishing positive samples against negative
ones. This solution outperforms other architectures, and is
therefore used for the analyses. Three fully-connected layers
(2048, 512 and 256 neurons respectively) with dropout prob-
ability of 0.8 and a prediction layer are added on top of the
pre-trained features. The weighted binary cross-entropy loss
is used to address the strong class imbalance in the training
data. L2 regularization is used with a coefficient of 0.01 on
the fully-connected layers. The optimization is solved with
stochastic gradient descent and standard parameters (Nes-
terov momentum at 0.9 and decay at 1e-6). Early stopping is
performed on the validation loss to stop the training process,
with 5 epochs of patience. The network obtains a test accu-
racy of 0.80 with an Area Under the ROC Curve (AUC) of
0.85.
Visualization Methods
Class Activation Maps Three types of activation maps
are used for the analysis, namely the original CAM imple-
mentation (Zhou et al. 2016), the gradient-weighted CAM
known as Grad-CAM (Selvaraju et al. 2017) and its gen-
eralized version Grad-CAM++ (Chattopadhay et al. 2018).
CAM produces a localization map by visualizing the con-
tribution of each feature map before these are spatially av-
eraged and linearly combined to produce the network pre-
diction. Grad-CAM generalizes CAM as it generates visual
explanations by directly taking into account the cascade of
gradients, thus applying to a wider variety of models and
applications, including image captioning and query answer-
ing. These two methods were shown to be equivalent up to a
normalization constant that is proportional to the number of
pixels in the feature maps (Selvaraju et al. 2017). As a fur-
ther development, Grad-CAM++ considers the gradients at
the pixel level rather than those of the entire feature maps2.
As a result Grad-CAM++ explanations partially address the
shortcomings of considering the entire feature maps, like the
difficulty to operate when multiple occurrences of instances
of the same class occur in a single image.
LIME Interpretable surrogates are used as linear classi-
fiers by LIME to locally simulate the decisions of the global
classifier. Using LIME to explain CNNs is similar to us-
ing a sparse linear model to approximate the complex de-
cision function of the CNN. The first step of the applica-
tion of LIME to images consists of clustering pixels into
superpixels (that will be used as features) using color, tex-
ture and other types of local similarities. Randomly hiding
some of the superpixels generates perturbations (called sam-
ples) of the original images which can be used to compute
the relevance of each superpixel. Following the indications
in (Palatnik de Sousa, Maria Bernardes Rebuzzi Vellasco,
and Costa da Silva 2019), we test two superpixel algorithms
namely Simple Linear Iterative Clustering (SLIC) (Achanta
et al. 2012) and Felzenszwalb’s graph based image segmen-
tation (FHA) (Felzenszwalb and Huttenlocher 2004).
Evaluation Methods
Visual Similarity and Alignment with Clinical Factors
For the methods considered in this analysis, namely CAM,
Grad-CAM, Grad-CAM++, and LIME with SLIC and FHA
superpixels, we propose the qualitative evaluation of some
visualizations and two quantitative analyses. The quantita-
tive analyses measure respectively the accordance of meth-
ods (in terms of their visual similarity) and their align-
ment with clinical factors. In the first quantitative analy-
sis, the SSIM is used to establish whether different XAI
methods point to the same input regions to explain a given
prediction. The SSIM, ranging from 0 (no structural sim-
ilarity) to 1 (identical structural similarity), is computed
for pairs of XAI methods to evaluate their agreement. In
the second analysis, we follow the experiments in (Zhou
et al. 2018) to compute the overlap (IoU) of the explana-
tions with specific image regions, i.e. each of the four nu-
clei types. Particular attention is given to neoplastic nuclei
as they are indicators of tumor tissue (Gamper et al. 2019;
2020).
Consistency and Repeatability The consistency of the
visualizations is evaluated over variations in the method
hyper-parameters. The number of superpixels and the size
of the neighborhood considered by the linear surrogate (the
2The Rectified Linear Unit (ReLU) activation function that is
applied to the gradients is neglected in the original formulation
(Chattopadhay et al. 2018). The backpropagation of the gradients
is as a result not thresholded by the activation function.
number of samples used to solve the local linear classifi-
cation task in LIME) are hyper-parameters that need to be
tuned to generate meaningful LIME explanations. By using
SSIM, we evaluate the similarity of the visualizations ob-
tained for slightly increasing values of the hyper-parameters,
until a visible plateau is reached (meaning that no further
change happens with an additional increase in the hyper-
parameter value). Based on the observations in (Madhyastha
and Jain 2019), we further evaluate the impact of the inher-
ent randomness within the LIME explanation. We compute
the repeatability as the SSIM between visualizations gener-
ated for multiple initialization seeds of the surrogate model.
The initialization seed controls both the random choice of
the local neighborhood samples and the starting point of the
optimization algorithm for the training of the local surro-
gate. CAM visualizations are excluded from this analysis as
they do not depend on the hyper-parameter choice.
Randomization Test The randomization test verifies the
dependence of the explanation on the model parameters.
The output of the explainability obtained from a trained
network is compared to that of a network with some ran-
domly initialized parameters. If no clear change is present
between the explanation of the trained CNN and that with
randomly initialized weights, then no clear link can be es-
tablished between the network weights and the explanation.
This can lead to a misleading interpretation of the visual-
izations. With the cascading randomization test (Adebayo
et al. 2018), the CNN weights are randomized in progres-
sion from the top layer to the bottom one. For each layer
we evaluate the similarity between the original explanation
and the one obtained with random weights up to that layer.
As previous work has shown, the SSIM of explanations for
ImageNet inputs does not necessarily decrease with the pro-
gressive randomization of the InceptionV3 layer (Adebayo
et al. 2018). From the cascading randomization, we can only
assess if the heatmaps depend on the network weights, as in
the Pneumothorax examples in (Arun et al. 2020) but we
cannot establish whether, after training, the heatmaps point
more towards neoplastic nuclei than before. Therefore, we
compute as an additional analysis the IoUs with each nuclei
type for a fully randomized untrained CNN. This analysis
is conducted to highlight whether the explanations become
more aligned with clinically relevant factors after training.
Results
Visual Similarity of XAI Methods
We compare the visualizations obtained for test PanNuke
images in Fig. 1. In this visual inspection, we compare the
heatmaps of four images with different classification out-
comes, namely True Positives (TP), True Negatives (TN),
False Positives (FP) and False Negatives (FN)3. The seman-
tic segmentation of the nuclei is overlayed on the origi-
nal images. To enable a fair comparison across the differ-
ent methods and the different inputs, the heatmaps are nor-
malized between zero and one according to the maximum
3More images are available in the Github repository for further
validation: https://bit.ly/2K48HKz.
Figure 1: Qualitative comparison of CAM, Grad-CAM,
Grad-CAM++ and LIME on testing images. We report
LIME computed with SLIC and FHA superpixel extraction
as well as their average. The nuclei segmentations are over-
layed on the original images in the left column.
and minimum values of the heatmaps for all testing inputs.
As suggested in (Palatnik de Sousa, Maria Bernardes Re-
buzzi Vellasco, and Costa da Silva 2019), the average of the
LIME visualizations obtained with the two superpixel ex-
traction algorithms is also shown for qualitative assessment
(LIME AVG). To obtain LIME heatmaps, we use the max-
imum number of features for the explanations, correspond-
ing to using all the superpixels in the images. We set the
neighborhood size to 10,000 samples to obtain more robust
visualizations, despite the high computational cost of the op-
erations (Ribeiro, Singh, and Guestrin 2016). Heatmaps of
negative predictions (both TNs and FNs) have lower values
than those for TPs. The mean value of the CAM values, for
example, are 1.53 for TNs, 1.83 for FNs and 4.03 for TPs.
A question that may arise is whether the heatmaps agree
(i.e. high SSIM) when the network is confident about the
predicted class. We evaluate the correlation between the
CNN predictions and the SSIM values to see whether the
similarity increases for increasing values of the prediction.
The SSIM is computed for 200 randomly drawn test inputs
with class stratification (100 for the positive class and 100
for the negative class). Pearson’s correlation of the SSIM
values is shown in Fig. 2a. Grad-CAM behaves as a general-
ization of CAM (Selvaraju et al. 2017), and their agreement
is positively correlated with the network confidence (with
Pearson’s correlation coefficient ρ= 0.44 (p << 0.001).
The correlations between the prediction and the SSIM of the
other pairs of XAI methods in Fig. 2a are mostly negative,
showing that heatmaps are more likely to disagree if the pre-
diction is positive.
In Fig. 2b, we compare the average SSIM values for pairs
of XAI methods. The 200 testing inputs are divided accord-
ing to their classification outcomes, i.e. TP, TN, FP, FN.
The XAI methods agree more on negative predictions than
on positive ones, with an SSIM above 0.6 for all methods
mostly due to consistently low activations of the heatmaps.
Alignment with Clinical Factors
We evaluate whether the explanations reflect the attention of
the CNN towards clinically relevant factors. This alignment
with clinical factors is quantified as the IoU of the heatmaps
(a) (b)
Figure 2: a) Pearson’s correlation between the SSIM of pairs
of XAI methods on testing input images and the relative
CNN predictions. For all cells, except C/LS, p << 0.005.
b) Average SSIM between pairs of XAI methods for the dif-
ferent network outcomes, i.e. TN: True Negative, TP: True
Positive, FN: False Negative, FP: False Positive. The error
bars represent the standard deviation of the SSIM values.
with the segmentation masks of functionally different nuclei.
The IoU is computed for 100 testing images of the PanNuke
dataset containing at least one neoplastic nucleus (indicative
of the presence of tumor). The heatmaps are thresholded,
as in (Zhou et al. 2018), so that they activate on average for
60% of the pixels of the positive class images. We obtain one
IoU score per image and per annotation type. Because some
nuclei types are not present on some subsets of images, the
IoU for a given annotation type is computed only on the sub-
set of images that contains at least one instance of this type.
Results are presented in Fig. 3. The IoU of the heatmaps gen-
erated for a CNN with fully randomized weights is added as
a baseline for comparison4.
Consistency, Repeatability and Randomization Test
We evaluate the consistency of the visualizations by assess-
ing three points: (i) their dependency on the XAI hyper-
parameters, (ii) their repeatability, and (iii) their response
to the randomization test. Since activation maps do not re-
quire the tuning of hyper-parameters, the consistency (i) and
repeatability analysis (ii) are only reported for LIME.
To assess (i), we monitor the changes in the SSIM over
small shifts in two hyper-parameters, namely the number of
samples (that corresponds to the neighborhood size), and the
number of features, i.e. the number of superpixels retained
for the analysis. In (i), differently from the analysis in (ii),
we do not reset the initialization seed used by the LIME sur-
rogate model across repetitions. This gives more stability to
the comparison, reducing the level of stochasticity within
the generation of LIME explanations. The plot in Fig. 4a
shows the SSIM for changing values of the neighborhood
size, with the number of superpixels fixed at 100. The shift
in the hyper-parameters was performed within the range of
zero to 3000 with a step of 50. For a given value Non the
x-axis, the plot represents the SSIM between the heatmap
obtained with Nsamples (image perturbations) and the one
obtained with N50 samples. Similarly, Fig. 4b reports
the SSIMs between heatmaps obtained by fixing the number
of samples to 1000 and shifting the number of superpixels
retained from zero to 3000 by intervals of 50.
4This is not the same as the cascaded randomization test pro-
posed in (Adebayo et al. 2018), that is reported in Fig. 7
Figure 3: Quantification of CNN attention on the nuclei types in PanNuke expressed as the IoU in the testing set. The IoU of
a network with randomly initialized weights (RANDOM-TP and RANDOM-FN) is added as a baseline for comparison. Best
seen on screen.
(a) Neighborhood size (b) Number of superpixels
Figure 4: SSIM between heatmaps obtained from LIME
when a parameter differs by a shift of 50. The studied pa-
rameter is the number of samples in (a) and the number of
superpixels in (b). E.g. the SSIM for the x-axis point 1000 is
the SSIM between the LIME method with 1000 samples and
the LIME method with 950 samples. The number of super-
pixels is set to 100 in (a) and the neighborhood size to 1000
in (b).
(a) Neighborhood size of 100
samples.
(b) Neighborhood size of 1000
samples.
Figure 5: SSIM evaluating LIME repeatability over 25 repe-
titions for True Positive (TP) and False Negative (FN) inputs
for both SLIC and FHA superpixels. The random seed ini-
tialization for LIME is reset at every repetition. Heatmaps
obtained with 10, 100 and 1000 superpixels are compared.
Error bars report the standard deviation.
We assess (ii), namely the repeatability of LIME visual-
izations, by evaluating the SSIM of the heatmaps obtained
with 25 different initialization seeds. As the hyper-parameter
values for the number of superpixels and the neighborhood
size may also affect LIME heatmaps (see Fig. 4), we com-
pare in Fig. 5 the repeatability of the visualizations for 10,
100 and 1000 superpixels with neighborhoods of 100 and
1000 samples. High repeatability (SSIM around 0.8) is ob-
tained only with 10 superpixels and a larger neighborhood
Fig. 5(b) results in slightly more stable explanations.
The randomization test (iii) is finally performed for the
XAI visualization methods considered in this analysis to
assess the dependency of the explanations on the CNN
weights. The weights are randomized in a cascading fashion
(from the top to the bottom layers) up to the full randomiza-
tion of all the CNN. In Fig. 6, we show visual examples of
these cascaded randomizations. The SSIM between the orig-
inal heatmap (from the trained CNN) and the one after the
randomization at each layer is shown in Fig. 7.
Figure 6: Visual assessment of the cascading randomiza-
tion (Adebayo et al. 2018) test for the positive input shown
in the top left corner. The progression from left to right
shows the XAI heatmaps for a randomization of the network
weights up to the layer represented by each column.
Figure 7: SSIM for cascading randomization (Adebayo et
al. 2018). We show the SSIM between the original heatmaps
(from the trained network) and the heatmaps generated as
the CNN weights are randomized in the cascading way.
Discussion
The experiments propose an in-depth evaluation of popu-
lar XAI methods. XAI explanations were generated and vi-
sually compared for a total of 200 testing inputs, despite
only a subset is presented in Fig. 1. The full set of results
can be inspected in the GitHub repository. The CAM-based
visualizations in Fig. 1 seem similar to each other. They
all activate more frequently and with larger absolute val-
ues for positive predictions, with stronger intensity on the
neoplastic nuclei areas, as also seen in (Graziani, Andrea-
rczyk, and M¨
uller 2019). The results obtained with LIME,
on the other hand, are difficult to interpret in the reported
examples, being very dependent on the generation of su-
perpixels (Palatnik de Sousa, Maria Bernardes Rebuzzi Vel-
lasco, and Costa da Silva 2019). Apart from these consider-
ations, however, the simple qualitative analysis is not suffi-
cient to evaluate the appropriateness and reliability of these
XAI methods for histopathology images.
The quantitative metrics in this paper evaluate XAI meth-
ods from a global perspective. The correlation analysis in
Fig. 2a shows that the explanations mostly agree when the
prediction is low (i.e. low probability of tumor). The agree-
ment is given by the very low values everywhere on the
heatmap for these inputs (as also seen in the qualitative as-
sessment in Fig. 1). The SSIM values in Fig. 2 further show
that XAI visualizations are dissimilar for positive predic-
tions (TPs and FPs). The similarity between CAM and Grad-
CAM, shown by the high SSIM values for all prediction out-
comes (0.8 on average), confirms the equivalence between
the two methods explained in (Selvaraju et al. 2017).
The IoU analysis in Fig. 3 suggests that all XAI methods
attribute the largest attention to the neoplastic nuclei. At first
sight, this may indicate that neoplastic nuclei are responsi-
ble for the identification of tumorous patches. There is, how-
ever, an important bias in the distribution of neoplastic nu-
clei, that are in larger numbers than all other types and are
present only in positive images by construction ( the ground
truth of all patches is computed by looking at the neoplas-
tic nuclei). We cannot thus clearly establish the correlation
between the attention attributed to neoplastic nuclei and the
network outcome. Because of this, we compare the IoUs of a
trained CNN to that of a randomly initialized CNN, also re-
ported in Fig. 3. Only a slight (non-statistically significative)
reduction in the IoU with the neoplastic nuclei is noticed.
This suggests that the likelihood of high IoU with neoplas-
tic nuclei is already high for a random heatmap and that the
alignment of the XAI visualization with clinically relevant
features is only apparent.
The consistency of the explanations upon small shifts in
the hyper-parameter setting is another point of our analy-
sis. Fig. 4 shows that LIME results depend strongly on the
hyper-parameter set-up. The SSIM evaluating shifts in the
parameters only plateaus after passing high values for both
the neighborhood size (above 1000 samples) and the number
of features (i.e. number of superpixels, above than 100). The
latter should be much larger than general recommendations
for LIME on non-visual inputs (e.g. around 10 features for
text applications). Besides, Fig. 5 shows that LIME explana-
tions are very unstable and not repeatable unless the number
of features used for the analysis is small (around 10, from
Fig. 5). The two analyses in Fig. 4b and 5 show that it is not
possible to obtain both repeatable and consistent LIME vi-
sualizations. Heatmaps with only 10 superpixels, moreover,
are hard to visually interpret (not illustrated in this paper, but
available on the GitHub for comparison).
We observe in Fig. 6 that XAI visualizations change when
the network weights are randomized at cascading depths. Es-
pecially the results for CAM-based methods align with those
in (Adebayo et al. 2018), showing diffused heatmaps around
the image center. The cascading randomization test is passed
by CAM-based methods in Fig. 7.This does not mean, how-
ever, that the CAM-based visualizations of the trained CNN
point more towards clinically relevant regions than an un-
trained CNN. As previously mentioned, the comparison of
the IoUs of trained and untrained CNNs in Fig. 3 shows that
XAI visualizations for the trained CNN do not overlap more
with neoplastic nuclei than those for an untrained CNN. We
conclude that the XAI visualizations analyzed in this paper
reflect only apparently the clinical variability in the images.
Conclusions
This study proposes qualitative and quantitative analyses of
XAI visualization methods in the context of CNNs for dig-
ital pathology. The qualitative inspection may lead to mis-
leading conclusions, e.g. that the network’s attention points
towards neoplastic nuclei. As neoplasticity is a main indi-
cator of tumor, these explanations increase the confirmation
bias and the tendency of accepting the CNN decisions as
true. Our evaluation however, shows no significant differ-
ence in the attention paid to neoplastic nuclei by a trained
CNN than that of a randomly initialized one. The explana-
tions, therefore, do not seem to increase their alignment with
clinical evidence after network training. In addition, LIME
visualizations are neither consistent nor repeatable, generat-
ing different explanations for different settings of the hyper-
parameters or initialization seeds. Driven by these results,
we feel that XAI visualizations should not be used as a way
of proving the correctness of the CNN decision-making on
this task.
We remark a limitation of this study in the annotations of
nuclei types, that are not exhaustive of the clinical factors
that could be involved in the decisions. Mitoses, for exam-
ple, are not considered by our analyses.We focus on the most
popular XAI methods, not proposing an exhaustive cover-
age. Our framework for the evaluation, however, can be eas-
ily extended to new annotations, architectures, datasets, and
XAI methods. This paper shows that more applied research
is indeed crucial to test and confirm the utility of XAI meth-
ods in the medical imaging field.
Acknowledgments
This work is supported by the European Union’s projects
PROCESS (agreement n. 777533), ExaMode (n. 825292)
and AI4Media (n. 951911). NVIDIA Corporation supported
this work with the donation of the Titan X GPU.
References
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; and
S¨
usstrunk, S. 2012. Slic superpixels compared to state-of-
the-art superpixel methods. IEEE transactions on pattern
analysis and machine intelligence 34(11):2274–2282.
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I.; Hardt,
M.; and Kim, B. 2018. Sanity checks for saliency maps. In
Advances in Neural Information Processing Systems, 9505–
9515.
Arun, N.; Gaw, N.; Singh, P.; Chang, K.; Aggarwal, M.;
Chen, B.; Hoebel, K.; Gupta, S.; Patel, J.; Gidwani, M.;
et al. 2020. Assessing the (un) trustworthiness of saliency
maps for localizing abnormalities in medical imaging. arXiv
preprint arXiv:2008.02766.
Chattopadhay, A.; Sarkar, A.; Howlader, P.; and Balasubra-
manian, V. N. 2018. Grad-CAM++: Generalized gradient-
based visual explanations for deep convolutional networks.
In 2018 IEEE Winter Conference on Applications of Com-
puter Vision (WACV), 839–847. IEEE.
Felzenszwalb, P. F., and Huttenlocher, D. P. 2004. Efficient
graph-based image segmentation. International journal of
computer vision 59(2):167–181.
Gamper, J.; Koohbanani, N. A.; Benet, K.; Khuram, A.; and
Rajpoot, N. 2019. PanNuke: an open pan-cancer histol-
ogy dataset for nuclei instance segmentation and classifica-
tion. In European Congress on Digital Pathology, 11–19.
Springer.
Gamper, J.; Koohbanani, N. A.; Graham, S.; Jahanifar, M.;
Khurram, S. A.; Azam, A.; Hewitt, K.; and Rajpoot, N.
2020. Pannuke dataset extension, insights and baselines.
arXiv preprint arXiv:2003.10778.
Graziani, M.; Andrearczyk, V.; and M¨
uller, H. 2018. Re-
gression concept vectors for bidirectional explanations in
histopathology. In Understanding and Interpreting Ma-
chine Learning in Medical Image Computing Applications.
Springer. 124–132.
Graziani, M.; Andrearczyk, V.; and M¨
uller, H. 2019. Vi-
sualizing and interpreting feature reuse of pretrained cnns
for histopathology. In IMVIP 2019: Irish Machine Vision
and Image Processing Conference Proceedings. Irish Pat-
tern Recognition and Classification Society.
Graziani, M.; Andrearczyk, V.; Marchand-Maillet, S.; and
M¨
uller, H. 2020. Concept attribution: Explaining CNN de-
cisions to physicians. Computers in Biology and Medicine
123:103865.
Khan, A.; Atzori, M.; Ot´
alora, S.; Andrearczyk, V.; and
M¨
uller, H. 2020. Generalizing convolution neural networks
on stain color heterogeneous data for computational pathol-
ogy. In Medical Imaging 2020: Digital Pathology, volume
11320, 113200R. International Society for Optics and Pho-
tonics.
Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C. J.; Wexler, J.;
Vi´
egas, F.; and Sayres, R. 2018. Interpretability beyond fea-
ture attribution: Quantitative testing with concept activation
vectors (tcav). In ICML.
Kindermans, P.-J.; Hooker, S.; Adebayo, J.; Alber, M.;
Sch¨
utt, K. T.; D¨
ahne, S.; Erhan, D.; and Kim, B. 2019.
The (un) reliability of saliency methods. Explainable AI:
Interpreting, Explaining and Visualizing Deep Learning,
Springer International Publishing.
Litjens, G.; Bandi, P.; Ehteshami Bejnordi, B.; Geessink, O.;
Balkenhol, M.; Bult, P.; Halilovic, A.; Hermsen, M.; van de
Loo, R.; Vogels, R.; et al. 2018. 1399 H&E-stained sentinel
lymph node sections of breast cancer patients: the CAME-
LYON dataset. GigaScience 7(6):giy065.
Lombrozo, T. 2006. The structure and function of explana-
tions. Trends in cognitive sciences 10(10):464–470.
Madhyastha, P., and Jain, R. 2019. On model stability as a
function of random seed. arXiv preprint arXiv:1909.10447.
Palatnik de Sousa, I.; Maria Bernardes Rebuzzi Vellasco,
M.; and Costa da Silva, E. 2019. Local interpretable
model-agnostic explanations for classification of lymph
node metastases. Sensors 19(13):2969.
Rakha, E. A.; El-Sayed, M. E.; Lee, A. H.; Elston, C. W.;
Grainge, M. J.; Hodi, Z.; Blamey, R. W.; and Ellis, I. O.
2008. Prognostic significance of nottingham histologic
grade in invasive breast carcinoma. Journal of clinical on-
cology 26(19):3153–3158.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ” why
should i trust you?” explaining the predictions of any classi-
fier. In Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining, 1135–
1144.
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.;
Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual explana-
tions from deep networks via gradient-based localization. In
Proceedings of the IEEE international conference on com-
puter vision, 618–626.
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna,
Z. 2016. Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE conference on computer
vision and pattern recognition, 2818–2826.
Tonekaboni, S.; Joshi, S.; McCradden, M. D.; and Gold-
enberg, A. 2019. What clinicians want: contextualizing
explainable machine learning for clinical end use. arXiv
preprint arXiv:1905.05134.
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba,
A. 2016. Learning deep features for discriminative localiza-
tion. In Proceedings of the IEEE conference on computer
vision and pattern recognition, 2921–2929.
Zhou, B.; Bau, D.; Oliva, A.; and Torralba, A. 2018. Inter-
preting deep visual representations via network dissection.
IEEE transactions on pattern analysis and machine intelli-
gence 41(9):2131–2145.
... Existing visualization methods present pitfalls that urge for improvement, as pointed out by the unreliability shown in [1,11]. LIME outputs for histopathology, for example, do not indicate any alignment of the explanations to clinical evidence and show high instability and scarce reproducibility [6]. Optimizing and reformulating this existing approach is thus a necessary step to promote its realistic deployment in clinical routines. ...
... Pairwise non-parametric Kruskal tests for independent samples are used for the comparisons. A paired t-test is used to compare LIME weights obtained from a randomly initialized and a trained network, as suggested in [6]. ...
... (a) (b) Fig. 3: a) Comparison between Sharp-LIME explanation weights for a trained and a randomly initialized CNN; b) Zoom on the random CNN in a). These results can be compared to those obtained for standard LIME in [6]. ...
Chapter
Full-text available
Being accountable for the signed reports, pathologists may be wary of high-quality deep learning outcomes if the decision-making is not understandable. Applying off-the-shelf methods with default configurations such as Local Interpretable Model-Agnostic Explanations (LIME) is not sufficient to generate stable and understandable explanations. This work improves the application of LIME to histopathology images by leveraging nuclei annotations, creating a reliable way for pathologists to audit black-box tumor classifiers. The obtained visualizations reveal the sharp, neat and high attention of the deep classifier to the neoplastic nuclei in the dataset, an observation in line with clinical decision making. Compared to standard LIME, our explanations show improved understandability for domain-experts, report higher stability and pass the sanity checks of consistency to data or initialization changes and sensitivity to network parameters. This represents a promising step in giving pathologists tools to obtain additional information on image classification models. The code and trained models are available on GitHub.
... Existing visualization methods present pitfalls that urge for improvement, as pointed out by the unreliability shown in [1,11]. LIME outputs for histopathology, for example, do not indicate any alignment of the explanations to clinical evidence and show high instability and scarce reproducibility [6]. Optimizing and reformulating this existing approach is thus a necessary step to promote its realistic deployment in clinical routines. ...
... Pairwise non-parametric Kruskal tests for independent samples are used for the comparisons. A paired t-test is used to compare LIME weights obtained from a randomly initialized and a trained network, as suggested in [6]. ...
... (a) (b) Fig. 3: a) Comparison between Sharp-LIME explanation weights for a trained and a randomly initialized CNN; b) Zoom on the random CNN in a). These results can be compared to those obtained for standard LIME in [6]. ...
Conference Paper
Full-text available
Being accountable for the signed reports, pathologists may be wary of high-quality deep learning outcomes if the decision-making is not understandable. Applying off-the-shelf methods with default configurations such as Local Interpretable Model-Agnostic Explanations (LIME) is not sufficient to generate stable and understandable explanations. This work improves the application of LIME to histopathology images by leveraging nuclei annotations, creating a reliable way for pathologists to audit black-box tumor classifiers. The obtained visualizations reveal the sharp, neat and high attention of the deep classifier to the neoplas-tic nuclei in the dataset, an observation in line with clinical decision making. Compared to standard LIME, our explanations show improved understandability for domain-experts, report higher stability and pass the sanity checks of consistency to data or initialization changes and sensitivity to network parameters. This represents a promising step in giving pathologists tools to obtain additional information on image classification models. The code and trained models are available on GitHub.
... The husky on Figure 2 has an explainability score 0.06 (6 %) by GradCAM. Finally, three CAMs make the final decision on the images because these methods are not reliable enough alone, they produce varying results (see Figure A.1 in Appendix), similar to the experiences in (Graziani et al., 2021). Thus, at least two methods out of the three CAMs must met the 1 %-red-area condition for the bounding box for a model to exempt the inclusion in S r if C s ≥ h 3 . ...
... Another improvement would be the human verification before removing the images since the current approach eliminates problematic images, but these steps can remove out-of-sample images as well. Graziani et al (Graziani et al., 2021) showed that CAMs cannot be applied to decision making in histopathology because they disagree on their heatmaps about the dominant image regions in classifier predictions. Figure A.1 presents how unreliable are the explainability methods for ImageNet sometimes under the same conditions. ...
Preprint
Full-text available
The convolutional neural networks (CNNs) trained on ILSVRC12 ImageNet were the backbone of various applications as a generic classifier, a feature extractor or a base model for transfer learning. This paper describes automated heuristics based on model consensus, explainability and confident learning to correct labeling mistakes and remove ambiguous images from this dataset. After making these changes on the training and validation sets, the ImageNet-Clean improves the model performance by 2-2.4 % for SqueezeNet and EfficientNet-B0 models. The results support the importance of larger image corpora and semi-supervised learning, but the original datasets must be fixed to avoid transmitting their mistakes and biases to the student learner. Further contributions describe the training impacts of widescreen input resolutions in portrait and landscape orientations. The trained models and scripts are published on Github (https://github.com/kecsap/imagenet-clean) to clean up ImageNet and ImageNetV2 datasets for reproducible research.
... Our work builds upon several studies investigating the validity of saliency maps for localization [59][60][61] and upon some early work on the trustworthiness of saliency methods to explain DNNs in medical imaging 47 . However, as recent work has shown 32 , evaluating saliency methods is inherently difficult given that they are post-hoc techniques. ...
Article
Full-text available
Saliency methods, which produce heat maps that highlight the areas of the medical image that influence model prediction, are often presented to clinicians as an aid in diagnostic decision-making. However, rigorous investigation of the accuracy and reliability of these strategies is necessary before they are integrated into the clinical setting. In this work, we quantitatively evaluate seven saliency methods, including Grad-CAM, across multiple neural network architectures using two evaluation metrics. We establish the first human benchmark for chest X-ray segmentation in a multilabel classification set-up, and examine under what clinical conditions saliency maps might be more prone to failure in localizing important pathologies compared with a human expert benchmark. We find that (1) while Grad-CAM generally localized pathologies better than the other evaluated saliency methods, all seven performed significantly worse compared with the human benchmark, (2) the gap in localization performance between Grad-CAM and the human benchmark was largest for pathologies that were smaller in size and had shapes that were more complex, and (3) model confidence was positively correlated with Grad-CAM localization performance. Our work demonstrates that several important limitations of saliency methods must be addressed before we can rely on them for deep learning explainability in medical imaging.
... (Guided) Grad-CAM is generally more popular in terms of annual citations [77], which makes it reasonable for us to take it into consideration. There is a debate about the sanity of such saliency maps [44,45], which might be irritating for the user. However, Yona and Greenfeld [41] showed that the sanity checks themselves seem to be faulty, which is why we include Guided Grad-CAM in our investigation. ...
Article
Full-text available
Digital histopathology poses several challenges such as label noise, class imbalance, limited availability of labelled data, and several latent biases to deep learning, negatively influencing transparency, reproducibility, and classification performance. In particular, biases are well known to cause poor generalization. Proposed tools from explainable artificial intelligence (XAI), bias detection, and bias discovery suffer from technical challenges, complexity, unintuitive usage, inherent biases, or a semantic gap. A promising XAI method, not studied in the context of digital histopathology is automated concept-based explanation (ACE). It automatically extracts visual concepts from image data. Our objective is to evaluate ACE’s technical validity following design science principals and to compare it to Guided Gradient-weighted Class Activation Mapping (Grad-CAM), a conventional pixel-wise explanation method. To that extent, we created and studied five convolutional neural networks (CNNs) in four different skin cancer settings. Our results demonstrate that ACE is a valid tool for gaining insights into the decision process of histopathological CNNs that can go beyond explanations from the control method. ACE validly visualized a class sampling ratio bias, measurement bias, sampling bias, and class-correlated bias. Furthermore, the complementary use with Guided Grad-CAM offers several benefits. Finally, we propose practical solutions for several technical challenges. In contradiction to results from the literature, we noticed lower intuitiveness in some dermatopathology scenarios as compared to concept-based explanations on real-world images.
... Validating the localisation ability and extracting features of automated injury detection systems aids their explainability. There has been significant research in the area of assessing the localisation ability of various saliency techniques [Arun et al., 2020, Graziani et al., 2020. There is, however, a lesser focus on assessing a model's localisation ability. ...
Preprint
This work employs a pre-trained, multi-view Convolutional Neural Network (CNN) with a spatial attention block to optimise knee injury detection. An open-source Magnetic Resonance Imaging (MRI) data set with image-level labels was leveraged for this analysis. As MRI data is acquired from three planes, we compare our technique using data from a single-plane and multiple planes (multi-plane). For multi-plane, we investigate various methods of fusing the planes in the network. This analysis resulted in the novel 'MPFuseNet' network and state-of-the-art Area Under the Curve (AUC) scores for detecting Anterior Cruciate Ligament (ACL) tears and Abnormal MRIs, achieving AUC scores of 0.977 and 0.957 respectively. We then developed an objective metric, Penalised Localisation Accuracy (PLA), to validate the model's localisation ability. This metric compares binary masks generated from Grad-Cam output and the radiologist's annotations on a sample of MRIs. We also extracted explainability features in a model-agnostic approach that were then verified as clinically relevant by the radiologist.
Chapter
This work employs a pre-trained, multi-view Convolutional Neural Network (CNN) with a spatial attention block to optimise knee injury detection. An open-source Magnetic Resonance Imaging (MRI) data set with image-level labels was leveraged for this analysis. As MRI data is acquired from three planes, we compare our technique using data from a single-plane and multiple planes (multi-plane). For multi-plane, we investigate various methods of fusing the planes in the network. This analysis resulted in the novel ‘MPFuseNet’ network and state-of-the-art Area Under the Curve (AUC) scores for detecting Anterior Cruciate Ligament (ACL) tears and Abnormal MRIs, achieving AUC scores of 0.977 and 0.957 respectively. We then developed an objective metric, Penalised Localisation Accuracy (PLA), to validate the model’s localisation ability. This metric compares binary masks generated from Grad-Cam output and the radiologist’s annotations on a sample of MRIs. We also extracted explainability features in a model-agnostic approach that were then verified as clinically relevant by the radiologist.
Preprint
Full-text available
Deep learning has enabled automated medical image interpretation at a level often surpassing that of practicing medical experts. However, many clinical practices have cited a lack of model interpretability as reason to delay the use of "black-box" deep neural networks in clinical workflows. Saliency maps, which "explain" a model's decision by producing heat maps that highlight the areas of the medical image that influence model prediction, are often presented to clinicians as an aid in diagnostic decision-making. In this work, we demonstrate that the most commonly used saliency map generating method, Grad-CAM, results in low performance for 10 pathologies on chest X-rays. We examined under what clinical conditions saliency maps might be more dangerous to use compared to human experts, and found that Grad-CAM performs worse for pathologies that had multiple instances, were smaller in size, and had shapes that were more complex. Moreover, we showed that model confidence was positively correlated with Grad-CAM localization performance, suggesting that saliency maps were safer for clinicians to use as a decision aid when the model had made a positive prediction with high confidence. Our work demonstrates that several important limitations of interpretability techniques for medical imaging must be addressed before use in clinical workflows.
Article
Full-text available
Purpose: To evaluate the trustworthiness of saliency maps for abnormality localization in medical imaging. Materials and methods: Using two large publicly available radiology datasets (Society for Imaging Informatics in Medicine-American College of Radiology Pneumothorax Segmentation dataset and Radiological Society of North America Pneumonia Detection Challenge dataset), the performance of eight commonly used saliency map techniques were quantified in regard to (a) localization utility (segmentation and detection), (b) sensitivity to model weight randomization, (c) repeatability, and (d) reproducibility. Their performances versus baseline methods and localization network architectures were compared, using area under the precision-recall curve (AUPRC) and structural similarity index measure (SSIM) as metrics. Results: All eight saliency map techniques failed at least one of the criteria and were inferior in performance compared with localization networks. For pneumothorax segmentation, the AUPRC ranged from 0.024 to 0.224, while a U-Net achieved a significantly superior AUPRC of 0.404 (P < .005). For pneumonia detection, the AUPRC ranged from 0.160 to 0.519, while a RetinaNet achieved a significantly superior AUPRC of 0.596 (P <.005). Five and two saliency methods (of eight) failed the model randomization test on the segmentation and detection datasets, respectively, suggesting that these methods are not sensitive to changes in model parameters. The repeatability and reproducibility of the majority of the saliency methods were worse than localization networks for both the segmentation and detection datasets. Conclusion: The use of saliency maps in the high-risk domain of medical imaging warrants additional scrutiny and recommend that detection or segmentation models be used if localization is the desired output of the network.Keywords: Technology Assessment, Technical Aspects, Feature Detection, Convolutional Neural Network (CNN) Supplemental material is available for this article. © RSNA, 2021.
Article
Full-text available
Deep learning explainability is often reached by gradient-based approaches that attribute the network output to perturbations of the input pixels. However, the relevance of input pixels may be difficult to relate to relevant image features in some applications, eg. diagnostic measures in medical imaging. The framework described in this paper shifts the attribution focus from pixel values to user-defined concepts. By checking if certain diagnostic measures are present in the learned representations, experts can explain and entrust the network output. Being post-hoc, our method does not alter the network training and can be easily plugged into the latest state-of-the-art convolutional networks. This paper presents the main components of the framework for attribution to concepts, in addition to the introduction of a spatial pooling operation on top of the feature maps to obtain a solid interpretability analysis. Furthermore, regularized regression is analyzed as a solution to the regression overfitting in high-dimensionality latent spaces. The versatility of the proposed approach is shown by experiments on two medical applications, namely histopathology and retinopathy, and on one non-medical task, the task of handwritten digit classification. The obtained explanations are in line with clinicians’ guidelines and complementary to widely used visualization tools such as saliency maps.
Conference Paper
Full-text available
Reusing the parameters of networks pretrained on large scale datasets of natural images, such as ImageNet, is a common technique in the medical imaging domain. The large variability of objects and classes is, however, drastically reduced in most medical applications where images are dominated by repetitive patterns with, at times, subtle differences between the classes. This paper takes the example of finetuning a pretrained convolutional network on a histopathology task. Because of the reduced visual variability in this application domain, the network mostly learns to detect textures and simple patterns. As a result, the complex structures that maximize the channel activations of deep layers in the pretrained network are not present after finetuning. The learned features seem to be used by the network to spot atypical nuclei in the images, as shown by class activation maps. Finally, texture measures appear discriminative after finetuning, as shown by accurate Regression Concept Vectors.
Chapter
Full-text available
Explanations for deep neural network predictions in terms of domain-related concepts can be valuable in medical applications, where justifications are important for confidence in the decision-making. In this work, we propose a methodology to exploit continuous concept measures as Regression Concept Vectors (RCVs) in the activation space of a layer. The directional derivative of the decision function along the RCVs represents the network sensitivity to increasing values of a given concept measure. When applied to breast cancer grading, nuclei texture emerges as a relevant concept in the detection of tumor tissue in breast lymph node samples. We evaluate score robustness and consistency by statistical analysis.
Article
Full-text available
Background The presence of lymph node metastases is one of the most important factors in breast cancer prognosis. The most common strategy to assess the regional lymph node status is the sentinel lymph node procedure. The sentinel lymph node is the most likely lymph node to contain metastasized cancer cells and is excised, histopathologically processed and examined by the pathologist. This tedious examination process is time-consuming and can lead to small metastases being missed. However, recent advances in whole-slide imaging and machine learning have opened an avenue for analysis of digitized lymph node sections with computer algorithms. For example, convolutional neural networks, a type of machine learning algorithm, are able to automatically detect cancer metastases in lymph nodes with high accuracy. To train machine learning models, large, well-curated datasets are needed. Results We released a dataset of 1399 annotated whole-slide images of lymph nodes, both with and without metastases, in total three terabytes of data in the context of the CAMELYON16 and CAMELYON17 Grand Challenges. Slides were collected from five different medical centers to cover a broad range of image appearance and staining variations. Each whole-slide image has a slide-level label indicating whether it contains no metastases, macro-metastases, micro-metastases or isolated tumor cells. Furthermore, for 209 whole-slide images, detailed hand-drawn contours for all metastases are provided. Last, open-source software tools to visualize and interact with the data have been made available. Conclusions A unique dataset of annotated, whole-slide digital histopathology images has been provided with high potential for re-use.
Article
Full-text available
The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that can summarize the important factors of variation behind the data. However, CNNs often criticized as being black boxes that lack interpretability, since they have millions of unexplained model parameters. In this work, we describe Network Dissection, a method that interprets networks by providing labels for the units of their deep visual representations. The proposed method quantifies the interpretability of CNN representations by evaluating the alignment between individual hidden units and a set of visual semantic concepts. By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials, and colors. The method reveals that deep representations are more transparent and interpretable than expected: we find that representations are significantly more interpretable than they would be under a random equivalently powerful basis. We apply the method to interpret and compare the latent representations of various network architectures trained to solve different supervised and self-supervised training tasks. We then examine factors affecting the network interpretability such as the number of the training iterations, regularizations, different initializations, and the network depth and width. Finally we show that the interpreted units can be used to provide explicit explanations of a prediction given by a CNN for an image. Our results highlight that interpretability is an important property of deep neural networks that provides new insights into their hierarchical structure.
Article
Full-text available
Saliency methods aim to explain the predictions of deep neural networks. These methods lack reliability when the explanation is sensitive to factors that do not contribute to the model prediction. We use a simple and common pre-processing step ---adding a constant shift to the input data--- to show that a transformation with no effect on the model can cause numerous methods to incorrectly attribute. In order to guarantee reliability, we posit that methods should fulfill input invariance, the requirement that a saliency method mirror the sensitivity of the model with respect to transformations of the input. We show, through several examples, that saliency methods that do not satisfy input invariance result in misleading attribution.
Conference Paper
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them