Chapter

Comparison of CNN Visualization Methods to Aid Model Interpretability for Detecting Alzheimer’s Disease

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Advances in medical imaging and convolutional neural networks (CNNs) have made it possible to achieve excellent diagnostic accuracy from CNNs comparable to human raters. However, CNNs are still not implemented in medical trials as they appear as a black box system and their inner workings cannot be properly explained. Therefore, it is essential to assess CNN relevance maps, which highlight regions that primarily contribute to the prediction. This study focuses on the comparison of algorithms for generating heatmaps to visually explain the learned patterns of Alzheimer’s disease (AD) classification. T1-weighted volumetric MRI data were entered into a 3D CNN. Heatmaps were then generated for different visualization methods using the iNNvestigate and keras-vis libraries. The model reached an area under the curve of 0.93 and 0.75 for separating AD dementia patients from controls and patients with amnestic mild cognitive impairment from controls, respectively. Visualizations for the methods deep Taylor decomposition and layer-wise relevance propagation (LRP) showed most reasonable results for individual patients matching expected brain regions. Other methods, such as Grad-CAM and guided backpropagation showed more scattered activations or random areas. For clinically research, deep Taylor decomposition and LRP showed most valuable network activation patterns.
Content may be subject to copyright.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Back-propagation methods are the most popular methods to interpret models, and a wide range of these algorithms have been used to study brain disorders: standard and DeconvNet (Dyrba et al., 2020) and deep Taylor Decomposition (Dyrba et al., 2020). ...
... Back-propagation methods are the most popular methods to interpret models, and a wide range of these algorithms have been used to study brain disorders: standard and DeconvNet (Dyrba et al., 2020) and deep Taylor Decomposition (Dyrba et al., 2020). ...
... guided back-propagation(Böhle et al., 2019;Eitel and Ritter, 2019;Hu et al., 2021;Oh et al., 2019;Rieke et al., 2018), gradient input(Dyrba et al., 2020;Eitel and Ritter, 2019;Eitel et al., 2019), Grad-CAM(Burduja et al., 2020;Dyrba et al., 2020), guided Grad-CAM(Tang et al., 2019), LRP(Böhle et al., 2019;Dyrba et al., 2020;Eitel and Ritter, 2019;Eitel et al., 2019), ...
Thesis
Full-text available
The goal of this PhD was the validation of the existence and the discovery of new subtypes of Alzheimer’s disease, the first cause of dementia worldwide. Indeed, despite its discovery more than a century ago, this disease is still not well defined and existing treatments are only weakly effective, possibly because several phenotypes exist within the disease. In order to explore its heterogeneity, we employed deep learning methods applied to a neuroimaging modality, structural magnetic resonance imaging. However, the discovery of important methodological biases in many studies in our field, as well as the lack of consensus regarding deep learning interpretability, partly changed the main objective of the PhD to focus more on issues of validation, robustness and interpretability of deep learning. Then, to correctly assess the ability of deep learning to detect Alzheimer’s disease, three experimental studies were conducted. The first one is a study of deep learning methods for Alzheimer’s classification and allowed a fair comparison of the methods. The second study found a lack of robustness of classification with deep learning in terms of atrophy patterns discovered using interpretability methods. Finally, the last study proposed a subtype discovery method aided by data augmentation. Although it works on synthetic data, it does not generalize to real data. Experimental results of this PhD were obtained thanks to ClinicaDL, one major contribution of this PhD. It is an open source Python library that was used to improve the reproducibility of deep learning experiments.
... In general, relevance or saliency maps indicate the amount of information or contribution of a single input feature on the probability of a particular output class. Previous methodological approaches like gradient-weighted class activation mapping (Grad-CAM) [6], occlusion sensitivity analyses [7,8], and local interpretable model-agnostic explanations (LIME) [9] had the limitation that deriving the relevance or saliency maps provided only group-average estimates, required long runtime [10], or provided only low spatial resolution [11,12]. In contrast, more recent methods such as guided backpropagation [13] or layer-wise relevance propagation (LRP) [4,5] use back-tracing of neural activation through the network paths to obtain high-resolution relevance maps. ...
... Recently, three studies compared LRP with other CNN visualization methods for the detection of Alzheimer's disease in T1-weighted MRI scans [11,12,14]. The derived relevance maps showed the strongest contribution of medial and lateral temporal lobe atrophy, which matched the a priori expected brain regions of high diagnostic relevance [15,16]. ...
... Notably, studies using approaches i and ii showed visualizations characterizing the whole sample or group averages. In contrast, studies applying iii also presented relevance maps for single participants [11,14]. ...
Article
Full-text available
Background Although convolutional neural networks (CNNs) achieve high diagnostic accuracy for detecting Alzheimer’s disease (AD) dementia based on magnetic resonance imaging (MRI) scans, they are not yet applied in clinical routine. One important reason for this is a lack of model comprehensibility. Recently developed visualization methods for deriving CNN relevance maps may help to fill this gap as they allow the visualization of key input image features that drive the decision of the model. We investigated whether models with higher accuracy also rely more on discriminative brain regions predefined by prior knowledge. Methods We trained a CNN for the detection of AD in N = 663 T1-weighted MRI scans of patients with dementia and amnestic mild cognitive impairment (MCI) and verified the accuracy of the models via cross-validation and in three independent samples including in total N = 1655 cases. We evaluated the association of relevance scores and hippocampus volume to validate the clinical utility of this approach. To improve model comprehensibility, we implemented an interactive visualization of 3D CNN relevance maps, thereby allowing intuitive model inspection. Results Across the three independent datasets, group separation showed high accuracy for AD dementia versus controls ( AUC ≥ 0.91) and moderate accuracy for amnestic MCI versus controls ( AUC ≈ 0.74). Relevance maps indicated that hippocampal atrophy was considered the most informative factor for AD detection, with additional contributions from atrophy in other cortical and subcortical regions. Relevance scores within the hippocampus were highly correlated with hippocampal volumes (Pearson’s r ≈ −0.86, p < 0.001). Conclusion The relevance maps highlighted atrophy in regions that we had hypothesized a priori. This strengthens the comprehensibility of the CNN models, which were trained in a purely data-driven manner based on the scans and diagnosis labels. The high hippocampus relevance scores as well as the high performance achieved in independent samples support the validity of the CNN models in the detection of AD-related MRI abnormalities. The presented data-driven and hypothesis-free CNN modeling approach might provide a useful tool to automatically derive discriminative features for complex diagnostic tasks where clear clinical criteria are still missing, for instance for the differential diagnosis between various types of dementia.
... In general, relevance or saliency maps indicate the amount of information or contribution of a single input feature on the probability of a particular output class. Previous methodological approaches like gradient-weighted class activation mapping (Grad-CAM) [7], occlusion sensitivity analyses [8,9], and local interpretable model-agnostic explanations (LIME) [10] had the limitation that deriving the relevance or saliency maps provided only group-average estimates, required long runtime [11] or provided only low spatial resolution [12,13]. In contrast, more recent methods such as guided backpropagation [14] or layer-wise relevance propagation (LRP) [5,6] use back-tracing of neural activation through the network paths to obtain high-resolution relevance maps. ...
... Recently, three studies compared LRP with other CNN visualization methods for the detection of Alzheimer's disease in T1-weighted MRI scans [12,13,15]. The derived relevance maps showed strongest contribution of medial and lateral temporal lobe atrophy, which matched the a priori expected brain regions of high diagnostic relevance [16,17]. ...
... Notably, studies using the approaches (i) and (ii) showed visualizations characterizing the whole sample or group averages. In contrast, studies applying (iii) also presented relevance maps for single participants [12,15]. ...
Article
Although machine learning approaches achieve high diagnostic accuracy for detecting Alzheimer’s disease (AD) based on MRI scans, they are not applied in clinical routine due to a lack of suitable methods for model comprehensibility and interpretability. Recently developed visualization methods for convolutional neural networks (CNN) may fill this gap. We trained a CNN model to detect AD based on T1‐weighted MRI and implemented a web application to provide an intuitive visualization of relevance maps. The aim of this study was to evaluate the association of relevance scores and hippocampus volume to validate the clinical utility of this approach. MRI scans for 254 cognitively normal controls (CN), 219 patients with amnestic mild cognitively impairment (MCI), and 189 patients with AD were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). MRI scans were tissue‐segmented and normalized using VBM8. Hippocampal volume was extracted from the gray matter (GM) maps. Subsequently, GM maps and hippocampal volume were corrected for effects of total intracranial volume, age, gender and scanner magnetic field strength using a linear model. Thirty‐two coronal slices covering the hippocampus were selected and the corresponding GM residual maps entered into the CNN. The model was evaluated using ten‐fold cross‐validation. Finally, we derived activation maps using the layer‐wise relevance propagation (LRP) algorithm and calculated the sum of hippocampus relevance scores. We obtained highly accurate results of an AUC of 93±6% for AD vs. CN. For MCI vs. CN, group separation was lower with an AUC of 74±9%. The relevance maps for four exemplary patients are shown in Fig. 1. Relevance maps confirmed that hippocampal atrophy is the most informative region for AD detection with minor contributions from other cortical and subcortical regions. Relevance scores within the hippocampus were correlated with hippocampal volume with a Pearson correlation of r=‐0.79 (Fig.2). Our CNN model yielded high accuracy when detecting AD and a medium level of accuracy for mild cognitive impairment. In general, the relevance maps showed the expected regions. The high association of hippocampus relevance scores and volume indicate a high validity of the CNN model as implemented in our intuitively usable web application.
... One benefit of feature importance is that it is model agnostic and so can be used in combination with many ML algorithms. Model-specific approaches to interpretability are more often used alongside DL algorithms [128,129]. For example, Qiu et al. [130] constructed 'disease probability maps' to visualize how a fully convolutional network determined AD status from MRI images. ...
Article
Introduction: Machine learning (ML) has been extremely successful in identifying key features from high-dimensional datasets and executing complicated tasks with human expert levels of accuracy or greater. Methods: We summarize and critically evaluate current applications of ML in dementia research and highlight directions for future research. Results: We present an overview of ML algorithms most frequently used in dementia research and highlight future opportunities for the use of ML in clinical practice, experimental medicine, and clinical trials. We discuss issues of reproducibility, replicability and interpretability and how these impact the clinical applicability of dementia research. Finally, we give examples of how state-of-the-art methods, such as transfer learning, multi-task learning, and reinforcement learning, may be applied to overcome these issues and aid the translation of research to clinical practice in the future. Discussion: ML-based models hold great promise to advance our understanding of the underlying causes and pathological mechanisms of dementia.
... One benefit of feature importance is that it is model agnostic and so can be used in combination with many ML algorithms. Model-specific approaches to interpretability are more often used alongside DL algorithms [128,129]. For example, Qiu et al. [130] constructed 'disease probability maps' to visualize how a fully convolutional network determined AD status from MRI images. ...
Preprint
Full-text available
Introduction: Machine learning (ML) has been extremely successful in identifying key features from high-dimensional datasets and executing complicated tasks with human expert levels of accuracy or greater. Methods: We summarize and critically evaluate current applications of ML in dementia research and highlight directions for future research. Results: We present an overview of ML algorithms most frequently used in dementia research and highlight future opportunities for the use of ML in clinical practice, experimental medicine, and clinical trials. We discuss issues of reproducibility, replicability and interpretability and how these impact the clinical applicability of dementia research. Finally, we give examples of how state-of-the-art methods, such as transfer learning, multi-task learning, and reinforcement learning, may be applied to overcome these issues and aid the translation of research to clinical practice in the future. Discussion: ML-based models hold great promise to advance our understanding of the underlying causes and pathological mechanisms of dementia.
... One approach is to extract the learned representation of a CNN into a comprehensible format, such as a voxel-wise relevance heatmap (see section 3.2.3), in order to detect patterns across subjects. For instance, multiple studies proposed different ways to compute relevance heatmaps in Alzheimer's disease for explaining individual network decisions [104,30,172,173]. In accordance with neurobiological evidence, all studies found the hippocampus and other temporal regions as primarily important. ...
Preprint
Full-text available
By promising more accurate diagnostics and individual treatment recommendations, deep neural networks and in particular convolutional neural networks have advanced to a powerful tool in medical imaging. Here, we first give an introduction into methodological key concepts and resulting methodological promises including representation and transfer learning, as well as modelling domain-specific priors. After reviewing recent applications within neuroimaging-based psychiatric research, such as the diagnosis of psychiatric diseases, delineation of disease subtypes, normative modeling, and the development of neuroimaging biomarkers, we discuss current challenges. This includes for example the difficulty of training models on small, heterogeneous and biased data sets, the lack of validity of clinical labels, algorithmic bias, and the influence of confounding variables.
... Deep Taylor decomposition is a powerful technique for explaining CNN decisions by identifying the features in an input vector that have the greatest impact on a neural network's output based on redistributed relevance [47]. In a recent Alzheimer's disease detection study, deep Taylor decomposition produced more reasonable and accurate results than Grad-CAM [48]. Layer-wise relevance propagation (LRP) is the foundation for deep Taylor decomposition [49] that seeks to create a relevance metric R over the input vector, such that we can represent the network output as the sum of the values of R, f (x) = ∑ V d=1 R d , where f is the neural network forward pass function. ...
Article
Full-text available
The most mysterious question humans have ever attempted to answer for centuries is, “What is beauty, and how does the brain decide what beauty is?”. The main problem is that beauty is subjective, and the concept changes across cultures and generations; thus, subjective observation is necessary to derive a general conclusion. In this research, we propose a novel approach utilizing deep learning and image processing to investigate how humans perceive beauty and make decisions in a quantifiable manner. We propose a novel approach using uncertainty-based ensemble voting to determine the specific features that the brain most likely depends on to make beauty-related decisions. Furthermore, we propose a novel approach to prove the relation between golden ratio and facial beauty. The results show that beauty is more correlated with the right side of the face and specifically with the right eye. Our study and findings push boundaries between different scientific fields in addition to enabling numerous industrial applications in variant fields such as medicine and plastic surgery, cosmetics, social applications, personalized treatment, and entertainment.
... The regions highlighted by the CNN saliency maps could possibly be related to AD using prior knowledge, but we will refrain from over-interpretation here. It is however unexpected that the medial temporal lobe is not covered as previously shown with CNN saliency maps on ADNI data (Dyrba et al., 2020;Rieke et al., 2018). Differences between the SVM and CNN classifiers in involved brain regions could be contributed to both the differences in the classification approaches as well as to the differences in the used visualization techniques. ...
Article
Full-text available
This work validates the generalizability of MRI-based classification of Alzheimer’s disease (AD) patients and controls (CN) to an external data set and to the task of prediction of conversion to AD in individuals with mild cognitive impairment (MCI).We used a conventional support vector machine (SVM) and a deep convolutional neural network (CNN) approach based on structural MRI scans that underwent either minimal pre-processing or more extensive pre-processing into modulated gray matter (GM) maps. Classifiers were optimized and evaluated using cross-validation in the Alzheimer’s Disease Neuroimaging Initiative (ADNI; 334 AD, 520 CN). Trained classifiers were subsequently applied to predict conversion to AD in ADNI MCI patients (231 converters, 628 non-converters) and in the independent Health-RI Parelsnoer Neurodegenerative Diseases Biobank data set. From this multi-center study representing a tertiary memory clinic population, we included 199 AD patients, 139 participants with subjective cognitive decline, 48 MCI patients converting to dementia, and 91 MCI patients who did not convert to dementia.AD-CN classification based on modulated GM maps resulted in a similar area-under-the-curve (AUC) for SVM (0.940; 95%CI: 0.924–0.955) and CNN (0.933; 95%CI: 0.918–0.948). Application to conversion prediction in MCI yielded significantly higher performance for SVM (AUC = 0.756; 95%CI: 0.720-0.788) than for CNN (AUC = 0.742; 95%CI: 0.709-0.776) (p
... One approach is to extract the learned representation of a CNN into a comprehensible format, such as a voxel-wise relevance heatmap (see Section 3.2.3), in order to detect patterns across subjects. For instance, multiple studies proposed different ways to compute relevance heatmaps in Alzheimer's disease for explaining individual network decisions (Böhle et al., 2019;Rieke et al., 2018;Dyrba et al., 2020;Jo et al., 2020). In accordance with neurobiological evidence, all studies found the hippocampus and other temporal regions as primarily important. ...
Article
Full-text available
By promising more accurate diagnostics and individual treatment recommendations, deep neural networks and in particular convolutional neural networks have advanced to a powerful tool in medical imaging. Here, we first give an introduction into methodological key concepts and resulting methodological promises including representation and transfer learning, as well as modelling domain-specific priors. After reviewing recent applications within neuroimaging-based psychiatric research, such as the diagnosis of psychiatric diseases, delineation of disease subtypes, normative modeling, and the development of neuroimaging biomarkers, we discuss current challenges. This includes for example the difficulty of training models on small, heterogeneous and biased data sets, the lack of validity of clinical labels, algorithmic bias, and the influence of confounding variables.
Chapter
Relevance maps derived from convolutional neural networks (CNN) indicate the influence of a particular image region on the decision of the CNN model. Individual maps are obtained for each single input 3D MRI image and various visualization options need to be adjusted to improve information content. In the use case of model prototyping and comparison, the common approach to save the 3D relevance maps to disk is impractical given the large number of combinations. Therefore, we developed a web application to aid interactive inspection of CNN relevance maps. For the requirements analysis, we interviewed several people from different stakeholder groups (model/visualization developers, radiology/neurology staff) following a participatory design approach. The visualization software was conceptually designed in a Model–View–Controller paradigm and implemented using the Python visualization library Bokeh. This framework allowed a Python server back-end directly executing the CNN model and related code, and a HTML/Javascript front-end running in any web browser. Slice-based 2D views were realized for each axis, accompanied by several visual guides to improve usability and quick navigation to image areas with high relevance. The interactive visualization tool greatly improved model inspection and comparison for developers. Owing to the well-structured implementation, it can be easily adapted to other CNN models and types of input data.
Preprint
Full-text available
Convolutional neural networks (CNN) have become a powerful tool for detecting patterns in image data. Recent papers report promising results in the domain of disease detection using brain MRI data. Despite the high accuracy obtained from CNN models for MRI data so far, almost no papers provided information on the features or image regions driving this accuracy as adequate methods were missing or challenging to apply. Recently, the toolbox iNNvestigate has become available, implementing various state of the art methods for deep learning visualizations. Currently, there is a great demand for a comparison of visualization algorithms to provide an overview of the practical usefulness and capability of these algorithms. Therefore, this thesis has two goals: 1. To systematically evaluate the influence of CNN hyper-parameters on model accuracy. 2. To compare various visualization methods with respect to the quality (i.e. randomness/focus, soundness).
Article
Full-text available
Deep neural networks have led to state-of-the-art results in many medical imaging tasks including Alzheimer's disease (AD) detection based on structural magnetic resonance imaging (MRI) data. However, the network decisions are often perceived as being highly non-transparent, making it difficult to apply these algorithms in clinical routine. In this study, we propose using layer-wise relevance propagation (LRP) to visualize convolutional neural network decisions for AD based on MRI data. Similarly to other visualization methods, LRP produces a heatmap in the input space indicating the importance/relevance of each voxel contributing to the final classification outcome. In contrast to susceptibility maps produced by guided backpropagation (“Which change in voxels would change the outcome most?”), the LRP method is able to directly highlight positive contributions to the network classification in the input space. In particular, we show that (1) the LRP method is very specific for individuals (“Why does this person have AD?”) with high inter-patient variability, (2) there is very little relevance for AD in healthy controls and (3) areas that exhibit a lot of relevance correlate well with what is known from literature. To quantify the latter, we compute size-corrected metrics of the summed relevance per brain area, e.g., relevance density or relevance gain. Although these metrics produce very individual “fingerprints” of relevance patterns for AD patients, a lot of importance is put on areas in the temporal lobe including the hippocampus. After discussing several limitations such as sensitivity toward the underlying model and computation parameters, we conclude that LRP might have a high potential to assist clinicians in explaining neural network decisions for diagnosing AD (and potentially other diseases) based on structural MRI data.
Chapter
Full-text available
Visualizing and interpreting convolutional neural networks (CNNs) is an important task to increase trust in automatic medical decision making systems. In this study, we train a 3D CNN to detect Alzheimer’s disease based on structural MRI scans of the brain. Then, we apply four different gradient-based and occlusion-based visualization methods that explain the network’s classification decisions by highlighting relevant areas in the input image. We compare the methods qualitatively and quantitatively. We find that all four methods focus on brain regions known to be involved in Alzheimer’s disease, such as inferior and middle temporal gyrus. While the occlusion-based methods focus more on specific regions, the gradient-based methods pick up distributed relevance patterns. Additionally, we find that the distribution of relevance varies across patients, with some having a stronger focus on the temporal lobe, whereas for others more cortical areas are relevant. In summary, we show that applying different visualization methods is important to understand the decisions of a CNN, a step that is crucial to increase clinical impact and trust in computer-based decision support systems.
Conference Paper
Full-text available
This article presents the prediction difference analysis method for visualizing the response of a deep neural network to a specific input. When classifying images, the method highlights areas in a given input image that provide evidence for or against a certain class. It overcomes several shortcoming of previous methods and provides great additional insight into the decision making process of classifiers. Making neural network decisions interpretable through visualization is important both to improve models and to accelerate the adoption of black-box classifiers in application areas such as medicine. We illustrate the method in experiments on natural images (ImageNet data), as well as medical images (MRI brain scans).
Article
This article presents the prediction difference analysis method for visualizing the response of a deep neural network to a specific input. When classifying images, the method highlights areas in a given input image that provide evidence for or against a certain class. It overcomes several shortcoming of previous methods and provides great additional insight into the decision making process of classifiers. Making neural network decisions interpretable through visualization is important both to improve models and to accelerate the adoption of black-box classifiers in application areas such as medicine. We illustrate the method in experiments on natural images (ImageNet data), as well as medical images (MRI brain scans).
Article
BACKGROUND: Alzheimer’s disease (AD) patients show early changes in white matter (WM) structural integrity. We studied the use of diffusion tensor imaging (DTI) in assessing WM alterations in the predementia stage of mild cognitive impairment (MCI). METHODS: We applied a Support Vector Machine (SVM) classifier to DTI and volumetric magnetic resonance imaging data from 35 amyloid-β42 negative MCI subjects (MCI-Aβ42−), 35 positive MCI subjects (MCI-Aβ42+), and 25 healthy controls (HC) retrieved from the European DTI Study on Dementia. The SVM was applied to DTI-derived fractional anisotropy, mean diffusivity (MD), and mode of anisotropy (MO) maps. For comparison, we studied classification based on gray matter (GM) and WM volume. RESULTS: We obtained accuracies of up to 68% for MO and 63% for GM volume when it came to distinguishing between MCI-Aβ42− and MCI-Aβ42+. When it came to separating MCI-Aβ42+ from HC we achieved an accuracy of up to 77% for MD and a significantly lower accuracy of 68% for GM volume. The accuracy of multimodal classification was not higher than the accuracy of the best single modality. CONCLUSIONS: Our results suggest that DTI data provide better prediction accuracy than GM volume in predementia AD.
Article
Recent evidence from cross-sectional in vivo imaging studies suggests that atrophy of the cholinergic basal forebrain (BF) in Alzheimer's disease (AD) can be distinguished from normal age-related degeneration even at predementia stages of the disease. Longitudinal study designs are needed to specify the dynamics of BF degeneration in the transition from normal aging to AD. We applied recently developed techniques for in vivo volumetry of the BF to serial magnetic resonance imaging scans of 82 initially healthy elderly individuals (60-93 years) and 50 patients with very mild AD (Clinical Dementia Rating score = 0.5) that were clinically followed over an average of 3 ± 1.5 years. BF atrophy rates were found to be significantly higher than rates of global brain shrinkage even in cognitively stable healthy elderly individuals. Compared with healthy control subjects, very mild AD patients showed reduced BF volumes at baseline and increased volume loss over time. Atrophy of the BF was more pronounced in progressive patients compared with those that remained stable. The cholinergic BF undergoes disproportionate degeneration in the aging process, which is further increased by the presence of AD.
Visualizing vonvolutional networks for MRIbased diagnosis of Alzheimer's disease. In: Understanding and Interpreting Machine Learning in Medical Image Computing Applications
  • J Rieke
  • F Eitel
  • M Weygandt
Rieke J, Eitel F, Weygandt M, et al. Visualizing vonvolutional networks for MRIbased diagnosis of Alzheimer's disease. In: Understanding and Interpreting Machine Learning in Medical Image Computing Applications. Springer; 2018. p. 24-31.
  • M Alber
  • S Lapuschkin
  • P Seegerer
Alber M, Lapuschkin S, Seegerer P, et al. iNNvestigate neural networks! J Mach Learn Res. 2019;20:1-8.