Content uploaded by Miguel-Ángel Fernández-Torres
Author content
All content in this area was uploaded by Miguel-Ángel Fernández-Torres on Nov 11, 2024
Content may be subject to copyright.
1
Opening the Black-Box: A Systematic Review on
Explainable AI in Remote Sensing
Adrian H¨
ohl∗, Ivica Obadic∗, Miguel- ´
Angel Fern´
andez-Torres, Hiba Najjar,
Dario Oliveira, Zeynep Akata, Andreas Dengel and Xiao Xiang Zhu, Fellow, IEEE.
Abstract—In recent years, black-box machine learning ap-
proaches have become a dominant modeling paradigm for knowl-
edge extraction in remote sensing. Despite the potential benefits of
uncovering the inner workings of these models with explainable
AI, a comprehensive overview summarizing the explainable AI
methods used and their objectives, findings, and challenges in
remote sensing applications is still missing. In this paper, we
address this gap by performing a systematic review to identify
the key trends in the field and shed light on novel explainable
AI approaches and emerging directions that tackle specific
remote sensing challenges. We also reveal the common patterns
of explanation interpretation, discuss the extracted scientific
insights, and reflect on the approaches used for the evaluation of
explainable AI methods. As such, our review provides a complete
summary of the state-of-the-art of explainable AI in remote
sensing. Further, we give a detailed outlook on the challenges
and promising research directions, representing a basis for novel
methodological development and a useful starting point for new
researchers in the field.
Index Terms—Earth observation, explainable AI (xAI), ex-
plainability, interpretable ML (IML), interpretability, remote
sensing
I. INTRODUCTION
MAchine Learning (ML) methods have shown outstand-
ing performance in numerous Earth Observation (EO)
tasks [1,2], but mostly they are complex and lack the
interpretability and explanation of their decisions. In Earth
Observation (EO) applications, understanding the model’s
∗Adrian H¨
ohl and Ivica Obadic share the first authorship. Corresponding
authors: Adrian H¨
ohl, Ivica Obadic, and Xiao Xiang Zhu.
Adrian H¨
ohl is with the Chair of Data Science in Earth Observation,
Technical University of Munich (TUM), 80333 Munich, Germany (email:
adrian.hoehl@tum.de)
Ivica Obadic is with the Chair of Data Science in Earth Observation,
Technical University of Munich (TUM) and the Munich Center for Machine
Learning, 80333 Munich, Germany (email: ivica.obadic@tum.de)
Miguel- ´
Angel Fern´
andez-Torres is with the Image Processing Laboratory
(IPL), Universitat de Val`
encia (UV), 46980 Paterna (Val`
encia), Spain (email:
miguel.a.fernandez@uv.es)
Hiba Najjar is with the University of Kaiserslautern-Landau, Germany;
German Research Center for Artificial Intelligence (DFKI), Kaiserslautern,
Germany (email: hiba.najjar@dfki.de)
Dario Oliveira is with the School of Applied Mathematics, Getulio Vargas
Foundation, Brazil. (email: darioaugusto@gmail.com)
Zeynep Akata is with the Institute for Explainable Machine Learning
at Helmholtz Munich and with the Chair of Interpretable and Reliable
Machine Learning, Technical University of Munich, 80333, Germany. (email:
zeynep.akata@helmholtz-munich.de)
Andreas Dengel is with the University of Kaiserslautern-Landau, Germany;
German Research Center for Artificial Intelligence (DFKI), Kaiserslautern,
Germany (email: andreas.dengel@dfki.de)
Xiao Xiang Zhu is with the Chair of Data Science in Earth Observation,
Technical University of Munich (TUM) and with the Munich Center for Ma-
chine Learning, 80333 Munich, Germany (email: xiaoxiang.zhu@tum.com)
2017 2018 2019 2020 2021 2022 2023*
Year
5000
10000
15000
20000
25000
Number of ML Publications in RS
ML in RS
xAI in RS
50
100
150
200
250
300
350
400
Number of xAI Publications in RS
Fig. 1. The number of publications of ML in RS (blue curve) and xAI in RS
(green curve), obtained by using the search query described in the Appendix
A1, differ by a factor of ≈70. (*Calculated amount of publications given the
first ten months of the year and assuming a linear trend in 2023.)
functioning and visualizing the interpretations for analysis
is crucial [3], as it allows practitioners to gain scientific
insights, discover biases, assess trustworthiness and fairness
for policy decisions, and to debug and improve a model. The
European Union adopted an Artificial Intelligence (AI) Act to
ensure that the methods developed and used in Europe align
with fundamental rights and values such as safety, privacy,
transparency, explicability, as well as social and environmental
wellbeing [4]. It is anticipated that other governments world-
wide will implement similar regulations [5]. Many applications
in EO could potentially violate these values when data and AI
are employed for analysis and decision-making. Nevertheless,
explainable AI (xAI) can contribute to aligning these practices
with rights and laws. Hence, xAI emerges as a promising
research direction to tackle the above-mentioned scientific and
regulatory challenges with observational data [6].
Despite these potential benefits, currently, there is a gap
between the usage of ML methods in RS and the works
that aim to reveal the workings of these models. This gap
is illustrated in Figure 1, where the blue curve shows that the
number of ML papers in RS have drastically increased in the
past few years. Although the works dealing with xAI in RS,
shown by the green curve, have also increased rapidly, there
is still a gap by a factor of ≈70 compared to the number
of papers of ML in RS. This increasing number of xAI in
RS papers motivates us to summarize the existing work in the
arXiv:2402.13791v2 [cs.LG] 6 Nov 2024
2
field and provide an overview to RS practitioners about the
recent developments, which might lead to narrowing this gap
and making xAI approaches more common in the field of RS.
xAI methods are typically designed to work on natural
images. However, RS images have different properties than
natural images [7]. First, images are captured from above.
This perspective comes with unique scales, resolutions, and
shadows. For instance, a RS image can cover whole land-
scapes, thousands of square kilometers, while natural images
can only cover a tiny fraction of it [8]. Second, RS cap-
tures images in other electromagnetic spectra, apart from the
usual Red-Green-Blue (RGB) channels. From hyperspectral
over Synthetic Aperture Radar (SAR) to Light Detection and
Ranging (LiDAR), RS does cover a wide range of reflectance
data. Third, usual RGB cameras are primarily passive, while
RS can be active, which changes properties like the radar
shadows, foreshortening, layover, elevation displacement, and
speckle effects [9]. Besides the different image properties,
the tasks differ as well. Computer Vision (CV) primarily
addresses dynamic scenarios where the objects dynamically
move and cover each other. In contrast, the observed processes
in RS happen on different spectral and spatiotemporal scales,
long-term and short-term, and the systems modeled are very
complex and diverse. Often, the observed systems are not
fully understood and can only be observed indirectly and not
completely, for example, weather events, hydrology, hazards,
ecosystems, and urban dynamics. Although there are numerous
reviews for xAI in the literature [10–12], they typically do not
reflect on the works specific for RS nor reveal how the existing
xAI approaches tackle the above challenges related to remote
sensing data. Therefore, a review of xAI tailored to the field of
RS is necessary to reveal the key trends, common objectives,
challenges, and latest developments.
While current reviews of xAI in EO are focused on social,
regulatory, and stakeholder perspectives [13,14], specific
subtopics in RS [15], or do not provide a broad literature
database [15–17], this paper targets the applications and ap-
proaches from xAI in RS and follows a systematic approach
to gather a comprehensive literature database. To this end,
we conduct a systematic literature search for xAI in RS in
three commonly used literature databases in RS, namely IEEE,
Scopus, and Springer (the research method is thoroughly
outlined in Appendix A). In parallel, we propose a catego-
rization of xAI methods, which provides a detailed overview
and understanding of the xAI taxonomy and techniques. We
identify the relevant papers from the literature database and
rely on the proposed categorization of the xAI methods to
summarize the typical usages, objectives, and new approaches
in the field. Furthermore, we discuss the alignment of the usage
of xAI in RS with standard practices in xAI and the evaluation
of xAI in RS. As such, this review aims to assist users in the
field of EO for the usage of xAI. Finally, we identified the
challenges and limitations of xAI for addressing the unique
properties of RS data, the interpretability of Deep Learning
(DL) models along with the lack of labels in RS, as well as the
combination of xAI with related fields such as uncertainty or
physics. These challenges extend the initial insights presented
in [18] and are more extensively explored here.
In particular, we attempt to answer the following research
questions: (RQ1) Which explainable AI approaches have been
used and which methods have been developed in the literature
for EO tasks? (RQ2) How are xAI explanations analyzed,
interpreted, and evaluated? (RQ3) What are the objectives and
findings of using xAI in RS? (RQ4) How do the utilized xAI
approaches in RS align with the recommended practices in the
field of xAI? (RQ5) What are the limitations, challenges, and
new developments of xAI in RS?
Therefore, this review provides the following main contri-
butions:
•First, we present an overview and categorization of cur-
rent xAI methods (Section III).
•Second, we summarize the state-of-the-art (SOTA) xAI
approaches in RS through the analysis of a comprehensive
literature database (Section IV - RQ1 and RQ2).
•Third, we identify objectives and practices for evaluating
xAI methods in RS (Section IV - RQ3).
•Finally, we discuss challenges, limitations, and future
directions for xAI in RS (Section V - RQ4 and RQ5).
II. RE LATE D WORK
Numerous resources for either the field of xAI [10,19,20] or
ML for EO applications [1,21] are available in the literature.
In contrast, to the best of our knowledge, there are only two
reviews [13,17] in the overlapping area of these two fields.
Gevaert aims to summarize the existing works of xAI in EO
and addresses the xAI usage from a regulatory and societal
perspective, discussing the requirements and type of xAI that
is needed in EO from policy, regulation, and politics [13]. The
work of Roscher et al. categorizes the identified works in xAI
according to the general challenges in the bio- and geosciences
[17]. Their categorization of xAI properties and their emphasis
on considering expert knowledge constitute two highlights of
this review. Furthermore, the presented challenges are still
faced by researchers today. However, neither of these reviews
uses a broad literature database necessary to provide a compre-
hensive overview of current xAI approaches in EO. There also
exist xAI reviews for specific EO tasks or perspectives [14–16,
22]. In detail, Leluschko and Tholen conduct a review on the
stakeholders and goals within human-centered xAI applied in
RS [14]. Their findings indicate an underrepresentation of non-
developer stakeholders in this area. Hall et al. [15] review
DL methods and investigate to which degree the methods
can explain human wealth or poverty from satellite imagery.
Next, Xing and Sieber [16] focus on xAI in conjunction with
Deep Neural Networks (DNNs) that incorporate geographic
structures and knowledge. They discuss three challenges when
applying xAI to geo-referenced data: challenges from xAI,
geospatial AI, and geosocial applications. Based on a short
use case on land use classification and relying on SHapley
Additive exPlanations (SHAP) explanations, they show that
the geometry, topology, scale, and localization are of great
importance. Finally, a group at Colorado State University
published a survey of their work using xAI for climate and
weather forecasting [22]. While all of these studies shed light
on the xAI practices on specific EO applications, they do
3
xAI
Feature Attribution
Backpropagation
Activation Maximization [1]
Gradient [2]
Integrated Gradients [3]
Deconvolution [4]
Class Activation Mapping (CAM) [5]
and variants (e.g. Grad-CAM [6])
Layer-wise Relevance Propagation (LRP) [7]
Perturbation
Occlusion Sensitivity [8]
Partial Dependence Plot (PDP) [9]
Accumulated Local Effects (ALE) [10]
Distillation
Local Approximation
Local Interpretable Model-agnostic Explana-
tion (LIME) [11]
SHapley Additive exPlanations (SHAP) [12]
Model Translation Rule [13,14], Tree [15–17], Graph [18,19]
Intrinsic
Interpretable-by-Design
Decision Rule and Decision Tree [20]
Linear Regression [20],
Generalized Linear Model (GLM) [21],
Generalized Additive Model (GAM) [22]
Latent Dirichlet Allocation (LDA) [23]
Embedding Space
Attention Mechanism [24,25]
Activation Assessment
Concept Discovery (e.g. TCAV [26], ACE
[27])
Joint Training
Explanation Association [28,29]
Prototype Learning [30–32]
Model Association (e.g. text explanations
[33])
Contrastive Examples
Counterfactuals Wachter et al. [34]
Example-based Mikolov et al. [35]
post-hoc ante-hoc model-agnostic model-specific local global
Fig. 2. Categorization of xAI methods based on Ras et al. [10]
4
not comprehensively summarize the work done in the broad
research field of xAI in EO.
To overcome these shortcomings, we approach xAI in EO
systematically and provide an extensive literature database,
resulting in a comprehensive summary of the current literature.
This work is characterized by our interest in the application
and usage of xAI techniques on RS data, while others focus
on the application of geological features, natural sciences, or
social implications. Not only do we present an overview of the
current challenges in the field, but we also highlight the state-
of-the-art methods for tackling these challenges. We believe
this could provide valuable insight into current limitations
faced in the field. Compared to the existing literature, we look
at the topic from a technical perspective without reflecting
on regulatory or ethical implications originating from inte-
grating xAI into the field of EO. Because of the orthogonal
approaches in physics-aware ML, uncertainty quantification, or
causal inference, this review excludes works from related and
overlapping domains. Instead, we refer the reader to overviews
in these fields: [23], [24], and [25,26], respectively.
III. EXP LA INABLE AI ME TH OD S IN MACH IN E LEARNING
This chapter provides a general overview of xAI. We first
present the taxonomy used to describe common distinctions
between explanation methods. Subsequently, we introduce our
categorization of xAI methods. Furthermore, we describe in
detail the commonly used methods in the field of RS in
Appendix B. Finally, we give an overview of the different
metrics proposed in the literature for evaluating these methods
and present the main objectives of using xAI. There exist
several terms for explainable AI, such as interpretable Machine
Learning (IML) and interpretable AI. These terms often refer
to the same concept: the explanation or interpretation of AI
models [27] and we will use them interchangeably.
In the literature, three common distinctions exist when cat-
egorizing xAI methods: ante-hoc vs. post-hoc,model-agnostic
vs. model-specific, and local vs. global [12,19,28,29]. The
ante-hoc and post-hoc taxonomy refer to the stage where the
explanation is generated. A xAI method that provides inter-
pretations within or simultaneously with the training process
is called ante-hoc. In contrast, a post-hoc method explains
the model after the training phase using a separate algorithm.
Further, model-agnostic methods have the ability to generate
explanations for any model. The method does not access the
model’s internal state or parameters, and the explanations
are created by analyzing the changes in the model’s output
when modifying its inputs. Opposed to them, model-specific
methods are exclusively designed for specific architectures
and typically have access to the model’s inner workings.
Finally, local methods explain individual instances and the
model’s behavior at a particular sample. In contrast, global
methods explain the model’s behavior on the entire dataset. In
practice, local explanations can be leveraged to achieve global
explanations. Through aggregation over a set of input instances
chosen to represent the dataset, local explanations can provide
insights into the general behavior of the inspected model.
The aggregation mechanism should be carefully defined since
straightforward aggregation on some xAI methods might lead
to erroneous results. Some researchers further investigated
the question of finding meaningful aggregation rules of local
explanations [30,31].
A. Categorization of Explainable AI Methods
In this section, we introduce a categorization of the most im-
portant xAI methods in the literature that is presented in Figure
2. We build on the categorical foundation of [10] and further
adjust these categories to capture a larger set of xAI methods.
The categories are structured in a tree-like design. The tree
has two internal layers, which describe a hierarchy of primary
and secondary categories, and a leaf layer, which indicates an
individual or group of specific methods. Additionally, these
methods are labeled according to the three universal categories
described above. In the following, a high-level overview of
the categories of xAI methods is explained following the
structure of Figure 2. Our four primary categories are feature
attribution,distillation,intrinsic explanations, and contrastive
examples and are visualized in Figure 3. Feature attribution
methods highlight the input features that significantly influence
the output. Alternatively, distillation builds a new interpretable
model from the behavior of the complex model. Intrinsic
methods focus on making the model itself or its components
inherently interpretable. Lastly, contrastive examples concen-
trate on showing simulated or real examples and allow an
explanation by comparing them.
For the sake of completeness, it should be noted that feature
selection methods are not considered in our categorization and
are listed separately in the results. Even though they are related
to feature interpretation, they serve different purposes. Feature
selection can be defined as a strategy to reduce the dimension-
ality of input space to improve the model performance and
reduce its computational cost [32]. While feature selection can
constitute the first step in a ML pipeline, feature interpretation
is usually the last step and typically involves more advanced
techniques than only looking at the predictive performance.
Therefore, feature selection and xAI can be complementary.
On the one hand, feature selection reduces the input space that
needs to be interpreted. On the other hand, xAI can provide
more qualitative insights for selection, such as uncovering a
bias introduced by a feature.
1) Feature Attribution: Feature attribution methods rely on
the trained ML model to estimate the importance of the input
features. Depending on whether the explanations are generated
by inspecting the model internals or by analyzing the changes
in the model’s output after modifying the input features, the
methods in this category are further split into backpropagation
and perturbation methods. The output of these methods is
commonly a saliency plot which determines the contribution
of the input features to the model prediction. In the case of
imagery inputs, the output of these methods can be visualized
as a heatmap, also called a saliency map, which highlights
regions relevant to the model prediction.
a) Backpropagation: These methods leverage the inher-
ent structure of DNNs to estimate the relevant features by
propagating the output of the network layers to the input
5
Attribution Map 0
1BlackBox y
BlackBox
y
...
y’
Attribution Map 0
1
Perturbation
Backpropagation
BlackBox y
Feature Space
Masked Input
Local Surrogate
Global Surrogate
Local
Perturbation
Translation of
All Samples
.
.
.
.
.
.
.
.
.
Prototype
Prototype
Member
Embeddings
Decision Boundary
Instance
Counterfactual
Example
1) Feature Attribution 2) Distillation
3) Intrinsic 4) Contrastive Examples
Fig. 3. Visualization of representative methods for the four main xAI categories, given models that receive images as input. (1) Feature attribution: The
graph on the top exemplifies perturbation strategies, representing the well-known Occlusion method. This method computes the sensitivity of the outputs
yto perturbations when sliding a mask over the input image, which generates an attribution map. The second graph illustrates backpropagation methods,
which compute gradient-based attribution maps. Attribution maps are normalized to [0,1] for visualization purposes. (2) Distillation: Within this category, we
differentiate between local approximation methods (top) and model translation methods (bottom). Local approximation methods provide explanations for each
sample by training a local surrogate model (a linear model is shown as an example on the top plot), given perturbed samples. Usually, these methods mask
superpixel regions for image-type inputs. Model translation methods rely on global surrogate models that attempt to reproduce the black-box decisions, given
all samples. (3) Intrinsic: A joint training method is illustrated in this example. The embedding space of the model is organized into different groups, and the
centroids of these clusters constitute prototypes, which can be visually compared with other instances (prototype members). (4) Contrastive examples: Here,
we visualize the decision boundary of a model. The graph identifies a counterfactual, which is a sample generated by changing the model decision, and an
example-based instance, typically a similar data point from the training set. (Image contains modified Copernicus Sentinel 2 data (2020), processed by ESA.)
6
features. The majority of these methods compute gradients
for this purpose. The Deconvolution method, also known as
Deconvolutional Neural Network (NN) [33], is designed to
reverse the convolutional operations from Convolutional Neu-
ral Networks (CNNs). Reconstructing the input space from the
feature maps of the CNN allows visualizing which information
was learned and how the input is transformed across different
network layers. Similarly, Layer-wise Relevance Propagation
(LRP) [34] calculates relevance scores for individual input
features through layerwise backpropagating the neuron’s acti-
vations from the output, utilizing specialized propagation rules.
The scores indicate the significance of the connection between
input and output. Various adaptations have been proposed
which apply different propagation rules based on the design
of the networks [35–37]. In contrast, the Gradient or Saliency
method [38] uses the partial derivative with respect to the
input to create the attribution maps. Rather than computing
the gradient once, Integrated Gradients [39] calculates the
integral of gradients with respect to the input features along
an interpolation or path defined between a baseline input and
the instance to be explained. Class Activation Mapping (CAM)
[40] visually explains CNNs through attribution heatmaps by
introducing a global pooling layer right before the top fully
connected one. Using the weights of the latter layer for a
particular class, the heatmap is generated by computing the
weighted average of the activation maps in the last convo-
lutional layer before being upsampled to match the size of
the input tensor for explainability purposes. One extension of
CAM is Gradient-weighted Class Activation Mapping (Grad-
CAM) [41], which replaces these weights by the gradient of
the output with respect to the last convolutional layer, thus
removing the original requirement of a final global pooling
layer.
b) Perturbation: The perturbation methods assess feature
importance by measuring the sensitivity of model predictions
to changes in the input features. These methods are distin-
guished by how the features are perturbed. Among others,
perturbations include blurring, averaging, shuffling, or adding
noise. For example, the Occlusion method [42] tries to remove
a feature by occluding the features with a neutral value. Per-
mutation Feature Importance (PFI) [43] permutes the features
along their dimension, destroying the original relationship
between input and output values. The Partial Dependence Plot
(PDP) [44] method is designed to show the average influence
of a single input feature on the decision while marginalizing
the remaining features, which are fixed. Therefore, it assumes
feature independence. A similar approach called Accumulated
Local Effects (ALE) [45] can handle correlated features by
averaging over the conditional distribution.
2) Model Distillation: Model distillation methods approx-
imate the predictive behavior of a complex model by training
a simpler surrogate model that is usually interpretable-by-
design. By replicating the predictions of the complex model,
the surrogate model offers hypotheses about the relevant
features and the correlations learned by the complex model
without providing further insights into its internal decision
mechanism. Distillation approaches are categorized into a)
local approximation methods, which train the surrogate model
in a small neighborhood around an individual local example,
and b) model translation methods, which replicate the behavior
of the complex model over the entire dataset.
a) Local Approximation: Approaches in this category fo-
cus on explaining individual predictions of the complex model
by inspecting a small neighborhood around the instances to be
explained. In contrast to the backpropagation and perturbation
methods, which operate on the raw input features, the local
approximation approaches transform the input features into
a simplified representation space, such as superpixels for
imagery inputs. A prominent approach in this category is
Local Interpretable Model-agnostic Explanation (LIME) [46].
It creates a new dataset in the neighborhood of the target
instance by perturbing its simplified representation. Next,
an interpretable surrogate model is trained to approximate
the predictions of the complex model on this newly created
dataset. Hence, the explanation for the complex model is
distilled to the interpretation of the surrogate model. A similar
strategy is employed by the SHAP framework [47]. Concretely,
Lundberg and Lee [47] introduce Kernel SHAP, which utilizes
the LIME’s framework under specific constraints to obtain the
feature importance by approximating their Shapley values, a
method grounded in game theory for estimating the player’s
contribution in cooperative games [48]. While Kernel SHAP
is a model-agnostic approach, Shapley values can also be
approximated with model-specific approaches such as Deep
SHAP [47] for neural networks and Tree SHAP [49] which
enables fast approximation of these values for tree-based
models.
b) Model Translation: These methods approximate the
model’s decisions on the entire dataset with a simple global
surrogate model. Typically, the interpretable-by-design meth-
ods summarized in the next section are used as surrogate
models, such as rule-based [50,51], tree-based [52,53] or
graph-based [54,55].
3) Intrinsic: Intrinsically interpretable ML models pro-
vide an explanation by themselves based on their structure,
components, parameters, or outputs. Alternatively, a human-
interpretable explanation can be obtained by visualizing them.
a) Interpretable-by-Design: These methods are inter-
pretable by humans because of their simplicity in design,
architecture, and decision process. Decision Rules are hier-
archical IF-THEN statements, assessing conditions and de-
termining a decision. Fuzzy rules [56] are designed to ad-
dress uncertainty and imprecision, frequently encountered in
nature. While classical precise rules struggle to represent this
uncertainty, fuzzy rules incorporate it, e.g., by using partial
membership of classes (fuzzy sets) [57]. Due to their prox-
imity to natural language, they are interpretable by humans.
The Decision Tree [58] greedily learns decision rules. Their
internal structure is a binary tree, where each internal node is
a condition and each leaf is a decision.
Generalized Additive Models (GAMs) [59] are statistical
modeling techniques that approximate the response variable
as a sum of smooth, non-linear functions transforming the
input features. On the other hand, Generalized Linear Models
7
(GLMs) [60] considers a linear relationship defined through a
specific distribution. Each function becomes a coefficient and
the response variable is now computed as a weighted sum
of features, which allows representing the mean of various
exponential-family distributions. A simple example of this
model type is Linear Regression (LR), which assumes a
Gaussian data distribution where the response variable is the
identity. In this context, the model explanation can be obtained
by examining the coefficients.
A well-known approach for generative probabilistic topic
modeling is Latent Dirichlet Allocation (LDA) [61]. A dataset
is assumed to be organized in corpora or collections, and
each collection contains discrete units, such as documents
comprised of words. A distribution of these units character-
izes both the collections and the topics. By analyzing the
proportions of the units in the collections, LDA estimates
the underlying topics. The decisions of the Latent Dirichlet
Allocation (LDA) model and the identified topics can be
interpreted by examining the predicted proportions for each
collection.
b) Embedding Space: These approaches process the ac-
tivations in the latent space of DNNs to interpret its workings.
The attention mechanism [62,63] creates high-level feature
representations by using the attention weights to model the
dependencies between the different elements in the input.
Hence, visualizing the attention weights is a common pro-
cedure to assess the relevant features for the model decisions.
Activation assessment analyzes the activations in the latent
space of a NN based on projection techniques. Commonly
used are dimensionality reduction approaches [64] (e.g., t-
distributed Stochastic Neighbor Embedding (t-SNE) [65] and
Uniform Manifold Approximation and Projection (UMAP)
[66]) or neuron receptive fields in CNNs [67]. Concept-based
explanations summarize the activations in the latent space
in terms of interpretable, high-level concepts. The concepts
represent visual patterns in computer vision, and they are
either user-defined in Testing with Concept Activation Vectors
(TCAV) [68] or learned in an unsupervised manner [69]. These
approaches provide global explanations by quantifying the
concept relevance per class.
c) Joint Training: This category provides ante-hoc ex-
planations by introducing an additional learning task to the
model. This task is jointly optimized with the original learning
objective and is used as an explanation. The methods in this
family typically differ based on the explanation representation
and how the additional task is integrated with the original
model.
Explanation association methods impose the inference of a
black-box model to rely upon human-interpretable concepts. A
prominent approach in this category are the concept bottleneck
models that represent the concepts as neurons in the latent
space of the model. They first introduce an additional task of
predicting the interpretable concepts in an intermediate layer
of a DL model. Then, the model predictions are derived based
on these concepts. During training, a regularization term is
added to the loss function that enforces alignment of the latent
space according to the interpretable concepts [70]. To estimate
the concept importance directly, Marcos et al. predict the final
output by linearly combining the concept activation maps [71].
Prototype learning approaches aim to identify a set of
representative examples (prototypes) from the dataset and
provide an interpretable decision mechanism by decomposing
the model predictions based on the instance’s similarity with
the learned prototypes [72]. Thus, visualizing the prototypes
enables global model interpretability, while the similarity with
an input instance offers local model explanations. One popular
approach for prototype learning on image classification tasks
is the ProtoPNet architecture introduced by Chen et al. [73].
The prototypes represent image parts and are encoded as
convolutional filters in a prototype layer of the proposed
network. Their weights are optimized with the supervised
learning loss of the network and additional constraints that
ensure both the clustering of the prototypes according to
their class and the separability from the other classes. One
extension of this approach is the Neural Prototype Trees [74],
which organizes the prototypes as nodes in a binary decision
tree. Each node computes the similarity of the corresponding
prototype with an instance. These similarities are used to route
the instance towards the leaves of the tree containing the class
predictions.
In contrast to the previous approaches, which encode the
explanation within the model that performs inference for
the original learning task, the model association methods
introduce an external model that generates explanations. These
approaches are often utilized to provide textual explanations
for CV tasks. An example of such an approach is presented
by Kim et al., who derive text explanations for self-driving
cars based on jointly training a vehicle controller and a
textual explanation generator [75]. The vehicle controller is a
CNN model that recognizes the car’s movements with spatial
attention maps. Next, the explanation generator, which is a
Long Short-Term Memory neural network (LSTM) model,
processes the context vectors and the spatial attention maps
from the controller to produce the text explanation.
4) Contrastive Examples: Methods within this category
provide alternative examples to an input instance and allow
obtaining an explanation by comparing them. Usually, exam-
ples that are close to each other in the input space yet lead to
a different outcome than the original input instance are shown.
a) Counterfactuals: This explanation type aims to dis-
cover the smallest change required for an instance to achieve
a predefined prediction. Essentially, they answer the question,
”Why does it yield output X rather than output Y?” which is
very close to human reasoning [76]. They have a close prox-
imity to adversarial examples, although their objectives differ
significantly. Adversarial examples usually want to achieve a
confident prediction with a minimal perturbed instance, whose
change should remain imperceptible for humans. Conversely,
counterfactuals aim to provide a diverse set of examples and
should allow representing the decision boundary of the model.
Wachter et al. [77] introduce an optimization problem whose
primary objective is to find a counterfactual that is as close as
possible to the original input. As such, the distance function
between the counterfactual and the original input should be
minimized.
8
b) Example-based Explanations: Unlike counterfactuals,
which can generate artificial instances, example-based expla-
nations usually present existing ”historical” training instances
and showcase similar instances to the input under consid-
eration [64]. The user can connect, correlate, and reason
based on the analogies. The explanatory approach aligns with
case-based interpretable-by-design model explanations, e.g., k-
Nearest-Neighbors (kNNs). For example, Mikolov et al. [78]
train a skip-gram model and evaluate the model using the
nearest neighbors determined by the distances in the embedded
space. Furthermore, they illustrate that the acquired word
representations have a linear relationship, allowing for the
computation of analogies through vector addition.
B. Evaluation of Explainable AI Methods
Evaluating explanation quality and its trustworthiness is an
essential methodological challenge in xAI, which has received
considerable attention in recent years. The existing evaluation
strategies can be categorized into (1) functional approaches
based on quantitative metrics and (2) user studies [79].
1) Functional Evaluation Metrics: Quantitatively evaluat-
ing explanations poses a challenge because of the lack of
ground truth explanations against which the generated expla-
nations can be compared. The numerous functional evaluation
metrics typically assess different aspects of the explanation
quality by quantitatively describing to which extent it satisfies
a certain set of desired properties. Nauta et al. propose
to categorize the 12 essential explanation quality properties
into three main groups and suggest to evaluate as many as
possible of these properties for a comprehensive quantitative
explanation assessment [80]. The first group of properties
evaluates the explanation content, describing how correct,
complete, consistent, or discriminative the explanations are,
among other properties. Conversely, properties in the second
group, such as complexity, which refers to the sparsity of
an explanation, or composition that can measure whether the
explanation can localize the ground-truth region of interest,
assess the presentation of the explanations. Last but not least,
the third group of properties focuses on the relevance of
explanations for the user and their alignment with the domain
knowledge.
Faithfulness and robustness, included in the first group of
properties, are two of the most common evaluation metrics in
the literature. The faithfulness property (also called correct-
ness) asserts how close an explanation method approximates
the actual model workings. Various functional metrics are
proposed to evaluate explanation faithfulness. For example,
metrics based on randomization tests are introduced in [81]
to measure the explanation sensitivity to randomization in
model weights and label permutation. The results reveal that
most of the evaluated backpropagation methods do not pass
these tests. Another common approach to evaluate explanation
faithfulness is based on the perturbation of the input features.
For instance, [82] measures the changes in the model out-
put after perturbation of the supposedly important features,
as estimated by the explanation method. Perturbation-based
metrics are also used to evaluate robustness (also referred to
as explanation sensitivity), which inspect the impact of small
changes in the input features on the resulting explanation [83,
84]. Explanations with low sensitivity are preferred as this
indicates that the explanation is robust to minor variations in
the input. However, it is worth noting that perturbation-based
metrics might result in examples with different distribution
than the instances used for model training, which questions
whether the drop in model performance can be attributed to
the distribution shift or to the perturbation of the important
features. To address this issue, Hooker et al. [85] propose
model retraining on a modified dataset where a fraction of
the most important features identified by the xAI method is
perturbed. Further, Rong et al. [86] presented an improved
evaluation strategy that avoids the need for model retraining
based on information theory.
2) User Studies: Conversely, experiments in which humans
evaluate the quality of explanations can also be conducted.
User studies in [79] are further categorized into an application-
grounded evaluation and human-grounded evaluation, depend-
ing on the evaluation task, the type of participants, and the
considered explanation quality criteria. On the one hand,
application-grounded evaluation studies typically involve do-
main experts who evaluate the explanation in the context of
the learning task. For instance, Suresh et al. [87] measures the
physicians’ agreement with various explanation methods for
the problem of classifying electrocardiogram heartbeats. On
the other hand, human-grounded evaluations are usually con-
ducted by participants who are typically non-domain experts
who evaluate more general notions of explanation quality. For
example, Alqaraawi et al. [88] evaluate whether explanations
help lay users recruited via an online crowdsourcing platform
to understand the decisions of a CNN model for image clas-
sification. For a detailed survey on the user studies conducted
for xAI methods evaluation, we refer the reader to [89].
C. Explainable AI Objectives
In this study, we group the objectives of utilizing xAI in RS
according to the following four reasons, as defined in [27]:
(1) explain to justify, (2) explain to control, (3) explain to
discover, and (4) explain to improve. Explain to justify is
motivated by the need to explain individual outcomes, which
ensures that the ML systems comply with legislations, such
as enabling users the ”right to explanations”. Furthermore,
the explanations can enable a detailed understanding of the
workings of the ML model. Hence, the explain to control
objective is relevant for assessing model trustworthiness and
can help to identify potential errors, biases, and flaws of the
ML model. These insights can be used to discover scientific
knowledge and new insights about the underlying process that
is modeled with the ML system or to further improve the
existing model. The improvement techniques based on the
xAI insights are classified by Weber et al. into augmenting
the (1) input data, (2) intermediate features, (3) loss function,
(4) gradient, or (5) ML model [90]. For the sake of complete-
ness, we consider adapting existing xAI methods as another
improvement strategy.
9
CAM-variant
Deconvolution
DeepLift
Gradient-variant
IG
LRP
Other
ALE
Occlusion
PDP
PFI
Other
LIME
SHAP
Rule Extraction
EBM
GAM
GLM
GP
LDA
Rules
Other
Activation Assessment
Attention
Explanation Association
Model Association
Prototype
GAN
WIK
MDI
Other
0
10
20
30
40
50
60
70
80
Number of Publications
xAI Category
Backpropagation
Perturbation
Local Approximation
Model Translation
Interpretable by Design
Embedding Space
Joint Training
Counterfactuals
Example-based
Feature Selection
Fig. 4. Number of publications per different xAI methods, grouped according to our categorization in Figure 2. Most RS studies rely on local approximation
approaches, particularly the SHAP method. Second most often backpropagation approaches are used, among which CAM methods receive the highest focus.
Perturbation and embedding space techniques are also prominently used, while the other xAI categories occur less frequently.
IV. EXP LAINAB LE A I IN REMOTE SE NS IN G
Here we summarize the research on xAI in RS and an-
swer the research questions RQ1, RQ2, and RQ3. First, we
bring attention to the common practices and highlight new
approaches (RQ1). Next, we summarize the methodologies
to understand and evaluate the model explanations (RQ2).
Further, we outline the common research questions across the
different RS tasks that practitioners aim to answer with xAI
(RQ3).
A. RQ1: Usage/Applications of Explainable AI in Remote
Sensing
Table I contains the full list of publications included in our
study. The papers are arranged by considering groups for
the different EO tasks (see Appendix D) and follow the xAI
categorization shown in Figure 2. The table lists the used xAI
methods, objectives, and evaluation types for each paper. Also,
Appendix C describes in detail the methods employed in the
three most common EO tasks: landcover mapping, agricultural
monitoring, and natural hazard monitoring.
1) Application of Explainable AI Methods: Figure 4 shows
the number of papers using a xAI method grouped by
the categories introduced, while Figure 5 illustrates all the
combinations of model and methods in the literature. Local
approximation methods are the most frequently used in over
94 publications. Through their popularity, they are used on
the most diverse set of EO tasks and models. As shown
in Figure 5, in most cases, they interpret tree-based models
(i.e., RF and tree ensembles), followed by CNNs and Mul-
tilayer Perceptrons (MLPs). Also, with over 65%, most of
these publications rely solely on local approximation methods
without evaluating other methods. Backpropagation methods
follow with a small gap of 72 papers, mainly leveraging
CAM variants. While many CNN architectures are interpreted,
most time series models, like the LSTM, are also located
here. In contrast, most papers using transformers are among
the 35 publications that leverage embedding space interpre-
tation techniques, which was to be expected since attention
is already the centerpiece of the architecture. Followed by
a large gap to the SHAP and CAM variant methods, 56
publications use various perturbation methods. The number of
publications is fairly well proportioned between the methods,
but PFI is the most widely used, followed by Occlusion
and PDP. Further, 15 publications leverage a diverse set of
interpretable-by-design models. Although Generalized Linear
Models (GLMs), particularly LR, are the most common, newer
models like the Explainable Boosting Machines (EBM) are
gaining recognition. Only a few papers employ joint training,
like prototypes or explanation associations. Even less popular
are model translations, counterfactuals, and example-based ex-
planations. Furthermore, as shown in [18], local approximation
and perturbation methods follow an increasing trend, and it
can be expected that their proportion of publications will
further increase while other categories, like backpropagation,
10
GB (44)
GAM (2)
GAN (2)
FNN (23)
Ensemble (32)
Rules (2)
Cubist (1)
LDA (4)
DT (3)
RF (69)
Transformer (12)
GLM (11)
CapsuleNet (1)
VAE (1)
NB (1)
CNN (137)
kNN (2)
SVM (8)
GP (5)
LSTM (25)
EBM (2)
GAN (1)
Gradient-variant (25)
LIME (16)
LRP (6)
Other (13)
Model Translation (3)
Backpropagation (104)
ALE (5)
Deconvolution (2)
Model Association (1)
Occlusion (15)
Rules (3)
LDA (3)
GP (1)
Attention (18)
Interpretable by Design (17)
Counterfactuals (1)
Example-based (1)
Rule Extraction (3)
Explanation Association (5)
IG (12)
CAM-variant (52)
GLM (5)
Joint Training (10)
Feature Selection (20)
MDI (18)
PDP (15)
Prototype (4)
PFI (30)
WIK (1)
EBM (2)
Perturbation (71)
SHAP (101)
Activation Assessment (25)
Local Approximation (117)
DeepLift (4)
Embedding Space (43)
GAM (1)
Model xAI Category xAI Method
Fig. 5. The number of times xAI categories, methods, and models are
mentioned in the identified literature. Typically, local approximation and
perturbation methods are applied to non DL models such as RF and GB.
However, they also find usage in explaining CNNs. While CNN and LSTM
models mostly rely on backpropagation methods, the transformer models are
almost exclusively explained with embedding space techniques.
stagnate. Next, the Mean Decrease in Impurity (MDI) or
Gini importance is often used in feature selection for global
importance measurements and can be easily obtained for tree-
based methods [91–107]. This shows that xAI and feature
selection have fluent boundaries, as the Gini index is used for
feature selection, i.e., when deciding about the depth in the tree
with the purity of a split, but also allows for the interpretation
of the decision process of the model. Finally, Figures 4 and
5 show that interpretable-by-design models are leveraged in
diverse contexts. For instance, linear regression supports more
complex methods and evaluates the linear trends [108,109].
The application of GLM is observed for small sample sizes
[110] and Generalized Additive Models (GAMs) are used for
larger datasets [111–114]. Zhou et al. reduce the feature space
with Principal Component Analysis (PCA) in [111], while
advanced versions based on EBMs are introduced in [112,
113]. EBMs use pairwise feature interactions within the GAM
and consider gradient boosting to train each feature function
consecutively. Fang et al. follows this idea and uses Feed
Forward Neural Networks (FNNs) as function approximations
[114]. Further, a decision tree incorporating linear regression
models at the terminal leaves is employed in [115] to gain
scientific insights into the partitioning of precipitation into
evapotranspiration and runoff. In another example, Karmakar
et al. apply LDA to SAR images. Their bag-of-words approach
uses superpixels as words to do landcover mapping [116,117].
Last but not least, Mart´
ınez-Ferrer et al. analyze the weights of
the Gaussian Process (GP) to find anomalous samples [118].
Besides the above-described common practices of applying
xAI in RS, we also identified distinct modeling approaches,
such as concept bottleneck models, fuzzy logic-based models,
and integration of xAI into the training pipeline that are
applied to specific EO tasks.
In EO studies, concept bottleneck models are used to
associate model predictions of socioeconomic indicators with
human-understandable concepts. In addition to the work of
Levering et al. [119] (also pointed in [13]) that uses land-
cover classes as interpretable concepts to explain landscape
aesthetics, a similar approach is also used in two recent studies
[120,121], Concretely, in [120], the same authors propose a
semantic bottleneck model for estimating the living quality
from aerial images with the interpretable concepts capturing
population statistics, building quality, physical environment,
safety, and access to amenities. The other work of Scepanovic
et al. estimates the vitality of Italian cities by relying on
vitality proxies such as land use, building characteristics, and
activity density [121].
Fuzzy logic-based models are another approach that is
mainly used to evaluate the trustworthiness of ML models.
An Ordered Weighted Averaging (OWA) fusion function is
presented in [122] for burned area mapping, which allows
controlling if the fusion results are affected by more false
positives than false negatives, and vice versa. At the same
time, it foresees if there are only a few highly or many
low relevant factors when providing a particular output. The
outputs of Fuzzy Logic Systems (FLSs) for tree monitoring
[123] can also be easily validated. Last but not least, measure-,
integral- and data-centric indices based on the Choquet integral
(an aggregation function defined with respect to the fuzzy
measure) are introduced in [124–127] intending to develop
more understandable ensembles for landcover mapping.
Lastly, some works explore integrating xAI into the training
pipeline [128,129]. Xiong et al. [129] leverage Grad-CAM
to create masks that occlude features the network has empha-
sized, encouraging the network to exploit other features [129].
The output of three NNs is merged through attention in another
study [130]. There, the outputs and the attributions of the Deep
Learning Important FeaTures (DeepLIFT) backpropagation
method are the key values of the attention layer. In contrast,
Li et al. [128] directly classify objects on top of the CAM
attribution map. Hence, their weakly supervised method does
not need bounding boxes, making the image labels sufficient.
2) Adapted Explainable AI Approaches: As described in
Section I, the popular xAI methods are designed initially to
work on natural images, which significantly differ from remote
sensing acquisitions. This raises the question of whether
the utilized xAI methods fit remote sensing data well. In
this respect, we identified several works that propose new
approaches considering remote sensing data properties to
produce better explainability insights. Figure 6 illustrates that
recently, there has been an increase in such approaches, with
50% of the publications being published in the last year, and
no novel method was identified before 2021. These approaches
typically adapt the existing xAI methods with a particular
focus on the CAM and Grad-CAM methods or propose new
11
2021 2022 2023
Year
0
1
2
3
4
5
6
Number of Publications
xAI Method
Attention
CAM
FNN
GLM
Grad-CAM
PFI
ProtoPNet
Fig. 6. Number of papers adapting existing xAI methods, grouped per
year. The existing xAI methods are increasingly adapted to address the RS
challenges with the highest attention given to CAM techniques.
DL architectures. For example, Feng et al. exploit that the
target objects in SAR images occupy only a small portion of
the image to propose a new CAM method which, instead of
upsampling the feature map of the convolutional layer to the
input image, downsamples the input image to the feature map
of the last convolutional layer [131]. This operation results
in saliency maps that localize precisely the targets in SAR
images compared to the Grad-CAM method. Additionally, a
CAM method able to produce much more fine-grained saliency
maps than the prior CAM methods is introduced by Guo et
al. in [132]. Similar to Layer-CAM [133], they use shallow
layers to get more fine-grained results but also rely on scores,
following the idea of Score-CAM [134], which are not as
noisy as gradients. Another attempt to improve current CAM
methods was proposed by Marvasti-Zadeh et al., who utilize
the attribution maps from all network layers and decrease
their number through only retaining maps which minimize
the information loss according to the Kullback-Leiber (KL)-
divergence [135]. Additionally, local attribution maps are gen-
erated by masking the image and weighting the maps by the
corresponding bounding box and their prediction confidence.
These local maps must be smoothed with a Gaussian kernel
to avoid sharp boundaries in the resulting CAM. This novel
approach, called Crown-CAM, is evaluated on a localization
metric and outperforms (augmented) Score-CAM and Eigen-
CAM on a tree crown localization task. CAM variants for
hyperspectral images are developed in [136]. The saliency map
is now a 3D volume instead of a 2D image, and each voxel
attributes the different channels in depth, which provides pixel-
wise and spectral-cumulative attributions. Other Grad-CAM
adaptions proposed are a median pooling [137] and a pixel-
wise [138] variant.
When it comes to new DL approaches, a model prototype
approach for RS is proposed in [139], where the ProtoPNet
architecture [73] is adapted to also consider the location of
the features. The network is iteratively trained in 3 stages.
Firstly, the encoder and prototype layers are trained to produce
the prototypes. Secondly, the prototypes are replaced by the
nearest prototype of the corresponding class. Lastly, only the
output layer weights are trained to produce the final prediction.
In contrast to ProtoPNet, the prototype similarity is scaled with
a location value learned by the network. This acknowledges
the location of the prototypes in the image and makes them
location-aware. Another approach is presented in [140], where
a reconstruction objective is added to the loss function to
enable the Grad-CAM++ method to more accurately localize
multiple target objects within an aerial image scene.
Finally, we identified one approach that addressed model
agnostic methods. Specifically, Brenning adapts the model-
agnostic PFI method to incorporate spatial distances [141].
Analogous to the PFI method, the features are permuted, and
the mean decrease in predictive accuracy is assessed. Notably,
features are permuted across various predefined distances,
revealing the spatial importance or sensitivity of the model.
B. RQ2: Interpretation and Evaluation of Explanations in
Remote Sensing
1) Understanding and Validating Explanations: As men-
tioned in Section I, remotely sensed data usually depicts com-
plex relationships, which can hinder the intuitive understand-
ing of the semantics of the explanations. Therefore, an obstacle
when applying xAI in RS is explanation interpretation, as
the relevant features often do not have a straightforward
interpretation. We identify that this challenge is frequently
tackled by transforming the raw features into interpretable
features used for model training [142,143] or by associating
domain knowledge with the explanation at the post-hoc stage
[103,144].
a) Creation of Interpretable Features: Ensuring human-
understandable features is essential for comprehending input-
output relationships or gaining knowledge of the ML model.
Our study identified that these features are typically derived
with spectral indices or dimensionality reduction techniques.
For optical imagery, many works utilize standard spectral in-
dices such as Normalized Difference Vegetation Index (NDVI)
to create interpretable features. Further, the problem of cre-
ating interpretable features from SAR images is tackled by
Ge et al. in [138] by transforming the SAR pixels into
human-interpretable factors using the U-Net architecture. In
detail, they derive three interpretable variables from the VH
polarization backscatter coefficients and the VV polarization
interferometric coherence of the Sentinel-1 images, providing
insights into the temporal variance and minimum. While the
temporal variance changes between different crops and land-
forms over time, the temporal minimum is specific for flooded
rice fields due to their proximity to water. This facilitates the
understanding of the attribution of the applied Grad-CAM
method. Dimensionality reduction techniques can also help
derive an interpretable representation from a complex and
correlated feature space. This is demonstrated by Brenning
who employs structured PCA for feature space reduction and
RF for classification [145]. His xAI analysis reveals that the
behavior of the features can be identified in the principal
components and allows the extraction of the relationship
between the main feature groups.
b) Interpreting Explanations with Domain Knowledge:
Domain knowledge is often utilized to reveal the semantics of
12
Agricultural Monitoring
Atmosphere Monitoring
Building Mapping
Ecosystem Interactions
Human Environment Interaction
Hydrology Monitoring
Landcover Mapping
Natural Hazard Monitoring
Soil Monitoring
Surface Temperature Prediction
Target Mapping
Vegetation Monitoring
Weather Climate Prediction
Other
0
5
10
15
20
25
30
Number of Publications
Evaluation
Anecdotal
Quantitative
User Study
Toy Task
Fig. 7. The number of times the evaluation types are considered in the
different EO tasks. While anecdotal evaluation is prominent across the EO
tasks, quantitative evaluation is rare, and most often, it is conducted for
landcover mapping problems. A single work relies on user studies for
agriculture, and three studies apply their methods to toy tasks.
the relevant features from raw inputs. It is usually derived from
already established indicators or based on expert knowledge.
For instance, indicators such as NDVI describing vegetation
phenology are commonly used in agriculture monitoring. They
are typically related to the relevant time steps identified with a
xAI approach to reveal the critical phenological events for crop
disambiguation [100,103,144] or yield prediction [146]. Fur-
ther, for landslide susceptibility, Zhang et al. relate the spatial
heterogeneity of the SHAP values to the natural characteristics
and human activities for the following factors: lithology, slope,
elevation, rainfall, and NDVI [147]. They argue that the
differences in factor contributions can be attributed to local
and regional characteristics such as topography, geology, or
vegetation. Another indicator is the land cover classes used
by Abitbol and Karsai to identify the relationship between
urban topology and the average household income [148]. They
relate the Grad-CAM attributions of the image pixels to their
landcover classes to identify that commercial/residential units
are characterized by low income, while natural areas describe
higher income.
In certain works, expert knowledge is used to validate
or interpret the findings. For example, the most important
drivers of landslides noted by the field investigation reports
are compared to the features sorted by explanation magnitudes
in [130]. LDA is leveraged in [116] for unsupervised sea ice
classification, and closely related classes are identified by KL
divergence. The LDA derived topics and probabilities, together
with the interclass distances and segmented images, enable
experts to assess the physical relationship between these
classes. For example, water bodies, melted snow, and water
currents have a similar topic distribution and a substantial
physical similarity: liquid water. In [149], expert knowledge
is employed to interpret and guide the process of finding
and validating dwelling styles and their evolution within
ethnic communities. The building footprints are classified with
eXtreme Gradient Boosting (XGBoost), and xAI is applied
by leveraging SHAP to determine the importance. Then, the
experts infer the semantic meaning. The results reveal the
emergence of mixed ethical styles inheriting from the three
traditional styles, which can be correlated to migration records.
2) Evaluation of Explainable AI Methods: As indicated in
Section III-B, xAI evaluation poses an open challenge. In
remote sensing, most of the literature relies on anecdotal evi-
dence, often involving the visualization of arbitrarily selected
or cherry-picked examples. Figure 7 illustrates that quantitative
evaluation is mostly conducted in well-established EO tasks,
such as landcover mapping and vegetation monitoring. Further,
a single user study is conducted for agriculture monitoring.
In addition to these types of evaluation, some authors [139,
150,151] evaluate their methods on straightforward toy tasks,
where humans can easily identify the sought relationships.
Paudel et al. exclusively conduct a user survey to assess
(1) the importance of the DL features by experts and (2)
judge the importance by the xAI method [152]. Five crop
modeling experts assigned importance scores to the features.
Subsequently, these scores were compared to the feature
importance estimated with SHAP. Afterward, the experts cat-
egorized the model explanations into four categories (strong)
agree, (strong) disagree, and should justify their decision.
Overall, it is demonstrated that experts can understand the
model explanations, and the explanations enable the experts
to get insights into the models. However, the task remains
challenging and has the potential for misconceptions about
the model behavior.
When it comes to the quantitative evaluation, we identified
15 studies testing the xAI methods on RS problems with
functional metrics. These metrics mainly asses the explanation
quality properties described in Section III-B. Particularly, the
backpropagation methods are most frequently evaluated with
a specific focus on the CAM approaches. In detail, Kakoge-
orgiou and Karantzalos [153] evaluate eight backpropagtion
methods, Occlusion, and LIME on landcover mapping tasks.
Utilized metrics are max-sensitivity, file size, computation
time, and Most Relevant First (MoRef). Max-sensitivity mea-
sures the reliability (maximum change in explanation) when
the input is slightly perturbed, the file size is used as a
proxy for explanation sparsity, and MoRef measures how fast
the classification accuracy declines when removing the most
relevant explanations. The results indicate no obvious choice
for this task. While Occlusion, Grad-CAM, and LIME were
the most reliable according to the max-sensitivity metric, they
lack high-resolution explanations. Further, the studies [131,
136,137] evaluate various CAM methods on class sensitivity
and by measuring drop/increase in confidence when occluding
parts of the image. In a different study, [154] demonstrates
the low faithfulness of the Grad-CAM explanations based on
similar metrics. The authors also conduct model and data ran-
domization tests to find that Grad-CAM is sensitive to changes
in network weights and label randomization. Other studies also
evaluate the localization ability of CAM methods by turning
the attributions into segmentation masks and comparing the
IoU or classification accuracy [132,135,155]. Additionally,
[156] compare attention networks and CAM variants on the
metrics max-sensitivity and average % drop/increase in confi-
dence. Regarding other xAI approaches, the attention weights
are evaluated in [144] by inspecting drops in the accuracy
for crop mapping when the transformer model is trained on a
subset of dates with the highest attention values. The results
13
Agricultural Monitoring
Atmosphere Monitoring
Building Mapping
Ecosystem Interactions
Human Environment Interaction
Hydrology Monitoring
Landcover Mapping
Natural Hazard Monitoring
Soil Monitoring
Surface Temperature Prediction
Target Mapping
Vegetation Monitoring
Weather Climate Prediction
Other
0
5
10
15
20
25
Number of Publications
Objective
To Control
To Discover
To Improve
To Justify
Fig. 8. The frequency of the objectives for using xAI in RS identified in the
analyzed works, categorized according to the scheme introduced in Section
III-C. Papers can appear in multiple categories due to ambiguous meanings
or multiple motivations and objectives. The objectives to control and improve
are widely used among the traditional EO tasks such as landcover or target
mapping. The objective to discover is frequently found among the more recent
EO tasks like monitoring the atmosphere and human environment interaction.
Finally, the xAI approaches in RS rarely have the objective to justify.
verify that attention weights select the key dates for crop
discrimination, as training the model with only the top 15
attended dates is sufficient to approximate the accuracy of
the model trained on the complete dataset. Further, Dantas
et al. [157] employ distinct metrics for their counterfactual
generation. These metrics aim to ensure a certain quality of
the counterfactuals. They used proximity (evaluating closeness
to the original input instance, measured by l2distance), com-
pactness (ensuring a small number of perturbations across time
steps), stability (measuring the consistency for comparable
input samples) and plausibility (measuring adherence to the
same data distribution). Finally, despite the high usage of the
local approximation methods in RS, we did not find any works
that quantitatively evaluate their explanations.
C. RQ3: Explainable AI Objectives and Findings in Remote
Sensing
In this section, we analyze the motivation for applying xAI
in RS according to the common objectives specified in Section
III-C. The frequency of the objectives across the EO tasks is
visualized in Figure 8 and reflects a similar trend as in [13].
Namely, the objective to control is the most common and is
followed by a large gap with the objectives to discover insights
and to improve. Finally, the objective to justify is rare and
found only in a few works. Notably, this figure also shows that
in contrast to the objective to control which is mostly identified
in the three most common EO tasks (landcover mapping,
agricultural monitoring, and natural hazard monitoring), the
objective to discover has a unique distribution across the
EO tasks as it frequently occurs in studies monitoring the
atmosphere, vegetation, and human environment interaction.
The studies with the objective to control are mainly compar-
ing the inference mechanisms of various established ML mod-
els used for EO. Consequently, they are mostly conducted in
landcover mapping. Next, they are often found in agricultural
monitoring, where they mainly aim to assess the reliability
of the proposed models by quantifying the relevance of the
multitemporal information. For instance, Rußwurm and K¨
orner
use the gradient method to measure the temporal importance
assigned by various DL models for crop classification [158].
Their analysis reveals that the transformer and the LSTM
approaches ignore the observations obscured by clouds and
focus on a relatively small number of observations when
compared to the CNN models. Following a similar approach,
Xu et al. evaluate the generalization capabilities of these
models when inference is performed for different years than
the ones used for model training [103]. Their experiments
indicate that the LSTM model adapts better to changes in
crop phenology induced by late plantation compared to the
transformer model. The objective to control also supports
studies that anticipate the model decisions in scenarios that can
occur in practical applications. For instance, Gawlikowski et
al. investigate the impact of exposing a DNN, initially trained
on cloud-free images, to cloudy images [159]. They apply
Grad-CAM to identify crucial regions for the classifier in both
image types. Their findings reveal different factors why the
network misclassifies the Out-Of-Distribution (OOD) exam-
ples, including the coverage of structures through cloud cover
and shadows, as well as the homogeneity or heterogeneity
induced by different cloud types.
The observed distribution of the EO tasks with the objective
to discover in Figure 8b indicates that using xAI for gaining
new knowledge is usually applied for the recently explored
EO tasks. For instance, the problem of finding the key drivers
for wildfire susceptibility is tackled in [160]. Applying SHAP
and PDP on the trained ML model reveals that soil moisture,
humidity, temperature variables, wind speed, and NDVI are
among the most important factors associated with wildfires.
Regarding monitoring natural hazards, Biass et al. uses SHAP
to identify that volcanic deposits, terrain properties, and veg-
etation types are strongly linked to vegetation vulnerability
after volcanic eruptions [161]. In detail, increased vegetation
vulnerability is associated with higher lapilli accumulations,
with crops and forests being the most and the least susceptible
vegetation types, respectively. In another study, Stomberg et
al. aims to discover new knowledge about the ambiguously
defined concept of wilderness [162]. By analyzing occlusion
sensitivity maps from their proposed deep learning mod-
els for predicting wilderness characteristics, they reveal that
wilderness is characterized by large areas containing natural
undisrupted soils. This is in contrast to anthropogenic areas
that have specific edge shapes and lie close to impervious
structures. Besides discovering new knowledge, scientific in-
sights can also be used to identify the reasons for inaccuracies
in Earth system models. Silva et al. provide insights into a
lightning model’s structural deficits by predicting its error
with a gradient boosting algorithm and interpreting it with
SHAP [163]. The error is computed as the difference between
the output of the model and satellite observational data.
Their analysis reveals potential deficits in high convective
precipitation and landcover heterogeneities.
A large part of the works with the objective to improve
focus on adapting the existing xAI methods to RS tasks and
are already described in Section IV-A2. Concerning the other
techniques for model improvement based on xAI insights,
we found that it is often used for data augmentation. For
14
instance, Beker et al. simulate synthetic data for training a
CNN model for the detection of volcanic deformations [164].
By conducting an explainability analysis with Grad-CAM on
real data, the authors identify that the model wrongly pre-
dicts volcanic deformations on some patterns not considered
in the simulated data. These insights are used to improve
the prediction performance by fine-tuning the model on a
hybrid synthetic-real dataset that accounts for these patterns.
Further, Kim et al. propose an iterative classification model
improvement for satellite onboard processing through a weakly
supervised human-in-the-loop process [156]. An inconsistency
metric is introduced to measure the similarity of the attribution
maps emphasizing commonly highlighted regions to identify
uncertain explanations across the attention blocks. Experts
refine badly performing samples by labeling the incorrect
pixels in the attention map. In the last step, the onboard model
weights get updated by retraining with the refined data.
Lastly, the objective to justify usually discusses the rele-
vance of the explainability insights from the stakeholder’s per-
spective. For instance, Campos-Taberner et al. argue that the
interpretability of the DL models in agricultural applications
is critical to ensure fair payouts to the farmers according to the
EU Common Agricultural Policy (CAP) [165]. By applying a
perturbation approach, they find that the summer acquisitions
and the red and near-infrared Sentinel-2 spectral bands carry
essential information for land use classification. Further, the
human footprint index, which represents the human pressure
on the landscape and can be a valuable metric for environmen-
tal assessments, is predicted from Landsat imagery in [166]. In
order to build the policymakers’ trust, Layer-wise Relevance
Propagation (LRP) is leveraged to visually highlight the rel-
evant features in the images. Lastly, the approach presented
in [167] can assist individual users during the production
phase of a ML model, as it justifies the model’s validity
for inference by providing example-based explanations. If the
provided example does not fit the input instance, the model
can be considered unreliable for the RS image classification
task.
15
TABLE I: A complete list of all relevant papers in this review, aggregated by EO Task and xAI
Category. (†xy indicates a new method which was derived from xy; a full list of acronyms can be
found in Appendix D.)
Earth Observation
Task
Explainable AI Category Paper, Explainable AI Methods Model Evaluation Objective
Toy Task
Anecdotal
Quantitative
Control
Improve
Discover
Justify
Agricultural Moni-
toring
Backpropagation [138] PWGrad-CAM†Grad-CAM CNN ✓ ✓
[168] RAM CNN ✓ ✓
Backpropagation, Embedding Space [158] Gradient, Activation Assessment, At-
tention
CNN, LSTM,
Transformer,
ConvLSTM
✓ ✓
Backpropagation, Feature Selection, Em-
bedding Space
[103] Gradient, Activation Assessment,
MDI, Attention
aLSTM,
Transformer, RF
✓ ✓
Backpropagation, Joint Training [169] Explanation Association, LRP GAN ✓ ✓ ✓
Backpropagation, Local Approximation [143] SHAP, IG LSTM ✓ ✓
[152] SHAP, IG LSTM ✓ ✓
Embedding Space [170] Attention Transformer ✓ ✓
Feature Selection [100] MDI, GFFS RF ✓ ✓
Feature Selection, Model Translation [99] MDI, Rule Extraction DT, RF ✓ ✓
Interpretable by Design [112] EBM EBM ✓ ✓
Joint Training [171] Prototype CNN ✓ ✓
Local Approximation [172] SHAP GB ✓ ✓
[173] SHAP GB ✓ ✓
[146] SHAP GB ✓ ✓
[174] SHAP RF ✓ ✓
[175] SHAP GB ✓ ✓ ✓
[176] SHAP GLM, kNN, GB,
SVM, RF, FNN
✓ ✓
Perturbation [141] Spatial Variable Importance
Profiles†PFI kNN LDA, LDA,
RF
✓ ✓
[177] PFI RF ✓ ✓
[178] GP ✓ ✓
Perturbation, Embedding Space [165] Activation Assessment, Occlusion LSTM ✓ ✓
[144] Attention, Occlusion Transformer ✓ ✓
Perturbation, Interpretable by Design [118] Occlusion, GP GP ✓ ✓
[108] ALE, PDP, PFI, GLM GLM, RF ✓ ✓
Atmosphere Moni-
toring
Backpropagation [179] IG NN Ensemble ✓ ✓
Backpropagation, Perturbation, Local Ap-
proximation
[180] XRAI, SHAP, Occlusion CNN ✓ ✓
Embedding Space [181] Activation Assessment FNN ✓ ✓
[182] Attention FNN ✓ ✓
Feature Selection [105] MDI Tree Ensemble ✓ ✓
[106] MDI Tree Ensemble ✓ ✓
[107] MDI Tree Ensemble ✓ ✓
Feature Selection, Joint Training, Embed-
ding Space
[104] Model Association, Activation As-
sessment, MDI
FNN, NN+ML En-
semble, RF
✓ ✓
Feature Selection, Local Approximation [97] MDI, SHAP RF ✓ ✓
Interpretable by Design [183] GLM†GLM GLM ✓ ✓
Local Approximation [184] SHAP FNN ✓ ✓
[185] SHAP GB, GLM ✓ ✓
[186] SHAP GB ✓ ✓ ✓
[187] SHAP GB ✓ ✓
Perturbation [188] PFI aCNN ✓ ✓
[189] ALE, PFI RF ✓ ✓ ✓
Perturbation, Feature Selection [102] MDI, PDP RF ✓ ✓
Perturbation, Local Approximation [190] Ceteris Paribus Profiles, PFI, LIME,
SHAP
RF ✓ ✓
[191] PDP, SHAP GB ✓ ✓
Building Mapping Backpropagation [192] Grad-CAM CNN ✓ ✓
[193] DeepLift, A*G CNN ✓ ✓
[194] Grad-CAM aCNN ✓ ✓
[155] Grad-CAM++, CAM, Grad-CAM,
Score-CAM, SmoothGrad-CAM++
CNN ✓ ✓
Backpropagation, Embedding Space [195] Attention, Grad-CAM CNN ✓ ✓
Backpropagation, Perturbation, Embedding
Space
[196] Activation Assessment, Occlusion, IG CNN, Transformer ✓ ✓ ✓
Feature Selection, Local Approximation [96] MDI, SHAP FNN, RF ✓ ✓
Local Approximation [149] SHAP GB ✓ ✓
Ecosystem Interac-
tions
Interpretable by Design [111] GAM GAM ✓ ✓
Interpretable by Design, Local Approxima-
tion
[109] GLM, SHAP GB, GLM ✓ ✓
Local Approximation [197] SHAP GB ✓ ✓
[198] SHAP GB ✓ ✓
Perturbation, Local Approximation [199] PFI, SHAP GB ✓ ✓
Human
Environment
Interaction
Backpropagation [148] GuidedGrad-CAM CNN ✓ ✓
[166] LRP CNN ✓ ✓ ✓
16
TABLE I: A complete list of all relevant papers in this review, aggregated by EO Task and xAI
Category. (†xy indicates a new method which was derived from xy; a full list of acronyms can be
found in Appendix D.)
Earth Observation
Task
Explainable AI Category Paper, Explainable AI Methods Model Evaluation Objective
Toy Task
Anecdotal
Quantitative
Control
Improve
Discover
Justify
Backpropagation, Perturbation, Embedding
Space
[154] Grad-CAM, Activation Assessment,
Occlusion
NN Ensemble,
GLM
✓ ✓
Interpretable by Design [121] GLM GLM ✓ ✓
Joint Training [120] Explanation Association CNN ✓ ✓
[119] Explanation Association CNN ✓ ✓
Local Approximation [200] SHAP RF ✓ ✓
[201] SHAP GB ✓ ✓
[202] SHAP GB ✓ ✓
[203] SHAP GB ✓ ✓
[204] LIME, SHAP RF ✓ ✓
Perturbation [205] ALE RF ✓ ✓
[206] PFI CNN ✓ ✓
Perturbation, Embedding Space [162] Activation Assessment, Occlusion CNN ✓ ✓
Perturbation, Local Approximation [207] Occlusion, LIME, SHAP FNN ✓ ✓
Hydrology
Monitoring
Backpropagation [150] Gradient GP ✓ ✓ ✓
[208] Gradient GP ✓ ✓
[209] Grad-CAM CNN ✓ ✓ ✓ ✓
[210] Grad-CAM CNN ✓ ✓
[211] Gradient CNN ✓ ✓
[212] Grad-CAM CNN ✓ ✓
Backpropagation, Local Approximation [213] SHAP, IG CNN ✓ ✓
Backpropagation, Perturbation [214] DeepLift, PDP, Expected Gradients,
IG
LSTM ✓ ✓ ✓
Embedding Space [215] Attention CNN ✓ ✓
[151] Attention FNN ✓ ✓ ✓
Local Approximation [216] SHAP FNN ✓ ✓
[217] SHAP RF, GB, GB, DT ✓ ✓
Perturbation, Interpretable by Design, Local
Approximation
[115] ALE, PFI, Cubist, LIME GB, Cubist ✓ ✓
Landcover Mapping Backpropagation [136] 3DGrad-CAM†Grad-CAM CNN ✓ ✓ ✓ ✓
[159] Grad-CAM CNN ✓ ✓
[132] CSG-CAM†Grad-CAM, Grad-CAM++,
Grad-CAM, Score-CAM
CNN ✓ ✓ ✓
[140] ERC-CAM†CAM CNN ✓ ✓
[218] XRAI CNN ✓ ✓
[219] Grad-CAM CNN ✓ ✓ ✓
[220] Grad-CAM CNN ✓ ✓
[137] Grad-CAM++, SmoothGrad-CAM++,
Grad-CAM, MPGrad-CAM†Grad-CAM CNN ✓ ✓ ✓
[221] CAM CNN ✓ ✓ ✓
Backpropagation, Embedding Space [156] Attention, Layer-CAM, Attention,
CAM, Attention, Grad-CAM
CNN ✓ ✓
Backpropagation, Perturbation, Local Ap-
proximation
[153] DeepLift, I*G, Gradient, Grad-CAM,
Occlusion, IG, GuidedBackprop, LIME,
GuidedGrad-CAM
CNN ✓ ✓ ✓
Counterfactuals [157] GAN CNN ✓ ✓
Embedding Space [222] Activation Assessment CapsuleNet, CNN ✓ ✓
[223] Activation Assessment CNN ✓ ✓
[224] Attention CNN ✓ ✓
[225] Activation Assessment, Attention CNN ✓ ✓
Example-based [167] WIK CNN ✓ ✓ ✓ ✓
Feature Selection, Local Approximation [93] MDI, SHAP CNN, RF ✓ ✓
[94] MDI, LIME RF ✓ ✓
Interpretable by Design [226] Rules NN+Tree Ensemble ✓ ✓
[117] LDA LDA ✓ ✓
Interpretable by Design, Embedding Space [227] Activation Assessment, LDA NN Ensemble ✓ ✓
Joint Training [228] Prototype CNN ✓ ✓
Local Approximation [229] SHAP RF ✓ ✓
[230] LIME, SHAP CNN ✓ ✓
[231] SHAP CNN, GB, SVM,
RF
✓ ✓
[232] SHAP CNN ✓ ✓
[233] LIME CNN ✓ ✓
Local Approximation, Embedding Space [127] Activation Assessment, SHAP NN Ensemble ✓ ✓
[234] Activation Assessment, LIME CNN ✓ ✓
[125] Activation Assessment, SHAP NN Ensemble ✓ ✓ ✓
[126] Activation Assessment, SHAP NN Ensemble ✓ ✓
Perturbation, Feature Selection, Local Ap-
proximation
[92] MDI, PFI, SHAP RF ✓ ✓
Perturbation, Local Approximation [145] ALE, PDP, PFI, SHAP RF ✓ ✓
Natural Hazard
Monitoring
Backpropagation, Embedding Space [130] DeepLift, Attention NN Ensemble ✓ ✓ ✓
Backpropagation, Perturbation, Embedding
Space
[164] Activation Assessment, Occlusion,
Grad-CAM
CNN ✓ ✓ ✓