Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Federated Onboard-Ground Station
Computing with Weakly Supervised
Cascading Pyramid Attention Network
for Satellite Image Analysis
TAEWOO KIM*, MINSU JEON*, CHANGHA LEE, JUNSOO KIM, GEONWOO KO, (Member,
IEEE), JOO-YOUNG KIM AND CHAN-HYUN YOUN., (Senior Member, IEEE)
School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, South Korea
Corresponding author: Chan-Hyun Youn (e-mail: chyoun@kaist.ac.kr).
This work was supported by Samsung Electronics Co., Ltd (IO201210-07976-01) and Electronics and Telecommunications Research
Institute(ETRI) grant funded by the Korean government.[22ZS1100, Core Technology Research for Self-Improving Integrated Artificial
Intelligence System] (*Taewoo Kim and Minsu Jeon contributed equally to this work.)
ABSTRACT With advances in NanoSat (CubeSat) and high-resolution sensors, the amount of raw data to
be analyzed by human supervisors has been explosively increasing for satellite image analysis. To reduce
the raw data, the satellite onboard AI processing with low-power COTS (Commercial, Off-The-Shelf) HW
has emerged from a real satellite mission. It filters the useless data (e.g. cloudy images) that is worthless
to supervisors, achieving efficient satellite-ground station communication. In the application for complex
object recognition, however, additional explanation is required for the reliability of the AI prediction due to
its low performance. Although various eXplainable AI (XAI) methods for providing human-interpretable
explanation have been studied, the pyramid architecture in a deep network leads to the background bias
problem which visual explanation only focuses on the background context around the object. Missing
the small objects in a tiny region leads to poor explainability although the AI model corrects the object
class. To resolve the problems, we propose a novel federated onboard-ground station (FOGS) computing
with Cascading Pyramid Attention Network (CPANet) for reliable onboard XAI in object recognition. We
present an XAI architecture with a cascading attention mechanism for mitigating the background bias for
the onboard processing. By exploiting the localization ability in pyramid feature blocks, we can extract
high-quality visual explanation covering the both semantic and small contexts of an object. For enhancing
visual explainability of complex satellite images, we also describe a novel computing federation with the
ground station and supervisors. In the ground station, active learning-based sample selection and attention
refinement scheme with a simple feedback method are conducted to achieve the robustness of explanation
and efficient supervisor’s annotation cost, simultaneously. Experiments on various datasets show that the
proposed system improves accuracy in object recognition and accurate visual explanation detecting small
contexts of objects even in a peripheral region. Then, our attention refinement mechanism demonstrates that
the inconsistent explanation can be efficiently resolved only with very simple selection-based feedback.
INDEX TERMS XAI, Visual Explanation, Satellite Image Analysis, Human-In-The-Loop
I. INTRODUCTION
The modern small satellites including CudbeSat are becom-
ing interesting technologies in the space industry. As the
revisit period of the acquired raw images is getting shorter,
complex applications (e.g. object tracking, detection, etc.) in
object recognition are interested in a research field. However,
the massive amount of raw data makes it difficult for the
limited supervisors to analyze an inspection of all image
patches over a broad area (covering >km in a raw image).
To assist the supervisors, the DL-based image analysis has
emerged [1]–[6]. By providing the prediction results for ob-
ject recognition, the DL model enables to efficiently analyze
VOLUME 4, 2016 1
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 1: The characteristics of satellite images make it
difficult for a DL model to infer accurate prediction and
visual explanation. Each image may contain scale-variant
and rotation-equivariant objects with the same category (air-
plane). And, training dataset includes multiple image resolu-
tions due to the optical sensor, altitude, etc.
the large-scale data with high accuracy. Furthermore, due to
the advent of low-power AI computing HW such as visual
processing unit (VPU) and field programmable gate array
(FPGA), a DL-based satellite onboard system has recently
been introduced in a satellite image analysis. The onboard
computing is important for efficient satellite-ground station
computing by filtering unnecessary images in advance. Typi-
cally, the onboard DL system, CloudScout [7], filters cloudy
images having no information to analyze using simple bi-
nary classification. In complex applications, however, the
prediction results of a DL model still remain ambiguous,
especially in object recognition. The false negative error is
a challenging issue for the reliability of the onboard AI
system. Explainable AI (XAI) technique has the key to a
reliable AI-based system by providing visual explanation
of the prediction for a black-box DL model. It represents
the form of saliency maps highlighting the pixels that are
important for the DL model to predict the class of a target
object. These human-interpretable results enable humans to
expect the model behavior for other samples and sometimes
retrain (refine) the model for improving performance.
Satellite images have low resolution (e.g.,>0.5m2per
one pixel) compared with other object recognition images
due to the long distance between the optical sensors and
target regions. Fig. 1 shows the characteristics of top-view
images: rotation-equivariant and scale-variant objects. It also
contains different quality images due to the sensor resolution
and environmental reasons (e.g. cloudiness, sunlight, etc.).
That makes it difficult for DL models to predict the class of
a target object as well as visual explanation accurately. In
the satellite images, the background has a large portion of the
entire image. Accordingly, if a certain class is highly involved
with the background (e.g., ship – ocean), the model could
be trained to recognize objects for the class based on the
only background rather than the object’s own characteristics,
called the background bias problem [8], [9]. In particular,
this problem is critical due to small object sizes and a
high correlation between background and object. The wrong
visual explanation highlighting only on background context
reduces the reliability of the mission-critical application in
the satellite system. In addition, due to the high computation
overhead of the XAI model and no supervision for output
visual explanation, the onboard computing system itself can-
not refine the model for explainability, which supervisors and
rich computing resources in the ground station are required.
To resolve these problems, we propose a novel federated
onboard-ground station (FOGS) computing system for satel-
lite image analysis. Our system deals with the reliability of
complex object recognition applications in satellite images,
which has not been addressed in the conventional onboard AI
system [7]. We are attempting to handle this problem through
the development of a novel satellite image analysis system
that cooperates with onboard-ground station and supervisors.
Different from the conventional onboard AI system [7], we
build a sustainable onboard XAI-based analysis system by
iteratively updating the refinement of the analysis model
between the onboard and ground station. We describe a novel
XAI model, cascading pyramid attention network (CPANet),
for visual explanation of satellite image analysis. We con-
sider the attention mechanism from the multiple layers in
CNN to detect small objects in visual explanation. To prop-
agate useful context to the following layers, we connect the
attention branch of each layer in a cascading manner. Then, to
correct inconsistent explanation from the onboard prediction,
we describe an attention refinement scheme with supervisors
in the ground station. The ground system conducts the data
sampling to select inconsistent explanation in the interme-
diate layers with different representation, in advance. Then,
with the selection-based feedback mechanism, we describe
the refinement scheme of the our XAI model.
II. RELATED WORK AND PROBLEM DESCRIPTION
A. ONBOARD COMPUTING SYSTEM FOR SATELLITE
IMAGE ANALYSIS
Modern satellite system contains low-power commercial off-
the-shelf (COTS) accelerators (e.g FPGA, embedded GPU,
and VPU) for onboard AI processing. In [10], the authors
introduce low-power NVIDIA-TX1 for target recognition
and segmentation using CNN. In [11], the authors claim
that the DL model can be a promising solution in terms of
communication cost in the satellite and facilitating naviga-
tion. They conduct a detailed analysis of deploying the DL
model on COTS HW used in the satellite, and a case study of
space-related applications such as cloud detection and object
tracking.
As the on-orbit AI processing, CloudScout [7] is used
for onboard image filtering for useless data as shown in
Fig. 2. From the hyperspectral images, the onboard AI model
performs cloud segmentation via the convolutional encoder-
decoder network. The result is binary classification (cloudi-
ness or not) for each pixel. If the captured image contains
more than 70% of cloud pixels, the system drops the image
since it has no information to analyze. By filtering the useless
2VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 2: The onboard AI processing in CloudScout [7].
With low-power Intel Myriad 2 HW, it filters unnecessary
cloudy images for reducing the communication bottleneck
and supervisor’s annotation cost.
images on the onboard side, it can reduce the communication
and supervisor’s annotation costs between the satellite and
the ground station. Equipped with HyperSouct2 flight model,
they evaluated the AI-based system in the real satellite envi-
ronment, Φ-Sat1 mission. In addition, NASA and Qualcomm
[12], [13] cooperate the development of onboard AI HW
with Qualcomm Snapdragon and Intel Movidius Myriad X
Processor for NASA jet propolusion laboratory (JPL) appli-
cation. They evaluate SAR image processing with the U-Net
model [14] and mars image analysis with AlexNet [15] and
DeepLabv3 architecture [16]. They showed promising task
performance, especially % of missed pixels (less than 10%).
However, in the case of complex application such as multi-
label classification and object detection, an AI model still
remains in question due to its low performance compared
to binary classification, resulting in the critical error of false
negatives (about 4% FN error of binary classification in
CloudScout).
B. XAI TECHNIQUES FOR PROVIDING VISUAL
EXPLANATION
To interpret a black-box DL model in image processing, there
are several methods for extracting a saliency map explaining
the basis of the model prediction. Among them, we do not
handle the perturbation-based methods [17]–[19] which need
to process randomly perturbed images repeatedly, it takes
more than a few times processing than the prediction process.
Therefore, it is not suitable in an onboard environment with
limited computing power.
As widely used a class activation map method, CAM
[20] was developed as an ancestor of the class activation
mapping family; it produces visual explanation results by
weighting the feature maps (after the top convolution layer)
from global averaging pooling (GAP) [21] and the fully-
connected layer, for a target class. To avoid architectural
restriction about GAP, gradient-based approaches have been
introduced [22], [23]. These approaches can interpret the
single layer representation (normally the top convolution
layer). LayerCAM [24] fuses the global explanation from
multiple local explanations of the intermediate layers of
CNN, which take advantage of localization ability in the low-
level layers. However, this approach still has limitations due
to the simple fusion method, which just performs an element-
wise maximization of local explanations. In this case, the
redundant information may contain producing the ambiguous
result.
These post-hoc explanation methods except CAM still
require additional operations (back-propagation for gradient
computation) to generate the visual explanation. And they
just interpret the output feature maps from the prediction re-
sult, not any improvement in task performance (i.e. accuracy
or precision). Meanwhile, instead of considering post-hoc
explanation, some of the very recent studies such as ABN
[25] and LFI-CAM [26] take response-based approaches
similar to CAM [20]. They extend CAM by introducing
an attention mechanism that is sub-branch from the CNN
backbone, which improves visual explainability and allows
end-to-end training (i.e., no need for any network architecture
modification or fine-tuning). By doing so, these attention
branch methods not only enable to generate visual explana-
tion within feed-forward passes but also achieve to overcome
the drawbacks of CAM, mentioned above. On the contrary,
they are limited to the top convolution layer to generate visual
explanation while gradient-based methods can extract any
layer in the backbone CNN. In the satellite image, the top
layer interpretation leads to the ambiguous explanation due
to the background bias problem in object recognition. This
problem makes visual explanation focus on the background
pixels around a target object. We describe describing the
details of this phenomenon in Section II-D.
On the other hand, several studies about the pyramidal
attention networks [27]–[29] have been conducted to utilize
the rich context of multiscale feature maps, but they only
handle the feature attention to improve the task performance.
In [30], the authors consider the various episodes from the
multi-layer attention modules to generate reliable visual ex-
planation in satellite images.
C. EXPLAINABILITY ENHANCEMENT WITH
HUMAN-IN-THE-LOOP
The concept of training a model by incorporating human
knowledge and experience has received attention to over-
come the lack of training data and the high cost of annota-
tion. This is called human-in-the-loop (HITL), which human
experts provide feedback for achieving better performance.
Our XAI-based satellite image analysis, on the other hand,
VOLUME 4, 2016 3
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 3: Scenario of the proposed federated XAI computing of onboard-ground station in satellite image analysis.
considers human cognition on improving model explainabil-
ity rather than task performance. There are some studies that
utilize HITL techniques to improve the feature explanation
ability in computer vision [31], [32]. Their goal is to obtain
not only accurate predictions but also proper explanations
for the accurate predictions. They collect human annotation
which highlights “important regions” for decision-making.
By doing so, the model trained with human knowledge
in their parameters. For example, human importance-aware
network tuning (HINT) [33] proposes a ranking loss between
human-based importance scores [8] and gradient-based sen-
sitivities. In self-critical reasoning (SCR) [34], the model
penalizes itself for the wrong answers on the important region
that most influences the prediction of the correct answer.
However, these approaches take a huge time to generate the
human attention map, scoring the importance of all pixels. In
the case of satellite image analysis, it is difficult to annotate
an attention map over all patches in a raw image (> km2per
an image). To adapt HITL to satellite image analysis effi-
ciently, our approach is to use weak supervision (i.e. simple
feedback for an attention map) rather than full supervision
(i.e. humanly create a ground-truth attention map for every
pixel) for visual explainability.
D. PROBLEMS ON ADAPTING XAI METHODS TO
ONBOARD
In this section, we argue the technical issues of existing XAI
methods adapting satellite images, especially in terms of vi-
sual explainability, and sustainable system issues exploiting
supervisors and computing resources in the ground station.
In practice, the layered architecture of CNN consists of
the pyramid feature blocks, group of convolutional layer.
Passing the pyramid feature block, the spatial dimension (i.e.
width and heights) of the output feature map is reduced and
the number of channels is increased due to the computation
efficiency and extraction of semantic contexts along with the
FIGURE 4: The background bias problem of conventional vi-
sual explanation based on top convolution layer. The trained
DL model infers the ground-truth label by focusing back-
ground context around the target objects.
channel. As mentioned in Section II-B, existing attention
branch and post-hoc explanation methods only consider the
feature map of the top convolution layer to generate visual
explanation due to its rich semantic context. However, they
may miss the contexts in the low-level features by passing
through pooling layers. Due to this spatial information loss,
visual explanation only from the high-level feature map often
fails to detect the small context or boundary of an object.
It is critical in terms of explainability, especially in satellite
images. As a result, generated visual explanation focuses on
the background pixels, not on the target object. We refer
this phenomenon as the background bias, which already
mentioned in object recognition researches [8], [9].
To verify the background bias in a satellite image, we
conducted the experiment training CNN with the pyramid
feature blocks [35] into a satellite image dataset [36]. We
resize the RGB input image to 224×224×3and observe the
visual explanation by the post-hoc explanation method [22]
widely used in an XAI field. Fig. 4 shows the background
bias problem of the visual explanation method using the top
convolution layer. We let an input image as xand its ground-
4VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 5: The spatial information loss caused by spatial
pooling operations in a pyramid network. The highlighted
region shows valuable context which is missed in visual
explanation of top convolution layer.
truth label (i.e. category) as ygt. The trained model corrects
the category of all input images with high confidence. Form
its visual explanation of ygt, however, the model determines
the category, not focusing on semantic context (i.e. airplane,
ship, and car) but on background information (i.e. airstrip,
ocean, and asphalt). The result implies that the trained model
could fail to correct the object’s category and visual expla-
nation if the background of an input image is not commonly
appeared with the target objects (e.g. the airplane passing by
the ocean).
To identify the reason why this background bias occurs, we
extract visual explanation of intermediate convolution layers
in the pyramid feature blocks. The result is shown in Fig. 5. In
the pyramid feature blocks in CNN, there are spatial pooling
operations between the blocks, which reduces the width and
height of the feature map. The lower convolution layers seem
to only focus on the small contexts over a local region (see
Residual Block 1 and 2). As mentioned in LayerCAM [24],
these layers can highlight the accurate boundary information
although they cannot explain the entire context of the input
image. In the contrast, visual explanation of residual block
3 seems to highlight the semantic contexts over the whole
region of the input image while containing some redundant
background areas. In the top convolution layer, however, the
model fails to explain the valuable contexts highly biased
to the background. This comes from the spatial pooling
operations missing features about "boat". In this example,
it seems to be the proper choice to select a visual expla-
nation of residual block 3 for the highest explainability. In
summary, visual explanation only using the top convolution
layer cannot guarantee explainability to human supervisors.
Our approach is to strengthen visual explanation of the top
convolution layer by combining useful context for accurate
boundaries and small objects in the lower convolution layers.
To this end, we present a novel attention branch method with
multiple attention blocks connected in a cascading manner.
Furthermore, to ensure the reliability of complex object
recognition tasks, continuous updates of the model (i.e.
trainable weights) according to newly captured images in
the satellite are required. The refinement (retraining) of the
trained model is not suitable for processing in the onboard
computing in terms of no supervision for explanation and
computing capability for retraining. The onboard system
cannot acquire the ground truth visual explanation of the
captured images. In addition, retraining an XAI model needs
powerful computing resources (e.g. GPUs) and a huge time to
complete the (re)training. This is not suitable in the onboard
computing environment with a limited computing constraint.
Furthermore, the manual correction on each pixel of visual
explanation [33], [34] also needs a huge amount of labeling
cost. To handle these issues, we propose a concept of fed-
erated XAI computing framework for the onboard-ground
station. Fig. 3 is overall scenario of the proposed concept.
In the onboard, an XAI-based object recognition model is
performed for the captured images from the satellite. Note
that the filtering criteria in the onboard is determined by
users, so we do not handle this. The prediction results in-
cluding the object’s class and its visual explanation transmit
to the ground station. In the ground station, the samples with
ambiguous visual explanation are automatically classified by
the active learning-based sampling. Then, an XAI model
is retrained based on the supervisor’s feedback to enhance
the visual explainability. The updated weights transmit the
onboard HW to process new incoming images, and then
repeat the entire procedure.
III. THE PROPOSED METHOD
In this section, we propose the FOGS computing system for
an XAI-based satellite image analysis. Through the system,
we are going to improve the visual explainability of the
XAI model to provide reliable object recognition on the
satellite onboard. Our approach is to consider the computing
interaction mechanisms between a satellite onboard HW and
the ground station, and ensure that the XAI model is trained
with the supervisor’s knowledge using a simple feedback
mechanism, simultaneously.
A. OVERALL PROCEDURE OF FEDERATED
ONBOARD-GROUND STATION COMPUTING
Fig. 6 shows the overall architecture of the proposed
onboard-ground station federated computing framework. In
the proposed framework, the onboard system directly exe-
cutes the inference of an XAI model according to the cap-
tured raw images from the satellite. It outputs the prediction
result of target objects (e.g. "airplane", "ship", etc.) and its
visual reason (i.e. explanation) in the form of a saliency
map. From the input image xand the trained black-box CNN
model, we denote visual explanation of the predicted class as
V Eij (x), feature importance factor of (i, j)pixel in the input
image. It can be said that a (i, j)pixel with a high V Eij (x)
value contributes to the prediction result, significantly. Based
on the prediction result of the XAI model, the onboard system
determines the raw images to be dropped (i.e. not informative
VOLUME 4, 2016 5
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 6: The overall architecture of the proposed onboard-ground station federated XAI computing with cascading pyramid
attention and weakly supervised refinement.
images related to the mission). Different from CloudScout
[7], our system can handle more complex image analysis
tasks (e.g. object recognition, scene classification, etc.) via
the XAI model. Once images and the analysis results are
transmitted to the ground station, the supervisors analyze
the correction of visual explanation. Since the onboard XAI
processing should be reliable, especially preventing critical
error (in the case of FN error in image selection), the ground
station refines the trained XAI model by inducing the super-
visor’s knowledge about the target prediction. We consider
the annotation cost of the supervisor when correcting visual
explanation while enhancing visual explainability in terms of
consistency. We describe the following methods based on the
onboard-ground station federation with the supervisors.
Onboard Processing with CPANet. First, we present a
novel attention branch method, cascading pyramid attention
network (CPANet), to mitigate a quality degradation of visual
explanation due to the background bias problem [8], [9].
As presented in Section II-B, we identify a critical failure
in visual explanation of the conventional attention branch
methods [25], [26] which exploit the top convolution layer.
Our approach is to exploit multiscale feature maps from
various layers in pyramid feature blocks of CNN. We denote
visual explanation aggregating valuable context over multiple
feature maps as global explanation. On the other hand, local
explanation represents visual explanation about a single fea-
ture map. They contain not only semantic information of ob-
jects but localization ability to detect the small context (e.g.
ship head, wings of an airplane) or objects (e.g. car, boat). To
extract elaborate global explanation from feature maps with
different spatial resolutions, we design a cascading attention
branch, subpath of pyramid feature blocks to propagate the
valuable context from the bottom to the top convolution layer.
In a cascading manner, local explanation (i.e. local attention
map) of the previous pyramid feature block becomes a guide
for extracting local explanation of the following block while
amplifying feature values of the region where the previous
one highlights. The global explanation (i.e. global attention
map) of the top block is utilized for the refinement of the
output feature map in the feature pyramid network and visual
explanation providing to a supervisor.
Weakly Supervised Attention Refinement in Ground
Station. Next, we discuss an attention refinement method
adapting the supervisor’s knowledge only using a simple
feedback mechanism to improve visual explainability. In this
step, the parameters of the onboard XAI model are refined by
supervisors in the ground station. In the classical approaches
[33], [34] using full supervision, supervisors corrects all
(i, j)pixel values in visual explanation, which is very time-
consuming and highly dependent on the supervisor’s ability.
In our method, we split two steps for refinement of the
attention branch; choosing the set of images showing the
inconsistent explanation, and weakly supervised attention re-
finement with selection-based feedback in a feature pyramid
network. The refined attention branch containing the super-
visor’s knowledge is uploaded to the onboard system for
more reliable prediction and explanation. By updating policy
according to this attention branch iteratively, the proposed
framework can achieve a sustainable and reliable system for
satellite image analysis.
In the following section, we describe an XAI model and
training strategies in the proposed system, in detail.
6VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 7: The proposed CPANet for generating the global context to explain the satellite images. It consists of the perception
branch and cascading attention branch to transmit useful context from local explanations to global explanation via the bottom-
up pathway. Detailed formulations are shown in Section III-B
B. CASCADING PYRAMID ATTENTION NETWORK FOR
ONBOARD IMAGES PROCESSING
In this section, we describe CPANet architecture for the
onboard processing of the captured images by the satellite, as
shown in Fig. 7. In advance, let Dbe a training dataset with
Npairs of (x,y), an image x∈RC×W×H(channel, width,
height) and its ground-truth label y∈ {1,2,· · · , K}, where
Kis the number of classes in D. The proposed CPANet
Θ = {u,v}consists of the perception branch u, and the
cascading attention branch v. Building upon the CNN with
the pyramid feature blocks, we first denote a pyramid feature
block as a group of consecutive layers in which output feature
maps have the same spatial dimension of width and height.
At the end of each pyramid block, there is the spatial pooling
(e.g. average pooling) layer which compresses the spatial di-
mension for computational efficiency. In the pyramid feature
blocks, we consider Lpyramid feature blocks from different
convolution layers. Given the input image x, we denote L
feature maps of the top convolution layer in each pyramid
feature block described in
A(Θ; x) = Ai(Θ; x)L
i=1.(1)
Each feature map Ai(Θ; x)∈RCi
×Wi
×Hihas its own dimen-
sion. Because CPANet architecture is fixed in the following,
we simply use a term Ai(x)instead of Ai(Θ; x)for mathe-
matical simplicity, except in the theoretical analysis.
We present the cascading attention branch with subpath
from multiple pyramid feature blocks to extract the global
attention map over the multiscale feature maps {Ai(x)}L
i=1.
To propagate the valuable contexts in a local feature map
to the following feature map, the previous attention map
becomes a guide when generating the following attention
map. As shown in Fig. 7, the feature map A1(x)is passed
into the following formula:
W=AvgP ool (˜
x)⊙A1(x)−min A1(x)
max A1(x),(2)
W′=AvgP ool (AttB1(W)),(3)
W1
i(x) = exp(W′
i)
PC1
j=1 exp(W′
j),∀i={0,1,· · · , C1−1}.(4)
In the first term of Eq. (2), the input image transformed to
grey-scale ˜
xand down sampling matching width and height
to A1via average pooling (AvgP ool), which multiplies by
the normalized feature map of the bottom pyramid feature
block. We denote as ⊙a element-wise multiplication. In
Eq. (3), the output Wis passing the local attention block and
average pooling to reduce the output to the C1-dimensional
vector. We design the local attention block with stacked con-
volutional layers, and the input Win each convolution block
is forwarded as σ(BN (ω∗W+β)), where σis an activation
function of ReLU, BN is batch normalization, and (ω, β)
is a pair of weight and bias for convolution operation ∗.
After passing the local attention block, the output W′∈RC1
of Eq. (3) becomes confidence for channel importance of
the feature map A1. It means that the particular channel
with high value has the valuable context for explaining the
objects. To transform it to relative importance among the
C1channels, the softmax function is passed in Eq. (4). The
output vector W1(x)={W1
i(x)}C1
i=1 weighs the feature map
VOLUME 4, 2016 7
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
A1(x)to generate the local attention map, as follows.
M(x) = σC1
X
i=1
W1
i(x)A1
i(x),
M1(x) = M(x)−min M(x)
max M(x),(5)
where we denote A1
i(x)as the i-th channel in the feature
map. The local attention map M1(x)∈RW1
×H1is a heatmap
where each value M1
ij ∈[0,1] is an (i, j)pixel importance for
visual explanation. Through the weighted sum of channels
in the feature map A1(x), the informative channels are
emphasized in the local attention map M1(x).
Once the local attention map Mi(x)is generated, our
purpose of the cascading attention is to enforce the following
feature map Ai+1 focusing on the region where the attention
map Mi(x)is highlighting. To this end, we conduct the pre-
processing on the feature map Ai(x)guided by the previous
attention map Mi−1(x), which is obtained by
W=1+Mi−1(x)⊙Ai(x)−min Ai(x)
max Ai(x),(6)
∀i={2,3,· · · , L}.
And, the following calculation is same as Eq. (3) and Eq. (4).
After passing the L−1local attention blocks, the global
attention map ML(x)creates containing the global context
over the pyramid feature blocks. We regard it as global visual
explanation V E(x)(i.e. V E(x) = ML(x)). Moreover, the
global attention map ML(x)is applied to refine the feature
map AL(x)to improve the task performance. The refined
feature map e
AL(x)is derived as
e
AL(x)=(1 + ML(x)) ⊙ AL(x).(7)
From Eq. (6), the regions where the global attention map
ML(x)are emphasized into the refined feature map AL(x),
whereas other regions are maintained. Finally, the refined
feature map passes to the predictor for correcting the object’s
class. The predictor outputs the c-class confidence pc(Θ; x)
of objects from the refined feature map. To train the model,
let the mini-batch of the training dataset be ζ= (X,Y)with
image set Xand corresponding labels Y, the loss term is
derived as
LΘ; ζ=−1
|ζ|X
(x,y)∈ζ
K
X
j=1
1[y=j] log pj(Θ; x),(8)
where (x, y)is a pair of image and label in mini-batch and the
indicator function 1[y=j]is 1 only if y=j. Based on Eq. (8),
we update the trainable parameters Θtin t-th iterations using
gradient descent described in Eq. (9).
Θt+1 = Θt−η▽ΘtEζ∼D[LΘt;ζt],(9)
where ηis learning rate and ζtis mini-batch in t-th training
iterations.
Now we provide a detail analysis for relationship between
local attention maps via the proposed CPANet. First, we
define the pyramid feature blocks and the local attention
blocks as follows:
Definition 1. (Relationship between the pyramid feature and
local attention block.) In the perception branch u, we denote
the trainable parameters of the i-th pyramid feature block
and the local attention block as uiand vi, respectively. In
addition, the trainable parameters from i-th block to j-th
block is denoted as uj
iand vj
i.
Training with the parameter update rule denoted in Eq. (9),
we suppose the following assumption:
Assumption 1. For training dataset Dand parame-
ter update function Eq. (9) with proper hyperparame-
ter setting, we assume that the gradient of loss term
Eζ∼D[LΘt;ζ]is converged. And, let local attention
map with the cascading attention branch be Mi(Θ; x) =
Mi(ui,vi;ζ). Then we assume that the local gradi-
ents
∇uiMi(ui
1,vi
1;ζ)
,
∇viMi(ui,vi;ζ)
∀i=
{1,2,· · · , L}have upper-bound Z.
Note that the i-th local attention map is derived by corre-
sponding i-th pyramid feature block and local attention block
(i.e. the local attention map calculated from the path of the
pyramid feature block and local attention block, see Fig. 7).
From Assumption 1, we can construct the effectiveness of the
cascaded attention in the following Lemma 1.
Lemma 1. In the i, j-th the pyramid feature blocks with
∀i, j ={1,2,· · · , L}, i ≥j, the gradients of attention maps
Miand Mjare satisfied as
∇ΘMi(Θ; ζ)− ∇ΘMj(Θ; ζ)
≤(2Z)j(1 + (2Z)i−j).
(10)
Proof. From the Assumption 1 and from the gradient chain
rule, we can have
∇ΘMi(Θ; ζ)− ∇ΘMj(Θ; ζ)
=
∇(ui
1,vi
1)Mi(ui
1,vi
1;ζ)−∇(uj
i,vj
1)Mj(uj
1,vj
1;ζ)
≤
∇(ui
1,vi
1)Mi(ui
1,vi
1;ζ)
+
∇(uj
1,vj
1)Mj(uj
1,vj
1;ζ)
=
∇Mi−1Mi(ui
1,vi
1;ζ)· ∇(ui−1
1,vi−1
1)Mi−1(ui−1
1,vi−1
1;ζ)
+
∇Mj−1Mj(uj
1,vj
1;ζ)· ∇(uj−1
1,vj−1
1)Mj−1(uj−1
1,vj−1
1;ζ)
=
∇uiMi(ui
1,vi
1;ζ) + ∇viMi(ui
1,vi
1;ζ)
· ∇(ui−1
1,vi−1
1)Mi−1(ui−1
1,vi−1
1;ζ)
+
∇ujMj(uj
1,vj
1;ζ) + ∇vjMj(uj
1,vj
1;ζ)
· ∇(uj−1
1,vj−1
1)Mj−1(uj−1
1,vj−1
1;ζ)
= 2Z·
∇(ui−1
1,vi−1
1)Mi−1(ui−1
1,vi−1
1;ζ)
+
∇(uj−1
1,vj−1
1)Mj−1(uj−1
1,vj−1
1;ζ)
≤(2Z)j(1 + (2Z)i−j) = A. (11)
where the first inequality is obtained from
z1−z2
≤
z1
+
z2
,∀z1,z2∈Rd; the second inequality is obtained
from
z1z2
≤
z1
z2
,∀z1,z2∈Rd. Note that the
partial gradients of M1is also bounded to Zbecause the
8VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
perception branch and attention branch is independent in the
first block (i.e. M1(u1,v1;ζ) = P er(u1;ζ) + Att(v1;ζ),
where P er and Att are the function of the perception branch
and attention branch.
We denote (2Z)j(1 + (2Z)i−j)to Athat is used in
the Section III-D. From the Lemma 1, we can assert that
the object-related region from the previous attention map
propagates the next attention map. Through the Lcascading
attention blocks, visual explanation of the top convolution
layer can have the highlighted region in the previous attention
maps by compensating for the spatial information loss.
C. EXPLANATION INCONSISTENCY-BASED DATA
SAMPLING
CPANet deployed to the onboard system can provide elab-
orate visual explanation catching the small context in a
satellite image while mitigating the background bias in a fea-
ture pyramid network. However, the problem of ambiguous
visual explanation may still occur due to the environmental
change and newly captured images out of the distribution of
the training dataset. Therefore, we consider the refinement
mechanisms of the proposed CPANet between the onboard
and ground station. In this section, we describe the active
learning-based data sampling for finding valuable samples
to improve visual explainability of CPANet. To this end, it
needs to resolve the problem of filtering the samples showing
ambiguous explanation automatically. As shown in Fig. 5,
visual explanations which inconsistent in different blocks
may indicate the information loss about the target object.
Inspired by this phenomenon, we introduce the criteria about
how inconsistent visual explanations are with respect to the
pyramid feature blocks.
In advance, we provide a method to compare the incon-
sistency of different visual explanations. It is measured by
comparing the similarity of two visual explanations. Previous
similarity metrics [37], [38] for visual maps is based on pixel-
wise comparison, given the image x. However, in the case
of visual explanation, this measurement include redundant
similarity of the region where the model is not focused on.
In this paper, we define a similarity metric of two visual
explanation in Definition 2.
Definition 2. (Similarity of Two Visual Explanations.) Given
the input xthe similarity of two visual explanations gen-
erated by the cascading attention branch, V E1(x)and
V E2(x)is defined as “similarity of the spatial region where
the two explanations are commonly highlighting”.
From Definition 2, we ignore the common area where the
both explanations are not focused on (i.e. blue area in Fig. 4).
To quantify the similarity, we describe a simple method as
follows: eliminating pixels (i.e. value to 0) with low values of
the region in visual explanation. To remain informative pixels
we use a threshold as 15% of the maximum value in visual
explanation similar with Grad-CAM [22]. We denote trans-
formed visual explanations as V E1(x), V E 2(x), respec-
tively. Note that each pixel value in explanation has a range
of [0,1]. And then, the similarity SI M (V E1(x), V E2(x))
is derived as:
SI M (V E1(x), V E2(x))
=P(i,j)∈S {1−
V E1
ij (x)−V E2
ij (x)
}
area(S),(12)
where Sis a non-zero pixels in explanations,
S=(i, j)|V E1
ij (x)=0 ∨V E2
ij (x)=0.(13)
Note that area(S)is total number of pixels in Sand V E1
ij is
(i, j)pixel value. Based on the similarity in Eq. (12), we can
define the inconsistency of V E1(x)and V E2(x)as
U(V E1(x), V E2(x)) = 1 −SIM(V E 1(x), V E 2(x)).
(14)
From the inconsistency measurement in Eq. (14) the system
conducts the data sampling. Note that the proposed CPANet
is to pass the valuable local contexts to the top convolution
layer. Therefore, our data selection is based on the inconsis-
tency between the local attention maps {Mi(x)}L
i=1. Over
training dataset D=(x,y)N
i=1, we evaluate the inconsistency
of V Ei(x)and V Ej(x)of the i-th and j-th local attention
maps with the trained model Θ. Based on the maximum
inconsistency max(i,j)U(V Ei(x), V Ej(x)) we filter the
sample xto be used for attention refinement using threshold
γ. We refer DU= (x,y)ˆ
N
i=1 be the retraining dataset.
D. ATTENTION REFINEMENT USING WEAK
SUPERVISION IN GROUND STATION
In this section, we handle to improve the explanation fi-
delity of the saliency map to supervisors by fine-tuning
the cascading attention branch. To retrain CPANet for con-
sistent explanation, we propose a novel weakly-supervised
learning mechanism with selection-based simple feedback
from supervisors. In conventional fully-supervised approach
[33], a supervisor should manually create the ground-truth
attention map. In our approach, we concentrate on incon-
sistent local attention maps of ambiguous samples, which
are filtered the active learning. Using this characteristics of
various explainability over the pyramid feature blocks, we
just provide a selection among the local attention maps with
high interpretability. Based on the selected attention map
as weak supervision, CPANet is retrained with the attention
regularization loss term. Fig. 8 shows the overall procedure
of the proposed refinement method.
To provide useful explanations to the supervisor we con-
sider the training the self-attention weights with supervi-
sor intervention to visual explanations. From the retraining
dataset DUwe define the supervisor’s feedback in the L
pyramid feature blocks, denoted as G, for feedback interface
to Llocal attention maps.
Definition 3. (Selection-based Feedback.) Supervisor feed-
back Gover Llocal attention maps is indicator for selecting
VOLUME 4, 2016 9
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 8: Update for onboard attention policy using weak supervision by selection-based annotation in pyramid feature
blocks.
the visual explanation with the human interpretability, which
is
G(Θ; xi∼ DU) : {Mj(xi)}L
j=1 →R1×L,(15)
where each Gj(Θ; xi∼ DU)∈[0,1] of j-th local attention
map has the following states ST ={"wrong", "correct"}.
We simply denote G(Θ; xi∼DU)as Gj(xi). For "wrong"
attention map, we add the penalty function. In the ground
station, there are several supervisors for analyzing satellite
images. We assume that each supervisor has own private
knowledge about analyzing the satellite images (e.g. domain,
class, etc.). In this situation, it is possible to mitigate the
refined attention branch to be biased on a particular supervi-
sor by conducting the aggregated labeling of feedback from
multiple supervisors. Assume that there are Tsupervisors
with different domain knowledge in the ground station. We
denote Gj
k(xi)∈R1×Lthe feedback (i.e. binary vector) of
j-th pyramid block of supervisor kaccording to the input
image xi. Through the majority voting to maximize consen-
sus among supervisors we derive the collective supervisor
feedback as PT
k=1 Gj
k(xi). If this value exceeds consensus
criteria δ, we set Gi(xi)to 1(correct), else to 0(wrong).
In this paper, we assume that there is only 1 supervisor for
simplicity (i.e.G=G). We set "correct" attention map as the
weak ground-truth for retraining.
When the image xand weak supervision {Gj(x)}L
j=1 are
given, we obtain a regularization term w.r.t the attention map
as Eq. (16).
1
Q
L
X
j=1
L
X
i=1
1[Gi(x)=0]Mi(x)−1[Gj(x)=1]Mj(x)
2
2,
(16)
where Qare the number of the pyramid feature blocks
with "wrong" labeling. From Eq. (16), the attention block
and connected pyramid feature block producing the wrong
attention map can refine the weights guided by the super-
visor’s knowledge. The loss function can be derived from
the regularization term according to the explanation distance
in a weakly supervised attention map. As a result, the loss
function Lref for refining CPANet is derived as
Lref (DU) = −1
ˆ
N
ˆ
N
X
i=1
L(xi,yi) + αn1
Q
L
X
j=1
L
X
i=1
1[Gi(x)=0]Mi(x)−1[Gj(x)=1]Mj(x)
2
2o,(17)
where αis a control variable for refinement. The αvalue
means how the refined model reflects the supervisor’s knowl-
10 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Algorithm 1 Procedure of Attention Refinement for Visual
Explainability in the Ground Station.
Input: Trained CPANet Θ, training dataset D, inconsistency
threshold of visual explanation γ, consensus criteria δ
1: DU← {} ▷sampling data for refinement
2: for all xin Ddo
3: SI Mmin ←1
4: Umax ←0
5: for all iin {1,2,· · · , L}do
6: for all jin {1,2,· · · , L}do
7: V Ei←Mi(Θ; x)and V Ej←Mj(Θ; x)
8: V Ei, V Ej←remove non-informative pixels
9: S ← (i, j)|V E 1
ij (x)=0 ∨V E2
ij (x)=0
10: AD ← |V E1−V E2|
11: SI M (V E1, V E2)←P(i,j)∈S 1−ADij
area(S)
12: if SI M (V E1, V E2)≤SI Mmin then
13: SI Mmin ←1−SI M (V E1, V E2)
14: end if
15: end for
16: end for
17: Umax ←1−SI Mmin
18: if Umax ≥γthen
19: DU← DU∪x
20: end if
21: end for
22: for all xin Dvdo ▷attention refinement
23: Calculate {Mj(x)}L
j=1
24: Gj
i(x)←RT ×L∀i, j from Tsupervisors
25: if PT
k=1 Gj
k(xi)≥δthen
26: Gi(xi)←1
27: else
28: Gi(xi)←0
29: end if
30: end for
31: Θ←Retraining CPANet by Eq. (17)
32: Update policy Θto onboard
edge for explainability. The update function at t-th refine-
ment iterations is denoted as Eq. (18).
Θt+1 = Θt−η▽ΘtEζ∼DU[Lref Θt;ζt].(18)
After training all weights in the attention branch, the in-
consistency of Llocal attention maps can be reduced. Our
approach has a intuitive interface for supervisor intervention
reducing the annotation costs (correction of the attention
map) of large satellite images.
Now, we provide the theoretical analysis for explainability
boundary of the local attention maps.
Theorem 1. (Explainability Consistency for Attention Maps)
Let Lemma 1 be satisfied. We assume that there are only
single correct explanation MGT (Θ; x)among Llocal at-
tention maps. From the initial trained model Θ, under the
loss function Eq. (17), the difference of the output of the two
pyramid feature blocks MGT (x)and Mj(x)is bounded as
MGT (Θt;ζt)−Mj(Θt;ζt)
≤At(1 −2αη)t
MGT (Θ0;ζ0)−Mj(Θ0;ζ0)
,(19)
where Θ0denotes the initial trained ones for Θand ζtis
mini-batch for t-iterations during refinement.
Proof. From the loss function and update rule given in
Eq. (17) and Eq. (18) and Lemma 1, we can obtain
MGT (Θt;ζt)−Mj(Θt;ζt)
=
MGT (Θt−1;ζt−1)−η∇Θt−1Lref (Θt−1;ζt−1)
−Mj(Θt−1;ζt−1) + η∇Θt−1Lref (Θt−1;ζt−1)
−2αηMGT (Θt−1;ζt−1)−Mj(Θt−1;ζt−1)
·∇Θt−1MGT (Θt−1;ζt−1)− ∇Θt−1Mj(Θt−1;ζt−1)
=
MGT (Θt−1;ζt−1)−Mj(Θt−1;ζt−1)
−2αηMGT (Θt−1;ζt−1)− ∇Θt−1Mj(Θt−1;ζt−1)
·∇Θt−1MGT (Θt−1;ζt−1)− ∇Θt−1Mj(Θt−1;ζt−1)
=
MGT (Θt−1;ζt−1)−Mj(Θt−1;ζt−1)
·(1 −2αη){∇Θt−1MGT (Θt−1;ζt−1)
− ∇Θt−1Mj(Θt−1;ζt−1)}
≤
MGT (Θt−1;ζt−1)−Mj(Θt−1;ζt−1)
·
(1 −2αη){∇Θt−1MGT (Θt−1;ζt−1)
− ∇Θt−1Mj(Θt−1;ζt−1)}
=A(1 −2αη)
MGT (Θt−1;ζt−1)−Mj(Θt−1;ζt−1)
.
(20)
Therefore, we can prove that explanation consistency for the
pyramid feature blocks is bounded with Eq. (19).
Theorem 1 shows that "wrong" visual explanation can be
corrected by Eq. (17). Overall procedure of the attention
refinement in the ground station is described in Algorithm 1.
E. ONBOARD XAI-CHIP EMBEDDED SYSTEM
IMPLEMENTATION
For explainability in the onboard XAI computing, CPANet
contains huge parameter space and computation require-
ments. Conventional onboard AI system [7] equipped with
the low-power HW (i.e. VPU) is not suitable for visual
explainability model deployed in satellite computing fed-
eration. That’s why we prototyped a newly designed ex-
plainable AI processor, as shown in Fig. 9. In this section,
we describe our prototype of application specific integrated
circuit (ASIC) based the onboard XAI system for low-power
and high computation ability. We implement the proposed
CPANet on a low-power ASIC with COTS components
and the embedded system. In Fig. 9, it consists of pro-
grammable logic controller, external/shared memory, con-
volutional/vector processor. In ASIC, the intrinsic resources
are limited, therefore we perform onboard processing adap-
tation for CPANet weights and operators to utilize low-power
VOLUME 4, 2016 11
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 9: Onboard XAI-chip embedded system with
CPANet prototype in a AI processor connected to the COTS
Xilinx FPGA. Especially, we prototyped explainable AI pro-
cessing unit (EPU) under Samsung Foundry 28-nm CMOS
Process with 200mW power consumption and 7.5Win the
entire onboard system.
ASIC. To accelerate XAI methods, we design the explainable
AI processing unit (EPU) chip which is used in the onboard
HW prototype.
Programmable Logic Controller for EPU. On low-
power peripheral, the task partitioning control of CPANet
is necessary because the input data xand CPANet in-
formation (ML,Θ) that can be processed are limited.
Since the input data size may change, the partitioned task
Tk= (xk, M i
k,Θk)where input data {xk|x=Skxk},
Θk= (uk,vk)and CPANet graph operators with parameters
{(ML
k,Θk)|(ML,Θ) = Sk(ML
k,Θk)}, is generated in the
Programmable Logic Controller. To guarantee each task can
be performed in the ASIC, the task size Mem(Tk)should
satisfy Mem(Tk) + Mem(ML
k(x)) ≤Swhere shared
memory size has SMB, Mem(Tk)equals Mem(xk) +
Mem(Θk)because operators do not occupy the storage, and
Mem(ML
k)is the k-th partial output. Finally, the partial
tasks to run are stored in external memory. After processing a
partitioned task of CPANet, Programmable Logic Controller
loads the results from the shared memory in ASIC.
Convolutional/Vector Processor in EPU. Entire opera-
tors to run CPANet are consisting of convolutional and vector
computations. In convolutional processor, there are multiple
array processing units. Array processing units process the
set of kernel sizes 1×1,3×3, and 7×7with various
zero-padding and stride sizes in parallel. Vector Processor is
designed to treat MaxPool, AvgPool, Batch Normalization,
ReLU activation, and GEMM. From the vector processor,
the partial output ML
k(xk)is stored to shared memory. The
power consumption of our onboard processing is commonly
7.25W. The designed EPU shows that the power consump-
tion is about 200mW in die area 10.24mm2.
IV. EXPERIMENTS AND DISCUSSION
In this section, we show the performance comparison of the
proposed methods with other approaches and discuss the
results.
A. EXPERIMENT SETTING AND BASELINE METHODS
We conducted experiments on UC Merced land use
(UCMerced) [36] and NWPU-RESISC45 [39] dataset.
UCMerced dataset contains 21,000 images with 256 ×256
resolution with 21 land use classes. 90% of total images is
randomly split into training, and the other 10% is used for
validation. NWPU-RESISC45 contains 31,500 images with
256 ×256 including 45 classes with from 0.2 to 30m pixel
resolution. There are 700 images per class, captured from
Google Earth. We compare our methods with other attention
branch methods using only top-level feature map such as
ABN [25] and LFI-CAM [26]. In addition, we compared the
visual explainability with class activation methods such as
CAM [20], Grad-CAM [22], Grad-CAM++ [23], and Layer-
CAM [24] with the proposed CPANet. Note that these post-
hoc explanation methods cannot influence task performance
and require additional backpropagation operations except
CAM. Note that our training CPANet and refinement are
conducted on a server platform with NVIDIA GPUs, these
stages are irrelevant to the onboard HW that is only for
inference after training. We use 4 NVIDIA RTX 3080 GPUs
for training and we use python 3.6.13, CUDA 11.3, and
pytorch 10.2 as DL framework.
B. MODEL ARCHITECTURE AND EVALUATION
METRICS
We use ResNet-18 [35] architecture as a common perception
branch in the baselines and the proposed CPANet. In ResNet,
we set each residual block as each pyramid feature block (i.e.
L= 4). The image augmentation and optimizer settings are
similar to ABN [25] and LFI-CAM [26]. Training images
in datasets are cropped with a random ratio and resized to
224 ×224 and randomly adapted to horizontal flips. We use
stochastic gradient descent (SGD) with momentum as the
optimizer for all models. The initial learning rate is set to 0.1
with momentum to 0.9. In our experiments, the total training
epochs is 200, and the learning rate decaying to 0.1 and
12 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 1: Performance comparison of visual explanation methods on UCMerced and NWPU-RESISC45: The proposed
CPANet vs. baseline CNN with CAM [20] and post-hoc explanation methods, and attention branch methods (ABN [25] and
LFI-CAM [26])
UCMerced NWPU-RESISC45
Params.(M) Top-1 Err. Smax Params.(M) Top-1 Err. Smax
r= 0.2r= 0.2
Base+CAM
11.19 6.67
0.048
11.23 9.3
0.49
+GC* 0.05 0.5
+GCpp* 0.05 0.57
+LC* 0.04 0.49
ABN 19.59 6.19 0.024 19.62 9.07 0.48
LFI-CAM 20.63 7.62 0.139 20.64 8.41 0.41
Proposed CPANet 20.59 3.81 0.11 20.61 8.08 0.4
*The post-hoc explanation methods. GC: GradCAM [22], GCpp: GradCAM++ [23], and LC: LayerCAM [24].
FIGURE 10: Comparison of visual explanation results on UCMerced images with the proposed CPANet and baselines.
0.01 in 100 and 150 epochs, respectively. We use a weight
decaying set to 1e-4 for all cases.
Basically, we adapt the top-1 error (%) and the number of
trainable parameters as the metrics for comparing the task
performance. In addition to that, we evaluate the qualita-
tive analysis of visual explainability on the proposed model
and baselines. As a metric for evaluating explainability, we
measure maximum sensitivity about input perturbation. In
[40], maximum sensitivity of the explanation is defined as the
maximum norm of differences of the explanation V E for a
black-box model fin the input xand r-perturbation input y,
which is defined as
Smax(V E , Θ,x,y) = max
∥y−x∥≤r∥V E(Θ; y)−V E(Θ; x)∥.
(21)
In every experiments, we use Gaussian noise with (−r, +r)
for total 100 perturbated inputs y, and measure the maxi-
mum difference by Eq. (21). In satellite images, there are
many environmental variances such as sunlight, noise, etc.,
resulting in the change in prediction and visual explanation.
Therefore, this maximum sensitivity can be an indicator of
how the XAI model can be robust in satellite image analysis.
For validating ambiguous explanation, we use inconsistency
VOLUME 4, 2016 13
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 11: Comparison of visual explanation results on NWPU-RESISC45 images with the proposed CPANet and baselines.
of visual explanation between the pyramid feature blocks,
described in Eq. (14). Note that inconsistency measures how
much the trained XAI method focuses on the exclusive region
compared to the common region between the feature maps
over pyramid feature blocks. As discussed in Section II-D,
it seems that the phenomenon of background bias results
in a rapid change in the region where the pyramid feature
block is focused, especially in the satellite images including
lots of small objects. The inconsistency metric can be used
to measure the changing in visual explanation. In addition,
we also evaluate the explainability metrics of average %
drop and % of increase in confidence, widely used in XAI
methods [23], [41], [42]. The both metrics is based on the
comparison of original image xwith ground-truth label cand
explanation map xexp =x⊙V E(Θ; x). The explanation
map is the synthetic image with highlighting regions where
visual explanation focuses and the other regions are re-
moved (remind that each pixel of visual explanation has [0,1]
floating-point value). Average % drop (denote as "average
drop") represents average drop ratio of c-confidence pc(Θ; x)
(see Eq. (8)) over validation dataset when the explanation
map is fed into an XAI model. It means that if visual
explanation misses the contexts representing the objects or
highlights the uncorrelated region the average drop could be
high. On the other hand, increase % in confidence is the
ratio of the number of validation images with increasing c-
class confidence pc(Θ; x)when feeding the explanation map.
It means that the explanation map strongly highlights the
most discriminative region of the objects. Similar to [23],
we remove 50% of lower pixels of visual explanation when
generating the explanation map.
C. EVALUATION ON SATELLITE IMAGE DATASET
Table 1 shows the results of top-1 accuracy and explain-
ability on UCMerced and NWPU-RESISC45 datasets. Com-
paring with conventional ResNet without any attention (see
Base), the proposed model can achieve 2.86% and 1.22%
lower top-1 error in UCMerced and NWPU-RESISC45, re-
spectively. Compared to the existing attention branches the
proposed CPANet reduces the top-1 error up to 2.4% and
in UCMerced. Likewise, in NWPU-RESISC45 dataset, the
proposed CPANet shows better task performance compared
to conventional methods while achieving about 1% lower
top-1 error. In terms of trainable parameters, all attention
branch methods have a large number of parameters due to
the sub-path built on the pyramid feature blocks (i.e. ResNet
backbone). As mentioned above, ABN and LFI-CAM con-
sider only generating a visual attention map from the top-
level convolution layer. In the proposed model, however, we
provide visual explanation based on 4pyramid feature blocks
14 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 2: The evaluation of quantitative explainability in
terms of average drop ("Avg. Drop") and increase in confi-
dence ("Inc. in Conf.") introduced in [23] in the proposed
CPANet and baselines of attention branch methods (ABN,
LFI-CAM). We used the UCMeerced dataset in this experi-
ment.
Avg. Drop Inc. in Conf.
ABN 79.28 4.76
LFI-CAM 74.29 7.62
Proposed CPANet 63.11 4.76
(i.e. last layer of each residual block) for better explainability.
Although the attention branch of the proposed model covers
more feature maps comparing to ABN and LFI-CAM, it only
requires ≤1% additional parameters, which shows the struc-
tural efficiency of our method. In terms of explainability mea-
surement, CPANet is quietly worse in UCMerced dataset and
better in NWPU-RESISC45 dataset. Although ABN shows
the higher maximum sensitivity, we confirmed that attention
was highly concentrated in the peripheral regions, even in the
background. In addition, we show the enhancement of the
maximum sensitivity through the attention refinement in the
following section.
The qualitative results of the visual explanation of
UCMerced dataset are depicted in Fig. 10. Note that we
extract Grad-CAM results from the top convolution layer, as
mentioned in [22]. In terms of small objects (see "harbor"
images), visual explanations derived from the conventional
methods (CAM, ABN, LFI-CAM) that only considers the
top-level convolution layer cannot accurately capture the
object (ship), where the trained model rather focus on the
background context (ocean) when predicting the category.
Especially, ABN severely fails to capture the objects while
only highlighting the background. Even in the multiple layers
fusion-based explanation method (LayerCAM), the model
cannot filter the background context because it just aggre-
gates visual explanation with pixel-wise maximization from
the multiple layers. This aggregation method may include
unnecessary information about feature maps in visual expla-
nation. However, the proposed method can efficiently remove
these redundant feature maps biased on the background by
weighting the different importance on each layer to generate
visual explanation. Especially, in the results of "parkinglot"
(fourth column), the proposed CPANet completely excludes
the background, and it achieves high-quality visual expla-
nation comparing to the other methods which highlight the
background and miss some objects. Moreover, such back-
ground bias is also mitigated on the larger constructions such
as freeway (third column) and buildings (fifth column) by
allocating small weights on the background biased feature
map with our cascading attention method. Likewise, in the
"airplane" images (last two columns), the critical back-
ground bias occurs in the other comparing methods focusing
on the airstrip together. In summary, in terms of capturing
the exact boundary of the object, the proposed model shows
FIGURE 12: Comparison of visual explanation results in the
feature pyramid blocks on a UCMerced image. The proposed
CPANet can propagate the useful context in the low layer
(boundary of "airplane") in a cascading manner.
better quality than the top-layer explanation methods (CAM,
ABN, LFI-CAM). The spatial information loss in the pyra-
mid feature blocks carries unclear object location in visual
explanation, resulting in the background focusing. And, the
method of multiscale feature maps (LayerCAM) cannot dis-
tinguish explainability of local explanations by highlighting
both object and background. On the other hand, CPANet
can distinguish between objects and backgrounds accurately
in complex images. Fig. 11 also shows visual explanations
about the NWPU-RESISC45 dataset. In the results, CPANet
can also identify the target objects from the background
and redundant objects (see the first column), while the other
methods raised the background bias in the results. In the case
of multiple object in a single image (see from first to third
th column), CPANet concentrates on all objects as fairly as
possible while separating the background and unnecessary
context (e.g. "car" in first column). By mitigating such back-
ground bias on visual explanation efficiently, our method can
provide the higher explainability.
Table 2 demonstrates explainability of average drop and
increase in confidence metrics for various attention branch
methods in the UCMerced dataset. Note that the higher
average drop and lower increase in confidence mean better
explainability. ABN shows low performance compared with
other methods. CPANet achieves >11.18% higher average
drop, which implies that the proposed CPANet generates
visual explanation covering the entire objects with the small
context while ABN and LFI-CAM can miss the discrimina-
tive parts of the objects and highlight the unnecessary region
disturbing the model prediction. In contrast, increase in con-
fidence of CPANet is worse than LFI-CAM (about 2.86%). It
implies that the number of explanations highlighting the most
discriminative region is large in LFI-CAM. Summarizing the
results, LFI-CAM can highlight the most distinctive parts of
the objects, but it can also miss small contexts and highlight
unnecessary backgrounds. On the other hand, we can argue
that the proposed CPANet can stably capture the overall
contexts of the objects while showing better task accuracy
and maximum sensitivity than that of LFI-CAM.
VOLUME 4, 2016 15
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 3: The evaluation of average drop and increase in
confidence of visual explanation for Grad-CAM and CPANet
over the pyramid feature blocks (Block) in ResNet-18. We
also use the UCMerced dataset in this experiment.
Grad-CAM Proposed CPANet
Avg. Drop Inc. in Conf. Avg. Drop Inc. in Conf.
Block 1 - 3.81 68.88 5.71
Block 2 83.16 1.43 65.5 5.71
Block 3 69.88 2.86 71.46 3.81
Bock 4 62.8 4.76 63.11 4.76
D. ABLATION STUDIES OF THE PROPOSED CPANET
In this section, we provide a detailed analysis of the advan-
tages of the proposed CPANet. For the ablation studies, we
verify visual explanation of the intermediate pyramid feature
blocks. We measure the inconsistency of visual explanations
of the different blocks. We compare the results with Grad-
CAM which can extract visual explanation in the intermedi-
ate feature maps. We get the inconsistency w.r.t all possible
combination cases of the pyramid feature blocks. For the
validation dataset in UCMerced, the mean and variance of
the proposed CPANet are 0.212 and 0.004 while Grad-cam
shows 0.31 and 0.019, respectively. It seems that the result
is due to the feature map guided by the previous attention
map described in Eq. (6). Through the cascading attention
blocks, the spatial information loss of the higher convolution
layer can be compensated, resulting in the relaxation of
the background bias. Fig. 12 illustrate visual explanation in
CPANet and Grad-CAM. We can observe that the proposed
model can propagate the boundary of the airplane in pyramid
feature block 1 to the following blocks while Grad-CAM
only focuses on the regions of the local pyramid feature
block.
Table 3 demonstrates explainability results of the aver-
age drop and increase in confidence. In terms of increase
in confidence, CPANet shows higher visual explainabiltiy
over all the pyramid feature blocks which implies visual
explanation can capture the most discriminative region com-
pared with Grad-CAM. In the average drop, however, Grad-
CAM seems to be slightly high performance (i.e. lower
average drop) in Block 4 and Block 3. We discuss the
two perspectives from this result. First, Grad-CAM shows
inconsistency of explanations in the pyramid feature blocks.
Explanation in block 1 shows NaN value because the model
fails to predict c-class probability in the explanation map
(values to NaN). They show unstable explainability among
the pyramid feature blocks. Second, in terms of the average
drop, it can be able to argue that the proposed explanation
mechanism is slightly worse than Grad-CAM (about 0.31%
in Block4). Grad-CAM is the class activation map method
which means that visual explanation is only about the specific
class via the gradient of c-class confidence. Whereas, that of
the attention branch methods (CPANet, ABN, LFI-CAM) is
about the salient objects independent of the target classes.
Therefore, if visual explanation of CPANet only highlights
FIGURE 13: An example of the sample (ground-truth label
is "river") with inconsistent explanation of pyramid fea-
ture blocks. Supervisor can select proper visual explanation
(block 2) as weak supervision.
TABLE 4: The evaluation of attention refinement based on
weakly supervision for the validation dataset: Init. Train is
the trained CPANet with training dataset and Att. Ref is
retrained CPANet with supervisor’s feedback for the images
with inconsistent explanation. Note that value of inconsis-
tency represents mean(variance).
Init. Train Att. Ref
Top-1 error 3.81 3.81
Inconsistency U0.212(4e-3) 0.19 (2.7e-3)
Max. Sensitivity Smax 0.11 0.06
the object unrelated to the ground-truth class c, then c-class
confidence of the explanation map may be near 0 because
all regions about the target objects are removed. In Grad-
CAM, however, even if the model predicts the different target
class, visual explanation of c-class may contains the tiny
regions about the target class because it is generated by only
the gradient of c-class confidence. In this reason, the class
activation map based Grad-CAM outperforms ABN and LFI-
CAM in terms of the average drop (see Table 2).
E. EVALUATION OF ATTENTION REFINEMENT WITH
WEAK SUPERVISION
In this section, we evaluate the attention refinement scheme
containing the active learning-based sampling with inconsis-
tent explanation and weak supervision in the ground station.
From the ResNet model initially trained with the UCMerced
training dataset, we measure the maximum inconsistency
among the local attention maps (i.e. visual explanation of the
local pyramid feature block), then sample the images with a
higher value than the threshold. In this experiment, we set the
threshold γto 0.3 and weighting αto 1.0. Through the data
sampling method based on the explanation inconsistency,
we sample almost 700 images among total 2,100 training
samples. An example of sampled data is illustrated in Fig. 13.
We can notice that the background bias occurs in the pyramid
feature block 4. It results in inconsistent explanation of
the pyramid feature blocks. In this case, the supervisor can
resolve the background bias by setting the weak supervision
(i.e. select block2 as weak ground-truth) and retraining it with
loss function Eq. (17).
As the hyperparameter of the refinement, we set the learn-
ing rate to 0.01 and weight decay to 1e-4 same as the
initial training stage. The results is presented in Table 4.
The results show that the top-1 error is the same after the
16 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
attention refinement. It means that the DL model corrects the
target class of the sampled data, which the cross-entropy loss
Eq. (8) is already saturated before the refinement. However,
from the perspective of explainability, the proposed attention
refinement reduces the inconsistency Uand maximum sen-
sitivity Smax. The results show that the refinement scheme
with simple supervisor feedback can make CPANet generate
robust explanations on the onboard.
F. COMPARISON OF COMPUTATIONAL COST FOR THE
PROPOSED FEDERATED COMPUTING WITH EPU
EMBEDDED SYSTEM
In this section, we discuss the computational costs of the
proposed FOGS computing compared to the conventional
approach (i.e. all captured images are transmitted via satellite
downlink and processed on the ground station computing). In
the proposed method, the interesting objects are selectively
transmitted to the ground station, and the area occupied by
an object may be minor in the entire satellite image. Due to
the limitation of the satellite electrical power system (EPS),
we define the computational cost metric Cas the expected
energy consumption for processing all captured images D
from a satellite. We denote |D| and data(D)is the number
of image patches and data volume (GB) to be transmitted
D. The computation cost CGS of the conventional approach,
ground station (GS) only, is given to Eq. (22),
CGS =Pcomm ·data(D)
Rcomm
+PGS ·|D| · tGS .(22)
PGS is the active power consumption of groun station HW,
tGS is the processing time per an image, Pcomm and Rcomm
are the transmission power and transmission rate from the
satellite to the ground station, respectively. Similarly, the
computational cost CF OGS of FOGS is given to Eq. (23),
CF OGS =POB D ·|D| · tOBD +Pcomm ·data(ρ· D)
Rcomm
.
(23)
POBD and tOBD are the active computation power and
processing time of onboard (OBD) HW, and ρis the ratio of
the selected image patches (i.e. interesting targets from the
ground station).
Then, we evaluate the computational cost in the case of
processing |D| = 100,000 captured image patches (each im-
age has 224×224 spatial dimension). Following the SANSA
parameters referred from [43], we set Pcomm = 1W and
Rcomm = 200Mbps. We assume that the processing time
calculates the total number of operations divided by the
maximum TOPS (Trillion Operation per Second) of the HW
accelerator for the simulation. And, we use the image com-
pression ratio for captured image patches followed by con-
sultative committee for space data systems (CCSDS) 123.0-
B-2 standard [44] (in the case of the lossless compression).
Table 5 shows the result of computational costs of energy
consumption (Kilo Joule; KJ) in various settings. From the
result, it seems to be that the filtering patches in FOGS
computing are effective in terms of the computational cost
C, providing a lower energy consumption for processing the
given images. Xilinx XCZU7EV based onboard system, the
COTS HW evaluating the onboard CloudScout model by
[45], shows the higher cost (>9.9×) than the EPU embedded
system (in Section III-E) due to its low power consumption.
In the low-power HW setting, the effect of the proposed
onboard-ground station computing is more pronounced. The
result implies that the low-power engineering of the onboard
HW is mandatory for processing for the onboard XAI com-
puting.
TABLE 5: Comparison of computational costs of the con-
ventional approach (ground station only) CGS and federated
onboard-ground station (FOGS) computing CF OGS with var-
ious onboard HW based on Eqs. 22 and 23
CGS (KJ) CF OGS (KJ)
NVIDIA
RTX 3080
Xilinx XCZU7EV
embedded system [45]
EPU
embedded system
18.06
ρ=1.0 14.26 1.43
ρ=0.5 13.7 1.07
ρ=0.3 13.42 0.59
ρ=0.1 13.2 0.35
V. CONCLUSION
In this paper, we proposed a federated onboard-ground sta-
tion computing framework for satellite image analysis. For
reliable analysis in complex space-related applications, es-
pecially in object recognition, we introduce a novel XAI
method with CPANet in the onboard processing. By utilizing
rich information for explainability in the multiple pyramid
feature blocks, the proposed model improves not only visual
explainability in terms of robustness in data perturbation but
the task performance. In addition, we propose the onboard re-
finement scheme with the supervisor’s feedback. Using weak
supervision, the proposed refinement mechanism can reduce
the cost of supervisor annotation, and improve visual explain-
ability. In future work, we are going to extend the architecture
to the object detection task and develop the prototype system
with a low-power AI accelerator. Due to the limited electrical
power system (EPS) in the satellite, an onboard AI model
should be light-weight. Though the proposed CPANet can
improve accuracy and visual explainability effectively, it
requires additional computation. Therefore, we additionally
consider co-design of the network compression (pruning and
weight quantization) for the implementation. Then, we will
validate the feasibility of our onboard system in terms of
processing time and power consumption.
REFERENCES
[1] Xiaofei Zhou, Kunye Shen, Zhi Liu, Chen Gong, Jiyong Zhang, and
Chenggang Yan. Edge-aware multiscale feature integration network for
salient object detection in optical remote sensing images. IEEE Transac-
tions on Geoscience and Remote Sensing, 60:1–15, 2021.
[2] Qi Ming, Zhiqiang Zhou, Lingjuan Miao, Hongwei Zhang, and Linhao Li.
Dynamic anchor learning for arbitrary-oriented object detection. In Pro-
VOLUME 4, 2016 17
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
ceedings of the AAAI Conference on Artificial Intelligence, volume 35,
pages 2355–2363, 2021.
[3] Zhiyong Lv, Tongfei Liu, and Jón Atli Benediktsson. Object-oriented key
point vector distance for binary land cover change detection using vhr
remote sensing images. IEEE Transactions on Geoscience and Remote
Sensing, 58(9):6524–6533, 2020.
[4] Heejae Kim, Kyungchae Lee, Changha Lee, Sanghyun Hwang, and Chan-
Hyun Youn. An alternating training method of attention-based adapters
for visual explanation of multi-domain satellite images. IEEE Access,
9:62332–62346, 2021.
[5] Woojoong Kim and Chan-Hyun Youn. Cooperative scheduling schemes
for explainable dnn acceleration in satellite image analysis and retraining.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYS-
TEMS, 33(7):1605–1618, 2022.
[6] Qi Wang, Wei Huang, Zhitong Xiong, and Xuelong Li. Looking closer
at the scene: Multiscale representation learning for remote sensing image
scene classification. IEEE Transactions on Neural Networks and Learning
Systems, 2020.
[7] Gianluca Giuffrida, Luca Fanucci, Gabriele Meoni, Matej Batiˇ
c, Léonie
Buckley, Aubrey Dunne, Chris van Dijk, Marco Esposito, John Hefele,
Nathan Vercruyssen, et al. The ϕ-sat-1 mission: the first on-board
deep neural network demonstrator for satellite earth observation. IEEE
Transactions on Geoscience and Remote Sensing, 60:1–14, 2021.
[8] Kai Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise
or signal: The role of image backgrounds in object recognition. arXiv
preprint arXiv:2006.09994, 2020.
[9] Sangwoo Mo, Hyunwoo Kang, Kihyuk Sohn, Chun-Liang Li, and Jinwoo
Shin. Object-aware contrastive learning for debiased scene representation.
Advances in Neural Information Processing Systems, 34:12251–12264,
2021.
[10] Nick Buonaiuto, Mark Louie, Jim Aarestad, Rohit Mital, Dennis Mateik,
Robert Sivilli, Apoorva Bhopale, Craig Kief, and Brian Zufelt. Satellite
identification imaging for small satellites using nvidia. 2017.
[11] Vivek Kothari, Edgar Liberis, and Nicholas D Lane. The final frontier:
Deep learning in space. In Proceedings of the 21st international workshop
on mobile computing systems and applications, pages 45–49, 2020.
[12] Emily Dunkel, Jason Swope, Zaid Towfic, Steve Chien, Damon Russell,
Joseph Sauvageau, Douglas Sheldon, Juan Romero-Cañas, Jose Espinosa-
Aranda, Léonie Buckley, et al. Benchmarking deep learning inference of
remote sensing imagery on the qualcomm snapdragon and intel movidius
myriad x processors onboard the international space station. In 2022 IEEE
International Geoscience and Remote Sensing Symposium, 2022.
[13] Zaid Towfic, Dennis Ogbe, Joe Sauvageau, Douglas Sheldon, Andre
Jongeling, Steve Chien, Faiz Mirza, Emily Dunkel, Jason Swope, Mehmet
Ogut, et al. Benchmarking and testing of qualcomm snapdragon system-
on-chip for jpl space applications and missions. In 2022 IEEE Aerospace
Conference (AERO), pages 1–12. IEEE, 2022.
[14] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolu-
tional networks for biomedical image segmentation. In International Con-
ference on Medical image computing and computer-assisted intervention,
pages 234–241. Springer, 2015.
[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet
classification with deep convolutional neural networks. In F. Pereira, C.J.
Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural
Information Processing Systems, volume 25. Curran Associates, Inc.,
2012.
[16] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig
Adam. Rethinking atrous convolution for semantic image segmentation.
arXiv preprint arXiv:1706.05587, 2017.
[17] Marco Tuilo Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i
trust you?": Explaining the predictions of any classifier. In KDD, 2016.
[18] Scott M. Lundberg and Su-In Lee. A unified approach to interpreting
model predictions. In NeurIPS, 2017.
[19] Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized in-
put sampling for explanation of black-box models. arXiv preprint
arXiv:1806.07421, 2018.
[20] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio
Torralba. Learning deep features for discriminative localization. In CVPR,
2016.
[21] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In ICLR,
2014.
[22] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakr-
ishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual
explanations from deep networks via gradient-based localization. ICCV,
128, 2017.
[23] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N.
Balasubramanian. Grad-cam++: Generalized gradient-based visual expla-
nations for deep convolutional networks. WACV, 128, 2018.
[24] Peng-Tao Jiang, Chang-Bin Zhang, Qibin Hou, Ming-Ming Cheng, and
Yunchao Wei. LayerCAM: Exploring hierarchical class activation maps
for localization. IEEE TIP, 30, 2021.
[25] Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu
Fujiyoshi. Attention branch network: Learning of attention mechanism
for visual explanation. In CVPR, 2019.
[26] Kwang Hee Lee, Chaewon Park, Junghyun Oh, and Nojun Kwak. LFI-
CAM: Learning feature importance for better visual explanation. In ICCV,
2021.
[27] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang. Pyramid attention
network for semantic segmentation. In BMVC, 2018.
[28] Wenguan Wang, Shuyang Zhao, Jianbing Shen, Steven C. H. Hoi, and Ali
Borji. Salient object detection with pyramid attention and salient edges. In
CVPR, 2019.
[29] Ting Zhao and Xiangqian Wu. Pyramid feature attention network for
saliency detection. In CVPR, 2019.
[30] J Mentor. Improving the explainability based on mediating visual ex-
planations from various attention episodes with multi-disciplinary debate.
Manuscript submitted for publication, 2022.
[31] Ming Jiang, Juan Xu, and Qi Zhao. Saliency in crowd. In European
conference on computer vision, pages 17–32. Springer, 2014.
[32] Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi Parikh, and Dhruv
Batra. Human attention in visual question answering: Do humans and
deep networks look at the same regions? Computer Vision and Image
Understanding, 163:90–100, 2017.
[33] Ramprasaath R Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini
Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh. Taking a hint: Lever-
aging explanations to make vision and language models more grounded.
In Proceedings of the IEEE/CVF international conference on computer
vision, pages 2591–2600, 2019.
[34] Jialin Wu and Raymond Mooney. Self-critical reasoning for robust visual
question answering. Advances in Neural Information Processing Systems,
32, 2019.
[35] Shaoqing Ren Kaiming He, Xiangyu Zhang and Jian Sun. Deep residual
learning for image recognition. In CVPR, 2016.
[36] Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions
for land-use classification. In Proceedings of the 18th SIGSPATIAL
international conference on advances in geographic information systems,
pages 270–279, 2010.
[37] Nikos Komodakis and Sergey Zagoruyko. Paying more attention to
attention: improving the performance of convolutional neural networks via
attention transfer. In ICLR, 2017.
[38] Zeyi Huang, Yang Zou, BVK Kumar, and Dong Huang. Comprehensive at-
tention self-distillation for weakly-supervised object detection. Advances
in neural information processing systems, 33:16797–16807, 2020.
[39] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene
classification: Benchmark and state of the art. Proceedings of the IEEE,
105(10):1865–1883, 2017.
[40] Chih-Kuan Yeh, Cheng-Yu Heish, and Arun Sai Suggala. On the
(in)fidelity and sensitivity of explanations. In NeurIPS, 2019.
[41] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui
Ding, Piotr Mardziel, and Xia Hu. Score-CAM: Score-weighted visual
explanations for convolutional neural networks. In CVPR Workshop,
2020.
[42] Harish Guruprasad Ramaswamy et al. Ablation-cam: Visual explanations
for deep convolutional network via gradient-free localization. In Proceed-
ings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, pages 983–991, 2020.
[43] Yuanjun Wang, Jiaxin Zhang, Xing Zhang, Peng Wang, and Liangjingrong
Liu. A computation offloading strategy in satellite terrestrial networks
with double edge computing. In 2018 IEEE international conference on
communication systems (ICCS), pages 450–455. IEEE, 2018.
[44] Miguel Hernández-Cabronero, Aaron B Kiely, Matthew Klimesh, Ian
Blanes, Jonathan Ligo, Enrico Magli, and Joan Serra-Sagristà. The ccsds
123.0-b-2 “low-complexity lossless and near-lossless multispectral and
hyperspectral image compression” standard: A comprehensive review.
IEEE Geoscience and Remote Sensing Magazine, 9(4):102–119, 2021.
[45] Emilio Rapuano, Gabriele Meoni, Tommaso Pacini, Gianmarco Dinelli,
Gianluca Furano, Gianluca Giuffrida, and Luca Fanucci. An fpga-based
18 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
hardware accelerator for cnns inference on board satellites: benchmarking
with myriad 2-based solution for the cloudscout case study. Remote
Sensing, 13(8):1518, 2021.
TAEWOO KIM received the B.S. degree in electri-
cal engineering from Kyungpook National Univer-
sity, Deagu, South Korea in 2015, and M.S. degree
in electrical engineering from Korea Advanced
Institute of Science and Technology (KAIST),
Daejeon, South Korea in 2017. He is currently
pursuing the Ph.D degree in electrical engineer-
ing at Korea Advanced Institute of Science and
Techonolgy (KAIST). His research interests in-
clude Deep Learning (DL) framework, GPU com-
puting for DL, eXplainable AI (XAI).
MINSU JEON received the B.S. degree in elec-
tronic engineering from Sogang University in
2016. He recieved the M.S. degree in electri-
cal engineering from Korea Advanced Institute
of Science and Technology (KAIST), Daejeon,
South Korea in 2017. He is currently pursuing
the Ph.D degree in electrical engineering at Korea
Advanced Institute of Science and Techonolgy
(KAIST). His research interests include deep
learning (DL) application/model, DL serving and
high performance computing system.
CHANGHA LEE received the B.S. degree in elec-
tronic engineering from Hanyang Univ., Seoul,
Korea (2018), and M.S. degree in electronic engi-
neering from Korea Advanced Institute of Science
and Technology (KAIST), Daejeon, Korea (2020).
He is currently a Ph.D. candidate in KAIST. Since
2018, he is a member of Network and Computing
Laboratory in KAIST and his current research
interests include deep learning acceleration plat-
form, continual learning, integrated system for
explainable AI.
JUNSOO KIM received the B.S. degree from
the School of Electrical & Computer Engineering,
Ulsan National Institute of Science and Technol-
ogy (UNIST), Ulsan, South Korea, in 2021. He
is currently pursuing the M.S. degree in electrical
engineering with the Korea Advanced Institute of
Science and Technology (KAIST), Daejeon, South
Korea. His research interests include low-power
system on-chip design and energy-efficient deep-
neural network inference/training accelerators.
GEONWOO KO received the B.S. degree, double
major in biomedical engineering and electrical
engineering from Korea University, Seoul, South
Korea, in 2022. He is currently pursuing the M.S.
degree in electrical engineering with the Korea
Advanced Institute of Science and Technology
(KAIST), Daejeon, South Korea. His research
interests include near-data processing, domain-
specific accelerators, and low-power system-on-
chip design.
JOOYOUNG KIM received the B.S., M.S., and
Ph.D. degrees in electrical engineering from the
Korea Advanced Institute of Science and Tech-
nology (KAIST), Daejeon, South Korea, in 2005,
2007, and 2010, respectively. He is currently an
Assistant Professor with the School of Electrical
Engineering, KAIST. He is also the Director of
the AI Semiconductor Systems Research Center.
His research interests include various aspects of
hardware design, including VLSI design, com-
puter architecture, FPGA, domain-specific accelerators, hardware/software
co-design, and agile hardware development. Before joining KAIST, he was
a Senior Hardware Engineering Lead at Microsoft Azure, Redmond, WA,
USA, working on hardware acceleration for its hyper-scale big data analytics
platform named Azure Data Lake. He was also one of the initial members
of Catapult project at Microsoft Research, Redmond, WA, USA, where he
deployed a fabric of field-programmable gate arrays (FPGAs) in datacenters
to accelerate critical cloud services, such as machine learning, data storage,
and networking. He was a recipient of the 2016 IEEE Micro Top Picks
Award, the 2014 IEEE Micro Top Picks Award, the 2010 DAC/ISSCC
Student Design Contest Award, the 2008 DAC/ISSCC Student Design
Contest Award, and the 2006 A-SSCC Student Design Contest Award. He
currently serves as an Associate Editor for the IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS—I: REGULAR PAPERS.
CHAN-HYUN YOUN (S’84–M’87–SM’2019)
received the B.Sc and M.Sc degrees in Electronics
Engineering from Kyungpook National Univer-
sity, Daegu, Korea, in 1981 and 1985, respectively,
and the Ph.D. degree in Electrical and Commu-
nications Engineering from Tohoku University,
Japan, in 1994. Before joining the University, from
1986 to 1997, he was a Head of High-Speed Net-
working Team at KT Telecommunications Net-
work Research Laboratories, where he had been
involved in the research and developments of centralized switching main-
tenance system, high-speed networking, and ATM network. Since 1997,
he has been a Professor at the School of Electrical Engineering in Korea
Advanced Institute of Science and Technology (KAIST), Daejeon, Korea.
He was an Associate Vice-President of office of planning and budgets in
KAIST from 2013 to 2017. He also is a Director of Grid Middleware
Research Center and XAI Acceleration Technology Research Center at
KAIST, where he is developing core technologies that are in the areas
of high performance computing, explainable AI system, satellite imagery
analysis, AI acceleration system and others. He was a general chair for
the 6th EAI International Conference on Cloud Computing (Cloud Comp
2015), KAIST, in 2015. He wrote a book on Cloud Broker and Cloudlet for
Workflow Scheduling, Springer, in 2017. Dr. Youn also was a Guest Editor
IEEE Wireless Communications in 2016, and served many international
conferences as TPC member.
VOLUME 4, 2016 19
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3219879
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/