Available via license: CC BY 4.0
Content may be subject to copyright.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Recursive Visual Explanations Mediation
Scheme based on DropAttention Model
with Multiple Episodes Pool
MINSU JEON*, TAEWOO KIM*, SEONGHWAN KIM, CHANGHA LEE, (Member, IEEE), AND
CHAN-HYUN YOUN., (Senior Member, IEEE)
School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, South Korea
Corresponding author: Chan-Hyun Youn (e-mail: chyoun@kaist.ac.kr).
This work was supported in part by the Challengeable Future Defense Technology Research and Development Program (No.915027201)
of Agency for Defense Development, and in part by Electronics and Telecommunications Research Institute (ETRI) grant funded by the
Korean government [22ZS1100, Core Technology Research for Self-Improving Integrated Artificial Intelligence System]. (*Minsu Jeon
and Taewoo Kim contributed equally to this work.)
ABSTRACT In some DL applications such as remote sensing, it is hard to obtain the high task performance
(e.g. accuracy) using the DL model on image analysis due to the low resolution characteristics of the
imagery. Accordingly, several studies attempted to provide visual explanations or apply the attention
mechanism to enhance the reliability on the image analysis. However, there still remains structural
complexity on obtaining a sophisticated visual explanation with such existing methods: 1) which layer
will the visual explanation be extracted from, and 2) which layers the attention modules will be applied
to. 3) Subsequently, in order to observe the aspects of visual explanations on such diverse episodes of
applying attention modules individually, training cost inefficiency inevitably arises as it requires training
the multiple models one by one in the conventional methods. In order to solve the problems, we propose
a new scheme of mediating the visual explanations in a pixel-level recursively. Specifically, we propose
DropAtt that generates multiple episodes pool by training only a single network once as an amortized
model, which also shows stability on task performance regardless of layer-wise attention policy. From
the multiple episodes pool generated by DropAtt, by quantitatively evaluating the explainability of each
visual explanation and expanding the parts of explanations with high explainability recursively, our visual
explanations mediatio scheme attempts to adjust how much to reflect each episodic layer-wise explanation
for enforcing a dominant explainability of each candidate. On the empirical evaluation, our methods show
their feasibility on enhancing the visual explainability by reducing average drop about 17% and enhancing
the rate of increase in confidence 3%.
INDEX TERMS Explainable AI (XAI), attention, class activation map (CAM), amortized model.
I. INTRODUCTION
Recently, with the development of deep learning (DL) mod-
els, several studies [1]–[3] attempt to apply it on image
analysis fields such as remote sensing or medical analysis.
However, it is hard to clearly distinguish object classes on
such satellite imagery due to its relatively low resolution,
therefore, explanation on prediction is further required to
provide reliability for the user via explainable AI (XAI)
method [4], [5].
In the paper, as one of areas in XAI, we attempt to deal
with visual explanation, which visually represents the esti-
mated importance of predicted result in pixel-level. Specifi-
cally, Grad-CAM [6] is the one of the representative methods
for deriving visual explanation, which derives a class activa-
tion map in a pixel-level from input image. However, such
visual explanations show different aspects depending on the
position of the target layer where the explanation is derived,
and accordingly, there remains a problem of selecting which
explanation to trust for making a final decision.
Moreover, several studies [7], [8] attempted to improve the
task performance by reflecting the attention map in the task
path, and the qualitative results show that more sophisticated
visual explanation also can be derived by applying the atten-
tion modules. However, such attentions also show different
VOLUME 4, 2016 1
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
aspects according to the layer-wise policy of applying atten-
tion modules (apply or not), and even the visual explanations
derived from the layers where the attention modules are not
applied are modified by the other attention applied layers
in the process of feed-forward and back-propagation, which
makes it more complex to choose a reliable explanation
from such various episodes for making a final decision. In
this paper, we denote episode as one of possible cases for
applying attention modules to the layers of the model.
In order to solve such difficulty on choosing a reliable
explanation from conflicting explanations varying on target
layers and layer-wise attention policies, we propose a series
of methods to integrate such complex explanations into a
single explanation of human manageable level. The key
concept of the proposed methods is to integrate the explana-
tions from multiple episodes by reflecting the partial region
in each episodic explanation, where the regional reflecting
ratio is adjusted based on the two quantitative explainability
indicators. However, it requires training several models one
by one from scratch for generating multiple episodes of
applying attention modules, which results in enormous time
consuming overhead.
To realize this practically, we mainly address the three
main technical issues.
•First, we propose DropAtt to enable handling of various
attention episodes that vary with the layer-wise policy of
applying attention modules (apply or not) in a training
cost effective way, allowing generation of several atten-
tion variant episodes through training a single amortized
network once, while maintaining the stable consistency
on task performance over multiple episodes.
•In order to ensure the consistency between task path and
explanation integration path, we also constructed the ad-
versarial game to search initial settings for mediating the
explanations, where the generator that tries to integrate
explanation for faking as task path and the discriminator
that distinguish task path from explanation integration
path compete each other.
•From the such initial settings for mediating the expla-
nations, the episodic layer-wise explanations that show
higher explainability among any two indicators (multi-
disciplinary) are selected as debating candidates, and
the regional reflecting ratio of each debate candidate
is adjusted incrementally to mediate the conflicts of
complementary explanations, which induces to improve
both multi-disciplinary explainability.
To evaluate the feasibility of the proposed methods, we
conducted an experimental evaluation on the satellite im-
agery dataset, and it is shown that the proposed explanation
mediating scheme enhances the visual explainability.
In the following sections, the backgrounds of visual expla-
nation and attention mechanism are presented in Section II.
The problem descriptions are delivered in Section III, and
the details of the proposed method for solving the problem is
presented in Section IV. Finally, empirical evaluation on the
proposed method is addressed in Section V.
II. RELATED WORK
A. VISUAL EXPLANATION
As a method for deriving a visual explanation of the predic-
tions from a task model, quantifying and representing the
importance of each pixel in the input image contributed on
the prediction process is a common approach [9]. One of
prevalent methods is observing changes in predictions of task
model by invoking perturbation or occlusion on input [10],
[11].
As an more convenient way for deriving the visual expla-
nation, it is widely shown that the activation map of the con-
volutional layer can show localization on objects even though
the network is only trained on image-level classification with
the help of characteristics in weakly supervised learning [12],
[13]. Using such characteristics, class activation map (CAM)
[14] attempts to represent the pixel-wise distinct regions
for identifying each class. However, such structure remains
restrictions that the global average pooling (GAP) should
be applied to the last convolutional layer and CAM is only
derived at the last layer basically.
In order to make up such limitations, gradient-weighted
class activation mapping (Grad-CAM) [6] produces a local-
ization map emphasizing the important pixels in the image
for predicting the task by weighting the activation map with
its pixel-wise gradients. Extended from Grad-CAM, Grad-
CAM++ [15] tries to use the weighted combination of the
positive partial derivatives of the last convolutional layer
feature maps, and Ablation-CAM [16] attempts to apply
ablation analysis to estimate pixel-wise class importance.
Moreover, several studies attempts to deliver the interpretable
explanations on various application domains [17], [18].
However, the visual explanations that is extracted by such
CAM-based methods show different aspects each other de-
pending on which layer the explanation is extracted from.
In practical [19], the explanations extracted from the rear
layers of the task model can represent the information of
adjacent final prediction layer largely, but the precision of
localization is degraded due to pooling through the several
previous layers. On the contrary, the explanation extracted
from the front layers show the relatively high precision of
localization, but the redundant information is also included
as the information of the final prediction layer is faded out
passing through the rear layers. In an attempt to consider
such different aspects of visual explanations according to the
target layers, LayerCAM [19] proposed a method of aggre-
gating the visual explanations (local explanations) extracted
from the multiple layers into one global explanation, but there
still remains a limit of containing the redundant information
due to its simple fusion method, which just conducts an
element-wise maximization among local explanations.
B. ATTENTION MODULE
Likewise to the CAM, attentions play an important role in hu-
man perception [20]–[22], and human visual system does not
process the whole scene at once, but tries to focus on salient
parts that they selected among series of partial glimpses.
2VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Based on this mechanism, several studies commonly attempt
to improve task performance by adding a path to generate
an attention in parallel to the layer of the task network and
reflecting it to the task path again.
As a representative method, residual attention network
[23] introduces attention module in the form of an encoder-
decoder. CBAM [8] attempts not to compute 3D attention
map directly, but to separate into channel attention and spatial
attention, and reflects channel-wise and spatial-wise attentive
analysis by applying each attention sequentially. Moreover,
attention branch network [7] constructs attention branch that
consists of attention generating path and attention based task
inferencing path, and introduces training the whole network
with loss function in which the both attention branch and
perception branch are bound together.
Such attention modules can also extract attention maps that
replace the visual explanations of CAM-based methods, but
since the class-wise distinguishing features like CAM cannot
be extracted except the last layer, only the attention maps
from the last layer are extracted, or CAM-based methods
are orthogonally applied to the other layers to extract visual
explanations of each predictive labels [7], [8], [24]. However,
looking only at the explanations of the last layer lacks precise
localization due to the pooling over the several layers, while
the visual explanations extracted from the front layers show
relatively severe noise but highly precise localization charac-
teristics as trade-off [19].
Moreover, different aspects of visual explanation are also
acquired depending on how the attention module is applied
to each layer as well as target layers, and the computational
inefficiency arises for observing each aspect individually as
it requires training the multiple models one by one as shown
in Fig. 1. We will address that issue in detail at Section III.
C. QUANTITATIVE METRICS OF VISUAL
EXPLAINABILITY
Aforementioned studies [15], [16], [25] on CAM-based vi-
sual explanation not only observe the qualitative result but
also attempt to evaluate its feasibility quantitatively. Average
drop in percentage is introduced to evaluate how much the
highlighted regions on the visual explanation contributes to
the confidence of prediction from the model. If the visual
explanation emphasized the important regions for making
decision correctly, bypassing only the highlighted regions
(removing the others) may mostly derive the lower drop in
confidence of the prediction, and the lower drop in confi-
dence represents the better visual explainability. In this basis,
average drop is calculated as the ratio of drop in confidence
of predictions by occluding (removing redundant regions)
the original input image with the derived visual explanation,
which is averaged among whole target input instances.
Moreover, in order to evaluate such interpretability more
accurately, remove and retrain (ROAR) [26] also attempts to
measure how much the accuracy is decreased when retraining
the model with the input data where a certain ratio of unim-
...
...
Spatial Pooling
...
Spatial Pooling
...
Feature Extractor
...
Spatial Pooling
block1 block2 block3 block N
Conventional Development of Attention-based DL Model
multiple heavy overhead of
training each model
Attenti on
Module
Attenti on
Module
Attenti on
Module
Attenti on
Module
Manual Attention
Episode Generation layer position layer-wise policy of applying
attention modules
episode 1
...
episode M
Trai ni ng
Model 1
Trai ni ng
Model M
episode 2
Trai ni ng
Model 2
FIGURE 1. For the conventional methods of attention mechanism [7], [8], [24],
development of the attention-based DL model requires manual decision for
adapting attention module. The attention modules are generally applied to all
layers or only applied to a certain layer as a fixed form, and it can only reflect a
single episode. As such model can only reflect a single episode of attention, in
order to observe the multiple episodes of attention, multiple heavy overhead of
training each model on attention episode is inevitably required.
portant pixels are removed gradually, which requires heavy
retraining overhead instead.
On the contrary, good visual explanations can also en-
courage the model to concentrate on only the discriminative
important regions, and it can rather results in increase in
confidence of prediction when we input the only highlighted
regions in image [15]. Accordingly, rate of increase in con-
fidence measures the frequency rate of the event where the
increase in confidence of prediction occur on bypassing only
the highlighted regions in input images, and the higher rate
represents the better explainability.
Furthermore, other various metrics are introduced, for
example, sanity check on such saliency map is introduced to
measure how sensitive to model parameter [27]. Some study
[28] attempts to evaluate how consistently the explanation is
derived among data in a model-agnostic way. Other study
[29] also introduces fidelity and sensitivity of explanations
that measures how similar to the impact of various pixel-wise
perturbation and how sensitive to the pixel-wise perturbation.
Among such various metrics, we adopt average drop and
rate of increase in confidence as main criteria for evaluating
the quality of derived visual explanation as an example.
III. PROBLEM DESCRIPTION
Complexity on visual explanations over multi-layers. As
described above, CAM-based methods [6], [14], [15] show
its feasibility on providing class activation map as a visual
explanation for distinguishing pixel-by-pixel object classes
VOLUME 4, 2016 3
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Residual block 1
Non-attention
modules applied
Attention
modules applied on
residual block1, 2
Residual block 2 Residual block 3 Residual block 4
Attention modules also affects
on post layer’s explanation
Inconsistency among explanations
at different layers
FIGURE 2. Comparison results of visual explanation on each residual block in ResNet-18 for the satellite imagery.
TABLE 1. Explainability comparison of layer-wise explanation among different attention policies.
Explainability Metric Network Residual Block 1 Residual Block 2 Residual Block 3 Residual Block 4
Average % drop in
confidence ↓
ResNet-18 75.82 75.98 75.51 72.04
ResNet-18 + ABN@Block1,2 55.57 45.52 57.05 72.41
Rate % of increase in
confidence ↑
ResNet-18 1.00 1.14 1.04 1.33
ResNet-18 + ABN@Block1,2 3.81 1.90 0.47 2.85
on input images. However, such CAM-based methods are
mainly applied to the last layer of the task model, which
shows the poor precision of localization as it propagates
through pooling over the several layers. Moreover, in case
of some methods like Grad-CAM, they can also be applied to
various layers but the different aspects of visual explanations
are extracted over the different target layers, which rather in-
creases the difficulty on making a final decision by enforcing
human user to select a reliable explanation among the several
candidates [27], [30].
In order to identify such problem, in practical, we extract
and observe the results of Grad-CAM on 4 residual blocks
of ResNet-18 trained by the land use classification dataset
on satellite imagery [31], and the top row of Fig. 2 shows
the corresponding results. As shown in the results, the visual
explanations from the first and second layer show inconsis-
tency over the visual explanations from the third and fourth
layer. The explanation extracted from the rear layers show
relatively poor localization performance due to the spatial
pooling over the several layers, and on the contrary, the
explanations extracted from the front layers show high lo-
calization precision but relatively high variance of redundant
information is also included together.
Moreover, in order to quantitatively identify such charac-
teristics, we also examined the average drop in confidence
and rate of increase in confidence that are widely used as
quantitative metrics [15], [16], [25] for evaluating explain-
ability of the derived visual explanation on the same dataset
and network, and the results were shown in Table 1. As
shown in the table, we can see that ResNet-18 (without any
attention modules) shows relatively large drop in confidence
among overall target layers, and in particular, the average
drops in confidence on the first and second layers (residual
blocks) are relatively inferior to the that on the subsequent
layers. In other words, as the visual explanations show dif-
ferent aspect according to the extracted target layers, in order
to provide a reliable explanation to the human users, it is
required to distinguish trade-off between each explanation
to remove the redundant information, and integrate into a
single global explanation by selecting only the information
that practically activated to the prediction.
Various visual explanations extracted over multiple
episodes. In addition, as aforementioned in Section II-B,
visual explanations show different patterns according to the
layer-wise policy (i.e. episode) of applying attention modules
(apply or not) as well as the target layers, therefore it is also
necessary to consider the diversity of such attention episodes
in the process of extracting a single integrated explanation.
In order to observe such problem, in addition to the previous
experimental results, we also observe the aspects of visual
explanations when the attention modules are applied to the
specific layers. We add ABN [7] as attention modules to the
first and second residual blocks of the previous ResNet-18,
train it with the same dataset, and extract Grad-CAM results
4VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 2. Task performance comparison between training by the fixed super network and training by using DropAtt among various attention episodes.
Attention module applied Train by the fixed super network Train by DropAtt (Proposed)
@Block 1 @Block 2 @Block 3 @Block 4 Top1 accuracy (%) △acc. Top1 accuracy (%) △acc.
O O O O 93.33 - 93.81 -
O X X X 52.38 -40.95 92.85 -0.95
X O X X 80.47 -12.85 93.33 -0.47
X X O X 85.23 -8.09 93.81 -0.00
X X X O 88.09 -5.23 93.81 -0.00
…
Select complementary debating candidates
Adjust mediating factor incrementally
Check final agreement
Mediating
explanations
Reflect xx% Reflect △△%Reflect ✽✽%
Select/aggregate
complementary parts
àAn integrated explanation
Ex) Satellite Onboard Processing Low-power AI Processor
Host System (CPU)
Payload Storage
Communication
Subsystem
Attention-based XAI Analysis
Conv Block
Conv Block
...
Pooling
Conv Block
Conv Block
...
Pooling
...
Attention
Attention
Scale
(Multi-layer)
Variant
Attention
Episode
Variant
Apply or not? Apply or not?
Acquired Images
Human Users
Conflicts among various
explanations
èDifficulty on making a
final decision
Attention episode variant
…
…
…
…
…
…
Scale (multi-layer) variant
FIGURE 3. Visual explanation methods based on CAM [6], [15] and attention modules [7], [8] contain the diversity of which layers to apply, and it results in difficulty
on making a final decision as conflicts among such various explanations may occur. Accordingly, we propose a new scheme of mediating various explanations by
selecting and aggregating the complementary parts into a single integrated form.
on each layer. The corresponding qualitative and quantitative
results are obtained as Fig. 2 (bottom row) and Table 1
respectively.
As shown in Fig. 2, by adding the attention modules, the
visual explanations (bottom row) on each layer shows differ-
ent aspects in overall comparing to the previous results (top
rows). In particular, even the visual explanation extracted
from the layer where the attention module is not applied
is also affected by the other attention applied layers. The
quantitative results in Table 1 also shows the corresponding
result that the explainability on attention unapplied layer
(third and fourth block) is affected (changed). Moreover, as
shown in the results, the exlplainability can be improved on
some layers by applying the attention module, but the distinct
tendency between the improvement of explainability and the
layer-wise policy (i.e. episode) of applying attention modules
is not observed. For example, in the results, the average
drop is improved in the third layer but the rate of increase
in confidence is degraded, while the fourth layer shows the
opposite result.
Accordingly, in order to provide a higher explainability,
it is required to distinguish trade-off among various expla-
nations extracted from various attention episodes as well
as target layers through the quantitative criteria to remove
the redundant information and integrate into a single global
explanation by selecting only the information that practically
contributed to the prediction.
Training cost inefficiency for generating various atten-
tion episodes. However, as the attention modules [7], [8]
are generally applied to the model in a fixed form as shown
in Fig. 1, acquiring multiple episodes of attention modules
inevitably requires substantial heavy computation overhead
of training multiple episodic networks one by one. Alterna-
tively, it can be considered that just training the super network
(i.e. model with applying attention modules to all layers)
and then applying layer-wise attention modules adaptively
from the super network in inference phase, but it results in
critical deviation on task performance for generating multiple
episodes.
To identify the problem, we practically train the super
network by adding ABN attention modules to all residual
blocks of ResNet-18 with the same dataset, and adaptively
remove attention modules from the trained super network
according to the episode in the inference phase. Table 2
shows the aspects of task performances (top-1 accuracy) with
regard to the various episodes. As shown in the results, the
task performance falls to maximum -40.95% for generating
multiple episodes adaptively from the super network, which
VOLUME 4, 2016 5
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
DropAtt 𝐅!
"= 𝐀𝐁!𝐅!⨂𝐅!𝑧!+ 𝐅!,
𝑧!∼𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝)
𝐅#
𝑧#∈ {0,1}
𝐅#
"
Attention
Module
Attention
Module
Attention
Module
X
+
Conv
Layers
BN
Conv
ReLU
Conv
BN
Sigmoid
Conv
GAP
Softmax
Attention Module
Tas k L ay er
𝑧$𝑧%𝑧&
! 𝑧!#ℒ"##
!(𝑥)
!
+
ℒ#"$%(𝑥)
X
Backpropagate the layers agnostic to the attention module
𝐀𝐁#(𝐅#)
FIGURE 4. Diagram of how the proposed DropAtt operates on each layer at the amortized network in training phase. (ABN is adopted as an example in this
description.)
results in the problem that any meaningful visual explanation
cannot be acquired unless maintaining the normal prediction
performance.
Therefore, prior to considering the various attention
episodes for extracting a sophisticated explanation, in order
to acquire the multiple episodes in a training cost effective
way, a new training method or model structure that enables to
generate various attention episodes while showing the stable
task performance by training only a single model is required
first.
IV. A PROPOSED MODEL
To solve such problems, we propose a scheme of explanation
mediation that selects only the complementary parts of each
visual explanation from various target layers and attention
episodes and integrates them into a single sophisticated ex-
planation as shown in Fig. 3. Specifically, we propose a new
layer architecture (DropAtt) that generates multiple episodes
while maintaining the stable task performance by training a
single amortized model, and also propose a recursive medi-
ation method that identifies and aggregates the parts of each
explanation for enhancing the visual explaianbility from the
multiple attention episodes pool produced by DropAtt. The
details of each components on our methods are presented as
follows.
A. DROPATT: GENERATING THE MULTIPLE EPISODES
OF ATTENTION MODULES FROM AN AMORTIZED
MODEL
First, to overcome the computational inefficiency problem of
training the multiple models one by one generating multiple
episodes caused by the existing methods of attention modules
[7], [8], we propose a new layer architecture that generates
multiple attention episodes with maintaining the stable task
performance by training only a single amortized model.
As aforementioned in Section III, in the case of training the
super network (i.e. model with applying attention modules
to all layers) and then applying layer-wise attention modules
adaptively from the super network in inference phase, as the
gradient is divided into two backward paths (original task
path on network and path for attention module) on training a
network, task information is partly reflected on the attention
module. Therefore, the task performance is vulnerable to be
degraded when some attention module is adaptively removed
from a super network that is trained by applying attention
modules to all layers in a fixed form.
To overcome such vulnerability, similar to dropout’s mech-
anism, as a method to ensure consistent task performance
on multiple episodes from a single amortized network, we
propose DropAttention (DropAtt) that can reflect all possible
episodes of applying attention modules in training phase,
and therefore the multiple episodes can be generated while
maintaining the stable task performance in the inference
phase.
Fig. 4 shows how the proposed DropAtt works on each
layer of the convolutional neural network. As shown in Fig. 4,
the proposed DropAtt is applied to each layer of the task
network, and the attention module on each layer is randomly
applied in training phase by DropAtt. From the layer-wise
attention gating (0/1) random variable zl∼Bernoulli(p)
that follows a Bernoulli distribution and determines whether
to apply the attention module ABl(·)on l-th layer or not,
the feed-forward computation of DropAtt is represented as
follows:
F′
l= (ABl(Fl)⊗Fl)zl+Fl,(1)
where pdenotes the probability of not applying the at-
tention module (in every data sample), and Fl, F′
ldenotes
input/output of attention module at l-th layer. In the training
phase, gradient at input path of l-th layer is calculated as
follows:
∂F′
l
∂θABl
= (∂ABl(Fl)
∂θABl
⊗Fl+ABl(Fl)⊗∂Fl
∂θABl
)zl+∂Fl
∂θABl
,
(2)
therefore, the layers of the task network can be mainly
trained by ∂Fl
∂θABl
, while being robust to the randomly passed
gradients from attention paths. In other word, different from
the existing methods of attention modules [7], [8] that train
the model in a fixed form, the proposed DropAtt regularizes
6VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
DropAtt
Attention
Module
Attention
Module
Attention
Module
Attention
Module
Generate various episodes (𝑒) of attention
from an amortized model
Initial
Mediator
Task
Layer
𝑓(𝑌 = 𝑐|𝑥, 𝑒)
0/1
𝑆!($)
𝐷($)
𝑆"($)
(fake/real)
Episodic Layer-wise
Explanations: 𝐿𝐸#!$%&
Initial Mediating
Factor: 𝜌#!$%&
RCP
CIR
Select complementary
debating candidates
Multi-disciplinary Explainability Debate
Adjust mediating
factor incrementally
Update mediation from debating candidates
𝜌′(",$) ← 𝜌(" ,$) +𝜀
Grant/Reject
Current mediated
explanation: 𝐿
'
Reflect xx%
Reflec t △△%
Reflec t ◻%
Reflect ✽✽ %
𝑅𝐶𝑃 𝐿
'?
C𝐼𝑅 𝐿
'?
…
Check final agreement
.𝑅(𝐿𝐸 ",$ ,𝜌(" ,$) )
∀(",$) 𝐿𝐸(",$)
𝐺
'($)
𝑧~𝐺((𝑥)
Consistency on Task
& Explanation Path
Enable
/Disable
FIGURE 5. Overall procedure of mediating explanations of multiple episodes based on DropAtt with Multi-disciplinary debate.
the overfit on a specific layer-wise attention by invoking the
randomness on applying layer-wise attention modules.
In practice, in order to identify the feasibility of DropAtt,
we conducted the empirical evaluation of whether the pro-
posed DropAtt maintains task performance regardless of
layer-wise policy of applying attention modules. In the eval-
uation, ResNet-18 and the satellite imagery land use classifi-
cation dataset [31] are targeted, and ABN [7] is adopted for
attention module. As shown in Table 2, the task performance
(accuracy) falls to maximum -40.95% among various atten-
tion episodes from the super network trained by applying
attention modules to all layers in a fixed form, but when we
train the amortized network by using the proposed DropAtt,
the accuracy on various attention episodes maintain stable
consistency (smaller than 1% changes in maximum). In ad-
dition, the task performance of the amortized network itself
is also slightly improved by using DropAtt.
Based on the amortized network that can adaptively apply
the attention module on each layer and generate multiple
episodes of attention through DropAtt, the following sub-
sections deal with the problem of how to integrate visual
explanations that show different characteristics/levels over
various attention episodes and target layers.
B. EXPLANATION CONSISTENCY VIA CLASS-WISE
FEATURE DISCRIMINATOR
Various characteristic visual explanations can be obtained
from the amortized model with the help of the proposed
DropAtt, and different characteristic visual explanations can
also be acquired among different target layers. However, as
aforementioned, since the existing CAM-based methods [6]
extract the different aspects of visual explanations according
to the target layers and attention episodes, it is required to
distinguish distinguish trade-off between each explanation
to remove the redundant information, and integrate into a
single global explanation by selecting only the information
that practically contributed to the prediction.
Accordingly, in order to make these various episode-
variant and layer-variant visual explanations into one inte-
grated explanation, we propose a scheme for explanation me-
diation where the regional reflecting ratio for each episodic
layer-wise explanation is adjusted differently according to
the degree of multi-disciplinary explainability on each expla-
nation, and the several episodic layer-wise explanation are
synthesized into a single integrated explanation by reflecting
each allocated regional ratio of explanation as shown in
Fig. 5.
For practicality of the explanation, in order that the inte-
grated explanation implies consistent information with the
predictions from task path, we induce it by constructing
mutual competition between generator that derives regional
reflection ratios in explanation path and discriminator that
tries to distinguish fake of prediction in explanation path
from real predictions in task path.
Specifically, in the explanation path, each regional re-
flecting ratio (0≤ρ(e,l)≤1) for the explanation ob-
tained at e-th attention episode and l-th layer is derived
by feeding layer-wise feature map (z∼Gz(x)) as input
to the parameterized generator function (Gρ(·)). As one of
methods for selecting the pixel-wise principal region from
the episodic layer-wise explanation (LE(e,l)), we propose to
VOLUME 4, 2016 7
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
filter out only the top ρ(e,l)·100% value pixels in a masking
format (0/1) as an example (denote function of this procedure
as R(LE(e,l), ρ(e,l))). Accordingly, integration of several
episodic layer-wise explanations is conducted by summing
only the allocated regions of each explanation:
ˆ
LE =X
∀e
X
∀l
R(LE(e,l), ρ(e,l))·LE(e,l).(3)
As the existing CAM-based methods [6] only extracts
the explanation on a specific episode at a certain layer, the
extracted visual explanations are vulnerable to contain the
redundant information of certain episode. On the other hand,
as the proposed mechanism of integrating explanations can
block out the inflow of redundant information from each
episodic explanation by adjusting the regional reflection ra-
tio on each explanation individually, through searching the
proper value of reflection ratio recursively, the proposed
method can extract a more sophisticated single explanation
aggregating only the parts of explanations that practically
contributed to the prediction.
Based on the integrated explanation, the feature score
(PiPjˆ
LEc
i,j ) of estimating each object class can be quan-
tified by summing up pixel-wise activated values in expla-
nation ( ˆ
LEc) for each class (c), and a relative prediction
probability for a particular class can be calculated as ap-
plying softmax on each feature score of class (denote as
Se(x, ρ) = exp(PiPjˆ
LEc
i,j (x))
P∀c′exp(PiPjˆ
LEc′
i,j (x)) ). Likewise, object class
prediction probability can also be quantified in a softmax
form by summing up the class-variant output (f(Y=c|x, e))
for each different network episode in task path (denote as
St(x) = exp(Pef(Y=c|x,e))
Pc′exp(Pef(Y=c′|x,e)) )).
From these basis, by configuring a parameterized discrimi-
nator (D(·)) that distinguishes real class prediction (St(x)) in
task path from fake prediction (Se(x, Gρ(z))) in explanation
path, we can construct zero-sum two-player (Gρ, D) game as
following Lemma 1.
Lemma 1. (Adversarial loss for explanation consistency)
The integrated explanation that is consistent with estimated
output in task path can be searched by solving zero-sum two-
player (Gρ, D) minimax game, where the adversarial loss is
constructed as:
min
Gρ
max
D
Et∼St(x)[log D(t)]
+Ez∼Gz(x)[log(1 −D(Se(x, Gρ(z))))].
(4)
Proof. Equation 4 where the discriminator Dthe generator
Gρcompete each other constructs adversarial training, and
the solution of such zero-sum two player game is obtained
by Nash equilibrium where the discriminator can not distin-
guish the generations of the generator network from the real
distribution.
To solve the corresponding two-player minimax game, as
shown in Algorithm 1, the discriminator is updated from the
TABLE 3. Comparison of task performance predicted by using only the
derived explanation itself (using Se(·)).
Top1 accuracy (%)
Grad-CAM 29.99
ABN 49.52
DropAtt 52.38
DropAtt + Discrim. 65.24
Whole Proposed Scheme (Algorithm 1) 74.28
sampled minibatch of predictions in task path and the sam-
pled minibatch of input for the generator in advance, and then
the generator is updated from the other sampled minibatch
to fake the discriminator. By conducting such updating step
iteratively, we can derive the initial region reflection ratio
for each episodic layer-wise explanation that can induce the
initial integrated explanation implying consistent information
with task predictions as much as possible.
Table 3 shows the task performance (top1 accuracy) re-
sults of predicting task by derived explanation itself as
Se(·), which shows how much the derived explanation con-
tains consistent information for task prediction. The hyper-
parameter settings for training are same with the previous
evaluation in Table 2, and class-wise prediction score is cal-
culated by averaging over all layer-wise explanations for the
5 methods: Grad-CAM, ABN, DropAtt, DropAtt + Discrim.,
Whole Proposed Scheme (Algorithm 1) (detail settings of
each method is described in Section V-C). As shown in the
results, the proposed DropAtt itself can derive more task
consistent explanation than the applying Grad-CAM [6] to
the network without attention (Grad-CAM) and the visual
explanation derived from the network with ABN [7] in a
fixed form (ABN). Moreover, by applying the proposed class-
wise feature discriminator (and generator) with DropAtt, it
can further improve the task consistency on the integrated
explanation, and our final Algorithm 1 shows the highest task
consistency on the derived explanation.
C. RECURSIVE MEDIATION OF EXPLANATIONS FROM
DEBATING CANDIDATES
As a method of combining the various explanations, we
proposed to filter out only the certain ratio of pixels from each
layer-wise episodic explanation to drop the redundant fea-
tures. However, such filtering process contains the concerns
of losing meaningful features together. Accordingly, in order
to minimize such risk, we also constructed the additional
step that attempts to search the appropriate regional reflec-
tion ratio for each explanation by identifying whether the
pixel-wise features can enhance the explainability, through
recursively incrementing the regional reflection ratio and
evaluating the explainability. Therefore, different from the
conventional methods like Grad-CAM that only extracts a
specific explanation biased on a certain target layer and an
attention episode, our new framework of mediating explana-
tions generates more sophisticated explanation by integrating
only the positive aspects of explanations from various target
8VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
layers and attention episodes.
The term "debate" used in this paper indicates complemen-
tary conflicts among various episodic layer-wise explanation
in terms of multi-disciplinary explainability. Specifically, the
debating candidates for deriving the integrated explanation
is selected based on two multi-disciplinary explainability
indicators (average drop and increase in confidence), which
are widely used as quantitative metrics to measure the ex-
planability of various visual explanations [15], [16], [25].
The first indicator of explanability considered in this work is
a remaining confidence percentage (RCP), which is defined
as average value on the ratio of remaining confidence on a
label (c) when inferencing (f(·)) with the augmented input
where the original image (x) among dataset (D) is masked
by visual explanation (LE) derived on certain e-th episode
(same concept with the average drop in confidence, but the
higher RCP value means higher explainability):
RC P (LE) = 1
|D| X
x∈D
min(f(c|x·R(LE, ρD), e), f (c|x, e))
f(c|x, e),
(5)
where ρDdenotes the regional ratio for passing out principal
values on such discipline. The second explainability indicator
is a confidence increase ratio (CIR), which is defined as ratio
of the events where the confidence derived from masked
input is higher than the confidence inferred from original
input:
CI R(LE) = 1
|D| X
x∈D
sign(f(c|x, e)< f (c|x·R(LE , ρD), e)).
(6)
Based on these two disciplinary indicators, if an expla-
nation dominant in both indicators is found, the regional
reflecting ratio for the explanation can be applied to 100%,
but in most cases, there is a problem that only the conflicting
explanations appear in terms of explanatory indicators, such
as high RCP but low CIR, or high CIR but low RCP.
In this paper, such selected explanations are synthesized
into a explanation that can be improved as much as possible
in terms of the multi-disciplinary indicators by reflecting only
the main partial area that can extract the advantages of dom-
inant explainability for each explanation. Accordingly, prior
to integrating explanations, the process of selecting debating
candidates is conducted by selecting only the episodic layer-
wise explanations that show any dominant explainability
among two disciplines (RCP, CIR) over current integrated
explanation ( ˆ
LE) as shown in Algorithm 1, and then the
regional reflecting ratios are mediated among the debating
candidate set for improving whole explainability metrics
together.
The procedure of mediating various explanations to de-
rive the integrated explanation ( ˆ
LE) based on the selected
debating candidates consists of incrementally adjusting the
regional reflecting ratio of each debating candidate, and only
adopt such adjustment when the newly integrated explanation
from mediation shows any improvement among two disci-
plinary explainability without any degradation. By conduct-
ing this process iteratively for various debate candidates and
attempting to mediate among them, it is intended that can-
didate explanations dominant on a specific discipline (RCP
or CIR) can complement the integrated explanation within
bounds of not exacerbating the weak discipline (explainabil-
ity).
In an ideal case, as stated in Lemma 2, the confidence
derived from the input masked by visual explanation can
be enhanced by further reflecting the partial regions of
the other explanation where the dominant explainability is
emphasized, and RCP or CIR of updated explanation can
be improved according to the dominant explainability of
candidate.
Lemma 2. (Confidence enhancement via adding discrimina-
tive explainiability-dominant region)
If f(x·R(LEB, ρD)) > f(x·R(LEA, ρD)),f(x·
R(LEB, ρB)) > f(x·R(LEA, ρA)),PiPjR(LEA, ρA)∪
R(LEB, ρB)≤ρDH W ,ρA, ρB≤ρD, then the confidence
can be improved by adding explainiability-dominant region
comparing to the raw explanation:
f(x·R(R(LEA, ρA)LEA+R(LEB, ρB)LEB, ρD))
> f(x·R(LEA, ρD)) (7)
Proof. As the area of R(LEA, ρA)∪R(LEB, ρB)is larger
than area of R(LEA, ρA), it is supposed to be f(x·
(R(LEA, ρA)∪R(LEB, ρB))) > f(x·R(LEA, ρA)). From
the assumptions, we can bring:
f(x·R(R(LEA, ρA)LEA+R(LEB, ρB)LEB, ρD)) (8)
=f(x·(R(LEA, ρA)∪R(LEB, ρB))) (9)
> f(x·R(LEA, ρA)),(10)
where Equation 9 can be derived by PiPjR(LEA, ρA)∪
R(LEB, ρB)≤ρDH W , and Equation 10 is equal to f(x·
R(LEA, ρD)) as ρA< ρD.
Based on this theoretical basis, we attempt to synthesize
a new integrated explanation in the form of extending the
partial region of dominant explainability on complementary
explanation candidates by debating trade-off between each
pair ( ˆ
LE,LE(e,l)). In the real world environment, as the
dominant explainability of previously integrated explanation
is likely to be diluted in the mediation procedure by extending
the reflection ratio of the debate candidate excessively, the
mediation is conducted by adjusting the regional reflecting
ratio for each debate candidate to a small amount of incre-
ment (ϵ) but iteratively.
In an ideal case, by conducting these searching and ex-
planation mediating procedure iteratively, as stated in Theo-
rem 1, the integrated explanation that converges to maximum
value of RCP and CIR within the tangible range (∀LE(e,l))
can be derived.
VOLUME 4, 2016 9
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Algorithm 1: The proposed scheme for mediating
explanations of multiple episodes from DropAtt with
multi-disciplinary debate.
1: Train the amortized network by applying DropAtt
2: for iterations do
3: for steps do
4: Sample minibatch of task path score {t1, ..., tM}
from St(x), x ∼ D
5: Sample minibatch of generator input
{(z1, x1), ..., (zM, xM)}from Gz(x), x ∼ D
6: Update discriminator by:
∇θD
1
MP∀m[log D(tm)
+ log(1 −D(Se(x, Gρ(z))))]
7: end for
8: Sample minibatch of generator input
{(z1, x1), ..., (zM, xM)}from Gz(x), x ∼ D
9: Update generator by:
∇θGρ
1
MP∀m[log(1 −D(Se(x, Gρ(z))))]
10: end for
11: ρ(e,l)←Gρ(Gz(x)),∀e, l, x
12: ˆ
LE ←P∀eP∀lR(LE(e,l), ρ(e,l))·LE(e,l)
13: for iterations do
14: DC ← {LE(e,l)|RC P (LE(e,l))>
RC P (ˆ
LE)or C I R(LE(e,l))> C I R(ˆ
LE),∀e, l}
15: for each LE(e,l)in DC do
16: ρ′
(e,l)←ρ(e,l)+ϵ, Calculate ˆ
LE′by adopting
ρ′
(e,l)
17: if ˆ
LE′improves RC P, CI R then
18: ρ(e,l)←ρ′
(e,l),ˆ
LE ←ˆ
LE′
19: end if
20: end for
21: end for
Theorem 1. (Convergence of mediating from debate can-
didates) In each ˆ
LE′updating step, if RC P (ˆ
LE′)>
min(RC P (ˆ
LE), RC P (LE(e,l))) and CIR(ˆ
LE′)≥
max(CI R(ˆ
LE),C I R(LE(e,l))) for RCP enforcing update,
and RC P (ˆ
LE′)≥max(RC P (ˆ
LE),RC P (LE(e,l))) and
CI R(ˆ
LE′)>min(C I R(ˆ
LE), C IR(LE(e,l))) for CIR
enforcing update, then the searching mediated explanation
ˆ
LE′is converged to show both maximum RCP and maximum
CIR values among all episodic layer-wise explanations:
RC P (ˆ
LE′)∞
−→ max
∀(e,l)RC P (LE(e,l)),(11)
and CI R(ˆ
LE′)∞
−→ max
∀(e,l)CIR(LE(e,l)),(12)
Proof. For a certain debate candidate LE(e,l), from the as-
sumptions, the updated explanation ˆ
LE with integration with
several iterations at least converges to show close value to
dominant indicator (RCP or CIR) of debate candidate LE(e,l)
as follows:
If RC P (LE(e,l))> RC P (ˆ
LE), then
RC P (ˆ
LE′)∞
−→ RC P (LE(e,l)), CI R(ˆ
LE′)≥C I R(ˆ
LE),
(13)
If CI R(LE(e,l))> CI R(ˆ
LE), then
CI R(ˆ
LE′)∞
−→ CI R(LE(e,l)), RC P (ˆ
LE′)≥RC P (ˆ
LE).
(14)
Therefore, when we try to update the explanation
with several debating candidates that shows at least
one dominant explainability (RCP or CIR) over sev-
eral iterations, the updating step inevitably go through
for the two debate candidates that shows maximum
value on each explainability indicator among all candi-
dates (max∀(e,l)RC P (LE(e,l)),max∀(e,l)CIR(LE(e,l))).
Accordingly, the updated explanation at least con-
verges to RCP (ˆ
LE′)∞
−→ max∀(e,l)RC P (LE(e,l))and
CI R(ˆ
LE′)∞
−→ max∀(e,l)CIR(LE(e,l)).
However, as the precondition for Theorem 1 that is induced
from Lemma 2 is not always satisfied in the real world
environment. Therefore, in order to alleviate this gap, as
shown in Algorithm 1, we add a procedure of examining
whether RCP or CIR improves without any degradation
to adopt mediation for each updating step, and iteratively
attempt to mediate among debating explanations in exploring
various episodic layer-wise explanations for satisfying such
precondition more likely. Moreover, as our methods can be
orthogonally applied to various CAM-based visual explana-
tion methods and attention modules, it does not contains any
other constraints on the backbone task network except for
the requirements of deriving CAM-based visual explanations
and applying attention modules.
V. EVALUATION AND DISCUSSION
A. EXPERIMENTAL SETTINGS
We evaluated the feasibility of our methods empirically. We
conducted evaluation on land use classification task with UC
Merced satellite imagery dataset [31], which requires expla-
nations on predictions due to low resolution characteristics
of satellite imagery in practical domain. In the dataset, 90%
of total dataset is split to train set and the remaining 10% is
used for test set.
The task network is trained using SGD with momentum
of 0.9, batch size of 48 and weight decay of 0.0001. We
trained 160 epochs in total, where the initial learning rate is
configured to 0.1 and decayed by 1/10 at every 60 epochs.
We used ResNet-18 [32] architecture, where only a single
full connected layer at the last layer is modified with 21
output channels for UC Merced satellite imagery dataset
and the other settings follow the ResNet-18 [32]. ABN [7]
is adopted for attention module on each residual block of
the task network in parallel, and Grad-CAM [6] is used
to extract layer-wise visual explanation. For each attention
module of the residual block, 2 convolutional layers are
10 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 4. The configurations of each sampled episode for evaluating the effect of mediating explanations from multiple episodes.
Layer-wise policy of applying attention modules Target layer
@Residual block 1 @Residual block 2 @Residual block 3 @Residual block 4
Sample 1
Episode 1 O X X X @Residual block 3
Episode 2 O X O X @Residual block 3
Episode 3 O X X X @Residual block 4
Sample 2
Episode 1 X X O O @Residual block 3
Episode 2 O O X X @Residual block 1
Episode 3 O O X O @Residual block 4
Sample 3
Episode 1 X X O X @Residual block 4
Episode 2 X O X X @Residual block 2
Episode 3 X O X O @Residual block 1
TABLE 5. Quantitative comparison of the mediated explanation obtained using the proposed scheme. We observed the results on the three set of samples where
the mediation is conducted among the explanations derived from three different sampled episodes in each case, and the mediated explanation achieves
improvement on explainability over the explanations of the three episodes in all three samples.
Metrics for Explainability 3 Sampled episodes Mediated
Explanation
Mediation Policy for
Episode 1 Episode 2 Episode 3 Episode 1 Episode 2 Episode 3
Sample 1 Average % drop ↓71.08 70.18 64.83 59.37 0.1 0.4 0.3
Rate % increase in confidence ↑3.33 3.33 1.90 4.76
Sample 2 Average % drop ↓68.54 66.25 65.76 61.99 0.05 0.3 0.3
Rate % increase in confidence ↑3.81 3.81 1.90 5.71
Sample 3 Average % drop ↓66.63 54.13 62.98 54.13 0 0.7 0.1
Rate % increase in confidence ↑2.38 4.76 5.24 5.24
Average Average % drop ↓68.75 63.52 64.53 58.50 ---
Rate % increase in confidence ↑3.17 3.97 3.02 5.24
applied to generate the attention map, where the kernel
sizes of 1 and 3 are applied to each convolutional layer
respectively, and the padding is only applied to the latter one
to maintain the output size. Batch normalization layers are
applied to each convolutional layers of the attention module,
and ReLU is subsequently applied to the former one, while
the sigmoid is applied to the latter one. Moreover, a single
1x1 convolutional layer is additionally applied to the former
one to predict the task in parallel, and the global average
pooling and softmax are applied subsequently to produce the
probability score of each label. In the training phase, the
sum of losses from 4 attention branches and a perception
branch is applied as loss function to train the whole task
network including the attention modules likewise to [7]. For
DropAttention, p=0.5 is applied in the training phase. The
task network with attention modules is trained using SGD
with momentum of 0.9, batch size of 48 and weight decay
of 0.0001. We trained 160 epochs in total, where the initial
learning rate is configured to 0.1 and decayed by 1/10 at
every 60 epochs. Any data augmentation is not applied in the
training phase. In the training, any data augmentation is not
applied.
To obtain the initial settings for mediating the explana-
tions, the discriminator and generator are constructed with
a single fully connected layer, where the softmax is applied
subsequently to the discriminator and the sigmoid is applied
subsequently to the generator. The discriminator and the
generator are trained by using the same hyper-parameter
settings of training the task network, and a single step of
inner loop is applied. In the procedure of recursive mediation,
ϵ=5% is applied as a unit of searching appropriate regional
ratio ρ.
B. FEASIBILITY OF MEDIATING EXPLANATIONS
In order to check the practical feasibility of the proposed
scheme for mediating explanations from various attention
episodes, we sampled three episodes from the amortized
network trained with DropAtt, and conducted the proposed
scheme of mediating explanations among only the sampled
episodic layer-wise explanations. The layer-wise policy of
applying attention modules and target layer (for deriving
Grad-CAM) of each sampled episode in Table 5 is presented
in Table 4.
As shown in Table 5, the results shows that the proposed
scheme of mediating explanations can derive the integrated
explanation in which both two explainability indicators (av-
erage percentage drop in confidence, rate percentage of in-
crease in confidence) are improved among the layer-wise
explanations in all three different sampled episodes, and it is
expected that the proposed scheme of mediating explanations
can be able to improve explainability if any multiple episodes
are acquired.
When we see the results of the first sample closely as
an example, first and second episodes have strength on rate
of increase in confidence, but they have lower performance
on average drop than the third episode. As the explain-
VOLUME 4, 2016 11
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Grad-CAMABN DropAtt
(Proposed)
DropAtt + Discrim.
(Proposed)
Airplane
Harbor
Buildings
Whole proposed
scheme (Algorithm 1)
Parking
lots
FIGURE 6. Qualitative comparison on visual explanation with regards to the way attention is used.
ability performance is similar each other between first and
second episode, our mediation method allocated reflection
ratio mainly on the second episode (0.1) rather than the
first episode (0.4). Moreover, by allocating some portion
on the complementary episode (i.e. third episode), the fi-
nally derived explanation through mediation can achieve im-
provement on both explainability metrics over each episodic
layer-wise explanation with the help of complementing each
other’s strength and weakness. Such complement can be
realized mainly from our mediation mechanism that reflects
only partial regions of derived explanation (i.e. take only
strong points) and excludes the remaining parts (i.e. aban-
don weak/noisy points and fill with strong points of other
explanations instead). The second sample also shows similar
aspects of results to the first sample.
On the third sample, first episode shows lower perfor-
mance on both explainability metrics over the others. Ac-
cordingly, our mediation method does not allocate reflection
ratio on the first episode, and only the complementary pair
(second and third episodes) are utilized for mediation.
In overall, our mediation methods can mediate comple-
mentary explanations that are derived from the different
target layer and episode, and therefore can achieve improve-
ments over the strongest explainabilities among applied ex-
planations.
C. QUANTITATIVE AND QUALITATIVE COMPARISON
OVER VISUAL EXPLANATION METHODS
We also observed the results of visual explanations, and
compared over the existing methods. We compared our
methods to the two main methods as baselines: Grad-CAM
and ABN. For Grad-CAM, visual explanation (Grad-CAM
[6]) is derived from the task network without applying any
attention module, and all the visual explanations derived at
each residual block are averaged to show as a single image.
ABN is the case of visual explanations derived from the fixed
model that is trained with applying ABN [7] on all residual
blocks as it suggested, and the explanations derived from all
residual blocks are averaged into a single explanation.
We also attempted to observe the effect of our methods
one by one: DropAtt,DropAtt + Discrim., and Whole Pro-
posed Scheme (Algorithm 1). For DropAtt, visual explanation
(Grad-CAM [6]) is derived from the task network trained
with applying our DropAtt, and all the layer-wise expla-
nations are also averaged to show as a single explanation.
In case of DropAtt + Discrim., three layer-wise episodic
explanations are derived from the task model by applying
our DropAtt, and they are integrated into a single explanation
by mediating the three sampled explanations with the initial
reflecting ratios inferred from the generator that is trained
with the proposed class-wise feature discriminator. Whole
Proposed Scheme is applying our final Algorithm 1 that me-
diates explanations from multiple episodes and target layers,
and the presented results in Table 6 and Fig. 6 are derived on
the settings of "Sample 3" in Table 4 as an example.
Among the comparing methods, we evaluated the visual
explainability of each method based on the aforementioned
two main metrics: average drop (lower is better) and rate of
increase in confidence (higher is better). As shown in Table 6,
Grad-CAM or ABN itself show relatively lower explainability
than our methods. Moerover, applying DropAtt only can not
achieve a significant improvements over two baselines show-
12 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 6. Performance comparison of visual explainability with regard to the
applying methods (average drop: lower is better, rate of increase in
confidence: higher is better).
Average % drop ↓Rate % increase
in confidence ↑
Grad-CAM 75.63 1.90
ABN 71.41 2.38
DropAtt 78.73 4.28
DropAtt + Discrim. 66.37 4.76
Whole Proposed Scheme
(Algorithm 1) 54.13 5.24
ing rather higher value on average drop. However, further ap-
plying the proposed way of integrating various explanations
with DropAtt + Discrim. shows the improvements on both
metrics of explainability. Finally, by applying the proposed
mediating scheme, Whole Proposed Scheme (Algorithm 1)
achieves the highest explainability by enhancing near 17%
on average drop and near 3% on the rate of increase in confi-
dence comparing to the baseline (ABN). Such results implies
that the way of integrating various explanations that can be
acquired from the diversity of deriving visual explanation and
applying attentions also plays an important role for providing
the reliable explanation.
We also observed the visual explanations on each compar-
ing method qualitatively, and Fig. 6 shows the corresponding
results. In the results, the leftmost column corresponds to
the samples of original input images, and the remains are
the visual explanation derived from each comparing method:
Grad-CAM,ABN,DropAtt,DropAtt + Discrim., and Whole
Proposed Scheme (Algorithm 1).
The results shows that applying our DropAtt only can not
produce significant improvements in visual explanation over
two baselines (results at second and third column), although
it shows stable task performance over multiple episodes
and even improvement on task performance over baseline
model of [7]. This implies that not only just generating the
multiple episodes of applying various attention modules, but
also the way of integrating such various explanations from
different target layers and episodes properly to complement
each other is important for enhancing the quality of the visual
explanation.
Subsequently, the results of applying only our DropAtt and
the proposed class-wise feature discriminator together just
show the better quality of visual explanation, where the ex-
planation highlights on the target objects mainly in the image
sophisticatedly. Furthermore, when we additionally apply
our explanation mediating scheme with multi-disciplinary
debate, the final results in the rightmost column show further
enhancement on the quality of the visual explanation in some
instances. In particular, the first row of the results shows such
improvement distinctly.
Episode 1 Episode 2 Episode 3
Entire Visual Explanation
Episode 1 Episode 2 Episode 3
Mediated Partial Visual Explanation
Airplane
Integrated
Explanation
Episode 1 Episode 2 Episode 3
Entire Visual Explanation
Episode 1 Episode 2 Episode 3
Mediated Partial Visual Explanation
Harbor
Integrated
Explanation
FIGURE 7. Visual explanations and their mediated partial regions on each
episode.
D. ABLATION STUDY: MEDIATED PARTIAL
EXPLANATIONS
In order to check how our explanations mediating scheme
works, we observe the visual explanation derived from each
episode and their partial regions from explanation mediation,
and Fig. 7 shows the corresponding results on the settings of
"Sample 3" in Table 4. As shown in the results, the visual
explanations from Episode 1 show redundant information
over several input instances, where the background bias is
revealed and even the highlighted regions cannot distinguish
the target object and background. Accordingly, the mediation
policy (regional reflection ratio ρ) for Episode 1 is allocated
as zero for blocking the meaningless explanations of Episode
1 from explanation integration.
In the case of Episode 2, the visual explanations can
mainly highlight the regions of target objects over several
input images, which shows the lowest average drop among
three episodes as shown in Table 4. As Episode 2 also shows
lower background bias over other episodes, the highest value
of the mediation policy is allocated (0.7) on Episode 2.
Similar to Episode 2, the visual explanations on Episode
3 also shows the good explainability, where the rate of
increase in confidence on Episode 3 is the highest among
the other episodes. However, visual explanations in some
input images shows background bias or wrong bias (e.g. The
VOLUME 4, 2016 13
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
visual explanation of Episode 3 on airplane example image
mainly highlights the vehicle not the airplane and some
background bias occur together). As the correct explanation
the successfully capture out the regions of target objects are
already reflected largely by Episode 2, only a small portion
of mediation policy is allocated (0.1) to Episode 3 within the
bounds of emphasizing the advantage (higher rate of increase
in confidence) but minimizing the weakness (higher average
drop than Episode 2).
E. COMPARISON OF COMPUTATIONAL COST FOR
SERVING ON THE REAL DOMAIN ENVIRONMENT
In this section, we provide the additional analysis of the
computational cost on the proposed model for evaluating the
feasibility of serving the model on the real domain environ-
ment. To evaluate the feasibility of serving the analysis model
on the space-related application, we consider an onboard
AI computing environment, depicted in [1]. In such envi-
ronment, utilizing the low-power computing resources such
as visual processing unit or embedded GPU, the onboard
computing system conducts inferencing of the DL network
to analyze the images collected in the satellite system and
send the annotated results to the ground station system. In
the ground station system, further analysis with XAI model
can be conducted to provide the reliable visual explanations
to the human experts for assisting final decision making. In
this scenario, the computation of inferencing the task DL
network is conducted on the satellite onboard system and the
computations of deriving visual explanations is conducted in
the ground station. Accordingly, we observe the total cost
Costnet of processing the series of such analysis on each
satellite image in terms of the total energy consumption,
which is derived as follows:
Costcomp =Pcomp
sat ·tF F +P E comm
sat ·D+Pcomp
gs ·tBP ,(15)
where Pcomp
sat and Pcomp
gs represent the power consumption
of computing HW resource in the satellite system and the
ground station system, and tinf and texpl represent the
processing time of inferencing the task network and the
processing time of deriving the visual explanation respec-
tively. P Ecomm
sat is the power efficiency of satellite-ground
station communication represented as the power per data
rate (W/bps), and Dis the data volume for captured images.
We assume P Ecomm
sat = 1/200W/Mbps of communication
parameter same as [33], and assume that NVIDIA Jetson TX-
1 is utilized for the satellite onboard system and NVIDIA
RTX 3080 is utilized for the ground station system where
Pcomp
sat = 10Wand Pcomp
gs = 320Wrespectively.
As shown in Table 7, Grad-CAM shows the smallest cost
as it does not apply any attention modules. DropAtt and
DropAtt + Discrim. shows the lower cost than ABN from the
advantages of DropAtt that does not need all the attention
modules for deriving the explanations. Our final scheme
(Algorithm 1) shows near 3% higher cost than ABN, but it
can enhance the explainability in trade-off.
TABLE 7. Comparison of computational cost required to process the satellite
imagery analysis with regard to the applying methods.
Costcomp (J)
Grad-CAM 8.01
ABN 8.81
DropAtt 8.41
DropAtt + Discrim. 8.44
Whole Proposed Scheme
(Algorithm 1) 9.09
F. FUTURE WORK
Compared to Grad-CAM, the proposed method has a limi-
tation in that additional computational overhead is required
due to the recursive searching step of finding the proper
regional reflection ratios over multiple episodes. However,
except for such preparation step of recursive search, in the
serving phase, the proposed method only requires 13% ad-
ditional computational cost comparing to Grad-CAM, while
improving the explainability.
Moreover, in the proposed method of mediating explana-
tions, by only taking the partial regions of each explanation,
background bias can be filtered out in some cases (see the
results of harbor example), but it is practically hard to filter
out whole wrong regions of explanations (see the results
of airplane example). As our method attempts to adjust the
mediation policy by observing explaianbility over whole
data instances, such limitation on sophisticated mediation
is inevitably remains. Such limitation can be overcome by
adjusting the mediation policy with observing changes in
explainability for each data instance, but it brings out the
huge computational overhead instead. Nevertheless, we veri-
fied that our explanation mediating scheme can improve the
explainability by integrating various explanations, and we
remain the further research for such elaborate mediation as
future work.
VI. CONCLUSION
In this paper, we identified two main problems on deriving
visual explanation with applying the attention modules: 1)
the heavy computation overhead for training a new model
from scratch is required for each episode of applying at-
tention modules, 2) and the visual explanations show their
diversity (complexity) over the various layer-wise policies of
applying attention modules and various target layers, which
makes it hard for human user to choose a reliable explanation
for making a final decision.
In order to overcome such problems, we propose a new
method of mediating various attention-variant layer-wise ex-
planations by generating several episodes from the amor-
tized model with the help of DropAtt. Through the multi-
disciplinary debate in mediating process, a single integrated
explanation can be derived, which can improve the both
disciplinary explainability by complementing strength and
weakness of explanations from multiple episodes and target
layers.
14 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
From the empirical evaluation, our DropAtt shows stable
task performance in applying multiple episodes of attention
modules from the amortized model and even slight im-
provement on task performance over the baseline. Moreover,
by applying our explanation mediating scheme with multi-
disciplinary debate on the multiple episodes that are gener-
ated from the amortized model with the help of DropAtt, it
further achieves improvements on both explainability metrics
and derives more sophisticated explanations.
REFERENCES
[1] Gianluca Giuffrida, Lorenzo Diana, Francesco de Gioia, Gionata Benelli,
Gabriele Meoni, Massimiliano Donati, and Luca Fanucci. Cloudscout: a
deep neural network for on-board cloud detection on hyperspectral images.
Remote Sensing, 12(14):2205, 2020.
[2] Yogesh H Bhosale and K Sridhar Patnaik. Application of deep learning
techniques in diagnosis of covid-19 (coronavirus): A systematic review.
Neural Processing Letters, pages 1–53, 2022.
[3] Yogesh H Bhosale and K Sridhar Patnaik. Iot deployable lightweight
deep learning application for covid-19 detection with lung diseases using
raspberrypi. In 2022 International Conference on IoT and Blockchain
Technology (ICIBT), pages 1–6. IEEE, 2022.
[4] Heejae Kim, Kyungchae Lee, Changha Lee, Sanghyun Hwang, and Chan-
Hyun Youn. An alternating training method of attention-based adapters
for visual explanation of multi-domain satellite images. IEEE Access,
9:62332–62346, 2021.
[5] Woo-Joong Kim and Chan-Hyun Youn. Cooperative scheduling schemes
for explainable dnn acceleration in satellite image analysis and retraining.
IEEE Transactions on Parallel and Distributed Systems, 33(7):1605–1618,
2021.
[6] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna
Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations
from deep networks via gradient-based localization. In Proceedings of the
IEEE international conference on computer vision, pages 618–626, 2017.
[7] Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu
Fujiyoshi. Attention branch network: Learning of attention mechanism
for visual explanation. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 10705–10714, 2019.
[8] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon.
Cbam: Convolutional block attention module. In Proceedings of the
European conference on computer vision (ECCV), pages 3–19, 2018.
[9] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model
predictions. Advances in neural information processing systems, 30, 2017.
[10] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i
trust you?" explaining the predictions of any classifier. In Proceedings of
the 22nd ACM SIGKDD international conference on knowledge discovery
and data mining, pages 1135–1144, 2016.
[11] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick
Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise
explanations for non-linear classifier decisions by layer-wise relevance
propagation. PloS one, 10(7):e0130140, 2015.
[12] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Weakly
supervised object localization with multi-fold multiple instance learn-
ing. IEEE transactions on pattern analysis and machine intelligence,
39(1):189–203, 2016.
[13] Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object lo-
calization for free?-weakly-supervised learning with convolutional neural
networks. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 685–694, 2015.
[14] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio
Torralba. Learning deep features for discriminative localization. In
Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 2921–2929, 2016.
[15] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N
Balasubramanian. Grad-cam++: Generalized gradient-based visual expla-
nations for deep convolutional networks. In 2018 IEEE winter conference
on applications of computer vision (WACV), pages 839–847. IEEE, 2018.
[16] Harish Guruprasad Ramaswamy et al. Ablation-cam: Visual explanations
for deep convolutional network via gradient-free localization. In Proceed-
ings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, pages 983–991, 2020.
[17] Xiang Li, Yuchen Jiang, Yiliu Liu, Jiusi Zhang, Shen Yin, and Hao Luo.
Ragcn: Region aggregation graph convolutional network for bone age
assessment from x-ray images. IEEE Transactions on Instrumentation and
Measurement, 71:1–12, 2022.
[18] Xiang Li, Yuchen Jiang, Jiusi Zhang, Minglei Li, Hao Luo, and Shen
Yin. Lesion-attention pyramid network for diabetic retinopathy grading.
Artificial Intelligence in Medicine, 126:102259, 2022.
[19] Peng-Tao Jiang, Chang-Bin Zhang, Qibin Hou, Ming-Ming Cheng, and
Yunchao Wei. Layercam: Exploring hierarchical class activation maps
for localization. IEEE Transactions on Image Processing, 30:5875–5888,
2021.
[20] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based
visual attention for rapid scene analysis. IEEE Transactions on pattern
analysis and machine intelligence, 20(11):1254–1259, 1998.
[21] Ronald A Rensink. The dynamic representation of scenes. Visual
cognition, 7(1-3):17–42, 2000.
[22] Maurizio Corbetta and Gordon L Shulman. Control of goal-directed
and stimulus-driven attention in the brain. Nature reviews neuroscience,
3(3):201–215, 2002.
[23] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang
Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for
image classification. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 3156–3164, 2017.
[24] Lingxiao Yang, Ru-Yuan Zhang, Lida Li, and Xiaohua Xie. Simam:
A simple, parameter-free attention module for convolutional neural net-
works. In International conference on machine learning, pages 11863–
11874. PMLR, 2021.
[25] Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui
Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual
explanations for convolutional neural networks. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition work-
shops, pages 24–25, 2020.
[26] Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A
benchmark for interpretability methods in deep neural networks. Advances
in neural information processing systems, 32, 2019.
[27] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz
Hardt, and Been Kim. Sanity checks for saliency maps. Advances in
neural information processing systems, 31, 2018.
[28] Szymon Bobek, Paweł Bałaga, and Grzegorz J Nalepa. Towards model-
agnostic ensemble explanations. In International Conference on Compu-
tational Science, pages 39–51. Springer, 2021.
[29] Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Suggala, David I Inouye, and
Pradeep K Ravikumar. On the (in) fidelity and sensitivity of explanations.
Advances in Neural Information Processing Systems, 32, 2019.
[30] Yang Zhang, Ashkan Khakzar, Yawei Li, Azade Farshad, Seong Tae
Kim, and Nassir Navab. Fine-grained neural network explanation by
identifying input features with predictive information. Advances in Neural
Information Processing Systems, 34, 2021.
[31] Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions
for land-use classification. In Proceedings of the 18th SIGSPATIAL
international conference on advances in geographic information systems,
pages 270–279, 2010.
[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[33] Yuanjun Wang, Jiaxin Zhang, Xing Zhang, Peng Wang, and Liangjingrong
Liu. A computation offloading strategy in satellite terrestrial networks
with double edge computing. In 2018 IEEE international conference on
communication systems (ICCS), pages 450–455. IEEE, 2018.
VOLUME 4, 2016 15
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
MINSU JEON received the B.S. degree in elec-
tronic engineering from Sogang University in
2016. He recieved the M.S. degree in electri-
cal engineering from Korea Advanced Institute
of Science and Technology (KAIST), Daejeon,
South Korea in 2017. He is currently pursuing
the Ph.D degree in electrical engineering at Korea
Advanced Institute of Science and Techonolgy
(KAIST). His research interests include deep
learning (DL) application/model, DL model com-
pression, DL serving and high performance computing system.
TAEWOO KIM received the B.S. degree in electri-
cal engineering from Kyungpook National Univer-
sity, Deagu, South Korea in 2015, and M.S. degree
in electrical engineering from Korea Advanced
Institute of Science and Technology (KAIST),
Daejeon, South Korea in 2017. He is currently
pursuing the Ph.D degree in electrical engineer-
ing at Korea Advanced Institute of Science and
Techonolgy (KAIST). His research interests in-
clude the Deep Learning(DL) framework, GPU
computing, and interactive learning.
SEONGHWAN KIM received the B.S. degree
in media and communications engineering from
Hanyang University, Seoul, South Korea, in 2012,
and the integrated master’s and Ph.D. degrees in
electrical engineering from the Korea Advanced
Institute of Science and Technology (KAIST),
Daejeon, South Korea, in 2020. He is currently
a Postdoctoral Researcher with the Network and
Computing Laboratory, KAIST. His research in-
terests include cloud brokering systems, deep
learning accelerator, and high performance computing and others.
CHANGHA LEE received the B.S. degree in elec-
tronic engineering from Hanyang Univ., Seoul,
Korea (2018), and M.S. degree in electronic engi-
neering from Korea Advanced Institute of Science
and Technology (KAIST), Daejeon, Korea (2020).
He is currently a Ph.D. student in KAIST. Since
2018, he is a member of Network and Computing
Laboratory in KAIST and his current research
interests include deep learning acceleration plat-
form and high performance edge-cloud computing
system.
CHAN-HYUN YOUN (S’84–M’87–SM’2019)
received the B.Sc and M.Sc degrees in Electronics
Engineering from Kyungpook National Univer-
sity, Daegu, Korea, in 1981 and 1985, respectively,
and the Ph.D. degree in Electrical and Commu-
nications Engineering from Tohoku University,
Japan, in 1994. Before joining the University, from
1986 to 1997, he was a Head of High-Speed Net-
working Team at KT Telecommunications Net-
work Research Laboratories, where he had been
involved in the research and developments of centralized switching main-
tenance system, high-speed networking, and ATM network. Since 1997,
he has been a Professor at the School of Electrical Engineering in Korea
Advanced Institute of Science and Technology (KAIST), Daejeon, Korea.
He was an Associate Vice-President of office of planning and budgets in
KAIST from 2013 to 2017. He also is a Director of Grid Middleware
Research Center and XAI Acceleration Technology Research Center at
KAIST, where he is developing core technologies that are in the areas
of high performance computing, explainable AI system, satellite imagery
analysis, AI acceleration system and others. He was a general chair for
the 6th EAI International Conference on Cloud Computing (Cloud Comp
2015), KAIST, in 2015. He wrote a book on Cloud Broker and Cloudlet for
Workflow Scheduling, Springer, in 2017. Dr. Youn also was a Guest Editor
IEEE Wireless Communications in 2016, and served many international
conferences as TPC member.
16 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3235332
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/