Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.0322000
Bridging the Knowledge Gap via
Transformer-based Multi-Layer Correlation
Learning
HUN-BEOM BAK1, and SEUNG-HWAN BAE1(Member, IEEE)
1Vision and Learning Laboratory, Department of Electrical and Computer Engineering, Inha University, Incheon 22212, South Korea
Corresponding author: Seung-Hwan Bae (e-mail: shbae@inha.ac.kr).
This work was supported by INHA UNIVERSITY Research Grant.
ABSTRACT We tackle a multi-layer knowledge distillation problem between deep models with het-
erogeneous architectures. The main challenges of that are the mismatches of the feature maps in terms
of the resolution or semantic levels. To resolve this, we propose a novel transformer-based multi-layer
correlation knowledge distillation (TMC-KD) method in order to bridge the knowledge gap between a
pair of networks. Our method aims to narrow the relational knowledge gaps between teacher and student
models by minimizing the local and global feature correlations. Comparisons with the recent KD methods
on CIFAR-100 and ImageNet show that the consistent accuracy improvement can be achieved by our TMC-
KD method. Furthermore, we evaluate our TMC-KD on the Rip current dataset and show the outperformed
results compared with other KD methods in the detection task.
INDEX TERMS Correlation Learning, Image Classification, Knowledge Distillation, Model Compression,
Object Detection, Transformer-based Learning
I. INTRODUCTION
OVER the decades, deep neural networks show promis-
ing performance on many down streaming vision tasks
such as image classification and object detection. Recently,
there are many efforts to apply the powerful deep models
for small or embedded devices which have limited hardware
resources. To achieve this, one of the common approaches is
to reduce model size while preserving its learned knowledge
at most using pruning [1], quantization [2], and knowledge
distillation (KD) [3]. In particular, KD methods less suffer
from accuracy degradation and complex training than others.
The vanilla KD method developed by [3] allows smaller
student models to mimic the representation of a larger teacher
model. This can be achieved by aligning the output responses
(e.g. logits and predictions) of both models. However, the
transferring of the output knowledge of the teacher often
achieves marginal improvement only due to the limited dis-
tillation of representations in the mid-level layers. Therefore,
there are attempts to align the intermediate knowledge be-
tween teacher and student networks for transferring more
knowledge. For instance, FitNets [4] adds the hint training
procedure for distillation of the selected intermediate layers
of a teacher. SemCKD [5] distills the correlation features
using the attention learning. Inspired by these works, our
work is also based on the distillation of the mid-level features
as well as output features.
In this work, we assume that relational knowledge within
feature maps of the teacher network is beneficial and should
be transferred to the student network. Unfortunately, the most
recent KD works with the knowledge of the intermediate
layers [5] less pay attention to this point. In some works
using the relational structure knowledge, it is the essential
knowledge needed to be transferred to the student model for
improving robustness or accuracy. For achieving this, feature
distance-wise or angle-wise losses are exploited in [6]. Inter-
channel correlation [7] of features is learned to capture feature
intrinsic distributions of a teacher model. However, cross
correlations between multiple features of teacher and student
models are not leveraged well as done in the multi-layer KD
methods. Therefore, our work aims to transfer the relational
knowledge of a teacher model and align multi-layer feature
maps of different models.
In order to achieve that, a powerful KD model which
can capture both correlations is required, and we present a
novel transformer-based multi-layer correlation learning for
knowledge distillation (TMC-KD). One of the main issues
of applying the transformer for KD is how to encode the
multi-level feature tensors with different dimensionality. To
VOLUME 11, 2023 1
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
resolve this, we present a multi-layer feature converter (MLC)
that can transform the different-level features into a series of
encoded features. Based on the multi-head attention learning
of the transformer, we can then produce the decoded features
by feeding the serially encoded features to the transformer.
In order to align the knowledge level between teacher and
student networks across mid-level layers, we learn the layer-
wise matched local correlation with the similarity between the
teacher-student decoded features. Then, we minimize the lo-
cal semantic gap between the internal layers with the learned
local correlation. Moreover, we reduce their global knowl-
edge gap by minimizing the self-correlation discrepancy of
the whole decoded features.
To prove the effectiveness, we compare our TMC-KD
with the recent KD methods on CIFAR-100 and ImageNet
datasets. Our TMC-KD method offers greater accuracy im-
provements than other KD methods on both sets for most
student models regardless of its architecture. In addition, we
provide the ablation study to show the usefulness of each
method.
To sum up, our contributions are
‚We propose a novel transformer-based multi-layer cor-
relation learning for the relational knowledge distillation
across intermediate layers.
‚We design the multi-layer feature converter to transform
multi-level features into sequentially-encoded ones and
use them as the inputs of the transformer.
‚We present global and local correlation learning for
bridging their local and global knowledge gaps.
In Section II, we present related works to the knowledge
distillation. In Section III, we discuss our TMC-KD method
composed of a multi-layer feature converter, local semantic
learning, and global relational learning. Section IV provides
experimental set-up and results. The conclusion is made in
Section V.
II. RELATED WORKS
A. MULTI-LAYER KNOWLEDGE DISTILLATION
As one of the pioneer works, Hinton et.al. [3] proposes a
simple knowledge distillation by minimizing the distance
between teacher and student outputs. Distilling only model
outputs is effective, but it shows some insufficient results for
many tasks. Using multiple teachers [8] and teacher assistants
[9], [10] improve the generalization, robustness, and accuracy
of the student model. Applying curriculum learning [11]
and generating virtual distribution [12] also improve student
models. For distilling more teacher knowledge within other
layers, some works present KD methods using intermediate-
layer feature maps. The existing works can be categorized into
based-on the local correlation learning [13]–[17] between
teacher-student features and relational correlation learning
[6], [7], [18] of the model itself. In the former works, they use
the local correlation between the matched layer features of the
teacher and student models. FitNets [4] minimize between L2
distance of teacher-student intermediate features. Attention-
guided KD methods [16], [19], [20] for object classes are
introduced for transferring more knowledge of the crucial
regions. VID [21] maximizes the mutual information between
teacher-student intermediate features. SimKD [15] reuses a
pre-trained teacher classifier.
One common limitation of these methods requires the prior
knowledge of the target layers to be distilled within teacher
and student models. To overcome this, some KD methods
[22], [23] solve the layer assignment problem by using the
attention mechanism. In specific, SemCKD [5] and ASM [13]
calculate the correlations across intermediate layers.
On the other hand, in the latter approach, a student model
tries to learn the relational representation of a teacher. [24]
defines the flow of solution procedure (FSP) matrix to dis-
till the flow knowledge between sequential layers. RKD [6]
learns the relational knowledge of data samples in terms of the
angles and distances. In [18], a student is learned to mimic the
activation patterns of similar training samples for a teacher.
ICKD [7] computes the inter-channel correlation by capturing
feature diversity and homology. LSL [25] defines the inter-
class and inter-layered Gram matrices to evaluate the diversity
and discrimination of feature maps.
Since both approaches show promising results and can
be complementary to each other, we present a transformer-
based KD method to combine both approaches. This allows
us to transfer important knowledge related to internal feature
relationship within a model and cross-correlations between
models to the student, resulting in improved performance for
the target student.
B. TRANSFORMER-BASED KNOWLEDGE DISTILLATION
Even though a transformer [26] came up with for natural
language processing, its variant models achieved remarkable
performance for other vision tasks [27], [28]. In addition,
the KD methods [29], [30] which distill a bunch of internal
knowledge of a teacher transformer have been developed. A
target-aware transformer [14] transfers the spatial semantic
knowledge of a teacher via one-to-all spatial matching. In this
work, we exploit a transformer for learning global and local
correlation between many-to-many matching layers between
different models rather than using it as a teacher itself [29],
[30].
III. METHOD
We first discuss the preliminary for KD and multi-layer based
KD methods. We then explain our TMC-KD method that
can reduce local semantic and global relation gaps between
models using our multi-layer feature converter.
A. PRELIMINARY
1) Knowledge distillation
We denote ftand fsas teacher and student models, respec-
tively. In general, the pre-trained ftwith more parameters is
superior to fswith less one. Then, the goal of the knowl-
edge distillation is to improve the fsby transferring the core
knowledge of fton a dataset D“ (x(i),y(i))(N
i“1with N
samples. Here, for the image classification task, each sample
2VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 1. The overall architecture of a transformer-based multi-layer knowledge distillation (TMC-KD) mainly consisting of (a) local semantic learning
and (b) global relational learning parts is described in Sec. III-B. Local semantic learning minimizes the discrepancy between intermediate-layer features
using the learned local correlation, but global semantic learning reduces the gaps of decoded features from a transformer using self-attention.
consists of an image x(i)and its label y(i). In general, in
Hinton et. al. KD [3], the knowledge gaps between ftand fs
are reduced by minimizing the cross entropy (CE) between
the student predicted label σ(zi
s)from the softmax layer σ(¨)
with an input of the logits z(i)
s“fs(x(i))and the one-hot
encoded target label y(i)for the image x(i). In addition, the
Kullback-Leibler (KL) divergence between the teacher and
student predicted probabilities σ´z(i)
t¯and σ´z(i)
s¯is added
as the total KD loss as follows:
LKD “
N
ÿ
i“1
(CE(σ(z(i)
s),y(i))`τ2KL(σ(z(i)
s{τ), σ(z(i)
t{τ))
(1)
where τis a temperature factor and controls the softness of
the outputs. Since the KD method aims at reducing the output
predictions of both models, the multi-layer KD methods in-
troduce the additional losses to reduce the discrepancy of the
mid-level features.
2) Multi-Layer knowledge distillation
Since the dimension of a feature map at each layer for a model
is usually different, we denote Fm
tPRBˆCm
tˆHm
tˆWm
tand Fj
sP
RBˆCj
sˆHj
sˆWj
sas the m-th and j-th feature maps for the teacher
tand student s, where Hand Ware the height and width of the
feature map, and Band Care the cardinality of the batches and
channels. Mand Jindicate the number of teacher and student
layers. Then, the multi-layer KD [4], [5] reduce the feature
gaps across layers between both by minimizing the total mean
square error as:
LLocal “
J
ÿ
j“1
M
ÿ
m“1
Λj,m
Local ›
›φs(Fj
s)´φt(Fm
t)›
›
2
2(2)
where φs(¨)and φt(¨)are transformation functions to match
the channel number and feature resolution between Fj
sand
Fm
t. Different from papers [4] the multi-layer KD methods [5]
avoid the usage of the prior knowledge to be an associated
pair by learning the local correlation Λj,m
Local . For instance,
SemCKD [5] computes the correlation Λj,m
Local between m-th
teacher and j-th student model features with the query-key
attention learning.
B. TRANSFORMER-BASED MULTI-LAYER CORRELATION
LEARNING
The limitation of the the multi-layer KD methods discussed
above is that they do not leverage the relational knowledge
between the feature maps within a teacher network. However,
in most cases, there is some relationship between multi-layer
features since the output feature at the preceding layer is used
as an input of the succeeding layer. We conjecture that the
feature relational knowledge within a teacher model should
be transferred to the student. In particular, we strive to transfer
the strong global relation among features of the teacher to the
student, and present a transformer-based multi-layer KD to
achieve this. More concretely, we encode multi-layer features
of teacher and student models to the sequential features for
compensating feature dimension mismatches. Then, we gen-
erate the decoded features of each model by exploiting the
encoded features as keys or queries alternately. Subsequently,
we learn local correlation between the decoded features and
use it for local and global relational learning. In the local
semantic learning, we use the local correlation as Λj,m
Local of the
Eq. (2), and minimize the decoded feature discrepancy. On the
other hand, the global relation learning allows the student to
VOLUME 11, 2023 3
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
mimic the global representation across all feature maps within
the teacher model.
1) Multi-layer feature converter
FIGURE 2. The structure of multi-layer feature converter (MLC). The
parameter size of the Conv and FC layers can be tuned by the
dimensionality of an input feature. Thus, it can produce the output
encoded feature Vtmand Vsjwith the same dimensionality.
Due to the mismatches of intermediate features between
teacher and student models, we design a multi-layer fea-
ture converter (MLC) before feeding each feature to the
transformer. As shown in Fig. 2, our MLC consists of two
1ˆ1 Conv, ReLU, normalization layer, and fully-connected
layer. By applying each MLC ψ(¨)to each Fsjand Ftm, we
can produce the encoded features VtmPRBˆEand VsjP
RBˆEwith the same dimensionality E:Vtm“ψtm(Ftm)and
Vsj“ψsj(Fsj). We then concatenate features of all the layers
tVtmuM
m“1and Vsj(J
j“1to obtain the sequentially encoded
features Vtand Vsusing
Vt“Concat(Vt1, ..., Vtm, ..., VtM)
Vs“Concat(Vs1, ..., Vsj, ..., VsJ)(3)
Then, we use the Vtand Vsas the inputs of the transformer as
described in the next section.
2) Local semantic learning
Because a transformer [26] is a powerful way to learn global
feature correlation as mentioned, we use it for our multi-
layer KD. Followed by the implementation [26], we design an
encoder Enc and a decoder Dec of a stack of NE“6identical
layers. By feeding the sequentially-encoded features Vtand
Vsin Eq. (3) to the transformer, we can produce the decoded
features PtPRBˆMˆEand PsPRBˆJˆEas:
Pt“Dec(Enc(Vs),Vt)
Ps“Dec(Enc(Vt),Vs)(4)
For describing the encoding process in Enc, we denote WQP
REˆEand WKPREˆEas the query and key weight matrices.
Then, we learn the global correlation CGlobal within Vtor Vs
by the self-attention mechanism with matrix multiplication
(˚) as
CGlobal (Vq)“(WQ˚Vq)˚(WK˚Vq)T
?E(5)
where Vqcan be Vtor Vs. We then learn h-th head attention
features Hh
Enc of NH“8multiple attention heads by applying
the global correlation CGlobal :
Hh
Enc(CGlobal ,Vq,WV)“CGlobal ˚ pWV˚Vqq(6)
where WVPREˆEis the learned weight matrix for Vq. The
multi-head attention consists of concatenating heads, addi-
tional weight, residual connection, and layer normalization.
Then, we obtain e-th encoder outputs Fe
Enc by applying two
feed-forward networks and a single activation function, and
layer normalization. Each encoder layer output Fe
Enc is fed
into the next encoder layer subsequently. Then, we represent
the output of the last encoder layer as FNE
Enc.
In the decoder Dec,Vp, which can be Vtor Vs, is fed into
the self-attention, and its output FVp
Self and encoder output FNE
Enc
are fed into cross-attention. Using multi-head attention with
Eq. (5) and (6), we produce the enhanced feature FVp
Self for Vp.
Then, we compute the cross-attention CCross with output of
encoder FNE
Enc and FVp
Self :
CCross(FVp
Self ,FNE
Enc)“(WQ˚FVp
Self )˚(WK˚FNE
Enc)T
?E(7)
The cross attention-applied feature map Hh
Cross is given as:
Hh
Cross(CCross,WV,FVp
Self )“CCross ˚(WV˚FVp
Self )(8)
Similar to the encoder, we feed each Hh
Cross to the next decoder
for NE´1steps, and denote Ptor Psas the outputs of the last
decoder for the teacher and student.
Basically, we can minimize the local semantic gap between
models by Eq. (2). However, in our TMC-KD, we use the
decoded features Ptand Psfor evaluating Λj,m
Local . To this
end, we first slice PtPRBˆMˆEand PsPRBˆJˆEto
ptmPRBˆE(M
m“1and psjPRBˆE(J
j“1. We then learn the
local correlation Λj,m
Local in the form of the softmax with the
dot-produce attention as
Λj,m
Local “exp(ptm¨pT
sj)
řM
i“1řJ
j“1exp(Ptm¨PT
sj).(9)
3) Global relational learning
We make the student learn the global feature dependency
within the teacher model as well as the global feature cor-
relation between them. Since both decoded Ptand Psare the
refined-attention features from the global and local correla-
tion learning Eq. (4), they could include the global relational
information and use them for this. To transfer this knowledge,
we define LGlobal with the self-correlation of each decode
feature as
LGlobal “›
›Pt¨PT
t´Ps¨PT
s›
›
2
2(10)
4VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 3. Examples for rip current dataset.
Algorithm 1 Transformer-based multi-layer correlation
knowledge distillation (TMC-KD)
Input: Training dataset D, a pre-trained teacher ftwith pa-
rameter θt, a student fswith parameter θs
Output: A trained fs
1: while θsis not converged do
2: Extract feature maps Ftmand Fsjby feeding a mini-
batch sampled from Dto ftand fs.
3: Compute LKD by Eq. (1)
4: // Multi-layer feature converter
5: Convert Ftmand Fsjto Vtmand Vsjusing ψ(¨).
6: Compute Vtand Vsby Eq. (3).
7: // Local semantic learning
8: Compute Ptand Psby Eq. (4).
9: Slicing Ptand Psto obtain tptmuM
m“1and psj(J
j“1.
10: Compute Λj,m
Local by Eq. (9).
11: Compute LLocal by Eq. (2).
12: // Global relational learning
13: Compute self-correlations of Ptand Ps.
14: Compute LGlobal by Eq. (10).
15: Compute LTotal by Eq. (11).
16: Update θsby minimizing Eq. (11).
17: end while
By minimizing Eq. (10), we make the student mimic the
global representation of the teacher. Finally, we define a total
TMC-KD loss including local and global losses from Eq. (2)
and Eq. (10):
LTotal “LKD `ζLGlobal `βLLocal (11)
where βand ζare balancing parameters to be tuned experi-
mentally.
In the Figure 5, we study sensitivity of hyper-parameter ζ
and β. Our training process is described in Algorithm 1.
IV. EXPERIMENTS
In this section, we have evaluated our TMC-KD by comparing
the recent KD methods. We also provide the ablation study
and sensitive analysis for proving our method.
A. DATASET
We exploit the CIFAR-100 [31] and ImageNet [32] datasets
for classification. We report the accuracy of methods in terms
of Top-1 accuracy. So, a higher Top-1 score indicates better
results.
The CIFAR-100 has 100 classes and each class consists of
500 training samples and 100 test samples. All samples of
CIFAR-100 have 32ˆ32 resolution. The ImageNet has 1.2 M
images and 1,000 object classes. All the images are resized to
224ˆ224 during training and testing.
We have evaluated our TMC-KD for a more challenging
object detection problem, and compared the KD performance
with other KD methods. For this comparison, we use the Rip
currents detection dataset [34]. The dataset contains 1,600
annotated images with rip currents and 700 images without
rip currents. For training, we use 1,200 images with rip cur-
rents and 700 images without rip currents. For evaluation,
we use 400 images with rip currents. Since the appearances
of the rip currents are various, it is very challenging for the
target student detector to learn the generalized features of the
rip currents. Therefore, we use a teacher detector with the
stronger backbone (R-101) for this KD comparison.
B. IMPLEMENTATION DETAILS
The architecture of our TMC-KD mainly consists of the
multi-layer converter and the transformer. The details of the
MLC structure is described in Sec. III-B. We set a batch
size Bto 64 or 256 for the CIFAR-100 or ImageNet. The
embedding size Eis tuned to 16. When implementing the
transformer, we follow the implementation of the original
VOLUME 11, 2023 5
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Teacher ResNet-32x4 ResNet-32x4 VGG-13 ResNet-32x4 WRN-40-2 VGG-13 ResNet-32x4
79.42 79.42 74.64 79.42 75.61 74.64 79.42
Student VGG-8 VGG-13 ShuffleNetV2 ShuffleNetV2 MobileNetV2 VGG-8 ResNet-8x4
70.46 ±0.29 74.82 ±0.22 72.60 ±0.12 72.60 ±0.12 65.43 ±0.29 70.46 ±0.29 73.09 ±0.30
KD (NIPS-14) [3] ˚72.73 ±0.15 77.17 ±0.11 75.60 ±0.21 75.49 ±0.24 68.70 ±0.22 73.38 ±0.05 74.42 ±0.05
FitNet (ICLR-15) [4] ˚72.91 ±0.18 77.06 ±0.14 75.44 ±0.11 75.82 ±0.22 68.64 ±0.12 73.63 ±0.11 74.32 ±0.08
AT (ICLR-17) [19] ˚71.90 ±0.13 77.23 ±0.19 75.41 ±0.10 75.91 ±0.14 68.79 ±0.13 73.51 ±0.08 75.07 ±0.03
SP (CVPR-19) [18] ˚73.12 ±0.10 77.72 ±0.33 75.54 ±0.18 75.77 ±0.08 68.48 ±0.36 73.53 ±0.23 74.29 ±0.07
VID (CVPR-19) [21] ˚73.19 ±0.23 77.45 ±0.13 75.22 ±0.07 75.55 ±0.18 68.37 ±0.24 73.63 ±0.07 74.55 ±0.10
HKD (CVPR-20) [33] ˚72.63 ±0.12 76.76 ±0.13 76.24 ±0.09 76.64 ±0.05 69.23 ±0.16 73.06 ±0.24 74.86 ±0.21
SemCKD (AAAI-21) [5] ˚75.27 ±0.13 79.43 ±0.02 76.39 ±0.12 77.62 ±0.32 69.61 ±0.05 74.43 ±0.25 76.23 ±0.04
TaT (CVPR-22) [14] ˛N/A N/A N/A N/A N/A 74.35 75.54
TMC-KD (ours) 76.23 79.83 76.91 77.88 70.03 75.10 76.63
TABLE 1. Comparison results with the recent multi-layer KD methods on CIFAR-100. ˚and ˛results are in [5] and [14], respectively. The best results are
marked with bold.
version [26]. Therefore, the transformer is composed of the
stack of 6 encoders and 6 decoders. We also use 8 parallel
attention heads. For the teacher and student models, we use
the various CNN models such as ResNet [35], VGG [36],
ShuffleNet [37], WRN [38], MobileNet [39] with different
model combinations.
In the KD loss Eq. (1), we set the temperature factor τ
to 4. For finding optimal βand ζused in the total KD loss
Eq. (11), we perform the sensitive analysis in Sec. IV-F,
and set βand ζto 50 and 0.1. For KD training, we use the
SGD optimizer with Nesterov momentum. For CIFAR-100,
the initial learning rate is 0.01 for the variants of MobileNets
and ShuffleNets. Otherwise, it is set to 0.05. We train models
during 240 epochs and decay the learning rates by 0.1 times at
the 150, 180, and 210 epochs. For ImageNet, we train models
during 100 epochs. We set the initial learning rate to 0.1, and
decay it by a factor of 0.1 at the 30, 60, and 90 epochs. We
implement all the KD methods by using the same HW/SW:
Intel-Xeon@2.40GHz, Titan-V, and PyTorch (v1.10).
For implementing our TMC-KD knowledge distillation, we
use the MMRazor. For the KD between teacher and student
detectors, we compare the outputs of the feature pyramid
networks (FPNs) since our method focuses on the multi-layer
KD. More specifically, we use the outputs of FPN as inputs
of multi-layer converter (MLC) and use the output of MLC as
inputs of transformer for global and local correlation.
For more comparisons, we perform the KD between both
detectors using CWD [40], FBKD [41], and PKD [42].
In the CWD-based implementation, we minimize the Kull-
back–Leibler divergence between teacher and student activa-
tion maps from FPNs. In the FBKD, we extract both spatial
and channel attentions from teacher and student detectors. We
then minimize each mean square error between teacher and
student attention maps. In the PKD, we normalize outputs of
FPNs and minimize the mean square error between teacher
and student normalized features. While other methods per-
form KD between same-level layers, our TMC-KD compares
features among all FPN layers using the local correlation
Eq.(9).
C. COMPARISON ON CIFAR-100 AND IMAGENET
We compared our TMC-KD with KD [3] and multi-layer KD
methods: FitNets [4], AT [19], SP [18], VID [21], HKD [33],
SemCKD [5], ICKD [7], and TaT [14]. We also compared the
relational knowledge KD methods: PKT [43], RKD [6], IRG
[44], CC [45], CRD [46], and ICKD [7].
In Table 1, we provide the comparison results with multi-
layer KD methods on CIFAR-100. As shown, our TMC-
KD achieves the best results for all different teacher-student
combinations. Compared to scores of the student models, our
TMC-KD improves 4.74 scores in average. In particular, for
the similar architectures of teacher and student models (e.g.
VGG-13 and VGG-8), TMC-KD provides 4.09 accuracy gain
on average, but 4.99 gain for the heterogeneous architecture
setup (e.g. ResNet-32x4 and VGG-13). TMC-KD provides
more gains between heterogeneous models where the exact
correlation learning between multi-layers is necessary.
Table 2 shows the results of the relation KD methods. In
this comparison, our TMC-KD is superior to other methods.
These comparison results show that exploiting both inter-
layer and intra-layer knowledge is very effective for KD.
Teacher ResNet-32x4 WRN-40-2 ResNet-32x4
79.42 75.61 79.42
Student ResNet-8x4 MobileNetV2 VGG-8
70.46 ±0.29 65.43 ±0.29 73.09 ±0.30
PKT [43] ˚73.11 ±0.21 68.68 ±0.29 74.61 ±0.25
RKD [6] ˚72.49 ±0.08 68.71 ±0.20 74.36 ±0.23
IRG [44] ˚72.57 ±0.20 68.83 ±0.18 74.67 ±0.15
CC [45] ˚72.63 ±0.30 68.68 ±0.14 74.50 ±0.13
CRD [46] ˚73.54 ±0.19 69.98 ±0.27 75.59 ±0.07
ICKD [7] †75.48 N/A N/A
TMC-KD 76.63 70.03 76.23
TABLE 2. Comparison with the relation-based KD methods on CIFAR-100.
The PKT (EECV-18), RKD (CVPR-19), IRG (CVPR-19), CC (ICCV-19) and CRD
(ICLR-21) results marked with ˚are in [5]. The ICKD (ICCV-21) result
marked with †is in [7].
For more comparison, we evaluate our method on Ima-
geNet as in Table 4. For reproducing results, we use the
officially released code of [5] from the KD to SemCKD
implementation, but use the code of [14] for ICKD and TaT
6VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Method AP@0.5:0.95 AP@0.5 AP@0.75 AR@0.5:0.95
Teacher (R-101) ‡42.9 93.0 29.6 50.2
Student (R-18) ‡39.9 88.1 25.6 49.0
CWD (ICCV-21) [40] ‡43.3 91.0 32.5 51.0
FBKD (ICLR-21) [41] ‡43.6 91.7 33.7 51.9
PKD (Neurips-22) [42] ‡43.7 92.3 35.5 51.0
TMC-KD (ours) ‡44.2 91.9 34.4 51.7
TABLE 3. Results on knowledge distillation on rip current detection. The bold highlights the best results and the underline indicates the second best
results. ‡represents our implementation result.
implementation. For the ResNet-18, our TMC-KD achieves
the better accuracy except for TaT. For the ShuffleNetV2x0.5,
our TMC-KD achieves the best performance in this heteroge-
neous setting1. From the results, we show that our method can
work well on the large-scale classification task.
Moreover, we provide the qualitative comparison in Fig.
A1. We visualize the saliency region for classification by
using Grad-CAM [47], and compare ours with other KD
methods. We visualize feature maps from a convolution layer
before the last fully-connected layer. Even some regions are
not discriminative in other methods, our TMC-KD provides
clearer saliency even for those regions. Compared to the
results of the teacher models, our TMC-KD produces almost
similar saliency responses. The more results can be found
in the appendix A. These qualitative results support that our
TMC-KD can achieve the better accuracy for the quantitative
comparison.
Teacher ResNet-34
73.31
Student ResNet-18 ShuffleV2x0.5
69.67 54.73
KD (NIPS-14) [3] 70.62 ˚50.42 ‡
FitNet (ICLR-15) [4] 70.31 ˚53.36 ‡
AT (ICLR-17) [19] 70.30 ˚54.49 ‡
SP (CVPR-19) [18] 69.99 ˚54.42 ‡
VID (CVPR-19) [21] 70.30 ˚54.49 ‡
SemCKD (AAAI-21) [5] 70.87 ˚54.59 ‡
ICKD (ICCV-21) [7] 68.35 ‡48.70 ‡
TaT (CVPR-22) [14] 71.74 ‡N/A
TMC-KD 71.43 54.72
TABLE 4. Comparison with other KD methods on the ImageNet. Bold
indicates the best Top-1 accuracy. ˚results are in [5], and ‡results are
from our re-implementation.
To compare complexity with other knowledge distillation
methods, we measure average training and inference time per
epoch on the ImageNet. We use the ResNet-34 and ResNet-
18 networks as a teacher and a student. The training speed
of TMC is slower than other methods due to the multi-layer
KD using the attached MLC and transformer. However, the
inference speed of our TMC-KD is similar to others since
the attached modules are not exploited during inference. This
implies that our TMC-KD does not impose extra costs on the
target student during the inference stage.
1TaT does not provide the code for learning the ShuffleNetV2x0.5.
Method Average training time
per epoch (hour)
Average inference time
per epoch (second)
KD [3] 0.417 90.198
FitNet [4] 0.428 90.245
AT [19] 0.426 94.340
SP [18] 0.419 90.708
VID [21] 0.462 90.612
SemCKD [5] 1.027 92.251
ICKD [7] 0.435 90.359
TMC-KD 1.261 90.886
TABLE 5. Average training and inference speed of different KD methods
per epoch on ImageNet. All results are evaluated by our
re-implementation and measured on the same H/W environment.
D. COMPARISON ON RIP CURRENT DETECTION
For more comparisons, we apply the KD to object detec-
tion. Despite the development of high-performance detec-
tors [48]–[51] in recent years, we use a simple Faster R-
CNN [52] detector for implementation and comparison. We
perform the KD between both detectors using CWD [40],
FBKD [41], and PKD [42]. In the CWD-based implemen-
tation, we minimize the Kullback–Leibler divergence be-
tween teacher and student activation maps from FPNs. In the
FBKD, we extract both spatial and channel attentions from
teacher and student detectors. We then minimize each mean
square error between teacher and student attention maps.
In the PKD, we normalize outputs of FPNs and minimize
the mean square error between teacher and student normal-
ized features. While other methods perform KD between
same-level layers, our TMC-KD compares features among
all FPN layers using the local correlation Eq.(9). For this
comparison, we evaluate detectors using the COCO style
metrics: average precision (AP) and recall (AR). We evaluate
AP at IoU P r0.5:0.05 : 0.95s(AP@0.5:0.95), at IoU 0.5
(AP@0.5) and at IoU 0.75 (AP@0.75). We evaluate AR at
IoU P r0.5:0.05 : 0.95s(AR@0.5:0.95).
In Table 3, we provide the comparison results for KD
methods on the rip current detection. Compared to other
KD methods, our TMC-KD achieves best meanAP 44.2%.
By applying our TMC-KD, we greatly improve the mean
AP by 4.3 point. On average recall, TMC-KD achieved the
second highest score 51.7. While other KD methods ignore
the correlation between different-level feature maps, TMC-
KD considers both local and global correlations. Therefore,
we achieve the high scores on both AP and AR metrics.
VOLUME 11, 2023 7
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Moreover, we provide the qualitative comparison in Fig. A3.
We visualize the detection results and compare ours with
other KD methods. On ‘‘Rip-1705’’, ‘‘Rip-904’’ and ‘‘Rip-
863’’ samples, TMC-KD detects rip currents successfully,
whereas other KD methods produce some missing or inac-
curate detection results of the rip currents.
E. ABLATION STUDY
We evaluate the effects of each method applying for TMC-
KD by measuring Top-1 and Top-5 accuracy. For this study,
we use the ResNet-32x4 and VGG-8 as teacher and student
models, respectively. Then, we use (a) the baseline with the
KD method using [3]. We then add our method one-by-one
into the baseline: (b) with the multi-layer local loss Eq. (2),
(c) with the global relation loss Eq. (10). To show the effect of
our MLC, we also implement (d) that uses SemCKD instead
of using our MLC. In this case, we use the feature pooling and
1ˆ1 convolution layers described in SemCKD for matching
the spatial resolution and channel number between different
layers. Compared to the baseline, (c) using all our methods
provides the 3.64 Top-1 accuracy and 1.63 Top-5 accuracy
gain. By adding the local and global losses, we can improve
the accuracy by 75.96 and 76.23 Top-1 accuracies. In addi-
tion, we can improve Top-5 accuracy by 93.40 and 93.47.
When comparing (b) and (d), replacing our MLC with the
SemCKD-based feature converter degrades Top-1 and Top-5
accuracies by 74.95 and 92.98. This because our MLC gen-
erates the more stronger sequential features which are input
of the transformer. These results indicate that all our methods
are beneficial of improving the multi-layer KD training.
LKD LLocal LGlobal Top-1 Acc. Top-5 Acc.
(a) l ˆ ˆ 72.59 91.84
(b) l l ˆ 75.96 93.40
(c) l l l 76.23 93.47
(d) l l ˆ 74.95 92.98
SemCKD converter
TABLE 6. Results on different components of TMC-KD.
F. SENSITIVITY ANALYSIS
We investigate the sensitivity of our TMC-KD by varying the
values of the important hyper-parameters. We use ResNet-
32x4 and VGG-8 as a teacher and student models, and evalu-
ate them on CIFAR-100.
1) The size of transformer
We change the number of stacked layers used in the encoder
and decoder from 1 to 12, and report the results in Figure
4. We achieve the best score when using 6 layers. Using
too many layers degrades the accuracy even due to the large
discrepancy between decoded features. We expect that trans-
formers with many layers could be rather over-fitted due to
the small sample size of the CIFAR-100. On the other hand,
a transformer with few layers is likely to be insufficient to
extract the exact correlation features.
FIGURE 4. Sensitivity analysis by changing the cardinality of the encoder
and decoder layers.
FIGURE 5. Sensitivity analysis of our TMC-KD for ζ(bottom axis) and β
(top axis).
2) Hyper-parameter ζand β
We change the values of ζand βwhich are used for balancing
between losses in Eq. (11). For ζ, we change the score from
0.001 to 1000 by multiplying 10. To investigate the effect
of the architecture difference, we fix a teacher model with
ResNet-32x4, but use ResNet-8x4 and VGG-8 for homoge-
neous and heterogeneous setup as a student model. Figure 5
shows the results. We achieve the best scores to 76.23% and
76.63% for VGG-8 and ResNet-8x4 when using ζ“0.1.
However, the accuracy difference between 0.01 and 1 is rather
marginal. For β, we change the score from 1 to 1000 by
multiplying 10 (including 400 tuned in SemCKD [5]). We
also achieve the best scores to 76.23% and 76.63% for VGG-8
and ResNet-8x4 when using β“400. The accuracy of both
student models tends to be enhanced as increases βbefore
β“400.
8VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
V. CONCLUSION
For multi-layer KD, we propose a novel transformer-based
multiple layer correlation KD (TMC-KD) method. Our TMC-
KD can bridge the knowledge gap between different models
via global and local correlation learning. For learning both
correlations between intermediate layers of different archi-
tectures, we design a multi-layer feature converter and exploit
it to transform multi-layer features to serially-connected en-
coded features. By using the decoded features and attention
tensors from a transformer, we can minimize the discrep-
ancy between models in terms of local and global semantic
relations. The comparison results with recent KD methods
prove the effectiveness of our method. In image classification,
our TMC-KD provides about 5% accuracy gain on average.
It outperforms other methods for the knowledge distillation
between heterogeneous architectures on the CIFAR-100. On
ImageNet, TMC-KD provides the best accuracy 54.72 accu-
racy. For more comparison, we also evaluate our TMC-KD
and other KD methods on the rip current detection set. In this
comparison, our TMC-KD achieves the best mAP of 44.2
% surpassing the performance of other methods. From the
extensive ablation study, we show the effects of multi-layer
feature converter, local and global correlation learning.
The training speed of TMC-KD is slower than other KD
methods due to the additional complexity of the transformer.
To reduce the complexity of the transformer, our future work
could combine the deformable attention techniques [48], [53].
We believe that our work could be a solid guideline of multi-
layer KD.
ACKNOWLEDGMENT
This work was supported by INHA UNIVERSITY Research
Grant.
APPENDIX A
QUALITATIVE COMPARISON
We visualize discriminative regions by using Grad-CAM [47]
for qualitative comparisons. We use images on ImageNet
[32]. As shown in Figure A1 and A2, our TMC-KD shows our
method produces almost similar discriminative regions as the
teacher model. Compared to other recent KD methods, our
method clearly shows the more distinctive saliency regions.
In particular, in some sample images (e.g. Mergus serrator, a
beagle, and a European fire salamander), our TMC-KD shows
the better results than the oracle teacher model.
APPENDIX B
THE MAIN CODES FOR OUR METHOD IMPLEMENTATION
We provide our code for implementing the proposed methods
described in Sec. III-B of our manuscript: the multi-layer
feature converter (MLC), local semantic learning, and global
relational learning. Basically, we implement our code using
the PyTorch [54]. The Algorithm A1 describes the imple-
mentation of the MLC. The MLC class constructor requires
the number of input channels, the size of input width, and an
embedding size Efor outputs. During training, we set the rate
for the dropout to 0.
In Algorithm A2 and A3, we provide the code for imple-
menting local semantic learning. Algorithm A2 shows the
CGlobal and self-attention implementation described in Eq. (5)
and (6) of the paper. Given the weights for the query, key, and
value for the series of the MLC converted features VP, we can
compute the global correlation CGlobal and the self-attention
Hh
Global using the multiple attention heads. Algorithm A3
shows how to compute the local correlation and the local KD
loss LLocal between multiple layers. To evaluate Λ, we first
perform the matrix multiplication between the teacher and
student feature maps as Ptand Ps, and normalize it using the
softmax. We then compute the local loss LLocal by evaluating
the feature L2 distance applied to the correlation Λ.
The Algorithm A4 shows the implementation of the global
loss in Eq. (10). By using the L2 distance between the self-
attentions of the decode features Ptand Ps, we evaluate the
global loss.
VOLUME 11, 2023 9
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE A1. Qualitative comparison of different KD methods on ImageNet: we visualize feature maps by using Grad-CAM. We use ResNet-34 and
ShuffleNetV2x0.5 as the teacher and student models. We represent the strength of the saliency with different colors (red indicates the stronger response).
10 VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE A2. Qualitative comparison of different KD methods on ImageNet: we visualize feature maps by using Grad-CAM. We use ResNet-34 and
ShuffleNetV2x0.5 as the teacher and student models. We represent the strength of the saliency with different colors (red indicates the stronger response).
VOLUME 11, 2023 11
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE A3. Qualitative comparison of different KD methods on the Rip current dataset. We visualize detection results with red box. For this comparison,
we use Faster R-CNNs with ResNet-101 and ResNet-18 as the teacher and student detectors.
12 VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Algorithm A1 Our PyTorch code for implementing the multi-layer feature converter (MLC).
import t o r c h . n n as nn
class MLC( nn . Mo dule ) :
d e f _ _ i n i t _ _ ( s e l f , i n _c h n , in_ w , d im _ ou t = 16 , d r op = 0 ) :
super ( ) . _ _ i n i t _ _ ( )
# # i n_ c h n : T h e n um be r o f i n p u t f e a t u r e map c h a n n e l s
# # in _w : T he w id th s i z e o f i n p u t f e a t u r e map
# # d i m _ o u t : D i m e n s i o n o f o u t p u t
# # d r o p : R a t e f o r D r o p o u t
h i d d e n _ c h n = i n _ c h n ∗2
h i d de n _ d i m = i n _ c h n ∗in_w ∗in_w
s e l f . c on v1 = nn . Co nv2 d ( in _c hn , hi dd en _ch n , 1)
s e l f . b n 1 = nn . B at c hN o rm 2d ( h i d d e n _ c h n )
s e l f . a c t = n n . ReLU ( i n p l a c e = T ru e )
s e l f . c on v2 = nn . Co nv2 d ( hi dd en _c hn , in _ch n , 1)
s e l f . L i n e a r = nn . L i n e a r ( h i dd e n_ d i m , d i m _o u t )
s e l f . d ro p = nn . D rop ou t ( d ro p )
def f o r w a r d ( s e l f , x ) :
B = x . s h a p e [ 0 ] # B : B at ch s i z e
x = s e l f . c on v1 ( x )
x = s e l f . a c t ( x )
x = s e l f . bn1 ( x )
x = s e l f . d ro p ( x )
x = s e l f . c on v2 ( x )
x = s e l f . L i n e a r ( x . r e s h a p e ( B , −1)) # # F l a t t e n a nd f u l l y c o n n e c t i o n
x = s e l f . d ro p ( x )
r e t u r n x
VOLUME 11, 2023 13
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Algorithm A2 Our code for implementing the global correlation CGlobal and the self-attention HGlobal .
import math
import t o r c h
import t o r ch . nn . f u n c t i o n a l as F
def S e l f _ A t t e n t i o n (W_Q, W_K, W_V, V_P ) :
# # W_P : W e i g h t f o r q u e r y
# # W_K : W e i g h t f o r k e y
# # W_V : W e i gh t f o r v a l u e
# # V_P : I n p u t f e a t u r e
E = V_P . s h a p e [ 2 ]
## M at ri x m u l t i p l i c a t i o n
Q = F . l i n e a r ( V_P , W_Q)
K = F . l i n e a r ( V_P , W_K)
V = F . l i n e a r ( V_P , W_V)
## Eq . ( 5) i n t h e main t e x t
Q = Q / m at h . s q r t ( E )
C _G lo ba l = t o r c h . bmm( Q , K. t r a n s p o s e ( −2 , −1))
C _ Gl o b a l = F . s o f t m a x ( C _ Gl o b al , di m =−1)
## Eq . ( 6) i n t h e main t e x t
H _G lo b al = t o r c h . bmm( C _ Gl ob al , V )
r e t u r n H_Global
14 VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Algorithm A3 Our code for implementing the local correlation Λand the local semantic loss LLocal .
import t o r c h . n n a s n n
import t o r ch . nn . f u n c t i o n a l as F
from einops import r e a r r a n g e
class L o c a l _ L o s s ( n n . M o du le ) :
d e f _ _ i n i t _ _ ( s e l f ) :
super ( ) . _ _ i n i t _ _ ( )
s e l f . c r i t = n n . M SELos s ( r e d u c t i o n = ’ n on e ’ )
d e f f o r w a r d ( s e l f , f_ s , f _ t , P _ t , P _ s ) :
# # f _ t : L i s t o f t h e f e a t u r e ma ps f o r p r o j e c t e d t e a c h e r
# # f _ s : L i s t o f t h e f e a t u r e m aps f o r p r o j e c t e d s t u d e n t
# # P _t : D ec o de r o u t p u t f o r t e a c h e r
# # P_ s : D ec od er o u t p u t f o r s t u d e n t
# C omp ut e l o c a l c o r r e l a t i o n ( Lam bda )
## Eq . ( 9) i n t h e main t e x t
# # R e a r r an g e d i m e n s i o n : F rom [ B , M, E ] t o [ B , E , M]
t em p _P _ t = r e a r r a n g e ( P_ t , ’B M E −> B E M’ )
La mbd a = F . s o f t m a x ( t o r c h . bmm( P_ s , t e m p _ P _ t ) , d im = −1)
# Comput e Eq . (2 ) i n t he m ai n t e x t
B , J , M = L amb da . s h a p e
## B : B a tc h s i z e
# # J : T he nu mb er o f s t u d e n t f e a t u r e m aps
# # M : T he n um be r o f t e a c h e r f e a t u r e m aps
l o s s _ i = t o r ch . z er o s ( B , J , M)
for ji n r a ng e ( J ) :
for mi n r a ng e (M) :
l o s s _ i [ : , j , m ] = s e l f . c r i t ( f _ s [ j ] [ m] , f _ t [ j ] [ m ] ) . r e s h a p e ( B , −1 ) . me an ( −1 )
l o c a l _ l o s s = ( Lambda ∗l o s s _ i ) . sum ( ) / (B∗J )
r e t u r n local_loss
VOLUME 11, 2023 15
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Algorithm A4 Our code for implementing the global relational loss LGlobal .
import t o r c h
import t o r c h . n n a s n n
import t o r ch . nn . f u n c t i o n a l as F
class G l o b a l _ L o s s ( n n . M o du le ) :
def _ _ i n i t _ _ ( s e l f ) :
super ( ) . _ _ i n i t _ _ ( )
def f o r w a r d ( s e l f , P _s , P _ t ) :
B = P _ s . s h a p e [ 0 ] # B at ch s i z e
t e mp _ t = P _ t . r e s h a p e ( B , −1)
t em p _ s = P_ s . r e s h a p e ( B , −1)
s e l f _ c o r _ t = t o r ch . matm ul ( t em p _t , t em p_ t . t ( ) )
s e l f _ c o r _ s = t o r c h . m at mu l ( t em p _s , t em p_ s . t ( ) )
g l o b a l _l o s s = F . m s e_ l os s ( s e l f_ c o r_ s , s e l f _ c o r _ t , r e d uc t i o n = ’mean ’ )
r e t u r n global_loss
16 VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
REFERENCES
[1] S. Hanson and L. Pratt, ‘‘Comparing biases for minimal network construc-
tion with back-propagation,’’ Proc. Adv. Neural Inf. Process. Syst., 1988.
[2] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, ‘‘Quantized convolutional
neural networks for mobile devices,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), 2016, pp. 4820–4828.
[3] G. Hinton, O. Vinyals, and J. Dean, ‘‘Distilling the knowledge in a neural
network,’’ in Proc. Adv. Neural Inf. Process. Syst., 2015.
[4] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio,
‘‘Fitnets: Hints for thin deep nets,’’ in Proc. Int. Conf. Learn. Repr. (ICLR),
2015.
[5] D. Chen, J.-P. Mei, Y. Zhang, C. Wang, Z. Wang, Y. Feng, and C. Chen,
‘‘Cross-layer distillation with semantic calibration,’’ in Proc. AAAI Conf.
Artif. Intell., vol. 35, no. 8, 2021, pp. 7028–7036.
[6] W. Park, D. Kim, Y. Lu, and M. Cho, ‘‘Relational knowledge distillation,’’
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019,
pp. 3967–3976.
[7] L. Liu, Q. Huang, S. Lin, H. Xie, B. Wang, X. Chang, and X. Liang,
‘‘Exploring inter-channel correlation for diversity-preserved knowledge
distillation,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp.
8271–8280.
[8] A. Amirkhani, A. Khosravian, M. Masih-Tehrani, and H. Kashiani, ‘‘Ro-
bust semantic segmentation with multi-teacher knowledge distillation,’’
IEEE Access, vol. 9, pp. 119 049–119 066, 2021.
[9] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and
H. Ghasemzadeh, ‘‘Improved knowledge distillation via teacher assistant,’’
in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 04, 2020, pp. 5191–5198.
[10] W. Son, J. Na, J. Choi, and W. Hwang, ‘‘Densely guided knowledge
distillation using multiple teacher assistants,’’ in Proc. IEEE/CVF Int. Conf.
Comput. Vis. (ICCV), 2021, pp. 9395–9404.
[11] A. Banitalebi-Dehkordi, A. Amirkhani, and A. Mohammadinasab,
‘‘Ebcdet: Energy-based curriculum for robust domain adaptive object de-
tection,’’ IEEE Access, 2023.
[12] S. Kim, ‘‘A virtual knowledge distillation via conditional gan,’’ IEEE
Access, vol. 10, pp. 34 766–34 778, 2022.
[13] D. Chen, H. Tan, L. Lan, X. Zhang, T. Liang, and Z. Luo, ‘‘Frustratingly
easy knowledge distillation via attentive similarity matching,’’ in Proc. Int.
Conf. Pattern Recognit. (ICPR). IEEE, 2022, pp. 2357–2363.
[14] S. Lin, H. Xie, B. Wang, K. Yu, X. Chang, X. Liang, and G. Wang, ‘‘Knowl-
edge distillation via the target-aware transformer,’’ in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 10 915–10 924.
[15] D. Chen, J.-P. Mei, H. Zhang, C. Wang, Y. Feng, and C. Chen, ‘‘Knowledge
distillation with the reused teacher classifier,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 11 933–11 942.
[16] Z. Guo, H. Yan, H. Li, and X. Lin, ‘‘Class attention transfer based knowl-
edge distillation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), 2023, pp. 11 868–11877.
[17] S.-G. Park and D.-J. Kang, ‘‘Knowledge distillation with feature self
attention,’’ IEEE Access, vol. 11, pp. 34 554–34 562, 2023.
[18] F. Tung and G. Mori, ‘‘Similarity-preserving knowledge distillation,’’ in
Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 1365–1374.
[19] S. Zagoruyko and N. Komodakis, ‘‘Paying more attention to attention:
Improving the performance of convolutional neural networks via attention
transfer,’’ in Proc. Int. Conf. Learn. Repr. (ICLR), 2016.
[20] Y. Lee, N. Ahn, J. H. Heo, S. Y. Jo, and S.-J. Kang, ‘‘Teaching where to
see: Knowledge distillation-based attentive information transfer in vehicle
maker classification,’’ IEEE Access, vol. 7, pp. 86 412–86 420, 2019.
[21] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai, ‘‘Variational
information distillation for knowledge transfer,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 9163–9171.
[22] M. Ji, B. Heo, and S. Park, ‘‘Show, attend and distill: Knowledge distillation
via attention-based feature matching,’’ in Proc. AAAI Conf. Artif. Intell.,
2021, pp. 7945–7952.
[23] Q. Tang, Y. Zhang, X. Xu, J. Wang, and Y. Guo, ‘‘Input-dependent dy-
namical channel association for knowledge distillation,’’ in Proc. IEEE Int.
Conf. Acous. Spech. Sig. Process. (ICASSP), 2023, pp. 1–5.
[24] J. Yim, D. Joo, J. Bae, and J. Kim, ‘‘A gift from knowledge distillation:
Fast optimization, network minimization and transfer learning,’’ in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 4133–
4141.
[25] H.-T. Li, S.-C. Lin, C.-Y. Chen, and C.-K. Chiang, ‘‘Layer-levelknowledge
distillation for deep neural network learning,’’ Applied Sciences, vol. 9,
no. 10, 2019. [Online]. Available: https://www.mdpi.com/2076-3417/9/
10/1966
[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, ‘‘Attention is allyou need,’’ Proc. Adv. Neural
Inf. Process. Syst., vol. 30, 2017.
[27] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc.
Eur. Conf. Comput. Vis. (ECCV). Springer, 2020, pp. 213–229.
[28] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un-
terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., ‘‘An
image is worth 16x16 words: Transformers for image recognition at scale,’’
in Proc. Int. Conf. Learn. Repr. (ICLR), 2020.
[29] G. Aguilar, Y. Ling, Y. Zhang, B. Yao, X. Fan, and C. Guo, ‘‘Knowledge
distillation from internal representations,’’ in Proc. AAAI Conf. Artif. Intell.,
2020, pp. 7350–7357.
[30] V. Sampath, I. Maurtua, J. J. A. Martín, A. Iriondo, I. Lluvia, and A. Rivera,
‘‘Vision transformer based knowledge distillation for fasteners defect de-
tection,’’ in Proc. Int. Conf. Elec. Comput. Energy.Tech. (ICECET). IEEE,
2022, pp. 1–6.
[31] A. Krizhevsky, G. Hinton et al., ‘‘Learning multiple layers of features from
tiny images,’’ 2009.
[32] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘Imagenet:
A large-scale hierarchical image database,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR). Ieee, 2009, pp. 248–255.
[33] N. Passalis, M. Tzelepi, and A. Tefas, ‘‘Heterogeneous knowledge distilla-
tion using information flow modeling,’’ in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), 2020, pp. 2339–2348.
[34] A. de Silva, I. Mori, G. Dusek, J. Davis, and A. Pang, ‘‘Automated rip cur-
rent detection with region based convolutional neural networks,’’ Coastal
Engineering, vol. 166, p. 103859, 2021.
[35] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
recognition,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), 2016, pp. 770–778.
[36] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
large-scale image recognition,’’ in Proc. Int. Conf. Learn. Repr. (ICLR),
2015. [Online]. Available: http://arxiv.org/abs/1409.1556
[37] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, ‘‘Shufflenet v2: Practical guide-
lines for efficient cnn architecture design,’’ in Proc. Eur. Conf. Comput. Vis.
(ECCV), 2018.
[38] S. Zagoruyko and N. Komodakis, ‘‘Wide residual networks,’’ in Proc.
Brit. Mach. Vis. Conf. (BMVC). BMVA Press, 2016, pp. 87.1–87.12.
[Online]. Available: https://dx.doi.org/10.5244/C.30.87
[39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, ‘‘Mo-
bilenetv2: Inverted residuals and linear bottlenecks,’’ in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 4510–4520.
[40] C. Shu, Y. Liu, J. Gao, Z. Yan, and C. Shen, ‘‘Channel-wise knowledge
distillation for dense prediction,’’ in Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, 2021, pp. 5311–5320.
[41] L. Zhang and K. Ma, ‘‘Improve object detection with feature-based knowl-
edge distillation: Towards accurate and efficient detectors,’’ in Proc. Int.
Conf. Learn. Repr. (ICLR), 2020.
[42] W. Cao, Y. Zhang, J. Gao, A. Cheng, K. Cheng, and J. Cheng, ‘‘Pkd:
General distillation framework for object detectors via pearson correlation
coefficient,’’ Proc. Adv. Neural Inf. Process. Syst., vol. 35, pp. 15394–
15 406, 2022.
[43] N. Passalis and A. Tefas, ‘‘Learning deep representations with probabilistic
knowledge transfer,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp.
268–284.
[44] Y. Liu, J. Cao, B. Li, C. Yuan, W. Hu, Y. Li, and Y. Duan, ‘‘Knowledge
distillation via instance relationship graph,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 7096–7104.
[45] B. Peng, X. Jin, J. Liu, D. Li, Y. Wu, Y. Liu, S. Zhou, and Z. Zhang,
‘‘Correlation congruence for knowledge distillation,’’ in Proc. IEEE/CVF
Int. Conf. Comput. Vis. (ICCV), 2019, pp. 5007–5016.
[46] Y. Tian, D. Krishnan, and P. Isola, ‘‘Contrastive representation distillation,’’
in Proc. Int. Conf. Learn. Repr. (ICLR), 2019.
[47] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Ba-
tra, ‘‘Grad-cam: Visual explanations from deep networks via gradient-
based localization,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
2017, pp. 618–626.
[48] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable detr:
Deformable transformers for end-to-end object detection,’’ in Proc. Inter-
national Conference on Learning Representations., 2020.
[49] S.-H. Bae, ‘‘Deformable part region learning for object detection,’’ in Proc.
AAAI Conf. Artif. Intell., vol. 36, no. 1, 2022, pp. 95–103.
VOLUME 11, 2023 17
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
[50] S.-H. Lee and S.-H. Bae, ‘‘Afi-gan: Improving feature interpolation of
feature pyramid networks via adversarial training for object detection,’’
Pattern Recognition, vol. 138, p. 109365, 2023.
[51] S.-H. Bae, ‘‘Deformable part region learning and feature aggregation tree
representation for object detection,’’ IEEE Transactions on Pattern Analy-
sis and Machine Intelligence, 2023.
[52] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster r-cnn: Towards real-time
object detection with region proposal networks,’’ IEEE Transactions on
Pattern Analysis and Machine Intelligence, Jun 2017.
[53] Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang, ‘‘Vision transformer
with deformable attention,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit. (CVPR), 2022, pp. 4794–4803.
[54] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen,
Z. Lin, N. Gimelshein, L. Antiga et al., ‘‘Pytorch: An imperative style,
high-performance deep learning library,’’ Proc. Adv. Neural Inf. Process.
Syst., 2019.
HUN-BEOM BAK received the BS degree in
Physics from Incheon National University in 2020,
and is currently pursuing the MS degree with the
Department of Electrical and Computer Engineer-
ing at Inha University, Korea. His current research
interest includes model compression, knowledge
distillation and data-free knowledge distillation.
SEUNG-HWAN BAE (Member, IEEE) received
the BS degree in information and communication
engineering from Chungbuk National University,
in 2009 and the MS and PhD degrees in informa-
tion and communications from the Gwangju Insti-
tute of Science and Technology (GIST), in 2010
and 2015, respectively. He was a senior researcher
at Electronics and Telecommunications Research
Institute (ETRI) in Korea from 2015 to 2017. He
was an assistant professor in the Department of
Computer Science and Engineering at Incheon National University, Korea
from 2017 to 2020. He is currently an Associate Professor with the Depart-
ment of Electrical and Computer Engineering at Inha University. His research
interests include object tracking, object detection, generative model learning,
continual learning, on-device ML, etc.
18 VOLUME 11, 2023
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3387859
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/