Content uploaded by Abdelaziz Abohamama
Author content
All content in this area was uploaded by Abdelaziz Abohamama on Jul 11, 2023
Content may be subject to copyright.
Compressing medical deep neural network models for edge devices
using knowledge distillation
F. MohiEldeen Alabbasy
a,
⇑
, A.S. Abohamama
a,b
, Mohammed F. Alrahmawy
a
a
Department of Computer Science, Faculty of Computers and Information, Mansoura University, Egypt
b
Department of Computer Science, Arab East Colleges, Saudi Arabia
article info
Article history:
Received 29 December 2022
Revised 9 May 2023
Accepted 9 June 2023
Available online 14 June 2023
Keywords:
Knowledge distillation
Deep models
Edge devices
Deep models’ compressing techniques
abstract
Recently, deep neural networks (DNNs) have been used successfully in many fields, particularly, in med-
ical diagnosis. However, deep learning (DL) models are expensive in terms of memory and computing
resources, which hinders their implementation in limited-resources devices or for delay-sensitive sys-
tems. Therefore, these deep models need to be accelerated and compressed to smaller sizes to be
deployed on edge devices without noticeably affecting their performance. In this paper, recent accelerat-
ing and compression approaches of DNN are analyzed and compared regarding their performance, appli-
cations, benefits, and limitations with a more focus on the knowledge distillation approach as a successful
emergent approach in this field. In addition, a framework is proposed to develop knowledge distilled DNN
models that can be deployed on fog/edge devices for automatic disease diagnosis. To evaluate the pro-
posed framework, two compressed medical diagnosis systems are proposed based on knowledge distil-
lation deep neural models for both COVID-19 and Malaria. The experimental results show that these
knowledge distilled models have been compressed by 18.4% and 15% of the original model and their
responses accelerated by 6.14x and 5.86%, respectively, while there were no significant drop in their per-
formance (dropped by 0.9% and 1.2%, respectively). Furthermore, the distilled models are compared with
other pruned and quantized models. The obtained results revealed the superiority of the distilled models
in terms of compression rates and response time.
Ó2023 The Authors. Published by Elsevier B.V. on behalf of King Saud University. This is an open access
article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
1. Introduction
Internet of Things (IoT) and sensing has led to huge transforma-
tions in the healthcare delivery. Cloud resources can be used to
improve the performance of IoT in processing the large amounts
of collected data. However, cloud and IoT computing models can-
not support delay-sensitive light-weight services required for
many healthcare systems. Fog computing model is a technology
that resides in the middle between the cloud computing and IoT.
It can achieve cloud-IoT convergence and support the delay-
sensitive services (Wang et al., 2018). Using fog computing can
improve the performance of healthcare systems by deploying the
AI, machine learning and deep learning models at the network
edge (Deng et al., 2020).
Recently, DNNs have achieved great performance break-
throughs in many fields including medical diagnosis systems.
However, DNNs have many layers, huge number of parameters,
and high computational complexity, which in turn demands the
availability of high computing resources such as GPUs. Hence,
complex DNNs cannot be used to build delay-sensitive light-
weight applications on fog/edge devices as these devices usually
have limited resources in terms of memory, CPU, bandwidth, and
energy (Cheng et al., 2020).
Research on deep neural network compression and acceleration
has received a lot of attention in recent years. Based on their charac-
teristics, the approaches used in recent research are classified into
four categories: parameter pruning and sharing, low-rank factoriza-
tion, transferred/compact convolutional filters, and knowledge dis-
tillation. These techniques allow the deployment of DNNs on
resource-limited devices (e.g., mobile phones) (Cheng et al., 2020).
Knowledge distillation transfers knowledge from a large model
to a smaller model without noticeably reducing the model accu-
racy. In this paper, a smart IOT system for diseases diagnosis using
distilled DNN models and deployed on fog devices computing envi-
ronments is proposed. The proposed system uses algorithms and
tools that considers the limited resources of the edge devices. The
proposed system provides the healthcare as an edge service and
https://doi.org/10.1016/j.jksuci.2023.101616
1319-1578/Ó2023 The Authors. Published by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
⇑
Corresponding author.
E-mail addresses: Fatimaalabbasy145@gmail.com (F. MohiEldeen Alabbasy),
Abohamama@mans.edu.eg (A.S. Abohamama), Mrahmawy@mans.edu.eg (M.F.
Alrahmawy).
Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
Contents lists available at ScienceDirect
Journal of King Saud University –
Computer and Information Sciences
journal homepage: www.sciencedirect.com
efficiently processes the patient’s data gathered from various IoT
devices. It employs a DL compression and acceleration technique
using knowledge distillation, to reduce the computation and stor-
age cost without significantly affecting the model performance.
Two simple medical models are provided as case studies. The first
one is for diagnosing COVID-19 and the other is for diagnosing
Malaria. The experimental results reveals that the knowledge distil-
lation technique efficiently result in fast light weight deep models
without significantly decreasing the model’s performance and their
compression ratio and response times are much better than both
pruning and quantization compression approaches.
The remaining sections of the paper are organized as follows.
Section 2 covers several related works. Section 3 overviews the
concepts and terminologies relevant to the proposed work. Sec-
tion 4 presents the proposed system architecture and methodol-
ogy. Then, Section 5 presents the two proposed case studies and
their architectures followed by the conducted experiments and
results analysis. Finally, Section 6 concludes the paper and covers
the challenges and future research directions.
2. Literature review
Deploying AI-based models on edge/fog devices faces many
challenges, particularly for healthcare applications that employ
ML and/or DL models. Edge devices usually have limited resources
that reduces their capabilities to deploy these models and to trans-
fer and/or store big amounts of data. Therefore, many researchers
work on developing both machine learning and deep learning
models that can work efficiently on the edge. In (Laccetti et al.,
2022), a parallel, adaptive, high-performance approach for improv-
ing the performance of dynamic data clustering based on the K-
means algorithm has been proposed. The proposed work attempts
to use the clustering to enable decision making on the network
edge for delay-sensitive applications. In (Kammoun et al., 2022),
another trust management and edge computing-based clustering
technique, known as CATMEC, has been proposed. The proposed
algorithm attempts to consider the security challenges, the power
consumption, and the limited capacities of nodes. A security model
has been used by developing a one-hop clustering technique based
on energy, density, and trust management. In (Lapegna et al.,
2021), the researchers investigated how to implement clustering
algorithm with low energy consumption.
On the other hand, deep learning models’ compression is a pos-
sible solution for deploying these models on the resource-limited
edge devices. deep learning models’ compression and acceleration
techniques includes parameter pruning and quantization, trans-
ferred/compact convolutional filters, low-rank factorization and
knowledge distillation (Deng et al., 2020).
Several parameter pruning and quantization techniques have
been proposed (Zhang et al., 2019b; Zhang et al., 2018b;
Vanhoucke et al., 2011; Gupta et al., 2015; Courbariaux et al.,
2015; Courbariaux and Bengio, 2016; Rastegari et al., 2016;
Hinton et al., 2015; Han et al., 2015; Srinivas and Babu, 2015;
Ullrich et al., 2017; Chun et al., 1991; Rakhuba and Oseledets,
2015; Moczulski et al., 2016). Vanhoucke et al. used the 8-bit quan-
tization to increase the speed of the model with little loss of accu-
racy (Zhang et al., 2019b). (Zhang et al., 2018b) revealed that the
stochastic rounding-based CNN training with 16-bit fixed-point
representation, significantly decreases the memory usage and
floating-point operations with a little impact on the accuracy of
classification. In addition, convolutional neural networks can be
directly trained using binary weights such as BinaryConnect
(Vanhoucke et al., 2011), BinaryNet (Gupta et al., 2015), and XNOR
(Courbariaux et al., 2015). In addition, early research showed that
network pruning is a reliable method for reducing network com-
plexity and over-fitting. Modern CNN models were recently pruned
by Han et al. without losing accuracy (Courbariaux and Bengio,
2016). Redundant neurons can be removed using a method called
data-free pruning which was developed by Srinivas and Babu
(Rastegari et al., 2016). Han et al. proposed a pruning method that
decreases the overall number of network operations and parame-
ters (Hinton et al., 2015). Another deep compression approach
has been proposed in which the unnecessary connections are
removed, the weights are quantized, and the quantized weights
are encoded using Huffman coding (Han et al., 2015). In (Srinivas
and Babu, 2015), a simple regularization technique which combi-
nes quantization and pruning methods into a single simple (re)-
training method. In addition, (Ullrich et al., 2017) emphasized
the importance of structured matrices concept. Their proposed
approach has been used in many classes of structured matrices
such as block and multi-level Toeplitz-like (Chun et al., 1991)
matrices and matrices connected to multi-dimensional convolu-
tion (Rakhuba and Oseledets, 2015). Based on this concept, a gen-
eral effective linear layer for CNNs was introduced in (Moczulski
et al., 2016).
Similarly, many low-rank approximation and sparsity methods
have been proposed (Rigamonti et al., June 2013; Denton et al.,
2014; Jaderberg et al., 2014). (Rigamonti et al., June 2013)
described how to train separable 1D filters by employing a dic-
tionary learning strategy. In (Denton et al., 2014), several low-
rank clustering and approximation approaches for the convolu-
tional kernels have been proposed for some DNN models. With
only a 1% loss in the accuracy, they can double the speed of a single
convolutional layer. Also, different tensor decomposition strategies
have been introduced in (Jaderberg et al., 2014) which reported 4.5
speedup with 1% reduction in accuracy.
In addition, many studies have been conducted in the scope of
transferred/compact convolutional filters (Lebedev et al., 2014;
Zhai et al., 2016; Wu et al., 2016; Shang et al., 1603). (Lebedev
et al., 2014) established the equivalent group theory which served
as the motivation for employing transferable convolutional filters
to compress CNN models. As proposed in (Zhai et al., 2016), a
1x1 convolutional layer can be used to minimize the number of
input channels in the following layer at a reasonably low computa-
tional cost. This reduces the number of filter channels in the layer
below. In (Wu et al., 2016), the proposed work achieved a signifi-
cant acceleration by adopting two 1 1 convolutions rather than
a33 convolution. The SqueezeNet produced a compact NN with
around 50 fewer parameters. According to (Shang et al., 1603), the
network’s pattern representations varied widely in terms of the
magnitudes of the convolutional kernel responses, making it
unsuitable to ignore weaker signals based on a single threshold.
Finally, many techniques has been proposed based on the
knowledge distillation technique (Hinton et al., 2015; Romero
et al., 2014; Korattikara Balan et al., 2015). A knowledge distillation
compression architecture that facilitates the DNNs training process
is proposed in (Hinton et al., 2015). (Romero et al., 2014) uses the
FitNets to solve the network compression problem. FitNets is a
technique for training thin but deep networks by compressing
big and shallower networks. FitNets force the student model to
replicate the teacher full feature maps to learn from the teacher
model intermediate representations. The research in (Korattikara
Balan et al., 2015) developed a student model to simulate a Monte
Carlo teacher in the student model uses deep neural networks with
online training.
3. Background
IoT data are generated close to the end user. Processing these
data at the network edge results in fast response times which is
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
2
crucial for health applications (Tmamna, 2020). However, deploy-
ing machine and deep learning models on the edge/fog nodes is a
challenge because they have limited resources and computing
power compared to cloud servers. To tackle this problem, several
methods have been used by researchers. This section overviews
the DNN compression and acceleration techniques. These methods
broadly belongs to four categories: parameter pruning and quantiza-
tion,low-rank factorization,transferred/compact convolutional filters,
and knowledge distillation.
3.1. Parameter pruning and Quantization.
The pruning and quantization method discovers the redundant
parameters, whose little impact on the model’s performance, and
removes it. This method can be used effectively in different set-
tings and capable of achieving good performance. It can handle
pre-trained models and be used to train models from scratch. This
method is further broken down into three subcategories: network
pruning,model binarization and quantization and structural matrices
(Shahshahani et al., 2018; Molchanov et al., 2017; Cheng and
Wang, 2020; Szegedy et al., 2016).
A. Network pruning
It is a technique for reducing the NNs storage and computation
by pruning unimportant connections. So, a fully connected net-
work converted into a sparse one. The model is fine-tuned to
restore accuracy by removing weights of smaller magnitude. Net-
work pruning is an effective method for reducing the network
complexity and solving the over fitting issue. Two types of pruning
are widely used: weight pruning and neuron pruning. In weight
pruning type, individual weights in the weight matrix are set to
zero. The other way is unit/neuron pruning in which the weight
matrix’s entire columns are changed to zero, thus removing the rel-
evant output neuron. (Shahshahani et al., 2018).
B. Binarization and quantization
Quantization is an effective method for compressing and accel-
erating the models. It uses fewer bits for storing the weights. A net-
work’s size can be reduced by four times when the weights are
which are stored as 8-bit integer values (INT8) rather than 32-bit
floating values (FP32) (Molchanov et al., 2017).
C. Designing structural matrix
A network model with input vector xwould have mnparam-
eters matrix. The time complexity of parameters storing and per-
forming matrix–vector products is OðmnÞwhere M is a general
large dense matrix. Therefore, assuming xas a parameterized
structural matrix is a logical technique for parameters pruning. A
structured matrix is a mnmatrix that can be defined by signifi-
cantly fewer parameters than an mnmatrix. Typically, the struc-
tural matrix speeds up the inference and training stages by
gradient computations and quick matrix–vector multiplication. In
addition, it reduces the memory cost. However, the structural con-
straint typically reduces performance because it may introduce
bias into the model (Cheng and Wang, 1710).
3.2. Low-rank factorization and sparsity
Matrix and tensor decomposition are used in low-rank factor-
ization to find redundant DNN parameters. The weight matrix
can be decomposed into smaller matrices, which improves the
storage requirements. Also, Convolutional layer factorization accel-
erates the inference process. It is possible to use low-rank factor-
ization during or after training. Correct factorization and rank
selection are essential for model performance and accuracy. How-
ever, the decomposition process is computationally expensive
(Cheng and Wang, 2020).
3.3. Transferred/Compact convolutional filters
By enhancing the network architecture, the total number of
weights and computations can be decreased. A large convolutional
layer is increasingly being replaced with a number of smaller con-
volutional layers that have less weights but the same effective
receptive field. (Szegedy et al., 2016).
3.4. Knowledge distillation (KD)
In our work, we used the knowledge distillation technique.
Knowledge distillation refers to the process of learning a smaller -
model from a larger one. In knowledge distillation, a teacher model
(the lager one) typically supervises the student model (the smaller
one). The main principle is that the students should emulate the
teachers to achieve a superior performance. As seen in Fig. 1,a
knowledge distillation system is made of: knowledge, teacher-
student architecture, and knowledge transfer method (Gou et al.,
2021).
Knowledge Distillation complements the other neural networks
compression techniques. It means transferring the knowledge from
the teacher network to the student network through a loss func-
tion. The goal of optimization is to align the student network’s
class-wise probability distribution with the teacher’s probability
output. The essential concept is to supervise a smaller ‘‘student
network” by using soft probabilities (‘‘logits”) and the ‘‘teacher net-
work” class labels. These soft probabilities provide more knowl-
edge than just the class labels and can improve the student
network learning process, teacher model transfers generalizations
instead of weights to the student model (Hinton et al., 2015).
According to Hinton et al., the KD formula is stated as follows:
q
i
¼expðZ
i
=TÞ
P
j
expðZ
i
=TÞð1Þ
where q
i
is soft probability values, Z
i
is the logits of teacher model,
and for controlling the importance of each soft target, a tempera-
ture factor T is provided (T = 1, ‘‘hard output”). Higher T means
softer probabilities distribution over classes. Total loss function is
the weighted sum of hard target + soft target. When T = 1, cross
entropy minimized with hard targets (student loss). When T > 1,
cross entropy is minimized with soft targets from teacher (distilla-
tion loss). Usually, less weight is provided for student loss than dis-
tillation loss.
The probability distribution is essentially smoothed down by
employing a ‘‘temperature” scaling function in the Softmax that
softens the logits and reveals the teacher’s inter-class correlations.
When trained directly on the data and labels, a large model gener-
alizes better than a small model. However, a larger model can be
used to train a smaller model to generalize more effectively.
Because it assigns non-zero probabilities to wrong classes as well,
a trained model’s output probabilities provide more information
than what the labels provide. Distillation uses this knowledge to
improve a small model’s training. Usually, the overall training loss
for the student model is described as follows:
L¼1
a
ðÞL
CE
y;
r
Z
s
ðÞðÞþ2
a
T
2
L
CE
r
Z
s
T
;
r
Z
T
T
ð2Þ
where L
CE
is the cross-entropy, /is a balancing hyper-parameter, y
is the one-hot vector of ground truths,
r
is the soft-max function, Z
s
and Z
T
are the student and teacher models output logits,
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
3
respectively while T is the temperature hyper-parameter (Hinton
et al., 2015).
Fig. 2 shows the knowledge distillation process steps. The stu-
dent model is employed to mimic the generalization ability of a
teacher model. The teacher’s network’s class prediction probabili-
ties are used as ‘‘soft targets” for training the student model. Gen-
erally, the same training set is used for transferring knowledge,
although a different ‘‘transfer set” of data can be used for achieving
the same purpose. Soft targets which have high entropy offer much
more information per training case than hard targets and much
lower gradient variance. As a result, the student model may be
trained using a much higher learning rate than the original teacher
model and much less data (Hinton et al., 2015).
A class prediction probability is produced by the pre-trained
teacher model. The student model produces a class probability dis-
tribution of its own when it is trained on the same data. The prob-
ability distribution of the student model is pushed towards the
distribution by the teacher model using a distillation loss. Addi-
tionally, just like in usual deep model training, the cross-entropy
loss is computed using both the true class labels and the hard
labels (the samples prediction classes). The student model is
trained using these two losses (Jaiswal and Gajjar, 2017).
3.4.1. Elements of knowledge distillation
Knowledge types, knowledge transfer methods and teacher-
student architectures are the key elements for student learning
process. In the following we give an overview on each of them.
A. Knowledge Types
Depending on how the knowledge is obtained from the teacher
model, there are three categories of knowledge: Response-based
knowledge, feature-based knowledge, and relation-based knowl-
edge. Fig. 3 provides an illustration of various knowledge types
inside a teacher model (Gou et al., 2021).
1. Response-Based Knowledge
Response-based knowledge refers to the neural response of the
teacher model’s final output layer. Fig. 4 shows the response-based
KD model. The primary goal is to exact mimic the teacher model’s
final prediction. It is the most popular type of knowledge systems.
One of the most important advantages of this type is that the pre-
trained models are a good alternative for creating a deep model
from scratch. The response-based knowledge distillation loss can
be described as:
L
ResD
ðz
t
;z
s
Þ¼L
R
ðz
t
;z
s
Þð3Þ
Given a vector of logits as the results of a deep model’s final
fully connected layer, where L
R
ð:Þis the callback-libeler divergence
loss of logits, z
t
and z
s
are teacher and student logits, respectively
(Gou et al., 2021).
2. Feature-Based Knowledge
DNNs are efficient for learning several levels of feature repre-
sentation. As a result, feature maps, which are the output of both
the final and intermediate layers, can be utilized as the information
to supervise the student model training. A trained teacher model
also captures knowledge in its intermediate layers, that is particu-
larly for DNNs. Fig. 5 displays a general feature-based knowledge
distillation model. The objective is to teach the student model
how to activate features similarly to the teacher model. The distil-
lation loss function performs this by reducing the difference
Fig. 1. The teacher–student architecture in the KD technique (Gou et al., 2021).
Fig. 2. Knowledge Distillation process.
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
4
between the feature activation of the teacher and student models.
The feature-based knowledge distillation loss can be formulated as
follows:
L
FeaD
ðf
t
ðxÞ;f
s
ðxÞÞ ¼ L
F
ð/
t
ðf
t
ðxÞÞ;/
s
ðf
s
ðxÞÞÞ ð4Þ
where f
t
ðxÞand f
s
ðxÞare the intermediate layers feature maps of the
teacher and student models, respectively. When the teacher and
student models feature maps are not in the same shape; the trans-
formation functions, /
t
ðf
t
ðxÞÞ and /
s
ðf
s
ðxÞÞ, are usually applied. L
F
ð:Þ
denotes the similarity function which is used for mapping the stu-
dent and teacher models feature maps (Gou et al., 2021).
3. Relation-Based Knowledge
The relationships between several layers or data samples are
further explored through relationship-based knowledge. This
knowledge can be used for training the student model. Fig. 6 shows
the relation-based knowledge in distillation model. The outputs of
certain layers in the teacher model are used by response-based
knowledge as well as feature-based knowledge. A correlation
between graphs, feature maps, probabilistic distributions based
on feature representations or similarity matrix can be used to
model this relationship (Gou et al., 2021). The relation-based
knowledge distillation loss based on the instance relations stated
as:
L
RelD
ðF
t
;F
s
Þ¼L
R
ðw
t
ðt
i
;t
j
Þ;w
s
ðs
i
;s
j
ÞÞ ð5Þ
where F
t
and F
s
are the teacher and student models feature repre-
sentations sets, respectively. ðt
i
;t
j
Þ2 F
t
,ðs
i
;s
j
Þ2F
s
, and w
t
ð.) and
w
s
ð.) are the similarity functions of ðt
i
;t
j
Þand ðs
i
;s
j
Þ. The feature rep-
resentations of the teacher and the student are correlated using the
function L
R
ð:Þ.
B. Teacher–Student Architecture
In knowledge distillation, the teacher model supervises the stu-
dent model. The complicated teacher model and the simpler stu-
dent model typically have different model capacities. By
optimizing transfer of knowledge using effective student–teacher
structures, this structural gap can be removed. The teacher and
student model structures can relate to each other as shown in
Fig. 7. Typically, the student network can be (1) a simple version
of the original teacher network with lower channels in each layer
and lower layers overall; (2) a smaller network with effective basic
operations; (3) a smaller network with improved global network
structure; (4) the same network as teacher; or (5) a teacher net-
work that has been quantized while preserving the network’s
structure (Gou et al., 2021).
C. Knowledge transfer methods
As shown in Fig. 8, the knowledge distillation methods are clas-
sified into three different categories: offline distillation,online distil-
lation, and self-distillation. This classification depends on whether
the teacher and student models are updated simultaneously or not.
1. Offline Distillation
It is the most popular knowledge distillation transfer method in
which the teacher network is pre-trained and then frozen. Then the
knowledge from the teacher model is distilled to train the student
model. The teacher model is not updated while the student net-
work is being trained. The most important advantage of this trans-
fer method is that it is straightforward and simple to use (Jaiswal
and Gajjar, 2017).
2. Online Distillation
In many cases, a pre-trained teacher model may not be accessi-
ble for offline distillation. To overcome this issue, the online distil-
lation is used to further improve the student model’s performance.
Both the teacher and student models are updated simultaneously
in online distillation, and the whole knowledge distillation frame-
works are trained end-to-end (Mirzadeh et al., 2020).
3. Self-Distillation
In self-distillation scheme, the teacher and student models use
the same network (Zhao et al., 2020b). This can be considered as a
special case of online distillation. A DNN’s shallow layers can be
trained using knowledge from its deeper layers. The teacher mod-
el’s later epochs can use knowledge from its earlier epochs
for training the student model.
As each distillation technique has its own advantages, they can
be joined to complement each other like human learning. For
instance, the multiple knowledge transfer architecture effectively
integrates both self- and online distillation.
3.4.2. Distillation algorithms
This section reviews the various algorithms that can be used to
train the student models for acquiring knowledge from the teacher
models.
Fig. 3. The illustrations of the three types of the knowledge (Gou et al., 2021).
Fig. 4. Response-based knowledge distillation.
Fig. 5. Feature-based knowledge distillation.
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
5
A. Adversarial Distillation
The concept of adversarial learning is borrowed from generative
adversarial network to enable the student and teacher models to
learn a better representation of the true data distribution. The
adversarial learning technique can be used to train a generator
model to obtain synthetic training data or to supplement the initial
training dataset in order to achieve the goal of learning the true
data distribution. A discriminator model is used in a second adver-
sarial learning-based distillation technique to distinguish the stu-
dent and teacher models samples using either feature maps or
logits. This technique helps students to mimic teacher accurately.
Online distillation is the center of the third adversarial learning-
based distillation technique, where both the teacher and the stu-
dent models are optimized together (Gou et al., 2021; Mirzadeh
et al., 2020).
B. Multi-Teacher Distillation.
As presented in Fig. 9, a student model learns from many tea-
cher models in the multi-teacher distillation method. Different
types of knowledge can be provided by different teacher models,
which helps in building a student model of better prediction capa-
bilities. Feature representations and logits are the basis for the
knowledge that is typically transferred from teachers (Gou et al.,
2021).
C. Cross-Model Distillation
In some cases, data is available in multiple modalities. However,
sometimes the labels or data from various modalities could be
incorrect, damaged, or useless. Thus, knowledge transfer between
modalities is essential. Applications like image captioning and
visual question answering benefit from cross-modal distillation.
Fig. 10 shows the cross-model distillation training scheme (Gou
et al., 2021).
D. Graph-based distillation
Sometimes a student’s understanding of intra-data relation-
ships cannot be improved by simply transferring individual knowl-
edge from the teacher’s network to them. For exploring this
paradigm, recent techniques propose using graphs as teacher
knowledge carriers or to control the message passing of the tea-
cher’s knowledge. In this distillation algorithm, a self-supervised
teacher is represented by each vertex of the graph, and they may
be based on feature-based or response-based knowledge, such
as feature maps and logits, respectively (Chen et al., 2021).
E. Attention-based Distillation
Human visual experience includes attention, which is impor-
tant and strongly related to perception. NN architectures for NLP
and computer vision have used attention methods. In order to
improve student model learning, such attention techniques have
Fig. 6. The relation-based knowledge distillation.
Fig. 7. Models of the teacher-student (Gou et al., 2021).
Fig. 8. Knowledge transfer methods.
Fig. 9. Multi-teacher distillation (Gou et al., 2021).
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
6
been applied in knowledge distillation. It is based on knowledge
transferring from feature embeddings using attention maps
(Srinivas and Fleuret, 2018).
F. Data-Free Distillation
When a distilled student model framework has access to the
training data that was used by pre-trained teacher network, it per-
forms at its best. However, due to the required privacy issues or
volume of training data, this might not always be possible. This
is particularly true in medical applications where it is not possible
to share the patient data used to train the teacher model for train-
ing the student model. So, this distillation algorithm is based on
synthetic data in the absence of a training dataset confidentiality,
security, or privacy reasons. The pre-trained teacher model’s fea-
ture representations are typically used to produce the synthetic
data (Do et al., 2019).
G. Quantized Distillation
As stated earlier, making DNNs models compatible with edge
devices requires quantizing the weights and activations before
they are deployed using a specific bit-width. Quantized KD algo-
rithm is used to transfer knowledge from a high-precision teacher
model (e.g., 32-bit FP) to a low-precision distilled student network
(e.g., 8-bit) (Kim et al., 2019a).
H. Lifelong Distillation
After browsing few images from new categories, human vision
can recognize them. It can detect both explicit visual data about
new things and some external visual data derived from their prior
experience. A network is trained using similar concepts through
lifelong learning algorithm. It is based on meta-learning, lifelong
learning, and continuous learning mechanisms, where previously
acquired knowledge is gathered and transferred to future learning
(Chen and Liu, 2018).
I. Neural Architecture Search-based Distillation.
It is used for identifying suitable student model architectures
which enhance learning from the teacher models (Kang et al.,
2020).
3.4.3. Distillation applications
Knowledge distillation has been successfully applied in various
artificial intelligence fields such as speech recognition, visual
recognition, recommendation systems and natural language pro-
cessing (NLP).
There are several applications using knowledge distillation
technique in computer vision. Recent computer vision models are
increasingly based on DNNs, which can be deployed with the help
of model compression. In fields of face recognition (Wu et al.,
2020), image classification (Zhu et al., 2019), action identification
(Cui et al., 2020), image/video segmentation (Hou et al., 2020),
and object detection (Chawla et al., 2021); person re-
identification (Wu et al., 2019a), shadow detection (Chen et al.,
2020c) and pedestrian detection (Shen et al., 2016); knowledge
distillation has been successfully used. Recent knowledge
distillation-based face recognition techniques focus on effective
deployment as well as competitive recognition accuracy (Zhang
et al., 2020b).
Conventional language models, like BERT, have sophisticated,
time- and resource-consuming frameworks. Hence, KD approach
is well researched in NLP field to obtain lightweight, efficient,
and effective language models. In addition, KD is used in many
NLP applications such as text generation (Chen et al., 2020b), ques-
tion answering system (Yang et al., 2020c), event detection (Liu
et al., 2019b), document retrieval (Shakeri et al., 2019), text recog-
nition (Wang and Du, 2021), neural machine translation (NMT)
(Zhou et al., 2019a), etc.
Additionally, there are several KD applications for developing
lightweight speech recognition deep models such as spoken lan-
guage identification (Kwon et al., 2020; Shen et al., 2019c), audio
classification (Perez et al., 2020), text-independent speaker recog-
nition (Ng et al., 2018), speech enhancement (Watanabe et al.,
2017), etc. Also, it is noticed that the teacher–student architectures
are widely used for improving the efficiency and accuracy of acous-
tic models (Tuli et al., 2020).
3.4.4. Pros and cons of knowledge distillation
Knowledge distillation technique has many advantages
includes:
1. Training lightweight and efficient models instead of complex
models supports the deployment of deep networks on mobile
devices or embedded sensor nodes while conserving computa-
tional resources.
2. The student model can have a competitive performance com-
pared to the teacher model without the need to be trained using
the same training dataset.
3. If the teacher model has already been trained, knowledge distil-
lation enables student model’s training with less training data.
On the other side, knowledge distillation technique suffers from
some problems that listed below:
1. Modelling various knowledge types in a unified and compli-
mentary framework is challenging. For instance, the training
of the student model may be influenced differently by the
knowledge from various layers.
2. In knowledge distillation, development of an effective teacher
model or building an efficient student model are still
challenging tasks.
3. There is still a lack of knowledge about knowledge distillation,
including theoretical justifications and empirical assessments
(Gou et al., 2021).
4. Proposed methodology
This section presents the proposed system architecture which
analyses and identifies the diseases. It delivers various healthcare
services to fulfill user requirements for managing the data of
patients efficiently. It integrates edge computing devices with
embedded deep learning models to diagnose diseases automati-
cally. Fog computing is a promising computing paradigm because
it is located near end users where the data are generated. Many
researchers addressed the DL implementation on edge devices
Fig. 10. Cross-model distillation (Gou et al., 2021).
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
7
rather than cloud nodes to optimize bandwidth consumption,
reduce latency, and accelerate real-time decision making. How-
ever, there are many challenges to deploy deep learning models
on edge devices which have limited resource compared to cloud
nodes (Cheng et al., 2020). To tackle these problems, the proposed
work employs the knowledge distillation technique to reduce the
computation and storage requirements of the used deep learning
models. Hence, complex deep learning networks could be embed-
ded on the network edge to provide fast decisions which is crucial
for delay-sensitive healthcare applications. The general structure
of the proposed framework is presented in Fig. 11. This framework
is adapted from the Aneka framework (Hassan et al., 2022) which
uses the FogBus as a software platform for developing integrated
Fog-Cloud environments. It connects different IoT devices with
gateway devices to send tasks and data to fog worker nodes and
facilitates the development of distributed applications over clouds.
It offers APIs to developers so they can use virtual resources in the
cloud (Hassan et al., 2022).
The inputs of the proposed framework are medical images such
as X-ray, CT scan images, and cell images. These medical images
are remotely received by smart edge devices, such as smart
phones, which act as communication link between healthcare pro-
fessionals and patients. Collected information is saved in a data-
base in a cloud server. The components of the proposed
architecture and the framework of the compressed DL model
deployed within it are described below.
4.1. Components of proposed architecture
The proposed framework integrates various components
including:
1. Gateway Devices: These devices include tablets, laptops, or
mobile phones, which serve as fog devices. They receive the
patients’ images and transfer them to worker nodes or brokers
for more analysis.
2. FogBus Module: This module consists of FogBusBroker and
FogBusWorker.
-FogBusBroker: The basic component in this broker is Resource
Manager. It receives the input job from gateway devices. The
Resource Manager is divided into modules: workload man-
ager and arbitration module. Workload manager analyzes
amounts of data on each worker node. The Arbitration mod-
ule distribute tasks between various devices to balance load
and achieve the best performance.
-FogBusWorker: Each worker node consists of Single Board
Computers (SBC) such as Raspberry Pies and embedded
devices. It contains Resource monitor and deep leaning model.
Worker nodes execute the tasks assigned by the resource
manager of the broker node. The Deep Learning Module
includes the proposed compressed DL models which process
the received data and make fast decisions.
3. Cloud Data Center: In addition to saving the collected data, the
cloud data center processes the received data when the
received amounts of data are beyond the fog nodes capacity.
The proposed work focuses on developing compressed deep
learning models which can be deployed on worker nodes without
significantly affect the models’ performance. Two compressed deep
learning are introduced in this study. The first model is a DL model
based on a COVID-19 dataset which detects the diseases from chest
X-ray (CXR) images in a fast and reliable manner. The second
model is another deep learning model based on a Malaria dataset.
Both models use the knowledge distillation approach which trans-
fers knowledge from the original (teacher) model to the student
model to improve the performance of the student network. The
proposed compressed deep learning model development frame-
work and steps of knowledge distillation technique are described
below.
4.2. Proposed compressed deep learning model development
framework
A dataset, which is divided into infected and normal patients, is
provided as the input to the proposed framework as shown in
Fig. 12. The proposed framework has a preprocessing module in
Fig. 11. The general structure of the proposed architecture (Hassan et al., 2022).
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
8
which several operations are performed such as normalization,
segmentation, and data augmentation. By applying the KD com-
pression technique, the resulting deep model becomes more suit-
able to be deployed in the resource-limited edge devices such as
smart mobiles, laptops, smart watches, etc. The model is evaluated
by measuring the loss, accuracy, and confusion matrix.
4.3. Steps of building the compressed learning model
The steps of knowledge distillation technique are shown in
Fig. 13. First, the original (teacher) model is built and trained on
the training set in the usual way. Then, its performance is evalu-
ated. Then, the student model is built, and the distiller is initialized
to distill the teacher knowledge to the student. Then, the distilled
student model is trained, and its performance is evaluated. If the
student model has a good performance, it is deployed on the
worker node, otherwise, the student model architecture is chan-
ged, and the process of distillation is repeated until we reach an
acceptable student model.
Response-based knowledge distillation has been used in the
proposed model because of its efficiency and simplicity as men-
tioned in Section 3. The distillation loss for response-based knowl-
edge is presented using Eq. (3).
The proposed models are classification models with naturally
exclusive class where each input belongs to only one class. Last
layer of the model produces a logit value for each class. Logits
are raw predictions of the model. Logits are converted to class
probabilities using softmax activation function as shown in Fig. 14.
The softmax is chosen as the activation function in the proposed
models to provide a probability distribution as the output. Expo-
nential makes negative reals positive, and normalization gives
the required distribution. Softmax function computes a high value
for maximum logits and pushes other probabilities towards zero.
To inspect relative magnitudes, it is better to loosen this effect
and work on a softer probability distribution. With a softer distri-
bution, each training input provides more knowledge and gradient
for different inputs does not fluctuate (Hinton et al., 2015).
Temperature parameter T is introduced in softmax computation
to adjust the smoothness of output distribution (Jaiswal and Gajjar,
2017) as shown in Eq. (1). Raising temperature makes logits smal-
ler and smoother (i.e., close to each other). High temperature is
only used during distillation. T is set to 1 for inference. In student
training, the loss function L of the student deep model is the
weighted average of two terms: L
1
and L
2
as defined in Table 1.
L¼w
1
L
1
þw
2
L
2
wherew
1
þw
2
¼1ð6Þ
The student model has to generalize in the same way as the pre-
trained teacher model using soft targets. The detailed steps of dis-
tilling the original teacher model into the student model are
presented in Fig. 15. Same training set is used for training the pre-
trained teacher and the student models. Teacher and student pro-
duce logits using inputs from dataset, logits from teacher and
student are fed to softmax functions with same high T>1. Raising
Temperature makes logits smaller and smoother (called soft tar-
gets). The teacher’s knowledge including generalization learned
from translated images is transferred to the student using these
soft targets. Soft targets and student’s soft predictions with the
same high T are used to calculate The First Loss Term (L1). Correct
labels (Hard Targets) and student’s predictions (with T = 1) are
used to calculate The Second Loss Term (L2).
Fig. 12. The proposed compressed deep learning model development framework.
Fig. 13. Steps of learning models compression.
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
9
The distillation loss (L
ResD
) for soft logits which has been formu-
lated earlier using Eq. (3) can be rewritten as:
L
ResD
ðpz
t
;TðÞ;pðz
s
;TÞÞ ¼ L
R
ðpz
t
;TÞ;pðz
s
;TÞðÞ ð7Þ
It is obvious that improving equations 3 or 7 can make the logits
z
s
of the student match the z
t
of the teacher. Fig. 15 shows the
knowledge distillation process which describes both the distilla-
tion and student losses. Note that the cross-entropy loss
L
CE
y;pz
s
;T¼1ðÞðÞbetween thesoft logits of the student model
and ground truth label is usually used to define the student loss.
From another perspective, label regularizers or label smoothing
can be compared to how successful soft targets are (Gou et al.,
2021).
5. Proposed deep learning models and results analysis
5.1. Experimental deep learning models
Two medical models are developed both use the knowledge dis-
tillation technique. The first model is a deep model based on a
COVID-19 dataset while the second model is a deep model based
on a Malaria dataset. It is expected that the knowledge distillation
technique can be efficiently used to accelerate and compress the
deep model without significantly decreasing its performance. Dif-
ferent optimization algorithms such as Adam, SGD and RMSprop
can be used to optimize the teacher and student models.
For implementing the two models, Python has been used
because it has a broad library access, which makes deep
learning-based challenges extremely effective. Anaconda Naviga-
tor, Jupyter Notebook, and Google Colab were used to manage
big datasets and online model training while taking advantage of
a personal GPU for dataset preprocessing. In addition, they were
used to store all information so that it could be obtained from
any GPU using GitHub. Python 3.6 is used for the implementation
with the Tensor Flow-Keras environment. Experiments are con-
ducted with a Core i7 processor and 8 GB of RAM.
5.1.1. Deep model based on COVID-19 dataset
This DL model is developed for detecting the COVID-19. CXR
images dataset is used to train the teacher model. The steps of
Fig. 14. Inputs and outputs of the softmax activation function.
Table 1
The loss function terms.
The First Loss Term L
1
The Second Loss Term L
2
Definition Cross entropy between
softmax outputs of teacher and
student models
Cross entropy between correct
labels and softmax output of
student models
Model Teacher and student models
are used to compute the first
term
Only student model is used to
compute the second term
T- value Same high T>1is used for
softmax computation of both
teacher and student models
For the second term, Tis set to
1 for softmax computation of
student model
Fig. 15. The knowledge distillation process.
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
10
developing the model for detecting COVID-19 is shown in Fig. 16
and explained next.
A. Dataset Description
The used dataset is obtained from Kaggle website. It includes
CXR images which are classified into two categories: COVID-19
patients and normal patients. It consists of 2623 images which
are divided into training, validation, and testing sets with a ratio
of 60:20:20, respectively. CXR samples for infected and normal
patients are shown in Fig. 17.
The dataset images initially have various heights and widths.
Therefore, all the images are normalized to have the same height
and width. In addition, data augmentation technique is used to
enlarge the size of the training data. Several operations were used
for data augmentation such as flips, random rotation, shear, and
shifts, etc.
B. Model Architecture
A DL teacher model has been used for feature extraction. An
overview of the model architecture is shown in Fig. 18. It has
two Conv2D layers, two MaxPooling 2D layers, one flattened layer,
three fully connected layers and rectified linear unit activation
function. Softmax activation function was applied on the outputs
of last layer (logits). The size of CXR images is fixed to
224 224. The model is trained to differentiate between COVID-
19 infected and normal images. The values of the model’s parame-
ter are shown in Table 2.
First, a teacher model is built and evaluated. Then, an initial stu-
dent model is built, and the knowledge distillation process is ini-
tialized to distill the teacher knowledge to the student model.
The model is initialized with assumed values for a selected T value.
Finally, the distilled student model is trained and evaluated. If the
evaluation shows a good performance of the student model, the
model is deployed on the worker node, otherwise, the student
model architecture is modified by changing the T value. Then,
the process of distillation is performed on the modified student
model. The whole process is repeated until an acceptable student
model is obtained. Fig. 19 shows the student model architecture
of the DL model based on a COVID-19 dataset.
5.1.2. Deep model based on Malaria dataset
This deep learning model is developed for identifying the visual
features of malaria lesions. As shown in Fig. 20, the microscopic
blood images is used as the DL model input. Several operations
are performed in the data preprocessing stage including image size
normalization, data augmentation, and dataset partitioning. The
proposed model is compressed using the knowledge distillation
technique to be deployable on the resource-limited edge devices.
The model is trained to differentiate between parasitized and nor-
mal patients.
A. Dataset Description
The used dataset is obtained from Kaggle website. It consists of
27,558 microscopic blood images which are divided into para-
sitized and uninfected patients. The dataset is divided into training
and testing with percentages of 75% and 25%, respectively. Fig. 21
shows random samples of Malaria images.
In the preprocessing stage, the size of the image is fixed to
224 224. In addition, the data augmentation techniques are used
to increase the size of the dataset and overcome the overfitting
problem. The used data augmentation techniques include rotation,
shifts, shear, and flips, etc.
B. Model Architecture
A DL teacher model has been used for feature extraction. The
model has nine Conv2D layers, three MaxPooling 2D layers, three
fully connected layers, two dense layers, one flattened layer and
a rectified linear unit activation function. The softmax activation
function is applied on the outputs of the last layer (logits). The
knowledge distillation technique is used to compress the model
Fig. 16. The steps of the proposed DL model for diagnosing Covid-19.
Fig. 17. (a) X-ray image of an infected patient (b) X-ray image of a normal patient
(Divyansh et al., 2020).
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
11
by distilling knowledge from the teacher model into the student
network. It enables the network to learn different solutions to
the target problem.
The framework is presented in Fig. 22. The objective of the pro-
posed model Is to correctly classify the microscopic blood images
as parasitized or normal patients. The values of the model’s param-
eters are shown in Table 3.
After building and evaluating the teacher model, an initial stu-
dent is built, and the knowledge distillation process is initialized to
distill the teacher knowledge to the student model. Finally, the stu-
dent is trained and evaluated. If the student model has a good per-
formance, the model is deployed on the worker node, otherwise,
the student model architecture is modified, and the process of dis-
tillation is repeated until an acceptable student model is obtained.
Fig. 18. The architecture of deep learning model based on a COVID-19 dataset (the teacher model).
Table 2
The main parameter values of the proposed DL model based on a COVID-19 dataset.
The Layer Name Filter size and Strides Output
Input Layer ___ (224, 224, 3)
ConvLayer Filter size = (3,3), Strides = 1, Padding = same (224, 224, 32)
MaxPooling Pool size=(2,2), Padding = valid, Strides = 2 (112, 112, 32)
Dropout Layer Units = 64, Activation = Linear (112, 112,64)
ConvLayer Filter size=(3,3), Strides = 1, Padding = same (112, 112, 64)
Max Pooling Pool size=(2,2), Padding = valid, Strides = 2 (56, 56, 64)
Dropout Layer Units = 64, Activation = Linear (56, 56, 64)
Flatten ___ (100352)
Fully Connected Layer Activation = Linear, Units = 64 (12 0)
Fully Connected Layer Activation = Linear, Units = 64 (64)
Fully Connected Layer Activation = Linear, Units = 2 (2)
Out put ___ (2)
Fig. 19. The architecture of distilled DL model based on a COVID-19 dataset (the
student model).
Fig. 20. The steps of the proposed deep CNN model for diagnosing Malaria.
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
12
Fig. 23 shows the student model architecture of the DL model
based on a Malaria dataset.
5.2. Evaluations metrics
The evaluation process in the proposed study is divided into
three main tasks:
Evaluating the training process of the proposed model.
Evaluating the performance of both the teacher and student
model in disease detection.
Evaluating the efficiency of the compression processes in terms
of size and speed.
The following subsections describe the used metrics in each
evaluation task.
5.2.1. KD models training evaluation metrics
Three different optimizers are used to evaluate the training pro-
cess namely Adam, SGD and RMSprop. These optimizers are used
to evaluate the training process of both the teacher and student
models and to measure the loss and the accuracy along the training
process (Huang and Ling, 2005).
5.2.2. Models performance evaluation metrics
It is important to evaluate the performance of the teacher
model to be compared later to the performance of the compressed
models that results from the compression processes. The following
metrics which depends on the confusion matrix parameters (True
Positive (TP), True Negative (TN), False Negative (FN), and False
Positive (FP)) are used in this evaluation task (Divyansh Shah;
Khushbu Kawale; Masumi Shah; Santosh Randive;, 2020).
A. Accuracy:
It is the ratio of correct predictions out of all predictions made
by the DL model. It is defined by Eq. (8).
Accuracy ¼TN þTP
TP þFP þTN þFN ð8Þ
Accuracy can show how many input cases are correctly classi-
fied as infected, but it has a limited scope. It cannot evaluate the
efficiency of the model in identifying the non-infected cases.
Fig. 21. Random samples of Malaria (Tang, et al., 2021).
Fig. 22. The architecture of CNN model based on a Malaria dataset (the teacher model).
Table 3
Summary of the proposed CNN model based on Malaria dataset.
The Layer Name Filter size and Strides Output
Input Layer – (224, 224, 3)
ConvLayer Filter size=(2,2), Strides = 1, Padding = same (224, 224, 128)
ConvLayer Filter size=(2,2), Strides = 1, Padding = same (224, 224, 128)
MaxPooling Pool size=(2,2), Strides = 2, Padding = valid (112, 112, 128)
ConvLayer Filter size=(2,2), Strides = 1, Padding = same (112, 112, 64)
ConvLayer Filter size=(2,2), Strides = 1, Padding = same (112, 112, 64)
MaxPooling Pool size=(2,2), Strides = 2, Padding = valid (56, 56, 64)
ConvLayer Filter size=(2,2), Strides = 1, Padding = same (56, 56, 32)
ConvLayer Filter size=(2,2), Strides = 1, Padding = same (56, 56, 32)
ConvLayer Filter size=(2,2), Strides = 1, Padding = same (56, 56, 32)
ConvLayer Filter size=(2,2), Strides = 1, Padding = same (56, 56, 32)
MaxPooling Pool size=(2,2), Strides = 2, Padding = valid (56, 56, 32)
ConvLayer Filter size=(2,2), Strides = 1, Padding = same (56, 56, 40)
Flatten ___ (100352)
FC Layer Unit = 2048, Activations = Linear (2048)
FC Layer Unit = 1024, Activations = Linear (1024)
Dropout Unit = 1024, Activations = Linear (1024)
FCLayer Unit = 2, Activations = Linear (2)
Out put ___ (2)
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
13
B. Recall:
It gives the percentage of the correctly identified infected cases
(TP) to all the infected cases. It is defined by Eq. (9).
Recall ¼TP
ðTP þFNÞð9Þ
High recall value means that the model is efficient in identifying
the infected cases. However, it does not give an indication of the
ability to identify the non-infected case correctly.
C. Precision:
It is the ratio of correctly identified infected cases (TP) over all
the cases identified by the model as infected (TP + FP). It is also
known as positive predictive value (PPV). It is defined by Eq. (10).
Precision ¼TP
ðTP þFPÞð10Þ
High precision value means that model is efficient in identifying
the correctly infected cases. However, it does not give an indication
about the model efficiency in identifying infected cases correctly.
D. F1-Score:
Neither recall nor precision can fully evaluate the model as they
are often in a collision because improving one of them can come at
the cost of decreasing the other. F1-score is a metric that combines
both recall and precision metrics. It is defined by Eq. (11).
F1 score ¼2TP
ð2TP þFP þFNÞð11Þ
E. Specificity:
It measures how the model is efficient in identifying the non-
infected cases (TN). It is defined by Eq. (12).
Specificity ¼TN
ðFP þTNÞð12Þ
5.2.3. Size and acceleration evaluation metrics
The purpose of compression process is to generate a com-
pressed model that has a smaller size and faster response com-
pared to the original teacher model. To evaluate these
requirements, the following metrics are used:
Number of Parameters: It is a training data property that
learned during the learning process, in the case of DL it is
weight and bias. The smaller this metric for the compressed
than the original teacher, the better is the compression process
as it directly affect the compression ratio.
Model Size: This metric directly reflects how efficient is the
compression process in producing the compressed model.
Inference Time: This metrics represent the response time
needed by the model to classify the input image. The smaller
this value for the compressed than the teacher, the better is
the compression process.
5.3. Result analysis
Several experiments have been conducted to evaluate the per-
formance of the proposed DL models using the aforementioned
metrics. This section shows the obtained results and their analysis.
Fig. 23. The architecture of Deep CNN model based on a Malaria dataset (the
student model).
Fig. 24. The temperature effect on the DL model based on the COVID-19 dataset.
Table 4
Settings of the different optimizers.
Optimizer
Algorithm
Settings
Adam learning_rate = 0.001, AMSGrad = false, epsilon = le-07,
B1¼0:9;B2¼0:999,
SGD learning_rate = 0.01, Decay = 0, Nesterov = false,
Momentum = 0.0, Schedule decay = 0.004
RMSprop learning_rate = 0.001, epsilon = le-07, Rho = 0.9,
centered = False
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
14
5.3.1. Evaluating the deep model based on COVID-19 dataset
The steps of developing the architectures of both the teacher
and student models are mentioned in Section 5.1.1. Based on the
COVID-19 dataset, the best student model is obtained with T = 5
as shown in Fig. 24, several valued have been tried for the temper-
ature T. However, the best result is obtained when T = 5. In the fol-
lowing subsections, the training process, teacher model, student
model, and knowledge distillation process are evaluated with
respect to the suitable metrics.
A. Evaluating the Training of the DL Model Based on COVID-19
Dataset
To train the teacher model, dropout regularization was applied,
and noise was added to the training images by up to two pixels in
each direction. On the other hand, in the student model training,
neither regularization nor noise addition were used.
Different optimization algorithms have been used in training
the proposed models namely, Adam, SGD and RMSprop. The set-
tings of the used optimizers are shown in Table 4.Figs. 25-27 show
the average values for accuracy and loss of these optimizer algo-
rithms before and after applying the knowledge distillation tech-
nique in the proposed model. As shown in the different model’s
accuracy plots, an expanding line of training accuracy is gradually
increasing. Also, it can be observed from the model’s loss plots that
both the lines representing training loss and testing loss are grad-
ually decreasing. Also, the loss and accuracy of the distilled model
are presented, respectively.
Fig. 25. (a) Teacher (the original) model loss; (b) Teacher (the original) model average accuracy;(c)Small (student) model loss; (d) Small (student) model average accuracy
(Adam).
Fig. 26. (a) Teacher (the original) model loss; (b) Teacher (the original) model average accuracy; (c)Small (student) model loss; (d) Small (student) model average accuracy
(SGD).
Fig. 27. (a) Teacher (the original) model loss; (b) Teacher (the original) model average accuracy; (c)Small (student) model loss; (d) Small (student) model average accuracy
(RMSprop).
Fig. 28. Teacher-Student accuracy based on the COVID-19 dataset.
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
15
For Adam optimizer, the best performance of the proposed
model is after 100 epochs with 99.63% accuracy and 2.01% loss
as presented in Fig. 25-a and Fig. 25-b. After applying the KD algo-
rithm on the model, based on Fig. 25-c and Fig. 25-d, the accuracy
and loss decrease to be 98.1% and 1.8%, respectively.
For SGD optimizer, the best performance of the proposed model
is also obtained after 100 epochs with 99.2% accuracy and 4.02%
loss, as presented in Fig. 26-a and Fig. 26-b. After applying the
KD algorithm on the model, based on Fig. 26-c and Fig. 26-d, the
values of accuracy and loss are 95.6% and 5.08%, respectively.
Finally, for RMSprop optimizer, the best performance of the
model is also obtained after 100 epochs with 99.8% accuracy and
5.06% loss, as presented in Fig. 27-a and Fig. 27-b. After applying
the KD algorithm on the model, based on Fig. 27-c and Fig. 27-d,
the values of accuracy and loss are 98.9% and 6.08%, respectively.
It is clear from these results that the Rmsprop optimizer
achieves the best accuracy and loss results with learning
rate = 0.001 and Rho factor = 0.9.
Fig. 28 presents the teacher-student model accuracy with dif-
ferent optimization algorithms; Adam, SGD and RMSprop, in order.
The obtained experimental results reveal that the knowledge dis-
tillation approach can be efficiently used to accelerate and com-
press the model without significantly decreasing the model’s
performance.
B. Evaluating the Performance of Teacher and Student Models
of COVID-19 dataset
In this Section, 3 pairs of the teacher and student models are
evaluated using the different optimization algorithm (Adam, SGD
and RMSprop) in terms of recall, specificity, precision, accuracy,
and F-score. The objective of the evaluation process is to assess
the efficiency of the proposed model in correctly identifying
COVID-19 infected and normal patients. Table 5 shows the metrics
values for both the teacher and student models. The obtained
results assure that the knowledge distillation technique can be effi-
ciently used to accelerate and compress the model without signif-
icantly decreasing the model’s performance.
Regarding the accuracy, as presented in Table 5, it is noticed
that the best accuracy is obtained using the RMSprop optimizer
with 99.84% accuracy for the teacher model and 98.93% accuracy
for the student model. This means that the KD process causes an
accuracy drop by 0.91%. On the other side, the accuracy drop using
Adam and SGD optimizers are 1.1% and 3.64%, respectively. Hence,
the RMSprop optimizer is the best choice with respect to the accu-
racy metric as it has the best obtained accuracy and the least accu-
racy drop.
In addition, based on Table 5, it is noticed that the recall and
precision of the models that employ the SGD optimizer are better
than the others. However, F1-score is a more accurate metric as
it combines both of the recall and precision metric. Regarding
F1-score, it is noticed that the models that employ the RMSprop
optimizer are better than the others. Also, it is noticed that the
F1-score of the student model decreases by 2.4%. Similarly, the
F1-score of the student models based on Adam and SGD optimizers
decreases by 0.45% and 3.57%, respectively. It is clear that the
RMSprop optimizer has the best F1-score.
Regarding specificity, it is noticed the SGD optimizer is the best
with 98.32% specificity. However, it has a specificity drop by 1.8%.
On the other hand, the student model based on the RMSprop opti-
mizer has a specificity gain by 0.14%. Based on the obtained results,
it is clear the RMSpro optimizer is the best one regarding the differ-
ent metrics except the specificity where the RMSpro optimizer has
a competitive but not the best value.
C. Evaluating the KD approach vs Quantization and Pruning
Using COVID-19 dataset
In this section, the efficiency of the knowledge distillation tech-
nique is evaluated against other compression techniques. To con-
duct the evaluation process, deep neural models have been
developed using pruning and quantization compression
approaches mentioned in Section 3. To develop these models, the
same original teacher model presented earlier has been used.
For the pruning approach, the weight pruning method is used
and for the quantization approach, the model precision is reduced
from Floating Point 32-bit to Int 8-bit. Table 6 presents a summary
of the 4 models’ architectures: original teacher model, distilled stu-
dent model, quantized model, and the pruned model.
Based on Table 6, it is noticed that the distilled student model
has the least number of parameters which is less than the number
of parameters in teacher model by 96% and less than the pruned
model by 3.6%, while the quantization does not change the number
of parameters as it just uses a fewer number of bits for weights.
Table 5
Performance of teacher and student models using the different optimizers for COVID-19 dataset.
Optimizer
Algorithm
Architecture type The Evaluation Metrices
Recall Specificity Precision ACC F-score
Adam Teacher 0.9517 0.9442 0.9447 0.9961 0.9743
Student 0.9508 0.9232 0.9232 0.9812 0.9699
SGD Teacher 0.9824 0.9832 0.9886 0.9924 0.9617
Student 0.9611 0.9655 0.9832 0.9562 0.9274
RMSprop Teacher 0.9733 0.9612 0.9612 0.9984 0.9971
Student 0.9726 0.9626 0.9622 0.9893 0.9732
Table 6
Summary of the architectures of different models using the COVID-19 dataset.
Factors Teacher Model Student Model Quantized model Pruned model
Model Architecture CNN consists of two
Conv2D layers and three FC layers
DNN consists of
two FC layers
The same original
(teacher) model
CNN consists of unpruned Conv1
layer and pruned Conv2, FC1, FC2 and FC3 layers.
Number of Parameters 44,426 1777 44,426 1843
Weight bits FP 32-bit FP 32-bit Int 8-bit FP 32-bit
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
16
In addition, the different models are compared in terms of accu-
racy, F1-score, model size (KB) and inference time (seconds) using
the COVID-19 dataset as shown in Table.
As shown in Table 7, the size of the distilled student model is
the least among the other models as its size is 18.4% of the teacher
size and this size is less than the quantized model size and the
pruned model size by 26.1% and 7.7%, respectively. This indicates
the significant storage saving offered by knowledge distillation as
it saves 82.5% of teacher storage. Also, the distilled student model
is 6.14 times faster than the teacher model and it is faster than
both quantized and pruned models by factors of 4.1x and 2.9x,
respectively.
Regarding the accuracy of the compressed models, the distilled
model was dropped by only 0.9% from that of the teacher model,
while there was a drop by 0.06% for the quantized model and a
drop by 0.8% for the pruned model. Finally, with respect to the
F1-score metric, the distilled student model scored the least F1-
score but with only a drop by 1.5% of the teacher model while there
was a drop by only 0.7% in the quantized model and 0.94% in the
pruned model.
The above results clearly shows the high efficiency of the dis-
tilled student model that with an accuracy and efficiency that is
very close to the teacher model and the other compressed models
produced by quantization and pruning. It is significantly lighter in
size and has a smaller number of layers and smaller hidden layer
sizes. In addition, its response time is much faster than the other
compressed models which make the distilled student model more
suitable to be deployed on fog nodes that usually have limited
resources and capable of producing real-time responses that are
crucial in delay-sensitive applications like our proposed medical
applications.
5.3.2. Evaluating the deep model based on Malaria dataset
Based on the Malaria dataset, the best student model is
obtained with T = 7. As shown in Fig. 29, several valued have been
tried for the temperature T. However, the best result is obtained
when T = 7. In the following subsections, the training process,
teacher model, student model, and knowledge distillation process
are evaluated with respect to the suitable metrics.
A. Evaluate the Training of the Deep Model Based on Malaria
Dataset
To train the teacher model, dropout regularization was applied,
and noise was added to the training images by up to two pixels in
each direction. On the other hand, in the student model training,
neither regularization nor noise addition were used.
Different optimization algorithms have been used in training
the proposed models namely, Adam, SGD and RMSprop. The set-
tings of the used optimizers are shown in Table 4.Figs. 30-32 show
the average values of accuracy and loss of these optimizer algo-
rithms before and after applying the knowledge distillation tech-
nique in the proposed model. As shown in the different model’s
accuracy plots, an expanding line of training accuracy is gradually
increasing. Also, it can be observed from the model’s loss plots that
both the lines representing training loss and testing loss are grad-
ually decreasing. Also, the loss and accuracy of the distilled model
are presented, respectively.
For Adam optimizer, the best performance of the proposed
model is after 100 epochs with 99.88% accuracy and 3.2% loss as
presented in Fig. 30-a and Fig. 30-b. After applying the KD algo-
rithm on the model, based on Fig. 30-c and Fig. 30-d, the accuracy
and loss decrease are 98.72% and 3.36%, respectively.
For SGD optimizer, the best performance of the proposed model
is also obtained after 100 epochs with 99.45% accuracy and 4.86%
loss, as presented in Fig. 31-a and Fig. 31-b. After applying the
KD algorithm on the model, based on Fig. 31-c and Fig. 31-d, the
values of accuracy and loss are 98.25% and 5.06% respectively.
Finally, for RMSprop optimizer, the best performance of the
model is also obtained after 100 epochs with 96.82% accuracy
and 15.7% loss, as presented in Fig. 32-a and Fig. 32-b. After apply-
ing the KD algorithm on the model, based on Fig. 32-c and Fig. 32-
d, the values of accuracy and loss are 96.29% and 19.05%,
respectively.
Table 7
Comparing different models using the Covid-19 Dataset.
The compression method Acc F1-Score Size (KB) Inference time (Sec.)
The Original model 0.9984 0.9982 1331.2 0.0010578
Quantization (Int 8-bit) 0.9978 0.9906 332.8 0.0007052
Pruning 0.9903 0.9888 266.24 0.0004942
Knowledge distillation 0.9893 0.9832 245.76 0.0001722
Fig. 29. The temperature effect on the DL model based on the Malaria dataset.
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
17
It is clear from these results that the Adam optimizer achieves
the best accuracy and loss results.
Fig. 33 presents the teacher-student model accuracy with dif-
ferent optimization algorithms; Adam, SGD and RMSprop, in order.
The obtained experimental results reveal that the knowledge dis-
tillation technique can be efficiently used to accelerate and com-
press the model without significantly decreasing the model’s
performance.
B. Evaluating the performance of Teacher and Student Models
of Malaria dataset
In this section, pairs of the teacher and student models are eval-
uated using the different optimization algorithm (Adam, SGD and
RMSprop) in terms of recall, specificity, precision, accuracy, and
F-score. The objective of the evaluation process is to assess the effi-
ciency of the proposed model in correctly identifying infected and
non-infected patients. Table 8 shows the metrics values for both
the teacher and student models. The obtained results assure that
Fig. 30. (a) Teacher (the original) model loss; (b) Teacher (the original) model average accuracy; (c)Small student Model loss; (d) Small student model average accuracy
(Adam).
Fig. 31. (a) Teacher (the original) model loss; (b) Teacher (the original) model average accuracy; (c)Small student Model loss; (d) Small student model average accuracy
(SGD).
Fig. 32. (a) Teacher (the original) model loss; (b) Teacher (the original) model average accuracy; (c)Small student Model loss; (d) Small student model average accuracy
(RMSprop).
Fig. 33. Teacher-Student accuracy based on the Malaria dataset.
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
18
the knowledge distillation technique can be efficiently used to
accelerate and compress the model without significantly decreas-
ing the model’s performance.
Regarding the accuracy, as shown in Table 8, it is noticed that
the best accuracy is obtained using the Adam optimizer with
99.88% accuracy for the teacher model and 98.72% accuracy for
the student model. This means that the KD process causes an accu-
racy drop by 1.2%. On the other side, the accuracy drop using SGD
and RMSprop optimizers are 1.2% and 4.04%, respectively. Hence,
the Adam optimizer is the best choice with respect to the accuracy
metric as it has the best obtained accuracy and the least accuracy
drop.
In addition, based on Table 8, it is noticed that the recall and
precision of the models that employ the Adam optimizer are better
than the others. However, F1-score is a more accurate metric as it
combines both of the recall and precision metric. Regarding
F1-score, it is noticed that the models that employ the Adam opti-
mizer are better than the others. Also, it is noticed that the F1-score
of the student model decreases by 0.5%. Similarly, the F1-score of
the student models based on SGD and RMSprop optimizer decrease
by 0.39% and 0.3%, respectively. Hence, It is clear that Adam opti-
mizer has the best F1-score.
Regarding specificity, it is noticed the SGD optimizer is the best
with 98.82% specificity. However, it has a specificity drop by 0.2%.
Similarly, the student models based on the Adam and RMSprop
optimizers have a specificity drop by 0.1% and 0.06%, respectively.
Based on the obtained results, it is clear the Adam optimizer is the
best one regarding the different metrics except the specificity
where the Adam optimizer has a competitive but not the best
value.
C. Evaluating the KD approach vs Quantization and Pruning
Using Malaria dataset
In this section, the proposed compressed models are evaluated
against a pruned model, a quantized model, and the original tea-
cher model using the Malaria dataset. Tables 9 summarizes the
architectures of the different models. It is noticed that the distilled
student model has the least number of parameters which is less
than the number of parameters in teacher model by 97.9% and less
than the pruned model by 64%, while the quantized model has the
same number of parameters as the original teacher model.
Also, Table 10 compares the different models in terms of accu-
racy, F1-score, size, and inference time (seconds) using the Malaria
dataset.
From Table 10, the size of the distilled student model is the least
among the other models as its size is 15% of the teacher size. Also,
the size of the distilled student model is less than the quantized
model size and pruned model size by 41.5% and 21.4%, respec-
tively. Hence, the saving in required storage that can be achieved
by the knowledge distillation technique is about 85.02%. This
makes the student model easy to deploy on the storage-limited
fog nodes. Regarding the inference time, the distilled student
model is 5.86 times faster than the teacher model and it is faster
than both quantized and pruned models by factors of 3.76x,
4.69x, respectively. This helps to produce systems with faster
response which is a crucial requirement for delay-sensitive
systems.
In addition, based on Table 10, the accuracy of the distilled
model was dropped by 1.2% from that of the teacher model, while
there were a drop by 0.39% for the quantization model and a drop
by 0.5% for the pruned model. Regarding the F1-score measure, the
distilled student model scored the least F1-score with a drop by
0.9% of the teacher model while there were a drop by only 0.11%
in the quantized model and 0.22% in the pruned model.
Based on the conducted experiments and the obtained results, it
is clear that the knowledge distillation technique is an effective
Table 8
Performance of teacher and student models using the different optimizers for Malaria dataset.
Optimizer
Algorithm
Architecture Typel The Evaluation Metrics %
Recall Specificity Precision ACC F-score
Adam Teacher 0.9861 0.9842 0.9982 0.9988 0.9903
Student 0.9714 0.9792 0.9862 0.9872 0.9853
SGD Teacher 0.9812 0.9882 0.9789 0.9945 0.9804
Student 0.9706 0.9855 0.9744 0.9825 0.9765
RMSprop Teacher 0.9656 0.9618 0.9624 0.9682 0.9626
Student 0.9653 0.9612 0.9598 0.9629 0.9597
Table 9
Summary of the architectures of different models using the Malaria dataset.
Factors Teacher Model Student Model Quantized model Pruned model
Model Architecture CNN consists of nine Conv2D
layers and three FC layers.
DNN consists of
two FC layers.
The same original
(teacher) model
CNN consists of unpruned Conv1 layer and pruned 8
Conv layers, FC1, FC2 and FC3 layers.
Number of Parameters 1,606,144 33,223 1,606,144 94,553
Weight bits FP 32-bit FP 32-bit Int 8-bit FP 32-bit
Table 10
Comparing different models using the Malaria Dataset.
The compression method Acc F1-Score Size (KB) Inference time (Sec.)
The Original model 0.9988 0.9903 1043 0.0020739
Quantization (Int 8-bit) 0.9923 0.9892 266.75 0.0013286
Pruning 0.9912 0.9881 198.7 0.00165912
Knowledge distillation 0.9872 0.9853 156.15 0.00035355
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
19
technique to generate lighter deep learning models with much fas-
ter response compared to other compression techniques. This facil-
itates the deployment of the distilled student model on the
limited-resources fog nodes. Hence, the knowledge distillation
technique can be widely adopted in delay-sensitive applications
such as healthcare applications.
6. Conclusion and future work
IOT based medical diagnosis services should be light weight ser-
vices with fast response time. Deep learning has significant success
in developing medical diagnosis systems. However, deep learning
models requires large storage and high computing power which
makes it difficult to deploy them on edge/fog devices. Several com-
pression and acceleration approaches of DNN have been recently
proposed. These approaches are classified into four categories;
parameter pruning and quantization, low-rank factorization, trans-
ferred/compact convolutional filters, and knowledge distillation.
Knowledge distillation was adopted in the proposed work for
compressing the DNN-based medical diagnosis models. Knowledge
distillation refers to the process of transferring knowledge from a
teacher model to a student model. A knowledge distillation system
consists of knowledge, teacher-student architecture, and a knowl-
edge transfer method. There are three types of knowledge namely
response-based, feature-based, and relation-based knowledge. In
the proposed work, the response-based knowledge is used as it is
the most popular type. In KD system, the original model is called
the teacher model while the compressed model to which the tea-
cher knowledge is transferred is called the student model. The
knowledge transfer can be one or mix of offline, online or a self-
distillation of the teacher model itself. The offline is commonly
used for pretrained existing models, so it is suitable to be used
for existing big size medical diagnosis DNN models to produce
lighter models suitable for edge/fog devices.
Two case studies have been presented for building compressed
DNN diagnosing models using KD. Experiments have clearly shown
the high efficiency of the KD approach as it results in a compressed
student models that have accuracy very close to the teacher model
but with much smaller sizes than the original teacher model and
much faster than it as well. In addition, these distilled DNN models
are much smaller and faster than other compressed models pro-
duced by pruning and quantization approaches and with very close
performance to them.
In the future, hybrid compression methods can be applied to
generate lighter DNN models under different scenarios. One sce-
nario is to apply the pruning method on the student model pro-
duced by KD. Then, the quantization is applied on the resulting
model. Another scenario is to prune the teacher model first, then
distill its knowledge to a student model and finally this student
model is quantized. Also, the effect of KD can be analyzed with
other learning schemes such as reinforcement learning and adver-
sarial learning.
Declaration of Competing Interest
The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Acknowledgments
The authors would like to thank the Academy of Scientific
Research and Technology for its support.
References
Chawla, A., Yin, H., Molchanov, P., Alvarez, J., 2021. Data-Free Knowledge
Distillation for Object Detection. In: WACV.
Chen, Y. C., Gan, Z., Cheng, Y., Liu, J., Liu, J., 2020b. Distilling knowledge learned in
BERT for text generation. In: ACL.
Chen, Z., Zhu, L., Wan, L., Wang, S., Feng, W., Heng, P.A., 2020. A multi-task mean
teacher for semi-supervised shadow detection. In: CVPR.
Chen, Z., Liu, B., 2018. Lifelong machine learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning 12 (3), 1–207.
Chen, H., Wang, Y., Xu, C., Xu, C., Tao, D., 2021. Learning student networks via
feature embedding. IEEE TNNLS 32 (1), 25–35.
Cheng, Y., Wang, D., Zhou, P., 2020. A Survey of Model Compression and
Acceleration for Deep Neural Networks. In: IEEE.
Cheng, Y., Wang, D., Zhou, P., Zhang, T., 2020. A Survey of Model Compression and
Acceleration for Deep Neural Networks. arXiv:1710.09282v9.
Chun, J., Kailath, T., 1991. Generalized Displacement Structure for Block- Toeplitz,
Toeplitz-block, and Toeplitz-derived Matrices. Berlin, Heidelberg: Springer
Berlin Heidelberg, pp. 215–236.
Courbariaux, M., Bengio, Y., 2016. Binarynet: Training deep neural networks with
weights and activations constrained to +1 or -1. CoRR vol. abs/1602.02830.
Courbariaux, M., Bengio, Y., David, J.P., 2015. Binaryconnect: Training deep neural
networks with binary weights during propagations. In: Advances in Neural
Information Processing Systems 28: Annual Conference on Neural Information
Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp.
3123–3131.
Cui, Z., Song, T., Wang, Y., Ji, Q., 2020. Knowledge augmented deep neural networks
for joint facial expression and action unit recognition. In: NeurIPS.
Deng, S., Zhao, H., Fang, W., Yin, J., Dustdar, S., Zomaya, A.Y., 2020. Edge intelligence:
the confluence of edge computing and artificial intelligence. IEEE Internet of
Things J. 2-s2.0-85089947867.
Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R., 2014. Exploiting linear
structure within convolutional networks for efficient evaluation. In:
Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q.
(Eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 1269–
1277.
Do, T., Do, T.T., Tran, H., Tjiputra, E., Tran, Q.D., 2019. Compact trilinear interaction
for visual question answering. In: ICCV.
Gou, J., Baosheng, Y.u., Maybank, S.J., Tao, D., 2021. Knowledge distillation: a survey.
Int. J. Computer Vis.
Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P., 2015. Deep learning with
limited numerical precision. In: Proceedings of the 32Nd International
Conference on International Conference on Machine Learning - Volume 37,
ser. ICML’15, pp. 1737–1746.
Han, S., Pool, J., Tran, J., Dally, W., 2015. Learning both weights and connections for
efficient neural networks. In: Proceedings of the 28th International Conference
on Neural Information Processing Systems, ser. NIPS’15.
Hassan, E., Shams, M.Y., Hikal, N.A. and Elmougy, S., 2022. Detecting COVID-19 in
Chest CT Images Based on Several Pre-Trained Models. Research Square
Platform LLC.
Hinton, G., Vinyals, O. Dean, J., 2015. Distilling the knowledge in a neural network.
arXiv preprint arXiv:1503.02531.
Hinton, G.E., Vinyals, O., Dean, J., 2015. Distilling the knowledge in a neural
network. CoRR vol. abs/1503.02531.
Hou, Y., Ma, Z., Liu, C., Hui, T. W., Loy, C.C., 2020. Inter-Region Affifinity Distillation
for Road Marking Segmentation. In: CVPR.
Huang, J., Ling, C.X., 2005. Using AUC and accuracy in evaluating learning
algorithms. IEEE Trans. Knowledge Data Eng. 17 (3), 299–310.
Jaderberg, M., Vedaldi, A., Zisserman, A., 2014. Speeding up convolutional neural
networks with low rank expansions. In: Proceedings of the British Machine
Vision Conference. BMVA Press.
Jaiswal, B., Gajjar, N., 2017. Deep neural network compression via knowledge
distillation for embedded applications. In: Nirma University International
Conference on Engineering (NUiCONE).
Kammoun, N., Abassi, R., Guemara, S., 2022. Toward a high-performance clustering
algorithm for securing edge computing environments, 2022, Scopus eid: 2-s2.0-
85135762490.
Kang, M., Mun, J., Han, B., 2020. Towards oracle knowledge distillation with neural
architecture search. In: AAAI.
Kim, J., Bhalgat, Y., Lee, J., Patel, C., Kwak, N., 2019a. QKD: Quantization-aware
Knowledge Distillation. arXiv preprint arXiv:1911.12491.
Korattikara Balan, A., Rathod, V., Murphy, K.P., Welling, M., 2015. Bayesian dark
knowledge. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M.B., Garnett, R.
(Eds.) Advances in Neural Information Processing Systems 28, pp. 3420–3428.
Kwon, K., Na, H., Lee, H., Kim, N.S., 2020. Adaptive knowledge distillation based on
entropy. In: ICASSP.
Laccetti, G., Lapegna, M., Romano, D., 2022. A hybrid clustering algorithm for high-
performance edge computing devices. In: IEEE bn bn2022, Scopus eid: 2-s2.0-
85142819340.
Lapegna, M., Balzano, W., Meyer, N., Romano, D., 2021. Clustering algorithms on
low-power and high-performance devices for edge computing environments.
Sensors 21, 5395.
Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I.V., Lempitsky, V.S., 2014. Speeding-
up convolutional neural networks using fine-tuned cp-decomposition. CoRR
vol. abs/1412.6553.
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
20
Liu, J., Chen, Y., Liu, K., 2019b. Exploiting the ground-truth: An adversarial imitation
based knowledge distillation approach for event detection. In: AAAI.
Mirzadeh, S. I., Farajtabar, M., Li, A., Ghasemzadeh, H., 2020. Improved knowledge
distillation via teacher assistant. In: AAAI.
Moczulski, M., Denil, M., Appleyard, J., de Freitas, N., 2016. Acdc: A structured
efficient linear layer. In: International Conference on Learning Representations
(ICLR).
Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J., 2017. Prunning Convolutionsl
Neural Networks For Resource Efficient Inference. arXiv:1611.06440v2 [cs.LG].
Ng, R.W., Liu, X., Swietojanski, P., 2018. Teacher-student training for text-
independent speaker recognition. In: SLTW.
Perez, A., Sanguineti, V., Morerio, P., Murino, V., 2020. Audio-visual model
distillation using acoustic images. In: WACV.
Rakhuba, M.V., Oseledets, I.V., 2015. ‘‘Fast multidimensional convolution in low-
rank tensor formats via cross approximation. SIAM J. Sci. Computing 37 (2), pp.
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A., 2016. Xnor-net: Imagenet
classification using binary convolutional neural networks. In: ECCV.
Rigamonti, R., Sironi, A., Lepetit, V., Fua, P., 2013. Learning separable filters. In: 2013
IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR,
USA, June 23-28, 2013, pp. 2754– 2761.
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y., 2014. Fitnets:
Hints for thin deep nets. CoRR vol. abs/1412.6550.
Shah, D., Kawale, K., Shah, M., Randive, S., Mapari, R., 2020. Rahul Mapari”.Malaria
Parasite Detection Using Deep Learning. IEEE.
Shahshahani, M., Goswami, P., Bhatia, D., 2018. Memory Optimization Techniques
for FPGA based CNN Implementations. In: 2018 IEEE 13th Dallas Circuits and
Systems Conference (DCAS).
Shakeri, S., Sethy, A., Cheng, C., 2019. Knowledge distillation in document retrieval.
arXiv preprint arXiv:1911.11065.
Shang, W., Sohn, K., Almeida, D., Lee, H., 2016. Understanding and improving
convolutional neural networks via concatenated rectified linear units, arXiv
preprint arXiv:1603.05201.
Shen, J., Vesdapunt, N., Boddeti, V.N., Kitani, K.M., 2016. In teacher we trust:
Learning compressed models for pedestrian detection. arXiv preprint
arXiv:1612.00478.
Shen, P., Lu, X., Li, S., Kawai, H., 2019. Interactive learning of teacher-student model
for short utterance spoken language identification. In: ICASSP.
Srinivas, S., Babu, R.V., 2015. Data-free parameter pruning for deep neural
networks,” in Proceedings of the British Machine Vision Conference 2015,
BMVC 2015, Swansea, UK, September 7-10, 2015, pp. 31.1–31.12.
Srinivas, S., Fleuret, F., 2018. Knowledge transfer with jacobian atching. In: ICML.
Szegedy, C., Ioffe, S., Vanhoucke, V., 2016. Inception-v4, inceptionresnet and the
impact of residual connections on learning. CoRR vol.abs/1602.07261.
Tang, H., et al., 2021. 1-bit Adam: Communication Efficient Large Scale Training
with Adam’s Convergence Speed. Available: .
Tmamna, J., Ayed, E.B., Ayed, M.B., 2020. Deep learning for internet of things in fog
computing: Survey and Open Issues. In: 2020 5th International Conference on
Advanced Technologies for Signal and Image Processing (ATSIP).
Tuli, S., Basumatary, N., Gill, S.S., Kahani, M., Arya, R.C., Wander, G.S., Buyya, R.,
2020. HealthFog: an ensemble deep learning based Smart Healthcare System for
Automatic Diagnosis of Heart Diseases in integrated IoT and fog computing
environments. Future Generation Computer Syst.
Ullrich, K., Meeds, E., Welling, M., 2017. Soft weight-sharing for neural network
compression. CoRR vol. abs/1702.04008.
Vanhoucke, V., Senior, A., Mao, M.Z., 2011. Improving the speed of neural networks
on cpus, In: Deep Learning and Unsupervised Feature Learning Workshop, NIPS
2011.
Wang, Z.R., Du, J., 2021. Joint architecture and knowledge distillation in CNN for
Chinese text recognition. Pattern Recogn. 111, 107722.
Wang, S., Tuor, T., Salonidis, T., Leung, K.K., Makaya, C., He, T., Chan, K., 2018, When
edge meets learning: adaptive control for resource-constrained distributed
machine learning. In: Proc. IEEE Int. Conf. Comput. Comnun. (INFOCOM).
Watanabe, S., Hori, T., Le Roux, J., Hershey, J.R., 2017. Student teacher network
learning with enhanced features. In: ICASSP.
Wu, A., Zheng, W.S., Guo, X., Lai, J.H., 2019a. Distilled person re-identification:
Towards a more scalable system. In: CVPR.
Wu, X., He, R., Hu, Y., Sun, Z., 2020. Learning an evolutionary embedding via massive
knowledge distillation. Int. J. Computer Vis., 1–18
Wu, B., Iandola, F.N., Jin, P.H., Keutzer, K., 2016. Squeezedet: Unified, small, low
power fully convolutional neural networks for real-time object detection for
autonomous driving. CoRR vol. abs/1612.01051.
Yang, Z., Shou, L., Gong, M., Lin, W., Jiang, D., 2020c. Model compression with two-
stage multi-teacher knowledge distillation for web question answering system.
In: WSDM.
Zhai, S., Cheng, Y., Zhang, Z.M., 2016. Doubly convolutional neural networks. In:
Advances in Neural Information Processing Systems, pp. 1082–1090.
Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H., 2018b. Deep mutual learning. In: CVPR.
Zhang, L., Song, J., Gao, A., Chen, J., Bao, C. Ma, K., 2019b. Be your own teacher:
Improve the performance of convolutional neural networks via self distillation.
In: ICCV.
Zhang, M., Song, G., Zhou, H., Liu, Y., 2020. Discriminability distillation in group
representation learning. In: ECCV.
Zhao, L., Peng, X., Chen, Y., Kapadia, M., Metaxas, D.N., 2020. Knowledge as Priors:
Cross-Modal Knowledge Generalization for Datasets without Superior
Knowledge. In: CVPR.
Zhou, C., Neubig, G., Gu, J., 2019. Understanding knowledge distillation in
nonautoregressive machine translation. In: ICLR.
Zhu, M., Han, K., Zhang, C., Lin, J., Wang, Y., 2019. Low-resolution visual recognition
via deep feature distillation. In: ICASSP.
F. MohiEldeen Alabbasy, A.S. Abohamama and M.F. Alrahmawy Journal of King Saud University – Computer and Information Sciences 35 (2023) 101616
21