ArticlePDF Available

A Channel Pruning Optimization With Layer-Wise Sensitivity in a Single-Shot Manner Under Computational Constraints

Authors:

Abstract and Figures

In the constrained computing environments such as mobile device or satellite on-board system, various computational factors of hardware resource can restrict the processing of deep learning (DL) services. Recent DL models such as satellite image analysis mainly require larger resource memory occupation for intermediate feature map footprint than the given memory specification of hardware resource and larger computational overhead (in FLOP) to meet service-level objective in the sense of hardware accelerator. As one of the solutions, we propose a new method of controlling the layer-wise channel pruning in a single-shot manner that can decide how much channels to prune in each layer by observing dataset once without full pretraining. To improve the robustness of the performance degradation, we also propose a layer-wise sensitivity and formulate the optimization problems for deciding layer-wise pruning ratio under target computational constraints. In the paper, the optimal conditions are theoretically derived, and the practical optimum searching schemes are proposed using the optimal conditions. On the empirical evaluation, the proposed methods show robustness on performance degradation, and present feasibility on DL serving under constrained computing environments by reducing memory occupation, providing acceleration effect and throughput improvement while keeping the accuracy performance.
Content may be subject to copyright.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
A Channel Pruning Optimization with
Layer-wise Sensitivity in a Single-shot
Manner under Computational
Constraints
MINSU JEON*, TAEWOO KIM*, CHANGHA LEE, (Member, IEEE), AND CHAN-HYUN YOUN.,
(Senior Member, IEEE)
School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 34141, South Korea
Corresponding author: Chan-Hyun Youn (e-mail: chyoun@kaist.ac.kr).
This research was supported by the Challengeable Future Defense Technology Research and Development Program(No.915027201) of
Agency for Defense Development in 2022. (*Minsu Jeon and Taewoo Kim contributed equally to this work.)
ABSTRACT In the constrained computing environments such as mobile device or satellite on-board
system, various computational factors of hardware resource can restrict the processing of deep learning
(DL) services. Recent DL models such as satellite image analysis mainly require larger resource memory
occupation for intermediate feature map footprint than the given memory specification of hardware resource
and larger computational overhead (in FLOP) to meet service-level objective in the sense of hardware
accelerator. As one of the solutions, we propose a new method of controlling the layer-wise channel pruning
in a single-shot manner that can decide how much channels to prune in each layer by observing dataset once
without full pretraining. To improve the robustness of the performance degradation, we also propose a layer-
wise sensitivity and formulate the optimization problems for deciding layer-wise pruning ratio under target
computational constraints. In the paper, the optimal conditions are theoretically derived, and the practical
optimum searching schemes are proposed using the optimal conditions. On the empirical evaluation, the
proposed methods show robustness on performance degradation, and present feasibility on DL serving
under constrained computing environments by reducing memory occupation, providing acceleration effect
and throughput improvement while keeping the accuracy performance.
INDEX TERMS Single-shot pruning, channel pruning, lottery ticket hypothesis, DL model compression.
I. INTRODUCTION
Recent advances in deep learning (DL) models have achieved
remarkable analysis performance in many computer vision
tasks [1], [2]. Since the recent convolutional neural networks
(CNNs) have grown in depth and complexity in pursue of
high analysis performance, it becomes a challenge to deploy
DL models on constrained computing environments such as
mobile device or satellite on-board system.
As one of the solutions, pruning can reduce computational
overhead and resource occupation of DL models, which
enables deploying in constrained computing resources. The
state-of-the-art pruning schemes [3], [4] attempt to decide
weight parameters to prune in single-shot manner without
pretraining that requires additional huge computational over-
head. However, these single-shot based pruning schemes are
mainly targeted on weight pruning, that can not directly
achieve reduction in computational resource occupation of
DL models without sparsity-aware designed hardware or
software; even provided, they yield small reduction effect [5].
Therefore, it can be considered to apply the single-shot
based channel pruning that can achieve reduction on compu-
tational resource occupation and acceleration on processing
DL model directly by removing output channels of each layer
for constrained computing environments. There are three
main issues on channel pruning the DL model in a single-
shot manner for constrained computing environments: 1) how
to adapt single-shot based weight pruning criterion onto the
channel pruning scheme, 2) risk of pruning a whole layer
in channel pruning with globally searching scheme, and 3)
varied aspects on layer-wise pruning sensitivity and layer-
VOLUME 4, 2016 1
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
wise computational properties.
To solve the problems, we propose a new scheme of layer-
wise channel pruning in a single-shot manner that is robust
to the performance (accuracy) degradation while meeting the
computational constraints of target hardware resource. Our
main contributions can be summarized as follows.
Firstly, theoretical analyses on validity of two possi-
ble single-shot based channel pruning criterion adapted
from single-shot weight pruning scheme are conducted,
and the valid one is defined as channel sensitivity.
We also observe that pruning channels with globally
searching scheme under lottery ticket hypothesis [6] is
prone to remove a whole layer by difference on layer-
wise channel sensitivity scale, and propose a layer-wise
sensitivity to regulate critical performance degradation
caused by excessive pruning on a certain layer.
Finally, we formulate the optimization problem that de-
cides layer-wise pruning ratios to minimize sensitivity
score of target model while meeting the computational
constraints of target hardware resource. We derived the
optimal conditions, and propose the practical methods
to search the optimal solutions that enable to consider
diverse layer-wise aspects of pruning sensitivity and
computational characteristics.
In the paper, empirical evaluation of the proposed methods
is conducted on various datasets and network models. The
proposed methods shows improved robustness on perfor-
mance degradation and feasibility on DL serving with its
accelerating effect.
II. RELATED WORK
1) DL Processing on Constrained Computing Environments
Applications like personalized service on mobile device [7],
[8], anomaly detection from IoT device [9], and satellite
imagery analysis [10] on on-board processing system [11]
require deploying DL models on constrained computing en-
vironment. As one of use cases, Cloudscout [11] deploys
the custom designed CNN for nanosatellite to select eligible
data by detecting cloud as binary masking form. As the
available hardware resources and power budget are limited
in such environment, light-weight DL model is designed by
constructing short and small CNN, but the light-weight DL
model inevitably shows lower performance than the deeper
and wide CNNs in general [12].
The main computational constraints of deploying such
deep CNNs under limited hardware resources are resource
memory occupation and computation overhead (which is
generally quantified in the number of floating point opera-
tions (FLOP)). Since the size of the target input image grows
up on the recent practical applications, requirement of the
resource memory occupation size for intermediate feature
maps in CNN mainly exceeds the given hardware memory
size [13], [14]. For example, as shown in Fig. 1, deploying
Faster-RCNN [2] model with 1k x 1k input requires about
9GB resource memory occupation for memory footprint of
Constrained Computing Environments
Computational requirements of DL model
EX) Satellite on-board system, mobile device, etc.
Low power GPU
- Memory: 4GB 64bit LPDDR4
- Jetson Nano achieves 472GFLOPS
- Xilinx VC707 achieves 61GFLOPS
FPGA
- Memory: 4GB DDR4
Ex) Faster-RCNN with 1k x 1k input
Resource memory occupation: about 9.2GB
Computation overhead: about 3.2TFLOPs
Pruning the model to enable DL serving:
§Reduce memory occupation
§Accelerate processing time
class ifier
bounding
boxes
...
pruning redundant channels
!"#$%&1 !"#$%&2 !"#$%&)
FIGURE 1. Necessity of pruning for serving DL model on constrained
computing environments.
intermediate feature maps and about 3TFLOP of computa-
tional overhead, however, the hardware accelerators for con-
strained computing environments like NVIDIA TX-1 [15] or
Xilinx VC707 [16] can only accommodate 4GB memory at
maximum and can achieve about 60500GFLOPS for DL
inference processing which is far insufficient to compute
within seconds level.
2) Single-shot Weight Pruning
According to the time when to prune the model, research
on pruning can be divided into two main branches: (1) a
form of pruning from pretrained state, and (2) a form of
pruning from initial state by observing dataset only once. The
schemes of pruning from pretrained state [17], [18] inevitably
suffer performance degradation from the original pretrained
model even though the further fine-tuning is conducted. A
recent study observes the lottery ticket hypothesis [6] that
there exists the sparse trainable subnetworks at initialization
(called as winning ticket). However, finding this winning
ticket from pretrained state requires additional computing
overhead of pretraining the original model before retraining
the pruned model. Therefore, recent works [3], [4] propose
the single-shot based weight pruning criterion that can search
weight parameters to prune in initial state without pretraining
full iterations.
3) Channel Pruning
However, the weight pruning itself can not directly achieve
reduction on resource memory occupation and computation
overhead without sparsitiy-aware designed hardware and
software [5], and its effect is relatively small on general
GPU resources [19] comparing to the degree of pruning.
2VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
𝑛!"#
𝐱!"#
1 1 0 1
0 0 0 0
1 1 0 1
1 1 0 1
𝒞!
𝑛!"#
!
𝑛!
𝑤!
!
𝑛!
𝐱!
𝑖-th layer
!,%
Prune 𝑗-th
out channel
𝜃&,'
!
FIGURE 2. Illustration of pruning a output channel in a convolutional layer.
Accordingly, channel pruning [20], [21] can be applied to
directly reduce the resource memory occupation and com-
putation overhead of DL model by removing output channels
in each layer. As the channel pruning is more risky to the
performance degradation by removing a bunch of weights
linked to output channel at once, most studies [17], [21]
mainly targeted on how efficiently recover the performance
degradation for pruning from pretrained state than how to
select efficient channels to prune.
On the other side, some studies [22]–[24] attempted to
prune the channels as a form of neural architecture search
(NAS) by constructing the loss function that considers the
both task performance and model cost (e.g., memory or
FLOP) together. However, such NAS-based methods in-
evitably requires the additional heavy overhead of training
the NAS model itself before training the pruned model.
To overcome these limitations, in this paper, we aim to
propose an efficient channel pruning scheme for constrained
computing environments in a single-shot manner.
III. SINGLE-SHOT BASED LAYER-WISE CHANNEL
PRUNING WITH COMPUTATIONAL CONSTRAINTS
In this section, we firstly introduce how to adapt single-shot
based weight pruning criterion to channel pruning scheme,
and check its validity theoretically. Then, the layer-wise
sensitivity is introduced to regulate excessive pruning on a
certain layer, and we formulate the optimization problem for
minimizing the whole pruning sensitivity score of the model.
We then derive the optimal condition and propose a practical
optimum searching method for each constraint respectively
in the following subsections.
A. SINGLE-SHOT BASED CHANNEL PRUNING
Let ni,hi, and widenote the number of output channels,
height and width of the output feature map in ith layer,
respectively. As shown in Fig. 2, from the input xi1, a
convolution layer conducts ni·ni1convolution operations,
and pruning can be represented as masking the filters denoted
by Ci Fi, where denotes element-wise product, and θi
p,q
denotes each kernel parameter for linking pth output channel
and qth input channel in ith layer. In the following, let denote
Cas a set of whole masking matrices in the network, and
denote C F as masking in each layer by Ci Fi.
TABLE 1. Computation time required for pretraining and retraining over
channel pruning methods
Proposed Lottery Pt + Ft Ori.
Accuracy (%) 81.90 84.29 79.05 80.00
Pretrain time (s) 15.6 1736.8 1736.8 1736.8
Retrain time (s) 850.5 992.5 176.9 -
Total time (s) 866.1 2729.3 1913.7 1736.8
Table 1 shows the motivation of our work for single-shot
based channel pruning. In the table, the training elapsed time
to achieve the best test accuracy among 160 epochs on each
pruning methods is measured. The test environment is same
with that of Table 2, 3 described in Section IV-A. The target
model is ResNet-101 for satellite imagery dataset which is
the most practical application.
As shown in the result, the conventional methods that
prune from the pretrained state then conduct fine-tuning
(denoted as Pt + Ft, and the fine-tuning time is presented
as retraining time) [21] or then conduct retraining with
re-initialization (denoted as Lottery) [6] show much more
computational burden on total of pretraining (single-shot
observing stage for the proposed method) and retraining (or
fine-tuning) stage. Such overheads even exceed the overhead
of training the original full model. Accordingly, we target
to address the single-shot based channel pruning that can
achieve acceleration on both pretraining stage and retraining
(or fine-tuning) stage not only on the inference phase.
1) Channel Sensitivity
As lack of study on single-shot based channel pruning cri-
terion, we analyze two possible adapting ways for chan-
nel pruning from single-shot based weight pruning crite-
rion. First possible criterion is summing up the single-shot
based weight masking effect along the weights linked to
the target output channel. Let Ldenotes a loss function
(e.g., cross-entropy loss), Ddenotes given dataset. When
observing this criterion theoretically, from one of the single-
shot based weight pruning criterion [3], weight-wise sensi-
tivity is given as si
p,q =|gi
p,q(F;D)|
P
i
P
p
P
q
|gi
p,q(F;D)|where gi
p,q =
∂L(C⊙F;D)
∂ci
p,q |C=1 L(F;D)L(F;θi
p,q = 0,D), and i,
p,qdenote index of layer, output channel, input channel
respectively. However, pruning with this criterion can not
guarantee being equal to solving the empirical risk mini-
mization problem of network in the finite hypothesis space
of pruning, which is stated as following property.
Property 1. Pruning a channel by Pqsi
p,q (where si
p,q is
single-shot based weight sensitivity in [3]) can not guarantee
being equal to solving empirical risk minimization (ERM)
problem of the neural network in the finite hypothesis space
of pruning.
Proof. Consider the weight-wise pruning criterion on SNIP
[3],
VOLUME 4, 2016 3
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
si
p,q =|gi
p,q(F;D)|
PiPpPq|gi
p,q(F;D)|.(1)
As mentioned in main content, iis index of layer, por j
denotes index of output channel, and qdenotes index of input
channel. Choosing a channel to prune by Pqsi
p,q as a form of
summing up weight-wise scores linked to the target channel
[21] is written as:
arg min
i,j X
q
si
p,q.(2)
Consider the pruning a channel by conducting element-
wise product on filters with masking matrix (CiFi) is equal
to finding Fiwith which channel to be pruned (θi
p,q = 0,q)
in the finite hypothesis space Hwhere the space consists of
all possible channel pruning cases. Therefore, the problem of
(2) can be written as:
= arg min
Fi|θi
j,q=0,qHX
q
si
p,q.(3)
Applying Pqsi
p,q =Pq|L(F;D)L(F;θi
p,q=0,D)|
PiPpPq|gi
p,q(F;D)|to (3),
denominator term and L(F;D)term in numerator are con-
stant with regard to j(wrapped as Fi|θi
j,q=0,q). As the cross
entropy is usually used as loss function L(), assume L > 0.
Then, the problem in (3) can be written as:
= arg min
Fi|θi
j,q=0,qH
Pq|αL(F;θi
p,q = 0,D)|
β(4)
= arg min
Fi|θi
j,q=0,qH
L(F;θi
p,q = 0,D),(5)
where α, β R, and α > 0, β 0are constants. Therefore,
this problem equation does not guarantee equal to problem of
empirical risk minimization (5).
Alternatively, we define channel sensitivity by transform-
ing single-shot based weight pruning criterion [3] to the form
of direct output channel masking effect as follows.
si
j=|gi
j(C F ;D)|
P
iP
j
|gi
j(C F ;D)|,(6)
where gi
j(C F ;D) = L(C⊙F ;D)
∂ci
j
|C=1 L(C F ;D)
L(C|ci
j=0 F;D).
Pruning a channel with the defined criterion (si
j) can be
stated as an equivalent problem of minimizing the empirical
loss on the network, as shown in the following Lemma 1.
Lemma 1. If L(C F ;ci
j= 0,D)L(C F ;D)
0,(i, j) {(i, j )|ci
j= 1,ci
j C}, pruning a channel
by si
jis equal to solving empirical risk minimization (ERM)
problem of the neural network in the finite hypothesis space.
Proof. In the score formulation transformed for channel
pruning from weight-wise score of [3], impact of pruning
j-th output channel in i-th layer (Li
j) is approximated to
gi
j(C F ;D)for calculation efficiency in implementation as
follows.
Li
j(F;D) = L(C F ;D)L(C F;ci
j= 0,D)
gi
j(C F ;D) = L(C F ;D)
∂ci
j
|C=1
(7)
Applying this to si
jof (6) obtains
si
j=|gi
j(C F ;D)|
P
iP
j
|gi
j(C F ;D)|(8)
|L(C F ;D)L(C F;ci
j= 0,D)|
PiPj|L(C F ;D)L(C F;ci
j= 0,D)|.(9)
Therefore, choosing a channel to prune by si
jis written as:
arg min
i,j
si
j(10)
= arg min
i,j
|L(C F ;D)L(C F;ci
j= 0,D)|
PiPj|L(C F ;D)L(C F;ci
j= 0,D)|,
(11)
where denominator term and L(C F ;D)term in numerator
are constant with regard to i, j.
As the cross entropy is usually used as a loss function
L(·), assume L > 0. Then, when denoting L(C F ;D)
in numerator and denominator term as constant α, β R,
α, β 0respectively, the equation of (11) becomes:
= arg min
i,j
|αL(C F ;ci
j= 0,D)|
β.(12)
This implies that, when L(C F ;ci
j= 0,D)α0,
the problem becomes equal to empirical risk minimization
problem of the neural network in the finite hypothesis space
Hwhere the space consists of all possible channel pruning
cases. Therefore, in such condition, (12) becomes:
= arg min
i,j
L(C F ;ci
j= 0,D)(13)
= arg min
F∈H
L(F;D).(14)
The exception condition L(C F ;ci
j= 0,D)< L(C
F;D)also means the case where pruning reduces the loss
from non-pruned state. Moreover, as solving ERM guaran-
tees probably approximately correct (PAC) bound [25], under
the same condition in Lemma 1, pruning a channel by si
j
also guarantees PAC bound and its estimation error is upper
bounded. As minimizing empirical loss tends to result in
minimizing test error except overfitting, it can be expected
4VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Layer1
Layer2
Layer3
Layer4
Layer5
Layer6
Layer7
Layer8
Layer9
Layer10
Layer11
Layer12
Layer13
Cumulative density on # channels (%)
Normalized channel sensitivity
Pruning on these ranges brings out
early layer removal
FIGURE 3. Cumulative density on the number of channels with regard to the
channel sensitivity value si
jin each layer of VGG-16 with CIFAR-10, which
shows the risk on early layer removal.
that pruning with the criterion from (6) can contribute to
minimize test error.
2) Layer-wise Sensitivity
However, the defined single-shot based channel sensitivity
reveals distributional difference over layers as shown in
Fig. 3. Since the channel sensitivity scores are distributed on
different bound for each layer and the bounds are different
over layers, it is prone to prune a certain whole layer as the
degree of pruning increases on global searching scheme. For
example, as shown in Fig. 3, trying to prune the model with
the normalized channel sensitivity criteria value of 0.2 just
invokes removing whole channels of layers (Layer 8, Layer5,
etc.). Such risk of pruning a whole layer results in critical
performance degradation, and the early layer removal wors-
ens the robustness of performance degradation with regard to
the compression rate. The detailed evaluation results of such
risk are also presented in Section IV.
To overcome such risk, in the methods of pruning from
pretrained state, layer-wise pruning sensitivity curve that
reveals layer-wise performance degradation with respect to
the degree of pruning can be easily obtained from profiling
process, and this layer-wise pruning sensitivity curve can
be used to regulate pruning a certain layer excessively [21].
However, in the single-shot based pruning scheme, it is prac-
tically impossible to obtain the layer-wise pruning sensitivity
curves as it can not directly obtain performance information
unless training on each profiling points with full iterations.
Instead, in order to replace the task performance degrada-
tion profiled curve that cannot be acquired in the single-shot
pruning scheme, we define a layer-wise sensitivity LSifor
single-shot pruning scheme in the inverse form of sum on
total channel sensitivity scores on remaining channels after
pruning (with C) at i-th layer as follows:
Algorithm 1 Global channel pruning scheme with layer-wise
sensitivity: s-ls-global()
Input: Target pruning ratio pr
1: ci
j1for i, j
2: calculate si
jfor i, j
3: update LSi(C)for i
4: while PiPjci
j>Pini·(1 pr)do
5: (i, j)arg min(i,j )∈{(i,j)|ci
j=1}si
j·LSi(C;ci
j= 0)
6: C C|ci
j=0
7: update LSi(C)
8: end while
9: return C
LSi(C) = 1
ni1
P
j=0
si
jP
j∈{j|ci
j=0,ci
j∈Ci}
si
j
.(15)
The proposed layer-wise sensitivity (LSi) basically fol-
lows the theoretical foundation (Lemma 1) of the global
channel sensitivity but additionally regulates the excessive
pruning on a certain layer, by constructing LSias a multi-
plicative inverse of the sum on the remaining global channel
sensitivity in each layer. In other words, in terms of each
layer, minimizing LSimakes pruning the channels with
small global channel sensitivity first in each layer with its the-
oretical foundation. In terms of inter-layer, excessive pruning
on a certain layer invokes excessive increase on LSi, and
therefore the proposed layer-wise sensitivity regulates risk
of early layer removal by trying to minimize LSiover the
layers.
Accordingly, in order to apply such mechanism to all
the layers of target network, minimization of the product
over all LSiis set as the optimization objective (minimize
QiLSi). Consequently, the channels with small global chan-
nel sensitivity si
jare pruned in each layer basically under
its theoretical foundation (Lemma 1), while such objective
(minimize QiLSi) also prevents LSi=(i.e., pruning a
whole layer) for any layer i, which results in being robust on
the task performance degradation.
Therefore, based on this property, we propose a global
pruning scheme (s-ls-global) with layer-wise sensi-
tivity as described in Algorithm 1. The proposed scheme
searches channels to prune in whole network globally by
selecting a channel that shows minimum si
j·LSi(C;ci
j=
0) score channel-by-channel, iteratively updating layer-wise
sensitivity, where layer-wise sensitivity term on score reg-
ulates to select a channel in excessively pruned layer on
current searching space.
B. LAYER-WISE ADAPTIVE PRUNING UNDER
COMPUTATIONAL CONSTRAINTS
As mentioned above, resource memory occupation size and
computation overhead (FLOP) of the DL model are mainly
VOLUME 4, 2016 5
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
0
5
10
15
20
25
30
35
40
0
50
100
150
200
250
300
conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv 8 conv9 conv10 conv11 conv12 conv13
MFLOP
Memory Occupation (KB)
Mem. Occ . MFLOP
FIGURE 4. Layer-wise aspects of resource memory occupation size and
computation overhead on VGG-16 with CIFAR-10.
limited by hardware resources. To meet such computational
constraints by the proposed s-ls-global of Algorithm 1,
the appropriate pruned model can be searched by incre-
mentally controlling the target pruning ratio up to meet the
computational constraints. However, this scheme does not
consider the layer-wise aspects of computational charac-
teristics, where both the resource memory occupation and
computation overhead (FLOP) have different aspect each
other and also over layers as shown in Fig. 4. Accordingly,
inefficient pruning case that just selecting links of low impact
on loss with low reduction effect of computational constraints
can occur, and it can disrupt the robustness of performance
degradation with regards to each computational constraint.
Therefore, from the idea of aforementioned layer-wise
sensitivity, we additionally propose a layer-wise adaptive
pruning scheme of determining the pruning ratio of each
layer that minimize network sensitivity QiLSi(pri)while
meeting the computational constraints as shown in Fig. 5. Let
pridenotes pruning ratio on i-th layer that is equivalent to
pri= 1 Pjci
j/ni,LSi(pri)denotes layer-wise sensitivity
score of i-th layer pruned with degree of priby channel sensi-
tivity si
j, and pr = [pr1, ..., prM]denotes pruning policy for
a whole network with Mlayers. In turn, the resource memory
occupation constraint (mainly occupied by memory footprint
of intermediate feature maps on large input image) and the
computation overhead constraint can be written as following
equations respectively.
X
i
|xi|0ni(1 pri)rmem X
i
|xi|0ni(16)
X
i
(1 pri1)(1 pri)COirFX
i
COi(17)
where rmem,rFdenotes the target constraint level to meet
hardware resource requirements (i.e., remaining ratio com-
paring to requiring quantity of original full model), and COi
denotes quantity of computation overhead in FLOP at i-th
layer. In the following, as LSi(pri)is convex, we can further
construct the optimization problem for each computational
constraint respectively, and address how to solve them.
Prune !"!portion
#"
#"#$ #"%$
$% & '!"
&( ) ( !"'*Minimize +,-!
!
.!
Note%that
( {1, , .}
Under resource’s constraints
Find optimum:
FIGURE 5. Illustration of layer-wise adaptive pruning, the different pruning
ratio is allocated to each layer considering its layer-wise sensitivity and
memory/FLOP characteristics.
1) Optimal Layer-wise Pruning Ratio for Resource Memory
Occupation Constraint
First, it is targeted to find the optimal layer-wise pruning ratio
pr that minimize QiLSisubject to satisfy resource memory
occupation constraint of (16). The optimal condition for this
optimization problem can be derived as following Theorem 1.
Theorem 1. The optimal layer-wise pruning ratios pr that
minimize QiLSi, while meeting the memory occupation
constraint of (16), satisfy the following condition:
LS
1(pr1)
LS1(pr1)
1
|x1|0n1
=... =LS
M(prM)
LSM(prM)
1
|xM|0nM
,(18)
where LS
idenotes the derivative.
Proof. The optimization problem targeted to solve is find-
ing optimal layer-wise pruning ratio pr that minimize
QiLSi(pri)subject to (16). The corresponding dual prob-
lem is written as:
L(pr, λ) =
M
Y
i=1
LSi(pri)λ
M
X
i=1
|xi|0ni(1 prirmem),
(19)
where Mdenotes the total number of layers in the network.
The partial derivatives with regard to the pruning ratio of a
certain layer prkgives:
L
∂prk
=
M
Y
i=1
LSi(pri)LS
k(prk)
LSk(prk)+λ|xi|0nk= 0,
k {1, ..., M }.
(20)
From (20), it can be expand as:
λ=
M
Y
i=1
LSi(pri)LS
k(prk)
LSk(prk)
1
|xi|0nk
,k {1, ..., M }.
(21)
Therefore, as QM
i=1 LSi(pri)term in (21) has same value
for k, following equation holds.
LS
1(pr1)
LS1(pr1)
1
|x1|0n1
=... =LS
M(prM)
LSM(prM)
1
|xM|0nM
(22)
6VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Algorithm 2 Optimal layer-wise pruning ratio searching
scheme for memory constraint: mem-opt()
Input: Target memory constraint level rmem
1: ci
j1for i, j
2: calculate si
jfor i, j
3: ρ0
4: pr [0, ..., 0]
5: while Pi|xi|0ni(1 pri)> rmem Pi|xi|0nido
6: i, prif1
mem(ρ)
7: ρρ+ϵ
8: end while
9: return pr
When denoting fmem,i(pri) = LS
i(pri)
LSi(pri)
1
|xi|0ni, the op-
timal pruning ratios can be practically searched by incre-
mentally varying the threshold variable ρfrom zero to top
until meeting computational constraints at first as described
in Algorithm 2, where ϵdenotes incrementing unit. After the
layer-wise pruning ratio is determined, pruning by channel
sensitivity si
jis conducted on each layer with its determined
pruning ratio pri.
Actually, the proposed pruning method for optimizing on
activation memory occupation can also be modified to sup-
port the optimization of weights memory occupation. When
denoting the size (memory occupation) of the weight parame-
ters in i-th layer as wi, then the size of weight parameter is re-
duced in proportion to the both pruning ratios (pri, pri1) of
current layer and the previous layer as (1pri1)(1pri)wi.
Accordingly, the weights memory occupation constraints that
is in the similar form of (17) is derived, and the optimization
can be derived likewise. However, for the conciseness of the
paper, only the optimization on memory occupation by the
intermediate feature maps is considered in this paper, and the
optimization on the weights memory occupation is remained
as future work.
2) Optimal Layer-wise Pruning Ratio for FLOP Constraint
For the optimization on computation overhead (FLOP) con-
straint, the goal is to find optimal layer-wise pruning ratio pr
that minimize QiLSisuch that satisfies FLOP constraint of
(17). Likewise, the optimal condition for this optimization
problem for FLOP constraint can be derived as following
Theorem 2. In the theorem, pr0denotes pruning ratio for
input image channel which is fixed to zero.
Theorem 2. The optimal layer-wise pruning ratios pr that
minimize QiLSi, while meeting the computation overhead
constraint of (17), satisfy the following condition:
LS
1(pr1)
LS1(pr1)
1
CO1(1pr0)=... =LS
M(prM)
LSM(prM)
1
COM(1prM1)
(23)
Proof. The optimization problem targeted to solve is find-
ing optimal layer-wise pruning ratio pr that minimize
Algorithm 3 Optimal layer-wise pruning ratio searching
scheme for FLOP constraint: flop-opt()
Input: Target FLOP constraint level rF
1: ci
j1for i, j
2: calculate si
jfor i, j
3: pr [0, ..., 0]
4: while Pi(1 pri1)(1 pri)COi> rFPiC Oido
5: ρfF LOP (pr1)
6: for iin {2, ..., M }do
7: prif1
F LOP,i (ρ)
8: end for
9: pr1pr1+ϵ
10: end while
11: return pr
QiLSi(pri)subject to (17). The corresponding dual prob-
lem is written as:
L(pr, λ) =
M
Y
i=1
LSi(pri)
λ
M
X
i=1
COi((1 pri)(1 pri1)rF),
(24)
where Mdenotes the total number of layers in the network.
Partial derivatives gives:
L
∂prk
=
M
Y
i=1
LSi(pri)LS
k(prk)
LSk(prk)+λCOk(1 prk1)=0,
k {1, ..., M }.
(25)
From (25),
λ=
M
Y
i=1
LSi(pri)LS
k(prk)
LSk(prk)
1
COk(1 prk1),
k {1, ..., M }.
(26)
Therefore, as QM
i=1 LSi(pri)term has same value for k,
following equation holds, where pr0denotes pruning ratio of
input image channel that is fixed to zero (pr0= 0):
LS
1(pr1)
LS1(pr1)
1
CO1(1pr0)=... =LS
M(prM)
LSM(prM)
1
COM(1prM1).
(27)
From the optimal condition, as pr0= 0, according
to the value of pr1, the others (pr2, ..., prL) are decided
deterministically in sequential, and denote fF LOP (pri) =
LS
i(pri)
LSi(pri)
1
COi(1pri1). The optimal pruning ratios can be
practically searched by incrementally varying the pruning
ratio of the first layer pr1from zero to one until meeting
VOLUME 4, 2016 7
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
computational constraints at first as described in Algorithm 3.
Likewise, after the layer-wise pruning ratio is determined,
pruning by channel sensitivity si
jis conducted on each layer
with its determined pruning ratio.
IV. EVALUATION
In this section, the proposed methods are evaluated in two
main aspects: robustness of performance degradation on
pruning and feasibility on DL serving. Through the evalua-
tion, the robustness of performance degradation on various
test cases is observed, and the acceleration effect on comput-
ing resources is measured.
A. EXPERIMENTAL SETTINGS
The proposed methods are compared with conventional chan-
nel pruning methods on three cases: VGG-16 [26] with
CIFAR-10 dataset [27], wide ResNet(WRN)-18 [1] with Cal-
tech101 dataset [28], and ResNet-101 [29] with UC Merced
satellite imagery dataset [30]. The detail settings for training
each case are described as follows.
1) VGG-16 with CIFAR-10 dataset
One of cases, the evaluation is conducted on modified VGG-
16 architecture where an average pooling layer is attached
after the last convolutional layer, and only a single fully
connected layer with 512 input channel is connected at the
end for CIFAR-10 from the original VGG-16 architecture
[26]. The model is trained by using SGD with momentum of
0.9, batch size of 128 and the weight decay rate of 0.0001,
and train 160 epochs in total. The initial learning rate is
configured to 0.1 and decayed by 0.1 at every 60 epochs. The
standard data augmentation (i.e., translation up to 4 pixels for
fitting to VGG-16, random horizontal flip and normalization)
is applied.
2) WRN-18 with Caltech101 dataset
For WRN-18 with Calteach101 dataset, WRN-18 architec-
ture [1] is applied, where only a single fully connected
layer at the end is modified with 101 output channels for
Caltech101 dataset. The model is trained by using SGD with
momentum of 0.9, batch size of 32 and the weight decay rate
of 0.0001, and train 80 epochs in total. The initial learning
rate is configured to 0.1 and decayed by 0.1 at 60 epoch.
For data augmentation, only resizing of the input data to
224×224 size is applied. In the dataset [28], 90% of total
dataset is split to training data and remaining 10% is used for
test data.
3) ResNet-101 with UC Merced satellite imagery dataset
Likewise, ResNet-101 architecture [29] modifying only the
last single fully connected layer with 21 output channels for
UC Merced satellite imagery dataset is used as third case. The
model is trained by using SGD with momentum of 0.9, batch
size of 128 and the weight decay rate of 0.0001, and train
160 epochs in total. The initial learning rate is configured to
0.1 and decayed by 0.1 at every 60 epochs. Only resizing the
input data to 256×256 size is applied to data augmentation.
In the dataset [30], 90% of total dataset is split to training
data and remaining 10% is used for test data.
B. ROBUSTNESS OF PERFORMANCE DEGRADATION
The robustness of task performance degradation on the pro-
posed methods are evaluated over aforementioned 3 different
cases. The task performance is evaluated on 20 sparsity
levels, and the best top-1 test accuracy is measured as the
performance of the model trained from re-initialized state of
pruned model with respect to evaluation of lottery ticket hy-
pothesis [6]. Aforementioned three proposed pruning meth-
ods (s-ls-global,mem-opt,flop-opt) are evaluated
by comparing two conventional methods as listed:
snip-sum: Adapting method of SNIP [3], the single-
shot based weight pruning, to channel pruning scheme
by summing up weight-wise scores linked to target
channel [21].
LTH-ch: Channel pruning form of evaluation in lottery
ticket hypothesis [6] that prunes channels according to
sum on magnitude of weight parameters linked to output
channel. To observe comparison on as much similar
training overhead as possible, single-step version of
lottery ticket hypothesis [6] is applied where the pruning
is conducted only once.
For LTH-ch, iterative pruning version of lottery ticket hy-
pothesis [6] was also conducted, which requires much more
training computation overhead than the single-step version.
However, as the results of iterative pruning version show
rather earlier layer removal than the single-step version in
channel pruning scheme, only the results of single-step ver-
sion (LTH-ch) are presented as representative one in the
paper.
To evaluate the robustness of performance degradation on
pruning, the shapes of performance (accuracy) with regards
to three aspects are observed: 1) the number of total remain-
ing channels,2) remaining resource memory occupation size,
and 3) remaining FLOP of pruned model.
The evaluation on VGG-16 model with CIFAR-10 dataset
is shown in Fig. 6. The results show that the pro-
posed methods improves robustness of performance degra-
dation comparing to the conventional methods. Especially,
s-ls-global shows best performance on reducing the
number of output channels, mem-opt shows the best per-
formance on reducing resource memory occupation size,
and flop-opt shows the best performance on reducing
computation overhead (FLOP) that corresponds to the intent
of each proposed method.
As an ablation study, in order to see the effect of proposed
layer-wise sensitivity, further evaluations on other variant
methods (s-local,s-global) are also conducted:
s-local: Pruning each layer with same pruning ratio
by channel sensitivity si
jin each layer.
s-global: Searching channels to prune by channel
sensitivity si
jglobally in whole network.
8VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
82
84
86
88
90
92
94
020 40 60 80 100
Tes t Acc ur ac y ( %)
Remaining FLOP (%)
82
84
86
88
90
92
94
020 40 60 80 100
Tes t Acc ur ac y ( %)
Remaining Mem. Occ. Size (%)
82
84
86
88
90
92
94
020 40 60 80 100
Tes t Acc ur ac y ( %)
Remaining # Channels (%)
(a)
(b)
(c)
FIGURE 6. Test top-1 accuracy results of pruning methods on VGG-16 for
CIFAR-10 with respect to (a) the number of remaining channels, (b) remaining
resource memory occupation size, and (c) remaining FLOP of pruned model.
As shown in the results (Fig. 6), s-ls-global shows
more robust to performance degradation than s-local
and s-global with the help of layer-wise sensitivity that
regulates excessive pruning on a certain layer. This effect
can be explicitly clarified by observing layer-wise pattern
of remaining fractions as shown in Fig. 7. The layer-wise
sensitivity smoothen the pruning pattern under high global
pruning ratio to regulate pruning a certain layer excessively
comparing patterns of s-ls-global and s-global. The
proposed flop-opt shows tendency of pruning 9, 10-th
layers more highly that contains higher FLOP than other lay-
ers as observed in Fig. 4. Likewise, mem-opt prunes front
layers aggressively that contains larger resource memory
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13
Layers
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13
Fraction remaining
Layers
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13
Layers
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13
Fraction remaining
Layers
mem-opt flop-opt
s-ls-global s-global
(a)(b)
(c)(d)
FIGURE 7. Fraction of remaining channels on each layer of VGG-16 for
CIFAR-10 pruned by each comparing method ((a) mem-opt, (b) flop-opt,
(c) s-ls-global, (d) s-global) on several overall degrees of pruning.
occupation size for intermediate feature maps. In addition, in
accuracy results, pruning with channel sensitivity itself also
shows higher performance than snip-sum and LTH-ch
with regards to the number of remaining channels, as stated
in theoretical validity analysis including Lemma 1.
Moreover, when comparing mem-opt and s-ls-global
on the aspects to the number of remaining channels and
resource memory occupation size, mem-opt shows lower
robustness of performance degradation on reducing the num-
ber of channels, but shows higher robustness on reducing
resource memory occupation size. This implies that the
proposed mem-opt can select efficient channels containing
high layer-wise memory occupation reducing effect even if
its channel sensitivity is high. The similar effect also corre-
sponds to comparison of flop-opt and s-ls-global.
Fig. 8 shows results on WRN-18 network with Caltech101
dataset. Likewise, in the results, the proposed methods show
higher robustness of performance degradation on reducing
the number of channels, resource memory occupation, com-
putation overhead (FLOP). However, for reducing FLOP,
mem-opt shows more robustness of performance degrada-
tion than flop-opt. It can be caused by difference on
calculating effect of FLOP reduction (23) on convolutional
layers at residual links in the network as we apply shared
output channel masking on residual links and just calculate
FLOP of them using only one pruning ratio of the most fore-
headed previous layer among several candidate layers.
Evaluation on deeper residual network (ResNet-101) with
satellite imagery dataset is also conducted. As shown in
Fig. 9, the proposed methods show higher robustness of
performance degradation on all reducing aspects. However,
VOLUME 4, 2016 9
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
50
55
60
65
70
75
80
85
90
020 40 60 80 100
Tes t Acc ur ac y ( %)
Remaining # Channels (%)
50
55
60
65
70
75
80
85
90
020 40 60 80 100
Tes t Acc ur ac y ( %)
Remaining Mem. Occ. Size (%)
50
55
60
65
70
75
80
85
90
020 40 60 80 100
Tes t Acc ur ac y ( %)
Remaining FLOP (%)
(a)
(b)
(c)
FIGURE 8. Test top-1 accuracy results of pruning methods on WRN-18 for
Caltech101 with respect to (a) the number of remaining channels, (b)
remaining resource memory occupation size, and (c) remaining FLOP of
pruned model.
likewise to the case of WRN-18, mem-opt and flop-opt
can not show their optimal decision on high reduction level
of computational properties respectively as accumulation of
the mis-calculation on each reducing property effect be-
comes higher on this network with much more residual links.
Though, the proposed s-ls-global can highly endure
performance degradation and can be applied instead of them
on such case.
In Fig. 8 and 9, the layer removal occurs at an earlier
sparsity than the result of CIFAR-10 when pruning with only
the criterion of global channel sensitivity si
j(s-global),
which shows more distinct risk of early layer removal on
10
20
30
40
50
60
70
80
90
100
020 40 60 80 100
Tes t Acc ur ac y ( %)
Remaining # Channels (%)
10
20
30
40
50
60
70
80
90
100
020 40 60 80 100
Tes t Acc ur ac y ( %)
Remaining Mem. Occ. Size (%)
10
20
30
40
50
60
70
80
90
100
020 40 60 80 100
Tes t Acc ur ac y ( %)
Remaining FLOP (%)
(a)
(b)
(c)
FIGURE 9. Test top-1 accuracy results of pruning methods on ResNet-101 for
UC Merced satellite imagery dataset with respect to (a) the number of
remaining channels, (b) remaining resource memory occupation size, and (c)
remaining FLOP of pruned model.
global channel sensitivity si
j. In the results, s-global
shows comparable robustness of task performance degrada-
tion to the other proposed methods (mem-opt,flop-opt,
s-ls-global) until the layer removal occurs, but the layer
removal on s-global occurs even earlier than LTH-ch
or snip-sum. In contrast to such aspects, through addi-
tionally considering the proposed layer-wise sensitivity, the
proposed mem-opt,flop-opt and s-ls-global show
much more robustness on the task performance degradation
by regulating the early layer removal. Moreover, in the case
of s-local, early layer removal did not occur, but the
task performance is mostly degraded more than the proposed
10 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
airplane
On-board (device) DL Serving
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
110-5
0
5
10
15
20
25
30
35
40
0
500000
1000000
1500000
2000000
2500000
3000000
conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 conv9 conv10conv11 conv12conv13
MFLOPs
#param
#param MFLOPs
Layer-wise
Sensitivity
Computational
properties
!!
!!"# !!$#
Prune "#%portion
$% = ["#
&,… , "#']
Layer-wise Channel Pruning
Optimization
ex) object classification/detection, etc.
Memory occupation
Computation complexity (FLOPs)
Resource optimized DL model deployment
FIGURE 10. Developed prototype of on-board system [31] with proposed layer-wise channel pruning optimization model.
mem-opt,flop-opt and s-ls-global for each spar-
sity level, which implies the importance of considering layer-
wise computational characteristics together.
C. FEASIBILITY OF DL SERVING WITH PRUNING
1) Practical Environment of On-board Embedded System
In order to evaluate the feasibility of the proposed methods on
the restricted computing environments such as satellite on-
board computing system [11], the embedded system board
[31] in which inference of the deep learning models can
be served in low power management is used as a test en-
vironment. Fig. 10 shows the hardware prototype of our
embedded system board [31], and it consists of NVIDIA
Jetson Nano chipset for managing host/GPU and ASIC chip
that is designed to process the inference of the deep learning
models. In the system, 4GB DDR4 memory is available,
and ASIC chip is prototyped under Samsung foundry 28-
nm CMOS process with 200mW power consumption and
minimum 7.5W in the entire on-board system.
The custom ASIC chip [31] mainly consists of shared
memory and convolutional/vector processor, and it inter-
works with the programmable logic controller and the ex-
ternal memory to process the DL inference task. In the pro-
grammable logic controller, the task of DL inference is parti-
tioned into subtasks of size that the intrinsic resources can ac-
commodate, and the subtasks are processed sequentially. The
partitioned subtasks are prepared in external memory, and
then transferred to the shared memory of ASIC chip with the
management of programmable logic controller. After that,
each subtask is processed by convolutional/vector processor,
where the convolutional processor supports various settings
of convolutional operations in parallel using the multiple
array processing units, and the vector processor supports op-
erations of MaxPool, AvgPool, batch normalization, ReLU,
and GEMM. After processing each subtask, programmable
logic controller loads the results from the shared memory
TABLE 2. Effect of acceleration on the inference processing time under same
batch size
mem flop s-ls LTH-ch Ori.
Accuracy (%) 83.33 81.90 81.43 84.29 80.00
Mem. occ. (MiB) 2,213 2,813 2,911 5,701 5,767
Latency (ms) 184 63 81 359 523
Speed up ×2.84 ×8.27 ×6.43 ×1.46 ×1.00
TABLE 3. Effect of throughput improvement for DL serving
mem flop s-ls LTH-ch Ori.
Max. batch size 160 180 87 32 31
Throughput (Req/s) 85.3 258.8 198.2 44.6 30.7
Improvement ×2.78 ×8.42 ×6.45 ×1.45 ×1.00
of ASIC chip to external memory. Further details of the
prototype onboard system are provided in [31].
2) Acceleration Effect of the Proposed Methods on Test
Environment
The feasibility of the proposed methods on DL serving is also
conducted. Some methods evaluated on previous subsection
are also applied in this evaluation, and their first few abbrevi-
ations are denoted as in Table 2 and 3. The test is conducted
on ResNet-101 for satellite imagery dataset which is the
most practical application, and the latency for processing
inference of the pruned model is measured by averaged from
100 trials under the developed on-board system (described in
Section IV-C1) with 1024×1024 RGB input image size.
First, the effect of acceleration on inference processing
by setting same batch size (=16) over comparing methods is
evaluated, and the resource memory occupation size for each
case is also measured together. In each comparing method,
the most highly pruned model that shows equal or greater
performance to original model as target deploying model
is selected to deploy on the target hardware resource. As
shown in Table 2, the proposed mem-opt can achieve the
VOLUME 4, 2016 11
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
lowest resource memory occupation size while maintaining
its accuracy, and the proposed flop-opt can achieve the
highest speed up on inference latency under same batch size
setting by largely reducing computation overhead. Although
mem-opt shows relatively lower speed up on inference
latency than flop-opt and s-ls-global as it is mainly
targeted to reduce memory occupation size, it achieves
higher speed up than the conventional method (LTH-ch)
that prunes from pretrained state like lottery ticket hypothesis
study [6], and can largely reduce resource memory occu-
pation size keeping the accuracy which is more feasible in
the aspect of enabling deployment on restricted computing
environments like on-board system that has severe memory
occupation constraint.
The effect of the proposed methods on improving through-
put for DL serving with the same deploying models used
in Table 2 is also evaluated. In each method, the maximum
available batch size that can be deployed into the target
hardware resource for DL inference serving is searched by
manual trials, which is practically applied in the study of
DL serving [32]. Along with the maximum batch size, the
throughput (the number of processed request images per
second) is also measured on each test case. As shown in
Table 3, the proposed methods show higher improvement
than conventional pruning method of lottery ticket hypothesis
framework, and flop-opt achieves highest improvement
on throughput by largely reducing inference latency time.
The acceleration effect on training phase is already pre-
sented in previous section with Table 1. According to the
result, the proposed method (flop-opt) shows the smallest
computing overhead also on the total training phase, but can
not show as much accelerating effect as in inference latency
due to training with configured batch size does not fully
utilize the resource on our test environment, and can expect
further acceleration by enabling larger available batch size
with the help of reducing memory occupation.
V. CONCLUSION
As the channel pruning generally shows severe performance
degradation than weight pruning, the conventional methods
mainly addressed how to efficiently recover the performance
degradation from pruning under pretrained state, which re-
quires heavy training overhead in double (pretraining and
fine-tuning). In this paper, we propose a new scheme of layer-
wise channel pruning that can sophisticatedly reflect each
layer’s characteristics and can be applied in a single-shot
manner to alleviate computational overhead of pretraining
by just observing dataset only once. Moreover, in order
to improve robustness of performance degradation, we also
propose a layer-wise sensitivity for single-shot based pruning
scheme, and extend to optimization problems for two main
computational constraints in layer-wise pruning decision
scheme with proposing practical methods to find optimums.
From the empirical evaluation, the proposed methods show
robustness on performance degradation in aspects of reduc-
ing the number of channels, resource memory occupation,
and computation overhead (FLOP) in deploying and serving
DL model. The proposed methods also show feasibility on
DL serving under constrained computing environments by
reducing memory occupation, providing higher acceleration
effect on inference latency and throughput improvement
while maintaining the accuracy performance, and reduce
computing overhead of training.
REFERENCES
[1] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In
BMVC, 2016.
[2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:
towards real-time object detection with region proposal networks. IEEE
transactions on pattern analysis and machine intelligence, 39(6):1137–
1149, 2016.
[3] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip:
Single-shot network pruning based on connection sensitivity. ICLR, 2019.
[4] Hidenori Tanaka, Daniel Kunin, Daniel LK Yamins, and Surya Ganguli.
Pruning neural networks without any data by iteratively conserving synap-
tic flow. NeurIPS, 2020.
[5] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A
Horowitz, and William J Dally. Eie: Efficient inference engine on com-
pressed deep neural network. ACM SIGARCH Computer Architecture
News, 44(3):243–254, 2016.
[6] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis:
Finding sparse, trainable neural networks. ICLR, 2019.
[7] Ruixuan Liu, Yang Cao, Hong Chen, Ruoyang Guo, and Masatoshi
Yoshikawa. Flame: Differentially private federated learning in the shuffle
model. In AAAI, 2020.
[8] Woo-Joong Kim and Chan-Hyun Youn. Lightweight online profiling-
based configuration adaptation for video analytics system in edge com-
puting. IEEE Access, 8:116881–116899, 2020.
[9] Dixian Zhu, Dongjin Song, Yuncong Chen, Cristian Lumezanu, Wei
Cheng, Bo Zong, Jingchao Ni, Takehiko Mizoguchi, Tianbao Yang, and
Haifeng Chen. Deep unsupervised binary coding networks for multivariate
time series retrieval. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 34, pages 1403–1411, 2020.
[10] Heejae Kim, Kyungchae Lee, Changha Lee, Sanghyun Hwang, and Chan-
Hyun Youn. An alternating training method of attention-based adapters
for visual explanation of multi-domain satellite images. IEEE Access,
9:62332–62346, 2021.
[11] Gianluca Giuffrida, Lorenzo Diana, Francesco de Gioia, Gionata Benelli,
Gabriele Meoni, Massimiliano Donati, and Luca Fanucci. Cloudscout: a
deep neural network for on-board cloud detection on hyperspectral images.
Remote Sensing, 12(14):2205, 2020.
[12] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and
Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks.
In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4510–4520, 2018.
[13] Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer.
Deepthings: Distributed adaptive deep learning inference on resource-
constrained iot edge clusters. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 37(11):2348–2359, 2018.
[14] Berkin Akin, Zeshan A Chishti, and Alaa R Alameldeen. Zcomp:
Reducing dnn cross-layer memory footprint using vector extensions. In
Proceedings of the 52nd Annual IEEE/ACM International Symposium on
Microarchitecture, pages 126–138, 2019.
[15] Nathan Otterness, Ming Yang, Sarah Rust, Eunbyung Park, James H
Anderson, F Donelson Smith, Alex Berg, and Shige Wang. An evaluation
of the nvidia tx1 for supporting real-time computer-vision workloads.
In 2017 IEEE Real-Time and Embedded Technology and Applications
Symposium (RTAS), pages 353–364. IEEE, 2017.
[16] Ahmad Shawahna, Sadiq M Sait, and Aiman El-Maleh. Fpga-based
accelerators of deep learning networks for learning and classification: A
review. IEEE Access, 7:7823–7859, 2018.
[17] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz.
Pruning convolutional neural networks for resource efficient inference.
ICLR, 2017.
[18] Song Han, Huizi Mao, and William J Dally. Deep compression: Compress-
ing deep neural networks with pruning, trained quantization and huffman
coding. ICLR, 2016.
12 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
[19] Masuma Akter Rumi, Xiaolong Ma, Yanzhi Wang, and Peng Jiang. Ac-
celerating sparse cnn inference on gpus with performance-aware weight
pruning. In Proceedings of the ACM International Conference on Parallel
Architectures and Compilation Techniques, pages 267–278, 2020.
[20] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating
very deep neural networks. In Proceedings of the IEEE International
Conference on Computer Vision, pages 1389–1397, 2017.
[21] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf.
Pruning filters for efficient convnets. ICLR, 2017.
[22] Matteo Risso, Alessio Burrello, Francesco Conti, Lorenzo Lamberti, Yukai
Chen, Luca Benini, Enrico Macii, Massimo Poncino, and Daniele Jahier
Pagliari. Lightweight neural architecture search for temporal convolutional
networks at the edge. IEEE Transactions on Computers, 2022.
[23] Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian,
Saining Xie, Bichen Wu, Matthew Yu, Tao Xu, Kan Chen, et al. Fbnetv2:
Differentiable neural architecture search for spatial and channel dimen-
sions. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 12965–12974, 2020.
[24] Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang,
and Edward Choi. Morphnet: Fast & simple resource-constrained structure
learning of deep networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1586–1595, 2018.
[25] Leslie G Valiant. A theory of the learnable. Communications of the ACM,
27(11):1134–1142, 1984.
[26] Karen Simonyan and Andrew Zisserman. Very deep convolutional net-
works for large-scale image recognition. ICLR, 2015.
[27] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of
features from tiny images. 2009.
[28] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object
categories. IEEE transactions on pattern analysis and machine intelligence,
28(4):594–611, 2006.
[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[30] Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions
for land-use classification. In Proceedings of the 18th SIGSPATIAL
international conference on advances in geographic information systems,
pages 270–279, 2010.
[31] Taewoo Kim, Minsu Jeon, Changha Lee, Junsoo Kim, Geonwoo Ko, Joo-
Young Kim, and Chan-Hyun Youn. Federated onboard-ground station
computing with weakly supervised cascading pyramid attention network
for satellite image analysis. IEEE Access, 2022.
[32] Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong,
Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. Nexus:
a gpu cluster engine for accelerating dnn-based video analysis. In Pro-
ceedings of the 27th ACM Symposium on Operating Systems Principles,
pages 322–337, 2019.
MINSU JEON received the B.S. degree in elec-
tronic engineering from Sogang University in
2016. He recieved the M.S. degree in electri-
cal engineering from Korea Advanced Institute
of Science and Technology (KAIST), Daejeon,
South Korea in 2017. He is currently pursuing
the Ph.D degree in electrical engineering at Korea
Advanced Institute of Science and Techonolgy
(KAIST). His research interests include deep
learning (DL) application/model, DL model com-
pression, DL serving and high performance computing system.
TAEWOO KIM received the B.S. degree in electri-
cal engineering from Kyungpook National Univer-
sity, Deagu, South Korea in 2015, and M.S. degree
in electrical engineering from Korea Advanced
Institute of Science and Technology (KAIST),
Daejeon, South Korea in 2017. He is currently
pursuing the Ph.D degree in electrical engineer-
ing at Korea Advanced Institute of Science and
Techonolgy (KAIST). His research interests in-
clude the Deep Learning(DL) framework, GPU
computing, and interactive learning.
CHANGHA LEE received the B.S. degree in elec-
tronic engineering from Hanyang Univ., Seoul,
Korea (2018), and M.S. degree in electronic engi-
neering from Korea Advanced Institute of Science
and Technology (KAIST), Daejeon, Korea (2020).
He is currently a Ph.D. student in KAIST. Since
2018, he is a member of Network and Computing
Laboratory in KAIST and his current research
interests include deep learning acceleration plat-
form and high performance edge-cloud computing
system.
CHAN-HYUN YOUN (S’84–M’87–SM’2019)
received the B.Sc and M.Sc degrees in Electronics
Engineering from Kyungpook National Univer-
sity, Daegu, Korea, in 1981 and 1985, respectively,
and the Ph.D. degree in Electrical and Commu-
nications Engineering from Tohoku University,
Japan, in 1994. Before joining the University, from
1986 to 1997, he was a Head of High-Speed Net-
working Team at KT Telecommunications Net-
work Research Laboratories, where he had been
involved in the research and developments of centralized switching main-
tenance system, high-speed networking, and ATM network. Since 1997,
he has been a Professor at the School of Electrical Engineering in Korea
Advanced Institute of Science and Technology (KAIST), Daejeon, Korea.
He was an Associate Vice-President of office of planning and budgets in
KAIST from 2013 to 2017. He also is a Director of Grid Middleware
Research Center and XAI Acceleration Technology Research Center at
KAIST, where he is developing core technologies that are in the areas
of high performance computing, explainable AI system, satellite imagery
analysis, AI acceleration system and others. He was a general chair for
the 6th EAI International Conference on Cloud Computing (Cloud Comp
2015), KAIST, in 2015. He wrote a book on Cloud Broker and Cloudlet for
Workflow Scheduling, Springer, in 2017. Dr. Youn also was a Guest Editor
of IEEE Wireless Communications in 2016, and served many international
conferences as TPC member.
VOLUME 4, 2016 13
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3232566
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Article
Full-text available
Pruning and compression of models are practical approaches for deploying and applying deep convolutional neural networks in scenarios with limited memory and computational resources. To mitigate the impact of pruning on model accuracy and enhance the stability of pruning (defined as the negligible drop in test accuracy immediately following pruning), an algorithm for reward-penalty decoupling is introduced in this study to achieve automated sparse training and channel pruning. During sparse training, the influence of unimportant channels is automatically identified and reduced,thereby preserving the ability of the important channels for feature recognition. First, by utilizing the gradient information learned through network backpropagation, the feature scaling factors of the batch normalization layers are combined with the gradient to determine the importance threshold for the network channels. Subsequently, a two-stage sparse training algorithm is proposed based on the reward-penalty decoupling strategy, applying different loss function strategies to the feature scaling factors of "important" and "unimportant" channels during decoupled sparse training. This approach has been experimentally validated across various tasks, baselines, and datasets, demonstrating its superiority over the previous state-of-the-art methods. The results indicate that the effect of pruning on model accuracy is significantly alleviated by the proposed method, and pruned models require only limited fine-tuning to achieve excellent performance.
Article
Full-text available
With advances in NanoSat (CubeSat) and high-resolution sensors, the amount of raw data to be analyzed by human supervisors has been explosively increasing for satellite image analysis. To reduce the raw data, the satellite onboard AI processing with low-power COTS (Commercial, Off-The-Shelf) HW has emerged from a real satellite mission. It filters the useless data (e.g. cloudy images) that is worthless to supervisors, achieving efficient satellite-ground station communication. In the application for complex object recognition, however, additional explanation is required for the reliability of the AI prediction due to its low performance. Although various eXplainable AI (XAI) methods for providing human-interpretable explanation have been studied, the pyramid architecture in a deep network leads to the background bias problem which visual explanation only focuses on the background context around the object. Missing the small objects in a tiny region leads to poor explainability although the AI model corrects the object class. To resolve the problems, we propose a novel federated onboard-ground station (FOGS) computing with Cascading Pyramid Attention Network (CPANet) for reliable onboard XAI in object recognition. We present an XAI architecture with a cascading attention mechanism for mitigating the background bias for the onboard processing. By exploiting the localization ability in pyramid feature blocks, we can extract high-quality visual explanation covering the both semantic and small contexts of an object. For enhancing visual explainability of complex satellite images, we also describe a novel computing federation with the ground station and supervisors. In the ground station, active learning-based sample selection and attention refinement scheme with a simple feedback method are conducted to achieve the robustness of explanation and efficient supervisor’s annotation cost, simultaneously. Experiments on various datasets show that the proposed system improves accuracy in object recognition and accurate visual explanation detecting small contexts of objects even in a peripheral region. Then, our attention refinement mechanism demonstrates that the inconsistent explanation can be efficiently resolved only with very simple selection-based feedback.
Article
Full-text available
Neural Architecture Search (NAS) is quickly becoming the go-to approach to optimize the structure of Deep Learning (DL) models for complex tasks such as Image Classification or Object Detection. However, many other relevant applications of DL, especially at the edge, are based on time-series processing and require models with unique features, for which NAS is less explored. This work focuses in particular on Temporal Convolutional Networks (TCNs), a convolutional model for time-series processing that has recently emerged as a promising alternative to more complex recurrent architectures. We propose the first NAS tool that explicitly targets the optimization of the most peculiar architectural parameters of TCNs, namely dilation, receptive-field and number of features in each layer. The proposed approach searches for networks that offer good trade-offs between accuracy and number of parameters/operations, enabling an efficient deployment on embedded platforms. Moreover, its fundamental feature is that of being lightweight in terms of search complexity, making it usable even with limited hardware resources. We test the proposed NAS on four real-world, edge-relevant tasks, involving audio and bio-signals: (i) PPG-based Heart-Rate Monitoring, (ii) ECG-based Arrythmia Detection, (iii) sEMG-based Hand-Gesture Recognition, and (iv) Keyword Spotting.Results show that, starting from a single seed network, our method is capable of obtaining a rich collection of Pareto optimal architectures, among which we obtain models with the same accuracy as the seed, and 15.9-152× fewer parameters. Moreover, the NAS finds solutions that Pareto-dominate state-of-the-arthand-tuned models for 3 out of the 4 benchmarks, and are Pareto-optimal on the fourth (sEMG). Compared to three state-of-the-art NAS tools, ProxylessNAS, MorphNet and FBNetV2, our method explores a larger search space for TCNs (up to 1012×) and obtains superior solutions, while requiring low GPU memory and search time. We deploy our NAS outputs on two distinct edge devices, the multicore GreenWaves Technology GAP8 IoT processor and the single-core STMicroelectronics STM32H7 microcontroller. With respect to the state-of-the-art hand-tuned models, we reduce latency and energy of up to 5.5× and 3.8× on the two targets respectively, without any accuracy loss.
Article
Full-text available
Recently, satellite image analytics based on convolutional neural networks have been vigorously investigated; however, in order for the artificial intelligence systems to be applied in practice, there still exists several challenges: (a) model explanability to improve the reliability of the artificial intelligence system by providing the evidence for the prediction results; (b) dealing with domain shift among images captured by multiple satellites of which the specification of the image sensors is various. To resolve the two issues in the development of a deep model for satellite image analytics, in this paper we propose a multi-domain learning method based on attention-based adapters. As plug-ins to the backbone network, the adapter modules are designed to extract domain-specific features as well as improve visual attention for input images. In addition, we also discuss an alternating training strategy of the backbone network and the adapters in order to effectively separate domain-invariant features and -specific features, respectively. Finally, we utilize Grad-CAM/LIME to provide visual explanation on the proposed network architecture. The experimental results demonstrate that the proposed method can be used to improve test accuracy, and its enhancement in visual explanability is also validated.
Article
Full-text available
The increasing demand for high-resolution hyperspectral images from nano and microsatellites conflicts with the strict bandwidth constraints for downlink transmission. A possible approach to mitigate this problem consists in reducing the amount of data to transmit to ground through on-board processing of hyperspectral images. In this paper, we propose a custom Convolutional Neural Network (CNN) deployed for a nanosatellite payload to select images eligible for transmission to ground, called CloudScout. The latter is installed on the Hyperscout-2, in the frame of the Phisat-1 ESA mission, which exploits a hyperspectral camera to classify cloud-covered images and clear ones. The images transmitted to ground are those that present less than 70% of cloudiness in a frame. We train and test the network against an extracted dataset from the Sentinel-2 mission, which was appropriately pre-processed to emulate the Hyperscout-2 hyperspectral sensor. On the test set we achieve 92% of accuracy with 1% of False Positives (FP). The Phisat-1 mission will start in 2020 and will operate for about 6 months. It represents the first in-orbit demonstration of Deep Neural Network (DNN) for data processing on the edge. The innovation aspect of our work concerns not only cloud detection but in general low power, low latency, and embedded applications. Our work should enable a new era of edge applications and enhance remote sensing applications directly on-board satellite.
Article
Full-text available
Video Analytics System has emerged as a promising technology to realize deep neural network based intelligent applications for video streams. Its objective is to maximize the video analytics performance of video streams, such as accuracy, while utilizing the limited computing resource capacity efficiently. Existing video analytics systems attempt to adapt configurations to optimize resource-accuracy tradeoffs under limited resource capacity. Especially, several works recently propose a configuration adaptation algorithm based on online profiling to overcome the inefficiency of one-time offline profiling caused by the dynamics of the configuration’s impact on video analytics accuracy. However, their systems are inefficient or limited to process the video analytics of multiple video streams in a GPU-enabled edge server. Furthermore, their online profiling methods still leads to a high profiling cost. In this paper, we design a video analytics system to adapt configurations for optimizing the resource-accuracy tradeoffs of multiple video streams with respect to frame rate and resolution fully under limited resource capacity of a GPU-enable edge server. In addition, we propose a lightweight online profiling method, utilizing the underlying characteristics of the video objects. Then, based on it, we propose a configuration adaptation algorithm to find the best configuration of each video stream and minimize accuracy degradation of multiple video streams under limited resource capacity of a GPU-enable edge server. To evaluate the proposed algorithm, we use a subset of surveillance videos and annotations from VIRAT 2.0 Ground dataset. The experimental results show that, in a GPU-enabled edge server, our video analytics system achieves the optimal configuration adaptation on resource-accuracy tradeoffs and our algorithm reduce the profiling cost of existing systems significantly while achieving the video analytics performance comparable to them.
Article
Federated Learning (FL) is a promising machine learning paradigm that enables the analyzer to train a model without collecting users' raw data. To ensure users' privacy, differentially private federated learning has been intensively studied. The existing works are mainly based on the curator model or local model of differential privacy. However, both of them have pros and cons. The curator model allows greater accuracy but requires a trusted analyzer. In the local model where users randomize local data before sending them to the analyzer, a trusted analyzer is not required but the accuracy is limited. In this work, by leveraging the \textit{privacy amplification} effect in the recently proposed shuffle model of differential privacy, we achieve the best of two worlds, i.e., accuracy in the curator model and strong privacy without relying on any trusted party. We first propose an FL framework in the shuffle model and a simple protocol (SS-Simple) extended from existing work. We find that SS-Simple only provides an insufficient privacy amplification effect in FL since the dimension of the model parameter is quite large. To solve this challenge, we propose an enhanced protocol (SS-Double) to increase the privacy amplification effect by subsampling. Furthermore, for boosting the utility when the model size is greater than the user population, we propose an advanced protocol (SS-Topk) with gradient sparsification techniques. We also provide theoretical analysis and numerical evaluations of the privacy amplification of the proposed protocols. Experiments on real-world dataset validate that SS-Topk improves the testing accuracy by 60.7% than the local model based FL. We highlight an observation that SS-Topk improves the accuracy by 33.94\% than the curator model based FL without any trusted party. Compared with non-private FL, our protocol SS-Topk only lose 1.48% accuracy under (2.348, 5e-6)-DP per epoch.
Article
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the lottery ticket hypothesis: dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that-when trained in isolation-reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.
Article
Multivariate time series data are becoming increasingly ubiquitous in varies real-world applications such as smart city, power plant monitoring, wearable devices, etc. Given the current time series segment, how to retrieve similar segments within the historical data in an efficient and effective manner is becoming increasingly important. As it can facilitate underlying applications such as system status identification, anomaly detection, etc. Despite the fact that various binary coding techniques can be applied to this task, few of them are specially designed for multivariate time series data in an unsupervised setting. To this end, we present Deep Unsupervised Binary Coding Networks (DUBCNs) to perform multivariate time series retrieval. DUBCNs employ the Long Short-Term Memory (LSTM) encoder-decoder framework to capture the temporal dynamics within the input segment and consist of three key components, i.e., a temporal encoding mechanism to capture the temporal order of different segments within a mini-batch, a clustering loss on the hidden feature space to capture the hidden feature structure, and an adversarial loss based upon Generative Adversarial Networks (GANs) to enhance the generalization capability of the generated binary codes. Thoroughly empirical studies on three public datasets demonstrated that the proposed DUBCNs can outperform state-of-the-art unsupervised binary coding techniques.