Conference PaperPDF Available

# Importance-Aware Filter Selection for Convolutional Neural Network Acceleration

Authors:

## Abstract and Figures

Convolutional Neural Networks(CNNs) are widely used in many fields, including artificial intelligence, computer vision and video coding. However, CNNs are typically over-parameterized and contain significant redundancy. Traditional model acceleration methods mainly rely on specific manual rules. This usually leads to sub-optimal results with relatively limited compression ratio. Recent works have deployed the self-learning agent on the layer-level acceleration but still combined with human-designed criteria. In this paper, we proposed a filter-based model acceleration method to directly and automatically decide which filters should be pruned with the reinforcement learning method DDPG. We designed a novel reward function with the reward shaping technique for the training process. Our method is utilized on the models trained on MNIST and CIFAR-10 datasets and achieves both higher acceleration ratio and less accuracy loss than the conventional methods simultaneously.
Content may be subject to copyright.
Importance-Aware Filter Selection for Convolutional
Neural Network Acceleration
Zikun Liu, Zhen Chen†∗ , Weiping Li
Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China
Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR, China
zikunliu6@gmail.com, zchen.ee@my.cityu.edu.hk, wpli@ustc.edu.cn
Abstract—Convolutional Neural Networks(CNNs) are widely
used in many ﬁelds, including artiﬁcial intelligence, computer
vision and video coding. However, CNNs are typically over-
parameterized and contain signiﬁcant redundancy. Traditional
model acceleration methods mainly rely on speciﬁc manual rules.
This usually leads to sub-optimal results with relatively limited
compression ratio. Recent works have deployed the self-learning
agent on the layer-level acceleration but still combined with
human-designed criterias. In this paper, we proposed a ﬁlter-
based model acceleration method to directly and automatically
decide which ﬁlters should be pruned with the reinforcement
learning method DDPG. We designed a novel reward function
with the reward shaping technique for the training process. Our
method is utilized on the models trained on MNIST and CIFAR-
10 datasets and achieves both higher acceleration ratio and less
accuracy loss than the conventional methods simultaneously.
Index Terms—Deep learning, Model acceleration, CNNs accel-
eration, Reinforcement learning
I. INTRODUCTION
Recently, the general trend of CNNs is that the models have
become deeper with an overall increment in the parameters and
computation costs. These models typically consume signiﬁcant
training time and overhead. The large CNNs models can’t
ﬁt easily in on-chip storage and demands costly memory
accesses. As a result, the original burdensome networks en-
counter difﬁculties to meet the restriction of the resources
on mobile devices and embedded systems [1]. For example,
VGG-16 parameters occupy more than 500MB storage space
and require 30.9 billion ﬂoating-point-operations (FLOPs)
when computing a single image with resolution 224×224 [2].
There are many works focusing on the model acceleration
and compression in order to reduce the computation and
storage costs. Recent years have witnessed the progress of
the hardware acceleration methods based on FPGA [3] and
custom accelerators specialized for different neural networks
[4]. On the other hand, Cheng et al. decomposed the low-rank
tensor for removing the redundancy in the kernels [5]. Li et
al. operated the compression on the channel level based on the
magnitude of the kernel weights [6]. And Chen et al. combined
the channel pruning with low-rank decomposition [7]. These
conventional acceleration methods require domain experts to
design the adequate heuristics with high-level mathematical
approaches to make a trade-off between the accuracy and the
acceleration rate.
In fact, recent advances in deep reinforcement learning
have made it possible to extract high-level features from the
given raw data and accomplish a series of difﬁcult tasks.
For example, Deep Q-Network (DQN) can be employed in
playing Atari using only raw pixels as input [8]. However,
these various deep reinforcement learning agents are difﬁcult
to deal with the continuous action space. Deep Deterministic
Policy Gradient (DDPG) achieved the continuous control using
the deep neural network on the actor-critic method where
both policy and action were represented through hierarchical
networks [9].
By modeling the network pruning as a Markov decision pro-
cess, aforementioned reinforcement learning methods can be
leveraged on pruning the neural network tasks and improving
the model acceleration efﬁciency. The conventional pruning
methods mainly focus on the various characteristics of the
ﬁlters and layers, and most strategies ignore some problems
such as each layer should be treated respectively with their
unique pruning methods. He et al. utilized the reinforcement
learning agent to produce the pruning ratio for each layer but
it still need the conventional heuristics to execute the pruning
process [10].
In this paper, we propose a full-automatic method without
any manual rules. We leverage the DDPG agent to determine
the pruning strategy automatically at ﬁlter-level by directly
giving the importance of the ﬁlter. This importance score
comes from the current ﬁlter information and the actor network
of DDPG will give the proper decision on whether this ﬁlter
should be pruned. The critic network performs as an auxiliary
helping rectify the wrong action. After giving the current ﬁlter
an importance value and getting a reward, the agent moves to
the next ﬁlter. The whole process is based on the iteration.
And for the consideration of time consumption, we update the
agent every few steps and evaluate the current state without
ﬁne-tuning. This simple method can reduce the exploration
time and get the instant model condition promptly. After the
agent training, we fetch the best agent and test it on the pruning
mission.
We evaluate our proposed method on two datasets and
demonstrate its competitive performance compared with the
conventional methods. Our method achieves 21.80×and
37.09×with slight accuracy drop on MNIST and CIFAR-10
respectively.
II. PRO PO SE D TECHNIQUE
Since the man-made heuristics are hard to design, We use
the DDPG agent, aiming to automatically decide which ﬁlters
should be pruned in the original model. This depends on the
corresponding importance of the ﬁlter estimated by the actor
network µ(s|θµ), where sdenotes the observation, θµis the
parameters of the actor network. We let the Q(s, a|θQ)to be
the critic network, giving instructions to µ. During the process
of training, if the actor network gives the pruning action,
we immediately discard this ﬁlter and compute the test error
without ﬁne-tuning. The reward is computed according to the
current FLOPs ratio and error. After few steps, we update the
agent networks by encouraging the higher acceleration ratio
and low test error [9]. The procedure is shown in Fig. 1.
Fig. 1. Demonstration of a pruning step, two blue blocks are the environment,
two red blocks are the DDPG agent. First, we get the observation of the current
ﬁlter , then fed it into the agent and get an action. Reward and new observation
will be got due to pruning. Then the reward is used to update the agent.
A. Observation space
The observation, i.e., the input of the DDPG agent is a
vector with 17 dimensions. At each step, the observation will
update according to the current ﬁlter and network condition.
The observation at the step t can be represented as st:
st= (lid, fid , o, i, k1, k2, stride, padding, siz e1,
size2, l1, l2, M ean, Std, AP oZ, P aram, F lops)(1)
where lid refers to the layer of the current ﬁlter, fid is the
ﬁlter index in the current layer. o,i,k1and k2correspond
to output channels, input channels, and height and width of
kernel sizes. size1and size2refer to the height and width of
the input matrices. l1and l2represent the L1-norm and L2-
norm of the ﬁlter weights respectively [6]. Mean and S td are
the average and the standard deviation of the ﬁlter weights.
AP oZ [11] is average percentage of zeros, which denotes the
percentage of zero activation of the neurons after the ReLu
function. P aram represents the ratio of current parameters
compared with the original network parameters. F lops means
the ratio of the left ﬂoating-point operations as shown in (3). At
the beginning of the training process, the P aram and F lops
are both initialized as 1 due to no pruning at the ﬁrst step, and
then they will decay slowly but overall no less than 0.
The 17-dimension observation contains the information not
only from the ﬁlter itself, but layers as well which will
favorably provide comprehensive instructions to the actor.
What’s more, the observation vector includes several existing
dimensions extracted from the handmade heuristics, giving a
well-rounded information.
B. Action
The action represents an importance score, which is a
scalar scaled within [-1, 1]. In the training process, we ﬁrst
get the current observation stand feed it into the DDPG
which will in-turn give us an action at. Then the environment
will be updated, giving the new observation st+1, reward
and some other information. Note that the pruning process
happens in the environment update process and the actor
only provides the score. Pruning strategy is to discard the
ﬁlters with scores above zero and keep others unchanged.
We do not use the discrete action space with DQN because
by using the continuous action space, the ﬁne-tune process
can prune the model based on the rank of the ﬁlter actions
and the DDPG agent can achieve the result well as shown
in the experiment. There are methods based on the structured
probabilistic pruning [12], in that case, the reward should be
scaled into [0, 1]. Consider the convergence speed and unstable
performance, we do not adopt it for practice which could be
further explored in the future.
To encourage exploration, we add O-U noise to actions
during training with a decayed variance [9].
C. Reward Function
In practice, we noticed that the naive reward functions such
as the linear or log function, easily trigger two imperfect
results: (1) The agent steps into the stationary state, which
means it tend to choose preserving the ﬁlters in order to avoid
the reward penalty. (2) The agent can’t accomplish a trade-off
between the accuracy and the acceleration, it lack the ability
to ﬁgure out the vital ﬁlters in the network.
Notice that the distribution of the ﬁlter absolute-value is
extremely uneven with great sparsity. We get enlightened by
the loss functions of the object detection which also encounters
the data imbalance issue. Focal loss has the ability to focus
learning on hard examples and solve the imbalance distribution
of the observation [13]. On this basis, shrinkage loss function
reconstruct the focal loss with sigmoid function and get a
relatively faster convergence speed [14]. We designed the new
reward function rby reshaping the shrinkage loss with l2norm
and reverse its monotonically as shown in (2):
r=2
1 + exp(ap(F lops b)2+Error2),(2)
where F lops is the FLOPs ratio and E rror is the current
test error ratio, as shown in (3) and (4), aand bare hyper-
parameters adjusting the convergence speed and localization.
We design this goal-driven reward function and let bto be
the approximate ﬁnal acceleration goal. Fig. 2 shows some
examples of the reward function.
F lops =F LOP scurrent
F LOP sorig
(3)
(a) a=2, b=0 (b) a=3.5, b=0
(c) a=3.5, b=0.3 (d) a=6, b=0.3
Fig. 2. Reward function with different parameters aand b. From (a),(b), we
can see that the bigger avalue will lead to a steeper shape. From (a),(c),we
can see that the ccontrols the peak position of the reward value.
Error = 1 Acccur rent
Accorig
(4)
Recent works have shown that reward shaping is an effective
technique to apply reinforcement learning in a complicated
environment [15]. We further utilize the reward shaping func-
tion Rsin (2) to stimulate the compressing while preserving
accuracy. Table I shows our strategy. Then the ﬁnal reward
should be rewritten as (5) where γis the proportion of the
reward shaping.
reward =r+γRs(5)
TABLE I
RE WARD SH AP ING F UN CTI ON UN DE R DIFF ER ENT E NV IRO NME NT
Environment Description RS
pruning reward when pruning a ﬁlter +0.05
preserving penalize when protecting the ﬁlter -0.02
ﬂuctuating penalize when the error increase -error changed×10
D. Fine-tune Process
After the training proces, we let the DDPG agent prune the
model and put all the scores of the ﬁlters into a list. Then we
sort it and choose a proper ratio which denotes the prospective
proportion of the ﬁlters to be discarded. After retraining the
pruned network to get the accuracy recovered, we will get the
compressed model.
III. EXP ER IM EN T
This section contains three parts. The implementation de-
tails, the experiment details on the MNIST datasets and the
CIFAR-10 dataset experiment to further validate the perfor-
mance. We use Pytorch framework to implement network
pruning and evaluate on NVDIA GTX 1080Ti GPU.
Algorithm 1 Model compression on the ﬁlter level using
DDPG agent
1: Randomly initialize actor network µ(s|θµ)and critic net-
work Q(s, a|θQ)
2: Import the model Mto be pruned
3: Initialize the buffer Band the training interval τ
4: for ep =1:Max E pisode do
5: Initialize a random process Nas noise for action
exploration
6: Initialize the observation of the ﬁrst ﬁlter as s1
7: lid = 1,fid = 1
8: total rewardep = 0
9: for t=1:Max S teps do
10: Get the action at=µ(st|θµ) + Nt
11: Operate pruning on the current ﬁlter with index
12: coordinates (l index,f index)
13: Observe the new rewardtand new state st+1
14: Update l index, f index
15: Store transition (st+1, at, rewardt, st+1 ) in B
16: if t%τ== 0 then
17: Update µ(s|θµ)and Q(s, a|θQ)with samples
18: from B
19: end if
20: total rewardep+ = rewardt
21: end for
22: reset M
23: end for
24: Select the highest total rewardep agent and ﬁne-tune
A. Implementation Details
We ﬁrst leverage our proxy on the MNIST dataset. The
original model to be compressed is a ﬁve-layer convolutional
neural network. The ﬁlter numbers that each layer contains can
be represented in a list [64, 64, 128, 256, 256] and every layer
follows a batch normalization layer and a ReLU layer. As for
the CIFAR-10 dataset, we use the VGG-16 like model. These
original models are trained using Adam with batch size 128.
The momentum is 0.9 and the weight decay is 0.0001. The
test accuracy is 99.46% and 90.8% on MNIST and CIFAR-
10. The actor and critic networks are composed of three fully
connected layers [9]. The learning rates of the actor and critic
network are 0.0001 and 0.001 respectively. The noise standard
deviation initialize as 0.5 at the ﬁrst 100 epochs and then
decays exponentially. The buffer size of the DDPG is 50000. In
the ﬁne-tune process, we use the ﬁlter-decayed pruning method
in practice: it cuts half of ﬁlters ﬁrst with one of the four
methods in the table and retrains the network for about 10
times with learning rate 0.01, then cuts the next quarter of
ﬁlters and retrains 20 times with a lower learning rate 0.001,
and the rest can be done in the same manner.
B. Experiment on MNIST Dataset
We compared our method with several conventional com-
pression methods with handmade heuristics on the same
TABLE II
COMPARISON OF OUR METHOD WITH THE CONVENTIONAL METHODS ON
TH E MNIST DATASET
Method FLOPs Acceleration Accuracy 4Acc%
Random 6.68×10613.24×98.52% -0.92
L1-norm [6] 4.49×10619.69×99.03% -0.43
Low-rank [5] 4.68×10618.87×98.94% -0.52
Taylor [16] 4.39×10620.15×99.15% -0.31
Importance-Aware 4.06×10621.80×99.26% -0.20
Fig. 3. Comparison of the l1-norm, Taylor, and our method in the remaining
ﬁlter ratio of each layer. The acceleration rate of the three methods are all
20×.
original neural network. The hyper-parameters of the reward
are chosen as follows: b= 0.1, a = 1, γ = 1. The result is
shown in Table II where the random method is to prune the
ﬁlters randomly. We can see that our method has both higher
accuracy and acceleration rate compared to other handcraft
methods.
To show the details of the pruned model, we compute
the remaining ﬁlter ratio of each layer using three different
methods. Fig. 3 shows that our method have a better pruning
performance especially on the ﬁrst few layers compared to the
other two methods which directly lead to a higher acceleration
rate. The L1-norm and Taylor methods are conservative at the
second layer where our agent explore a better pruning strategy.
This largely credit to the usage of the self-learning DDPG
agent, giving the accurate importance of each ﬁlter according
to the current environment.
C. Experiment on CIFAR-10 Dataset
We further compared our method with three conventional
methods on the CIFAR-10 dataset. The hyper-parameters of
the reward are chosen as: b= 0.2, a = 2, γ = 1. The learning
rates and the noise parameters are same as MNIST training
process. Table III shows that our method achieves higher
acceleration ratio and accuracy. Moreover, pruning ﬁlters using
our method also performs a good compression result. We
achieved 28.59×, 49.07×compression ratio with 88.14%,
87.97% accuracy respectively.
IV. CONCLUSION
In this paper, we introduce a ﬁne-grained fully-automatic
method to directly prune the ﬁlters according to the importance
score. We used the DDPG agent and designed a novel reward
function with the reward shaping technique.The experiments
TABLE III
COMPARISON OF OUR METHOD WITH THE CONVENTIONAL METHODS ON
TH E CIFAR-10 DATAS ET
Method FLOPs Acceleration Accuracy 4Acc%
Random 6.1×1079.16×86.25% -4.59
L1-norm 3.7×10715.45×87.69% -3.15
Low-rank 4.3×10712.53×85.85% -4.99
Taylor 3.0×10719.20×86.32% -4.52
Importance-Aware 1.89×10737.09×88.09% -2.75
on two datasets demonstrate the competitive performance of
our method. In the future, we are going to evaluate our model
further on larger datasets and apply it on video processing
REFERENCES
[1] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufﬂenet: An extremely efﬁ-
cient convolutional neural network for mobile devices,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2018, pp. 6848–6856.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[3] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpga-based
processor for convolutional networks,” in 2009 International Conference
on Field Programmable Logic and Applications. IEEE, 2009, pp. 32–
37.
[4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“Diannao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4. ACM,
2014, pp. 269–284.
[5] C. Tai, T. Xiao, Y. Zhang, X. Wang et al., “Convolutional neural net-
works with low-rank regularization,arXiv preprint arXiv:1511.06067,
2015.
[6] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
ﬁlters for efﬁcient convnets.
[7] Z. Chen, J. Lin, S. Liu, Z. Chen, W. Li, J. Zhao, and W. Yan,
“Exploiting weight-level sparsity in channel pruning with low-rank
approximation,” in 2019 IEEE International Symposium on Circuits and
Systems (ISCAS). IEEE, 2019, pp. 1–5.
[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-
ing,” arXiv preprint arXiv:1312.5602, 2013.
[9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” arXiv preprint arXiv:1509.02971, 2015.
[10] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for
model compression and acceleration on mobile devices,” in Proceedings
of the European Conference on Computer Vision (ECCV), 2018, pp.
784–800.
[11] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data-
driven neuron pruning approach towards efﬁcient deep architectures,”
arXiv preprint arXiv:1607.03250, 2016.
[12] H. Wang, Q. Zhang, Y. Wang, and H. Hu, “Structured probabilistic
pruning for convolutional neural network acceleration,arXiv preprint
arXiv:1709.06994, 2017.
[13] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´
ar, “Focal loss
for dense object detection,” in Proceedings of the IEEE international
conference on computer vision, 2017, pp. 2980–2988.
[14] X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, and M.-H. Yang, “Deep
regression tracking with shrinkage loss,” in Proceedings of the European
Conference on Computer Vision (ECCV), 2018, pp. 353–369.
[15] Y. Wu and Y. Tian, “Training agent for ﬁrst-person shooter game with
actor-critic curriculum learning,” 2016.
[16] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning
convolutional neural networks for resource efﬁcient inference,arXiv
preprint arXiv:1611.06440, 2016.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Acceleration and compression on Deep Neural Networks (DNNs) have become a critical problem to develop intelligence on resource-constrained hardware, especially on Internet of Things (IoT) devices. Previous works based on channel pruning can be easily deployed and accelerated without specialized hardware and software. However, weight-level sparsity is not well explored in channel pruning, which results in relatively low compression rate. In this work, we propose a framework that combines channel pruning with low-rank decomposition to tackle this problem. First, the low-rank decomposition is utilized to eliminate redundancy within filter, and achieves acceleration in shallow layers. Then, we apply channel pruning on the decomposed network in a global way, and obtains further acceleration in deep layers. In addition, a spectral norm-based indicator is proposed to balance low-rank approximation and channel pruning. We conduct a series of ablation experiments and prove that low-rank decomposition can effectively improve channel pruning by generating small and compact filters. To further demonstrate the hardware compatibility, we deploy the pruned networks on the FPGA, and the networks produced by our method have obviously low latency.
Chapter
Full-text available
Model compression is an effective technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted features and require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverages reinforcement learning to efficiently sample the design space and can improve the model compression quality. We achieved state-of-the-art model compression results in a fully automated way without any human efforts. Under 4$$\times$$ FLOPs reduction, we achieved 2.7% better accuracy than the hand-crafted model compression method for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet-V1 and achieved a speedup of 1.53$$\times$$ on the GPU (Titan Xp) and 1.95$$\times$$ on an Android phone (Google Pixel 1), with negligible loss of accuracy.
Article
Full-text available
Convolutional Neural Networks (CNNs) are extensively used in image and video recognition, natural language processing and other machine learning applications. The success of CNNs in these areas corresponds with a significant increase in the number of parameters and computation costs. Recent approaches towards reducing these overheads involve pruning and compressing the weights of various layers without hurting the overall CNN performance. However, using model compression to generate sparse CNNs mostly reduces parameters from the fully connected layers and may not significantly reduce the final computation costs. In this paper, we present a compression technique for CNNs, where we prune the filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole planes in the network, together with their connecting convolution kernels, the computational costs are reduced significantly. In contrast to other techniques proposed for pruning networks, this approach does not result in sparse connectivity patterns. Hence, our techniques do not need the support of sparse convolution libraries and can work with the most efficient BLAS operations for matrix multiplications. In our results, we show that even simple filter pruning techniques can reduce inference costs for VGG-16 by up to 34% and ResNet-110 by up to 38% while regaining close to the original accuracy by retraining the networks.
Article
Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.
Article
Although deep Convolutional Neural Network (CNN) has shown better performance in various machine learning tasks, its application is accompanied by a significant increase in storage and computation. Among CNN simplification techniques, parameter pruning is a promising approach which aims at reducing the number of weights of various layers without intensively reducing the original accuracy. In this paper, we propose a novel progressive parameter pruning method, named Structured Probabilistic Pruning (SPP), which efficiently prunes weights of convolutional layers in a probabilistic manner. Unlike existing deterministic pruning approaches, in which the pruned weights of a well-trained model are permanently eliminated, SPP utilizes the relative importance of weights during training iterations, which makes the pruning procedure more accurate by leveraging the accumulated weight importance. Specifically, we introduce an effective weight competition mechanism to emphasize the important weights and gradually undermine the unimportant ones. Experiments indicate that our proposed method has obtained superior performance on ConvNet and AlexNet compared with existing pruning methods. Our pruned AlexNet achieves 4.0 $\sim$ 8.9x (averagely 5.8x) layer-wise speedup in convolutional layers with only 1.3\% top-5 error increase on the ImageNet-2012 validation dataset. We also prove the effectiveness of our method on transfer learning scenarios using AlexNet.
Article
We introduce an extremely computation efficient CNN architecture named ShuffleNet, designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two proposed operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 6.7\%) than the recent MobileNet system on ImageNet classification under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves \textasciitilde 13$\times$ actual speedup over AlexNet while maintaining comparable accuracy.
Article
We propose a new framework for pruning convolutional kernels in neural networks to enable efficient inference, focusing on transfer learning where large and potentially unwieldy pretrained networks are adapted to specialized tasks. We interleave greedy criteria-based pruning with fine-tuning by backpropagation - a computationally efficient procedure that maintains good generalization in the pruned network. We propose a new criterion based on an efficient first-order Taylor expansion to approximate the absolute change in training cost induced by pruning a network component. After normalization, the proposed criterion scales appropriately across all layers of a deep CNN, eliminating the need for per-layer sensitivity analysis. The proposed criterion demonstrates superior performance compared to other criteria, such as the norm of kernel weights or average feature map activation.
Article
State-of-the-art neural networks are getting deeper and wider. While their performance increases with the increasing number of layers and neurons, it is crucial to design an efficient deep architecture in order to reduce computational and memory costs. Designing an efficient neural network, however, is labor intensive requiring many experiments, and fine-tunings. In this paper, we introduce network trimming which iteratively optimizes the network by pruning unimportant neurons based on analysis of their outputs on a large dataset. Our algorithm is inspired by an observation that the outputs of a significant portion of neurons in a large network are mostly zero, regardless of what inputs the network received. These zero activation neurons are redundant, and can be removed without affecting the overall accuracy of the network. After pruning the zero activation neurons, we retrain the network using the weights before pruning as initialization. We alternate the pruning and retraining to further reduce zero activations in a network. Our experiments on the LeNet and VGG-16 show that we can achieve high compression ratio of parameters without losing or even achieving higher accuracy than the original network.