Content uploaded by Chen Zhen

Author content

All content in this area was uploaded by Chen Zhen on Sep 21, 2019

Content may be subject to copyright.

Importance-Aware Filter Selection for Convolutional

Neural Network Acceleration

Zikun Liu∗, Zhen Chen†∗ , Weiping Li∗

∗Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, China

†Department of Electrical Engineering, City University of Hong Kong, Hong Kong SAR, China

zikunliu6@gmail.com, zchen.ee@my.cityu.edu.hk, wpli@ustc.edu.cn

Abstract—Convolutional Neural Networks(CNNs) are widely

used in many ﬁelds, including artiﬁcial intelligence, computer

vision and video coding. However, CNNs are typically over-

parameterized and contain signiﬁcant redundancy. Traditional

model acceleration methods mainly rely on speciﬁc manual rules.

This usually leads to sub-optimal results with relatively limited

compression ratio. Recent works have deployed the self-learning

agent on the layer-level acceleration but still combined with

human-designed criterias. In this paper, we proposed a ﬁlter-

based model acceleration method to directly and automatically

decide which ﬁlters should be pruned with the reinforcement

learning method DDPG. We designed a novel reward function

with the reward shaping technique for the training process. Our

method is utilized on the models trained on MNIST and CIFAR-

10 datasets and achieves both higher acceleration ratio and less

accuracy loss than the conventional methods simultaneously.

Index Terms—Deep learning, Model acceleration, CNNs accel-

eration, Reinforcement learning

I. INTRODUCTION

Recently, the general trend of CNNs is that the models have

become deeper with an overall increment in the parameters and

computation costs. These models typically consume signiﬁcant

training time and overhead. The large CNNs models can’t

ﬁt easily in on-chip storage and demands costly memory

accesses. As a result, the original burdensome networks en-

counter difﬁculties to meet the restriction of the resources

on mobile devices and embedded systems [1]. For example,

VGG-16 parameters occupy more than 500MB storage space

and require 30.9 billion ﬂoating-point-operations (FLOPs)

when computing a single image with resolution 224×224 [2].

There are many works focusing on the model acceleration

and compression in order to reduce the computation and

storage costs. Recent years have witnessed the progress of

the hardware acceleration methods based on FPGA [3] and

custom accelerators specialized for different neural networks

[4]. On the other hand, Cheng et al. decomposed the low-rank

tensor for removing the redundancy in the kernels [5]. Li et

al. operated the compression on the channel level based on the

magnitude of the kernel weights [6]. And Chen et al. combined

the channel pruning with low-rank decomposition [7]. These

conventional acceleration methods require domain experts to

design the adequate heuristics with high-level mathematical

approaches to make a trade-off between the accuracy and the

acceleration rate.

In fact, recent advances in deep reinforcement learning

have made it possible to extract high-level features from the

given raw data and accomplish a series of difﬁcult tasks.

For example, Deep Q-Network (DQN) can be employed in

playing Atari using only raw pixels as input [8]. However,

these various deep reinforcement learning agents are difﬁcult

to deal with the continuous action space. Deep Deterministic

Policy Gradient (DDPG) achieved the continuous control using

the deep neural network on the actor-critic method where

both policy and action were represented through hierarchical

networks [9].

By modeling the network pruning as a Markov decision pro-

cess, aforementioned reinforcement learning methods can be

leveraged on pruning the neural network tasks and improving

the model acceleration efﬁciency. The conventional pruning

methods mainly focus on the various characteristics of the

ﬁlters and layers, and most strategies ignore some problems

such as each layer should be treated respectively with their

unique pruning methods. He et al. utilized the reinforcement

learning agent to produce the pruning ratio for each layer but

it still need the conventional heuristics to execute the pruning

process [10].

In this paper, we propose a full-automatic method without

any manual rules. We leverage the DDPG agent to determine

the pruning strategy automatically at ﬁlter-level by directly

giving the importance of the ﬁlter. This importance score

comes from the current ﬁlter information and the actor network

of DDPG will give the proper decision on whether this ﬁlter

should be pruned. The critic network performs as an auxiliary

helping rectify the wrong action. After giving the current ﬁlter

an importance value and getting a reward, the agent moves to

the next ﬁlter. The whole process is based on the iteration.

And for the consideration of time consumption, we update the

agent every few steps and evaluate the current state without

ﬁne-tuning. This simple method can reduce the exploration

time and get the instant model condition promptly. After the

agent training, we fetch the best agent and test it on the pruning

mission.

We evaluate our proposed method on two datasets and

demonstrate its competitive performance compared with the

conventional methods. Our method achieves 21.80×and

37.09×with slight accuracy drop on MNIST and CIFAR-10

respectively.

II. PRO PO SE D TECHNIQUE

Since the man-made heuristics are hard to design, We use

the DDPG agent, aiming to automatically decide which ﬁlters

should be pruned in the original model. This depends on the

corresponding importance of the ﬁlter estimated by the actor

network µ(s|θµ), where sdenotes the observation, θµis the

parameters of the actor network. We let the Q(s, a|θQ)to be

the critic network, giving instructions to µ. During the process

of training, if the actor network gives the pruning action,

we immediately discard this ﬁlter and compute the test error

without ﬁne-tuning. The reward is computed according to the

current FLOPs ratio and error. After few steps, we update the

agent networks by encouraging the higher acceleration ratio

and low test error [9]. The procedure is shown in Fig. 1.

Fig. 1. Demonstration of a pruning step, two blue blocks are the environment,

two red blocks are the DDPG agent. First, we get the observation of the current

ﬁlter , then fed it into the agent and get an action. Reward and new observation

will be got due to pruning. Then the reward is used to update the agent.

A. Observation space

The observation, i.e., the input of the DDPG agent is a

vector with 17 dimensions. At each step, the observation will

update according to the current ﬁlter and network condition.

The observation at the step t can be represented as st:

st= (lid, fid , o, i, k1, k2, stride, padding, siz e1,

size2, l1, l2, M ean, Std, AP oZ, P aram, F lops)(1)

where lid refers to the layer of the current ﬁlter, fid is the

ﬁlter index in the current layer. o,i,k1and k2correspond

to output channels, input channels, and height and width of

kernel sizes. size1and size2refer to the height and width of

the input matrices. l1and l2represent the L1-norm and L2-

norm of the ﬁlter weights respectively [6]. Mean and S td are

the average and the standard deviation of the ﬁlter weights.

AP oZ [11] is average percentage of zeros, which denotes the

percentage of zero activation of the neurons after the ReLu

function. P aram represents the ratio of current parameters

compared with the original network parameters. F lops means

the ratio of the left ﬂoating-point operations as shown in (3). At

the beginning of the training process, the P aram and F lops

are both initialized as 1 due to no pruning at the ﬁrst step, and

then they will decay slowly but overall no less than 0.

The 17-dimension observation contains the information not

only from the ﬁlter itself, but layers as well which will

favorably provide comprehensive instructions to the actor.

What’s more, the observation vector includes several existing

dimensions extracted from the handmade heuristics, giving a

well-rounded information.

B. Action

The action represents an importance score, which is a

scalar scaled within [-1, 1]. In the training process, we ﬁrst

get the current observation stand feed it into the DDPG

which will in-turn give us an action at. Then the environment

will be updated, giving the new observation st+1, reward

and some other information. Note that the pruning process

happens in the environment update process and the actor

only provides the score. Pruning strategy is to discard the

ﬁlters with scores above zero and keep others unchanged.

We do not use the discrete action space with DQN because

by using the continuous action space, the ﬁne-tune process

can prune the model based on the rank of the ﬁlter actions

and the DDPG agent can achieve the result well as shown

in the experiment. There are methods based on the structured

probabilistic pruning [12], in that case, the reward should be

scaled into [0, 1]. Consider the convergence speed and unstable

performance, we do not adopt it for practice which could be

further explored in the future.

To encourage exploration, we add O-U noise to actions

during training with a decayed variance [9].

C. Reward Function

In practice, we noticed that the naive reward functions such

as the linear or log function, easily trigger two imperfect

results: (1) The agent steps into the stationary state, which

means it tend to choose preserving the ﬁlters in order to avoid

the reward penalty. (2) The agent can’t accomplish a trade-off

between the accuracy and the acceleration, it lack the ability

to ﬁgure out the vital ﬁlters in the network.

Notice that the distribution of the ﬁlter absolute-value is

extremely uneven with great sparsity. We get enlightened by

the loss functions of the object detection which also encounters

the data imbalance issue. Focal loss has the ability to focus

learning on hard examples and solve the imbalance distribution

of the observation [13]. On this basis, shrinkage loss function

reconstruct the focal loss with sigmoid function and get a

relatively faster convergence speed [14]. We designed the new

reward function rby reshaping the shrinkage loss with l2norm

and reverse its monotonically as shown in (2):

r=2

1 + exp(ap(F lops −b)2+Error2),(2)

where F lops is the FLOPs ratio and E rror is the current

test error ratio, as shown in (3) and (4), aand bare hyper-

parameters adjusting the convergence speed and localization.

We design this goal-driven reward function and let bto be

the approximate ﬁnal acceleration goal. Fig. 2 shows some

examples of the reward function.

F lops =F LOP scurrent

F LOP sorig

(3)

(a) a=2, b=0 (b) a=3.5, b=0

(c) a=3.5, b=0.3 (d) a=6, b=0.3

Fig. 2. Reward function with different parameters aand b. From (a),(b), we

can see that the bigger avalue will lead to a steeper shape. From (a),(c),we

can see that the ccontrols the peak position of the reward value.

Error = 1 −Acccur rent

Accorig

(4)

Recent works have shown that reward shaping is an effective

technique to apply reinforcement learning in a complicated

environment [15]. We further utilize the reward shaping func-

tion Rsin (2) to stimulate the compressing while preserving

accuracy. Table I shows our strategy. Then the ﬁnal reward

should be rewritten as (5) where γis the proportion of the

reward shaping.

reward =r+γRs(5)

TABLE I

RE WARD SH AP ING F UN CTI ON UN DE R DIFF ER ENT E NV IRO NME NT

Environment Description RS

pruning reward when pruning a ﬁlter +0.05

preserving penalize when protecting the ﬁlter -0.02

ﬂuctuating penalize when the error increase -error changed×10

D. Fine-tune Process

After the training proces, we let the DDPG agent prune the

model and put all the scores of the ﬁlters into a list. Then we

sort it and choose a proper ratio which denotes the prospective

proportion of the ﬁlters to be discarded. After retraining the

pruned network to get the accuracy recovered, we will get the

compressed model.

III. EXP ER IM EN T

This section contains three parts. The implementation de-

tails, the experiment details on the MNIST datasets and the

CIFAR-10 dataset experiment to further validate the perfor-

mance. We use Pytorch framework to implement network

pruning and evaluate on NVDIA GTX 1080Ti GPU.

Algorithm 1 Model compression on the ﬁlter level using

DDPG agent

1: Randomly initialize actor network µ(s|θµ)and critic net-

work Q(s, a|θQ)

2: Import the model Mto be pruned

3: Initialize the buffer Band the training interval τ

4: for ep =1:Max E pisode do

5: Initialize a random process Nas noise for action

exploration

6: Initialize the observation of the ﬁrst ﬁlter as s1

7: lid = 1,fid = 1

8: total rewardep = 0

9: for t=1:Max S teps do

10: Get the action at=µ(st|θµ) + Nt

11: Operate pruning on the current ﬁlter with index

12: coordinates (l index,f index)

13: Observe the new rewardtand new state st+1

14: Update l index, f index

15: Store transition (st+1, at, rewardt, st+1 ) in B

16: if t%τ== 0 then

17: Update µ(s|θµ)and Q(s, a|θQ)with samples

18: from B

19: end if

20: total rewardep+ = rewardt

21: end for

22: reset M

23: end for

24: Select the highest total rewardep agent and ﬁne-tune

A. Implementation Details

We ﬁrst leverage our proxy on the MNIST dataset. The

original model to be compressed is a ﬁve-layer convolutional

neural network. The ﬁlter numbers that each layer contains can

be represented in a list [64, 64, 128, 256, 256] and every layer

follows a batch normalization layer and a ReLU layer. As for

the CIFAR-10 dataset, we use the VGG-16 like model. These

original models are trained using Adam with batch size 128.

The momentum is 0.9 and the weight decay is 0.0001. The

test accuracy is 99.46% and 90.8% on MNIST and CIFAR-

10. The actor and critic networks are composed of three fully

connected layers [9]. The learning rates of the actor and critic

network are 0.0001 and 0.001 respectively. The noise standard

deviation initialize as 0.5 at the ﬁrst 100 epochs and then

decays exponentially. The buffer size of the DDPG is 50000. In

the ﬁne-tune process, we use the ﬁlter-decayed pruning method

in practice: it cuts half of ﬁlters ﬁrst with one of the four

methods in the table and retrains the network for about 10

times with learning rate 0.01, then cuts the next quarter of

ﬁlters and retrains 20 times with a lower learning rate 0.001,

and the rest can be done in the same manner.

B. Experiment on MNIST Dataset

We compared our method with several conventional com-

pression methods with handmade heuristics on the same

TABLE II

COMPARISON OF OUR METHOD WITH THE CONVENTIONAL METHODS ON

TH E MNIST DATASET

Method FLOPs Acceleration Accuracy 4Acc%

Random 6.68×10613.24×98.52% -0.92

L1-norm [6] 4.49×10619.69×99.03% -0.43

Low-rank [5] 4.68×10618.87×98.94% -0.52

Taylor [16] 4.39×10620.15×99.15% -0.31

Importance-Aware 4.06×10621.80×99.26% -0.20

Fig. 3. Comparison of the l1-norm, Taylor, and our method in the remaining

ﬁlter ratio of each layer. The acceleration rate of the three methods are all

20×.

original neural network. The hyper-parameters of the reward

are chosen as follows: b= 0.1, a = 1, γ = 1. The result is

shown in Table II where the random method is to prune the

ﬁlters randomly. We can see that our method has both higher

accuracy and acceleration rate compared to other handcraft

methods.

To show the details of the pruned model, we compute

the remaining ﬁlter ratio of each layer using three different

methods. Fig. 3 shows that our method have a better pruning

performance especially on the ﬁrst few layers compared to the

other two methods which directly lead to a higher acceleration

rate. The L1-norm and Taylor methods are conservative at the

second layer where our agent explore a better pruning strategy.

This largely credit to the usage of the self-learning DDPG

agent, giving the accurate importance of each ﬁlter according

to the current environment.

C. Experiment on CIFAR-10 Dataset

We further compared our method with three conventional

methods on the CIFAR-10 dataset. The hyper-parameters of

the reward are chosen as: b= 0.2, a = 2, γ = 1. The learning

rates and the noise parameters are same as MNIST training

process. Table III shows that our method achieves higher

acceleration ratio and accuracy. Moreover, pruning ﬁlters using

our method also performs a good compression result. We

achieved 28.59×, 49.07×compression ratio with 88.14%,

87.97% accuracy respectively.

IV. CONCLUSION

In this paper, we introduce a ﬁne-grained fully-automatic

method to directly prune the ﬁlters according to the importance

score. We used the DDPG agent and designed a novel reward

function with the reward shaping technique.The experiments

TABLE III

COMPARISON OF OUR METHOD WITH THE CONVENTIONAL METHODS ON

TH E CIFAR-10 DATAS ET

Method FLOPs Acceleration Accuracy 4Acc%

Random 6.1×1079.16×86.25% -4.59

L1-norm 3.7×10715.45×87.69% -3.15

Low-rank 4.3×10712.53×85.85% -4.99

Taylor 3.0×10719.20×86.32% -4.52

Importance-Aware 1.89×10737.09×88.09% -2.75

on two datasets demonstrate the competitive performance of

our method. In the future, we are going to evaluate our model

further on larger datasets and apply it on video processing

tasks.

REFERENCES

[1] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufﬂenet: An extremely efﬁ-

cient convolutional neural network for mobile devices,” in Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition,

2018, pp. 6848–6856.

[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for

large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[3] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpga-based

processor for convolutional networks,” in 2009 International Conference

on Field Programmable Logic and Applications. IEEE, 2009, pp. 32–

37.

[4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,

“Diannao: A small-footprint high-throughput accelerator for ubiquitous

machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4. ACM,

2014, pp. 269–284.

[5] C. Tai, T. Xiao, Y. Zhang, X. Wang et al., “Convolutional neural net-

works with low-rank regularization,” arXiv preprint arXiv:1511.06067,

2015.

[6] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning

ﬁlters for efﬁcient convnets.”

[7] Z. Chen, J. Lin, S. Liu, Z. Chen, W. Li, J. Zhao, and W. Yan,

“Exploiting weight-level sparsity in channel pruning with low-rank

approximation,” in 2019 IEEE International Symposium on Circuits and

Systems (ISCAS). IEEE, 2019, pp. 1–5.

[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-

stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-

ing,” arXiv preprint arXiv:1312.5602, 2013.

[9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,

D. Silver, and D. Wierstra, “Continuous control with deep reinforcement

learning,” arXiv preprint arXiv:1509.02971, 2015.

[10] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for

model compression and acceleration on mobile devices,” in Proceedings

of the European Conference on Computer Vision (ECCV), 2018, pp.

784–800.

[11] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data-

driven neuron pruning approach towards efﬁcient deep architectures,”

arXiv preprint arXiv:1607.03250, 2016.

[12] H. Wang, Q. Zhang, Y. Wang, and H. Hu, “Structured probabilistic

pruning for convolutional neural network acceleration,” arXiv preprint

arXiv:1709.06994, 2017.

[13] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´

ar, “Focal loss

for dense object detection,” in Proceedings of the IEEE international

conference on computer vision, 2017, pp. 2980–2988.

[14] X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, and M.-H. Yang, “Deep

regression tracking with shrinkage loss,” in Proceedings of the European

Conference on Computer Vision (ECCV), 2018, pp. 353–369.

[15] Y. Wu and Y. Tian, “Training agent for ﬁrst-person shooter game with

actor-critic curriculum learning,” 2016.

[16] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning

convolutional neural networks for resource efﬁcient inference,” arXiv

preprint arXiv:1611.06440, 2016.