Content uploaded by Ye Li
Author content
All content in this area was uploaded by Ye Li on Jun 21, 2019
Content may be subject to copyright.
A Convolutional Neural Network-Based Approach
to Rate Control in HEVC Intra Coding
Ye Li, Bin Li, Dong Liu, Zhibo Chen
CAS Key Laboratory of Technology in Geo-spatial Information Processing and Application System
University of Science and Technology of China, Hefei 230027, China
chenzhibo@ustc.edu.cn
Abstract—Rate control is an essential element for the practical
use of video coding standards. A rate control scheme typically
builds a model that characterizes the relationship between rate
(R) and a coding parameter, e.g. quantization parameter or
Lagrange multiplier (λ). In such a scheme, the rate control
performance depends highly on the modeling accuracy. For inter
frames, the model parameters can be precisely updated to fit
the video content, based on the information of previously coded
frames. However, for intra frames, especially the first frame
of a video sequence, there is no prior information to rely on.
Therefore, intra frame rate control has remained a challenge. In
this paper, we adopt the R-λmodel to characterize each coding
tree unit (CTU) in an intra frame, and we propose a convolutional
neural network (CNN) based approach to effectively predict the
model parameters for every CTU. Then we develop a new CTU
level bit allocation and bitrate control algorithm based on the R-λ
model for HEVC intra coding. The experimental results show that
our proposed CNN-based approach outperforms the currently
used rate control algorithm in HEVC reference software, leading
to on average 0.46 percent decrease of rate control error and 0.7
percent BD-rate reduction.
Index Terms—H.265/HEVC, Intra coding, Rate control, Con-
volutional neural network (CNN), R-λmodel
I. INTRODUCTION
Rate control (RC) plays an important role in video ap-
plications, especially for real-time video transmission. The
objective of rate control is to achieve the optimal quality
of the coded video under the constraint of the given target
rate, which varies adaptively to the communication channels
in practical use. Typically, rate control involves two steps: bit
allocation and bitrate control [1]. The bit allocation step is to
allocate the total amount of bits for the video content to be
coded, and can be implemented at three levels, i.e. group-of-
picture (GOP) level, frame level, and basic unit (BU) level.
The bitrate control step has a target to achieve the allocated
bitrate as precisely as possible. Usually, a rate control scheme
introduces a model that characterizes the relationship between
rate (R) and a coding parameter, e.g. the R-Q model using
quantization parameter (Q) [2], and the R-λmodel using
Lagrange multiplier (λ) [1], the latter is adopted in the current
High Efficiency Video Coding (HEVC) [3] reference software.
Intra frame rate control is significantly important in a video
coding system for two reasons. First, intra frames usually
978-1-5090-5316-2/16/$31.00 c
2017 IEEE.
EEϭ
EEϮ
WĂƌĂŵĞƚĞƌ
ߙ
WĂƌĂŵĞƚĞƌ
ߚ
ʄ
R
E
O D
EEϭ
EEϮ
WĂƌĂŵĞƚĞƌ
ߙ
WĂƌĂŵĞƚĞƌ
ߚ
ʄ
R
E
O D
R
R
ŝƚůůŽĐĂƚŝŽŶΘŝƚƌĂƚĞ
ŽŶƚƌŽů
/ŶƉƵƚ&ƌĂŵĞ
dh
dh
Fig. 1. Our proposed CNN-based intra frame rate control method.
consume much more bits than inter frames. Second, the quality
of intra frames will influence the coding efficiency of the
following inter frames due to inter prediction. Intra frame rate
control is often more difficult, too. While for inter frames, the
R-λmodel is shown to be accurate, since the model parameters
can be updated from the previously coded frames [1], it does
not hold for intra frames. The interval between intra frames is
usually large, so the previously coded intra frames help little
on the following intra frames. For the first frame of a video
sequence, or when scene change occurs, there is actually even
no prior information to rely on. Therefore, the rate control in
intra coding remains a challenge.
In this paper, we propose a convolutional neural network
(CNN) based approach for rate control in HEVC intra coding,
as depicted in Fig. 1. We verify that the R-λmodeling is
accurate for most of the coding tree units (CTUs) in intra
frames, and the major difficulty is to estimate the model
parameters in practice. We then propose to adopt CNN to
predict the model parameters directly from the content of
CTUs, without any presumption or prior information. Then,
the bit allocation and bitrate control algorithms are developed
for intra frame rate control. The experimental results show that
our method significantly outperforms the state-of-the-art one
that is currently used in the HEVC reference software, leading
to 0.46 percent decrease of rate control error and 0.7 percent
BD-rate reduction.
The rest of this paper is organized as follow. The related
works in HEVC rate control are summarized in Section II. In
Section III we introduce the details of our CNN-based intra
VCIP 2017, Dec. 10 - Dec. 13, St Petersburg, Florida, U.S.A.
frame rate control method. The experimental results are shown
in Section IV and Section V concludes the paper.
II. RE LATE D WOR K
R-λmodeling is proposed by Li et al. [1] for HEVC, which
regards the Lagrange multiplier λas the most critical factor to
determine the rate R. This model together with an improved
bit allocation method [4] is adopted in the current HEVC
reference software. The basic R-λmodel reads,
λ=α·Rβ(1)
where αand βare the model parameters that are dependent
on the content. This basic model works well for HEVC inter
coding, because the model parameters of inter frames can be
accurately estimated from those of previously coded frames.
But as mentioned above, the model has difficulty to deal with
intra frames because the parameters are not easy to estimate.
To address such problem, some works proposed to modify
the R-λmodel by introducing more content-dependent param-
eters. For example, in [5] the following model is proposed,
λ=α·C
Rβ
(2)
where Cstands for content complexity, and is estimated based
on the sum of absolutely transformed difference (SATD) of
pixel values. Specifically, a Hadamard transform is performed
on the pixel values of an 8×8 block, and the absolute values of
resulting coefficients are summed up. This model is currently
used in the HEVC reference software.
Moreover, there are some works proposed to introduce the
gradients into the rate model to improve the intra frame rate
control performance [6], [7]. Some other works proposed to
utilize the information of previously coded frames to further
improve [8], [9].
Different from the previous works, in this paper we return to
the basic R-λmodel and try to accurately estimate the model
parameters directly. We are inspired by the great successes of
deep learning in recent years [10] and then motivated to adopt
CNN for the model parameter prediction.
III. THE PRO PO SE D MET HO D
In this section, we first verify the accuracy of the R-λ
modeling at CTU level for HEVC intra coding. Then the CNN-
based prediction of model parameters is introduced in detail.
Finally we develop a CTU level bit allocation algorithm for
intra frame rate control.
A. R-λModeling for CTUs
The R-λmodel in Equation.1 characterizes the relationship
between the Lagrange multiplier λand the rate R. In [1], the
authors have verified that at sequence level R and λaccurately
fit the model. We perform curve fitting with the R and λ
at CTU level by conducting HEVC intra coding with fixed
quantization parameters. Some results are shown in Fig. 2.
From Fig. 2 we observe that the R-λrelationship at CTU
level still matches the model in Equation.1 very well. But
ʄ сϮ͘ϲϳϳďƉƉ
ͲϮ͘ϯ
ZϸсϬ͘ϵϵϵϱ
Ϭ
ϱϬ
ϭϬϬ
ϭϱϬ
ϮϬϬ
ϮϱϬ
ϯϬϬ
ϯϱϬ
ϰϬϬ
Ϭ Ϭ͘Ϯ Ϭ͘ϰ Ϭ͘ϲ Ϭ͘ϴ ϭ
ůĂŵďĚĂǀĂůƵĞ
ďƉƉ
YDĂůůdhŝŶĚĞdžсϰ
ʄ сϵϬ͘ϱϯϰďƉƉ
ͲϮ͘Ϯϭϵ
ZϸсϬ͘ϵϴϴϰ
Ϭ
ϱϬ
ϭϬϬ
ϭϱϬ
ϮϬϬ
ϮϱϬ
ϯϬϬ
ϯϱϬ
ϰϬϬ
ϬϭϮϯϰϱ
ůĂŵďĚĂǀĂůƵĞ
ďƉƉ
WĂƌƚLJ^ĐĞŶĞdhŝŶĚĞdžсϬ
Fig. 2. R-λcurve fitting at CTU level.
we also observe that different CTUs within the same frame,
though adjacent, have quite different parameters (i.e. αand β).
It shows the model parameters are highly content dependent
and the CTUs have quite different characteristics. Thus, how
to accurately estimate the model parameters is a challenge that
hinders the usage of R-λmodel in intra coding.
ϳdžϳĐŽŶǀ͕ϭϲ
ZĞ>Ƶ
ϳdžϳĐŽŶǀ͕ϯϮ
ZĞ>Ƶ
ŵĂdžƉŽŽů
ϳdžϳĐŽŶǀ͕ϭϲ
ZĞ>Ƶ
ϳdžϳĐŽŶǀ͕ϯϮ
ZĞ>Ƶ
ĨĐͺϭϮϴ
ZĞ>Ƶ
ĨĐͺϲϰ
ĨĐͺϭ
ŵĂdžƉŽŽů
Fig. 3. The CNN structure for predicting model parameters αand β.
B. CNN-Based Prediction of Model Parameters
From the above analyses, we conjecture that the model
parameters (αand β) may be predicted from the given video
content. Intuitively, we can leverage the powerful learning
ability of CNN to perform the prediction, which brings two
advantages. First, there is no need to design features (such
as SATD or gradients) as CNN automates feature learning.
Second, it is not necessary to refer to the previously coded
content, thus facilitates parallel encoding.
1) Network Structure: We design a CNN with four convo-
lutional layers, each followed by a rectified linear unit (ReLU),
two max-pooling layers and three full-connection (fc) layers,
as shown in Fig. 3. The final fc layer outputs a predicted value
for a model parameter (αor β). The same structure is used to
predict αor β, but is separately trained for either.
2) Training: We use many natural images to train the
CNN to obtain a universal model. The images come from the
UCID dataset [11] and the RAISE dataset [12]. None of them
appears in the HEVC common test sequences, and therefore,
our coding results can demonstrate the generalizability of the
trained CNN.
The natural images are converted into YUV420 format and
then compressed with the HEVC reference software under 11
different quantization parameters (QPs), ranging from 20 to 40
with an interval of 2. The coding rate and Lagrange multiplier
values of different QPs are collected for each CTU. Then curve
fitting is performed for each CTU using the 11 pairs of (R, λ)
to achieve αand β. Outlier CTUs are then removed, where
the inliers are defined as α∈[0.05,200] and β∈[−3,0]. At
last, we use the original pixel values of the luma component
of each CTU as input to CNN, and use the corresponding α
or βas label for training CNN. There are 180,000 CTUs used
for training, and another 16,000 CTUs used for validation.
For an original CTU Xn, where n∈ {1, . . . , N }indexes
each CTU, the corresponding label is denoted by yn, the
training of CNN is to minimize the following loss function,
L(Θ) = 1
N
N
X
n=1
kF(Xn|Θ) −ynk2(3)
where Θis the set of parameters of the network in Fig. 3. The
training is performed by stochastic gradient descent with the
standard back-propagation.
3) Network Usage: We integrate the two trained networks
(for αand β, respectively) into HEVC intra frame rate control.
Before encoding one frame, we use the original pixel values
of the luma component of each CTU to predict its model
parameters. It is worth noting that CTUs can be processed
in parallel in the prediction process.
For boundary CTUs whose sizes are less than the normal
size 64×64, they are first padded to the normal size (padded
pixel value is constant 128) and then sent to the trained
CNN. Afterwards, we adjust the CNN predicted parameters for
these boundary CTUs to take the padding effect into account.
Specifically, as shown in Fig. 4, the parameters of the original
CTU (a) and the padded one (b) are given by,
λ(a)=α(a)· R(a)
N(a)
pix !β(a)
, λ(b)=α(b)· R(b)
N(b)
pix !β(b)
(4)
where Ris the amount of coded bits and Npix denotes the
amount of pixels. As the CNN predicts αand βfor (b), we
need the values of (a). The following rectification process is
proposed for this purpose.
Note that the padded pixels have constant value, which
should cost very few bits, so we assume the amount of coded
bits is the same for (a) and (b). Also, we suppose the βvalue
of (a) and (b) are nearly the same, because we empirically
observe the value of beta varies in a small range. Given the
same value of λ, we then have:
λ(a)=λ(b), R(a)=R(b), β(a)≈β(b)(5)
Combining (4) and (5), we have the rectified αvalue as:
α(a)=α(b)·Sab (6)
where
Sab = N(a)
pix
N(b)
pix !β(b)
(7)
In our experiments, the rectification factor Sab is further
clipped into the range [1,4] empirically.
;ĂͿ
;ďͿ
Fig. 4. Illustration of CTU padding: (a) original CTU, (b) CTU after padding.
C. Bit Allocation
Given a target rate Rffor an intra frame, we first solve
a frame-level λ, i.e. λfby numerically solving the following
equation via the method of bisection,
Nf
X
i=1
αBiλβBi
f=Rf(8)
where Nfis the number of CTUs in the frame, αBiand βBi
are the model parameters of each CTU. Then, we use the Basic
Unit (BU) level bit allocation method proposed in [4], which
is also used for inter frame rate control in the current HEVC
reference software.
IV. EXP ER IM EN TAL RE SULTS
The proposed CNN-based intra frame rate control method
is implemented into the HEVC reference software (HM-16.9).
We follow the HEVC common test conditions in experiments.
Five classes, i.e. A, B, C, D and E, including 20 sequences are
tested. Class F is omitted as it contains screen content videos,
and our CNN is trained from natural images and may not suit
for screen content. For each sequence, only the first frame is
tested because of two reasons. First, it is well known that the
performance of intra coding methods can be well reflected by
results of a small number of frames. Second, as mentioned
above, rate control for the first intra frame is very challenging
as no prior information can be used.
TABLE I
RES ULTS O F RATE CO NT ROL E RROR
Average rate error
RC in HM-16.9 Proposed RC
Class A 0.73% 0.39%
Class B 0.94% 0.62%
Class C 1.41% 1.24%
Class D 3.01% 2.02%
Class E 1.77% 1.25%
Average 1.53% 1.07%
TABLE II
RES ULTS O F BD-RATE,T HE A NCH OR I S TUR NI NG OFF R ATE CO NTR OL
Average BD-rate
RC in HM-16.9 Proposed RC
Y U V Y U V
Class A 1.3% 5.9% 5.2% 0.3% 5.1% 4.4%
Class B 2.3% 6.9% 7.7% 1.3% 4.7% 4.4%
Class C 1.4% 9.0% 8.1% 1.2% 5.5% 6.4%
Class D 1.6% 5.6% 5.2% 1.2% 2.1% 3.2%
Class E 2.6% 4.7% 5.6% 1.6% 3.4% 3.9%
Average 1.8% 6.5% 6.5% 1.1% 4.3% 4.5%
TABLE III
EXA MPL E DE TAILE D RE SULT S
RC in HM-16.9 Proposed RC
Sequence Target bitrate Bitrate Y-PSNR Bitrate error Bitrate Y-PSNR Bitrate error
ParkScene 48754.75 48884.352 41.53 0.27% 48854.59 41.58 0.20%
26321.28 26462.98 38.67 0.54% 26433.22 38.71 0.43%
13756.42 13846.08 35.79 0.65% 13826.69 35.84 0.51%
6773.18 6805.44 33.03 0.48% 6802.37 33.10 0.43%
RaceHorseC 17542.80 17656.08 41.98 0.65% 17477.28 42.08 0.37%
10761.84 10976.40 38.42 1.99% 10836.96 38.42 0.70%
6191.28 6402.72 34.85 3.42% 6297.60 34.78 1.72%
3222.48 3299.04 31.22 2.38% 3276.96 31.14 1.69%
We use the HM software when turning off rate control to
compress each sequence at different QPs, then use the resulting
bitrates as the target bitrates for our rate control method, as
well as the SATD-based method [5] which is currently the
default rate control method in HM.
The results of rate control error and BD-rate are shown in
Tables I and II, respectively. Some detailed results are further
shown in Table III. In the tables, the rate control error is
measured by,
E=|Rt−Ra|
Rt
×100% (9)
where Rtis the target bitrate, Rais the actual bitrate.
From Table I we can observe that our proposed method
achieves a more accurate rate control performance compared
to the SATD-based method, leading to on average 0.46 percent
decrease of rate control error. We attribute this to our CNN-
based method, where the CNN can accurately predict the
parameters of the R-λmodel.
To evaluate the compression efficiency, we use the BD-rate
between the scheme with rate control and the scheme without.
The comparative results are shown in Table II. Note that the
scheme with rate control is usually a little worse than the
scheme without in terms of rate-distortion performance. It can
be observed that our method also outperforms the SATD-based
method, leading to on average 0.7 percent BD-rate reduction
in Y component, and more than 2 percent BD-rate reduction
in U and V components.
Our current implementation is straightforward and not op-
timized. In our experiments, we observe the encoding time
of our method is about 2.2 times to that of the SATD-based
method, due to the CNN-based prediction. Nonetheless, we an-
ticipate the computational time of CNN can be reduced greatly
if using graphics processing unit or specialized hardware.
V. CONCLUSIONS
In this paper, we have presented a convolutional neural
network-based approach for HEVC intra frame rate control.
The proposed rate control method outperforms the SATD-
based one, which is used in the current HEVC reference
software, in terms of both rate control accuracy and compres-
sion efficiency. Since the model parameters are obtained by
CNN, which is trained by many natural images, our method
is universal and does not require any prior information, and
thus especially suits for the first frame or deals with scene
change.
VI. ACKNOWLEDGMENT
This work was supported in part by the Nation-
al Key Research and Development Plan under Grant
No.2016YFC0801001, and National 973 Program of China
under Grant 2015CB351803, and NSFC (Natural Science
Foundation of China) under Grant 61571413, 61390514 and
Intel ICRI MNC.
REFERENCES
[1] B. Li, H. Li, L. Li, and J. Zhang, “λdomain rate control algorithm for
high efficiency video coding,” IEEE Transactions on Image Processing,
vol. 23, no. 9, pp. 3841–3854, Sept 2014.
[2] H. Choi, J. Nam, J. Yoo, D. Sim, and I. Bajic, “Rate control based
on unified RQ model for HEVC,” ITU-T SG16 Contribution, JCTVC-
H0213, pp. 1–13, 2012.
[3] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand, “Overview of the
high efficiency video coding (HEVC) standard,” IEEE Transactions on
Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–
1668, 2012.
[4] L. Li, B. Li, H. Li, and C. W. Chen, “λdomain optimal bit allocation
algorithm for high efficiency video coding,” IEEE Transactions on
Circuits and Systems for Video Technology, vol. PP, no. 99, pp. 1–1,
2016.
[5] M. Karczewicz and X. Wang, “Intra frame rate control based on SATD,”
in JCTVC 13th Meeting, Incheon, KR, 2013.
[6] M. Wang, K. N. Ngan, and H. Li, “An efficient frame-content based
intra frame rate control for high efficiency video coding,” IEEE Signal
Processing Letters, vol. 22, no. 7, pp. 896–900, 2015.
[7] B. Hosking, D. Agrafiotis, D. Bull, and N. Eastern, “An adaptive resolu-
tion rate control method for intra coding in HEVC,” in Acoustics, Speech
and Signal Processing (ICASSP), 2016 IEEE International Conference
on. IEEE, 2016, pp. 1486–1490.
[8] C. Sheng, F. Chen, Z. Peng, and W. Chen, “An adaptive bit mismatch
rectification algorithm for intra frame rate control in HEVC,” in Image
and Signal Processing (CISP), 2015 8th International Congress on.
IEEE, 2015, pp. 80–84.
[9] M. Zhou, Y. Zhang, B. Li, and H.-M. Hu, “Complexity-based intra frame
rate control by jointing inter-frame correlation for high efficiency video
coding,” Journal of Visual Communication and Image Representation,
2016.
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[11] G. Schaefer and M. Stich, “UCID: An uncompressed color image
database,” in Electronic Imaging 2004. International Society for Optics
and Photonics, 2003, pp. 472–480.
[12] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, “RAISE:
A raw images dataset for digital image forensics,” in Proceedings of the
6th ACM Multimedia Systems Conference. ACM, 2015, pp. 219–224.