Content uploaded by Ivan V. Bajic
Author content
All content in this area was uploaded by Ivan V. Bajic on Aug 31, 2017
Content may be subject to copyright.
HEVC INTRA FEATURES FOR HUMAN DETECTION
Hyomin Choi and Ivan V. Baji´
c
School of Engineering Science, Simon Fraser University
Burnaby, BC, Canada
ABSTRACT
Object detection relies on appropriate features extracted from
images or video. Typically, features are extracted in the pixel
domain, after full decoding. In this paper, we introduce a set
of features derived from HEVC Intra coding syntax elements
- block size, intra prediction modes, and transform coefficient
levels - which enable human detection without full bitstream
decoding. When used with a properly trained linear Support
Vector Machine (SVM), the proposed features achieve com-
petitive accuracy on human detection with widely used His-
togram of Oriented Gradients (HOG) features over a range of
Quantization Parameter (QP) values.
Index Terms—Human detection, SVM, HEVC, intra
coding
1. INTRODUCTION
Object detection, and in particular human detection, is a key
component in a number of recent technologies that rely on
computer vision, such as video surveillance, autonomous ve-
hicles, independent living, and many others. Features rep-
resent certain relationships among pixels that are helpful in
deciding whether or not a particular type of object is present
in an image. Features are usually processed by an appropri-
ately designed machine learning algorithm to make a decision
regarding the presence of the corresponding object. Even sim-
ple Haar-like features are able to provide fairly accurate ob-
ject detection when coupled with an advanced machine learn-
ing algorithm [1]. More sophisticated features such as His-
togram of Oriented Gradients (HOG) [2] can be coupled with
a simpler machine learning models, such as Support Vector
Machine (SVM), and still offer excellent accuracy. In partic-
ular, HOG features coupled with a linear SVM are considered
the benchmark handcrafted features for human detection. The
latest trend in human detection is towards learned features
generated by training deep neural networks [3, 4], which pro-
vide higher accuracy, but with higher computational cost.
Common to all these approaches is the need to fully de-
code an image or video frame and store it in memory prior
to computing features. As the scale of data that needs to be
analyzed increases, one must look at ways to save computa-
tion wherever possible. One possibility is to compute the fea-
Table 1. Average time saving with HEVC entropy decoding
only vs. full decoding and reconstruction
ClassA ClassB ClassC ClassD ClassE
All Intra 65% 63% 64% 57% 70%
Low Delay 67% 67% 65% 58% 59%
RandomAccess 69% 69% 67% 61% 64%
tures from the compressed bitstream without full decoding.
Table 1 shows how much time can be saved by performing
only entropy decoding and avoiding inverse quantization and
transform, prediction, and in-loop filters. In the table, vari-
ous classes (A, B, ..., E) refer to video resolutions, while rows
indicate different HEVC coding configurations [5].
Indeed, compressed-domain and transform-domain fea-
tures have been studied in the past, especially for earlier im-
age and video coding standards [6, 7, 8, 9]. However, the lat-
est standards such as H.264/AVC [10] and H.265/HEVC [11]
involve a recursive relationship between transform coeffi-
cients and reconstructed pixel values, making it more difficult
to extract meaningful features without fully reconstructing
the image. For this reason, there has been very limited work
on compressed-domain features, especially for HEVC. A rare
recent example is [12], which presents several motion-based
features for human-vehicle classification.
Our focus in this paper is on HEVC intra-coding features
for human detection. Humans detected in intra frames could
then be tracked through inter-frames using a method similar
to [13], for example. One of the features we develop is a his-
togram of intra prediction directions, which is meant to emu-
late HOG [2]. The relationship between prediction directions
and gradients has been recognized before [14, 15, 16, 17]
and used to speed up mode decision in HEVC intra coding.
Here, we use the relationship in reverse, to construct some-
thing analogous to the gradient histogram from prediction di-
rections. Another novel feature is based on transform coeffi-
cient levels in small blocks. These features are combined in a
vector of similar size to HOG and used to train a linear SVM.
The proposed features are described in Section 2, followed by
experiments and conclusions in Sections 3 and 4, respectively.
To be presented at IEEE GlobalSIP'17, Montreal, QC, Nov. 2017
2. PROPOSED FEATURES
Motivated by the success of HOG features [2] in human detec-
tion, we set out to create similar features that could be com-
puted from HEVC compressed data without full image recon-
struction. The results is a Histogram of Prediction Directions
(HoPD), whose relationship with HOG is illustrated in Fig. 1.
We supplement HoPD with another feature called Coefficient
Binary Patterns (CBP), which is derived from transform co-
efficient levels. The combination of HoPD and CBP turns out
to be competitive with HOG over a wide range of QP values,
yet only requires the data available at the output of the HEVC
entropy decoder module. To facilitate comparison with HOG
features, we explain how HoPD and CBP features are com-
puted for 64 ×128 windows, the same window size that HOG
was originally developed for [2]. It should be noted that the
framework can be easily adapted to other window sizes.
2.1. Histogram of Prediction Directions
HEVC intra coding employs angular prediction with 33 direc-
tional modes (indexed 2-34) and two non-directional modes
(indexed 0-1). Directional modes uniformly cover a 180◦
range, from 45◦to 225◦. Rate Distortion Optimized (RDO)
mode decision tends to select smaller blocks near edges and
object boundaries, where prediction directions tend to follow
edges, as illustrated in the left part of Fig. 2. Hence, for the
purpose of HoPD computation, all Coding Units (CUs) larger
than 8×8are considered non-directionally predicted.
However, since the smallest block size is 4×4and we get
at most one prediction direction per block, the encoded set
of prediction directions is much smaller than the HOG vec-
tor. Specifically, for a 64 ×128 window, HOG vector dimen-
sionality is 16740, while the number of prediction directions
would be at most 512. Hence, to enrich the set of prediction
directions, we compute 16 sub-directions per each 4×4block
(one direction per pixel), as a weighted average of neighbour-
ing prediction directions.
Let C be the current 4×4block, whose pixels are la-
beled a, b, c, ..., as shown in the right part of Fig. 2. Let
the neighboring 4×4blocks be denoted as Top-Left (TL),
Top (T), Top-Right (TR), ..., and let Di∈ {0,1, ..., 34}be
the prediction mode index of the corresponding block, with
i∈ {TL, T, TR, L, C, R, BL, B, BR}. For the top-left quad-
rant of block C (pixels a, b, c, d), the sub-direction index is
computed as
Da=
DTL+3DC+2DT+2DL
8,if DTL,DC,DT,DL>1
DTL+3DC
4,if DTL,DC>1; DTor DL≤1
2DT+2DL
4,if DT,DL>1; DTL or DC≤1
0,otherwise
(1)
(a) Histogram of Oriented Gradients (HOG)
(b) Histogram of Prediction Directions (HoPD)
Fig. 1. Comparison of HOG and HoPD on sample images
Fig. 2. Left: prediction directions (red) in 4×4blocks; Right:
computing sub-directions
Db=
DTL+3DC+3DT+DL
8,if DTL,DC,DT,DL>1
DTL+3DC
4,if DTL,DC>1; DTor DL≤1
3DT+DL
4,if DT,DL>1; DTL or DC≤1
0,otherwise
(2)
Dc=
DTL+3DC+DT+3DL
8,if DTL,DC,DT,DL>1
DTL+3DC
4,if DTL,DC>1; DTor DL≤1
DT+3DL
4,if DT,DL>1; DTL or DC≤1
0,otherwise
(3)
(a) (b) (c) (d) (e)
Fig. 3. (a) sample image,(b) residual image, (c) 4×4CBP
map, (d) 8×8CBP map, (e) combined CBP map
Dd=
DC,if DC>1
DT+DL
2,if DT,DL>1; DC≤1
0,otherwise
(4)
Sub-direction index Dj, j ∈ {a, b, c, d}is rounded down if
any of equations (1)-(4) results in a fractional value. Sub-
directions in other quadrants of block C are computed anal-
ogously using neighboring block directions. Note that if C
and/or a sufficient number of neighboring blocks are direc-
tionally predicted (Di>1for i∈ {TL, T, ..., BR}), the cor-
responding sub-direction Djwill also be directional (Dj>
1), because it is computed as a weighted average over num-
bers larger than 1. If not, the sub-direction gets assigned a
value of 0, which indicates non-directional prediction.
Once all 16 sub-directions are computed, the histogram is
formed as follows. A 180◦range is divided into 9bins (same
as in HOG). Then each directional sub-direction (Dj>1)
increases the count in its corresponding bin. In addition, two
neighboring bins are incremented. Let αbe the difference
between the angle of Djand upper boundary of its bin in the
histogram, divided by the bin width (so that 0≤α≤1).
Then the bin above is incremented by 1−αand the bin below
is incremented by α.
Meanwhile, non-directional sub-direction (Dj= 0) in-
crements all bins by 1. After such histogram is formed for
all 4×4blocks in a 64 ×128 window, all the histograms
are stacked together into a HoPD vector, whose dimension is
(64
4)·(128
4)·9 = 4608.
2.2. Coefficient Binary Patterns
To supplement HoPD, we introduce Coefficient Binary Pat-
tern (CBP) features. Although intra prediction in HEVC
makes the relationship between pixel values and transform
coefficients recursive, in some cases residuals resemble orig-
inal pixel structures because the corresponding pixels cannot
be effectively predicted from neighboring regions. An exam-
ple is given in Fig. 3, where we see that the face and some
outlines of the body from Fig. 3(a) are still visible in the
residual shown in Fig. 3(b). Hence, transform coefficients
still contain useful information about local pixel values, even
(a) (b)
Fig. 4. Coefficients used in CBP for (a) 4×4DST and (b)
8×8DCT
though those pixel values cannot be fully recovered from
these transform coefficients alone.
In HEVC, 4×4blocks are transformed using an integer
approximation to the Discrete Sine Transform (DST), while
larger blocks use integer approximations to the Discrete Co-
sine Transform (DCT). We only construct CBPs for 4×4DST
blocks and 8×8DCT blocks. To construct 4×4CBPs, we
overlay a 4×4cell grid over the 64 ×128 window. Then
for each 4×4cell, if the corresponding block has been trans-
formed using the 4×4DST, we generate a 16-bit map, with
each bit indicating whether the corresponding coefficient’s
magnitude is greater than 0(bit = 1) or not (bit = 0). For 4×4
cells that have been transformed using larger transforms, we
generate an all zero binary map. An example of 4×4CBP
map is shown Fig. 3(c). An analogous procedure, using an
8×8grid, is used to construct the 8×8CBP map (Fig. 3(d)),
except that in this case we consider only the 27 coefficients
at low and medium frequencies, indicated in Fig. 4(b). Since
each block is transformed using only one transform, the non-
zero entries in the 4×4and 8×8CBP maps do not overlap,
and they can be combined as shown in Fig. 3(e).
Each cell in each of the two CBP maps is vectorized as
Vk=V1
k, V 2
k, V 3
k, V 4
k(5)
where k∈ {DC T, D ST }and n∈ {1,2,3,4}indicates
various regions of the transform block (Fig. 4) in vertical-
scan order. Then all the vectors are stacked together and
combined with HoPD. The total number of 4×4CBP fea-
tures is (64
4)·(128
4)·16 = 8192 and the total number of
8×8CBP features is (64
8)·(128
8)·27 = 3456. Together
with HoPD, the overall dimension of our feature vector is
4608 + 8192 + 3456 = 16256. This is slightly less than the
dimension of HOG features for the same window size, which
is 16740.
3. EXPERIMENTAL RESULTS
For our experiments, we used the INRIA person dataset1,
which provides a training set with 2416 images of humans and
1http://pascal.inrialpes.fr/data/human/
Fig. 5. Test image size format and margin
Fig. 6. Accuracy at various QP values
1218 human-free images, as well as a test set with 1126 and
453 human and human-free images, respectively. In the test
set, human images were scaled up from 70 ×134 to 80 ×144
(Fig. 5) using bicubic interpolation, in order to make image
dimensions a multiple of 8, as required for HEVC encoding.
Human-free images were generally larger than images con-
taining humans, which allows selecting multiple human-free
windows from them. In the test set, we randomly selected 2
windows within each human-free image, which brought the
number of human-free training samples up to 2436, close to
the number of human training samples. Similarly, in the test
set, we randomly selected 3 windows from each human-free
image, which brought the number of human-free test samples
to 1359, close to the number of human test samples.
Images were encoded using the HEVC reference encoder
(HM-16.2) with common test conditions for intra coding [5]
and QP ∈ {16,20,24,28,32,36,40}. A separate Support
Vector Machine (SVM) was trained on the proposed features
at each QP value. In parallel, a separate SVM was trained on
HOG features extracted from decoded images for each QP.
For evaluation, we used several metrics derived from True
Positives (T P ), False Positives (F P ), True Negatives (T N)
and False Negatives (FN ). These metrics were: Accuracy =
T P +T N
T P +F P +T N +F N , Precision =T P
T P +F P , Recall =T P
T P +F N
and F1-measure (the harmonic mean of Precision and Re-
call). The accuracy curves are shown in Fig. 6 for HOG,
HoPD, CBP, and the combination of HoPD and CBP (denoted
HoPD+CBP). As seen in the figure, HoPD features on their
Table 2. Precision, Recall and F1 for HoPD+CBP and HOG
HoPD+CBP HOG
Precision Recall F1 Precision Recall F1
QP =16 0.98 0.96 0.97 0.98 0.95 0.96
QP =20 0.98 0.96 0.97 0.97 0.96 0.97
QP =24 0.97 0.94 0.95 0.91 0.98 0.94
QP =28 0.96 0.96 0.96 0.97 0.95 0.96
QP =32 0.96 0.92 0.94 0.97 0.94 0.95
QP =36 0.92 0.89 0.91 0.97 0.92 0.94
QP =40 0.86 0.85 0.86 0.97 0.88 0.92
own provide over 90% accuracy at QP = 16, but the accuracy
deteriorates at higher compression ratios (higher QP). The
other three feature sets show less sensitivity to compression
although their performance also degrades towards higher QP
values. This is because at higher QP values, the encoder tends
to select more frequently block sizes larger than 8×8, which
are less indicative of object boundaries. CBP and HoPD+CBP
are competitive with HOG up to about QP = 32, after which
HOG shows an accuracy advantage of over 2%. Meanwhile,
HoPD+CBP is slightly better than CBP alone, by up to 1.5%,
depending on the case. Overall, HoPD+CBP shows compet-
itive accuracy with HOG up to QP = 32, which is the range
of QPs often employed in practice. Precision, Recall and F1
are shown in Table 2 for HoPD+CBP and HOG, and again we
see that HoPD+CBP shows competitive accuracy with HOG
up to QP = 32.
While their accuracy is comparable (up to QP = 32),
HoPD+CBP holds significant advantage over HOG in terms
of processing requirements. HoPD+CBP only requires en-
tropy decoding within an HEVC decoder, which accounts for
only 30-40% of the overall decoding time (Table 1). Mean-
while, HOG requires full decoding, as well as storing the re-
sulting image in memory for computing HOG features. Once
features are computed, both HoPD+CBP and HOG will re-
quire approximately the same processing complexity by an
SVM, because their dimensions are about the same: 16256
for HoPD+CBP and 16740 for HOG.
4. CONCLUSION
We proposed a set of HEVC intra coding features for human
detection. These features are computed at the output of the
entropy decoding module without access to posterior mod-
ules that reconstruct pixel values. Yet, they are able to detect
the presence of humans with accuracy comparable to widely
used pixel-domain HOG features over a practically important
range of QP values. For these reasons, the new features are
useful for fast and effective in practical video analysis on a
large scale.
5. REFERENCES
[1] P. Viola and M. Jones, “Rapid object detection using
a boosted cascaded of simple features,” in Proc. IEEE
CVPR, Dec. 2001, vol. 1, pp. 511–518.
[2] N. Dalal and B. Triggs, “Histogram of oriented gradi-
ents for human detection,” in Proc. IEEE CVPR, Jun.
2005, vol. 1, pp. 886–893.
[3] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. Le-
cun, “Pedestrian detection with unsupervised multi-
stage feature learning,” in Proc. IEEE CVPR, Jun. 2013,
pp. 3626–3633.
[4] P. Luo, Y. Tian, X. Wang, and X. Tang, “Switchable
deep network for pedestrian detection,” in Proc. IEEE
CVPR, Jun. 2014, pp. 899–906.
[5] F. Bossen, “Common HM test conditions and soft-
ware reference configurations,” in ISO/IEC JTC1/SC29
WG11 m28412, JCTVC-L1100, Jan. 2013.
[6] B. Shen and I. K. Sethi, “Direct feature extraction from
compressed images,” in Proc. SPIE Storage and Re-
trieval for Image and Video Databases IV, 1996, vol.
2670.
[7] Z. Qian, W. Wang, and T. Qiao, “An edge detection
method in DCT domain,” in Int. Workshop Inform. Elec-
tron. Eng., 2012, pp. 344–348.
[8] T. Tsai, Y.-P. Huang, and T.-W. Chiang, “Image retrieval
based on dominant texture features,” in Proc. IEEE Int.
Symp. Industr. Electronics, Jul. 2006, vol. 1, pp. 441–
446.
[9] R. Fusek and E. Sojka, “Gradient-DCT (G-DCT) de-
scriptors,” in Proc. IEEE Int. Conf. Image Processing
Theory, Tools, Appl. (IPTA), Oct. 2014, pp. 1–6.
[10] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and
A. Luthra, “Overview of the H.264/AVC video coding
standard,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 13, pp. 560–576, Jul. 2003.
[11] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand,
“Overview of the high efficiency video coding (HEVC)
standard,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 22, pp. 1649–1668, Dec. 2012.
[12] L. Zhao, Z. He, W. Cao, and D. Zhao, “Real-time mov-
ing object segmentation and classification from HEVC
compressed surveillance video,” IEEE Trans. Circuits
Syst. Video Technol., to appear.
[13] S. H. Khatoonabadi and I. V. Bajic, “Video object track-
ing in the compressed domain using spatio-temporal
markov random fields,” IEEE Trans. Image Processing,
vol. 22, no. 1, pp. 300–313, Jan. 2013.
[14] X. Liu, Y. Liu, P. Wang, C.-F. Lai, and H.-C. Chao, “An
adaptive mode decision algorithm based on video tex-
ture characteristics for HEVC intra prediction,” IEEE
Trans. Circuits Syst. Video Technol., to appear.
[15] X. Wang and Y. Xue, “Fast HEVC intra coding algo-
rithm based on Otsu’s method and gradient,” in Proc.
IEEE Int. Symp. Boradband Multimedia Systems and
Broadcasting, Jul. 2016.
[16] F. Pan, X. Lin, S. Rahardja, K. P. Lim, and Z. G. Li,
“A directional field based fast intra mode decision algo-
rithm for H.264 video coding,” in Proc. IEEE ICME’04,
Jun. 2004.
[17] W. Jiang, H. Ma, and Y. Chen, “Gradient based fast
mode decision algorithm for intra prediction in HEVC,”
in 2nd Int. Conf. Consumer Electronics, Communica-
tions and Networks, May 2012.