Conference PaperPDF Available

HEVC Intra Features for Human Detection

Authors:

Abstract and Figures

Object detection relies on appropriate features extracted from images or video. Typically, features are extracted in the pixel domain, after full decoding. In this paper, we introduce a set of features derived from HEVC Intra coding syntax elements-block size, intra prediction modes, and transform coefficient levels-which enable human detection without full bitstream decoding. When used with a properly trained linear Support Vector Machine (SVM), the proposed features achieve competitive accuracy on human detection with widely used His-togram of Oriented Gradients (HOG) features over a range of Quantization Parameter (QP) values.
Content may be subject to copyright.
HEVC INTRA FEATURES FOR HUMAN DETECTION
Hyomin Choi and Ivan V. Baji´
c
School of Engineering Science, Simon Fraser University
Burnaby, BC, Canada
ABSTRACT
Object detection relies on appropriate features extracted from
images or video. Typically, features are extracted in the pixel
domain, after full decoding. In this paper, we introduce a set
of features derived from HEVC Intra coding syntax elements
- block size, intra prediction modes, and transform coefficient
levels - which enable human detection without full bitstream
decoding. When used with a properly trained linear Support
Vector Machine (SVM), the proposed features achieve com-
petitive accuracy on human detection with widely used His-
togram of Oriented Gradients (HOG) features over a range of
Quantization Parameter (QP) values.
Index TermsHuman detection, SVM, HEVC, intra
coding
1. INTRODUCTION
Object detection, and in particular human detection, is a key
component in a number of recent technologies that rely on
computer vision, such as video surveillance, autonomous ve-
hicles, independent living, and many others. Features rep-
resent certain relationships among pixels that are helpful in
deciding whether or not a particular type of object is present
in an image. Features are usually processed by an appropri-
ately designed machine learning algorithm to make a decision
regarding the presence of the corresponding object. Even sim-
ple Haar-like features are able to provide fairly accurate ob-
ject detection when coupled with an advanced machine learn-
ing algorithm [1]. More sophisticated features such as His-
togram of Oriented Gradients (HOG) [2] can be coupled with
a simpler machine learning models, such as Support Vector
Machine (SVM), and still offer excellent accuracy. In partic-
ular, HOG features coupled with a linear SVM are considered
the benchmark handcrafted features for human detection. The
latest trend in human detection is towards learned features
generated by training deep neural networks [3, 4], which pro-
vide higher accuracy, but with higher computational cost.
Common to all these approaches is the need to fully de-
code an image or video frame and store it in memory prior
to computing features. As the scale of data that needs to be
analyzed increases, one must look at ways to save computa-
tion wherever possible. One possibility is to compute the fea-
Table 1. Average time saving with HEVC entropy decoding
only vs. full decoding and reconstruction
ClassA ClassB ClassC ClassD ClassE
All Intra 65% 63% 64% 57% 70%
Low Delay 67% 67% 65% 58% 59%
RandomAccess 69% 69% 67% 61% 64%
tures from the compressed bitstream without full decoding.
Table 1 shows how much time can be saved by performing
only entropy decoding and avoiding inverse quantization and
transform, prediction, and in-loop filters. In the table, vari-
ous classes (A, B, ..., E) refer to video resolutions, while rows
indicate different HEVC coding configurations [5].
Indeed, compressed-domain and transform-domain fea-
tures have been studied in the past, especially for earlier im-
age and video coding standards [6, 7, 8, 9]. However, the lat-
est standards such as H.264/AVC [10] and H.265/HEVC [11]
involve a recursive relationship between transform coeffi-
cients and reconstructed pixel values, making it more difficult
to extract meaningful features without fully reconstructing
the image. For this reason, there has been very limited work
on compressed-domain features, especially for HEVC. A rare
recent example is [12], which presents several motion-based
features for human-vehicle classification.
Our focus in this paper is on HEVC intra-coding features
for human detection. Humans detected in intra frames could
then be tracked through inter-frames using a method similar
to [13], for example. One of the features we develop is a his-
togram of intra prediction directions, which is meant to emu-
late HOG [2]. The relationship between prediction directions
and gradients has been recognized before [14, 15, 16, 17]
and used to speed up mode decision in HEVC intra coding.
Here, we use the relationship in reverse, to construct some-
thing analogous to the gradient histogram from prediction di-
rections. Another novel feature is based on transform coeffi-
cient levels in small blocks. These features are combined in a
vector of similar size to HOG and used to train a linear SVM.
The proposed features are described in Section 2, followed by
experiments and conclusions in Sections 3 and 4, respectively.
To be presented at IEEE GlobalSIP'17, Montreal, QC, Nov. 2017
2. PROPOSED FEATURES
Motivated by the success of HOG features [2] in human detec-
tion, we set out to create similar features that could be com-
puted from HEVC compressed data without full image recon-
struction. The results is a Histogram of Prediction Directions
(HoPD), whose relationship with HOG is illustrated in Fig. 1.
We supplement HoPD with another feature called Coefficient
Binary Patterns (CBP), which is derived from transform co-
efficient levels. The combination of HoPD and CBP turns out
to be competitive with HOG over a wide range of QP values,
yet only requires the data available at the output of the HEVC
entropy decoder module. To facilitate comparison with HOG
features, we explain how HoPD and CBP features are com-
puted for 64 ×128 windows, the same window size that HOG
was originally developed for [2]. It should be noted that the
framework can be easily adapted to other window sizes.
2.1. Histogram of Prediction Directions
HEVC intra coding employs angular prediction with 33 direc-
tional modes (indexed 2-34) and two non-directional modes
(indexed 0-1). Directional modes uniformly cover a 180
range, from 45to 225. Rate Distortion Optimized (RDO)
mode decision tends to select smaller blocks near edges and
object boundaries, where prediction directions tend to follow
edges, as illustrated in the left part of Fig. 2. Hence, for the
purpose of HoPD computation, all Coding Units (CUs) larger
than 8×8are considered non-directionally predicted.
However, since the smallest block size is 4×4and we get
at most one prediction direction per block, the encoded set
of prediction directions is much smaller than the HOG vec-
tor. Specifically, for a 64 ×128 window, HOG vector dimen-
sionality is 16740, while the number of prediction directions
would be at most 512. Hence, to enrich the set of prediction
directions, we compute 16 sub-directions per each 4×4block
(one direction per pixel), as a weighted average of neighbour-
ing prediction directions.
Let C be the current 4×4block, whose pixels are la-
beled a, b, c, ..., as shown in the right part of Fig. 2. Let
the neighboring 4×4blocks be denoted as Top-Left (TL),
Top (T), Top-Right (TR), ..., and let Di∈ {0,1, ..., 34}be
the prediction mode index of the corresponding block, with
i∈ {TL, T, TR, L, C, R, BL, B, BR}. For the top-left quad-
rant of block C (pixels a, b, c, d), the sub-direction index is
computed as
Da=
DTL+3DC+2DT+2DL
8,if DTL,DC,DT,DL>1
DTL+3DC
4,if DTL,DC>1; DTor DL1
2DT+2DL
4,if DT,DL>1; DTL or DC1
0,otherwise
(1)
(a) Histogram of Oriented Gradients (HOG)
(b) Histogram of Prediction Directions (HoPD)
Fig. 1. Comparison of HOG and HoPD on sample images
Fig. 2. Left: prediction directions (red) in 4×4blocks; Right:
computing sub-directions
Db=
DTL+3DC+3DT+DL
8,if DTL,DC,DT,DL>1
DTL+3DC
4,if DTL,DC>1; DTor DL1
3DT+DL
4,if DT,DL>1; DTL or DC1
0,otherwise
(2)
Dc=
DTL+3DC+DT+3DL
8,if DTL,DC,DT,DL>1
DTL+3DC
4,if DTL,DC>1; DTor DL1
DT+3DL
4,if DT,DL>1; DTL or DC1
0,otherwise
(3)
(a) (b) (c) (d) (e)
Fig. 3. (a) sample image,(b) residual image, (c) 4×4CBP
map, (d) 8×8CBP map, (e) combined CBP map
Dd=
DC,if DC>1
DT+DL
2,if DT,DL>1; DC1
0,otherwise
(4)
Sub-direction index Dj, j ∈ {a, b, c, d}is rounded down if
any of equations (1)-(4) results in a fractional value. Sub-
directions in other quadrants of block C are computed anal-
ogously using neighboring block directions. Note that if C
and/or a sufficient number of neighboring blocks are direc-
tionally predicted (Di>1for i∈ {TL, T, ..., BR}), the cor-
responding sub-direction Djwill also be directional (Dj>
1), because it is computed as a weighted average over num-
bers larger than 1. If not, the sub-direction gets assigned a
value of 0, which indicates non-directional prediction.
Once all 16 sub-directions are computed, the histogram is
formed as follows. A 180range is divided into 9bins (same
as in HOG). Then each directional sub-direction (Dj>1)
increases the count in its corresponding bin. In addition, two
neighboring bins are incremented. Let αbe the difference
between the angle of Djand upper boundary of its bin in the
histogram, divided by the bin width (so that 0α1).
Then the bin above is incremented by 1αand the bin below
is incremented by α.
Meanwhile, non-directional sub-direction (Dj= 0) in-
crements all bins by 1. After such histogram is formed for
all 4×4blocks in a 64 ×128 window, all the histograms
are stacked together into a HoPD vector, whose dimension is
(64
4)·(128
4)·9 = 4608.
2.2. Coefficient Binary Patterns
To supplement HoPD, we introduce Coefficient Binary Pat-
tern (CBP) features. Although intra prediction in HEVC
makes the relationship between pixel values and transform
coefficients recursive, in some cases residuals resemble orig-
inal pixel structures because the corresponding pixels cannot
be effectively predicted from neighboring regions. An exam-
ple is given in Fig. 3, where we see that the face and some
outlines of the body from Fig. 3(a) are still visible in the
residual shown in Fig. 3(b). Hence, transform coefficients
still contain useful information about local pixel values, even
(a) (b)
Fig. 4. Coefficients used in CBP for (a) 4×4DST and (b)
8×8DCT
though those pixel values cannot be fully recovered from
these transform coefficients alone.
In HEVC, 4×4blocks are transformed using an integer
approximation to the Discrete Sine Transform (DST), while
larger blocks use integer approximations to the Discrete Co-
sine Transform (DCT). We only construct CBPs for 4×4DST
blocks and 8×8DCT blocks. To construct 4×4CBPs, we
overlay a 4×4cell grid over the 64 ×128 window. Then
for each 4×4cell, if the corresponding block has been trans-
formed using the 4×4DST, we generate a 16-bit map, with
each bit indicating whether the corresponding coefficient’s
magnitude is greater than 0(bit = 1) or not (bit = 0). For 4×4
cells that have been transformed using larger transforms, we
generate an all zero binary map. An example of 4×4CBP
map is shown Fig. 3(c). An analogous procedure, using an
8×8grid, is used to construct the 8×8CBP map (Fig. 3(d)),
except that in this case we consider only the 27 coefficients
at low and medium frequencies, indicated in Fig. 4(b). Since
each block is transformed using only one transform, the non-
zero entries in the 4×4and 8×8CBP maps do not overlap,
and they can be combined as shown in Fig. 3(e).
Each cell in each of the two CBP maps is vectorized as
Vk=V1
k, V 2
k, V 3
k, V 4
k(5)
where k∈ {DC T, D ST }and n∈ {1,2,3,4}indicates
various regions of the transform block (Fig. 4) in vertical-
scan order. Then all the vectors are stacked together and
combined with HoPD. The total number of 4×4CBP fea-
tures is (64
4)·(128
4)·16 = 8192 and the total number of
8×8CBP features is (64
8)·(128
8)·27 = 3456. Together
with HoPD, the overall dimension of our feature vector is
4608 + 8192 + 3456 = 16256. This is slightly less than the
dimension of HOG features for the same window size, which
is 16740.
3. EXPERIMENTAL RESULTS
For our experiments, we used the INRIA person dataset1,
which provides a training set with 2416 images of humans and
1http://pascal.inrialpes.fr/data/human/
Fig. 5. Test image size format and margin
Fig. 6. Accuracy at various QP values
1218 human-free images, as well as a test set with 1126 and
453 human and human-free images, respectively. In the test
set, human images were scaled up from 70 ×134 to 80 ×144
(Fig. 5) using bicubic interpolation, in order to make image
dimensions a multiple of 8, as required for HEVC encoding.
Human-free images were generally larger than images con-
taining humans, which allows selecting multiple human-free
windows from them. In the test set, we randomly selected 2
windows within each human-free image, which brought the
number of human-free training samples up to 2436, close to
the number of human training samples. Similarly, in the test
set, we randomly selected 3 windows from each human-free
image, which brought the number of human-free test samples
to 1359, close to the number of human test samples.
Images were encoded using the HEVC reference encoder
(HM-16.2) with common test conditions for intra coding [5]
and QP ∈ {16,20,24,28,32,36,40}. A separate Support
Vector Machine (SVM) was trained on the proposed features
at each QP value. In parallel, a separate SVM was trained on
HOG features extracted from decoded images for each QP.
For evaluation, we used several metrics derived from True
Positives (T P ), False Positives (F P ), True Negatives (T N)
and False Negatives (FN ). These metrics were: Accuracy =
T P +T N
T P +F P +T N +F N , Precision =T P
T P +F P , Recall =T P
T P +F N
and F1-measure (the harmonic mean of Precision and Re-
call). The accuracy curves are shown in Fig. 6 for HOG,
HoPD, CBP, and the combination of HoPD and CBP (denoted
HoPD+CBP). As seen in the figure, HoPD features on their
Table 2. Precision, Recall and F1 for HoPD+CBP and HOG
HoPD+CBP HOG
Precision Recall F1 Precision Recall F1
QP =16 0.98 0.96 0.97 0.98 0.95 0.96
QP =20 0.98 0.96 0.97 0.97 0.96 0.97
QP =24 0.97 0.94 0.95 0.91 0.98 0.94
QP =28 0.96 0.96 0.96 0.97 0.95 0.96
QP =32 0.96 0.92 0.94 0.97 0.94 0.95
QP =36 0.92 0.89 0.91 0.97 0.92 0.94
QP =40 0.86 0.85 0.86 0.97 0.88 0.92
own provide over 90% accuracy at QP = 16, but the accuracy
deteriorates at higher compression ratios (higher QP). The
other three feature sets show less sensitivity to compression
although their performance also degrades towards higher QP
values. This is because at higher QP values, the encoder tends
to select more frequently block sizes larger than 8×8, which
are less indicative of object boundaries. CBP and HoPD+CBP
are competitive with HOG up to about QP = 32, after which
HOG shows an accuracy advantage of over 2%. Meanwhile,
HoPD+CBP is slightly better than CBP alone, by up to 1.5%,
depending on the case. Overall, HoPD+CBP shows compet-
itive accuracy with HOG up to QP = 32, which is the range
of QPs often employed in practice. Precision, Recall and F1
are shown in Table 2 for HoPD+CBP and HOG, and again we
see that HoPD+CBP shows competitive accuracy with HOG
up to QP = 32.
While their accuracy is comparable (up to QP = 32),
HoPD+CBP holds significant advantage over HOG in terms
of processing requirements. HoPD+CBP only requires en-
tropy decoding within an HEVC decoder, which accounts for
only 30-40% of the overall decoding time (Table 1). Mean-
while, HOG requires full decoding, as well as storing the re-
sulting image in memory for computing HOG features. Once
features are computed, both HoPD+CBP and HOG will re-
quire approximately the same processing complexity by an
SVM, because their dimensions are about the same: 16256
for HoPD+CBP and 16740 for HOG.
4. CONCLUSION
We proposed a set of HEVC intra coding features for human
detection. These features are computed at the output of the
entropy decoding module without access to posterior mod-
ules that reconstruct pixel values. Yet, they are able to detect
the presence of humans with accuracy comparable to widely
used pixel-domain HOG features over a practically important
range of QP values. For these reasons, the new features are
useful for fast and effective in practical video analysis on a
large scale.
5. REFERENCES
[1] P. Viola and M. Jones, “Rapid object detection using
a boosted cascaded of simple features,” in Proc. IEEE
CVPR, Dec. 2001, vol. 1, pp. 511–518.
[2] N. Dalal and B. Triggs, “Histogram of oriented gradi-
ents for human detection,” in Proc. IEEE CVPR, Jun.
2005, vol. 1, pp. 886–893.
[3] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. Le-
cun, “Pedestrian detection with unsupervised multi-
stage feature learning,” in Proc. IEEE CVPR, Jun. 2013,
pp. 3626–3633.
[4] P. Luo, Y. Tian, X. Wang, and X. Tang, “Switchable
deep network for pedestrian detection, in Proc. IEEE
CVPR, Jun. 2014, pp. 899–906.
[5] F. Bossen, “Common HM test conditions and soft-
ware reference configurations, in ISO/IEC JTC1/SC29
WG11 m28412, JCTVC-L1100, Jan. 2013.
[6] B. Shen and I. K. Sethi, “Direct feature extraction from
compressed images,” in Proc. SPIE Storage and Re-
trieval for Image and Video Databases IV, 1996, vol.
2670.
[7] Z. Qian, W. Wang, and T. Qiao, “An edge detection
method in DCT domain,” in Int. Workshop Inform. Elec-
tron. Eng., 2012, pp. 344–348.
[8] T. Tsai, Y.-P. Huang, and T.-W. Chiang, “Image retrieval
based on dominant texture features, in Proc. IEEE Int.
Symp. Industr. Electronics, Jul. 2006, vol. 1, pp. 441–
446.
[9] R. Fusek and E. Sojka, “Gradient-DCT (G-DCT) de-
scriptors,” in Proc. IEEE Int. Conf. Image Processing
Theory, Tools, Appl. (IPTA), Oct. 2014, pp. 1–6.
[10] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and
A. Luthra, “Overview of the H.264/AVC video coding
standard,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 13, pp. 560–576, Jul. 2003.
[11] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand,
“Overview of the high efficiency video coding (HEVC)
standard,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 22, pp. 1649–1668, Dec. 2012.
[12] L. Zhao, Z. He, W. Cao, and D. Zhao, “Real-time mov-
ing object segmentation and classification from HEVC
compressed surveillance video, IEEE Trans. Circuits
Syst. Video Technol., to appear.
[13] S. H. Khatoonabadi and I. V. Bajic, “Video object track-
ing in the compressed domain using spatio-temporal
markov random fields, IEEE Trans. Image Processing,
vol. 22, no. 1, pp. 300–313, Jan. 2013.
[14] X. Liu, Y. Liu, P. Wang, C.-F. Lai, and H.-C. Chao, “An
adaptive mode decision algorithm based on video tex-
ture characteristics for HEVC intra prediction,” IEEE
Trans. Circuits Syst. Video Technol., to appear.
[15] X. Wang and Y. Xue, “Fast HEVC intra coding algo-
rithm based on Otsu’s method and gradient, in Proc.
IEEE Int. Symp. Boradband Multimedia Systems and
Broadcasting, Jul. 2016.
[16] F. Pan, X. Lin, S. Rahardja, K. P. Lim, and Z. G. Li,
A directional field based fast intra mode decision algo-
rithm for H.264 video coding,” in Proc. IEEE ICME’04,
Jun. 2004.
[17] W. Jiang, H. Ma, and Y. Chen, “Gradient based fast
mode decision algorithm for intra prediction in HEVC,”
in 2nd Int. Conf. Consumer Electronics, Communica-
tions and Networks, May 2012.
... To address the drawbacks above, some related works have been proposed [10]- [15], [28]- [34]. Commonly, the bitstream of the entire image is used to perform machine vision tasks, such as face detection [15], human detection [14] and so on, while the decoded image is used for human viewing. Although machine vision tasks can be directly performed at the bitstream level and the decoding process is saved in this situation, the bitstream of the entire image is still needed for machine vision tasks. ...
Preprint
Full-text available
With the AI of Things (AIoT) development, a huge amount of visual data, e.g., images and videos, are produced in our daily work and life. These visual data are not only used for human viewing or understanding but also for machine analysis or decision-making, e.g., intelligent surveillance, automated vehicles, and many other smart city applications. To this end, a new image codec paradigm for both human and machine uses is proposed in this work. Firstly, the high-level instance segmentation map and the low-level signal features are extracted with neural networks. Then, the instance segmentation map is further represented as a profile with the proposed 16-bit gray-scale representation. After that, both 16-bit gray-scale profile and signal features are encoded with a lossless codec. Meanwhile, an image predictor is designed and trained to achieve the general-quality image reconstruction with the 16-bit gray-scale profile and signal features. Finally, the residual map between the original image and the predicted one is compressed with a lossy codec, used for high-quality image reconstruction. With such designs, on the one hand, we can achieve scalable image compression to meet the requirements of different human consumption; on the other hand, we can directly achieve several machine vision tasks at the decoder side with the decoded 16-bit gray-scale profile, e.g., object classification, detection, and segmentation. Experimental results show that the proposed codec achieves comparable results as most learning-based codecs and outperforms the traditional codecs (e.g., BPG and JPEG2000) in terms of PSNR and MS-SSIM for image reconstruction. At the same time, it outperforms the existing codecs in terms of the mAP for object detection and segmentation.
Article
At present, and increasingly so in the future, much of the captured visual content will not be seen by humans. Instead, it will be used for automated machine vision analytics and may require occasional human viewing. Examples of such applications include traffic monitoring, visual surveillance, autonomous navigation, and industrial machine vision. To address such requirements, we develop an end-to-end learned image codec whose latent space is designed to support scalability from simpler to more complicated tasks. The simplest task is assigned to a subset of the latent space (the base layer), while more complicated tasks make use of additional subsets of the latent space, i.e., both the base and enhancement layer(s). For the experiments, we establish a 2-layer and a 3-layer model, each of which offers input reconstruction for human vision, plus machine vision task(s), and compare them with relevant benchmarks. The experiments show that our scalable codecs offer 37%-80% bitrate savings on machine vision tasks compared to best alternatives, while being comparable to state-of-the-art image codecs in terms of input reconstruction.
Preprint
Full-text available
We present a dataset that contains object annotations with unique object identities (IDs) for the High Efficiency Video Coding (HEVC) v1 Common Test Conditions (CTC) sequences. Ground-truth annotations for 13 sequences were prepared and released as the dataset called SFU-HW-Tracks-v1. For each video frame, ground truth annotations include object class ID, object ID, and bounding box location and its dimensions. The dataset can be used to evaluate object tracking performance on uncompressed video sequences and study the relationship between video compression and object tracking.
Preprint
At present, and increasingly so in the future, much of the captured visual content will not be seen by humans. Instead, it will be used for automated machine vision analytics and may require occasional human viewing. Examples of such applications include traffic monitoring, visual surveillance, autonomous navigation, and industrial machine vision. To address such requirements, we develop an end-to-end learned image codec whose latent space is designed to support scalability from simpler to more complicated tasks. The simplest task is assigned to a subset of the latent space (the base layer), while more complicated tasks make use of additional subsets of the latent space, i.e., both the base and enhancement layer(s). For the experiments, we establish a 2-layer and a 3-layer model, each of which offers input reconstruction for human vision, plus machine vision task(s), and compare them with relevant benchmarks. The experiments show that our scalable codecs offer 37%-80% bitrate savings on machine vision tasks compared to best alternatives, while being comparable to state-of-the-art image codecs in terms of input reconstruction.
Article
Full-text available
In this paper, we propose a Switchable Deep Network (SDN) for pedestrian detection. The SDN automatically learns hierarchical features, salience maps, and mixture representations of different body parts. Pedestrian detection faces the challenges of background clutter and large variations of pedestrian appearance due to pose and viewpoint changes and other factors. One of our key contributions is to propose a Switchable Restricted Boltzmann Machine (SRBM) to explicitly model the complex mixture of visual variations at multiple levels. At the feature levels, it automatically estimates saliency maps for each test sample in order to separate background clutters from discriminative regions for pedestrian detection. At the part and body levels, it is able to infer the most appropriate template for the mixture models of each part and the whole body. We have devised a new generative algorithm to effectively pretrain the SDN and then fine-tune it with back-propagation. Our approach is evaluated on the Caltech and ETH datasets and achieves the state-of-the-art detection performance.
Article
Full-text available
This paper proposes a new edge detection method in DCT domain for compressed images. Instead of detection with spatial pixels, we generate the gradient patterns of DCT basis images with detection operators. The quantized coefficients are used to calculate the edge maps, which avoids the procedure of inverse quantization. It is shown that the proposed scheme leads to not only good edges but also computation speedup. (C) 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Harbin University of Science and Technology.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
Moving object segmentation and classification from compressed video plays an important role for intelligent video surveillance. Compared with H.264/AVC, HEVC introduces a host of new coding features which can be further exploited for moving object segmentation and classification. In this paper, we present a real-time approach to segment and classify moving object using unique features directly extracted from the HEVC compressed domain for video surveillance. In the proposed method, firstly, motion vector interpolation for intra-coded prediction unit and MV outlier removal are employed for preprocessing. Secondly, blocks with non-zero motion vectors are clustered into the connected foreground regions using the fourconnectivity component labeling algorithm. Thirdly, object region tracking based on temporal consistency is applied to the connected foreground regions to remove the noise regions. The boundary of moving object region is further refined by the coding unit size and prediction unit size. Finally, a person-vehicle classification model using bag of spatial-temporal HEVC syntax words is trained to classify the moving objects, either persons or vehicles. The experimental results demonstrate that the proposed method provides the solid performance and can classify moving persons and vehicles accurately.
Conference Paper
High Efficiency Video Coding (HEVC), the newest video coding standard established, aims to significantly improve the compression performance compared with H.264/AVC and other existing video coding standards. In the intra coding of HEVC, this improvement is achieved by adopting the coding tree unit (CTU) structure and providing up to 35 prediction modes for each prediction unit (PU). However, the computational complexity is greatly increased due to these new techniques. In this paper, we propose a fast coding unit (CU) size decision and PU mode decision algorithm for HEVC intra coding to reduce the time consumption. Since CU sizes are highly correlated to the texture detail and variance information of the largest coding unit (LCU), we introduce the Otsu's method to measure the texture complexity of each LCU and skip some CU depth levels. Meanwhile, as the prediction modes always correspond to gradient directions, we apply improved Sobel masks to measure gradient directions of PUs and reduce some candidate intra modes. Experimental results show that our proposed algorithm can save 33% time consumption on average with negligible loss of coding efficiency.
Article
The latest High Efficiency Video Coding (HEVC) standard could achieve the highest coding efficiency compared with the existing video coding standards. To improve the coding efficiency of Intra frame, a quadtree-based variable block size coding structure which is flexible to adapt to various texture characteristics of images and up to 35 Intra prediction modes for each prediction unit (PU) is adopted in HEVC. However, the computational complexity is increased dramatically because all the possible combinations of the mode candidates are calculated in order to find the optimal rate distortion (RD) cost by using Lagrange multiplier. To alleviate the encoder computational load, this paper proposes an adaptive mode decision algorithm based on texture complexity and direction for HEVC Intra prediction. Firstly, an adaptive Coding Unit (CU) selection algorithm according to each depth levels’ texture complexity is presented to filter out unnecessary coding block. And then, the original redundant mode candidates for each PU are reduced according to its texture direction. The simulation results show that the proposed algorithm could reduce around 56% encoding time in average while maintaining the encoding performance efficiently with only 1.0% increase in BD-rate compared to the test model HM16 of HEVC.
Article
Many feature-based object detectors have shown that the use of gradient image information can be a very efficient way to describe the appearance of objects. Especially, the gradient sizes, directions and histograms are commonly used. In this area, the histogram of oriented gradients (HOG) is considered as the state-of-the-art method. The histograms and gradient orientations are used to encode the gradient information in HOG. Nevertheless, many works have proved that the feature vector dimensionality of HOG can be reduced; particularly, the information of the gradient directions is redundant and it can be reduced. This was the motivation to encode the gradient information with the least possible redundant information. In this paper, we propose the method in which the discrete cosine transform (DCT) is used to effectively encode the gradient information; using DCT, the gradient information can be encoded with a relatively small set of DCT coefficients in which the most important gradient information is preserved. We show the properties of presented method for the case of solving the problem of face and pedestrian detection.
Article
High Efficiency Video Coding (HEVC) is currently being prepared as the newest video coding standard of the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. The main goal of the HEVC standardization effort is to enable significantly improved compression performance relative to existing standards-in the range of 50% bit-rate reduction for equal perceptual video quality. This paper provides an overview of the technical features and characteristics of the HEVC standard.
Article
High Efficiency Video Coding (HEVC) is an ongoing video compression standard and a successor to H.264/AVC. It aims to provide significantly improved compression performance as compared to all existing video coding standards. In the intra prediction of HEVC, this is being achieved by providing up to 35 intra modes with larger coding unit. The optimal mode is selected by a rough mode decision (RMD) process from all of the available modes first, and then through the rate-distortion optimization (RDO) process for the final decision. Because the fact that every coding unit with different sizes is traversed in both procedures makes it very time-consuming, a gradient based fast mode decision algorithm is proposed in this paper to reduce the computational complexity of HEVC. Prior to intra prediction, gradient directions are calculated and a gradient-mode histogram is generated for each coding unit. Based on the distribution of the histogram, only a small part of the candidate modes are chosen for the RMD and the RDO processes. As compared to the default encoding scheme in HEVC test model HM 4.0, experimental results show that the fast intra mode decision scheme provides almost 20% time savings in all intra low complexity cases on average with negligible loss of coding efficiency.