Conference PaperPDF Available

An HOG-LBP human detector with partial occlusion handling


Abstract and Figures

By combining Histograms of Oriented Gradients (HOG) and Local Binary Pattern (LBP) as the feature set, we propose a novel human detection approach capable of handling partial occlusion. Two kinds of detectors, i.e., global detector for whole scanning windows and part detectors for local regions, are learned from the training data using linear SVM. For each ambiguous scanning window, we construct an occlusion likelihood map by using the response of each block of the HOG feature to the global detector. The occlusion likelihood map is then segmented by Mean-shift approach. The segmented portion of the window with a majority of negative response is inferred as an occluded region. If partial occlusion is indicated with high likelihood in a certain scanning window, part detectors are applied on the unoccluded regions to achieve the final classification on the current scanning window. With the help of the augmented HOG-LBP feature and the global-part occlusion handling method, we achieve a detection rate of 91.3% with FPPW= 10<sup>−6</sup>, 94.7% with FPPW= 10<sup>−5</sup>, and 97.9% with FPPW= 10<sup>−4</sup> on the INRIA dataset, which, to our best knowledge, is the best human detection performance on the INRIA dataset. The global-part occlusion handling method is further validated using synthesized occlusion data constructed from the INRIA and Pascal dataset.
Content may be subject to copyright.
An HOG-LBP Human Detector with Partial Occlusion Handling
Xiaoyu WangTony X. HanShuicheng Yan
Electrical and Computer Engineering Dept. Electrical and Computer Engineering Dept.
University of Missouri National University of Singapore
Columbia, MO 65211 Singapore 117576
By combining Histograms of Oriented Gradients (HOG)
and Local Binary Pattern (LBP) as the feature set, we pro-
pose a novel human detection approach capable of handling
partial occlusion. Two kinds of detectors, i.e., global de-
tector for whole scanning windows and part detectors for
local regions, are learned from the training data using lin-
ear SVM. For each ambiguous scanning window, we con-
struct an occlusion likelihood map by using the response
of each block of the HOG feature to the global detector.
The occlusion likelihood map is then segmented by Mean-
shift approach. The segmented portion of the window with
a majority of negative response is inferred as an occluded
region. If partial occlusion is indicated with high likelihood
in a certain scanning window, part detectors are applied
on the unoccluded regions to achieve the final classifica-
tion on the current scanning window. With the help of the
augmented HOG-LBP feature and the global-part occlu-
sion handling method, we achieve a detection rate of 91.3%
with FPPW= 106,94.7% with FPPW= 105, and 97.9%
with FPPW= 104on the INRIA dataset, which, to our best
knowledge, is the best human detection performance on the
INRIA dataset. The global-part occlusion handling method
is further validated using synthesized occlusion data con-
structed from the INRIA and Pascal dataset.
1. Introduction
Human detection has very important applications in
video surveillance, content-based image/video retrieval,
video annotation, and assisted living. However, detecting
humans in images/videos is a challenging task owing to
their variable appearance and the wide range of poses that
they can adopt.
The results of The Pascal Challenge from 2005 to 2008
[12] and the recent research [8,13,15,28,18,21] indicate
that sliding window classifiers are presently the predomi-
nant method being used in object detection, or more specif-
ically, human detection, due to their good performance.
Figure 1. The first row shows ambiguous images in the scanning
windows. The second row shows the corresponding segmented
occlusion likelihood images. For each segmented region, the neg-
ative overall score, i.e. the sum of the HOG block responses to the
global detector, indicates possible partial occlusion. The first four
columns are from the INRIA testing data. The last two columns
are samples of our synthesized data with partial occlusion.
For the sliding window detection approach, each image is
densely scanned from the top left to the bottom right with
rectangular sliding windows (as shown in Figure 1) in dif-
ferent scales. For each sliding window, certain features such
as edges, image patches, and wavelet coefficients are ex-
tracted and fed to a classifier, which is trained offline using
labeled training data. The classifier will classify the sliding
windows, which bound a person, as positive samples, and
the others as negative samples. Currently, the Support Vec-
tor Machine (SVM) and variants of boosted decision trees
are two leading classifiers for their good performance and
Although preferred for its performance in general, com-
pared to other detectors such as part-based detectors [1,
14,16,19,32], the sliding window approach handles par-
tial occlusions poorly. Because the features inside the scan-
ning window are densely selected, if a portion of the scan-
ning window is occluded, the features corresponding to the
occluded area are inherently noisy and will deteriorate the
classification result of the whole window. On the other side,
part based detectors [16,19,32] can alleviate the occlusion
problem to some extent by relying on the unoccluded part
to determine the human position.
In order to integrate the advantage of part-based detec-
tors in occlusion handling to the sliding-window detectors,
we need to find the occluded regions inside the sliding win-
dow when partial occlusion appears. Therefore, we have
to answer two key questions: 1)How to decide whether the
partial occlusion occurs in a scanning window? 2)If there
is partial occlusion in the sliding window, how to estimate
its location?
To infer the occluded regions when partial occlusions
happen, we propose an approach based on segmenting the
“locally distributed” scores of the global classification score
inside each sliding window.
Through the study of the classification scores of the lin-
ear SVM on the INRIA dataset [8,9], we found an interest-
ing phenomenon: If a portion of the pedestrian is occluded,
the densely extracted blocks of Histograms of Oriented Gra-
dients (HOG) feature [8] in that area uniformly respond to
the linear SVM classifier with negative inner products.
This interesting phenomenon leads us to study the cause
behind it. The HOG feature of each scanning window
is constituted by 105 gradient histograms extracted from
7×15 = 105 blocks (image patches of 16 ×16 pixels).
By noticing the linearity of the scalar product, the linear
SVM score of each scanning window is actually an inner
product between the HOG feature (i.e. the concatenation of
the 105 orientation histograms) and a vector w, which is
the weighted sum of all the support vectors learned. (The
procedure of distributing the constant bias βto each block
is discussed in section 3.3.)
Therefore, the linear SVM score is a sum of 105 linear
products between the HOG blocks and the corresponding
wi,i= 1,...,105. In our framework, these 105 linear
products are called responses of the HOG blocks. For an
ambiguous scanning window, we construct a binary occlu-
sion likelihood image with a resolution of 7×15. The in-
tensity of each pixel in the occlusion likelihood image is the
sign of the corresponding block response.
For each sliding window with ambiguous classification
score, we can segment out the possible occlusion regions by
running image segmentation algorithms on the binary oc-
clusion likelihood image. The mean shift algorithm [4,5] is
applied to segment the binary image for each window. The
real-valued response of each block is used as the weight-
ing density of each pixel in the mean shift framework. The
segmented regions with a negative overall response are in-
ferred as an occluded region for scanning window. Some
examples of the segmented occlusion likelihood image are
shown in Figure 1. The negative regions are possible oc-
cluded regions.
Once the occluded regions are detected, we minimize the
occlusion effects by resorting to a part-based detector on the
unoccluded area. (See details in Section 3.3).
The contribution of this paper is three-fold: 1) Through
occlusion inference on sliding window classification re-
sults, we propose an approach to integrate the advantage
of part-based detectors in occlusion handling to the sliding-
window detectors; 2) An augmented feature, HOG-LBP,
which combines HOG with cell-structured Local Binary
Pattern (LBP) [3], is proposed as the feature, based on
which the HOG-LBP human detector achieves better per-
formance than all of known state-of-the-art human detectors
[8,28,18,34,25,27,20] on INRIA dataset (refer to section
3.1 and section 4for details). 3) We simplify the trilinear
interpolation procedure as a 2D convolution so that it can
be integrated to the integral histogram approach, which is
essential to the efficiency of sliding window detectors.
2. Related Work
Wu and Nevatia [32,33] use Bayesian combination to
combine the part detectors to get a robust detection in the
situation of partial occlusion. They assume the humans
walk on a ground plane and the image is captured by a cam-
era looking down to the ground. Stein [26] takes advantage
of occlusion boundaries to help high-level reasoning and
improve object segmentation. Lin and Tang [6] presents a
framework to automatically detect and recover the occluded
facial region. Fu et al. [23] proposed a detection algorithm
based on the occlusion reasoning and partial division block
template matching for tracking task.
Mu et al. [20] state that traditional LBP operator in [2]
does not suit the human detection problem well. We pro-
posed a different cell-structured LBP. The scanning window
are divided into n on-overlapping cells with the size 16 ×16.
The LBPs extracted from cells are concatenate into a cell-
structured LBP, similar to the cell-block structure in [8]. As
shown in Figure 6(a) in the experiments section, the de-
tection results based on our cell-structured LBP are much
better than [20].
3. Approach
The human detection procedure based on the HOG-LBP
feature is shown in Figure 2. Our occlusion handling idea is
based on global and part detectors trained using the HOG-
LBP feature.
Input image
gradient at
each pixel
trilinear in-
for each
SVM clas-
LBP at
each pixel
Figure 2. The framework of HOG-LBP detector (without occlu-
sion handling).
3.1. Human Detection using Integrated HOG-LBP
As a dense version of the dominating SIFT [17] fea-
ture, HOG [8] has shown great success in object detection
and recognition [8,9,13,25,34]. HOG has been widely
accepted as one of the best features to capture the edge or
local shape information.
While the LBP operator [22] is an exceptional texture
descriptors. It has been widely used in various applications
and has achieved very good results in face recognition [3].
The LBP is highly discriminative and its key advantages,
namely its invariance to monotonic gray level changes and
computational efficiency, make it suitable for demanding
image analysis tasks such as human detection.
We propose an augmented feature vector, which com-
bines the HOG feature with the cell-structured LBP fea-
ture. HOG performs poorly when the background is clut-
tered with noisy edges. Local Binary Pattern is complemen-
tary in this aspect. It can filter out noises using the concept
of uniform pattern [22]. We believe that the appearance
of a human can be better captured if we combine both the
edge/local shape information and the texture information.
As shown in Figure 7in the experiments section, our con-
jecture is verified by our experiments on the INRIA dataset.
We follow the procedure in [8] to extract the HOG fea-
ture. For the construction of the cell-structured LBP, we
directly build pattern histograms in cells. The histograms
of the LBP patterns from different cells are then concate-
nated to describe the texture of the current scanning win-
dow. We use the notation L BP u
n,r to denote LBP feature
that takes nsample points with radius r, and the number
of 0-1 transitions is no more than u. The pattern that satis-
fies this constraint is called uniform patterns in [22]. For
example, the pattern 0010010 is a nonuniform pattern for
LBP 2, and is a uniform pattern for LB P 4because LBP 4
allows four 0-1 transitions. In our approach, different uni-
form patterns are counted into different bins and all of the
nonuniform patterns are voted into one bin.
Using the ldistance to measure the distance to the
center pixel, (i.e.d((x1, y1),(x2, y2)) = max(|x1
x2|,|y1y2|)), we illustrate the LBP8,1feature extraction
process in Figure 3.
50 60 101
30 100 122
200 220 156
0 1 1 1 1 1 0 0
Figure 3. The LBP8,1feature extraction using ldistance.
In our implementation, we use Euclidean distance to
measure the distance to achieve better performance. Bilin-
ear interpolation is needed in order to extract the circular
local binary patterns from a rectangular lattice. The perfor-
mance comparison of the cell-structured LBP features with
different parameters is shown in Figure 6(a) in the experi-
ments section.
3.2. Integral Histogram Construction with Convo-
luted Trilinear Interpolation
In spite of its good performance, the approach of slid-
ing window classification is often criticized as being too
resource and computationally expensive. The integral im-
age/histogram [29,24,34], the efficient subwindow search
[15], and the increasingly powerful parallel computing
hardware (e.g. GPU and multicore CPU) help to alleviate
the speed problem. Within the framework of the integral
image/histogram [29,24,34], the extraction of the features
for scanning windows has a constant complexity O(c)(two
vector addition and two vector subtraction). Many state-of-
the-art detectors [28,15,34,30,25] based on sliding win-
dow classifiers use the integral image method to increase
the running speeds by several folds.
Trilinear interpolation and Gaussian weighting are two
important sub-procedures in HOG construction [8]. The
naive distribution scheme of the orientation magnitude
would cause aliasing effects, both in orientation bin and
spatial dimensions. Such aliasing effects can cause sudden
changes in the final features which make them not stable
enough. For example, if a strong edge pixel is at the bound-
ary of a cell in one image and, due to certain slight changes,
it falls into the neighboring cell in another image, the naive
voting scheme assigns the pixel’s weight to different his-
togram bins in the two cases. To avoid this problem, we
should distribute the effect of the gradient of each pixel to
its neighborhood. In our experiments on the INRIA dataset,
when FA=104, we found that the HOG-LBP detector with-
out the trilinear interpolation has a detection rate 3% lower.
The performance of our HOG-LBP detector is not affected
by the Gaussian weighting procedure.
It was believed that the trilinear interpolation didn’t fit
well into integral image approach [34]. While the inte-
grated HOG feature without trilinear interpolation is fast to
compute, it is inferior to the original HOG, as mentioned in
In order to take the advantage of the integral image with-
out impairing the performance, we propose an approach,
named as Convoluted Trilinear Interpolation (CTI), to do
the trilinear interpolation [7]. For HOG, the direction of
the gradient at each pixel is discretized into 9bins. So at
each pixel, the gradient is a 2D vector with a real-valued
magnitude and a discretized direction (9possible directions
uniformly distributed in [0, π)). During the construction of
the integral image of HOG, if we treat the feature value at
each pixel as a 2D vector, we won’t be able to do the trilin-
ear interpolation between pixels. To conquer this difficulty,
we treat the feature value at each pixel as a 9D vector, of
which the value at each dimension is the interpolated mag-
nitude value at the corresponding direction. The trilinear
interpolation can be done by convolution before construct-
ing the integral image as shown in Figure 4.
Original pixel gradient Voted into adjacent bins
Convoluted bin image Integral bin image(over whole image)
Figure 4. The illustration of the trilinear interpolation in the frame-
work of integral image.
We designed a 7 by 7 convolution kernel to implement
the fast trilinear interpolation. The weights are distributed
to the neighborhood linearly according to the distances.
256 ×
1 2 3 4 3 2 1
2 4 6 8 6 4 2
3 6 9 12 9 6 3
4 8 12 16 12 8 4
3 6 9 12 9 6 3
2 4 6 8 6 4 2
1 2 3 4 3 2 1
First, we need to vote the gradient with a real-valued di-
rection between 0and πinto the 9discrete bins according
to its direction and magnitude. Using bilinear interpolation,
we distribute the magnitude of the gradient into two adja-
cent bins(as shown in the top-right subplot of Figure 4).
Then, the kernel in Equation ( 1) is used to convolve over
the orientation bin image to achieve the trilinear interpola-
tion. The intermediate results are the trilinearly interpolated
gradient image (bottom-left subplot of Figure 4), ready for
integral image construction.
We want to emphasize that the CTI approach doesn’t in-
crease the space complexity of the integral image approach.
The intermediate trilinear interpolated results can be stored
using the space allocated for the integral image. The tri-
linear interpolated gradient histogram image is of the same
size as the integral image. The extra computation time is
slim. For each image, it is only a convolution with a 7×7
kernel, which can be further accelerated by Fast Fourier
Transform (FFT).
3.3. Combined Global/Part-based Detector for Oc-
clusion Handling
Through the study of the classification scores of the lin-
ear SVM classifiers, we found that if a portion of the pedes-
trian is occluded, the densely extracted blocks of features
in that area uniformly respond to the linear SVM classi-
fier with negative inner products. Taking advantage of this
henomenon, we propose to use the classification score of
each block to infer whether the occlusion occurs and where
it occurs. When the occlusion occurs, the part-based detec-
tor is triggered to examine the unoccluded portion, as shown
in Figure 5. The HOG feature of each scanning window is a
3780 dimensional feature. This 3780 dimensional feature is
constituted by the sub-HOG of 105 blocks. The sub-HOG
at each block is a 36 dimensional vector denoted as B. The
3780 dimensional HOG feature of each sliding window is:
. With its canonical form, the decision
function for SVM classifier is:
f(x) = β+
where xk:k∈ {1,2, . . . , l}are the support vectors. If the
linear kernel SVM is used here, the inner product h., .iis
computed as the scalar product of two vectors in Rn. Tak-
ing into account the linearity of the scalar product, we can
rewrite the decision function as:
f(x) = β+xT·
where wis the weighting vector of the linear SVM, i.e., the
weighted sum of all the support vectors learned:
We distribute the constant bias βto each block Bi. Then
the real contribution of a block could be got by subtracting
the corresponding bias from the summation of feature inner
production over this block. That is, to find a set of βisuch
that β=P105
i=1 βifor the following equation:
f(x) = β+wT·x=
We learn the βi,i.e. the constant bias from the training
part of the INRIA dataset by collecting the relative ratio of
the bias constant in each block to the total bias constant.
Denote the set of HOG features of positive training samples
as: {x+
p}for p= 1,...,N+(N+is the number of positive
samples). The set of HOG features of negative samples is:
q}for q= 1,...,N(Nis the number of negative
samples). The ith blocks of x+
pand x
qare denoted as B+
HOG-LBP Feature
part body
Figure 5. Occlusion reasoning/handling framework. A: block score before distributing bias(summation of SVM classification scores in
blocks); B: block score after distributing the bias; C: Segmented region after the mean shift merging.
and B
q;i, respectively. By summing all the positive and
negative classification scores, we have:
p) = S+=N+β+
q) = S=Nβ+
Denote A=S
S+. By adding the equations ( 6) and ( 7),
we have:
0 = A·N+β+Nβ+
where B=1
A·N++N. We have:
By Equation (10), we distribute the constant bias βto
each block Biwhich translates the decision function of the
whole linear SVM to a summation of classification results
of each block. This approach of distributing keeps the rela-
tive bias ratio across the whole training dataset.
The negative blocks (<0) is, denoted as B
Similarly we denote positive blocks as B+
i. If the geomet-
ric locations of some negative blocks B
is are close to each
other, while other high-confident B+
is fall into other neigh-
boring areas of the scanning window, we tend to conclude
that this scanning window contains a human, who is par-
tially occluded in the location, where B
is dominate.
We construct the binary occlusion likelihood image ac-
cording to the response of each block of the HOG feature
to the trained linear SVM. The intensity of the occlusion
likelihood image is the sign of fi(Bi).
For each sliding window with ambiguous classification
score (i.e.the score falls in the SVM classification margin
[-1, 1]), we can segment out the possible occlusion regions
by running image segmentation algorithms on the binary
occlusion likelihood image. Each block is treated as a pixel
in the binary likelihood image. Positive blocks have the in-
tensity 1and negative blocks have the intensity 1. The
mean shift algorithm [4,5] is applied to segment this bi-
nary image for each sliding window. The absolute value of
the real-valued response of each block (i.e.|fi(Bi)|) is used
as the weight ωiin [5]. The binary likelihood image can be
then segmented to different regions. A segmented region
of the window with an overall negative response is inferred
as an occluded region. But if all the segmented regions are
consistently negative, we tends to treat the image as a neg-
ative image. Some examples of the segmented occlusion
likelihood image are shown in Figure 1.
Our experiments on the INRIA dataset show that the ap-
proach can detect the occluded region accurately. Based on
the localization of the occluded portion, the part detector
running on the positive regions will be activated to make
more confident decision. The whole framework is shown
in Figure 5. In our approach, we train the upper body and
lower body detector as part detectors to handle occlusion,
combining with the global detecor.
4. Experimental Results
Three groups of experiments are carried to validate our
assumptions. We first study the factors affecting the perfor-
mance of the cell-structured LBP. Comparing to state-of-the
art human detectors, the second group of experiments shows
the exceptional performance of the convolutional-trilinear-
interpolated HOG-LBP feature. Finally, we compare the
detection results between the algorithms with and without
occlusion handling on both the original INRIA data and the
synthesized occlusion data constructed from the INRIA and
Pascal dataset.
4.1. Cell-structured LBP detector
We study the effects of different choices of sample points
{4,6,7,8,9,10}and radius {1,2}to the cell structured
LBP. Linear SVM is used to train and classify on the INRIA
human dataset. We also compared our cell-structured LBP
with S-LBP in [20]. As shown in Figure 6(a),LB P 2
forms best. Using {4,6}sample points or radius {2}would
decrease the performance very much. We also tried LBP
features with cell size 8×8,16 ×16,32 ×32 and find that
16 ×16 cell works best. This is because the LB P 2
terns of a 8×8cell are too few to be discriminative and a
32 ×32 cell introduces too much smoothing over the his-
togram bins.
10−6 10−5 10−4 10−3 10−2
false positives per window
miss rate
LBP performance with different sample points (16X16 Cell)
Vector S−LBP
Figure 6. (a) The performance comparison of LBP features with
different parameters on the INRIA dataset. The LBP with proper
parameter setting outperforms vector S-LBP proposed in [20].
The performance of F-LBP in [20] is not available in the normal
INRIA training-testing setup. (b) The performance comparison
for different normalization schemes for LBP feature using LB P8,1
with a cell size 16 ×16.
Choosing a good normalization method is essential for
the performance of cell-structured LBP. As shown in Fig-
ure 6(b), the L1-sqrt normalization gives the best perfor-
mance. The L2 normalization decreases the performance
by 4% while using the L1 normalization would decrease the
performance by 9.5% with a false alarm of 104. Accord-
ing to Figure 6(b) and Figure 7, the cell-structured LBP
detector has outperformed the traditional HOG detector on
INRIA data.
4.2. Detection Results with HOG-LBP Feature
We use augmented HOG-LBP as the feature vector and
linear SVM as the classifier for the human detection on the
INRIA dataset. We use two different criteria: 1) The detec-
tion rate vs. False Positive Per Window (FPPW); and 2) The
detection rate vs False Positive Per Image (FPPI). Evaluated
using both criteria, our HOG-LBP detector (with/without
occlusion handling) out perform all known state-of-the-art
detectors [8,13,28,10,31,25,34,18] on the INRIA
dataset. Results are shown in Figure 71and Figure 8.
The detector with occlusion handling algorithm is slightly
better than the HOG-LBP detector without occlusion han-
dling. The performances of the other algorithms are com-
pared [11].
1It has been reported in [11] that the features extracted in [18] contains
the boundary of the cropped positive examples, which implicitly encodes
the label information.
10−6 10−5 10−4 10−3
miss rate
HOG−LBP with occlusion handling
HOG−LBP without occlusion handling
Multi−level HOG
Riemannian Manifolds
Figure 7. The performance comparison between the proposed hu-
man detectors and the state-of-the-art detectors on INRIA dataset
using detection(=1missing rate) VS FPPW. HOG-LBP with oc-
clusion handling: The augmented HOG-LBP with Convoluted
Trilinear Interpolation.Multi-Level HOG1: The detector [18] us-
ing Multilevel HOG and IKSVM. Riemannian Manifolds: The
detector [28] based on covariance tensor feature. Multi-Level
HOG and Riemannian Manifolds are the best curves in year
2008 and 2007, respectively.
10−2 10−1 100101
false positives per image
miss rate
Multi−level HOG
Figure 8. The performance comparison between the proposed
human detectors and the state-of-the-art detectors on INRIA
dataset using detection(=1missing rate) VS FPPI. MultiFtr:
The detector [31] using Shape Context and Haar wavelets fea-
ture. LatSVM: The detector [13] using deformable model.
Multi-Level HOG:The detector [18] using Multilevel HOG and
IKSVM. Shapelet: The detector [25] using shapelet features. Ftr-
Mine: The detector [10] using Haar features and feature mining
algorithm. HOG-LBP: Our HOG-LBP detector without occlu-
sion handling. HOG-LBP & Occ: Our HOG-LBP detector with
occlusion handling.
We achieve a detection rate of 91.3% at 106FPPW and
94.7% at 105FPPW. The result closest to ours is from
Maji et al. [18] using Multi-Level HOG and Intersection
Kernel SVM (IKSVM). We improve the detection rate by
1.5% at FPPW=105and by 8.0% at FPPW=106. It is
reported in [18] that the Multi-Level HOG can get only
50% detection rate using linear SVM, but it is improved by
about 47% at 104FPPW [18] by using IKSVM. So it’s
interesting to see what the detection performance will be by
applying IKSVM as the classifier for our feature.
Since we achieved the desired performance on INRIA
data (only 25 positive samples are missed out of 1126 test-
ing positive image with the FPPW=104), we test the HOG-
LBP detector on a very challenging upper body dataset
(with 6000 positive samples and 4000 negative images),
which is made available to public for download 2. Our de-
tector gains more than 20% improvement at 104compared
to the HOG detector as shown in Figure 9.
10−6 10−5 10−4 10−3 10−2
false positives per window
miss rate
on upper body dataset
Figure 9. The performance comparison of HOG-LBP and HOG on
the NUS upper body dataset.
4.3. Experiment on Combined Global/Part-based
Detection for Occlusion Handling
As shown in Figure 10(a), our occlusion handling ap-
proach improved the detection results. The improvement is
less than 1% in detection rate. This is because the INRIA
dataset contains very few occluded pedestrians. We save
all the miss detection at 106FPPW and find that only 28
positvie images are missclassified because of partial occlu-
sion. Our detector picks up 10 of them. Figure 11 shows
the samples.
In order to evaluate the proposed occlusion handling ap-
proach, we create synthesized data with partial occlusion by
overlaying PASCAL segmented objects to the testing im-
ages in the INRIA dataset, as shown in Figure 1. First, we
just add the objects to the lower part of the human. Then
they are added to a random position of the human to simu-
late various occlusion cases. Objects are resized in order to
generate different ratios of occlusion. Three detectors based
on the INRIA training dataset are built: the global detector,
the upper body detector and the lower body detector.
Following the procedure discussed in section 3.3, we first
check the consistency of the segmented binary occlusion
likelihood image. A part detector is activated over the pos-
10−6.2 10−6.0 10−5.8 10−5.6 10−5.4 10−5.2 10−5.0
false positives per window
miss rate
HOG−LBP with occlusion handling
HOG−LBP without occlusion handling
10−6 10−5 10−4 10−3 10−2
false positives per window
miss rate
With/without occlusion handling
Figure 10. (a) The performance comparison between with and
without occlusion handling on the original INRIA dataset. (b)
The occlusion handling comparison on the synthesized occlusion
dataset. 0.33: the occlusion ratio is 0.33; Org: Detection with-
out occlusion handling; OCC: Detection with occlusion handling;
Random: testing images are randomly occluded.
Figure 11. Samples of corrected miss detection
itive region when inconsistency is detected. The final de-
cision would be made based on the detector that has the
higher confidence. If both detectors are not confidential
enough (i.e. the classification score is smaller than a thresh-
old, 1.5 for example), we combine global and part detec-
tors by weighting the classification score. We give the score
of the global detector a weight 0.7 and 0.3 for part detec-
tor in our experiments. The reason that we give part de-
tector a smaller weight is that the global detector and the
part detector have different classification margins. In order
to keep the consistency of the confidence score, we make
the weights proportional to the corresponding classification
margins. As shown in Figure 10(b), our method improves
the detection results a lot on the synthesized dataset.
5. Conclusion
We propose a human detection approach capable of han-
dling partial occlusion and a feature set that combines the
trilinear interpolated HOG with LBP in the framework of
integral image. It has been shown in our experiments that
the HOG-LBP feature outperforms other state-of-the-art de-
tectors on the INRIA dataset. However, our detector cannot
handle the articulated deformation of people, which is the
next problem to be tackled.
The research was sponsored by the Leonard Wood Institute in
cooperation with the U.S. Army Research Laboratory and was ac-
complished under Cooperative Agreement Number W911NF-07-
2-0062. The views and conclusions contained in this document are
those of the authors and should not be interpreted as representing
the official policies, either expressed or implied, of the Leonard
Figure 12. Sample detections on images densely scanned by the HOG-LBP detectors with/without occlusion handling. First row: detected
by both. Second Row: detected by the HOG-LBP with occlusion handling. Third row: Missed detection by the HOG-LBP without
occlusion handling.
Wood Institute, the Army Research Laboratory or the U.S. Gov-
ernment. The U.S. Government is authorized to reproduce and
distribute reprints for Government purposes notwithstanding any
copyright notation heron. Yan is partially supported by NRF/IDM
grant NRF2008IDM-IDM004-029.
[1] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images via
a sparse, part-based representation. IEEE Trans. Pattern Anal. Mach. Intell.,
26(11):1475–1490, 2004. 1
[2] T. Ahonen, A. Hadid, and M. Pietikinen. Face recognition with local binary
patterns. In ECCV , pages 469–481, 2004. 2
[3] T. Ahonen, A. Hadid, and M . Pietikinen. Face description with local binary
patterns: Application to face recognition. IEEE Trans. Pattern Anal. Mach.
Intell., 28(12):2037–2041, 2006. 2,3
[4] Y. Cheng. Mean shift, mode seeking, and clustering. IEE E Trans. Pattern Anal.
Mach. Intell., 17(8):790–799, 1995. 2,5
[5] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space
analysis. IEEE Trans. Pattern Anal. Mach. Intell., 24(5):603–619, 2002. 2,5
[6] L. Dahua and T. Xiaoou. Quality-driven face occlusion detection and recovery.
In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEE E Confer-
ence on, pages 1–7, 2007. 2
[7] N. Dalal. Finding People in Images and Videos. PhD thesis, INRIA Rhne-
Alpes, Grenoble, France, 2006. 3
[8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
In CVPR 2005, volume 1, pages 886–893, 2005. 1,2,3,6
[9] N. Dalal, B. Triggs, and C. Schmid. Human de tection using oriented histograms
of flow and appearance. In ECCV (2), pages 428–441, 2006. 2,3
[10] P. Dollar, Z. Tu, H. Tao, and S. Belongie. Feature mining for image classifi-
cation. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE
Conference on, pages 1–8, June 2007. 6
[11] P. Doll´ar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A bench-
mark. In CVP R, June 2009. 6
[12] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.
The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results. 1
[13] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained,
multiscale, deformable part model. In CV PR, 2008. 1,3,6
[14] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsuper-
vised scale-invariant learning. CVPR, 02:264, 2003. 1
[15] C. H. L ampert, M. B. Blaschko, and T. Hofmann. Beyond sliding windows:
Object localization by efficient subwindow search. In CVPR, 2008. 1,3
[16] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes.
In CVPR, pages 878–885, 2005. 1,2
[17] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Inter-
national Journal of Computer Vision, 60:91–110, 2004. 3
[18] S. Maji, A. Berg, and J. Malik. Classification using intersection kernel support
vector machines is efficient. In CVPR, June 2008. 1,2,6,7
[19] K . Mikolajczyk, C. Schmid, and A. Zisserman. Human detection based on a
probabilistic assembly of robust part detectors. In ECCV, 2004. 1,2
[20] Y. Mu, S. Yan, Y. Liu, T. Huang, and B. Zhou. Discriminative local binary
patterns for human detection in personal album. In CVPR, 2008. 2,6
[21] S. Munder and D. Gavrila. An experimental study on pedestrian classification.
IEEE Trans. Pattern Anal. Mach. Intell., 28(11):1863–1868, Nov. 2006. 1
[22] T. Ojala, M. Pietikinen, and D. Harwood. A comparative study of texture mea-
sures with classification based on feature distributions. Pattern Recognition,
29(1):51–59, 1998. 3
[23] F. Ping, L. Weijia, X. Dingyu, X. Xinhe, and G. Daoxiang. Research on oc-
clusion in the multiple vehicle detecting and tracking system. In Intelligent
Control and Automation, 2006. WCICA 2006. The Sixth World Congress on,
volume 2, pages 10430–10434, 2006. 2
[24] F. Porikli. Integral histogram: A fast way to extract histograms in cartesian
spaces. CVPR, 2005. 3
[25] P. Sabzmeydani and G. Mori. Detecting pedestrians by learning shapelet fea-
tures. In CV PR, pages 1–8, 2007. 2,3,6
[26] A. N. Stein. Occlusion boundaries: Low-level detection to high-level reasoning.
Thesis, 2008. 2
[27] D. Tran and D. Forsyth. Configuration estimates improve pedestrian finding.
In Advances in Neural Information Processing Systems 20, pages 1529–1536.
MIT Press, Cambridge, MA, 2008. 2
[28] O. Tuzel, F. Porikli, and P. Meer. Human detection via classification on rieman-
nian manifolds. In CV PR, pages 1–8, 2007. 1,2,3,6
[29] P. Viola and M. Jones. Robust real-time object detection. IJCV, 2001. 3
[30] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of
motion and appearance. In ICCV, pages 734–741 vol.2, 2003. 3
[31] C. Wojek and B. Schiele. A performance evaluation of single and multi-feature
people detection. In Proceedings of the 30th DAGM symposium on Pattern
Recognition, pages 82–91, Berlin, Heidelberg, 2008. Springer-Verlag. 6
[32] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a
single image by bayesian combination of edgelet part detectors. In ICCV 2005,
volume 1, pages 90–97, 2005. 1,2
[33] B. Wu and R. Nevatia. Tracking of multiple, partially occluded humans based
on static body part detection. In CVPR, pages 951–958, 2006. 2
[34] Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan. Fast human detection using a
cascade of histograms of oriented gradients. In CVPR, pages 1491–1498, 2006.
... In this paper, we focus on the approaches that are close to our approach, which belong to the sliding window framework. According to which kind of features they used, those approaches can be broadly categorized into three groups of detectors, including histograms of oriented gradient (HOG)-based 12,20,34 approaches, channel-based approaches, 15,16,24,26,30,35 and other approaches. 36 One of the earliest approaches belonging to this framework is the revolutionary work of Viola and Jones (VJ), 36 which combines Haar-like features with cascade AdaBoost to learn a robust real-time face detector. ...
... 21 proposed a new feature called "CSS" based on the self-similarity of color channels and used it in association with HOG and motion information. Wang et al. 12 introduced another combination considering HOG and LBPs 39 descriptors to deal with partial occlusions. ...
... The training starts with 20,000 initial negative samples randomly cropped from the training set. Then, the trained classifier searches for 20,000 hard negative samples over the training set to add them to the initial negative samples. Figure 6 compares our results with previous detectors: VJ, 36 HogLbp, 12 HikSvm, 34 ACF-Caltech, 24 ACF + SDt, 41 Katamari, 6 LDCF, 26 SpatialPooling+, 17 and Checkerboard. 15 VJ, HogLbp, HikSvm, and HOG are trained on the INRIA pedestrian dataset. ...
Automatic pedestrian detection is a crucial task for intelligent vehicles as it helps avoid car-to-pedestrian accidents. Recently, several approaches have been proposed to improve detection accuracy, such as the channel-based family, which is based on the aggregation of channel features (e.g., HOG + LUV) to represent the shape of pedestrians. Previous handcrafted-based detectors significantly improved the detection performance by applying different filters to channel features. However, when there are similar-looking background objects, their detection performance suffers due to the lack of texture features. Another drawback of these approaches lies in their high computational cost, limiting their use for real-world applications, especially for resource constrained systems. To mitigate these drawbacks while maintaining a distinguishable detection accuracy, we propose a method that combines a simple but effective texture descriptor based on local binary patterns with HOG + LUV. The concatenated features are used to train a boosted decision trees classifier. To evaluate our approach, we carried out extensive experiments on four well-known datasets, including INRIA, ETH, KITTI, and Caltech. Experimental results show that the proposed method achieves a competitive performance to the channel-based detectors with significantly lower computational cost; it runs at 91.64 and 6.08 times faster than the closest competitors detectors on the INRIA and Caltech datasets, respectively. Furthermore, it outperforms these methods when evaluating under the corrected annotations of the Caltech dataset.
... Similarly, half of the KITTI results for pedestrian detection were submitted in 2014. (Viola and Jones, 2004) 94.73% DF Haar I Shapelet (Sabzmeydani and Mori, 2007) 91.37% -Gradients I PoseInv (Lin and Davis, 2008) 86.32% -HOG I+ LatSvm-V1 (Felzenszwalb et al., 2008) 79.78% DPM HOG P ConvNet (Sermanet et al., 2013) 77.20% DN Pixels I FtrMine (Dollár et al., 2007) 74.42% DF HOG+Color I HikSvm (Maji et al., 2008) 73.39% -HOG I HOG (Dalal and Triggs, 2005) 68.46% -HOG I MultiFtr (Wojek and Schiele, 2008) 68.26% DF HOG+Haar I HogLbp (Wang et al., 2009) 67.77% -HOG+LBP I AFS+Geo (Levi et al., 2013) 66.76% -Custom I AFS (Levi et al., 2013) 65.38% -Custom I LatSvm-V2 (Felzenszwalb et al., 2010) 63.26% DPM HOG I Pls (Schwartz et al., 2009) 62.10% -Custom I MLS (Nam et al., 2011) 61.03% DF HOG I MultiFtr+CSS (Walk et al., 2010) 60.89% DF Many T FeatSynth (Bar-Hillel et al., 2010) 60.16% -Custom I pAUCBoost (Paisitkriangkrai et al., 2013) 59.66% DF HOG+COV I FPDW (Dollár et al., 2010) 57.40% DF HOG+LUV I ChnFtrs (Dollár et al., 2009a) 56.34% DF HOG+LUV I CrossTalk (Dollár et al., 2012a) 53.88% DF HOG+LUV I DBN−Isol (Ouyang and Wang, 2012) 53.14% DN HOG I ACF 51.36% DF HOG+LUV I RandForest (Marín et al., 2013) 51.17% DF HOG+LBP I&C MultiFtr+Motion (Walk et al., 2010) 50.88% DF Many+Flow T SquaresChnFtrs 50.17% DF HOG+LUV I Franken 48.68% DF HOG+LUV I MultiResC (Park et al., 2010) 48.45% DPM HOG C Roerei 48.35% DF HOG+LUV I DBN−Mut 48.22% DN HOG C MF+Motion+2Ped (Ouyang and Wang, 2013b) 46.44% DF Many+Flow I+ MOCO (Chen et al., 2013) 45.53% -HOG+LBP C MultiSDP 45.39% DN HOG+CSS C ACF-Caltech 44.22% DF HOG+LUV C MultiResC+2Ped (Ouyang and Wang, 2013b) 43.42% DPM HOG C+ WordChannels (Costea and Nedevschi, 2014) 42.30% DF Many C MT-DPM (Yan et al., 2013) 40.54% ...
... By having richer and higher dimensional representations, the classification task becomes somewhat easier, enabling improved results. A large set of feature types have been explored: edge information (Dalal and Triggs, 2005;Dollár et al., 2009a;Lim et al., 2013;Luo et al., 2014), colour information (Dollár et al., 2009a;Walk et al., 2010), texture information (Wang et al., 2009), local shape information (Costea and Nedevschi, 2014), covariance features (Paisitkriangkrai et al., 2013), among others. More and more diverse features have been shown to systematically improve performance. ...
With the continued development of Autonomous Vehicle System (AVS), self-driving related technologies have attracted much attention over the past decade. In this light, we survey existing literature regarding self-driving related data, technologies, and systems. We present details of representative studies regarding collision avoidance, automatic lane-changing maneuver, object detection (including pedestrian detection and obstacle detection), and vehicle trajectory prediction, respectively. This survey summarizes the findings of existing self-driving studies, thus uncovering new insights that may guide researchers and software engineers in fields of self-driving data management systems and autonomous vehicle systems.
Onsite systematic monitoring benefits hazard prevention immensely. Hazard identification is usually limited due to the semantic gap. Previous studies that integrate computer vision and ontology can address the semantic gap and detect the onsite hazards. However, extracting and encoding regulatory documents in a computer-processable format often requires manual work which is costly and time-consuming. A novel and universally applicable framework is proposed that integrates computer vision, ontology, and natural language processing to improve systematic safety management, capable of hazard prevention and elimination. Visual relationship detection based on computer vision is used to detect and predict multiple interactions between objects in images, whose relationships are then coded in a three-tuple format because it has abundant expressiveness and is computer-accessible. Subsequently, the concepts of construction safety ontology are presented to address the semantic gap. The results are subsequently recorded into the SWI Prolog, a commonly used tool to run Prolog (programming of logic), as facts and compared with triplet rules extracted from using natural language processing to indicate the potential risks in the ongoing work. The high-performance results of Recall@100 demonstrated that the chosen method can precisely predict the interactions between objects and help to improve onsite hazard identification.
Full-text available
An automatic pathological diagnosis is a challenging task because histopathological images with different cellular heterogeneity representations are sometimes limited. To overcome this, we investigated how the holistic and local appearance features with limited information can be fused to enhance the analysis performance. We propose an unsupervised deep learning model for whole-slide image diagnosis, which uses stacked autoencoders simultaneously feeding multiple-image descriptors such as the histogram of oriented gradients and local binary patterns along with the original image to fuse the heterogeneous features. The pre-trained latent vectors are extracted from each autoencoder, and these fused feature representations are utilized for classification. We observed that training with additional descriptors helps the model to overcome the limitations of multiple variants and the intricate cellular structure of histopathology data by various experiments. Our model outperforms existing state-of-the-art approaches by achieving the highest accuracies of 87.2 for ICIAR2018, 94.6 for Dartmouth, and other significant metrics for public benchmark datasets. Our model does not rely on a specific set of pre-trained features based on classifiers to achieve high performance. Unsupervised spaces are learned from the number of independent multiple descriptors and can be used with different variants of classifiers to classify cancer diseases from whole-slide images. Furthermore, we found that the proposed model classifies the types of breast and lung cancer similar to the viewpoint of pathologists by visualization. We also designed our whole-slide image processing toolbox to extract and process the patches from whole-slide images.
One object class may show large variations due to diverse illuminations, backgrounds, and camera viewpoints in the multi-scene object detection task. Traditional object detection methods generally perform poorly under unconstrained video environments. To address this problem, many modern approaches provide deep hierarchical appearance representations for object detection. Most of these methods require time-consuming training procedures on large manually annotated sample sets. In this paper, we propose a self-learning object detection framework to resolve the multi-scene detection problem in a bottom-up manner. A scene-specific objector is obtained from an autonomous learning process triggered by marking several bounding boxes around an object in the first video frame via a mouse. Here, artificially labeled training data or generic detectors are not needed. This learning process is conveniently replicated many times in different surveillance scenarios and produces scene-specific detectors from various camera viewpoints. Obviously, the initial scene-specific detector, initialized by several bounding boxes, exhibits poor detection performance and is difficult to be improved by traditional online learning algorithms. Consequently, we propose the Generative-Discriminative model (GDM) based detection method to partition detection response space and assign each partition an individual descriptor that progressively achieves high classification accuracy. Online gradual optimization process is proposed to optimize the Generative-Discriminative model and focus on those hard samples lying near the decision boundary. Experimental results on nine video datasets show that our approach achieves comparable performance to that of robust supervised methods, and outperforms state-of-the-art scene-specific object detection methods under varying imaging conditions.
Full-text available
CCTV-based video monitoring technology is one of the fastest growing security technologies markets. The existing video monitoring systems are, however, still not in a position to be used to prevent crime. For public safety purposes, large networks of cameras are increasingly deployed in public places like Residential Buildings, College Campus, offices, airports, railway stations, and shopping malls. Such systems are primarily dependent on human observers and are therefore limited over long periods by factors such as exhaustion and monitoring. In order to overcome this constraint, "intelligent" systems are required, which can highlight the critical data and remove normal conditions that are not a safety hazard. We propose a model utilizing machine learning techniques in order to build these smart systems. This research aims to create an application in real time, which is necessary for labs, places of work or homes where human detection and Recognition will be done for human safety
Full-text available
This report presents the results of the 2006 PASCAL Visual Object Classes Challenge (VOC2006). Details of the challenge, data, and evaluation are presented. Participants in the challenge submitted descriptions of their methods, and these have been included verbatim. This document should be considered preliminary, and subject to change.
This paper describes a visual object detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the "Integral Image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features and yields extremely efficient number of critical visual features and yields extremely efficient classifiers [6]. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. A set of experiments in the domain of face detection are presented. The system yields face detection performace comparable to the best previous systems [18, 13, 16, 12, 1]. Implemented on a conventional desktop, face detection proceeds at 15 frames per second.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Pedestrian detection is a key problem in computer vision, with several applications including robotics, surveillance and automotive safety. Much of the progress of the past few years has been driven by the availability of challenging public datasets. To continue the rapid rate of innovation, we introduce the Caltech Pedestrian Dataset, which is two orders of magnitude larger than existing datasets. The dataset contains richly annotated video, recorded from a moving vehicle, with challenging images of low resolution and frequently occluded people. We propose improved evaluation metrics, demonstrating that commonly used per-window measures are flawed and can fail to predict performance on full images. We also benchmark several promising detection systems, providing an overview of state-of-the-art performance and a direct, unbiased comparison of existing methods. Finally, by analyzing common failure cases, we help identify future research directions for the field.
Conference Paper
A detection algorithm based on the occlusion reasoning and partial division block template matching is proposed since the occluded target can not be located accurately by the common located algorithm during tracking process. Locating target accurately is critical factors that affect the tracking performance and reliability. Occlusion reasoning method provides the occlusion alarm status of multiple targets in the frame t. By using the predicted information, the occlusion status is verified once again in the frame (t+1). If the occlusion status is enabled, partial division block template matching technique is employed. Using these algorithms, we can obtain the reliable locations of occluded objects respectively. Finally, the proposed algorithms are applied to real image sequences. Experimental results on the high way and urban road demonstrate the usefulness of the proposed method