ArticlePDF Available

Fast Pedestrian Detection Using a Cascade of Boosted Covariance Features

Authors:

Abstract and Figures

Efficiently and accurately detecting pedestrians plays a very important role in many computer vision applications such as video surveillance and smart cars. In order to find the right feature for this task, we first present a comprehensive experimental study on pedestrian detection using state-of-the-art locally extracted features (e.g., local receptive fields, histogram of oriented gradients, and region covariance). Building upon the findings of our experiments, we propose a new, simpler pedestrian detector using the covariance features. Unlike the work in [1], where the feature selection and weak classifier training are performed on the Riemannian manifold, we select features and train weak classifiers in the Euclidean space for faster computation. To this end, AdaBoost with weighted Fisher linear discriminant analysis-based weak classifiers are designed. A cascaded classifier structure is constructed for efficiency in the detection phase. Experiments on different datasets prove that the new pedestrian detector is not only comparable to the state-of-the-art pedestrian detectors but it also performs at a faster speed. To further accelerate the detection, we adopt a faster strategy-multiple layer boosting with heterogeneous features-to exploit the efficiency of the Haar feature and the discriminative power of the covariance feature. Experiments show that, by combining the Haar and covariance features, we speed up the original covariance feature detector [1] by up to an order of magnitude in detection time with a slight drop in detection performance.
Content may be subject to copyright.
1140 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008
Fast Pedestrian Detection Using a Cascade of Boosted
Covariance Features
Sakrapee Paisitkriangkrai, Chunhua Shen, and Jian Zhang
Abstract—Efficiently and accurately detecting pedestrians plays
a very important role in many computer vision applications such as
video surveillance and smart cars. In order to find the right feature
for this task, we first present a comprehensive experimental study
on pedestrian detection using state-of-the-art locally extracted fea-
tures (e.g., local receptive fields, histogram of oriented gradients,
and region covariance). Building upon the findings of our exper-
iments, we propose a new, simpler pedestrian detector using the
covariance features. Unlike the work in [1], where the feature selec-
tion and weak classifier training are performed on the Riemannian
manifold, we select features and train weak classifiers in the Eu-
clidean space for faster computation. To this end, AdaBoost with
weighted Fisher linear discriminant analysis-based weak classi-
fiers are designed. A cascaded classifier structure is constructed for
efficiency in the detection phase. Experiments on different datasets
prove that the new pedestrian detector is not only comparable to
the state-of-the-art pedestrian detectors but it also performs at a
faster speed. To further accelerate the detection, we adopt a faster
strategy—multiple layer boosting with heterogeneous features—to
exploit the efficiency of the Haar feature and the discriminative
power of the covariance feature. Experiments show that, by com-
bining the Haar and covariance features, we speed up the original
covariance feature detector [1] by up to an order of magnitude in
detection time with a slight drop in detection performance.
Index Terms—AdaBoost, boosting with heterogeneous features,
local features, pedestrian detection/classification, support vector
machine.
I. INTRODUCTION
EFFICIENTLY and accurately detecting pedestrians is of
fundamental importance for many applications in com-
puter vision, e.g., smart vehicles, surveillance systems with in-
telligent query capabilities, and sports video content analysis.
In particular, there is growing effort in the development of in-
telligent video surveillance systems. An automated method for
finding humans in a scene serves as the first important prepro-
cessing step in understanding human activity. Despite the multi-
tude of approaches in the literature, the problem of automatic de-
Manuscript received November 14, 2007; revised March 7, 2008 and May,
22, 2008. First published July 9, 2008; current version published August 29,
2008. NICTA is funded through the Australian Government’s Backing Aus-
tralia’s Ability initiative, in part through the ARC. This paper was recommended
by Associate Editor F. Pereira.
S. Paisitkriangkrai and J. Zhang are with NICTA, Neville Roach Labora-
tory, Kensington, NSW 2052, Australia, and also with the University of New
South Wales, Sydney, NSW 2052, Australia (e-mail: paul.pais@nicta.com.au;
jian.zhang@nicta.com.au).
C. Shen is with NICTA, Canberra Research Laboratory, Canberra, ACT
2601, Australia, and also with the Australian National University, Canberra,
ACT 0200,Australia (e-mail: chunhua.shen@nicta.com.au).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2008.928213
tection of objects is far from being solved (e.g., [2]–[8]). Pedes-
trian detection in still images is one of the most difficult ex-
amples of generic object detection. The challenges are due to
a wide range of poses that humans can adopt, large variations
in clothing, as well as cluttered backgrounds and environmental
conditions.
Pattern classification approaches have been shown to achieve
successful results in many areas of object detections. These ap-
proaches can be decomposed into two key components: fea-
ture extraction and classifier construction. In feature extraction,
dominant features are extracted from a large number of training
samples. These features are then used to train a classifier. During
testing, the trained classifier scanned the entire input image to
look for particular object patterns. This general approach has
shown to work very well in detection of many different objects,
e.g., face [2] and car number plate [9].
The literature on pedestrian detection is abundant. Mainly,
two types of image features are used, motion and shape. Motion
approaches, which require preprocessing techniques like back-
ground subtraction or image segmentation (e.g.,[10]), segments
an image into so-called super pixels and then detects the human
body and estimates its pose. Approaches based on shape infor-
mation typically detect pedestrian directly without using pre-
processing techniques [1], [3], [11], [12]. Features can be dis-
tinguished into global features and local features depending on
how the features are measured. The difference between global
and local features is that global features operate on the entire
image of datasets whereas local features operate on the subset
regions of the images. One of the well-known global feature ex-
traction methods is principal component analysis (PCA). The
drawback of global features is that the approach fails to extract
meaningful features if there is a large variation in object’s ap-
pearance, pose and illumination conditions. On the other hand,
local features are much less sensitive to these problems since
the features are extracted from the subset regions of the images.
Some examples of the commonly used local features are wavelet
coefficient [2], gradient orientation [11], and region covariance
[1]. Local feature approaches can be further divided into whole
body detection and body parts detection [13]. In the part-based
approach, individual results are combined by a second clas-
sifier to form whole body detection. The advantage of using
part-based approach is that it can deal with variation in human
appearance due to body articulation. However, this approach
adds more complexity to the pedestrian detection problem. As
pointed out in [14], the classification performances reported in
literature are quite different. This may be due to datasets’ com-
position with respect to negative samples. Data sets with nega-
tive samples containing large uniform image regions typically
lead to much better classification performance.
1051-8215/$25.00 © 2008 IEEE
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1141
The performances of several pedestrian detection approaches
have been evaluated in [14]. Multiple feature-classier combi-
nations have been examined with respect to their receiver op-
erating characteristic (ROC) performances and efciency. Dif-
ferent features including PCA, local receptive elds (LRF) fea-
ture [12], and Haar wavelets [3] are used to train neural net-
works, support vector machines (SVM) [15] and -NN classi-
ers. The authors conclude that the combination of SVM with
LRF features performs best. An observation is that local features
based detectors signicantly outperform those using global fea-
tures [14]. This may be due to the large variability of pedestrian
shapes. Global features like PCA are more powerful for mod-
eling objects with stable structures such as frontal faces, rigid
car images taken from a xed view angle.
Although [14] provides some insights on pedestrian de-
tection, it has not compared state-of-the-art techniques due
to the fast progress on this topic. Recently, histogram of ori-
ented gradients (HOG) [11] and region covariance features
[1] are proffered for pedestrian detection. It has been shown
that they outperform those previous approaches. HOG is a
gray-level image feature formed by a set of normalized gra-
dient histograms; while region covariance is an appearance
based feature, which combines pixel coordinates, intensity,
and gradients, into a covariance matrix. Hence, the type of
features employed for detection ranges from purely silhouette
based (e.g., HOG) to appearance-based (e.g., region covariance
feature). To the best of our knowledge, these approaches have
not yet been compare. It remains unclear whether silhouette- or
appearance-based features are better for pedestrian detection.
The rst part of this paper tries to answer this question. Also, in
order to nd the right feature for human detection, we perform a
systematic experimental study on the state-of-the-art pedestrian
detection techniques: LRF, HOG, and region covariance. The
reasons we select the SVM classier are: 1) it is one of the
advanced classiers and 2) it is easy to train and, unlike neural
networks, the global optimum is guaranteed. Thus, the variance
caused by suboptimal training is avoided for fair comparison.
Building upon the results of our experiments, we then pro-
pose a new, simpler pedestrian detector using the covariance
features. Therefore, the second contribution of our work is that
we show how multidimensional covariance features can be
integrated with weighted linear discriminant analysis before
being trained by the AdaBoost framework. In other words, the
AdaBoost framework is adapted to vector-valued covariance
features, and a weak classier is designed according to the
weighted linear discriminant analysis. This technique is not
only faster but also accurate. In order to support our claim, we
compare the performance of our proposed method with the
state-of-the-art pedestrian detection techniques mentioned in
[14].
The proposed boosted covariance detector achieves a detec-
tion speed that is about four times faster than the method in [1],
but it is still not fast enough for real-time applications. On the
one hand, the Haar feature can be computed rapidly due to its
simplicity [2], but it is less powerful for classication [16]. On
the other hand, although the covariance feature is a better can-
didate for representing pedestrians, it requires heavier compu-
tation than the Haar feature. Here, to further accelerate our pro-
posed detector, we adopt a faster strategytwo-layer boosting
with heterogeneous featuresto exploit the efciency of the
Haar feature and the discriminative power of the covariance
feature in a single framework. This idea has also been imple-
mented in face detection [17] for combining Haar features with
Gaussian features. It is well known that the cascade classica-
tion structure decreases the detection time by rejecting at the
beginning of the cascade most of the regions in the image that
do not contain a target. Thanks to the exibility of the cascaded
classier, we employ the Haar feature-based classiers at the
beginning of the cascade and use the covariance feature at latter
stages. Experiments show that, by combining the Haar and co-
variance features, we speed up the conventional covariance fea-
ture detector [1] by an order of detection time without greatly
compromising the detection performance.
II. FEATURE EXTRACTION
Feature extraction is the rst step in most object detection and
pattern recognition algorithms. The performance of most com-
puter vision algorithms often relies on the extracted features.
The ideal feature would be the one that can differentiate ob-
jects in the same category from objects in different categories.
Commonly used low-level features in computer vision are color,
texture, and shape. Here, we evaluate three state-of-the-art local
features, namely, LRF, HOG, and region covariance. LRF fea-
tures are extracted using multilayer perceptrons by means of
their hidden layer. The features are tuned to the data during
training. The price is heavier computation. HOG uses histogram
to describe oriented gradient information while region covari-
ance computes covariance from several low-level image features
such as image intensities and gradients.
A. Local Receptive Fields
Multilayer perceptrons provide an adaptive approach for fea-
ture extraction by means of their hidden layer [12]. A neuron of
a higher layer does not receive input from all neurons of the un-
derlying layer but only from a limited region of it, which is call
local receptive elds (LRF). The hidden layer is divided into a
number of branches.
B. Histograms of Oriented Gradients
Since the development of scale-invariant feature transforma-
tion (SIFT) [18], which uses normalized local spatial histograms
as a descriptor, many research groups have been studying the use
of orientation histograms in other areas. The work in [11] is one
of the successful examples. This work [11] proposes histogram
of oriented gradients in the context of human detection. Their
method uses a dense grid of histogram of oriented gradients,
computed over blocks of various sizes. Each block consists of a
number of cells. Blocks can overlap with each other. For each
pixel , the gradient magnitude and orientation
is computed from
(1)
(2)
(3)
(4)
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
1142 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008
A local 1-D orientation histogram of gradients is formed from
the gradient orientations of sample points within a region. Each
histogram divides the gradient angle range into a predened
number of bins. The gradient magnitudes vote into the orien-
tation histogram. In [11], the orientation histogram of each cell
has nine bins covering the orientation range of degrees
(unsigned gradients). Hence, each block is represented by a
36-D feature vector (9 bins/cell 4 cells/block). The nal step
is to combine these normalized block descriptors to form a fea-
ture vector. The feature vector can then be used to train SVMs.
C. Region Covariance
Tuzel et al. [1], [19] have proposed region covariance in the
context of object detection. Instead of using joint histograms
of the image statistics ( dimensions where is the number
of image statistics and is the number of histogram bins used
for each image statistics), covariance is computed from several
image statistics inside a region of interest (dimensions). This
results in a much smaller dimensionality. For each region, the
correlation coefcient is calculated. The correlation coefcient
of two random variables and is given by
(5)
(6)
where is the covariance of two random variables, is
the sample mean, and is the sample variance. Correlation
coefcient is commonly used to describe the information we
gain about one random variable by observing another random
variable.
Image statistics used in this experiment are similar to the one
used in [1]. The 8-D feature image used are pixel location ,
pixel location ,rst-order partial derivative of the intensity in
horizontal direction ,rst-order partial derivative of the in-
tensity in vertical direction , the magnitude , edge
orientation , second-order partial derivative of
the intensity in the horizontal direction , second-order par-
tial derivative of the intensity in the vertical direction .
The covariance descriptor of a region is an 8 8 matrix. Due
to the symmetry, only the upper triangular part is stacked as a
vector and used as covariance descriptors. The descriptors en-
code information of the correlations of the dened features in-
side the region. Note that this treatment is different from that
in [1] and [19], where the covariance matrix is directly used
as the feature and the distance between features is calculated
in the Riemannian manifold.1However, eigen-decomposition is
involved for calculating the distance in the Riemannian man-
ifold. Eigen-decomposition is very computationally expensive
(arithmetic operations). We instead vectorize the sym-
metric matrix and measure the distance in the Euclidean space,
which is faster.
1Covariance matrices are symmetric and positive semi-denite, hence they
reside in the Riemannian manifold.
Preliminary experiments, similar to the experiment described
in [19], have been conducted to compare the two different dis-
tance measures: distance of the correlation coefcient from two
covariance matrices in the Euclidean space and distance of two
covariance matrices in the Riemannian manifold. The results in-
dicate that their performance on pedestrian detection are quite
similar.
In order to improve the covariance matricescalculation ef-
ciency, a technique which employs integral image [2] can be
applied [19]. By expanding the mean from previous equation,
covariance equation can be written as
(7)
Hence, to nd the fast covariance in a given rectangular region,
the sum of each feature dimension, e.g., and ,
and the sum of the multiplication of any two feature dimensions,
e.g., , can be computed using the integral image.
The extracted covariance features assume that the image sta-
tistics follow a single Gaussian distribution. Although this as-
sumption may look overly simple, experiments prove the co-
variance featuresefcacy. Jin et al. [20] have used an identical
idea for network intrusion detection.
III. CLASSIFIERS
There exist many classication techniques that can be applied
to object detection. Some of the commonly applied classica-
tion techniques are SVM [15] and AdaBoost [2], [21].
A. Support Vector Machines
SVM is one of the popular large margin classiers [15] that
has a very promising generalization capacity. Due to space limit,
we omit details of SVM. The reader is referred to [15] for de-
tails. In our experiments, SVM classiers with three different
kernel functions, linear, quadratic, and RBF kernels, are com-
pared with the features calculated from previous section.
B. AdaBoost
AdaBoost is the rst practical and efcient algorithm for en-
semble learning [21]. The training procedure of AdaBoost is
a greedy algorithm, which constructs an additive combination
of weak classiers such that the exponential loss
is minimized. Here, is the labeled training examples
and is its label; is the nal decision function which out-
puts the decided class label. AdaBoost iteratively combines a
number of weak classiers to form a strong classier. A weak
classier is dened as a classier with accuracy on the training
set greater than average. The nal strong classier can be
dened as , where is a weight
coefcient, is a weak learner, and is the number of
weak classiers. At each new round, AdaBoost selects a new
hypothesis that best classies training samples with min-
imal classication error. Each training sample receives a weight
that determines its probability of being selected for a training
set. If a training sample is correctly classied, then its proba-
bility of being used again in a subsequent component classier
is reduced. Conversely, if the pattern is misclassied, then its
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1143
Fig. 1. Architecture of the proposed pedestrian detection system using boosted
covariance feature. We set the training objective as detection rate: 99.5%; false
positive rate: 50%.
probability of being used again is increased. In this way, the al-
gorithm focuses more on the misclassied samples after each
round of boosting.
Since Viola et al. [2] introduced AdaBoost into computer vi-
sion for face detection, extensions have been proposed for better
classication performance [22], fast training [23], or dealing
with imbalanced training data [24]. These techniques can be ap-
plied to our problem. We leave this as a future research direction.
IV. BOOSTED COVARIANCE FEATURES
Here, we describe a new pedestrian detector system. Fig. 1
shows the structure of the new pedestrian detector. We use
the covariance feature originally invented in [19]. The reasons
why we choose this local feature will be explained in detail in
Section IV-A.
Given the training dataset, each training sample is assigned a
weight which determines its probability of being selected for a
training set. From a set of given rectangular windows, covari-
ance matrix is calculated from several low-level image statis-
tics within a rectangular region. The upper triangular part of
computed covariance matrix is stacked as a vector and used
as a covariance descriptors. A vector of covariance descriptors
is projected onto a 1-D space using the algorithm described in
Section IV-A. AdaBoost is then applied to select the best rectan-
gular region w.r.t. the weak learner that best classies training
samples with minimal classication error. The best weak learner
is added to a cascade. Weak learners are added until the prede-
ned classication accuracy is met. The process is replicated for
the next stage of the cascades.
This section begins with a short explanation of Fisher linear
discriminant analysis (LDA) concept. We then extend these
methods to varying weighted training samples. Finally, we
describe in details how to apply these techniques to train mul-
tidimensional covariance features on a cascade of AdaBoost
classiers framework.
A. Weighted Fisher Linear Discriminant Analysis
The objective of the Fishers criteria is to nd a linear
combination of the variables that can separate two classes as
much as possible. However, the criterion proposed by Fisher
assumes uniformly weighted training samples. In AdaBoost
Fig. 2. First and second covariance region selected by AdaBoost. The rst two
covariance regions overlayed on human training samples are shown in the rst
column. The second column displays human body parts selected by AdaBoost.
The rst covariance feature represents human legs (two parallel vertical bars)
while the second covariance feature captures the information of the head and
the human body.
training, each data point is associated with a weight which
measures how difcult to correctly classify this data. Therefore,
we need to apply a weighted version of the standard Fisher
linear discriminant analysis (WLDA). Similar to LDA, WLDA
nds a linear combination of the variables that can separate
two classes as much as possible with emphasis on the training
samples with high weights.
It is well known that the choice of weak classiers is vital
to the classication accuracy of boosting techniques. Although
effective weak classiers increase the performance of the nal
strong classier, the large amount of potential features make
the computation prohibitively heavy with the use of complex
classiers such as SVMs. For scalar features such as Haar fea-
tures in [2], [4], a very efcient stump can be used. For vector-
valued features such as HOG or covariance features, unfortu-
nately, seeking an optimal linear discriminant would require
much longer time. As shown in [25], it is possible to use linear
SVMs as weak learners, the training procedure is very time-con-
suming. Here we adopt a more efcient approach. We project
the multi-dimensional covariance features onto a 1-D line using
WLDA, which nds a linear projection function which guar-
antees optimal classication of normally distributed samples of
two classes.
Each weak learner can then be dened as
if
otherwise (8)
where denes a weak learner, is the calculated covariance
features, and is an optimal threshold such that the minimum
number of examples are misclassied.
B. Cascade of Covariance Descriptors
The covariance feature efciently captures the relationship
between different image statistics. Combining with WLDA, this
information can be used to represent a distinct part of the human
body. At each AdaBoost iteration, a simple classier is trained
from the collection of region covariance features. The experi-
mental results show that the covariance region selected by Ad-
aBoost are physically meaningful and can be easily interpreted
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
1144 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008
as shown in Fig. 2. The rst selected feature focuses on the
bottom part of the human body while the second selected feature
focuses on the top part of the body. It turns out that covariance
features are well adapted to capture patterns that are invariant to
illumination changes and human poses/appearance changes.
Our fast boosted covariance features-based detection frame-
work is summarized in Algorithm 1.
Algorithm 1. The training algorithm for building the cascade
of boosted covariance detector.
Input:
A positive training set and a negative training set;
: minimum acceptable detection rate per cascade
level;
: maximum acceptable false positive rate per cascade
level;
: target overall false positive rate.
Initialize: ; ; ;
while do
; ;
while do
(1) Normalize AdaBoost weights;
(2) Calculate the projection vector with WLDA; and
project the covariance features to 1D;
(3) Train decision stumps by nding a optimal threshold
, using the training set;
(4) Add the best decision stump classier into the
strong classier;
(5) Update sample weights in the AdaBoost manner;
(6) Lower threshold such that holds;
(7) Update using this threshold.
end
; ; and remove correctly
classied negative samples from the training set;
if then
Evaluate the current cascaded classier on the negative
images and add misclassied samples into the negative
training set.
end
end
Output:
A cascade of boosted covariance classiers for each
cascade level ;
Final training accuracy: and .
Fig. 3. Structure of our two-layer pedestrian detector.
In order to reduce computation time, a cascade of classiers
is built [2]. The key insight is that efcient boosted classiers,
which can reject many of the simple nonpedestrian samples
while detecting almost all pedestrian samples, are constructed
and placed at the early stages of the cascades. Time-consuming
and complex boosted classiers, which can remove more com-
plex nonpedestrian samples, are placed in the later stages of
the cascades. By constructing classiers in this way, we are
able to quickly discard simple background regions of the image,
e.g., sky, building, or road, while spending more time on pedes-
trian-like regions. Only samples that can pass through all stages
of the cascades are classied as pedestrians.
V. T WO-LAYER BOOSTING WITH HETEROGENEOUS FEATURES
In order to further accelerate our proposed detector, an ap-
proach which consists of a two-layer cascade of classiers is
built [17]. The objective of designing a two-layer approach is
to achieve high detection speed and accuracy. The idea is to
place simple and fast-to-compute features in the rst layer while
putting a more accurate but slower-to-compute features in the
second layer of the cascade. The simple features lter out most
simple nonpedestrian patterns in the early stage of the cascade.
Haar wavelet features have proved to be extremely fast and
highly powerful in the application of face detections [2]. How-
ever, the Haar feature performs poorly in the context of human
detection as reported in [4]. In order to improve the overall accu-
racy, we apply boosted covariance features in the second layer.
In other words, Haar features are used in the rst cascade while
boosted covariance features are used in the second cascade. This
way we utilize the efciency of the Haar feature and the discrim-
inative power of the covariance feature in a single framework.
Fig. 3 shows the detector architecture of the two-layer approach.
We experimentally evaluate covariance features and Haar fea-
tures by training two different classiers on the same training
set using AdaBoost. The positive training set is extracted from
INRIA dataset [11] which consists of 2416 human samples (mir-
rored). The negative training set comes from random patches
extracted from negative images. The classiers are evaluated
on the INRIA test set. Fig. 4 gives a comparison of the per-
formances of different feature types. The following observation
can be made from the gure. The test error decreases quickly
with the number of AdaBoost iterations for all features. The test
error of covariance features run into saturation after about 100
iterations while the test error rate of Haar feature continues to
decrease slowly. The results can also be interpreted in terms of
the number of selected features and test error rate. For example,
it is possible to achieve a 5% test error rate using either 25 co-
variance features or 100 Haar features. Table I shows the com-
putation time for different feature types (including computation
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1145
Fig. 4. Performance comparison of covariance and Haar features on INRIA
test set [11].
TABLE I
AVERAGE TIME REQUIRED TO EVALUATE COVARIANCE AND HAAR FEATURES
overhead of integral images). The computation of Haar features
is much faster than the computation of covariance features.
Due to the exibility of the cascaded structure, it is easy to in-
tegrate multiple heterogeneous features. Although we use Haar
and covariance features here, some combination of various fea-
tures may lead to better performance. It remains a future study
topic on how to nd the best combination.
VI. EXPERIMENTS
We evaluate the performance of our techniques on two pub-
licly available datasets, dataset of [14] and INRIA dataset [11].
The rst dataset [14] contains a set of extracted pedestrian and
non-pedestrian samples which are scaled to size 18 36 pixels.
We conduct three experiments on the dataset of [14] using co-
variance features trained with SVM and AdaBoost. The second
dataset [11] contains 1176 pedestrian samples from 288 im-
ages. We conduct two experiments using covariance features
trained with AdaBoost. To our knowledge, [11] and [1] are the
state-of-the-art on human detection in the literature. Hence, we
mainly compare our algorithm with these two techniques.
The experimental section is organized as follows. First, the
datasets used in this experiment, including how the performance
is analyzed, are described. Experiments and the parameters used
to achieve optimal results are then discussed. Finally, exper-
imental results and analysis of different techniques are com-
pared. In all experiments, associated parameters are optimized
via cross-validation.
A. Experiments on DaimlerChrysler Dataset With SVM
We rst use the dataset in [14]. This dataset consists of three
training sets and two test sets. Each training set contains 4800
pedestrian examples and 5000 nonpedestrian examples The
pedestrian examples were obtained from manually labeling
and extracting pedestrians in video images at various time and
locations with no particular constraints on pedestrian pose or
clothing, except that pedestrians are standing in an upright
position. Pedestrian images are mirrored and the pedestrian
bounding boxes are shifted randomly by a few pixels in hor-
izontal and vertical directions. A border of 2 pixels is added
to the sample in order to preserve contour information. All
samples are scaled to size 18 36 pixels. Performance on
the test sets is analyzed similarly to the techniques described
in [14]. For each experiment, three different classiers are
generated. Testing all three classiers on two test sets yields six
different ROC curves. A 95% condence interval of the true
mean detection rate is given by the t-distribution.
1) Experiment Setup: We train covariance features with var-
ious combination of SVMs. For this method, we concatenate the
covariance descriptors for all regions into a combined feature
vector. An SVM classier is trained using this feature vector.
Our preliminary experiments show that training Gaussian kernel
SVM with region of size 7 7 pixels, shifted at a step size of 2
pixels over the entire input image of size 18 36 gives optimal
results. Increasing the region width and step size decreases the
performance slightly. The reason is that increasing the region
width and step size decreases the feature length of covariance
descriptors to be trained by SVM.
In contrast, training a linear SVM with region of size 7 7
pixels gives a very poor performance (all positive samples are
misclassied). We suspect that the region size is too small. As
a result, calculated covariance features of positive and negative
samples can not be separated by linear hyperplane. In our exper-
iments, the feature length of covariance descriptors per training
samples is between 1,0002,000 features. The length is pro-
portional to the number of image statistics used and the total
number of regions used for calculating covariance.
For the HOG features, the congurations reported in [11] are
tested on the benchmark datasets. However, our preliminary re-
sults show a poor performance. This is due to the fact that the
resolution of benchmark datasets used (18 36 pixels) is much
smaller than the resolution of the original datasets (64 128
pixels). In order to achieve a better result, HOG descriptors are
experimented with various spatial/orientation binning and de-
scriptor blocks (cell size ranging from 3 to 8 pixels and block
size of 2 24 4 cells). From our experimental results, we
have decided to use a cell size of 3 3 pixels with a block size
of 2 2 cells, descriptor stride of 2 pixels, and 18 orientation
bins of signed gradients (total feature length is 8064) to train
SVM classiers.
2) Results Based on SVM on the Dataset of [14]: LRF fea-
tures with quadratic SVM is the best approach among the fea-
tures compared in [14]. For completeness, we compare it with
our results.
Fig. 5 shows detection results of covariance features trained
with different SVM classiers. When trained with the RBF
SVM, a region of size 7 7 pixels turns out to perform best
compared with other region sizes. From the gure, region
covariance features perform better than LRF features when
trained with the same SVM kernel (quadratic SVM).
Fig. 6 shows detection results of HOG features trained with
different SVM classiers. From the gure, it clearly indicates
that a combination of HOG features with quadratic SVM
performs best. Obviously, the nonlinear SVM outperforms
the linear SVM. It is also interesting to note that the linear
SVM trained using HOG features performs better than the
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
1146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008
Fig. 5. Performance of different parameters on region covariance features.
Fig. 6. Performance of different classiers on histogram of oriented gradients
features.
nonlinear SVM trained using LRF features. This means that
HOG features are much better at describing spatial information
in the context of human detection than LRF features.
From these two experiments, we know that although LRF is
considered the best local feature for human detection in [14], it
cannot compete with region covariance and HOG.
We have also compared the covariance and HOG features on
the MIT CBCL datasets.2Both HOG and covariance features
perform extremely well on this MIT dataset. This is not too sur-
prising knowing that the MIT dataset contain only a frontal view
and rear view of human. Less variation in human poses makes
the classication problem much easier for SVM classiers. It is
also interesting to note that the performance of covariance fea-
tures (with Gaussian RBF SVM) is very similar to HOG fea-
tures trained using Gaussian RBF and quadratic SVM. It even
outperforms HOG features at a low false positive rate. We may
conclude that in terms of classication performance, covariance
features are the best among the three local features we have
compared.
B. Experiments on DaimlerChrylser Dataset With a Cascade
of Boosted Covariance Features
1) Experiment Setup: For a boosted cascade of covariance
features, we generate a set of overcomplete rectangular covari-
ance lters and subsample the overcomplete set in order to keep
2[Online]. Available: http:// cbcl.mit.edu/ software-datasets/ PedestrianData.
html
Fig. 7. Performance comparison of our cascade of boosted covariance features
with covariance features trained using SVM (left) and histogram of oriented
gradients (HOG) features trained using SVM (right).
a manageable set for the training phase. The set contains approx-
imately 1120 covariance lters. Each lter (weak classier) con-
sists of four parameters, e.g., -coordinate, -coordinate, width,
and height. A strong classier consisting of several weak clas-
siers is built in each stage of the cascade. At each stage, weak
classiers are added until the predened objective is met. In this
experiment, we set the minimum detection rate to be 99.5% and
the maximum false positive rate to be 50% in each stage. The
negative samples used in each stage of the cascade are collected
from false positives of the previous stage of the cascade.
Since the resolution of the test samples is quite small, we
extend the border of each test sample by one pixel. The extra
margin helps shifting the pedestrian in the test sample to the
center. Doing so increases a exibility of our boosted classi-
er. During classication, we count the number of the positively
classied subwindows and use this number to test whether the
test sample is pedestrian or non-pedestrian.
2) Results Based on Boosted Covariance Features on the
Dataset of [14]: Fig. 7 shows detection results of covariance
features trained with AdaBoost. The performance of our pro-
posed method is very similar to the best performance of covari-
ance features with Gaussian SVM. It also performs better than
HOG features with linear SVM. However, the performance is
slightly worse compared with the performance of HOG features
with quadratic SVM.
We have also applied bootstrapping technique to HOG [11]
and covariance features. Bootstrapping is applied iteratively,
generating 10 000 new nonpedestrian samples at each iteration.
It is observed that collecting the rst 10 000 new nonpedestrian
samples did not take long, but the second iteration took a long
time. This is exactly to be expected since the new classier has
better accuracy than the previous classier. We observe that
the improvement of training HOG feature using bootstrapping
technique over initial classier is up to 7% increase in detec-
tion rate at 2.5% false positives rate while the improvement
is slightly lower in covariance features (about 3% increases
at 2.5% false positives rate). However, this performance gain
comes at a higher computation cost for training.
Finally, a comparison of the best performing results for dif-
ferent feature types are shown in Fig. 8. The following observa-
tions can be made. Out of the three features, both HOG and co-
variance features perform much better than LRF. HOG features
is slightly better than covariance features. [1] concludes that the
covariance descriptor outperforms the HOG descriptor (using
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1147
Fig. 8. Performance comparison of the best classiers for different feature
types on the dataset of [14].
TABLE II
AVERAGE TIME REQUIRED TO EVALUATE 10 FRAMES OF A SEQUENCE OF 384
2
288 PIXELS IMAGES.EACH IMAGE CONSISTS OF 17 280 WINDOWS (SCALE
FACTOR O F 0.8 AND STEP-SIZE OF 4P
IXELS)
human datasets of size 64 128 pixels with LogitBoost classi-
cation). We suspect the difference would be in the resolution of
datasets and the classiers used. Small resolution datasets give
less number of covariance features than large resolution data
sets. To support our ndings, we conduct experiments on INRIA
dataset [14] with a resolution of 18 36 pixels and include the
results at the end of Section VI-E.
We can see that gradient information is very helpful in
human detection problems. In all experiments, nonlinear SVMs
(quadratic or Gaussian RBF SVM) improves performance
signicantly over the linear one. However, this comes at the
cost of a much higher computation time (approximately 50
times slower in building SVM models).
Experiments show that most false negatives are due to the
subjects pose deformation, occlusions, or the very difcult illu-
mination environments. False positives usually contain gradient
information which looks like human body boundaries.
The advantages of our proposed method over features trained
using SVM are ease of parameter tuning and much faster
detection speed. SVM has more parameters compared to the
boosted cascade, e.g., tradeoff between training error and
margin or parameters of the nonlinear kernel. These parameters
need to be manually optimized for the specic classication
task using cross validation. In the next experiment, we compare
the processing speed in windows per second of the two best
classiers: HOG with quadratic SVM and 20 stages of boosted
covariance features. We apply the two classiers to a sequence
of 10 images with resolution of 384 288 pixels in width and
height. Table II shows the average detection speed for the two
classiers. As expected, the detection speed of 20 stages of
boosted covariance features is much faster than the detection
speed of the nonlinear SVM classier.
Fig. 9. Number of weak classiers in different cascade levels on the dataset of
[14]. Note that adding Haar features as a preprocessing step does not vary the
number of covariance features in later stages of cascade much.
TABLE III
AVERAGE EVALUATION TIME IN WINDOWS PER SECOND FOR DIFFERENT
PARAMETERS OF THE TWO-LAYER BOOSTING APPROACHES
C. Experiments on DaimlerChrysler Dataset With
Two-Layer Boosting
1) Experiment Setup: We generate a set of overcomplete
Haar wavelet lters and subsample the overcomplete set. The
set of Haar features that we use to train the cascade contained
20 547 lters: 5540 vertical two-rectangle features, 5395 hor-
izontal two-rectangle features, 3592 vertical three-rectangle
features, 3396 horizontal three-rectangle features, and 2624
four-rectangle features. From the preliminary experiments on
signed and unsigned wavelets, we observe that the performance
of signed wavelets outperform unsigned wavelets. Hence, we
preserve the sign of intensity gradients in this experiment. For
covariance features, we use a set of rectangular covariance
features generated from previous section. Fig. 9 gives some
details about our two-layer boosting cascade.
2) Results Based on Multilayer Boosting: Table III shows the
evaluation time in windows per second for different hybrid con-
gurations. Adding more stages of Haar wavelet features as a
preprocessing step increases the detection speed approximately
exponentially. Fig. 10 shows the performance of our two-layer
boosting. The curve of our method is generated by adding one
cascade level at a time. The boosted covariance features outper-
forms all other approaches. The performance of hybrid classi-
ers is quite poor at high false positive rate due to Haar-like fea-
tures in the initial stages of the cascade. Nonetheless, the perfor-
mance improves as more covariance features have been added
to the later stages of the cascade.
D. Experiments on INRIA Human Dataset With AdaBoost
The dataset consists of one training set and one test set. The
training set contains 1208 pedestrian samples (2416 mirrored
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
1148 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008
Fig. 10. Performance comparison of the two-layer boosting approach and a
cascade of the boosted covariance features on the dataset of [14]. The two-layer
boosting approach performs comparable to the cascade of boosted covariance
features at low false positive rate
(
<
0
:
01)
, which is the range of interest.
samples) and 1200 nonpedestrian images. The pedestrian sam-
ples were obtained from manually labeling images taken from
a digital camera at various time of the day and various loca-
tion. The pedestrian samples are mostly in standing position. A
border of 8 pixels is added to the sample in order to preserve
contour information. All samples are scaled to size 64 128
pixels. The test set contains 1176 pedestrian samples (mirrored)
extracted from 288 images.
We evaluate the performance of our classiers on the given
test set using classication approach and detection approach.
For human classication, we used cropped human samples
taken from the test images. During classication, the number of
the positively classied windows is used to determine if the test
sample is human or nonhuman. For human detection, a xed
size window is used to scan the test images with a scale factor of
0.95 and a step size of 4 pixels. As in [1], mean shift clustering
[26] is used to cluster multiple overlapping detection windows.
Simple rules as in [2] are also applied on the clustering results
to merge those close detection windows.
The criteria similar to the one used in PASCAL VOC Chal-
lenge [27] is adopted here. Detections are considered true or
false positives based on the area of overlap with ground truth
bounding boxes. To be considered a correct detection, the area
of overlap between the predicted bounding box and ground
truth bounding box must exceed 40% by
Multiple detections of the same object in an image are con-
sidered false detections. For quantitative analysis, we plot miss
rate versus false positive per window tested (false positive rate)
curves on a loglog scale. The experiments are conducted using
a standard desktop with 2.8-GHz Intel Pentium-D CPU and
2-GB RAM.
1) Experiment Setup: Similar to the previous experiments,
we generate a set of overcomplete rectangular covariance lters
and subsample the overcomplete set in order to keep a man-
ageable set for the training phase. The set contains approxi-
mately 15 225 covariance lters. In each stage, weak classiers
are added until the predened objective is met. In this experi-
ment, we set the minimum detection rate to be 99.5% and the
Fig. 11. Performance comparison of our cascade of boosted covariance fea-
tures with HOG with linear SVM [11] and covariance features on Riemannian
manifold [1]. The curve of covariance on Riemannian manifold is reproduced
from [1].
maximum false positive rate to be 50% in each stage. Each stage
is trained with 2416 human samples and 5000 nonhuman sam-
ples. The negative samples used in each stage of the cascade are
collected from false positives of the previous stages of the cas-
cade. The nal cascade consists of 29 stages.
2) Results Based on Boosted Covariance Features: Fig. 11
shows a comparison of our experimental results with different
methods. The curve of our method is generated by adding
one cascade level at a time. From the gure, it can be seen
that our systems performance is much better than HOG with
linear SVM [11] while achieving a comparable detection
rate to the technique described in [1]. [1] calculates distance
between covariance matrix on the Riemannian manifold.
An eigen-decomposition is required which slows down the
computation speed [1]. In contrast, our approach avoids the
eigen-decomposition and therefore it is much faster. It is also
easier to implement. The gure also shows the performance of
our system on human detection problem. In order to achieve
the results at low false positive rate i.e., , we man-
ually adjust the minimum neighbor threshold (a number of
merged detections). From Fig. 11, our covariance technique
with detection approach outperforms the same technique with
classication approach. The reason is due to the clustering
and merging techniques we used. By clustering and merging
multiple overlapping detection windows, we are able to further
reduce the number of false detections. As a result, the curve
is slightly shifted to the left. As for the processing time, on
average our unoptimized implementation in C++ can search
about 12 000 detection windows per second. Due to the cascade
structure, the search time is faster when human is against plain
backgrounds and slower when human is against more complex
backgrounds. Table IV shows the average detection speed for
three different classiers. Compared with [11] and [1], our
search time is faster than both techniques (2.2 times faster than
[11] and 4 times faster than [1]). Note that the system in [1] is
implemented in C++ on a Pentium-D 2.8-GHz processor with
2-GB RAM, which is the same as ours.3
In the next experiment, we show how adding a cascade
of Haar wavelet features as a preprocessing to a cascade of
3Personal communication with the author of [1].
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1149
TABLE IV
AVERAGE TIME REQUIRED TO EVALUATE A 240
2
320IMAGE (12 800
WINDOWS PER IMAGE)FOR DIFFERENT DETECTORS
Fig. 12. Number of weak classiers in different cascade levels on the INRIA
dataset [11].
TABLE V
AVERAGE EVALUATION TIME IN WINDOWS PER SECOND FOR DIFFERENT
PARAMETERS OF THE TWO-LAYER BOOSTING APPROACHES
boosted covariance features could help improve the detection
speed while maintaining a high detection rate.
E. Experiments on the INRIA Human Dataset
With Two-Layer Boosting
1) Experiment Setup: Similar to the experiments on the
dataset of [14], we subsample the overcomplete set of Haar
features to 54 779 lters: 11 446 vertical two-rectangle fea-
tures, 14 094 horizontal two-rectangle features, 8088 vertical
three-rectangle features, 10 400 horizontal three-rectangle fea-
tures, and 10 751 four-rectangle features. Unlike the previous
experiment, the performance of unsigned wavelets seems to
outperform the performance of signed wavelets. We think that,
when the human resolution is large, clothing and background
details can be easily observed and intensity gradient sign
becomes irrelevant. In other words, the wide range of clothing
and background colors make the gradient sign uninformative,
e.g., a person with a black shirt in front of a white background
should have the same information as a person with a white shirt
in front of a black background. Hence, we used the absolute
values of the wavelet responses in this experiment. For covari-
ance features, we use a set of rectangular covariance features
generated from previous section. Fig. 12 gives some details
about our two-layer boosting cascade.
Fig. 13. Performance comparison between different congurations of the two-
layer boosting approach based on classication (left) and detection (right) on
INRIA dataset. Overlapping amongst the ROC curves of different congurations
of two-layer boosting techniques indicates the performance similarity.
Fig. 14. Performance comparison between the two-layer boosting approach
(Haar features plus covariance features) and HOG features on INRIA dataset
with resolution of 18
2
36.
2) Results Based on Multilayer Boosting: The evaluation
time in windows per second for different hybrid congurations
is shown in Table V. Similar to previous results, adding Haar
wavelet features as a preprocessing step increases the detection
speed signicantly. Compared with the original covariance de-
tector in [1], the two-layer boosting approach is ten times faster.
Fig. 13 shows the performance of two-layer boosting ap-
proach using the classification and detection approaches. For
the classication approach, the overall performance of different
hybrid congurations is very similar to the performance of
a cascade of boosted covariance features. A hybrid classier
with 15 levels of Haar features and 12 levels of covariance
features might seem to perform poorly at high false positive
rate. However, at a low false positive rate, i.e., 2 , its
performance is very similar to performance of a cascade of
boosted covariance features. For the detection approach, the
two-layer boosting approach performs slightly inferior to the
cascade of boosted covariance features. This is not surprising
since INRIA human datasets contain human with various poses
which Haar features are less capable to capture. Nonetheless,
applying boosted covariance features in the second cascade
greatly improves the overall accuracy of a boosted cascade of
Haar features.
We have also compared the two-layer boosting approach and
HOG features on the INRIA dataset [11] with a resolution of
18 36. Note that the experiment setup used in this experiment
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
1150 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008
Fig. 15. Detection rate and speed tradeoff for different congurations of two-layer boosting. The rst two gures show detection rate versus false positive rate on
the dataset [14] and the INRIA dataset [11]. The last two gures show computation time versus false positive rate on the dataset [14] and the INRIA dataset [11].
Clearly, covariance features have the highest detection rate across all false positive rates while Haar features have the lowest detection rate. On the other hand,
Haar features are the fastest to compute while covariance features are the slowest.
is similar to the one used in previous experiment (Sections VI-B
and C). Fig. 14 shows the experimental results of different ap-
proaches. The results look slightly different from experimental
results in Section VI-C due to the different datasets used. How-
ever, the overall results seem to be consistent with results shown
in Figs. 8 and 10.
F. Detection Performance and Speed Tradeoff for the
Two-Layer Boosting
From the previous experiments, the results show that the
speed of Haar features classier is much faster than the speed
of the covariance features classier. Therefore, it is best to
place as many stages of Haar features in the rst layer of the
classier. However, having too many stages of Haar features
will degrade the overall performance. In this section, we try to
nd the best combination that will give the best overall results.
To study the tradeoff between the detection performance and
speed of our classiers, we perform a test on different false posi-
tive rates. For example, to achieve a 5 false positive rate
for a boosted covariance classier on INRIA dataset, we only
use the rst 19 stages of covariance features (instead of the full
29 stages). We then calculate the average computation time by
evaluating the 19 stages classier on a test sequence of images.
Fig. 15 shows the detection rate and computation time for dif-
ferent congurations of multiple-layer boosting on the dataset
of [14] and INRIA dataset [11]. From the gure, it can be con-
cluded that there is a tradeoff between the detection performance
and speed. In order to achieve a high detection rate, only a small
number of Haar stages should be placed in the rst layer of the
classier. For a small-resolution dataset (18 36 pixels), a con-
guration of Harr (5 stages) covariance (15 stages) seems to per-
form best at a reasonable computation time. For a larger resolu-
tion dataset (64 128 pixels), a conguration of Haar (7 stages)
covariance (22 stages) seems to perform best.
VII. CONCLUSION
This paper has presented a fast and robust pedestrian detec-
tion technique. We use weighted Fisher linear discriminant anal-
ysis as the weak classier for AdaBoost training. In order to
speed up the computation time, a cascaded classier architec-
ture is adopted [2].
From the experimental results on datasets used in [14], our
system has shown to give high detection performance at a low
false positive rate. Comparing with techniques using linear
SVM classier, the proposed system outperforms all the sys-
tems evaluated. When compared with nonlinear SVM systems,
the system is shown to perform very similar to the covariance
features with Gaussian SVM and slightly inferior compared to
HOG with quadratic SVM. However, the computation time of
HOG with quadratic SVM is much higher than our proposed
technique.
The performance of the proposed approach is also evaluated
on the INRIA pedestrian dataset [11]. On this dataset, previous
methods reported have signicantly higher miss rates at almost
all the false positive rates per window. Our algorithms perfor-
mance is comparable to the state-of-the-art [1] while is almost
four times faster for detection due to its new design.
To further accelerate the detection, we have also introduced
a faster strategytwo-layer boosting with heterogeneous fea-
turesto exploit the efciency of the Haar feature and the dis-
criminative power of the covariance feature. This way our de-
tector runs ten times faster than the original covariance feature
detector [1].
Ongoing work includes the search of new features for human
detection. How to optimally design a cascaded classier may
also be a future topic.
REFERENCES
[1] O. Tuzel, F. Porikli, and P. Meer, Human detection via classication
on Riemannian manifolds,in Proc. IEEE Conf. Comp. Vis. Pattern
Recognit., Minneapolis, MN, 2007, pp. 18.
[2] P. Viola and M. J. Jones, Robust real-time face detection,Int. J.
Comput. Vis., vol. 57, no. 2, pp. 137154, 2004.
[3] C. Papageorgiou and T. Poggio, A trainable system for object detec-
tion,Int. J. Comput. Vis., vol. 38, no. 1, pp. 1533, 2000.
[4] P. Viola, M. J. Jones, and D. Snow, Detecting pedestrians using pat-
terns of motion and appearance,in Proc. IEEE Int. Conf. Comput. Vis.,
2003, pp. 734741.
[5] D. M. Gavrila and S. Munder, Multi-cue pedestrian detection and
tracking from a moving vehicle,Int. J. Comput. Vis., vol. 73, no. 1,
pp. 4159, 2007.
[6] B. Leibe, E. Seemann, and B. Schiele, Pedestrian detection in crowded
scenes,in Proc. IEEE Conf. Comp. Vis. Pattern Recognit., San Diego,
CA, 2005, vol. 1, pp. 878885.
[7] B. Wu and R. Nevatia, Detection of multiple, partially occluded hu-
mans in a single image by bayesian combination of edgelet part detec-
tors,in Proc. IEEE Int. Conf. Comput. Vis., Beijing, China, 2005, vol.
1, pp. 9097.
[8] V. Sharma and J. Davis, Integrating appearance and motion cues
for simultaneous detection and segmentation of pedestrians,in Proc.
IEEE Int. Conf. Comput. Vis., Rio de Janeiro, Brazil, 2007, pp. 18.
[9] Y. Amit, D. Geman, and X. Fan, A coarse-to-ne strategy for mul-
ticlass shape detection,IEEE Trans. Pattern Anal. Mach. Intell., vol.
26, no. 12, pp. 16061621, Dec. 2004.
[10] G. Mori, X. Ren, A. Efros, and J. Malik, Recovering human body con-
gurations: combining segmentation and recognition,in Proc. IEEE
Conf. Comput. Vis. Patt. Recogn., Washington, DC, 2004, vol. 2, pp.
326333.
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1151
[11] N. Dalal and B. Triggs, Histograms of oriented gradients for human
detection,in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., San Diego,
CA, 2005, vol. 1, pp. 886893.
[12] C. Wöhler and J. Anlauf, An adaptable time-delay neural-network al-
gorithm for image sequence analysis,IEEE Trans. Neural Netw., vol.
10, no. 6, pp. 15311536, Dec. 1999.
[13] K. Mikolajczyk, C. Schmid, and A. Zisserman, Human detection
based on a probabilistic assembly of robust part detectors,in Proc.
Eur. Conf. Comput. Vis., Prague, Czech Republic, May 2004, vol. 1,
pp. 6981.
[14] S. Munder and D. M. Gavrila, An experimental study on pedestrian
classication,IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11,
pp. 18631868, Nov. 2006.
[15] J. Shawe-Taylor and N. Cristianini, Support Vector Machines and
Other Kernel-Based Learning Methods. Cambridge, U.K.: Cam-
bridge Univ. Press, 2000.
[16] K. Levi and Y. Weiss, Learning object detection from a small number
of examples: The importance of good features,in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., Washington, DC, 2004, vol. 2, pp.
5360.
[17] J. Meynet, V. Popovici, and J.-P. Thiran, Face detection with boosted
Gaussian features,Pattern Recognit., vol. 40, no. 8, pp. 22832291,
2007.
[18] D. G. Lowe, Distinctive image features from scale-invariant key-
points,Int. J. Comput. Vis., vol. 60, no. 2, pp. 91110, 2004.
[19] O. Tuzel, F. Porikli, and P. Meer, Region covariance: A fast descriptor
for detection and classication,in Proc. Eur. Conf. Comput. Vis., Graz,
Austria, May 2006, vol. 2, pp. 589600.
[20] S. Jin, D. S. Yeung, and X. Wang, Network instrusion detection in
covariance feature space,Pattern Recognit., vol. 40, pp. 21852197,
2007.
[21] R. E. Schapire, Theoretical views of boosting and applications,in
Proc. Int. Conf. Algorithmic Learn. Theory, London, U.K., 1999, pp.
1325.
[22] S. Z. Li and Z. Zhang, Floatboost learning and statistical face de-
tection,IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9, pp.
11121123, Sep. 2004.
[23] M. T. Pham and T. J. Cham, Fast training and selection of haar features
using statistics in boosting-based face detection,in Proc. IEEE Int.
Conf. Comput. Vis., Rio de Janeiro, Brazil, 2007, pp. 17.
[24] J. Wu, M. D. Mullin, and J. M. Rehg, Linear asymmetric classier for
cascade detectors,in Proc. Int. Conf. Mach. Learn., Bonn, Germany,
2005, pp. 988995.
[25] Q. Zhu, S. Avidan, M. Yeh, and K.-T. Cheng, Fast human detection
using a cascade of histograms of oriented gradients,in Proc. IEEE
Conf. Comput. Vis. Pattern Recogn., New York, 2006, vol. 2, pp.
14911498.
[26] D. Comaniciu and P. Meer, Mean shift: A robust approach toward
feature space analysis,IEEE Trans. Pattern Anal. Mach. Intell., vol.
24, no. 5, pp. 603619, May 2002.
[27] The PASCAL Visual Object Classes Challenge VOC (2007). [Online].
Available: http://www.pascal-network.org/challenges/VOC/voc2007/
index.html
Sakrapee Paisitkriangkrai received the B.E. de-
gree in computer engineering and the M.E. degree
in biomedical engineering from the University of
New South Wales, Sydney, Australia, where he is
currently working toward the Ph.D. degree.
His research interests include pattern recognition,
image processing, and machine learning.
Chunhua Shen received the Ph.D. degree from the
University of Adelaide, Australia, in 2005.
He is currently a Researcher with the Computer
Vision Program, NICTA, Canberra, Australia. He is
also an Adjunct Research Fellow with the Australian
National University and an Adjunct Lecturer with the
University of Adelaide. His main research interests
include statistical pattern analysis and its application
in computer vision.
Jian Zhang (M98SM04) received the Ph.D.
degree in electrical engineering from the University
College, University of New South Wales, Australian
Defence Force Academy, Australia, in 1997.
He is a Principal Researcher with NICTA, Sydney,
Australia. He is also a Conjoint Associate Professor
with University of New South Wales, Sydney,
Australia. He is currently an Associate Editor of the
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS
FOR VIDEO TECHNOLOGY and the EURASIP Journal
on Image and Video Processing.
Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.
... The Multi-block Normalized Local Binary Patterns MN-LBP [22] and the Dissociated dipoles [8] features were proposed for traffic sign detection. The edge orientation histogram (EOH) [23], binary pattern of local feature [24], and covariance features [25] were utilized for pedestrian detection. Different features were proposed to apply in different detection problems, but not limited to the described problems. ...
Article
Full-text available
As a common way to extract region proposals for CNN based detection, Region Proposal Network often requires very large amount of training samples and relative large computation requirements, which makes it hard to transfer to different real-time applications. To overcome these difficulties, a transfer and supplement AdaBoost learning method (TS-AdaBoost) is proposed to retrain the off-line trained AdaBoost detector to adapt to the new data with small amount of samples, which can extract region proposals when combining with CNN. The TS-AdaBoost includes a transfer learning process and a supplement learning process. The transfer learning process is designed to replace the features in the off-line trained detector with some new features, resulting a transfer learned new detector with better adaptive capacity to the new data. The supplement learning process is designed to lengthen the transfer learned detector achieving higher detection rates and lower false alarm rates. This method allows users to utilize the new labeled data to retrain the off-line trained detector, and do not need to discard all old labeled data and the old trained detector. Two transfer learning problems for traffic sign detection (TSD) are taken to show our method. Experiments show that the proposed TS-AdaBoost learning method can adapt to the new data from different application scenes independently or combined with CNN-based methods.
... Semantic LBP and Fourier LBP exploit the idea of a geometrical interpretation and a Fourier boundary descriptor, respectively. Training weak classifiers in the Euclidean space for faster computation, AdaBoost is designed for faster pedestrian detection [8]. Exploiting motion information, covariance descriptors can be used to detect pedestrians in multi-camera settings [10]. ...
Article
Full-text available
Network fusion has been recently explored as an approach for improving pedestrian detection performance. However, most existing fusion methods suffer from runtime efficiency, modularity, scalability, and maintainability due to the complex structure of the entire fused models, their end-to-end training requirements, and sequential fusion process. Addressing these challenges, this paper proposes a novel fusion framework that combines asymmetric inferences from object detectors and semantic segmentation networks for jointly detecting multiple pedestrians. This is achieved by introducing a consensus-based scoring method that fuses pair-wise pixel-relevant information from the object detector and the semantic segmentation network to boost the final confidence scores. The parallel implementation of the object detection and semantic segmentation networks in the proposed framework entails a low runtime overhead. The efficiency and robustness of the proposed fusion framework are extensively evaluated by fusing different state-of-the-art pedestrian detectors and semantic segmentation networks on a public dataset. The generalization of fused models is also examined on new cross pedestrian data collected through an autonomous car. Results show that the proposed fusion method significantly improves detection performance while achieving competitive runtime efficiency.
... In particular, adaptive boosting (AdaBoost) [36] is an efficient computation method that uses adaptive weighting and parallel relearning based on the recognition rate of the classifiers in the learning process. The proposed combination of the serial cascade classifier with parallel boosting is effective and versatile and has been used in the VOLUME 10, 2022 recognition of eyes [37], people [38], cars [39], animals [40], and other objects. ...
Article
Full-text available
Although industrial robotic arms are equipped with external cables to supply electricity, gases or other materials, cable path design is a difficult and demanding task. Herein, an efficient optimization method is proposed for automating cable path design under the assumption that the robot motion path is known. The contribution of this study was to reduce the considerable computation time required for the optimization, which was a concern in our previous work. The previous method represented candidates for cable paths as a set of parameter vectors (PVs) that included cable length and guide configurations, and then selected the optimal PV that satisfies stress constraints and provided the shortest cable path. The proposed method extracted critical poses, i.e., several static robot poses that are prone to applying stress to the cable, from the joint angle time series of the motion path, and then performed attachment and motion tests. The cable geometry for the static critical poses was simulated in the attachment test, while the geometry for dynamic robot motion was simulated in the motion test in an ascending order of the cable length among the PV candidates. Experimental results showed that the computation time for cable path optimization could be significantly reduced.
... Covariance features were first introduced in [19] for object tracking, since then they have proved to be effective in other computer vision tasks such as texture classification [27], pedestrian detection [17,28], and action recognition [7,22]. Work in [8] extends covariance features to model spatiotemporal patches by also calculating temporal gradients, for fire and flame detection. ...
Article
Full-text available
In this paper, we propose three different methods for anomaly detection in surveillance videos based on modeling of observation likelihoods. By means of the methods we propose, normal (typical) events in a scene are learned in a probabilistic framework by estimating the features of consecutive frames taken from the surveillance camera. The proposed methods are based on long short-term memory (LSTM) and linear regression. To decide whether an observation sequence (i.e., a small video patch) contains an anomaly or not, its likelihood under the modeled typical observation distribution is thresholded. An anomaly is decided to be present if the threshold is exceeded. Due to its effectiveness in object detection and action recognition applications, covariance features are used in this study to compactly reduce the dimensionality of the shape and motion cues of spatiotemporal patches obtained from the video segments. The two most successful methods are based on the final state vector of LSTM and support vector regression applied to mean covariance features and achieve an average performance of up to 0.95 area under curve on benchmark datasets.
... Typically, classical methods employ handcrafted feature extraction to determine pedestrians. Some powerful feature extractors that have been used for pedestrian detection are Histogram of Oriented Gradient (HOG) [5], Histogram of Oriented Flow (HOF) [6], Scale Invariant Feature Transform (SIFT) [7], Covariance [8] [9] [10], Haar Wavelet [11], Hierarchical Max (HMAX) [12], and Local Binary Patterns (LBP) [13]. In [14], the authors proposed a part-based model to handle partial occlusions in both the detection and the tracking stages. ...
... About combining different solutions, this direction is actually followed by more projects since it can be very flexible with unconstrained mixing manners. Authors in [8,21,22,33] used the AdaBoost algorithm to train a classifier employing Haar-like features together with either covariance features, Histogram of Oriented Gradients (HOG) features, or Modified Symmetric-Local Binary Patterns (MS-LBP) to detect human and pedestrians. This feature combination approach can exploit the distinctiveness of particular objects better than the one using only a single feature to increase the classification accuracy. ...
Article
Full-text available
Traditional iris recognition methods, which are still preferred against artificial intelligence (AI) approaches in practical applications, are often required to capture high-grade iris samples by an iris scanner for accurate subsequent processing. To reduce the system cost for mass deployment of iris recognition, pricey scan devices can be replaced by the average quality cameras combined with additional processing algorithm. In this paper, we propose a Haar-like-feature-based iris localization method to quickly detect the location of human iris in the images captured by low-cost cameras for the ease of post-processing stages. The AdaBoost algorithm was chosen as a learning method for training a cascade classifier using Haar-like features, which was then utilized to detect the iris position. The experimental results have shown acceptable accuracy and processing speed for this novel cascade classifier. This achievement stimulates us to implement this novel capturing device in our iris recognition.
Article
High resolution and strong semantic representation are both vital for feature extraction networks of pedestrian detection. The existing high-resolution network (HRNet) has presented a promising performance for pedestrian detection. However, we observed that it still has some significant shortcomings for heavily occluded and small-scale pedestrians. In this paper, we propose to address the shortcomings by extracting semantic and spatial context from HRNet. Specifically, we propose a Context-aware Feature Representation Learning Module (CFRL-Module), which combines a Multi-scale Feature Context Extraction Parallel Block for Convolution and Self-attention (CEPCA-Block) with two parallel paths and an Equivalent FFN (EFFN) Block. The core CEPCA-Block adopts a parallel design to integrate convolution and multi-head self-attention (MHSA) with low parameter computational cost, which can obtain the deep semantic context by convolution path and precise context by MHSA path. Furthermore, to overcome the inefficiency of global MHSA in high-resolution pedestrian detection, we propose a novel local window MHSA, which can significantly reduce memory consumption but barely affect the detection performance. Cascading the proposed CFRL-Module with the anchor-free detection head constitutes our Context-aware Feature Representation Learning Anchor-Free Network (CFRLA-Net). The proposed CFRLA-Net can catch a high-level understanding of the heavily occluded and small-scale pedestrian instances based on HRNet, which can effectively solve the limitation of the insufficient feature extraction ability of HRNet for the hard samples. Experimental results show that CFRLA-Net achieves state-of-the-art performance on CityPersons, Caltech, and CrowdHuman benchmarks.
Article
The semantic SLAM (Simultaneous Localization And Mapping) system is a crucial module for autonomous indoor parking. Visual cameras (monocular/binocular) and IMU (Inertial Measurement Unit) constitute the basic configuration to build such a system. The performance of existing SLAM systems typically deteriorates in the presence of dynamically movable objects or objects with little texture. By contrast, semantic objects on the ground embody the most salient and stable features in the indoor parking environment. Due to their inabilities to perceive such features on the ground, existing SLAM systems are prone to tracking inconsistency during navigation. In this paper, we present MOFIS <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SLAM</sub> , a novel tightly-coupled ${M}$ ulti- ${O}$ bject semantic SLAM system integrating ${F}$ ront-view, ${I}$ nertial, and ${S}$ urround-view sensors for autonomous indoor parking. The proposed system moves beyond existing semantic SLAM systems by complementing the sensor configuration with a surround-view system capturing images from a top-down viewpoint. In MOFIS <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SLAM</sub> , apart from low-level visual features and inertial motion data, typical semantic objects (parking-slots, parking-slot IDs and speed bumps) detected in surround-views are also incorporated in optimization, forming robust surround-view constraints. Specifically, each surround-view feature imposes a surround-view constraint that can be split into a contact term and a registration term. The former pre-defines the position of each individual surround-view feature subject to whether it has semantic contact with other surround-view features. Three contact modes, defined as complementary , adjacent and coincident , are identified to guarantee a unified form of all contact terms. The latter further constrains by registering each surround-view observation and its position in the world coordinate system. In parallel, to objectively evaluate SLAM studies for autonomous indoor parking, a large-scale dataset with groundtruth trajectories is collected, which is the first of its kind. Its groundtruth trajectories, commonly unavailable, are obtained by tracking artificial features scattered in the indoor parking environment, whose 3D coordinates are measured with an ETS (Electronic Total Station). The collected dataset has been made publicly available at https://shaoxuan92.github.io/MOFIS .
Chapter
Effective and precise detection of pedestrian serves as a key to a number of applications in the domain of computer vision such as smart cars, video surveillance, robotics, and security. This paper presents the combination of feature extraction and classification. We present a thorough study on the type of features fit for pedestrian detection. The features are obtained by concatenating global shape feature histogram of oriented gradients (HOG) with global color and local texture features. We investigate our proposed method with respect to their receiver operator characteristics (ROC) and detection error trade-off (DET) performance. For classification part, we use the standard support vector machines (SVM) with linear kernel. We test our proposed method on the benchmark dataset for pedestrian detection: Institut National de Recherche en Informatique et en Automatique (INRIA) Pedestrian Dataset. The dataset contains pedestrians and non-pedestrians captured over a varying environment. Our proposed method performs best with respect to other algorithms presented in this study and gives a miss rate of 5.80%.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Conference Paper
Boosting is a general method for improving the accuracy of any given learning algorithm. Focusing primarily on the AdaBoost algorithm, we briefly survey theoretical work on boosting including analyses of AdaBoost's training error and generalization error, connections between boosting and game theory, methods of estimating probabilities using boosting, and extensions of AdaBoost for multiclass classification problems. Some empirical work and applications are also described.