ArticlePDF Available

Fast Pedestrian Detection Using a Cascade of Boosted Covariance Features

September 2008
IEEE Transactions on Circuits and Systems for Video Technology 18(8):1140 - 1151

September 2008
18(8):1140 - 1151

DOI:10.1109/TCSVT.2008.928213

Source
IEEE Xplore

Authors:

Sakrapee Paisitkriangkrai

University of Adelaide

Chunhua Shen

University of Adelaide

Jian Zhang

University of Technology Sydney

Efficiently and accurately detecting pedestrians plays a very important role in many computer vision applications such as video surveillance and smart cars. In order to find the right feature for this task, we first present a comprehensive experimental study on pedestrian detection using state-of-the-art locally extracted features (e.g., local receptive fields, histogram of oriented gradients, and region covariance). Building upon the findings of our experiments, we propose a new, simpler pedestrian detector using the covariance features. Unlike the work in [1], where the feature selection and weak classifier training are performed on the Riemannian manifold, we select features and train weak classifiers in the Euclidean space for faster computation. To this end, AdaBoost with weighted Fisher linear discriminant analysis-based weak classifiers are designed. A cascaded classifier structure is constructed for efficiency in the detection phase. Experiments on different datasets prove that the new pedestrian detector is not only comparable to the state-of-the-art pedestrian detectors but it also performs at a faster speed. To further accelerate the detection, we adopt a faster strategy-multiple layer boosting with heterogeneous features-to exploit the efficiency of the Haar feature and the discriminative power of the covariance feature. Experiments show that, by combining the Haar and covariance features, we speed up the original covariance feature detector [1] by up to an order of magnitude in detection time with a slight drop in detection performance.

Architecture of the proposed pedestrian detection system using boosted covariance feature. We set the training objective as detection rate: 99.5%; false positive rate: 50%.

…

First and second covariance region selected by AdaBoost. The first two covariance regions overlayed on human training samples are shown in the first column. The second column displays human body parts selected by AdaBoost. The first covariance feature represents human legs (two parallel vertical bars) while the second covariance feature captures the information of the head and the human body.

…

Performance comparison of covariance and Haar features on INRIA test set [11].

…

Performance of different parameters on region covariance features.

…

Performance of different classifiers on histogram of oriented gradients features.

…

Figures - uploaded by Chunhua Shen

Content may be subject to copyright.

Content uploaded by Chunhua Shen

Content may be subject to copyright.

1140 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

Fast Pedestrian Detection Using a Cascade of Boosted

Covariance Features

Sakrapee Paisitkriangkrai, Chunhua Shen, and Jian Zhang

Abstract—Efﬁciently and accurately detecting pedestrians plays

a very important role in many computer vision applications such as

video surveillance and smart cars. In order to ﬁnd the right feature

for this task, we ﬁrst present a comprehensive experimental study

on pedestrian detection using state-of-the-art locally extracted fea-

tures (e.g., local receptive ﬁelds, histogram of oriented gradients,

and region covariance). Building upon the ﬁndings of our exper-

iments, we propose a new, simpler pedestrian detector using the

covariance features. Unlike the work in [1], where the feature selec-

tion and weak classiﬁer training are performed on the Riemannian

manifold, we select features and train weak classiﬁers in the Eu-

clidean space for faster computation. To this end, AdaBoost with

weighted Fisher linear discriminant analysis-based weak classi-

ﬁers are designed. A cascaded classiﬁer structure is constructed for

efﬁciency in the detection phase. Experiments on different datasets

prove that the new pedestrian detector is not only comparable to

the state-of-the-art pedestrian detectors but it also performs at a

faster speed. To further accelerate the detection, we adopt a faster

strategy—multiple layer boosting with heterogeneous features—to

exploit the efﬁciency of the Haar feature and the discriminative

power of the covariance feature. Experiments show that, by com-

bining the Haar and covariance features, we speed up the original

covariance feature detector [1] by up to an order of magnitude in

detection time with a slight drop in detection performance.

Index Terms—AdaBoost, boosting with heterogeneous features,

local features, pedestrian detection/classiﬁcation, support vector

machine.

I. INTRODUCTION

EFFICIENTLY and accurately detecting pedestrians is of

fundamental importance for many applications in com-

puter vision, e.g., smart vehicles, surveillance systems with in-

telligent query capabilities, and sports video content analysis.

In particular, there is growing effort in the development of in-

telligent video surveillance systems. An automated method for

ﬁnding humans in a scene serves as the ﬁrst important prepro-

cessing step in understanding human activity. Despite the multi-

tude of approaches in the literature, the problem of automatic de-

Manuscript received November 14, 2007; revised March 7, 2008 and May,

22, 2008. First published July 9, 2008; current version published August 29,

2008. NICTA is funded through the Australian Government’s Backing Aus-

tralia’s Ability initiative, in part through the ARC. This paper was recommended

by Associate Editor F. Pereira.

S. Paisitkriangkrai and J. Zhang are with NICTA, Neville Roach Labora-

tory, Kensington, NSW 2052, Australia, and also with the University of New

South Wales, Sydney, NSW 2052, Australia (e-mail: paul.pais@nicta.com.au;

jian.zhang@nicta.com.au).

C. Shen is with NICTA, Canberra Research Laboratory, Canberra, ACT

2601, Australia, and also with the Australian National University, Canberra,

ACT 0200,Australia (e-mail: chunhua.shen@nicta.com.au).

Color versions of one or more of the ﬁgures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TCSVT.2008.928213

tection of objects is far from being solved (e.g., [2]–[8]). Pedes-

trian detection in still images is one of the most difﬁcult ex-

amples of generic object detection. The challenges are due to

a wide range of poses that humans can adopt, large variations

in clothing, as well as cluttered backgrounds and environmental

conditions.

Pattern classiﬁcation approaches have been shown to achieve

successful results in many areas of object detections. These ap-

proaches can be decomposed into two key components: fea-

ture extraction and classiﬁer construction. In feature extraction,

dominant features are extracted from a large number of training

samples. These features are then used to train a classiﬁer. During

testing, the trained classiﬁer scanned the entire input image to

look for particular object patterns. This general approach has

shown to work very well in detection of many different objects,

e.g., face [2] and car number plate [9].

The literature on pedestrian detection is abundant. Mainly,

two types of image features are used, motion and shape. Motion

approaches, which require preprocessing techniques like back-

ground subtraction or image segmentation (e.g.,[10]), segments

an image into so-called super pixels and then detects the human

body and estimates its pose. Approaches based on shape infor-

mation typically detect pedestrian directly without using pre-

processing techniques [1], [3], [11], [12]. Features can be dis-

tinguished into global features and local features depending on

how the features are measured. The difference between global

and local features is that global features operate on the entire

image of datasets whereas local features operate on the subset

regions of the images. One of the well-known global feature ex-

traction methods is principal component analysis (PCA). The

drawback of global features is that the approach fails to extract

meaningful features if there is a large variation in object’s ap-

pearance, pose and illumination conditions. On the other hand,

local features are much less sensitive to these problems since

the features are extracted from the subset regions of the images.

Some examples of the commonly used local features are wavelet

coefﬁcient [2], gradient orientation [11], and region covariance

[1]. Local feature approaches can be further divided into whole

body detection and body parts detection [13]. In the part-based

approach, individual results are combined by a second clas-

siﬁer to form whole body detection. The advantage of using

part-based approach is that it can deal with variation in human

appearance due to body articulation. However, this approach

adds more complexity to the pedestrian detection problem. As

pointed out in [14], the classiﬁcation performances reported in

literature are quite different. This may be due to datasets’ com-

position with respect to negative samples. Data sets with nega-

tive samples containing large uniform image regions typically

lead to much better classiﬁcation performance.

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1141

The performances of several pedestrian detection approaches

have been evaluated in [14]. Multiple feature-classiﬁer combi-

nations have been examined with respect to their receiver op-

erating characteristic (ROC) performances and efﬁciency. Dif-

ferent features including PCA, local receptive ﬁelds (LRF) fea-

ture [12], and Haar wavelets [3] are used to train neural net-

works, support vector machines (SVM) [15] and -NN classi-

ﬁers. The authors conclude that the combination of SVM with

LRF features performs best. An observation is that local features

based detectors signiﬁcantly outperform those using global fea-

tures [14]. This may be due to the large variability of pedestrian

shapes. Global features like PCA are more powerful for mod-

eling objects with stable structures such as frontal faces, rigid

car images taken from a ﬁxed view angle.

Although [14] provides some insights on pedestrian de-

tection, it has not compared state-of-the-art techniques due

to the fast progress on this topic. Recently, histogram of ori-

ented gradients (HOG) [11] and region covariance features

[1] are proffered for pedestrian detection. It has been shown

that they outperform those previous approaches. HOG is a

gray-level image feature formed by a set of normalized gra-

dient histograms; while region covariance is an appearance

based feature, which combines pixel coordinates, intensity,

and gradients, into a covariance matrix. Hence, the type of

features employed for detection ranges from purely silhouette

based (e.g., HOG) to appearance-based (e.g., region covariance

feature). To the best of our knowledge, these approaches have

not yet been compare. It remains unclear whether silhouette- or

appearance-based features are better for pedestrian detection.

The ﬁrst part of this paper tries to answer this question. Also, in

order to ﬁnd the right feature for human detection, we perform a

systematic experimental study on the state-of-the-art pedestrian

detection techniques: LRF, HOG, and region covariance. The

reasons we select the SVM classiﬁer are: 1) it is one of the

advanced classiﬁers and 2) it is easy to train and, unlike neural

networks, the global optimum is guaranteed. Thus, the variance

caused by suboptimal training is avoided for fair comparison.

Building upon the results of our experiments, we then pro-

pose a new, simpler pedestrian detector using the covariance

features. Therefore, the second contribution of our work is that

we show how multidimensional covariance features can be

integrated with weighted linear discriminant analysis before

being trained by the AdaBoost framework. In other words, the

AdaBoost framework is adapted to vector-valued covariance

features, and a weak classiﬁer is designed according to the

weighted linear discriminant analysis. This technique is not

only faster but also accurate. In order to support our claim, we

compare the performance of our proposed method with the

state-of-the-art pedestrian detection techniques mentioned in

[14].

The proposed boosted covariance detector achieves a detec-

tion speed that is about four times faster than the method in [1],

but it is still not fast enough for real-time applications. On the

one hand, the Haar feature can be computed rapidly due to its

simplicity [2], but it is less powerful for classiﬁcation [16]. On

the other hand, although the covariance feature is a better can-

didate for representing pedestrians, it requires heavier compu-

tation than the Haar feature. Here, to further accelerate our pro-

posed detector, we adopt a faster strategy—two-layer boosting

with heterogeneous features—to exploit the efﬁciency of the

Haar feature and the discriminative power of the covariance

feature in a single framework. This idea has also been imple-

mented in face detection [17] for combining Haar features with

Gaussian features. It is well known that the cascade classiﬁca-

tion structure decreases the detection time by rejecting at the

beginning of the cascade most of the regions in the image that

do not contain a target. Thanks to the ﬂexibility of the cascaded

classiﬁer, we employ the Haar feature-based classiﬁers at the

beginning of the cascade and use the covariance feature at latter

stages. Experiments show that, by combining the Haar and co-

variance features, we speed up the conventional covariance fea-

ture detector [1] by an order of detection time without greatly

compromising the detection performance.

II. FEATURE EXTRACTION

Feature extraction is the ﬁrst step in most object detection and

pattern recognition algorithms. The performance of most com-

puter vision algorithms often relies on the extracted features.

The ideal feature would be the one that can differentiate ob-

jects in the same category from objects in different categories.

Commonly used low-level features in computer vision are color,

texture, and shape. Here, we evaluate three state-of-the-art local

features, namely, LRF, HOG, and region covariance. LRF fea-

tures are extracted using multilayer perceptrons by means of

their hidden layer. The features are tuned to the data during

training. The price is heavier computation. HOG uses histogram

to describe oriented gradient information while region covari-

ance computes covariance from several low-level image features

such as image intensities and gradients.

A. Local Receptive Fields

Multilayer perceptrons provide an adaptive approach for fea-

ture extraction by means of their hidden layer [12]. A neuron of

a higher layer does not receive input from all neurons of the un-

derlying layer but only from a limited region of it, which is call

local receptive ﬁelds (LRF). The hidden layer is divided into a

number of branches.

B. Histograms of Oriented Gradients

Since the development of scale-invariant feature transforma-

tion (SIFT) [18], which uses normalized local spatial histograms

as a descriptor, many research groups have been studying the use

of orientation histograms in other areas. The work in [11] is one

of the successful examples. This work [11] proposes histogram

of oriented gradients in the context of human detection. Their

method uses a dense grid of histogram of oriented gradients,

computed over blocks of various sizes. Each block consists of a

number of cells. Blocks can overlap with each other. For each

pixel , the gradient magnitude and orientation

is computed from

(1)

(2)

(3)

(4)

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

1142 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

A local 1-D orientation histogram of gradients is formed from

the gradient orientations of sample points within a region. Each

histogram divides the gradient angle range into a predeﬁned

number of bins. The gradient magnitudes vote into the orien-

tation histogram. In [11], the orientation histogram of each cell

has nine bins covering the orientation range of degrees

(unsigned gradients). Hence, each block is represented by a

36-D feature vector (9 bins/cell 4 cells/block). The ﬁnal step

is to combine these normalized block descriptors to form a fea-

ture vector. The feature vector can then be used to train SVMs.

C. Region Covariance

Tuzel et al. [1], [19] have proposed region covariance in the

context of object detection. Instead of using joint histograms

of the image statistics ( dimensions where is the number

of image statistics and is the number of histogram bins used

for each image statistics), covariance is computed from several

image statistics inside a region of interest (dimensions). This

results in a much smaller dimensionality. For each region, the

correlation coefﬁcient is calculated. The correlation coefﬁcient

of two random variables and is given by

(5)

(6)

where is the covariance of two random variables, is

the sample mean, and is the sample variance. Correlation

coefﬁcient is commonly used to describe the information we

gain about one random variable by observing another random

variable.

Image statistics used in this experiment are similar to the one

used in [1]. The 8-D feature image used are pixel location ,

pixel location ,ﬁrst-order partial derivative of the intensity in

horizontal direction ,ﬁrst-order partial derivative of the in-

tensity in vertical direction , the magnitude , edge

orientation , second-order partial derivative of

the intensity in the horizontal direction , second-order par-

tial derivative of the intensity in the vertical direction .

The covariance descriptor of a region is an 8 8 matrix. Due

to the symmetry, only the upper triangular part is stacked as a

vector and used as covariance descriptors. The descriptors en-

code information of the correlations of the deﬁned features in-

side the region. Note that this treatment is different from that

in [1] and [19], where the covariance matrix is directly used

as the feature and the distance between features is calculated

in the Riemannian manifold.1However, eigen-decomposition is

involved for calculating the distance in the Riemannian man-

ifold. Eigen-decomposition is very computationally expensive

(arithmetic operations). We instead vectorize the sym-

metric matrix and measure the distance in the Euclidean space,

which is faster.

1Covariance matrices are symmetric and positive semi-deﬁnite, hence they

reside in the Riemannian manifold.

Preliminary experiments, similar to the experiment described

in [19], have been conducted to compare the two different dis-

tance measures: distance of the correlation coefﬁcient from two

covariance matrices in the Euclidean space and distance of two

covariance matrices in the Riemannian manifold. The results in-

dicate that their performance on pedestrian detection are quite

similar.

In order to improve the covariance matrices’calculation ef-

ﬁciency, a technique which employs integral image [2] can be

applied [19]. By expanding the mean from previous equation,

covariance equation can be written as

(7)

Hence, to ﬁnd the fast covariance in a given rectangular region,

the sum of each feature dimension, e.g., and ,

and the sum of the multiplication of any two feature dimensions,

e.g., , can be computed using the integral image.

The extracted covariance features assume that the image sta-

tistics follow a single Gaussian distribution. Although this as-

sumption may look overly simple, experiments prove the co-

variance features’efﬁcacy. Jin et al. [20] have used an identical

idea for network intrusion detection.

III. CLASSIFIERS

There exist many classiﬁcation techniques that can be applied

to object detection. Some of the commonly applied classiﬁca-

tion techniques are SVM [15] and AdaBoost [2], [21].

A. Support Vector Machines

SVM is one of the popular large margin classiﬁers [15] that

has a very promising generalization capacity. Due to space limit,

we omit details of SVM. The reader is referred to [15] for de-

tails. In our experiments, SVM classiﬁers with three different

kernel functions, linear, quadratic, and RBF kernels, are com-

pared with the features calculated from previous section.

B. AdaBoost

AdaBoost is the ﬁrst practical and efﬁcient algorithm for en-

semble learning [21]. The training procedure of AdaBoost is

a greedy algorithm, which constructs an additive combination

of weak classiﬁers such that the exponential loss

is minimized. Here, is the labeled training examples

and is its label; is the ﬁnal decision function which out-

puts the decided class label. AdaBoost iteratively combines a

number of weak classiﬁers to form a strong classiﬁer. A weak

classiﬁer is deﬁned as a classiﬁer with accuracy on the training

set greater than average. The ﬁnal strong classiﬁer can be

deﬁned as , where is a weight

coefﬁcient, is a weak learner, and is the number of

weak classiﬁers. At each new round, AdaBoost selects a new

hypothesis that best classiﬁes training samples with min-

imal classiﬁcation error. Each training sample receives a weight

that determines its probability of being selected for a training

set. If a training sample is correctly classiﬁed, then its proba-

bility of being used again in a subsequent component classiﬁer

is reduced. Conversely, if the pattern is misclassiﬁed, then its

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1143

Fig. 1. Architecture of the proposed pedestrian detection system using boosted

covariance feature. We set the training objective as detection rate: 99.5%; false

positive rate: 50%.

probability of being used again is increased. In this way, the al-

gorithm focuses more on the misclassiﬁed samples after each

round of boosting.

Since Viola et al. [2] introduced AdaBoost into computer vi-

sion for face detection, extensions have been proposed for better

classiﬁcation performance [22], fast training [23], or dealing

with imbalanced training data [24]. These techniques can be ap-

plied to our problem. We leave this as a future research direction.

IV. BOOSTED COVARIANCE FEATURES

Here, we describe a new pedestrian detector system. Fig. 1

shows the structure of the new pedestrian detector. We use

the covariance feature originally invented in [19]. The reasons

why we choose this local feature will be explained in detail in

Section IV-A.

Given the training dataset, each training sample is assigned a

weight which determines its probability of being selected for a

training set. From a set of given rectangular windows, covari-

ance matrix is calculated from several low-level image statis-

tics within a rectangular region. The upper triangular part of

computed covariance matrix is stacked as a vector and used

as a covariance descriptors. A vector of covariance descriptors

is projected onto a 1-D space using the algorithm described in

Section IV-A. AdaBoost is then applied to select the best rectan-

gular region w.r.t. the weak learner that best classiﬁes training

samples with minimal classiﬁcation error. The best weak learner

is added to a cascade. Weak learners are added until the prede-

ﬁned classiﬁcation accuracy is met. The process is replicated for

the next stage of the cascades.

This section begins with a short explanation of Fisher linear

discriminant analysis (LDA) concept. We then extend these

methods to varying weighted training samples. Finally, we

describe in details how to apply these techniques to train mul-

tidimensional covariance features on a cascade of AdaBoost

classiﬁers framework.

A. Weighted Fisher Linear Discriminant Analysis

The objective of the Fisher’s criteria is to ﬁnd a linear

combination of the variables that can separate two classes as

much as possible. However, the criterion proposed by Fisher

assumes uniformly weighted training samples. In AdaBoost

Fig. 2. First and second covariance region selected by AdaBoost. The ﬁrst two

covariance regions overlayed on human training samples are shown in the ﬁrst

column. The second column displays human body parts selected by AdaBoost.

The ﬁrst covariance feature represents human legs (two parallel vertical bars)

while the second covariance feature captures the information of the head and

the human body.

training, each data point is associated with a weight which

measures how difﬁcult to correctly classify this data. Therefore,

we need to apply a weighted version of the standard Fisher

linear discriminant analysis (WLDA). Similar to LDA, WLDA

ﬁnds a linear combination of the variables that can separate

two classes as much as possible with emphasis on the training

samples with high weights.

It is well known that the choice of weak classiﬁers is vital

to the classiﬁcation accuracy of boosting techniques. Although

effective weak classiﬁers increase the performance of the ﬁnal

strong classiﬁer, the large amount of potential features make

the computation prohibitively heavy with the use of complex

classiﬁers such as SVMs. For scalar features such as Haar fea-

tures in [2], [4], a very efﬁcient stump can be used. For vector-

valued features such as HOG or covariance features, unfortu-

nately, seeking an optimal linear discriminant would require

much longer time. As shown in [25], it is possible to use linear

SVMs as weak learners, the training procedure is very time-con-

suming. Here we adopt a more efﬁcient approach. We project

the multi-dimensional covariance features onto a 1-D line using

WLDA, which ﬁnds a linear projection function which guar-

antees optimal classiﬁcation of normally distributed samples of

two classes.

Each weak learner can then be deﬁned as

otherwise (8)

where deﬁnes a weak learner, is the calculated covariance

features, and is an optimal threshold such that the minimum

number of examples are misclassiﬁed.

B. Cascade of Covariance Descriptors

The covariance feature efﬁciently captures the relationship

between different image statistics. Combining with WLDA, this

information can be used to represent a distinct part of the human

body. At each AdaBoost iteration, a simple classiﬁer is trained

from the collection of region covariance features. The experi-

mental results show that the covariance region selected by Ad-

aBoost are physically meaningful and can be easily interpreted

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

1144 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

as shown in Fig. 2. The ﬁrst selected feature focuses on the

bottom part of the human body while the second selected feature

focuses on the top part of the body. It turns out that covariance

features are well adapted to capture patterns that are invariant to

illumination changes and human poses/appearance changes.

Our fast boosted covariance features-based detection frame-

work is summarized in Algorithm 1.

Algorithm 1. The training algorithm for building the cascade

of boosted covariance detector.

Input:

A positive training set and a negative training set;

: minimum acceptable detection rate per cascade

level;

: maximum acceptable false positive rate per cascade

level;

: target overall false positive rate.

Initialize: ; ; ;

while do

; ;

while do

(1) Normalize AdaBoost weights;

(2) Calculate the projection vector with WLDA; and

project the covariance features to 1D;

(3) Train decision stumps by ﬁnding a optimal threshold

, using the training set;

(4) Add the best decision stump classiﬁer into the

strong classiﬁer;

(5) Update sample weights in the AdaBoost manner;

(6) Lower threshold such that holds;

(7) Update using this threshold.

end

; ; and remove correctly

classiﬁed negative samples from the training set;

if then

Evaluate the current cascaded classiﬁer on the negative

images and add misclassiﬁed samples into the negative

training set.

end

Output:

A cascade of boosted covariance classiﬁers for each

cascade level ;

Final training accuracy: and .

Fig. 3. Structure of our two-layer pedestrian detector.

In order to reduce computation time, a cascade of classiﬁers

is built [2]. The key insight is that efﬁcient boosted classiﬁers,

which can reject many of the simple nonpedestrian samples

while detecting almost all pedestrian samples, are constructed

and placed at the early stages of the cascades. Time-consuming

and complex boosted classiﬁers, which can remove more com-

plex nonpedestrian samples, are placed in the later stages of

the cascades. By constructing classiﬁers in this way, we are

able to quickly discard simple background regions of the image,

e.g., sky, building, or road, while spending more time on pedes-

trian-like regions. Only samples that can pass through all stages

of the cascades are classiﬁed as pedestrians.

V. T WO-LAYER BOOSTING WITH HETEROGENEOUS FEATURES

In order to further accelerate our proposed detector, an ap-

proach which consists of a two-layer cascade of classiﬁers is

built [17]. The objective of designing a two-layer approach is

to achieve high detection speed and accuracy. The idea is to

place simple and fast-to-compute features in the ﬁrst layer while

putting a more accurate but slower-to-compute features in the

second layer of the cascade. The simple features ﬁlter out most

simple nonpedestrian patterns in the early stage of the cascade.

Haar wavelet features have proved to be extremely fast and

highly powerful in the application of face detections [2]. How-

ever, the Haar feature performs poorly in the context of human

detection as reported in [4]. In order to improve the overall accu-

racy, we apply boosted covariance features in the second layer.

In other words, Haar features are used in the ﬁrst cascade while

boosted covariance features are used in the second cascade. This

way we utilize the efﬁciency of the Haar feature and the discrim-

inative power of the covariance feature in a single framework.

Fig. 3 shows the detector architecture of the two-layer approach.

We experimentally evaluate covariance features and Haar fea-

tures by training two different classiﬁers on the same training

set using AdaBoost. The positive training set is extracted from

INRIA dataset [11] which consists of 2416 human samples (mir-

rored). The negative training set comes from random patches

extracted from negative images. The classiﬁers are evaluated

on the INRIA test set. Fig. 4 gives a comparison of the per-

formances of different feature types. The following observation

can be made from the ﬁgure. The test error decreases quickly

with the number of AdaBoost iterations for all features. The test

error of covariance features run into saturation after about 100

iterations while the test error rate of Haar feature continues to

decrease slowly. The results can also be interpreted in terms of

the number of selected features and test error rate. For example,

it is possible to achieve a 5% test error rate using either 25 co-

variance features or 100 Haar features. Table I shows the com-

putation time for different feature types (including computation

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1145

Fig. 4. Performance comparison of covariance and Haar features on INRIA

test set [11].

TABLE I

AVERAGE TIME REQUIRED TO EVALUATE COVARIANCE AND HAAR FEATURES

overhead of integral images). The computation of Haar features

is much faster than the computation of covariance features.

Due to the ﬂexibility of the cascaded structure, it is easy to in-

tegrate multiple heterogeneous features. Although we use Haar

and covariance features here, some combination of various fea-

tures may lead to better performance. It remains a future study

topic on how to ﬁnd the best combination.

VI. EXPERIMENTS

We evaluate the performance of our techniques on two pub-

licly available datasets, dataset of [14] and INRIA dataset [11].

The ﬁrst dataset [14] contains a set of extracted pedestrian and

non-pedestrian samples which are scaled to size 18 36 pixels.

We conduct three experiments on the dataset of [14] using co-

variance features trained with SVM and AdaBoost. The second

dataset [11] contains 1176 pedestrian samples from 288 im-

ages. We conduct two experiments using covariance features

trained with AdaBoost. To our knowledge, [11] and [1] are the

state-of-the-art on human detection in the literature. Hence, we

mainly compare our algorithm with these two techniques.

The experimental section is organized as follows. First, the

datasets used in this experiment, including how the performance

is analyzed, are described. Experiments and the parameters used

to achieve optimal results are then discussed. Finally, exper-

imental results and analysis of different techniques are com-

pared. In all experiments, associated parameters are optimized

via cross-validation.

A. Experiments on DaimlerChrysler Dataset With SVM

We ﬁrst use the dataset in [14]. This dataset consists of three

training sets and two test sets. Each training set contains 4800

pedestrian examples and 5000 nonpedestrian examples The

pedestrian examples were obtained from manually labeling

and extracting pedestrians in video images at various time and

locations with no particular constraints on pedestrian pose or

clothing, except that pedestrians are standing in an upright

position. Pedestrian images are mirrored and the pedestrian

bounding boxes are shifted randomly by a few pixels in hor-

izontal and vertical directions. A border of 2 pixels is added

to the sample in order to preserve contour information. All

samples are scaled to size 18 36 pixels. Performance on

the test sets is analyzed similarly to the techniques described

in [14]. For each experiment, three different classiﬁers are

generated. Testing all three classiﬁers on two test sets yields six

different ROC curves. A 95% conﬁdence interval of the true

mean detection rate is given by the t-distribution.

1) Experiment Setup: We train covariance features with var-

ious combination of SVMs. For this method, we concatenate the

covariance descriptors for all regions into a combined feature

vector. An SVM classiﬁer is trained using this feature vector.

Our preliminary experiments show that training Gaussian kernel

SVM with region of size 7 7 pixels, shifted at a step size of 2

pixels over the entire input image of size 18 36 gives optimal

results. Increasing the region width and step size decreases the

performance slightly. The reason is that increasing the region

width and step size decreases the feature length of covariance

descriptors to be trained by SVM.

In contrast, training a linear SVM with region of size 7 7

pixels gives a very poor performance (all positive samples are

misclassiﬁed). We suspect that the region size is too small. As

a result, calculated covariance features of positive and negative

samples can not be separated by linear hyperplane. In our exper-

iments, the feature length of covariance descriptors per training

samples is between 1,000–2,000 features. The length is pro-

portional to the number of image statistics used and the total

number of regions used for calculating covariance.

For the HOG features, the conﬁgurations reported in [11] are

tested on the benchmark datasets. However, our preliminary re-

sults show a poor performance. This is due to the fact that the

resolution of benchmark datasets used (18 36 pixels) is much

smaller than the resolution of the original datasets (64 128

pixels). In order to achieve a better result, HOG descriptors are

experimented with various spatial/orientation binning and de-

scriptor blocks (cell size ranging from 3 to 8 pixels and block

size of 2 2–4 4 cells). From our experimental results, we

have decided to use a cell size of 3 3 pixels with a block size

of 2 2 cells, descriptor stride of 2 pixels, and 18 orientation

bins of signed gradients (total feature length is 8064) to train

SVM classiﬁers.

2) Results Based on SVM on the Dataset of [14]: LRF fea-

tures with quadratic SVM is the best approach among the fea-

tures compared in [14]. For completeness, we compare it with

our results.

Fig. 5 shows detection results of covariance features trained

with different SVM classiﬁers. When trained with the RBF

SVM, a region of size 7 7 pixels turns out to perform best

compared with other region sizes. From the ﬁgure, region

covariance features perform better than LRF features when

trained with the same SVM kernel (quadratic SVM).

Fig. 6 shows detection results of HOG features trained with

different SVM classiﬁers. From the ﬁgure, it clearly indicates

that a combination of HOG features with quadratic SVM

performs best. Obviously, the nonlinear SVM outperforms

the linear SVM. It is also interesting to note that the linear

SVM trained using HOG features performs better than the

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

1146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

Fig. 5. Performance of different parameters on region covariance features.

Fig. 6. Performance of different classiﬁers on histogram of oriented gradients

features.

nonlinear SVM trained using LRF features. This means that

HOG features are much better at describing spatial information

in the context of human detection than LRF features.

From these two experiments, we know that although LRF is

considered the best local feature for human detection in [14], it

cannot compete with region covariance and HOG.

We have also compared the covariance and HOG features on

the MIT CBCL datasets.2Both HOG and covariance features

perform extremely well on this MIT dataset. This is not too sur-

prising knowing that the MIT dataset contain only a frontal view

and rear view of human. Less variation in human poses makes

the classiﬁcation problem much easier for SVM classiﬁers. It is

also interesting to note that the performance of covariance fea-

tures (with Gaussian RBF SVM) is very similar to HOG fea-

tures trained using Gaussian RBF and quadratic SVM. It even

outperforms HOG features at a low false positive rate. We may

conclude that in terms of classiﬁcation performance, covariance

features are the best among the three local features we have

compared.

B. Experiments on DaimlerChrylser Dataset With a Cascade

of Boosted Covariance Features

1) Experiment Setup: For a boosted cascade of covariance

features, we generate a set of overcomplete rectangular covari-

ance ﬁlters and subsample the overcomplete set in order to keep

2[Online]. Available: http:// cbcl.mit.edu/ software-datasets/ PedestrianData.

html

Fig. 7. Performance comparison of our cascade of boosted covariance features

with covariance features trained using SVM (left) and histogram of oriented

gradients (HOG) features trained using SVM (right).

a manageable set for the training phase. The set contains approx-

imately 1120 covariance ﬁlters. Each ﬁlter (weak classiﬁer) con-

sists of four parameters, e.g., -coordinate, -coordinate, width,

and height. A strong classiﬁer consisting of several weak clas-

siﬁers is built in each stage of the cascade. At each stage, weak

classiﬁers are added until the predeﬁned objective is met. In this

experiment, we set the minimum detection rate to be 99.5% and

the maximum false positive rate to be 50% in each stage. The

negative samples used in each stage of the cascade are collected

from false positives of the previous stage of the cascade.

Since the resolution of the test samples is quite small, we

extend the border of each test sample by one pixel. The extra

margin helps shifting the pedestrian in the test sample to the

center. Doing so increases a ﬂexibility of our boosted classi-

ﬁer. During classiﬁcation, we count the number of the positively

classiﬁed subwindows and use this number to test whether the

test sample is pedestrian or non-pedestrian.

2) Results Based on Boosted Covariance Features on the

Dataset of [14]: Fig. 7 shows detection results of covariance

features trained with AdaBoost. The performance of our pro-

posed method is very similar to the best performance of covari-

ance features with Gaussian SVM. It also performs better than

HOG features with linear SVM. However, the performance is

slightly worse compared with the performance of HOG features

with quadratic SVM.

We have also applied bootstrapping technique to HOG [11]

and covariance features. Bootstrapping is applied iteratively,

generating 10 000 new nonpedestrian samples at each iteration.

It is observed that collecting the ﬁrst 10 000 new nonpedestrian

samples did not take long, but the second iteration took a long

time. This is exactly to be expected since the new classiﬁer has

better accuracy than the previous classiﬁer. We observe that

the improvement of training HOG feature using bootstrapping

technique over initial classiﬁer is up to 7% increase in detec-

tion rate at 2.5% false positives rate while the improvement

is slightly lower in covariance features (about 3% increases

at 2.5% false positives rate). However, this performance gain

comes at a higher computation cost for training.

Finally, a comparison of the best performing results for dif-

ferent feature types are shown in Fig. 8. The following observa-

tions can be made. Out of the three features, both HOG and co-

variance features perform much better than LRF. HOG features

is slightly better than covariance features. [1] concludes that the

covariance descriptor outperforms the HOG descriptor (using

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1147

Fig. 8. Performance comparison of the best classiﬁers for different feature

types on the dataset of [14].

TABLE II

AVERAGE TIME REQUIRED TO EVALUATE 10 FRAMES OF A SEQUENCE OF 384

288 PIXELS IMAGES.EACH IMAGE CONSISTS OF 17 280 WINDOWS (SCALE

FACTOR O F 0.8 AND STEP-SIZE OF 4P

IXELS)

human datasets of size 64 128 pixels with LogitBoost classi-

ﬁcation). We suspect the difference would be in the resolution of

datasets and the classiﬁers used. Small resolution datasets give

less number of covariance features than large resolution data

sets. To support our ﬁndings, we conduct experiments on INRIA

dataset [14] with a resolution of 18 36 pixels and include the

results at the end of Section VI-E.

We can see that gradient information is very helpful in

human detection problems. In all experiments, nonlinear SVMs

(quadratic or Gaussian RBF SVM) improves performance

signiﬁcantly over the linear one. However, this comes at the

cost of a much higher computation time (approximately 50

times slower in building SVM models).

Experiments show that most false negatives are due to the

subject’s pose deformation, occlusions, or the very difﬁcult illu-

mination environments. False positives usually contain gradient

information which looks like human body boundaries.

The advantages of our proposed method over features trained

using SVM are ease of parameter tuning and much faster

detection speed. SVM has more parameters compared to the

boosted cascade, e.g., tradeoff between training error and

margin or parameters of the nonlinear kernel. These parameters

need to be manually optimized for the speciﬁc classiﬁcation

task using cross validation. In the next experiment, we compare

the processing speed in windows per second of the two best

classiﬁers: HOG with quadratic SVM and 20 stages of boosted

covariance features. We apply the two classiﬁers to a sequence

of 10 images with resolution of 384 288 pixels in width and

height. Table II shows the average detection speed for the two

classiﬁers. As expected, the detection speed of 20 stages of

boosted covariance features is much faster than the detection

speed of the nonlinear SVM classiﬁer.

Fig. 9. Number of weak classiﬁers in different cascade levels on the dataset of

[14]. Note that adding Haar features as a preprocessing step does not vary the

number of covariance features in later stages of cascade much.

TABLE III

AVERAGE EVALUATION TIME IN WINDOWS PER SECOND FOR DIFFERENT

PARAMETERS OF THE TWO-LAYER BOOSTING APPROACHES

C. Experiments on DaimlerChrysler Dataset With

Two-Layer Boosting

1) Experiment Setup: We generate a set of overcomplete

Haar wavelet ﬁlters and subsample the overcomplete set. The

set of Haar features that we use to train the cascade contained

20 547 ﬁlters: 5540 vertical two-rectangle features, 5395 hor-

izontal two-rectangle features, 3592 vertical three-rectangle

features, 3396 horizontal three-rectangle features, and 2624

four-rectangle features. From the preliminary experiments on

signed and unsigned wavelets, we observe that the performance

of signed wavelets outperform unsigned wavelets. Hence, we

preserve the sign of intensity gradients in this experiment. For

covariance features, we use a set of rectangular covariance

features generated from previous section. Fig. 9 gives some

details about our two-layer boosting cascade.

2) Results Based on Multilayer Boosting: Table III shows the

evaluation time in windows per second for different hybrid con-

ﬁgurations. Adding more stages of Haar wavelet features as a

preprocessing step increases the detection speed approximately

exponentially. Fig. 10 shows the performance of our two-layer

boosting. The curve of our method is generated by adding one

cascade level at a time. The boosted covariance features outper-

forms all other approaches. The performance of hybrid classi-

ﬁers is quite poor at high false positive rate due to Haar-like fea-

tures in the initial stages of the cascade. Nonetheless, the perfor-

mance improves as more covariance features have been added

to the later stages of the cascade.

D. Experiments on INRIA Human Dataset With AdaBoost

The dataset consists of one training set and one test set. The

training set contains 1208 pedestrian samples (2416 mirrored

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

1148 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

Fig. 10. Performance comparison of the two-layer boosting approach and a

cascade of the boosted covariance features on the dataset of [14]. The two-layer

boosting approach performs comparable to the cascade of boosted covariance

features at low false positive rate

(

01)

, which is the range of interest.

samples) and 1200 nonpedestrian images. The pedestrian sam-

ples were obtained from manually labeling images taken from

a digital camera at various time of the day and various loca-

tion. The pedestrian samples are mostly in standing position. A

border of 8 pixels is added to the sample in order to preserve

contour information. All samples are scaled to size 64 128

pixels. The test set contains 1176 pedestrian samples (mirrored)

extracted from 288 images.

We evaluate the performance of our classiﬁers on the given

test set using classiﬁcation approach and detection approach.

For human classiﬁcation, we used cropped human samples

taken from the test images. During classiﬁcation, the number of

the positively classiﬁed windows is used to determine if the test

sample is human or nonhuman. For human detection, a ﬁxed

size window is used to scan the test images with a scale factor of

0.95 and a step size of 4 pixels. As in [1], mean shift clustering

[26] is used to cluster multiple overlapping detection windows.

Simple rules as in [2] are also applied on the clustering results

to merge those close detection windows.

The criteria similar to the one used in PASCAL VOC Chal-

lenge [27] is adopted here. Detections are considered true or

false positives based on the area of overlap with ground truth

bounding boxes. To be considered a correct detection, the area

of overlap between the predicted bounding box and ground

truth bounding box must exceed 40% by

Multiple detections of the same object in an image are con-

sidered false detections. For quantitative analysis, we plot miss

rate versus false positive per window tested (false positive rate)

curves on a log–log scale. The experiments are conducted using

a standard desktop with 2.8-GHz Intel Pentium-D CPU and

2-GB RAM.

1) Experiment Setup: Similar to the previous experiments,

we generate a set of overcomplete rectangular covariance ﬁlters

and subsample the overcomplete set in order to keep a man-

ageable set for the training phase. The set contains approxi-

mately 15 225 covariance ﬁlters. In each stage, weak classiﬁers

are added until the predeﬁned objective is met. In this experi-

ment, we set the minimum detection rate to be 99.5% and the

Fig. 11. Performance comparison of our cascade of boosted covariance fea-

tures with HOG with linear SVM [11] and covariance features on Riemannian

manifold [1]. The curve of covariance on Riemannian manifold is reproduced

from [1].

maximum false positive rate to be 50% in each stage. Each stage

is trained with 2416 human samples and 5000 nonhuman sam-

ples. The negative samples used in each stage of the cascade are

collected from false positives of the previous stages of the cas-

cade. The ﬁnal cascade consists of 29 stages.

2) Results Based on Boosted Covariance Features: Fig. 11

shows a comparison of our experimental results with different

methods. The curve of our method is generated by adding

one cascade level at a time. From the ﬁgure, it can be seen

that our system’s performance is much better than HOG with

linear SVM [11] while achieving a comparable detection

rate to the technique described in [1]. [1] calculates distance

between covariance matrix on the Riemannian manifold.

An eigen-decomposition is required which slows down the

computation speed [1]. In contrast, our approach avoids the

eigen-decomposition and therefore it is much faster. It is also

easier to implement. The ﬁgure also shows the performance of

our system on human detection problem. In order to achieve

the results at low false positive rate i.e., , we man-

ually adjust the minimum neighbor threshold (a number of

merged detections). From Fig. 11, our covariance technique

with detection approach outperforms the same technique with

classiﬁcation approach. The reason is due to the clustering

and merging techniques we used. By clustering and merging

multiple overlapping detection windows, we are able to further

reduce the number of false detections. As a result, the curve

is slightly shifted to the left. As for the processing time, on

average our unoptimized implementation in C++ can search

about 12 000 detection windows per second. Due to the cascade

structure, the search time is faster when human is against plain

backgrounds and slower when human is against more complex

backgrounds. Table IV shows the average detection speed for

three different classiﬁers. Compared with [11] and [1], our

search time is faster than both techniques (2.2 times faster than

[11] and 4 times faster than [1]). Note that the system in [1] is

implemented in C++ on a Pentium-D 2.8-GHz processor with

2-GB RAM, which is the same as ours.3

In the next experiment, we show how adding a cascade

of Haar wavelet features as a preprocessing to a cascade of

3Personal communication with the author of [1].

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1149

TABLE IV

AVERAGE TIME REQUIRED TO EVALUATE A 240

320IMAGE (12 800

WINDOWS PER IMAGE)FOR DIFFERENT DETECTORS

Fig. 12. Number of weak classiﬁers in different cascade levels on the INRIA

dataset [11].

TABLE V

AVERAGE EVALUATION TIME IN WINDOWS PER SECOND FOR DIFFERENT

PARAMETERS OF THE TWO-LAYER BOOSTING APPROACHES

boosted covariance features could help improve the detection

speed while maintaining a high detection rate.

E. Experiments on the INRIA Human Dataset

With Two-Layer Boosting

1) Experiment Setup: Similar to the experiments on the

dataset of [14], we subsample the overcomplete set of Haar

features to 54 779 ﬁlters: 11 446 vertical two-rectangle fea-

tures, 14 094 horizontal two-rectangle features, 8088 vertical

three-rectangle features, 10 400 horizontal three-rectangle fea-

tures, and 10 751 four-rectangle features. Unlike the previous

experiment, the performance of unsigned wavelets seems to

outperform the performance of signed wavelets. We think that,

when the human resolution is large, clothing and background

details can be easily observed and intensity gradient sign

becomes irrelevant. In other words, the wide range of clothing

and background colors make the gradient sign uninformative,

e.g., a person with a black shirt in front of a white background

should have the same information as a person with a white shirt

in front of a black background. Hence, we used the absolute

values of the wavelet responses in this experiment. For covari-

ance features, we use a set of rectangular covariance features

generated from previous section. Fig. 12 gives some details

about our two-layer boosting cascade.

Fig. 13. Performance comparison between different conﬁgurations of the two-

layer boosting approach based on classiﬁcation (left) and detection (right) on

INRIA dataset. Overlapping amongst the ROC curves of different conﬁgurations

of two-layer boosting techniques indicates the performance similarity.

Fig. 14. Performance comparison between the two-layer boosting approach

(Haar features plus covariance features) and HOG features on INRIA dataset

with resolution of 18

36.

2) Results Based on Multilayer Boosting: The evaluation

time in windows per second for different hybrid conﬁgurations

is shown in Table V. Similar to previous results, adding Haar

wavelet features as a preprocessing step increases the detection

speed signiﬁcantly. Compared with the original covariance de-

tector in [1], the two-layer boosting approach is ten times faster.

Fig. 13 shows the performance of two-layer boosting ap-

proach using the classiﬁcation and detection approaches. For

the classiﬁcation approach, the overall performance of different

hybrid conﬁgurations is very similar to the performance of

a cascade of boosted covariance features. A hybrid classiﬁer

with 15 levels of Haar features and 12 levels of covariance

features might seem to perform poorly at high false positive

rate. However, at a low false positive rate, i.e., 2 , its

performance is very similar to performance of a cascade of

boosted covariance features. For the detection approach, the

two-layer boosting approach performs slightly inferior to the

cascade of boosted covariance features. This is not surprising

since INRIA human datasets contain human with various poses

which Haar features are less capable to capture. Nonetheless,

applying boosted covariance features in the second cascade

greatly improves the overall accuracy of a boosted cascade of

Haar features.

We have also compared the two-layer boosting approach and

HOG features on the INRIA dataset [11] with a resolution of

18 36. Note that the experiment setup used in this experiment

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

1150 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

Fig. 15. Detection rate and speed tradeoff for different conﬁgurations of two-layer boosting. The ﬁrst two ﬁgures show detection rate versus false positive rate on

the dataset [14] and the INRIA dataset [11]. The last two ﬁgures show computation time versus false positive rate on the dataset [14] and the INRIA dataset [11].

Clearly, covariance features have the highest detection rate across all false positive rates while Haar features have the lowest detection rate. On the other hand,

Haar features are the fastest to compute while covariance features are the slowest.

is similar to the one used in previous experiment (Sections VI-B

and C). Fig. 14 shows the experimental results of different ap-

proaches. The results look slightly different from experimental

results in Section VI-C due to the different datasets used. How-

ever, the overall results seem to be consistent with results shown

in Figs. 8 and 10.

F. Detection Performance and Speed Tradeoff for the

Two-Layer Boosting

From the previous experiments, the results show that the

speed of Haar features classiﬁer is much faster than the speed

of the covariance features classiﬁer. Therefore, it is best to

place as many stages of Haar features in the ﬁrst layer of the

classiﬁer. However, having too many stages of Haar features

will degrade the overall performance. In this section, we try to

ﬁnd the best combination that will give the best overall results.

To study the tradeoff between the detection performance and

speed of our classiﬁers, we perform a test on different false posi-

tive rates. For example, to achieve a 5 false positive rate

for a boosted covariance classiﬁer on INRIA dataset, we only

use the ﬁrst 19 stages of covariance features (instead of the full

29 stages). We then calculate the average computation time by

evaluating the 19 stages classiﬁer on a test sequence of images.

Fig. 15 shows the detection rate and computation time for dif-

ferent conﬁgurations of multiple-layer boosting on the dataset

of [14] and INRIA dataset [11]. From the ﬁgure, it can be con-

cluded that there is a tradeoff between the detection performance

and speed. In order to achieve a high detection rate, only a small

number of Haar stages should be placed in the ﬁrst layer of the

classiﬁer. For a small-resolution dataset (18 36 pixels), a con-

ﬁguration of Harr (5 stages) covariance (15 stages) seems to per-

form best at a reasonable computation time. For a larger resolu-

tion dataset (64 128 pixels), a conﬁguration of Haar (7 stages)

covariance (22 stages) seems to perform best.

VII. CONCLUSION

This paper has presented a fast and robust pedestrian detec-

tion technique. We use weighted Fisher linear discriminant anal-

ysis as the weak classiﬁer for AdaBoost training. In order to

speed up the computation time, a cascaded classiﬁer architec-

ture is adopted [2].

From the experimental results on datasets used in [14], our

system has shown to give high detection performance at a low

false positive rate. Comparing with techniques using linear

SVM classiﬁer, the proposed system outperforms all the sys-

tems evaluated. When compared with nonlinear SVM systems,

the system is shown to perform very similar to the covariance

features with Gaussian SVM and slightly inferior compared to

HOG with quadratic SVM. However, the computation time of

HOG with quadratic SVM is much higher than our proposed

technique.

The performance of the proposed approach is also evaluated

on the INRIA pedestrian dataset [11]. On this dataset, previous

methods reported have signiﬁcantly higher miss rates at almost

all the false positive rates per window. Our algorithm’s perfor-

mance is comparable to the state-of-the-art [1] while is almost

four times faster for detection due to its new design.

To further accelerate the detection, we have also introduced

a faster strategy—two-layer boosting with heterogeneous fea-

tures—to exploit the efﬁciency of the Haar feature and the dis-

criminative power of the covariance feature. This way our de-

tector runs ten times faster than the original covariance feature

detector [1].

Ongoing work includes the search of new features for human

detection. How to optimally design a cascaded classiﬁer may

also be a future topic.

REFERENCES

[1] O. Tuzel, F. Porikli, and P. Meer, “Human detection via classiﬁcation

on Riemannian manifolds,”in Proc. IEEE Conf. Comp. Vis. Pattern

Recognit., Minneapolis, MN, 2007, pp. 1–8.

[2] P. Viola and M. J. Jones, “Robust real-time face detection,”Int. J.

Comput. Vis., vol. 57, no. 2, pp. 137–154, 2004.

[3] C. Papageorgiou and T. Poggio, “A trainable system for object detec-

tion,”Int. J. Comput. Vis., vol. 38, no. 1, pp. 15–33, 2000.

[4] P. Viola, M. J. Jones, and D. Snow, “Detecting pedestrians using pat-

terns of motion and appearance,”in Proc. IEEE Int. Conf. Comput. Vis.,

2003, pp. 734–741.

[5] D. M. Gavrila and S. Munder, “Multi-cue pedestrian detection and

tracking from a moving vehicle,”Int. J. Comput. Vis., vol. 73, no. 1,

pp. 41–59, 2007.

[6] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in crowded

scenes,”in Proc. IEEE Conf. Comp. Vis. Pattern Recognit., San Diego,

CA, 2005, vol. 1, pp. 878–885.

[7] B. Wu and R. Nevatia, “Detection of multiple, partially occluded hu-

mans in a single image by bayesian combination of edgelet part detec-

tors,”in Proc. IEEE Int. Conf. Comput. Vis., Beijing, China, 2005, vol.

1, pp. 90–97.

[8] V. Sharma and J. Davis, “Integrating appearance and motion cues

for simultaneous detection and segmentation of pedestrians,”in Proc.

IEEE Int. Conf. Comput. Vis., Rio de Janeiro, Brazil, 2007, pp. 1–8.

[9] Y. Amit, D. Geman, and X. Fan, “A coarse-to-ﬁne strategy for mul-

ticlass shape detection,”IEEE Trans. Pattern Anal. Mach. Intell., vol.

26, no. 12, pp. 1606–1621, Dec. 2004.

[10] G. Mori, X. Ren, A. Efros, and J. Malik, “Recovering human body con-

ﬁgurations: combining segmentation and recognition,”in Proc. IEEE

Conf. Comput. Vis. Patt. Recogn., Washington, DC, 2004, vol. 2, pp.

326–333.

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

PAISITKRIANGKRAI et al.: FAST PEDESTRIAN DETECTION USING A CASCADE OF BOOSTED COVARIANCE FEATURES 1151

[11] N. Dalal and B. Triggs, “Histograms of oriented gradients for human

detection,”in Proc. IEEE Conf. Comput. Vis. Patt. Recogn., San Diego,

CA, 2005, vol. 1, pp. 886–893.

[12] C. Wöhler and J. Anlauf, “An adaptable time-delay neural-network al-

gorithm for image sequence analysis,”IEEE Trans. Neural Netw., vol.

10, no. 6, pp. 1531–1536, Dec. 1999.

[13] K. Mikolajczyk, C. Schmid, and A. Zisserman, “Human detection

based on a probabilistic assembly of robust part detectors,”in Proc.

Eur. Conf. Comput. Vis., Prague, Czech Republic, May 2004, vol. 1,

pp. 69–81.

[14] S. Munder and D. M. Gavrila, “An experimental study on pedestrian

classiﬁcation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11,

pp. 1863–1868, Nov. 2006.

[15] J. Shawe-Taylor and N. Cristianini, Support Vector Machines and

Other Kernel-Based Learning Methods. Cambridge, U.K.: Cam-

bridge Univ. Press, 2000.

[16] K. Levi and Y. Weiss, “Learning object detection from a small number

of examples: The importance of good features,”in Proc. IEEE Conf.

Comput. Vis. Pattern Recognit., Washington, DC, 2004, vol. 2, pp.

53–60.

[17] J. Meynet, V. Popovici, and J.-P. Thiran, “Face detection with boosted

Gaussian features,”Pattern Recognit., vol. 40, no. 8, pp. 2283–2291,

2007.

[18] D. G. Lowe, “Distinctive image features from scale-invariant key-

points,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[19] O. Tuzel, F. Porikli, and P. Meer, “Region covariance: A fast descriptor

for detection and classiﬁcation,”in Proc. Eur. Conf. Comput. Vis., Graz,

Austria, May 2006, vol. 2, pp. 589–600.

[20] S. Jin, D. S. Yeung, and X. Wang, “Network instrusion detection in

covariance feature space,”Pattern Recognit., vol. 40, pp. 2185–2197,

2007.

[21] R. E. Schapire, “Theoretical views of boosting and applications,”in

Proc. Int. Conf. Algorithmic Learn. Theory, London, U.K., 1999, pp.

13–25.

[22] S. Z. Li and Z. Zhang, “Floatboost learning and statistical face de-

tection,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9, pp.

1112–1123, Sep. 2004.

[23] M. T. Pham and T. J. Cham, “Fast training and selection of haar features

using statistics in boosting-based face detection,”in Proc. IEEE Int.

Conf. Comput. Vis., Rio de Janeiro, Brazil, 2007, pp. 1–7.

[24] J. Wu, M. D. Mullin, and J. M. Rehg, “Linear asymmetric classiﬁer for

cascade detectors,”in Proc. Int. Conf. Mach. Learn., Bonn, Germany,

2005, pp. 988–995.

[25] Q. Zhu, S. Avidan, M. Yeh, and K.-T. Cheng, “Fast human detection

using a cascade of histograms of oriented gradients,”in Proc. IEEE

Conf. Comput. Vis. Pattern Recogn., New York, 2006, vol. 2, pp.

1491–1498.

[26] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward

feature space analysis,”IEEE Trans. Pattern Anal. Mach. Intell., vol.

24, no. 5, pp. 603–619, May 2002.

[27] The PASCAL Visual Object Classes Challenge VOC (2007). [Online].

Available: http://www.pascal-network.org/challenges/VOC/voc2007/

index.html

Sakrapee Paisitkriangkrai received the B.E. de-

gree in computer engineering and the M.E. degree

in biomedical engineering from the University of

New South Wales, Sydney, Australia, where he is

currently working toward the Ph.D. degree.

His research interests include pattern recognition,

image processing, and machine learning.

Chunhua Shen received the Ph.D. degree from the

University of Adelaide, Australia, in 2005.

He is currently a Researcher with the Computer

Vision Program, NICTA, Canberra, Australia. He is

also an Adjunct Research Fellow with the Australian

National University and an Adjunct Lecturer with the

University of Adelaide. His main research interests

include statistical pattern analysis and its application

in computer vision.

Jian Zhang (M’98–SM’04) received the Ph.D.

degree in electrical engineering from the University

College, University of New South Wales, Australian

Defence Force Academy, Australia, in 1997.

He is a Principal Researcher with NICTA, Sydney,

Australia. He is also a Conjoint Associate Professor

with University of New South Wales, Sydney,

Australia. He is currently an Associate Editor of the

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS

FOR VIDEO TECHNOLOGY and the EURASIP Journal

on Image and Video Processing.

Authorized licensed use limited to: UNSW Library. Downloaded on June 12, 2009 at 01:19 from IEEE Xplore. Restrictions apply.

Transfer and supplement AdaBoost for extracting region proposals of CNN in transfer-learning application

Article

Full-text available

Nov 2023
MULTIMED TOOLS APPL

As a common way to extract region proposals for CNN based detection, Region Proposal Network often requires very large amount of training samples and relative large computation requirements, which makes it hard to transfer to different real-time applications. To overcome these difficulties, a transfer and supplement AdaBoost learning method (TS-AdaBoost) is proposed to retrain the off-line trained AdaBoost detector to adapt to the new data with small amount of samples, which can extract region proposals when combining with CNN. The TS-AdaBoost includes a transfer learning process and a supplement learning process. The transfer learning process is designed to replace the features in the off-line trained detector with some new features, resulting a transfer learned new detector with better adaptive capacity to the new data. The supplement learning process is designed to lengthen the transfer learned detector achieving higher detection rates and lower false alarm rates. This method allows users to utilize the new labeled data to retrain the off-line trained detector, and do not need to discard all old labeled data and the old trained detector. Two transfer learning problems for traffic sign detection (TSD) are taken to show our method. Experiments show that the proposed TS-AdaBoost learning method can adapt to the new data from different application scenes independently or combined with CNN-based methods.

Pedestrian Detection for Autonomous Cars: Inference Fusion of Deep Neural Networks

Article

Full-text available

Dec 2022

Network fusion has been recently explored as an approach for improving pedestrian detection performance. However, most existing fusion methods suffer from runtime efficiency, modularity, scalability, and maintainability due to the complex structure of the entire fused models, their end-to-end training requirements, and sequential fusion process. Addressing these challenges, this paper proposes a novel fusion framework that combines asymmetric inferences from object detectors and semantic segmentation networks for jointly detecting multiple pedestrians. This is achieved by introducing a consensus-based scoring method that fuses pair-wise pixel-relevant information from the object detector and the semantic segmentation network to boost the final confidence scores. The parallel implementation of the object detection and semantic segmentation networks in the proposed framework entails a low runtime overhead. The efficiency and robustness of the proposed fusion framework are extensively evaluated by fusing different state-of-the-art pedestrian detectors and semantic segmentation networks on a public dataset. The generalization of fused models is also examined on new cross pedestrian data collected through an autonomous car. Results show that the proposed fusion method significantly improves detection performance while achieving competitive runtime efficiency.

Efficient Cable Path Optimization Based on Critical Robot Poses for Industrial Robot Arms

Article

Full-text available

Jan 2022

Although industrial robotic arms are equipped with external cables to supply electricity, gases or other materials, cable path design is a difficult and demanding task. Herein, an efficient optimization method is proposed for automating cable path design under the assumption that the robot motion path is known. The contribution of this study was to reduce the considerable computation time required for the optimization, which was a concern in our previous work. The previous method represented candidates for cable paths as a set of parameter vectors (PVs) that included cable length and guide configurations, and then selected the optimal PV that satisfies stress constraints and provided the shortest cable path. The proposed method extracted critical poses, i.e., several static robot poses that are prone to applying stress to the cable, from the joint angle time series of the motion path, and then performed attachment and motion tests. The cable geometry for the static critical poses was simulated in the attachment test, while the geometry for dynamic robot motion was simulated in the motion test in an ascending order of the cable length among the PV candidates. Experimental results showed that the computation time for cable path optimization could be significantly reduced.

Video anomaly detection with autoregressive modeling of covariance features

Article

Full-text available

Jun 2022

In this paper, we propose three different methods for anomaly detection in surveillance videos based on modeling of observation likelihoods. By means of the methods we propose, normal (typical) events in a scene are learned in a probabilistic framework by estimating the features of consecutive frames taken from the surveillance camera. The proposed methods are based on long short-term memory (LSTM) and linear regression. To decide whether an observation sequence (i.e., a small video patch) contains an anomaly or not, its likelihood under the modeled typical observation distribution is thresholded. An anomaly is decided to be present if the threshold is exceeded. Due to its effectiveness in object detection and action recognition applications, covariance features are used in this study to compactly reduce the dimensionality of the shape and motion cues of spatiotemporal patches obtained from the video segments. The two most successful methods are based on the final state vector of LSTM and support vector regression applied to mean covariance features and achieve an average performance of up to 0.95 area under curve on benchmark datasets.

Pedestrian Detection for Autonomous Cars: Occlusion Handling by Classifying Body Parts

Conference Paper

Full-text available

Oct 2020

Fast Iris localization using Haar-like features and AdaBoost algorithm

Article

Full-text available

Dec 2020
MULTIMED TOOLS APPL

Traditional iris recognition methods, which are still preferred against artificial intelligence (AI) approaches in practical applications, are often required to capture high-grade iris samples by an iris scanner for accurate subsequent processing. To reduce the system cost for mass deployment of iris recognition, pricey scan devices can be replaced by the average quality cameras combined with additional processing algorithm. In this paper, we propose a Haar-like-feature-based iris localization method to quickly detect the location of human iris in the images captured by low-cost cameras for the ease of post-processing stages. The AdaBoost algorithm was chosen as a learning method for training a cascade classifier using Haar-like features, which was then utilized to detect the iris position. The experimental results have shown acceptable accuracy and processing speed for this novel cascade classifier. This achievement stimulates us to implement this novel capturing device in our iris recognition.

CFRLA-Net: A Context-Aware Feature Representation Learning Anchor-Free Network for Pedestrian Detection

Article

Sep 2023

High resolution and strong semantic representation are both vital for feature extraction networks of pedestrian detection. The existing high-resolution network (HRNet) has presented a promising performance for pedestrian detection. However, we observed that it still has some significant shortcomings for heavily occluded and small-scale pedestrians. In this paper, we propose to address the shortcomings by extracting semantic and spatial context from HRNet. Specifically, we propose a Context-aware Feature Representation Learning Module (CFRL-Module), which combines a Multi-scale Feature Context Extraction Parallel Block for Convolution and Self-attention (CEPCA-Block) with two parallel paths and an Equivalent FFN (EFFN) Block. The core CEPCA-Block adopts a parallel design to integrate convolution and multi-head self-attention (MHSA) with low parameter computational cost, which can obtain the deep semantic context by convolution path and precise context by MHSA path. Furthermore, to overcome the inefficiency of global MHSA in high-resolution pedestrian detection, we propose a novel local window MHSA, which can significantly reduce memory consumption but barely affect the detection performance. Cascading the proposed CFRL-Module with the anchor-free detection head constitutes our Context-aware Feature Representation Learning Anchor-Free Network (CFRLA-Net). The proposed CFRLA-Net can catch a high-level understanding of the heavily occluded and small-scale pedestrian instances based on HRNet, which can effectively solve the limitation of the insufficient feature extraction ability of HRNet for the hard samples. Experimental results show that CFRLA-Net achieves state-of-the-art performance on CityPersons, Caltech, and CrowdHuman benchmarks.

Research on Pedestrian Target Intelligent Recognition Method Based on Neural Networks and Genetic Algorithms

Conference Paper

Dec 2021

MOFISSLAM: A Multi-Object Semantic SLAM System with Front-view, Inertial and Surround-view Sensors for Indoor Parking

Article

Dec 2021

The semantic SLAM (Simultaneous Localization And Mapping) system is a crucial module for autonomous indoor parking. Visual cameras (monocular/binocular) and IMU (Inertial Measurement Unit) constitute the basic configuration to build such a system. The performance of existing SLAM systems typically deteriorates in the presence of dynamically movable objects or objects with little texture. By contrast, semantic objects on the ground embody the most salient and stable features in the indoor parking environment. Due to their inabilities to perceive such features on the ground, existing SLAM systems are prone to tracking inconsistency during navigation. In this paper, we present MOFIS <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SLAM</sub> , a novel tightly-coupled ${M}$ ulti- ${O}$ bject semantic SLAM system integrating ${F}$ ront-view, ${I}$ nertial, and ${S}$ urround-view sensors for autonomous indoor parking. The proposed system moves beyond existing semantic SLAM systems by complementing the sensor configuration with a surround-view system capturing images from a top-down viewpoint. In MOFIS <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">SLAM</sub> , apart from low-level visual features and inertial motion data, typical semantic objects (parking-slots, parking-slot IDs and speed bumps) detected in surround-views are also incorporated in optimization, forming robust surround-view constraints. Specifically, each surround-view feature imposes a surround-view constraint that can be split into a contact term and a registration term. The former pre-defines the position of each individual surround-view feature subject to whether it has semantic contact with other surround-view features. Three contact modes, defined as complementary , adjacent and coincident , are identified to guarantee a unified form of all contact terms. The latter further constrains by registering each surround-view observation and its position in the world coordinate system. In parallel, to objectively evaluate SLAM studies for autonomous indoor parking, a large-scale dataset with groundtruth trajectories is collected, which is the first of its kind. Its groundtruth trajectories, commonly unavailable, are obtained by tracking artificial features scattered in the indoor parking environment, whose 3D coordinates are measured with an ETS (Electronic Total Station). The collected dataset has been made publicly available at https://shaoxuan92.github.io/MOFIS .

Pedestrian Detection: Unification of Global and Local Features

Chapter

Jan 2021

Effective and precise detection of pedestrian serves as a key to a number of applications in the domain of computer vision such as smart cars, video surveillance, robotics, and security. This paper presents the combination of feature extraction and classification. We present a thorough study on the type of features fit for pedestrian detection. The features are obtained by concatenating global shape feature histogram of oriented gradients (HOG) with global color and local texture features. We investigate our proposed method with respect to their receiver operator characteristics (ROC) and detection error trade-off (DET) performance. For classification part, we use the standard support vector machines (SVM) with linear kernel. We test our proposed method on the benchmark dataset for pedestrian detection: Institut National de Recherche en Informatique et en Automatique (INRIA) Pedestrian Dataset. The dataset contains pedestrians and non-pedestrians captured over a varying environment. Our proposed method performs best with respect to other algorithms presented in this study and gives a miss rate of 5.80%.

Histograms of Oriented Gradients for Human Detection

Conference Paper

Jul 2005
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

Distinctive Image Features from Scale-Invariant Keypoints

Article

Nov 2004

David G. Lowe

This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

Human detection via classification on Riemannian manifolds

Article

Jan 2007

Theoretical views of boosting and applications

Conference Paper

Jan 1999
Lect Notes Comput Sci

Robert E. Schapire

Boosting is a general method for improving the accuracy of any given learning algorithm. Focusing primarily on the AdaBoost algorithm, we briefly survey theoretical work on boosting including analyses of AdaBoost's training error and generalization error, connections between boosting and game theory, methods of estimating probabilities using boosting, and extensions of AdaBoost for multiclass classification problems. Some empirical work and applications are also described.

Mean shift towards feature space analysis

Article