Content uploaded by James Ferryman
Author content
All content in this area was uploaded by James Ferryman
Content may be subject to copyright.
A Novel Shape Feature for Fast Region-based Pedestrian Recognition∗
Ali Shahrokni†Darren Gawley‡James Ferryman†
†Computational Vision Group, University of Reading, UK
‡School of Computer Science, The University of Adelaide, Australia
Abstract
A new class of shape features for region classifica-
tion and high-level recognition is introduced. The novel
Randomised Region Ray (RRR) features can be used to
train binary decision trees for object category classi-
fication using an abstract representation of the scene.
In particular we address the problem of human detec-
tion using an oversegmented input image. We therefore
do not rely on pixel values for training, instead we de-
sign and train specialised classifiers on the sparse set
of semantic regions which compose the image. Thanks
to the abstract nature of the input, the trained classi-
fier has the potential to be fast and applicable to ex-
treme imagery conditions. We demonstrate and evalu-
ate its performance in people detection using a pedes-
trian dataset.
1. Introduction and Related Work
This paper introduces a new class of shape features
for region classification and high-level recognition. In
particular we address the problem of human detection
using an abstract representation of the scene. Seg-
mented images provide semantically meaningful com-
ponents that form the basis of recognition for objects of
interest in the scene. Our proposed method is based on
the observation that humans can recognise and discern
objects from their crude silhouette in poor visibility.
Traditionally, recognition based on region detection
has been hampered by the sensitivity of feature ex-
traction to segmentation error. However, recent ad-
vances in reliable image subregion extractions [2] has
inspired region-based recognition methods. Notably,
Gu et al. [5] recently introduced a unified framework
for detection, segmentation and classification based on
detected regions. In spite of the promising results of
∗This work was supported in part by BAE systems (Operations)
Limited.
the above methods which exploit spatial semantics this
area is vastly unexplored and remains an active research
domain. To that end, we explore classification trees [6]
which are established as as fast and reliable appearance
descriptors for object classification [4]. We investigate
their application to shape-based object recognition for
the specific task of object detection. This leads to in-
troduction of novel binary features which we refer to
as Randomised Region Rays (RRR) 1features and are
used efficiently for region classification. This method-
ology is novel and unique in its approach to recognition
through specific shape characteristics.
Instead of processing individual pixels or patches
around geometric features for detection, we base our ap-
proach on the concept of semantically meaningful com-
ponents of image such as superpixels [3]. The main
contribution of this paper is to design dedicated fea-
tures for recognition based on a crude representation
of the scene. We therefore do not rely on pixel values
for training classifiers, instead we design and train spe-
cialised classifiers on the set of superpixels which is far
more sparse than the set of pixels in the image.
The advantage of such a system is two fold. First it
enables classification and object detection (such as hu-
man body) based on a crude representation of the scene.
This is of essential importance in low visibility situa-
tions and when the camera is moving in a rapid and un-
predictable manner where traditional background mod-
els would fail. The second advantage of this approach
is that it would naturally lead to a dramatic reduction
of the computational costs of the algorithm due to the
smaller amount of input data to process.
In the remainder of this paper we introduce our novel
shape features based on segmented input images. We
then discuss training and inference using these fea-
tures and present experimental results and evaluations
by comparing our approach with a state-of-the-art peo-
ple detector.
1Pronounced ’Arrr’
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.117
448
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.117
448
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.117
444
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.117
444
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.117
444
A
B
(a) (b)
Figure 1: (a) RRR feature defined on superpixels using
two angles and rays. (b) Training database is used to
build classification trees for foreground objects.
2. Recognition Framework
Randomised trees [6] and ferns [7] have been suc-
cessfully used in detection and tracking of patches and
have been applied to real-time tracking and recognition
of textured objects with large motion and appearance
changes. Inspired by the basic idea of randomised clas-
sifiers employed in such methods, we propose a new
classifier based on shape features of uniform regions in
the scene defined by superpixels [3]. These superpixels
form the input to our classification system. We there-
fore do not rely on pixel values in the classification and
recognition stage. We then combine the classification
results in a joint probability framework to infer the pos-
sible locations of objects of interest in the scene. While
our approach is generic and can be applied to any object
category, for the purpose of this work we focus on hu-
man body detection. The details of our proposed model
are explained in the next section.
2.1. RRR Shape Features
We introduce Randomised Region Ray (RRR) fea-
tures which are in the form of binary questions that can
be used to collectively describe the geometrical form of
superpixels. Each RRR feature casts two random rays
from the centre of the superpixel to form a binary de-
cision that can be used to train decision trees on the
shape characteristics of the superpixels. One such fea-
ture is illustrated in Fig. 1. angles αand βare randomly
determined from the axis of reference (the vertical line).
Binary trees are then trained to classify superpixels into
different classes based on the comparison of the length
of the two rays at angles αand β. Here we denote a ray
defined on superpixel Siwith angle θby r(Si, θ). Each
RRR feature at superpixel Sican be expressed as:
RRR(Si, α, β) = 1if r(Si, α)> r(Si, β )
0otherwise
(1)
The RRR features defined above are very simple and
fast to compute on superpixels. Furthermore, their def-
inition based on relative length of the rays makes them
invariant to scale. We use Bresenham’s algorithm to
compute the rays, r, efficiently. In the next section we
show how these primitive features can be used to clas-
sify superpixels and enable object recognition.
2.2. Parts-based Human Body Recognition
We train randomised trees on different classes of
superpixels which represent essential parts of the ob-
ject class in question, i.e. human body. We subdivide
the body into three parts (upper body, middle body and
lower body) and train classifiers to distinguish between
different body parts as well as background superpixels.
The number of parts can be a parameter and vary de-
pending on system requirements. For instance if the
body size is small in the image and typically a few su-
perpixels covers the body, then the body can be mod-
elled by one or two parts only. In general if there are
nparts in the object class model, the classifier will be
trained on n+ 1 classes (including background).
For the training and evaluation of the developed al-
gorithm, we use the Penn-Fudan Pedestrian database
which is a publicly available pedestrian image dataset 2
with ground truth masks. This dataset consists of 170
images with 345 labeled pedestrians. The extracted su-
perpixels in the labelled training database provide dif-
ferent instances of each of these main parts. The la-
belled ground truth parts are further used to obtain a
geometrical distribution model for the body parts with
respect to each other. This model will encode the spa-
tial relationship between object parts. Specifically in
the case of the 3-part human body, it models the distri-
bution of the upper, middle and lower body parts as 2-D
Gaussian distributions with respects to the middle body
mean centroid. This model will be used for the global
inference of the body part positions given the individual
responses of super pixels.
2.3. Training
We start by extracting a set of superpixels, S=
{Si|i= 1,...,N}, in the input image. The
next step is to train a set of randomised trees, F=
{Fk|k= 1,...,K}, to classify superpixels into
known object parts or background. Similar to [7],
Fk={fσ(k, 1),...,fσ(k, D)}represents the kth tree
in the ”forest” Fand fσ(k, j)is a set of random RRR
features, σ, at depth jof the kth tree.
2http://www.cis.upenn.edu/∼jshi/ped html/
449449445445445
(a) (b)
Figure 2: (a) The overall response of the part-based
classifier and the detection result. Higher value of the
red colour component corresponds to higher probability
of human body presence. (b). The detection results of
the 3 body parts.
Examples of the forest classification for individual
superpixels are shown in Fig. 4-c. Likelihoods of su-
perpixels belonging to body parts are computed using
the trained forest of classification trees on superpixels
shown in Fig. 4-b. While the colours in Fig. 4-b are
randomly selected, the colours in Fig. 4-c encode the
probability of each superpixel belonging to upper body
(red), middle body (green) and lower body (blue) and
the most likely body part label is shown in Fig. 4-d. It
can be noted that areas with human presence is high-
lighted by the higher likelihood value which is indepen-
dently computed for each superpixel using the trained
decision forest. This implies that the RRR features
are capable of learning the distinctive characteristics of
each body part. In the next section we show how these
individual responses for each superpixel can be aggre-
gated for higher level inference.
2.4. Foreground Inference
Target inference is done by applying the Generalised
Distance Transform [2] to the classifier outputs for in-
diviual superpixels. For each detected superpixel Si
we obtain a class distribution for each body part and
the background. Using the learnt Gaussian geometric
distribution model of body parts of Section 2.2, Gener-
alised Distance Transform can then be used efficiently
to compute the maximum a posteriori probability of
the objects (MAP) given by P(Ol|S1,...,SN) =
P(S1,...,SN|Ol)P(Ol), where Olis part lof the
object category, i.e. human body.
3. Experimental Results
We have tested the proposed algorithm and the de-
veloped RRR features using a leave-one-out test on
the Penn-Fudan Pedestrian Database. The classification
forest is composed of 40 trees with 9 levels of depth.
These trees are trained on K−1images and tested on
00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pfa
Pd
HOG
Our Method: Test
Our Method: Training
Figure 3: ROC curves for training and test results of
detection using our method. Also shown is the ROC
curve obtained using HOG-SVM people detector.
one image at a time, where Kis the number of images
in the dataset (i.e. K= 170). This enables us to in-
dependently evaluate the algorithm 170 times. Fig. 2-
a shows the overall response of the part-based model
computed by Generalised Distance Transform on indi-
vidual superpixel classifications by the trained forest.
Higher value of the red colour component corresponds
to higher probability of human body presence. Fig. 2-
b shows the detection result of the 3 body parts of the
highest rank detection (upper body, middle body and
lowerbody centroids are marked). We can use non-
maximum suppression or similar algorithms to detect
multiple people in the image as shown in Fig. 5.
The ROC curve of performance on the Penn-Fudan
dataset was also computed to plot the ratio of true pos-
itives vs. the fraction of false positives for training
dataset as well as the leave-one-out test experiment.
Both these experiments include 170 results and the clas-
sification score is used to draw the curve. For the pur-
pose of comparison, we have also computed the perfor-
mance of Histograms of Oriented Gradient (HOG) de-
scriptors with linear SVM classifer [1] using the same
dataset. To that end we used the OpenCV implementa-
tion of HOG-SVM and adjusted the hit threshold to 0.5
and group threshold to 0to improve the performance
without grouping the detections. The results are shown
in Fig. 3 and show comparable performance and im-
provements using our introduced RRR features and su-
perpixels without relying on pixel-level data for infer-
ence.
4. Conclusion and Future Work
This work is motivated by the importance and chal-
lenges of development of specific classification features
that are suitable for the loosely segmented input views.
We introduce a new approach for object classification
450450446446446
(a) (b) (c) (d)
Figure 4: Illustration of the likelihoods of superpixels. (a) Original image. (b) Superpixel image with random colour
labels used as input to the trained RRR classification forest. (c) Colour coded probability of foreground parts. Brighter
colour indicates higher probability of body parts. The superpixels composing the people in the image are clearly
highlighted (d) Same as (c) but only the dominant colour channel for the foreground is shown (red corresponds to
upper body, green middle body and blue lower body). Best viewed in colour.
Figure 5: Examples of detection results. Bounding boxes and the points corresponding to body parts are shown.
Colours are based on the area of overlap with the ground truth.
and detection which works on an abstract representaion
of the input image and uses novel Randomised Region
Ray features and binary decision trees for object clas-
sification. The RRR features are very easy to compute
and are scale invariant. The input image can be a su-
perpixel segmentation or any crude representation of
the scene. These representations can either come di-
rectly from the sensing device through built-in proces-
sor/filters or can be efficiently computed prior to clas-
sification. The trained classifier has the potential to be
fast and applicable to extreme videography conditions
where the camera is mounted on a mobile platform such
as UAVs or has poor visibility. As a result, the computa-
tional costs of the RRR-based superpixel classification
are substantially lower due to the simplicity of the RRR
features themselves as well as the sparsity of the super-
pixels compared to pixel-level cues and classification
algorithms. The typical non-optimised processing time
of a superpixel image by the RRR classification forest
is around 500ms.
The results obtained on the Penn-Fudan Pedestrian
database suggest that the approach is promising and is
capable of detecting humans using only a sparse set of
superpixels as input. Furthermore, we can see that the
RRR-based classification has comparable performance
to existing algorithms that use pixel-level information
for classification.
As the results indicate, the classifier performs bet-
ter on lower and middle body parts. This might be due
to the fact that the upper body (head area) is less sig-
nificant in size in relation to the other parts. Another
issues is that some superpixels bleed into other parts of
the image and can have negative impact on the learning
process. Possible solutions might involve modification
of the underlying superpixel-computation technique to
obtain more well-defined superpixel inputs. These is-
sues would be addressed in the future work.
References
[1] N. Dalal and B. Triggs. Histograms of Oriented Gradients
for Human Detection. In Conference on Computer Vision
and Pattern Recognition, 2005.
[2] P. Felzenszwalb and D. Huttenlocher. Pictorial Structures
for Object Recognition. International Journal of Com-
puter Vision, 16(1), 2005.
[3] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient
graph-based image segmentation. Int. J. Comput. Vision,
59(2):167–181, 2004.
[4] J. Gall and V. Lempitsky. Class-specific hough forests
for object detection. In Conference on Computer Vision
and Pattern Recognition, pages 1–8, Miami, USA, 2009.
IEEE Computer Society.
[5] C. Gu, J. Lim, P. Arbelaez, and J. Malik. Recognition
using regions. In Conference on Computer Vision and
Pattern Recognition, Miami, USA, 2009. IEEE Computer
Society.
[6] V. Lepetit and P. Fua. Keypoint recognition using ran-
domized trees. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2006.
[7] M. Ozuysal, P. Fua, and V. Lepetit. Fast keypoint recog-
nition in ten lines of code. In Conference on Computer
Vision and Pattern Recognition, pages 1–8, 2007.
451451447447447