Conference PaperPDF Available

A Novel Shape Feature for Fast Region-Based Pedestrian Recognition


Abstract and Figures

A new class of shape features for region classification and high-level recognition is introduced. The novel Randomised Region Ray (RRR) features can be used to train binary decision trees for object category classification using an abstract representation of the scene. In particular we address the problem of human detection using an over segmented input image. We therefore do not rely on pixel values for training, instead we design and train specialised classifiers on the sparse set of semantic regions which compose the image. Thanks to the abstract nature of the input, the trained classifier has the potential to be fast and applicable to extreme imagery conditions. We demonstrate and evaluate its performance in people detection using a pedestrian dataset.
Content may be subject to copyright.
A Novel Shape Feature for Fast Region-based Pedestrian Recognition
Ali ShahrokniDarren GawleyJames Ferryman
Computational Vision Group, University of Reading, UK
School of Computer Science, The University of Adelaide, Australia
A new class of shape features for region classifica-
tion and high-level recognition is introduced. The novel
Randomised Region Ray (RRR) features can be used to
train binary decision trees for object category classi-
fication using an abstract representation of the scene.
In particular we address the problem of human detec-
tion using an oversegmented input image. We therefore
do not rely on pixel values for training, instead we de-
sign and train specialised classifiers on the sparse set
of semantic regions which compose the image. Thanks
to the abstract nature of the input, the trained classi-
fier has the potential to be fast and applicable to ex-
treme imagery conditions. We demonstrate and evalu-
ate its performance in people detection using a pedes-
trian dataset.
1. Introduction and Related Work
This paper introduces a new class of shape features
for region classification and high-level recognition. In
particular we address the problem of human detection
using an abstract representation of the scene. Seg-
mented images provide semantically meaningful com-
ponents that form the basis of recognition for objects of
interest in the scene. Our proposed method is based on
the observation that humans can recognise and discern
objects from their crude silhouette in poor visibility.
Traditionally, recognition based on region detection
has been hampered by the sensitivity of feature ex-
traction to segmentation error. However, recent ad-
vances in reliable image subregion extractions [2] has
inspired region-based recognition methods. Notably,
Gu et al. [5] recently introduced a unified framework
for detection, segmentation and classification based on
detected regions. In spite of the promising results of
This work was supported in part by BAE systems (Operations)
the above methods which exploit spatial semantics this
area is vastly unexplored and remains an active research
domain. To that end, we explore classification trees [6]
which are established as as fast and reliable appearance
descriptors for object classification [4]. We investigate
their application to shape-based object recognition for
the specific task of object detection. This leads to in-
troduction of novel binary features which we refer to
as Randomised Region Rays (RRR) 1features and are
used efficiently for region classification. This method-
ology is novel and unique in its approach to recognition
through specific shape characteristics.
Instead of processing individual pixels or patches
around geometric features for detection, we base our ap-
proach on the concept of semantically meaningful com-
ponents of image such as superpixels [3]. The main
contribution of this paper is to design dedicated fea-
tures for recognition based on a crude representation
of the scene. We therefore do not rely on pixel values
for training classifiers, instead we design and train spe-
cialised classifiers on the set of superpixels which is far
more sparse than the set of pixels in the image.
The advantage of such a system is two fold. First it
enables classification and object detection (such as hu-
man body) based on a crude representation of the scene.
This is of essential importance in low visibility situa-
tions and when the camera is moving in a rapid and un-
predictable manner where traditional background mod-
els would fail. The second advantage of this approach
is that it would naturally lead to a dramatic reduction
of the computational costs of the algorithm due to the
smaller amount of input data to process.
In the remainder of this paper we introduce our novel
shape features based on segmented input images. We
then discuss training and inference using these fea-
tures and present experimental results and evaluations
by comparing our approach with a state-of-the-art peo-
ple detector.
1Pronounced ’Arrr’
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.117
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.117
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.117
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.117
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.117
(a) (b)
Figure 1: (a) RRR feature defined on superpixels using
two angles and rays. (b) Training database is used to
build classification trees for foreground objects.
2. Recognition Framework
Randomised trees [6] and ferns [7] have been suc-
cessfully used in detection and tracking of patches and
have been applied to real-time tracking and recognition
of textured objects with large motion and appearance
changes. Inspired by the basic idea of randomised clas-
sifiers employed in such methods, we propose a new
classifier based on shape features of uniform regions in
the scene defined by superpixels [3]. These superpixels
form the input to our classification system. We there-
fore do not rely on pixel values in the classification and
recognition stage. We then combine the classification
results in a joint probability framework to infer the pos-
sible locations of objects of interest in the scene. While
our approach is generic and can be applied to any object
category, for the purpose of this work we focus on hu-
man body detection. The details of our proposed model
are explained in the next section.
2.1. RRR Shape Features
We introduce Randomised Region Ray (RRR) fea-
tures which are in the form of binary questions that can
be used to collectively describe the geometrical form of
superpixels. Each RRR feature casts two random rays
from the centre of the superpixel to form a binary de-
cision that can be used to train decision trees on the
shape characteristics of the superpixels. One such fea-
ture is illustrated in Fig. 1. angles αand βare randomly
determined from the axis of reference (the vertical line).
Binary trees are then trained to classify superpixels into
different classes based on the comparison of the length
of the two rays at angles αand β. Here we denote a ray
defined on superpixel Siwith angle θby r(Si, θ). Each
RRR feature at superpixel Sican be expressed as:
RRR(Si, α, β) = 1if r(Si, α)> r(Si, β )
The RRR features defined above are very simple and
fast to compute on superpixels. Furthermore, their def-
inition based on relative length of the rays makes them
invariant to scale. We use Bresenham’s algorithm to
compute the rays, r, efficiently. In the next section we
show how these primitive features can be used to clas-
sify superpixels and enable object recognition.
2.2. Parts-based Human Body Recognition
We train randomised trees on different classes of
superpixels which represent essential parts of the ob-
ject class in question, i.e. human body. We subdivide
the body into three parts (upper body, middle body and
lower body) and train classifiers to distinguish between
different body parts as well as background superpixels.
The number of parts can be a parameter and vary de-
pending on system requirements. For instance if the
body size is small in the image and typically a few su-
perpixels covers the body, then the body can be mod-
elled by one or two parts only. In general if there are
nparts in the object class model, the classifier will be
trained on n+ 1 classes (including background).
For the training and evaluation of the developed al-
gorithm, we use the Penn-Fudan Pedestrian database
which is a publicly available pedestrian image dataset 2
with ground truth masks. This dataset consists of 170
images with 345 labeled pedestrians. The extracted su-
perpixels in the labelled training database provide dif-
ferent instances of each of these main parts. The la-
belled ground truth parts are further used to obtain a
geometrical distribution model for the body parts with
respect to each other. This model will encode the spa-
tial relationship between object parts. Specifically in
the case of the 3-part human body, it models the distri-
bution of the upper, middle and lower body parts as 2-D
Gaussian distributions with respects to the middle body
mean centroid. This model will be used for the global
inference of the body part positions given the individual
responses of super pixels.
2.3. Training
We start by extracting a set of superpixels, S=
{Si|i= 1,...,N}, in the input image. The
next step is to train a set of randomised trees, F=
{Fk|k= 1,...,K}, to classify superpixels into
known object parts or background. Similar to [7],
Fk={fσ(k, 1),...,fσ(k, D)}represents the kth tree
in the ”forest” Fand fσ(k, j)is a set of random RRR
features, σ, at depth jof the kth tree.
2 html/
(a) (b)
Figure 2: (a) The overall response of the part-based
classifier and the detection result. Higher value of the
red colour component corresponds to higher probability
of human body presence. (b). The detection results of
the 3 body parts.
Examples of the forest classification for individual
superpixels are shown in Fig. 4-c. Likelihoods of su-
perpixels belonging to body parts are computed using
the trained forest of classification trees on superpixels
shown in Fig. 4-b. While the colours in Fig. 4-b are
randomly selected, the colours in Fig. 4-c encode the
probability of each superpixel belonging to upper body
(red), middle body (green) and lower body (blue) and
the most likely body part label is shown in Fig. 4-d. It
can be noted that areas with human presence is high-
lighted by the higher likelihood value which is indepen-
dently computed for each superpixel using the trained
decision forest. This implies that the RRR features
are capable of learning the distinctive characteristics of
each body part. In the next section we show how these
individual responses for each superpixel can be aggre-
gated for higher level inference.
2.4. Foreground Inference
Target inference is done by applying the Generalised
Distance Transform [2] to the classifier outputs for in-
diviual superpixels. For each detected superpixel Si
we obtain a class distribution for each body part and
the background. Using the learnt Gaussian geometric
distribution model of body parts of Section 2.2, Gener-
alised Distance Transform can then be used efficiently
to compute the maximum a posteriori probability of
the objects (MAP) given by P(Ol|S1,...,SN) =
P(S1,...,SN|Ol)P(Ol), where Olis part lof the
object category, i.e. human body.
3. Experimental Results
We have tested the proposed algorithm and the de-
veloped RRR features using a leave-one-out test on
the Penn-Fudan Pedestrian Database. The classification
forest is composed of 40 trees with 9 levels of depth.
These trees are trained on K1images and tested on
00.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Our Method: Test
Our Method: Training
Figure 3: ROC curves for training and test results of
detection using our method. Also shown is the ROC
curve obtained using HOG-SVM people detector.
one image at a time, where Kis the number of images
in the dataset (i.e. K= 170). This enables us to in-
dependently evaluate the algorithm 170 times. Fig. 2-
a shows the overall response of the part-based model
computed by Generalised Distance Transform on indi-
vidual superpixel classifications by the trained forest.
Higher value of the red colour component corresponds
to higher probability of human body presence. Fig. 2-
b shows the detection result of the 3 body parts of the
highest rank detection (upper body, middle body and
lowerbody centroids are marked). We can use non-
maximum suppression or similar algorithms to detect
multiple people in the image as shown in Fig. 5.
The ROC curve of performance on the Penn-Fudan
dataset was also computed to plot the ratio of true pos-
itives vs. the fraction of false positives for training
dataset as well as the leave-one-out test experiment.
Both these experiments include 170 results and the clas-
sification score is used to draw the curve. For the pur-
pose of comparison, we have also computed the perfor-
mance of Histograms of Oriented Gradient (HOG) de-
scriptors with linear SVM classifer [1] using the same
dataset. To that end we used the OpenCV implementa-
tion of HOG-SVM and adjusted the hit threshold to 0.5
and group threshold to 0to improve the performance
without grouping the detections. The results are shown
in Fig. 3 and show comparable performance and im-
provements using our introduced RRR features and su-
perpixels without relying on pixel-level data for infer-
4. Conclusion and Future Work
This work is motivated by the importance and chal-
lenges of development of specific classification features
that are suitable for the loosely segmented input views.
We introduce a new approach for object classification
(a) (b) (c) (d)
Figure 4: Illustration of the likelihoods of superpixels. (a) Original image. (b) Superpixel image with random colour
labels used as input to the trained RRR classification forest. (c) Colour coded probability of foreground parts. Brighter
colour indicates higher probability of body parts. The superpixels composing the people in the image are clearly
highlighted (d) Same as (c) but only the dominant colour channel for the foreground is shown (red corresponds to
upper body, green middle body and blue lower body). Best viewed in colour.
Figure 5: Examples of detection results. Bounding boxes and the points corresponding to body parts are shown.
Colours are based on the area of overlap with the ground truth.
and detection which works on an abstract representaion
of the input image and uses novel Randomised Region
Ray features and binary decision trees for object clas-
sification. The RRR features are very easy to compute
and are scale invariant. The input image can be a su-
perpixel segmentation or any crude representation of
the scene. These representations can either come di-
rectly from the sensing device through built-in proces-
sor/filters or can be efficiently computed prior to clas-
sification. The trained classifier has the potential to be
fast and applicable to extreme videography conditions
where the camera is mounted on a mobile platform such
as UAVs or has poor visibility. As a result, the computa-
tional costs of the RRR-based superpixel classification
are substantially lower due to the simplicity of the RRR
features themselves as well as the sparsity of the super-
pixels compared to pixel-level cues and classification
algorithms. The typical non-optimised processing time
of a superpixel image by the RRR classification forest
is around 500ms.
The results obtained on the Penn-Fudan Pedestrian
database suggest that the approach is promising and is
capable of detecting humans using only a sparse set of
superpixels as input. Furthermore, we can see that the
RRR-based classification has comparable performance
to existing algorithms that use pixel-level information
for classification.
As the results indicate, the classifier performs bet-
ter on lower and middle body parts. This might be due
to the fact that the upper body (head area) is less sig-
nificant in size in relation to the other parts. Another
issues is that some superpixels bleed into other parts of
the image and can have negative impact on the learning
process. Possible solutions might involve modification
of the underlying superpixel-computation technique to
obtain more well-defined superpixel inputs. These is-
sues would be addressed in the future work.
[1] N. Dalal and B. Triggs. Histograms of Oriented Gradients
for Human Detection. In Conference on Computer Vision
and Pattern Recognition, 2005.
[2] P. Felzenszwalb and D. Huttenlocher. Pictorial Structures
for Object Recognition. International Journal of Com-
puter Vision, 16(1), 2005.
[3] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient
graph-based image segmentation. Int. J. Comput. Vision,
59(2):167–181, 2004.
[4] J. Gall and V. Lempitsky. Class-specific hough forests
for object detection. In Conference on Computer Vision
and Pattern Recognition, pages 1–8, Miami, USA, 2009.
IEEE Computer Society.
[5] C. Gu, J. Lim, P. Arbelaez, and J. Malik. Recognition
using regions. In Conference on Computer Vision and
Pattern Recognition, Miami, USA, 2009. IEEE Computer
[6] V. Lepetit and P. Fua. Keypoint recognition using ran-
domized trees. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2006.
[7] M. Ozuysal, P. Fua, and V. Lepetit. Fast keypoint recog-
nition in ten lines of code. In Conference on Computer
Vision and Pattern Recognition, pages 1–8, 2007.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We have seen an increasing amount of interest in the ap- plication of Semantic Web technologies to Web services. The aim is to support automated discovery and composition of the services allowing seamless and transparent interoperability. In this paper we discuss three projects that are applying such technologies to bioinformatics: myGrid, MOBY-Services and Semantic-MOBY. Through an examination of the dierences and similarities between the solutions produced, we highlight some of the practical diculties in developing Semantic Web services and suggest that the experiences with these projects have implications for the development of Semantic Web services as a whole.
Conference Paper
Full-text available
While feature point recognition is a key component of modern approaches to object detection, existing approaches require computationally expensive patch preprocessing to handle perspective distortion. In this paper, we show that formulating the problem in a Naive Bayesian classification framework makes such preprocessing unnecessary and produces an algorithm that is simple, efficient, and robust. Furthermore, it scales well to handle large number of classes. To recognize the patches surrounding keypoints, our classifier uses hundreds of simple binary features and models class posterior probabilities. We make the problem computationally tractable by assuming independence between arbitrary sets of features. Even though this is not strictly true, we demonstrate that our classifier nevertheless performs remarkably well on image datasets containing very significant perspective changes.
Conference Paper
Full-text available
VisTrails is a new system that enables interactive multiple-view visualizations by simplifying the creation and maintenance of visualization pipelines, and by optimizing their execution. It provides a general infrastructure that can be combined with existing visualization systems and libraries. A key component of VisTrails is the visualization trail (vistrail), a formal specification of a pipeline. Unlike existing dataflow-based systems, in VisTrails there is a clear separation between the specification of a pipeline and its execution instances. This separation enables powerful scripting capabilities and provides a scalable mechanism for generating a large number of visualizations. VisTrails also leverages the vistrail specification to identify and avoid redundant operations. This optimization is especially useful while exploring multiple visualizations. When variations of the same pipeline need to be executed, substantial speedups can be obtained by caching the results of overlapping subsequences of the pipelines. In this paper, we describe the design and implementation of VisTrails, and show its effectiveness in different application scenarios.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
myGrid supports in silico experiments in the life sciences, enabling the design and enactment of workflows as well as providing components to assist service discovery, data and metadata management. The myGrid ontology is one component in a larger semantic discovery framework for the identification of the highly distributed and heterogeneous bioinformatics services in the public domain. From an initial model of formal OWL-DL semantics throughout, we now adopt a spectrum of expressivity and reasoning for different tasks in service annotation and discovery. Here, we discuss the development and use of the myGrid ontology and our experiences in semantic service discovery.
The BioMoby project was initiated in 2001 from within the model organism database community. It aimed to standardize methodologies to facilitate information exchange and access to analytical resources, using a consensus driven approach. Six years later, the BioMoby development community is pleased to announce the release of the 1.0 version of the interoperability framework, registry Application Programming Interface and supporting Perl and Java code-bases. Together, these provide interoperable access to over 1400 bioinformatics resources worldwide through the BioMoby platform, and this number continues to grow. Here we highlight and discuss the features of BioMoby that make it distinct from other Semantic Web Service and interoperability initiatives, and that have been instrumental to its deployment and use by a wide community of bioinformatics service providers. The standard, client software, and supporting code libraries are all freely available at
This paper presents a unified framework for object detection, segmentation, and classification using regions. Region features are appealing in this context because: (1) they encode shape and scale information of objects naturally; (2) they are only mildly affected by background clutter. Regions have not been popular as features due to their sensitivity to segmentation errors. In this paper, we start by producing a robust bag of overlaid regions for each image using Arbeldez et al., CVPR 2009. Each region is represented by a rich set of image cues (shape, color and texture). We then learn region weights using a max-margin framework. In detection and segmentation, we apply a generalized Hough voting scheme to generate hypotheses of object locations, scales and support, followed by a verification classifier and a constrained segmenter on each hypothesis. The proposed approach significantly outperforms the state of the art on the ETHZ shape database(87.1% average detection rate compared to Ferrari et al. 's 67.2%), and achieves competitive performance on the Caltech 101 database.
In this paper we present a computationally efficient framework for part-based modeling and recognition of objects. Our work is motivated by the pictorial structure models introduced by Fischler and Elschlager. The basic idea is to represent an object by a collection of parts arranged in a deformable configuration. The appearance of each part is modeled separately, and the deformable configuration is represented by spring-like connections between pairs of parts. These models allow for qualitative descriptions of visual appearance, and are suitable for generic recognition problems. We address the problem of using pictorial structure models to find instances of an object in an image as well as the problem of learning an object model from training examples, presenting efficient algorithms in both cases. We demonstrate the techniques by learning models that represent faces and human bodies and using the resulting models to locate the corresponding objects in novel images.
We present a method for the detection of instances of an object class, such as cars or pedestrians, in natural images. Similarly to some previous works, this is accomplished via generalized Hough transform, where the detections of individual object parts cast probabilistic votes for possible locations of the centroid of the whole object; the detection hypotheses then correspond to the maxima of the Hough image that accumulates the votes from all parts. However, whereas the previous methods detect object parts using generative codebooks of part appearances, we take a more discriminative approach to object part detection. Towards this end, we train a class-specific Hough forest, which is a random forest that directly maps the image patch appearance to the probabilistic vote about the possible location of the object centroid. We demonstrate that Hough forests improve the results of the Hough-transform object detection significantly and achieve state-of-the-art performance for several classes and datasets.