Cross-Articulation Learning for Robust
Detection of Pedestrians
Edgar Seemann and Bernt Schiele
Technical University of Darmstadt
Abstract. Recognizing categories of articulated objects in real-world
scenarios is a challenging problem for today’s vision algorithms. Due to
the large appearance changes and intra-class variability of these objects,
it is hard to define a model, which is both general and discriminative
enough to capture the properties of the category. In this work, we pro-
pose an approach, which aims for a suitable trade-off for this problem.
On the one hand, the approach is made more discriminant by explic-
itly distinguishing typical object shapes. On the other hand, the method
generalizes well and requires relatively few training samples by cross-
articulation learning. The effectiveness of the approach is shown and
compared to previous approaches on two datasets containing pedestri-
ans with different articulations.
In recent years a large number of approaches have been proposed for the de-
tection of object categories in still images. Categories of non-rigid objects such
as pedestrians have proven to be particularly challenging. The high intra-class
variability, which is caused by global appearance changes and object articula-
tions, requires recognition approaches that are both highly discriminative and
also generalize well.
In the literature, several approaches focus on the global structure of the
object [5,11,17,2], while others detect individual parts [4,3,10,8,16]. Gravila
 uses a hierarchy of object silhouettes and applies Chamfer matching to ob-
tain detection hypotheses. Papageorgiou & Poggio  train an SVM based on
wavelet features. Zhao & Thorbe  perform detection with a neural network
and exploit stereo information to pre-segment images. Dalal & Triggs  com-
pute a global gradient-based descriptor, similar to SIFT, to train a linear SVM.
Forsyth & Fleck  introduce the general methodology of body plans for finding
people in images. Felzenszwalb and Huttenlocher  learn simplistic detectors for
individual body parts. Ronfard et al.  extended this work by using stronger
classifiers such as SVMs. Mohan and Papageorgiou  apply the wavelet-based
detectors from  to detect body parts and then use body geometry to infer
a person’s position and pose. Viola et al.  use simple local features and a
boosting scheme to train a cascade of classifiers. Mikolajczyk et al. train body
part classifiers with boosting and combine them in a probabilistic framework.
In this work, instead of modeling individual object parts, we identify and model
typical object articulations or shapes. These typical shapes are learnt automati-
cally from motion segmentations, which can be computed from video sequences
with a Grimson-Stauffer background model . The advantage of this approach
is that we do not need manual labeling of object parts.
The main contributions of this paper are the following. We introduce a novel
scheme to learn the relationship between arbitrary object parts or, as we call it,
local contexts and the global object shape. As a result, we obtain an approach,
which captures large appearance variations in a single model and implements a
suitable trade-off between generalization performance and discriminative power
of the model. The method is able to share features between typical object shapes
and therefore requires relatively few training images. In a sense, the approach
generalizes the idea of sharing features  to the sharing of local appearance
across object instances and shapes. A thorough evaluation shows that the pro-
posed model outperforms previously published methods on two challenging data
sets for the task of pedestrian recognition.
2. Recognition Algorithm
The recognition approach proposed in this paper extends the Implicit Shape
Model (ISM) developed by Leibe & Schiele . This section introduces the ba-
sic algorithm and discusses extensions to explicitly handle global appearance
changes and object articulations.
2.1. Standard ISM
The ISM is a voting framework, which accumulates local image evidences to
find the most promising object hypotheses. It is capable of multi-scale detec-
tion and pixel-wise segmentation masks can be inferred for each hypothesis. An
additional reasoning step based on the Minimum Description Length (MDL)
principle makes the method more robust in the presence of clutter and overlap-
ping objects. The following gives a brief overview of the methods.
Codebook Representation. For representing an object category with an
ISM, a codebook or visual vocabulary of local appearances is built . There-
fore, a scale-invariant interest point detector is applied to each training image
and local descriptors are extracted. These descriptors are subsequently clustered
with an agglomerative clustering scheme. The resulting set of local appearances
represents typical structures on an object category.
Spatial Occurrence Distribution. Once a codebook on an object category
has been learnt, we model the spatial occurrence distribution of its elements. In
order to do this, we record all locations (x-, y-position and scale) on which a
codebook entry matches the training instances.
Hypotheses Voting. In the recognition process, we apply the same feature
extraction procedure as during training. Thus, we obtain a set of local descriptors
at various scales on the test image. Each extracted descriptor casts votes for ob-
ject hypotheses in a probabilistic extension of the generalized Hough transform.
The maxima of the 3D voting space (x,y,scale) are back-projected to the image
to retrieve the supporting local features of each hypotheses. We present details
on an improved version of this probabilistic formulation when we introduce the
extensions to deal with different object shapes in section 2.2.
Fig.7. Example detections on test set A (upper row) and testset B (lower row) at the
1. S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using
shape contexts. PAMI, 2002.
2. N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In
3. P. Felzenszwalb and D. Huttenlocher. Efficient matching of pictorial structures. In
4. D. Forsyth and M. Fleck. Body plans. In CVPR, 1997.
5. D. Gavrila. Multi-feature hierarchical template matching using distance trans-
forms. In ICPR, volume 1, pages 439–444, 1998.
6. B. Leibe and B. Schiele. Scale invariant object categorization using a scale-adaptive
mean-shift search. In DAGM, pages 145–153, 2004.
7. B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. In
8. C. Mikolajczyk, C. Schmid, and A. Zisserman. Human detection based on a prob-
abilistic assembly of robust part detectors. In ECCV, pages 69–82, 2004.
9. K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors.
PAMI, 27(10):1615–1630, 2005.
10. A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in
images by components. PAMI, 23(4):349–361, 2001.
11. C. Papageorgiou and T. Poggio. A trainable system for object detection. IJCV,
12. E. Seemann, B. Leibe, K. Mikolajczyk, and B. Schiele. An evaluation of local
shape-based features for pedestrian detection. In BMVC, 2005.
13. E. Seemann, B. Leibe, and B. Schiele. Multi-aspect detection of articulated objects.
In CVPR, 2006.
14. C. Stauffer and W. Grimson. Adaptive background mixture models for realtime
tracking. In CVPR, 1999.
15. A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing visual features for mul-
ticlass and multiview object detection. In submitted to PAMI, 2005.
16. P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion
and appearance. In ICCV, pages 734–741, 2003.
17. L. Zhao and C. Thorpe. Stereo and neural network-based pedestrian detection.
IEEE Transactions on Intelligent Transportation Systems, 1(3):148 –154, 2000.