Cross-Articulation Learning for Robust
Detection of Pedestrians
Edgar Seemann and Bernt Schiele
Technical University of Darmstadt
Abstract. Recognizing categories of articulated objects in real-world
scenarios is a challenging problem for today’s vision algorithms. Due to
the large appearance changes and intra-class variability of these objects,
it is hard to define a model, which is both general and discriminative
enough to capture the properties of the category. In this work, we pro-
pose an approach, which aims for a suitable trade-off for this problem.
On the one hand, the approach is made more discriminant by explic-
itly distinguishing typical object shapes. On the other hand, the method
generalizes well and requires relatively few training samples by cross-
articulation learning. The effectiveness of the approach is shown and
compared to previous approaches on two datasets containing pedestri-
ans with different articulations.
In recent years a large number of approaches have been proposed for the de-
tection of object categories in still images. Categories of non-rigid objects such
as pedestrians have proven to be particularly challenging. The high intra-class
variability, which is caused by global appearance changes and object articula-
tions, requires recognition approaches that are both highly discriminative and
also generalize well.
In the literature, several approaches focus on the global structure of the
object [5,11,17,2], while others detect individual parts [4,3,10,8,16]. Gravila
 uses a hierarchy of object silhouettes and applies Chamfer matching to ob-
tain detection hypotheses. Papageorgiou & Poggio  train an SVM based on
wavelet features. Zhao & Thorbe  perform detection with a neural network
and exploit stereo information to pre-segment images. Dalal & Triggs  com-
pute a global gradient-based descriptor, similar to SIFT, to train a linear SVM.
Forsyth & Fleck  introduce the general methodology of body plans for finding
people in images. Felzenszwalb and Huttenlocher  learn simplistic detectors for
individual body parts. Ronfard et al.  extended this work by using stronger
classifiers such as SVMs. Mohan and Papageorgiou  apply the wavelet-based
detectors from  to detect body parts and then use body geometry to infer
a person’s position and pose. Viola et al.  use simple local features and a
boosting scheme to train a cascade of classifiers. Mikolajczyk et al. train body
part classifiers with boosting and combine them in a probabilistic framework.
In this work, instead of modeling individual object parts, we identify and model
typical object articulations or shapes. These typical shapes are learnt automati-
cally from motion segmentations, which can be computed from video sequences
with a Grimson-Stauffer background model . The advantage of this approach
is that we do not need manual labeling of object parts.
The main contributions of this paper are the following. We introduce a novel
scheme to learn the relationship between arbitrary object parts or, as we call it,
local contexts and the global object shape. As a result, we obtain an approach,
which captures large appearance variations in a single model and implements a
suitable trade-off between generalization performance and discriminative power
of the model. The method is able to share features between typical object shapes
and therefore requires relatively few training images. In a sense, the approach
generalizes the idea of sharing features  to the sharing of local appearance
across object instances and shapes. A thorough evaluation shows that the pro-
posed model outperforms previously published methods on two challenging data
sets for the task of pedestrian recognition.
2. Recognition Algorithm
The recognition approach proposed in this paper extends the Implicit Shape
Model (ISM) developed by Leibe & Schiele . This section introduces the ba-
sic algorithm and discusses extensions to explicitly handle global appearance
changes and object articulations.
2.1. Standard ISM
The ISM is a voting framework, which accumulates local image evidences to
find the most promising object hypotheses. It is capable of multi-scale detec-
tion and pixel-wise segmentation masks can be inferred for each hypothesis. An
additional reasoning step based on the Minimum Description Length (MDL)
principle makes the method more robust in the presence of clutter and overlap-
ping objects. The following gives a brief overview of the methods.
Codebook Representation. For representing an object category with an
ISM, a codebook or visual vocabulary of local appearances is built . There-
fore, a scale-invariant interest point detector is applied to each training image
and local descriptors are extracted. These descriptors are subsequently clustered
with an agglomerative clustering scheme. The resulting set of local appearances
represents typical structures on an object category.
Spatial Occurrence Distribution. Once a codebook on an object category
has been learnt, we model the spatial occurrence distribution of its elements. In
order to do this, we record all locations (x-, y-position and scale) on which a
codebook entry matches the training instances.
Hypotheses Voting. In the recognition process, we apply the same feature
extraction procedure as during training. Thus, we obtain a set of local descriptors
at various scales on the test image. Each extracted descriptor casts votes for ob-
ject hypotheses in a probabilistic extension of the generalized Hough transform.
The maxima of the 3D voting space (x,y,scale) are back-projected to the image
to retrieve the supporting local features of each hypotheses. We present details
on an improved version of this probabilistic formulation when we introduce the
extensions to deal with different object shapes in section 2.2.
Fig.1. Schematic overview of the different object models. Both the standard ISM
and 4D-ISM models are special cases of the proposed approach. By learning the shape
distribution from local contexts, we combine the strength of the two other models.
2.2. Consistent Shape Voting
Figure 1 shows a schematic illustration of the standard ISM-model on the left.
While the ISM allows for cross-instance learning and therefore requires relatively
little training data it has no notion of possible object articulations within the
category. Local appearances are learnt from all possible variations, which ensures
good generalization performance, but results in relatively weak discriminative
power, e.g. with respect to background structures. By adding a 4thdimension
for object articulations to the ISM voting space (Figure 1 center), the model is
able to distinguish between object shapes and is thus more discriminant .
This, however, requires an association of each training example to one of the
typical articulations or shapes.
Learning Object Shapes. Manual labelling of object shapes in the training
data is both time consuming and difficult for more complex objects. We there-
fore automatically learn the most prominent shapes from object silhouettes.
Therefore, we apply agglomerative clustering with global Chamfer distance as
similarity measure. The silhouettes are extracted from video sequences with a
motion-segmentation algorithm . For the object category of pedestrians the
silhouette is often a good indication of the current body articulation. As an
example, Figure 3 shows the identified articulation clusters for side-view pedes-
trians generated by this method.
4D Voting. In this paragraph we describe the probabilistic formulation of
the extended 4D voting procedure. Let e be a local descriptor computed at lo-
cation ?. Each descriptor is compared to the codebook and may be matched
to several codebook entries. One can think of these matches as multiple valid
interpretations Ii for the descriptor, each of which holds with the probability
p(Ii|e). Each interpretation then casts votes for different object instances on, lo-
cations λx,λy, scales λσand shape clusters s according to its learned occurrence
distribution P(on,λ,s|Ii,?) with λ = (λx,λy,λσ). Thus, any single vote has the
weight P(on,λ,s|Ii,?)p(Ii|e) and the descriptor’s contribution to the hypothesis
can be expressed by the following marginalization:
There are, however, several issues with this formulation. First, it is difficult
to estimate the probability density P(λ,s|on,Ii,l) reliably due to the increased
dimensionality, in particular from a relatively small set of data. Second and
quite importantly, the shape dimension s is neither continuous nor ordered. It
is therefore unclear, how the maximum search can be efficiently formulated.
Applying a Mean-Shift search with a scale-adapted kernel, as in the standard
ISM approach, is no longer feasible. Therefore, the following factorization is used
to obtain a tractable solution:
Please note, that all but the first term (P(s|λ,on,Ii,?)) are the same as in
. Therefore we can use the following simple yet effective strategy to find the
maxima of equation 2. By first searching the K maxima in the marginalized
3D voting space, we can not only reduce the computational complexity but
also constrain our search to those areas of the probability density with enough
evidence and training data. Choosing K sufficiently large, we can find all max-
ima with high probability. For those K maxima we then retrieve the contribut-
ing votes and use the following calculation (for simplicity of notation we use
P(s|H) = P(s|λ,on,Ii,?)):
where cj corresponds to the individual silhouettes present in the training data
and s is a shape cluster. P(s|cj) represents the probability that silhouette cjis
assigned to cluster s. P(s|cj) is 1 if silhouette cjis contained in shape cluster s.
By following the above procedure, we can obtain the 4D-maxima of P(on,λ,s).
This means in particular, that the votes corresponding to these maxima conform
with a common shape cluster. As a result, the voting scheme produces hypothe-
ses, which have a consistent shape.
2.3. Cross-Articulation Learning using Local Contexts
As will be seen in the experiments, the 4D voting procedure for individual object
shapes improves the recognition performance w.r.t. the original ISM approach.
In particular, the discriminative power of the learned object model is increased,
since it enables to distinguish typical object articulations. While this is a desir-
able goal, it involves a number of side effects.
On the one hand, we reduce the statistical significance of the object hy-
potheses, since the number of features contributing to each hypothesis has been
Fig.2. (Left) The same local feature can occur on globally dissimilar object shapes.
(Right) The comparison of local contexts (red) around interest points (yellow star),
influences the choice of object shapes considered in the recognition process.
reduced. In essence, the votes are distributed over a range of articulation clus-
ters. This can be easily seen from the schematic views of the original ISM-model
(Fig. 1 left) and the 4D voting approach (Fig. 1 center). In the standard ISM
model feature occurrences from all training instances can be combined for a final
hypothesis. This is a desirable property, which we call cross-instance learning,
that uses the training images effectively and allows to obtain high recognition
performance with a relatively small number of training images. Even though,
in the case of the 4D-ISM, codebook entries are shared and some limited cross-
articulation learning is achieved, the feature occurrences and therefore the votes
are basically limited to a certain shape cluster. The goal of the following is there-
fore to introduce an algorithm that allows for more effective cross-articulation
learning and thereby increasing the generalization power of the approach without
loosing the gained discriminative power of the 4D-ISM.
To illustrate the underlying idea, consider the images shown in figure 2 (left).
Assume that we are observing a head-feature (shown as the yellow square in
the two left images). In that case, the observation of the head puts very little
constraints on the particular position and shape of the legs. In terms of the
4D-ISM, this means that we should not restrict our votes to one particular
articulation but rather vote for a range of different but compatible articulations.
While, in principle, an increase of the number of training instances should
compensate for the limited cross-instance learning in the 4D-ISM, we, motivated
by the discussion above, propose another strategy. Our strategy re-enables cross-
instance learning without the need of more training data. The principle idea is
that object shapes, while being globally dissimilar, are often very similar in
a more local context. So instead of considering only the global shape for the
assignment of feature occurrences to articulation clusters, we propose to compare
the local context of an interest point. This is illustrated in Figure 2 (right).
There, we consider a local context (represented as local silhouette segment in
red) extracted around an interest point (yellow star) and depict locally similar
object silhouettes. As can be seen, an occurrence at the head (here the front of the
head) is compatible with many different articulations. An occurrence on the foot,
on the contrary, constrains the range of compatible articulations considerably.
Learning the Shape Distribution. In order to integrate this idea into our
probabilistic voting framework, we adapt equation 4 with information about the
similarity between local contexts on different object shapes.