Page 1

Cross-Articulation Learning for Robust

Detection of Pedestrians

Edgar Seemann and Bernt Schiele

{lastname}@mis.tu-darmstadt.de

Technical University of Darmstadt

http://www.mis.informatik.tu-darmstadt.de

Abstract. Recognizing categories of articulated objects in real-world

scenarios is a challenging problem for today’s vision algorithms. Due to

the large appearance changes and intra-class variability of these objects,

it is hard to define a model, which is both general and discriminative

enough to capture the properties of the category. In this work, we pro-

pose an approach, which aims for a suitable trade-off for this problem.

On the one hand, the approach is made more discriminant by explic-

itly distinguishing typical object shapes. On the other hand, the method

generalizes well and requires relatively few training samples by cross-

articulation learning. The effectiveness of the approach is shown and

compared to previous approaches on two datasets containing pedestri-

ans with different articulations.

1. Introduction

In recent years a large number of approaches have been proposed for the de-

tection of object categories in still images. Categories of non-rigid objects such

as pedestrians have proven to be particularly challenging. The high intra-class

variability, which is caused by global appearance changes and object articula-

tions, requires recognition approaches that are both highly discriminative and

also generalize well.

In the literature, several approaches focus on the global structure of the

object [5,11,17,2], while others detect individual parts [4,3,10,8,16]. Gravila

[5] uses a hierarchy of object silhouettes and applies Chamfer matching to ob-

tain detection hypotheses. Papageorgiou & Poggio [11] train an SVM based on

wavelet features. Zhao & Thorbe [17] perform detection with a neural network

and exploit stereo information to pre-segment images. Dalal & Triggs [2] com-

pute a global gradient-based descriptor, similar to SIFT, to train a linear SVM.

Forsyth & Fleck [4] introduce the general methodology of body plans for finding

people in images. Felzenszwalb and Huttenlocher [3] learn simplistic detectors for

individual body parts. Ronfard et al. [19] extended this work by using stronger

classifiers such as SVMs. Mohan and Papageorgiou [10] apply the wavelet-based

detectors from [11] to detect body parts and then use body geometry to infer

a person’s position and pose. Viola et al. [16] use simple local features and a

boosting scheme to train a cascade of classifiers. Mikolajczyk et al. train body

part classifiers with boosting and combine them in a probabilistic framework.

In this work, instead of modeling individual object parts, we identify and model

typical object articulations or shapes. These typical shapes are learnt automati-

cally from motion segmentations, which can be computed from video sequences

Page 2

with a Grimson-Stauffer background model [14]. The advantage of this approach

is that we do not need manual labeling of object parts.

The main contributions of this paper are the following. We introduce a novel

scheme to learn the relationship between arbitrary object parts or, as we call it,

local contexts and the global object shape. As a result, we obtain an approach,

which captures large appearance variations in a single model and implements a

suitable trade-off between generalization performance and discriminative power

of the model. The method is able to share features between typical object shapes

and therefore requires relatively few training images. In a sense, the approach

generalizes the idea of sharing features [15] to the sharing of local appearance

across object instances and shapes. A thorough evaluation shows that the pro-

posed model outperforms previously published methods on two challenging data

sets for the task of pedestrian recognition.

2. Recognition Algorithm

The recognition approach proposed in this paper extends the Implicit Shape

Model (ISM) developed by Leibe & Schiele [6]. This section introduces the ba-

sic algorithm and discusses extensions to explicitly handle global appearance

changes and object articulations.

2.1. Standard ISM

The ISM is a voting framework, which accumulates local image evidences to

find the most promising object hypotheses. It is capable of multi-scale detec-

tion and pixel-wise segmentation masks can be inferred for each hypothesis. An

additional reasoning step based on the Minimum Description Length (MDL)

principle makes the method more robust in the presence of clutter and overlap-

ping objects. The following gives a brief overview of the methods.

Codebook Representation. For representing an object category with an

ISM, a codebook or visual vocabulary of local appearances is built [6]. There-

fore, a scale-invariant interest point detector is applied to each training image

and local descriptors are extracted. These descriptors are subsequently clustered

with an agglomerative clustering scheme. The resulting set of local appearances

represents typical structures on an object category.

Spatial Occurrence Distribution. Once a codebook on an object category

has been learnt, we model the spatial occurrence distribution of its elements. In

order to do this, we record all locations (x-, y-position and scale) on which a

codebook entry matches the training instances.

Hypotheses Voting. In the recognition process, we apply the same feature

extraction procedure as during training. Thus, we obtain a set of local descriptors

at various scales on the test image. Each extracted descriptor casts votes for ob-

ject hypotheses in a probabilistic extension of the generalized Hough transform.

The maxima of the 3D voting space (x,y,scale) are back-projected to the image

to retrieve the supporting local features of each hypotheses. We present details

on an improved version of this probabilistic formulation when we introduce the

extensions to deal with different object shapes in section 2.2.

Page 3

Fig.1. Schematic overview of the different object models. Both the standard ISM

and 4D-ISM models are special cases of the proposed approach. By learning the shape

distribution from local contexts, we combine the strength of the two other models.

2.2. Consistent Shape Voting

Figure 1 shows a schematic illustration of the standard ISM-model on the left.

While the ISM allows for cross-instance learning and therefore requires relatively

little training data it has no notion of possible object articulations within the

category. Local appearances are learnt from all possible variations, which ensures

good generalization performance, but results in relatively weak discriminative

power, e.g. with respect to background structures. By adding a 4thdimension

for object articulations to the ISM voting space (Figure 1 center), the model is

able to distinguish between object shapes and is thus more discriminant [13].

This, however, requires an association of each training example to one of the

typical articulations or shapes.

Learning Object Shapes. Manual labelling of object shapes in the training

data is both time consuming and difficult for more complex objects. We there-

fore automatically learn the most prominent shapes from object silhouettes.

Therefore, we apply agglomerative clustering with global Chamfer distance as

similarity measure. The silhouettes are extracted from video sequences with a

motion-segmentation algorithm [14]. For the object category of pedestrians the

silhouette is often a good indication of the current body articulation. As an

example, Figure 3 shows the identified articulation clusters for side-view pedes-

trians generated by this method.

4D Voting. In this paragraph we describe the probabilistic formulation of

the extended 4D voting procedure. Let e be a local descriptor computed at lo-

cation ?. Each descriptor is compared to the codebook and may be matched

to several codebook entries. One can think of these matches as multiple valid

interpretations Ii for the descriptor, each of which holds with the probability

p(Ii|e). Each interpretation then casts votes for different object instances on, lo-

cations λx,λy, scales λσand shape clusters s according to its learned occurrence

distribution P(on,λ,s|Ii,?) with λ = (λx,λy,λσ). Thus, any single vote has the

weight P(on,λ,s|Ii,?)p(Ii|e) and the descriptor’s contribution to the hypothesis

can be expressed by the following marginalization:

P(on,λ,s|e,?) =

?

i

P(on,λ,s|Ii,?)p(Ii|e,?) (1)

Page 4

=

?

i

?

k

P(λ,s|on,Ii,?)p(on|Ii,?)p(Ii|e)

P(on,λ,s) ∼

P(on,λ,s|ek,?k) (2)

There are, however, several issues with this formulation. First, it is difficult

to estimate the probability density P(λ,s|on,Ii,l) reliably due to the increased

dimensionality, in particular from a relatively small set of data. Second and

quite importantly, the shape dimension s is neither continuous nor ordered. It

is therefore unclear, how the maximum search can be efficiently formulated.

Applying a Mean-Shift search with a scale-adapted kernel, as in the standard

ISM approach, is no longer feasible. Therefore, the following factorization is used

to obtain a tractable solution:

P(on,λ,s|e,?) =

?

i

P(s|λ,on,Ii,?)P(λ|on,Ii,?)p(on|Ii,?)p(Ii|e) (3)

Please note, that all but the first term (P(s|λ,on,Ii,?)) are the same as in

[6]. Therefore we can use the following simple yet effective strategy to find the

maxima of equation 2. By first searching the K maxima in the marginalized

3D voting space, we can not only reduce the computational complexity but

also constrain our search to those areas of the probability density with enough

evidence and training data. Choosing K sufficiently large, we can find all max-

ima with high probability. For those K maxima we then retrieve the contribut-

ing votes and use the following calculation (for simplicity of notation we use

P(s|H) = P(s|λ,on,Ii,?)):

P(s|H) =

?

j

P(s|cj,H)p(cj|H) =

?

j

P(s|cj)p(cj|H) (4)

where cj corresponds to the individual silhouettes present in the training data

and s is a shape cluster. P(s|cj) represents the probability that silhouette cjis

assigned to cluster s. P(s|cj) is 1 if silhouette cjis contained in shape cluster s.

By following the above procedure, we can obtain the 4D-maxima of P(on,λ,s).

This means in particular, that the votes corresponding to these maxima conform

with a common shape cluster. As a result, the voting scheme produces hypothe-

ses, which have a consistent shape.

2.3. Cross-Articulation Learning using Local Contexts

As will be seen in the experiments, the 4D voting procedure for individual object

shapes improves the recognition performance w.r.t. the original ISM approach.

In particular, the discriminative power of the learned object model is increased,

since it enables to distinguish typical object articulations. While this is a desir-

able goal, it involves a number of side effects.

On the one hand, we reduce the statistical significance of the object hy-

potheses, since the number of features contributing to each hypothesis has been

Page 5

Fig.2. (Left) The same local feature can occur on globally dissimilar object shapes.

(Right) The comparison of local contexts (red) around interest points (yellow star),

influences the choice of object shapes considered in the recognition process.

reduced. In essence, the votes are distributed over a range of articulation clus-

ters. This can be easily seen from the schematic views of the original ISM-model

(Fig. 1 left) and the 4D voting approach (Fig. 1 center). In the standard ISM

model feature occurrences from all training instances can be combined for a final

hypothesis. This is a desirable property, which we call cross-instance learning,

that uses the training images effectively and allows to obtain high recognition

performance with a relatively small number of training images. Even though,

in the case of the 4D-ISM, codebook entries are shared and some limited cross-

articulation learning is achieved, the feature occurrences and therefore the votes

are basically limited to a certain shape cluster. The goal of the following is there-

fore to introduce an algorithm that allows for more effective cross-articulation

learning and thereby increasing the generalization power of the approach without

loosing the gained discriminative power of the 4D-ISM.

To illustrate the underlying idea, consider the images shown in figure 2 (left).

Assume that we are observing a head-feature (shown as the yellow square in

the two left images). In that case, the observation of the head puts very little

constraints on the particular position and shape of the legs. In terms of the

4D-ISM, this means that we should not restrict our votes to one particular

articulation but rather vote for a range of different but compatible articulations.

While, in principle, an increase of the number of training instances should

compensate for the limited cross-instance learning in the 4D-ISM, we, motivated

by the discussion above, propose another strategy. Our strategy re-enables cross-

instance learning without the need of more training data. The principle idea is

that object shapes, while being globally dissimilar, are often very similar in

a more local context. So instead of considering only the global shape for the

assignment of feature occurrences to articulation clusters, we propose to compare

the local context of an interest point. This is illustrated in Figure 2 (right).

There, we consider a local context (represented as local silhouette segment in

red) extracted around an interest point (yellow star) and depict locally similar

object silhouettes. As can be seen, an occurrence at the head (here the front of the

head) is compatible with many different articulations. An occurrence on the foot,

on the contrary, constrains the range of compatible articulations considerably.

Learning the Shape Distribution. In order to integrate this idea into our

probabilistic voting framework, we adapt equation 4 with information about the

similarity between local contexts on different object shapes.