CrossArticulation Learning for Robust Detection of Pedestrians.
ABSTRACT Recognizing categories of articulated objects in realworld scenarios is a challenging problem for today's vision algorithms. Due to the large appearance changes and intraclass variability of these objects, it is hard to define a model, which is both general and discriminative enough to capture the properties of the category. In this work, we pro pose an approach, which aims for a suitable tradeoff for this problem. On the one hand, the approach is made more discriminant by explic itly distinguishing typical object shapes. On the other hand, the method generalizes well and requires relatively few training samples by cross articulation learning. The effectiveness of the approach is shown and compared to previous approaches on two datasets containing pedestri ans with different articulations.
 IEEE Transactions on Software Engineering 07/2010; 32(7):123958. · 2.59 Impact Factor

Article: Kamerabasierte Fußgängerdetektion
Handbuch Fahrerassistenzsysteme. 01/2009;  SourceAvailable from: Joachim M BuhmannInternational Journal of Computer Vision 01/2009; 83:5771. · 3.62 Impact Factor
Page 1
CrossArticulation Learning for Robust
Detection of Pedestrians
Edgar Seemann and Bernt Schiele
{lastname}@mis.tudarmstadt.de
Technical University of Darmstadt
http://www.mis.informatik.tudarmstadt.de
Abstract. Recognizing categories of articulated objects in realworld
scenarios is a challenging problem for today’s vision algorithms. Due to
the large appearance changes and intraclass variability of these objects,
it is hard to define a model, which is both general and discriminative
enough to capture the properties of the category. In this work, we pro
pose an approach, which aims for a suitable tradeoff for this problem.
On the one hand, the approach is made more discriminant by explic
itly distinguishing typical object shapes. On the other hand, the method
generalizes well and requires relatively few training samples by cross
articulation learning. The effectiveness of the approach is shown and
compared to previous approaches on two datasets containing pedestri
ans with different articulations.
1. Introduction
In recent years a large number of approaches have been proposed for the de
tection of object categories in still images. Categories of nonrigid objects such
as pedestrians have proven to be particularly challenging. The high intraclass
variability, which is caused by global appearance changes and object articula
tions, requires recognition approaches that are both highly discriminative and
also generalize well.
In the literature, several approaches focus on the global structure of the
object [5,11,17,2], while others detect individual parts [4,3,10,8,16]. Gravila
[5] uses a hierarchy of object silhouettes and applies Chamfer matching to ob
tain detection hypotheses. Papageorgiou & Poggio [11] train an SVM based on
wavelet features. Zhao & Thorbe [17] perform detection with a neural network
and exploit stereo information to presegment images. Dalal & Triggs [2] com
pute a global gradientbased descriptor, similar to SIFT, to train a linear SVM.
Forsyth & Fleck [4] introduce the general methodology of body plans for finding
people in images. Felzenszwalb and Huttenlocher [3] learn simplistic detectors for
individual body parts. Ronfard et al. [19] extended this work by using stronger
classifiers such as SVMs. Mohan and Papageorgiou [10] apply the waveletbased
detectors from [11] to detect body parts and then use body geometry to infer
a person’s position and pose. Viola et al. [16] use simple local features and a
boosting scheme to train a cascade of classifiers. Mikolajczyk et al. train body
part classifiers with boosting and combine them in a probabilistic framework.
In this work, instead of modeling individual object parts, we identify and model
typical object articulations or shapes. These typical shapes are learnt automati
cally from motion segmentations, which can be computed from video sequences
Page 2
with a GrimsonStauffer background model [14]. The advantage of this approach
is that we do not need manual labeling of object parts.
The main contributions of this paper are the following. We introduce a novel
scheme to learn the relationship between arbitrary object parts or, as we call it,
local contexts and the global object shape. As a result, we obtain an approach,
which captures large appearance variations in a single model and implements a
suitable tradeoff between generalization performance and discriminative power
of the model. The method is able to share features between typical object shapes
and therefore requires relatively few training images. In a sense, the approach
generalizes the idea of sharing features [15] to the sharing of local appearance
across object instances and shapes. A thorough evaluation shows that the pro
posed model outperforms previously published methods on two challenging data
sets for the task of pedestrian recognition.
2. Recognition Algorithm
The recognition approach proposed in this paper extends the Implicit Shape
Model (ISM) developed by Leibe & Schiele [6]. This section introduces the ba
sic algorithm and discusses extensions to explicitly handle global appearance
changes and object articulations.
2.1. Standard ISM
The ISM is a voting framework, which accumulates local image evidences to
find the most promising object hypotheses. It is capable of multiscale detec
tion and pixelwise segmentation masks can be inferred for each hypothesis. An
additional reasoning step based on the Minimum Description Length (MDL)
principle makes the method more robust in the presence of clutter and overlap
ping objects. The following gives a brief overview of the methods.
Codebook Representation. For representing an object category with an
ISM, a codebook or visual vocabulary of local appearances is built [6]. There
fore, a scaleinvariant interest point detector is applied to each training image
and local descriptors are extracted. These descriptors are subsequently clustered
with an agglomerative clustering scheme. The resulting set of local appearances
represents typical structures on an object category.
Spatial Occurrence Distribution. Once a codebook on an object category
has been learnt, we model the spatial occurrence distribution of its elements. In
order to do this, we record all locations (x, yposition and scale) on which a
codebook entry matches the training instances.
Hypotheses Voting. In the recognition process, we apply the same feature
extraction procedure as during training. Thus, we obtain a set of local descriptors
at various scales on the test image. Each extracted descriptor casts votes for ob
ject hypotheses in a probabilistic extension of the generalized Hough transform.
The maxima of the 3D voting space (x,y,scale) are backprojected to the image
to retrieve the supporting local features of each hypotheses. We present details
on an improved version of this probabilistic formulation when we introduce the
extensions to deal with different object shapes in section 2.2.
Page 3
Fig.1. Schematic overview of the different object models. Both the standard ISM
and 4DISM models are special cases of the proposed approach. By learning the shape
distribution from local contexts, we combine the strength of the two other models.
2.2. Consistent Shape Voting
Figure 1 shows a schematic illustration of the standard ISMmodel on the left.
While the ISM allows for crossinstance learning and therefore requires relatively
little training data it has no notion of possible object articulations within the
category. Local appearances are learnt from all possible variations, which ensures
good generalization performance, but results in relatively weak discriminative
power, e.g. with respect to background structures. By adding a 4thdimension
for object articulations to the ISM voting space (Figure 1 center), the model is
able to distinguish between object shapes and is thus more discriminant [13].
This, however, requires an association of each training example to one of the
typical articulations or shapes.
Learning Object Shapes. Manual labelling of object shapes in the training
data is both time consuming and difficult for more complex objects. We there
fore automatically learn the most prominent shapes from object silhouettes.
Therefore, we apply agglomerative clustering with global Chamfer distance as
similarity measure. The silhouettes are extracted from video sequences with a
motionsegmentation algorithm [14]. For the object category of pedestrians the
silhouette is often a good indication of the current body articulation. As an
example, Figure 3 shows the identified articulation clusters for sideview pedes
trians generated by this method.
4D Voting. In this paragraph we describe the probabilistic formulation of
the extended 4D voting procedure. Let e be a local descriptor computed at lo
cation ?. Each descriptor is compared to the codebook and may be matched
to several codebook entries. One can think of these matches as multiple valid
interpretations Ii for the descriptor, each of which holds with the probability
p(Iie). Each interpretation then casts votes for different object instances on, lo
cations λx,λy, scales λσand shape clusters s according to its learned occurrence
distribution P(on,λ,sIi,?) with λ = (λx,λy,λσ). Thus, any single vote has the
weight P(on,λ,sIi,?)p(Iie) and the descriptor’s contribution to the hypothesis
can be expressed by the following marginalization:
P(on,λ,se,?) =
?
i
P(on,λ,sIi,?)p(Iie,?)(1)
Page 4
=
?
i
?
k
P(λ,son,Ii,?)p(onIi,?)p(Iie)
P(on,λ,s) ∼
P(on,λ,sek,?k)(2)
There are, however, several issues with this formulation. First, it is difficult
to estimate the probability density P(λ,son,Ii,l) reliably due to the increased
dimensionality, in particular from a relatively small set of data. Second and
quite importantly, the shape dimension s is neither continuous nor ordered. It
is therefore unclear, how the maximum search can be efficiently formulated.
Applying a MeanShift search with a scaleadapted kernel, as in the standard
ISM approach, is no longer feasible. Therefore, the following factorization is used
to obtain a tractable solution:
P(on,λ,se,?) =
?
i
P(sλ,on,Ii,?)P(λon,Ii,?)p(onIi,?)p(Iie)(3)
Please note, that all but the first term (P(sλ,on,Ii,?)) are the same as in
[6]. Therefore we can use the following simple yet effective strategy to find the
maxima of equation 2. By first searching the K maxima in the marginalized
3D voting space, we can not only reduce the computational complexity but
also constrain our search to those areas of the probability density with enough
evidence and training data. Choosing K sufficiently large, we can find all max
ima with high probability. For those K maxima we then retrieve the contribut
ing votes and use the following calculation (for simplicity of notation we use
P(sH) = P(sλ,on,Ii,?)):
P(sH) =
?
j
P(scj,H)p(cjH) =
?
j
P(scj)p(cjH)(4)
where cj corresponds to the individual silhouettes present in the training data
and s is a shape cluster. P(scj) represents the probability that silhouette cjis
assigned to cluster s. P(scj) is 1 if silhouette cjis contained in shape cluster s.
By following the above procedure, we can obtain the 4Dmaxima of P(on,λ,s).
This means in particular, that the votes corresponding to these maxima conform
with a common shape cluster. As a result, the voting scheme produces hypothe
ses, which have a consistent shape.
2.3. CrossArticulation Learning using Local Contexts
As will be seen in the experiments, the 4D voting procedure for individual object
shapes improves the recognition performance w.r.t. the original ISM approach.
In particular, the discriminative power of the learned object model is increased,
since it enables to distinguish typical object articulations. While this is a desir
able goal, it involves a number of side effects.
On the one hand, we reduce the statistical significance of the object hy
potheses, since the number of features contributing to each hypothesis has been
Page 5
Fig.2. (Left) The same local feature can occur on globally dissimilar object shapes.
(Right) The comparison of local contexts (red) around interest points (yellow star),
influences the choice of object shapes considered in the recognition process.
reduced. In essence, the votes are distributed over a range of articulation clus
ters. This can be easily seen from the schematic views of the original ISMmodel
(Fig. 1 left) and the 4D voting approach (Fig. 1 center). In the standard ISM
model feature occurrences from all training instances can be combined for a final
hypothesis. This is a desirable property, which we call crossinstance learning,
that uses the training images effectively and allows to obtain high recognition
performance with a relatively small number of training images. Even though,
in the case of the 4DISM, codebook entries are shared and some limited cross
articulation learning is achieved, the feature occurrences and therefore the votes
are basically limited to a certain shape cluster. The goal of the following is there
fore to introduce an algorithm that allows for more effective crossarticulation
learning and thereby increasing the generalization power of the approach without
loosing the gained discriminative power of the 4DISM.
To illustrate the underlying idea, consider the images shown in figure 2 (left).
Assume that we are observing a headfeature (shown as the yellow square in
the two left images). In that case, the observation of the head puts very little
constraints on the particular position and shape of the legs. In terms of the
4DISM, this means that we should not restrict our votes to one particular
articulation but rather vote for a range of different but compatible articulations.
While, in principle, an increase of the number of training instances should
compensate for the limited crossinstance learning in the 4DISM, we, motivated
by the discussion above, propose another strategy. Our strategy reenables cross
instance learning without the need of more training data. The principle idea is
that object shapes, while being globally dissimilar, are often very similar in
a more local context. So instead of considering only the global shape for the
assignment of feature occurrences to articulation clusters, we propose to compare
the local context of an interest point. This is illustrated in Figure 2 (right).
There, we consider a local context (represented as local silhouette segment in
red) extracted around an interest point (yellow star) and depict locally similar
object silhouettes. As can be seen, an occurrence at the head (here the front of the
head) is compatible with many different articulations. An occurrence on the foot,
on the contrary, constrains the range of compatible articulations considerably.
Learning the Shape Distribution. In order to integrate this idea into our
probabilistic voting framework, we adapt equation 4 with information about the
similarity between local contexts on different object shapes.