Recognizing human actions from still images with latent poses.
-
Citations (0)
- Cited In (2)
-
Conference Proceeding: A Discriminative Latent Model of Image Region and Object Tag Correspondence.
Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada.; 01/2010 -
SourceAvailable from: citeseerx.ist.psu.edu
Chapter: A Discriminative Latent Model of Object Classes and Attributes
[show abstract] [hide abstract]
ABSTRACT: We present a discriminatively trained model for joint modelling of object class labels(e.g. “person”, “dog”, “chair”, etc.) and their visual attributes(e.g. “has head”, “furry”, “metal”, etc.). We treat attributes of an object as latent variables in our model and capture the correlations among attributes using an undirected graphical model built from training data. The advantage of our model is that it allows us to infer object class labels using the information of both the test image itself and its(latent) attributes. Our model unifies object class prediction and attribute prediction in a principled framework. It is also flexible enough to deal with different performance measurements. Our experimental results provide quantitative evidence that attributes can improve object naming.09/2010: pages 155-168;
Page 1
Recognizing Human Actions from Still Images with Latent Poses
Weilong Yang, Yang Wang, and Greg Mori
School of Computing Science
Simon Fraser University
Burnaby, BC, Canada
wya16@sfu.ca, ywang12@cs.sfu.ca, mori@cs.sfu.ca
Abstract
We consider the problem of recognizing human actions
from still images. We propose a novel approach that treats
the pose of the person in the image as latent variables
that will help with recognition. Different from other work
that learns separate systems for pose estimation and action
recognition, then combines them in an ad-hoc fashion, our
system is trained in an integrated fashion that jointly con-
sidersposesandactions. Ourlearningobjectiveisdesigned
to directly exploit the pose information for action recogni-
tion. Ourexperimentalresultsdemonstratethatbyinferring
the latent poses, we can improve the final action recognition
results.
1. Introduction
Consider the two images shown in Fig. 1(left). Even
though only still images are given, we as humans can still
perceive the actions (walking, playing golf) conveyed by
those images. The primary goal of this work is to recognize
actions from still images. In still images, the information
about the action label of an image mainly comes from the
pose, i.e. the configuration of body parts, of the person in
the image. However, not all body parts are equally impor-
tant for differentiating various actions. Consider the poses
shown in Fig. 1(middle). The configurations of torso, head
and legs are quite similar for both walking and playing golf.
The main difference for these two actions in terms of the
pose is the configuration of the arms. For example, “playing
golf” seems to have very distinctive V-shaped arms, while
“walking” seems to have two arms hanging on the side. A
standard pose estimator tries to find the correct locations of
all the body parts. The novelty of our work is that we do
not need to correctly infer complete pose configuration in
order to do action recognition. In the example of “walking”
versus “playing golf”, as long as we get correct locations of
the arms, we can correctly recognize the action, even if the
locations of other body parts are incorrect. The challenge is
Figure1.Illustrationofourproposedapproach. Ourgoalistoinfer
the action label of a still image. We treat the pose of the person
in the image as “latent variables” in our system. The “pose” is
learned in a way that is directly tied to action classification.
how to learn a system that is aware of the importance of dif-
ferent body parts, so it can focus on the arms when trying to
differentiate between “walking” and “playing golf”. We in-
troduce a novel model that jointly learns poses and actions
in a principled framework.
Human action recognition is an extremely important
and active research area in computer vision, due to its
wide range of applications, e.g. surveillance, entertainment,
human-computer interaction, image and video search, etc.
Space constraints do not allow an extensive review of the
field, but a comprehensive survey is available in [9]. Most
of the work in this field focuses on recognizing actions from
videos [13, 15, 18] using motion cues, and a significant
amount of progress has been made in the past few years.
Action recognition from still images, on the other hand, has
not been widely studied. We believe analyzing actions from
stillimagesisimportant. Progressmadeherecanbedirectly
applied to videos. There are also applications that directly
require understanding still images of human actions, e.g.
news/sports image retrieval and analysis.
Not surprisingly, recognizing human actions from still
images is considerably more challenging than video se-
1
Page 2
quences. In videos, the motion cue provides a rich source of
information for differentiating various actions. But in still
images, the only information we can rely on is the shape (or
the pose) of the person in an image. Previous work mainly
focuses on building good representations for shapes and
poses of people in images. Wang et al. [20] cluster different
human poses using distances calculated from deformable
shape matching.Thurau and Hlav´ aˇ c [19] represent ac-
tions using histograms of pose primitives computed by non-
negative matrix factorization. Ikizler et al. [10] recognize
actions using a descriptor based on histograms of oriented
rectangles. Ikizler-Cinbis et al. [11] learn actions from web
images using HOG descriptors [3]. A limitation of these
approaches is that they all assume an image representation
based on global templates, i.e. an image is represented by
a feature descriptor extracted from the whole image. This
representation has been made popular due to its success in
pedestrian detection, in particular the work on histogram of
oriented gradient (HOG) by Dalal and Triggs [3]. This rep-
resentation might be appropriate for pedestrian detection,
since most pedestrians are upright. So it might be helpful
to represent all the pedestrians using a global template. But
when it comes to action recognition, global templates are
not flexible enough to represent the huge amount of varia-
tions for an action. For example, consider the images of the
“playing golf” action in Fig. 4. It is hard to imagine that a
single global template can capture all the pose variations of
this action. Recently, Felzenszwalb et al. [6] show that part-
based representations can better capture the pose variations
of an object, hence outperform global template representa-
tions. In this paper, we operationalize on the same intuition
and demonstrate that part-based representations are useful
for action recognition in still images as well. A major dif-
ference of our work from [6] is that we have ground-truth
labeling of the pose on the training data, i.e. our “parts” are
semantically meaningful.
Another important goal of this paper is to bridge the gap
between human action recognition and human pose esti-
mation. Those are two closely related research problems.
If we can reliably estimate the pose of a person, we can
use this information to recognize the action. However, in
the literature, they are typically touted as two separate re-
search problems and there has been only very little work
on combining them together. There is some work on trying
to combine these two problems in a cascade way, e.g. by
building an action recognition system on top of the output
of a pose estimation system. For example, Ramanan and
Forsyth [17] annotate and synthesize human actions in 3D
by track people in 2D and match the track to an annotated
motion capture dataset. Their work uses videos rather than
still images, but the general idea is similar. Ferrari et al. [8]
retrieve TV shots containing a particular 2D human pose by
first estimating the human pose, then searching shots based
Figure 2. Difference between previous work and ours. (Top) Pre-
vious work typically approaches pose estimation and action recog-
nition as two separate problems, and uses the output of the former
as the input to the latter. (Bottom) We treat pose estimation and
action recognition as an single problem, and learn everything in
an integrated framework.
on a feature vector extracted from the pose. But it has been
difficult to establish the value of pose estimation for action
recognition in this cascade manner, mainly because pose es-
timation is still a largely unsolved problem. It is question-
able whether the output of any pose estimation algorithm is
reliable enough to be directly used for action recognition.
In this paper, we propose a novel way of combining ac-
tion recognition and pose estimation together to achieve
the end goal of action recognition. Our work is different
from previous work in two perspectives. First, instead of
representing the human pose as the configuration of kine-
matic body parts [16], e.g. upper-limb, lower-limb, head,
etc., we choose to use an exemplar-based pose representa-
tion, “poselet”. This notation of “poselet” is first proposed
in [2] and used to denote a set of patches with similar 3D
pose configuration. In this paper, for the purpose of ac-
tion recognition, we further restrict those patches not only
to have similar configuration, but also from the same action
class. Second, as illustrated by the diagram in Fig. 2 (top),
previous work typically treats pose estimation and action
recognition as two separate learning problems, and uses the
output of a pose estimation algorithm as the input of an ac-
tion recognition system [8, 10]. As pointed out earlier, the
problem with this approach is that the output of the pose es-
timation is typically not reliable. Instead, as illustrated by
the diagram in Fig. 2 (bottom), we treat pose estimation and
action recognition as two components of a single learning
problem, and jointly learn the whole system in an integrated
manner. But our learning objective is designed in a way that
allows pose information to help action classification.
The high-level idea of our proposed approach can be
seen from Fig. 1. Our goal is to infer the action label of a
still image. We treat the pose of the person as intermediate
information useful for recognizing the action. But instead
of trying to infer the pose correctly using a pose estimation
algorithm, we treat the pose as latent variables in the whole
system. Compared with previous work on exploiting pose
for recognition [17, 8], the “pose” in our system is learned
in a way that is directly tied to our end goal of action clas-
sification.
Page 3
2. Pose Representation
In this paper, we treat human pose as latent information
and use it to assist the task of action recognition. Since
we do not aim to obtain good pose estimation results in the
end, the latent pose in our approach is not restricted to any
specific type of pose representation. Because our focus is
action recognition, we decide to choose a coarse exemplar-
based pose representation. It is an action-specific variant of
the “poselet” proposed in [2]. In this paper, we use the no-
tation of “poselet” to refer to a set of patches not only with
similar pose configuration, but also from the same action
class. Fig. 3 illustrates the four poselets of a walking image.
As we can see, the poselet normally covers more than one
semantically meaningful part in terms of limbs and thus it is
distinct from the background. So, the detection of poselets
is more reliable than limb detection, especially with clut-
tered backgrounds.
In [2], a dataset is built where the joint positions of each
human image are labeled in 3D space via a 2D-3D lifting
procedure. We simply annotate the joint positions of hu-
man body in the 2D image space, as shown in Fig. 4. From
the pose annotation, we can easily collect a set of patches
with similar pose configuration. Based on the intuition that
action-specific parts contain more discriminative informa-
tion, we decide to select the poselets per action. For ex-
ample, we would like to select a number of poselets from
running-legs, or walking arms. The procedure of poselet
selection for a particular action (e.g. running) is as follows:
1. We first divide the human pose annotation of the run-
ning images into four parts, legs, left-arm, right-arm, and
upper-body; 2. We cluster the joints on each part into sev-
eral clusters based their normalized x and y coordinates; 3.
We remove clusters with very few examples; 4. Based on
the pose clusters, we crop the corresponding patches from
the images and form a set of poselets for the running ac-
tion. Representative poselets from the running action are
shown in Fig. 5. As we can see, among each poselet the ap-
pearance of each patch looks different, but they have very
similar semantic meaning. As pointed in [2], this is also
one advantage of using poselets. We repeat this process for
other actions and obtain 90 poselets in total in the end.
In order to detect the presence of each poselet, we train a
classifier for each poselet. We use the standard linear SVM
and the histograms of Oriented Gradients feature proposed
by Dalal and Triggs [3]. The positive examples are the
patches from each poselet cluster. The negative examples
are randomly selected from images which have the different
actionlabeltothepositiveexamples. Forexample, whenwe
train the classifier for one of “running-legs” poselets, we se-
lect the negative examples from all other action categories
except for the running action. The learned running poselet
templates are visualized in the last column in Fig. 5.
Figure 3. Visualization of the poselets for a walking image.
Ground-truth skeleton is overlayed on image. Examples of pose-
lets for each part are shown.
(a)
(b)(c)
(d)(e)
Figure 4. Sample images of the still image action dataset [11], and
the ground truth pose annotation. The locations of 14 joints have
been annotated on each action image. (a) Running; (b) Walking;
(c) Playing Golf; (d) Sitting; (e) Dancing.
3. Model Formulation
Let I be an image containing a person. In this paper,
we consider a figure-centric representation where I only
contains one person centered in the middle of the image.
This representation can be obtained from a standard pedes-
trian detection system. Let Y be the action label of the
person, and L be the pose of the person. We denote L as
L = (l0,l1,...,lK−1), where K is the number of parts. In
this paper, we choose K = 4 corresponding to upper-body,
legs, left-arm, and right-arm. The configuration of the k-th
part lkis represented as lk = (xk,yk,zk), where (xk,yk)
indicates the (x,y) locations of the k-th part in the image,
and zk∈ Zkis the index of the chosen poselet for the k-th
part. We have used Zkto denote the poselet set correspond-
ing to the part k. In this paper, we use |Zk| as 26, 20, 20, 24
for the four parts: legs, left-arm, right-arm, and upper-body,
based on our clustering results.
Similar to the standard pictorial structure models [7, 16]
in human pose estimation, we use an undirected graph
G = (V,E) to constrain the configuration of the pose L.
Page 4
Figure 5. Examples of poselets for each part from the running ac-
tion. Each row corresponds to one poselet. The last column is the
visualization of the filters for each poselet learned from SVM +
HOG.
Figure 6. The four part star structured model. We divide the pose
into four parts: legs, left-arm, right-arm, and upper-body.
Usually the kinematic tree of the human body is used. A
vertex j ∈ V corresponds to the configuration ljof the j-th
part, and an edge (j,k) ∈ E indicates the dependency be-
tween two connected parts ljand lk. In this paper, we use
a simple four part star structured model, as shown in Fig. 6.
The upper-body part is the root node of G and other parts
are connected to the root node. We emphasize that our al-
gorithm is not limited to the four part star structure and can
be easily generalized to other types of tree structures.
Our training data consists of images with ground-truth
labels of their action classes and poses (i.e. (x,y) location
of each part and its chosen poselet). The ground-truth pose-
let of a part is obtained by tracing back the poselet cluster
membership of this part. Given a set of N training exam-
ples {(I(n),L(n),Y(n))}N
that can be used to assign the class label Y to an unseen
test image I. Note that during testing, we do not know the
ground-truth pose L of the test image I.
We are interested in learning a discriminative function
H : I × Y → R over an image I and its class label Y ,
n=1, our goal is to learn a model
where H is parameterized by Θ. During testing, we can
predict the class label Y∗of an input image I as:
Y∗= argmax
Y ∈YH(I,Y ;Θ)
(1)
We assume H(I,Y ;Θ) takes the following form:
H(I,Y ;Θ) = max
L
ΘTΨ(I,L,Y )
(2)
where Ψ(I,L,Y ) is a feature vector depending on the im-
age I, its pose configuration L and its class label Y . We
define ΘTΨ(I,L,Y ) as follows:
ΘTΨ(I,L,Y ) =
?
j∈V
αT
jφ(I,lj,Y )
+
?
(i,j)∈E
βT
jkψ(lj,lk,Y ) + ηTω(l0,Y ) + γTϕ(I,Y ) (3)
The model parameters Θ are simply the concatenation
of the parameters in all the factors, i.e. Θ = {αj : j ∈
V} ∪ {βj,k: (j,k) ∈ E} ∪ {γ}. The details of the potential
functions in Eqn. (3) are described below.
Part appearance potential αT
tial function models the compatibility between the action
class label Y , the configuration lj= (xj,yj,zj) of the j-th
part, and the appearance of the image patch extracted from
the location (xj,yj). It is parameterized as:
jφ(I,lj,Y ): This poten-
αT
jφ(I,lj,Y ) =
?
a∈Y
?
b∈Zj
αT
jab· 1a(Y ) · 1b(zj) · f(I(lj)) (4)
where 1a(X) is an indicator that takes the value 1 if X =
a, and 0 otherwise. We use f(I(lj)) to denote the feature
vector extracted from the patch defined by lj= (xj,yj,zj)
in the image I. The poselet set for the j-th part is denoted as
Zj. The parameter αjabrepresents a template for the j-th
part if the action label is a and the chosen poselet for the
j-th part is b.
Instead of keeping f(I(lj)) as a high dimensional vec-
tor, we simply use the output of a SVM classifier trained
on a particular poselet as the single feature. We append a
constant 1 to f(I(lj)) to learn a model with a bias term. In
other words, let fab(I(lj)) be the score of the SVM trained
with action a and poselet b. Then the parameterization can
be re-written as:
αT
?
jφ(I,lj,Y ) =
?
This trick greatly speeds up our learning algorithm. Similar
tricks are used in [4].
a∈Y
b∈Zj
αT
jab· 1a(Y ) · 1b(zj) · [fab(I(lj));1]
(5)
Page 5
Pairwise potential βT
tion represents the dependency between the j-th and the k-
th part, for a given class label Y . Similar to [16], we use dis-
crete binning to model the spatial relations between parts.
We define this potential function as
jkψ(lj,lk,Y ): This potential func-
βT
jkψ(lj,lk,Y ) =
?
a∈Y
βT
jka· bin(lj− lk) · 1a(Y )
(6)
where bin(lj− lk) is a feature vector that bins the relative
location of the j-th part with respect to the k-th part accord-
ing to the (x,y) component of ljand lk. Hence bin(lj−lk)
is a sparse vector of all zeros with a single one for the occu-
pied bin. Here βjkais a model parameter that favors certain
relative bins for the j-th part with respect to the k-th part for
the action class label a.
Root location potential ηTω(l0,Y ):
function models the compatibility between the action class
label Y and the root location. Here l0denotes the config-
uration of the “root” part, i.e. upper-body in our case. It is
parameterized as:
This potential
ηTω(l0,Y ) =
?
a∈Y
ηT
a· bin(l0) · 1a(Y )
(7)
We discretize the image grid into h × w spatial bins, and
ω(l0) is a length h × w sparse vector of all zeros with a
single one for the spatial bin occupied by the root part. The
parameter ηafavors certain bins (possibly those in the mid-
dle of the image) for the location of the root part for the
action label a. For example, for the running and walking
actions, the root part may appear in the upper-middle part
of the image with high probability, while for the sitting or
playing golf action, the root part may appear in the center-
middle or lower-middle part of the image. This potential
function deals with different root locations for different ac-
tions. It also allows us to handle the unreliability caused by
the human detection system.
Global action potential γTϕ(I,Y ):
function represents a global template model for action
recognition from still images without considering the pose
configuration. It is parameterized as follows:
This potential
γTϕ(I,Y ) =
?
a∈Y
γT
a· 1a(Y ) · f(I)
(8)
where f(I) is a feature vector extracted from the whole im-
age I. The parameter γais a template for the action class
a. This potential function measures the compatibility be-
tween the model parameter γ and the combination of image
observation f(I) and its class label Y . Similar to the part
appearance model, we represent f(I) as a vector of outputs
of a multi-class SVM classifier.
4. Learning and Inference
We now describe how to infer the action label Y given
the model parameters Θ (Sec. 4.1), and how to learn the
model parameters from a set of training data (Sec. 4.2)
4.1. Inference
Given the model parameters and a test image I, we can
enumerate all the possible action labels Y ∈ Y and predict
the action label Y∗of I according to Eqn. (1). For a fixed
Y , we need to solve an inference problem of finding the best
pose Lbestas follows:
Lbest= argmax
L
ΘTΨ(I,L,Y )
??
+ηTω(l0,Y )
= argmax
L
j∈V
αT
jφ(I,lj,Y ) +
?
(i,j)∈E
βT
jkψ(lj,lk,Y )
?
(9)
Note for a fixed Y , the global action potential function is a
constant and has nothing to do with the pose L, so we omit
it from above equation. Since we assume a star model on L,
the inference problem in Eqn. (9) can be efficiently solved
via dynamic programming.
In this paper, we choose the size of relative location bin-
ning bin(lj− lk) as 32 × 15. With such a discrete binning
scheme, the inference can be directly solved by dynamic
programming efficiently even without using the generalized
distance transform [7]. The inference for a fixed Y on an
image only takes 0.015s with our MATLAB/MEX imple-
mentation.
4.2. Learning
Now we describe how to train the model parameters Θ
from N training examples {In,Ln,Yn}n=1,2,...,N. If we
assume the pose L is unobserved on the training data, we
can learn Θ using the latent SVM formulation [6, 21] as
follows:
?
s.t.
max
L
?
min
Θ,ξn≥0
ΘTΘ + C
n
ξn
ΘTΨ(In,L,Yn)
?? ?
H(In,Yn;Θ)
≥ ∆(Y,Yn) − ξn, ∀n, ∀Y ∈ Y
where ∆(Y,Yn) is a function measuring the loss incurred
by classifying the example Into be Y , while the true class
label is Yn. We use the 0-1 loss defined as follows:
?
The constraint in Eqn. (10) specifies the following in-
tuition. For the n-th training example, we want the score
−max
?
L
Θ?Ψ(In,L,Y )
???
H(In,Y ;Θ)
(10)
∆(Y,Yn) =
1
0
if Y ?= Yn
otherwise
(11)