Recognizing Human Actions from Still Images with Latent Poses
Weilong Yang, Yang Wang, and Greg Mori
School of Computing Science
Simon Fraser University
Burnaby, BC, Canada
firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
We consider the problem of recognizing human actions
from still images. We propose a novel approach that treats
the pose of the person in the image as latent variables
that will help with recognition. Different from other work
that learns separate systems for pose estimation and action
recognition, then combines them in an ad-hoc fashion, our
system is trained in an integrated fashion that jointly con-
to directly exploit the pose information for action recogni-
the latent poses, we can improve the final action recognition
Consider the two images shown in Fig. 1(left). Even
though only still images are given, we as humans can still
perceive the actions (walking, playing golf) conveyed by
those images. The primary goal of this work is to recognize
actions from still images. In still images, the information
about the action label of an image mainly comes from the
pose, i.e. the configuration of body parts, of the person in
the image. However, not all body parts are equally impor-
tant for differentiating various actions. Consider the poses
shown in Fig. 1(middle). The configurations of torso, head
and legs are quite similar for both walking and playing golf.
The main difference for these two actions in terms of the
pose is the configuration of the arms. For example, “playing
golf” seems to have very distinctive V-shaped arms, while
“walking” seems to have two arms hanging on the side. A
standard pose estimator tries to find the correct locations of
all the body parts. The novelty of our work is that we do
not need to correctly infer complete pose configuration in
order to do action recognition. In the example of “walking”
versus “playing golf”, as long as we get correct locations of
the arms, we can correctly recognize the action, even if the
locations of other body parts are incorrect. The challenge is
the action label of a still image. We treat the pose of the person
in the image as “latent variables” in our system. The “pose” is
learned in a way that is directly tied to action classification.
how to learn a system that is aware of the importance of dif-
ferent body parts, so it can focus on the arms when trying to
differentiate between “walking” and “playing golf”. We in-
troduce a novel model that jointly learns poses and actions
in a principled framework.
Human action recognition is an extremely important
and active research area in computer vision, due to its
wide range of applications, e.g. surveillance, entertainment,
human-computer interaction, image and video search, etc.
Space constraints do not allow an extensive review of the
field, but a comprehensive survey is available in . Most
of the work in this field focuses on recognizing actions from
videos [13, 15, 18] using motion cues, and a significant
amount of progress has been made in the past few years.
Action recognition from still images, on the other hand, has
not been widely studied. We believe analyzing actions from
applied to videos. There are also applications that directly
require understanding still images of human actions, e.g.
news/sports image retrieval and analysis.
Not surprisingly, recognizing human actions from still
images is considerably more challenging than video se-
Figure 9. Example visualizations of the latent poses on test im-
ages. For each action, we manually select some good estimation
examples and bad examples. The action for each row (from top) is
running, walking, palying golf, sitting and dancing respectively.
location (xk,yk) in the image. The skeleton used for a par-
ticular poselet is obtained from the cluster center of the joint
locations of the corresponding poselet. In terms of pose es-
timation in the usual sense, those results are not accurate.
However, we can make several interesting observations. In
the “sitting” action, our model almost always correctly lo-
calizes the legs. In particular, it mostly chooses the poselet
that corresponds to the “A” shaped-legs (e.g. first two im-
ages in the fourth row) or the triangle-shaped legs (e.g. the
third image in the fourth row). It turns out the legs of a per-
son are extremely distinctive for the “sitting” action. So our
model “learns” to focus on localizing the legs for the sitting
action, in particular, our model learns that the “A” shaped-
legs and the triangle-shaped legs are most discriminative for
the sitting action. For the sitting action, the localized arms
are far from their correct locations. From the standard pose
estimation point of view, this is considered as a failure case.
But for our application, this is fine since we are not aiming
to correctly localize all the parts. Our model will learn not
to use the localizations of the arms to recognize the sitting
action. Another example is the “walking” action (the im-
ages in the second row). For this action, our model almost
always correctly localizes the arms hanging on the two sides
of the torso, even on the bad examples. This is because
“hanging arms” is a very distinctive poselet for the walking
action. So our model learns to focus on this particular part
for walking, without getting distracted by other parts.
We have presented a model that integrates action recog-
nition and pose estimation. The main novelty of our model
is that although we consider these two problems together,
our end goal is action recognition, and we treat the pose
information as latent variables in the model. The pose is
directly learned in a way that is tied to action recognition.
This is very different from other work that learns a pose es-
timation system separately, then uses the output of the pose
estimation to train an action recognition system. Our exper-
imental results demonstrate that by inferring the latent pose,
we can improve the final action recognition results.
 Y. Altun, T. Hofmann, and I. Tsochantaridis. SVM learning for inter-
dependent and structured output spaces. In Machine Learning with
Structured Outputs. MIT Press, 2006.
 L. Bourdev and J. Malik. Poselets: Body part detectors training using
3d human pose annotations. In ICCV, 2009.
 N. Dalal and B. Triggs. Histogram of oriented gradients for human
detection. In CVPR, 2005.
 C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for
multi-class object layout. In ICCV, 2009.
 T.-M.-T. Do and T. Artieres.
markov models with partially observed states. In ICML, 2009.
 P.Felzenszwalb, D.McAllester, andD.Ramanan. Adiscriminatively
trained, multiscale, deformable part model. In CVPR, 2008.
 P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for
object recognition. IJCV, 61(1):55–79, January 2005.
 V. Ferrari, M. Mar´ ın-Jim´ enez, and A. Zisserman. Pose search: re-
trieving people using their pose. In CVPR, 2009.
 D. A. Forsyth, O. Arikan, L. Ikemoto, J. O’Brien, and D. Ramanan.
Computational studies of human motion: Part 1, tracking and motion
1(2/3):77–254, July 2006.
 N. Ikizler, R. G. Cinbis, S. Pehlivan, and P. Duygulu. Recognizing
actions from still images. In ICPR, 2008.
 N. Ikizler-Cinbis, R. G. Cinbis, and S. Sclaroff. Learning actions
from the web. In ICCV, 2009.
 T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of struc-
tural SVMs. Machine Learning, 2008.
 I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning
realistic human actions from movies. In CVPR, 2008.
 J. C. Niebles, B. Han, A. Ferencz, and L. Fei-Fei. Extracting moving
people from internet videos. In ECCV, 2008.
 J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of
human action categories using spatial-temporal words. In BMVC,
volume 3, pages 1249–1258, 2006.
 D. Ramanan. Learning to parse images of articulated bodies. In
NIPS, volume 19, pages 1129–1136, 2007.
 D. Ramanan and D. A. Forsyth. Automatic annotation of everyday
movements. In NIPS. MIT Press, 2003.
 C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a
local SVM approach. In ICPR, volume 3, pages 32–36, 2004.
 C. Thurau and V. Hlav´ aˇ c. Pose primitive based human action recog-
nition in videos or still images. In CVPR, 2008.
 Y. Wang, H. Jiang, M. S. Drew, Z.-N. Li, and G. Mori. Unsupervised
discovery of action classes. In CVPR, 2006.
 Y. Wang and G. Mori. Max-margin hidden conditional random fields
for human action recognition. In CVPR, 2009.
Large margin training for hidden