Conference Paper

Recognizing Human Actions from Still Images with Latent Poses

DOI: 10.1109/CVPR.2010.5539879 Conference: The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010
Source: DBLP


We consider the problem of recognizing human actions from still images. We propose a novel approach that treats the pose of the person in the image as latent variables that will help with recognition. Different from other work that learns separate systems for pose estimation and action recognition, then combines them in an ad-hoc fashion, our system is trained in an integrated fashion that jointly considers poses and actions. Our learning objective is designed to directly exploit the pose information for action recognition. Our experimental results demonstrate that by inferring the latent poses, we can improve the final action recognition results.

  • Source
    • "The pioneer pictorial structure work (Fischler & Elschlager, 1973) provided an inspirable framework for representing visual objects with a spring-like graph of parts. Following this direction, continuous efforts on the part-based models have been made for a wide range of computer vision tasks including object detection (Felzenszwalb et al., 2010; Chen et al., 2014), pose estimation (Yang & Ramanan, 2011; Chen & Yuille, 2014), semantic segmentation (Long et al., 2015; Chen et al., 2015) and action recognition (Yang et al., 2010; Zhu et al., 2013). Particularly, the DPMs (Felzenszwalb et al., 2010; Zhu et al., 2010), which are built on basis of the HOG features (Dalal & Triggs, 2005), have reached milestone of object detection in the past few years. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a deep part-based model (DeePM) for symbiotic object detection and semantic part localization. For this purpose, we annotate semantic parts for all 20 object categories on the PASCAL VOC 2010 dataset, which provides information on object pose, occlusion, viewpoint and functionality. DeePM is a latent graphical model based on the state-of-the-art R-CNN framework, which learns an explicit representation of the object-part configuration with flexible type sharing (e.g., a sideview horse head can be shared by a fully-visible sideview horse and a highly truncated sideview horse with head and neck only). For comparison, we also present an end-to-end Object-Part (OP) R-CNN which learns an implicit feature representation for jointly mapping an image ROI to the object and part bounding boxes. We evaluate the proposed methods for both the object and part detection performance on PASCAL VOC 2010, and show that DeePM consistently outperforms OP R-CNN in detecting objects and parts (by $0.7 \%$ and $5.9 \%$ in mAP, respectively). In addition, it obtains superior performance to Fast and Faster R-CNNs for object detection.
    Full-text · Article · Nov 2015
  • Source
    • "To recognize action classes from still images accurately, researchers tend to integrate it with the task of pose estimation [7], [8]. In the integrated framework, these two tasks can help each other. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Recognizing actions from still images is popularly studied recently. In this paper, we model an action class as a flexible number of spatial configurations of body parts by proposing a new spatial SPN (Sum-Product Networks). First, we discover a set of parts in image collections via unsupervised learning. Then, our new spatial SPN is applied to model the spatial relationship and also the high-order correlations of parts. To learn robust networks, we further develop a hierarchical spatial SPN method, which models pairwise spatial relationship between parts inside sub-images and models the correlation of sub-images via extra layers of SPN. Our method is shown to be effective on two benchmark datasets.
    Preview · Article · Nov 2015
  • Source
    • "These approaches have reported very interesting results on challenging still image action datasets such as those described in [7] [8] [9]. At the opposite end of the spectrum are approaches based on the explicit recovery of body parts and the incorporation of structural information in the recognition process [10] [11]. The baseline model is a latent part-based model akin to Pictorial Structure which can be estimated as a joint, conditional or max-margin model [12] [13]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Action recognition from still images is an important task of computer vision applications such as image annotation, robotic navigation, video surveillance and several others. Existing approaches mainly rely on either bag-of-feature representations or articulated body-part models. However, the relationship between the action and the image segments is still substantially unexplored. For this reason, in this paper we propose to approach action recognition by leveraging an intermediate layer of "superpixels" whose latent classes can act as attributes of the action. In the proposed approach, the action class is predicted by a structural model(learnt by Latent Structural SVM) based on measurements from the image superpixels and their latent classes. Experimental results over the challenging Stanford 40 Actions dataset report a significant average accuracy of 74.06% for the positive class and 88.50% for the negative class, giving evidence to the performance of the proposed approach.
    Full-text · Article · Jul 2015
Show more