Keep It Simple And Sparse: Real-Time Action Recognition

Journal of Machine Learning Research (Impact Factor: 2.47). 09/2013; 14:2617-2640.


Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and
recognition of actions using linear SVMs. The main contribution of the paper is an effective real-time system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot.

Download full-text


Available from: Francesca Odone, Apr 07, 2014
  • Source
    • "In contrast, if a new action starts and a old action ends, there exists a transitional stage, so all the estimated distances are similar and the standard deviation is relatively low. In [6] and [4], a similar method is used on the SVM scores. Fig.1 shows a segment of minimum distances between online covariance matrix and training covariance matrices of each action class. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Online action recognition aims to recognize actions from unsegmented streams of data in a continuous manner. One of the challenges in online recognition is the accumulation of evidence for decision making. This paper presents a fast and efficient online method to recognize actions from a stream of noisy skeleton data. The method adopts a covari-ance descriptor calculated from skeleton data and is based on a novel method developed for incrementally learning the covariance descriptors, referred to as weighted covariance descriptors, so that past frames have less contributions to the descriptor and current frames and informative frames such as key frames contributes more towards the descrip-tor. The online recognition is achieved using an efficient nearest neighbour search against a set of trained actions. Experimental results on MSRC-12 Kinect Gesture dataset and our newly collocated online action recognition dataset have demonstrated the efficacy of the proposed method.
    • "We experimentally provide the evidence of the presence of actions classes inferred by estimating the pairwise similarities of action sequences from the same view and across different views. Works related to the computational model we consider can be found in fields as video surveillance, video retrieval and robotics, where tasks as gesture and action recognition or behavior modeling have been for many years very fertile disciplines, and still are (Fanello et al., 2013; Malgireddy et al., 2012; Mahbub et al., 2011; Noceti and Odone, 2012; Wang et al., 2009) We refer the interested reader to a recent survey (Aggarwal and Ryoo, 2011) for a complete account on the topic. From the view-invariance standpoint, the problem has been addressed considering two different settings, i.e. observing the same dynamic event simultaneously from two (or more) cameras (Zheng et al., 2012; Wu and Jia, 2012; Li and Zickler, 2012; Huang et al., 2012a; Zheng and Jiang, 2013) or considering different instances of a same concept of dynamic event (Lewandowski et al., 2010; Gong and Medioni, 2011; Junejo et al., 2011; Li et al., 2012; Huang et al., 2012b). "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper deals with the problem of estimating the affinity level between different types of human actions observed from different viewpoints. We analyse simple repetitive upper body human actions with the goal of producing a view-invariant model from simple motion cues, that have been inspired by studies on the human perception. We adopt a simple descriptor that summarizes the evolution of spatio-temporal curvature of the trajectories, which we use for evaluating the similarity between actions pair on a multi-level matching. We experimentally verified the presence of semantic connections between actions across views, inferring a relations graph that shows such affinities.
    VISAPP; 03/2015
  • Source
    • "Temporal segmentation is then paramount to cut video streams into single action instances, consistent with the models learnt from the training sequences. We show how our method can be easily combined with the robust temporal segmentation algorithm presented in [12]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce an online action recognition system that can be combined with any set of frame-by-frame feature descriptors. Our system covers the frame feature space with classifiers whose distribution adapts to the hardness of locally approximating the Bayes optimal classifier. An efficient nearest neighbour search is used to find and combine the local classifiers that are closest to the frames of a new video to be classified. The advantages of our approach are: incremental training, frame by frame real-time prediction, nonparametric predictive modelling, video segmentation for continuous action recognition, no need to trim videos to equal lengths and only one tuning parameter (which, for large datasets, can be safely set to the diameter of the feature space). Experiments on standard benchmarks show that our system is competitive with state-of-the-art non incremental and incremental baselines. keywords: action recognition, incremental learning, continuous action recognition, nonparametric model, real time, multivariate time series classification, temporal classification © 2014. The
    British Machine Vision Conference (BMVC); 01/2014
Show more