Keep It Simple And Sparse: Real-Time Action Recognition

Journal of Machine Learning Research (Impact Factor: 2.47). 09/2013; 14:2617-2640.

ABSTRACT Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and
recognition of actions using linear SVMs. The main contribution of the paper is an effective real-time system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot.

Download full-text


Available from: Francesca Odone, Apr 07, 2014
135 Reads
  • Source
    • "Temporal segmentation is then paramount to cut video streams into single action instances, consistent with the models learnt from the training sequences. We show how our method can be easily combined with the robust temporal segmentation algorithm presented in [12]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce an online action recognition system that can be combined with any set of frame-by-frame feature descriptors. Our system covers the frame feature space with classifiers whose distribution adapts to the hardness of locally approximating the Bayes optimal classifier. An efficient nearest neighbour search is used to find and combine the local classifiers that are closest to the frames of a new video to be classified. The advantages of our approach are: incremental training, frame by frame real-time prediction, nonparametric predictive modelling, video segmentation for continuous action recognition, no need to trim videos to equal lengths and only one tuning parameter (which, for large datasets, can be safely set to the diameter of the feature space). Experiments on standard benchmarks show that our system is competitive with state-of-the-art non incremental and incremental baselines. keywords: action recognition, incremental learning, continuous action recognition, nonparametric model, real time, multivariate time series classification, temporal classification © 2014. The
    British Machine Vision Conference (BMVC); 01/2014
  • Source
    European Conference on Computer Vision; 01/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A novel technique to classify time series with imprecise hidden Markov models is presented. The learning of these models is achieved by coupling the EM algorithm with the imprecise Dirichlet model. In the stationarity limit, each model corresponds to an imprecise mixture of Gaussian densities, this reducing the problem to the classification of static, imprecise-probabilistic, information. Two classifiers, one based on the expected value of the mixture, the other on the Bhattacharyya distance between pairs of mixtures, are developed. The computation of the bounds of these descriptors with respect to the imprecise quantification of the parameters is reduced to, respectively, linear and quadratic optimization tasks, and hence efficiently solved. Classification is performed by extending the k-nearest neighbors approach to interval-valued data. The classifiers are credal, this means that multiple class labels can be returned in the output. Experiments on benchmark datasets for computer vision show that these methods achieve the required robustness whilst outperforming other precise and imprecise methods.
    International Journal of Approximate Reasoning 07/2014; DOI:10.1016/j.ijar.2014.07.005 · 2.45 Impact Factor
Show more