-
[show abstract]
[hide abstract]
ABSTRACT: Products of Hidden Markov Models(PoHMMs) are an interesting class of
generative models which have received little attention since their
introduction. This maybe in part due to their more computationally expensive
gradient-based learning algorithm,and the intractability of computing the log
likelihood of sequences under the model. In this paper, we demonstrate how the
partition function can be estimated reliably via Annealed Importance Sampling.
We perform experiments using contrastive divergence learning on rainfall data
and data captured from pairs of people dancing. Our results suggest that
advances in learning and evaluation for undirected graphical models and recent
increases in available computing power make PoHMMs worth considering for
complex time-series modeling tasks.
05/2012;
-
IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011; 01/2011
-
Journal of Machine Learning Research. 01/2011; 12:1025-1068.
-
The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011; 01/2011
-
[show abstract]
[hide abstract]
ABSTRACT: We address the problem of learning good features for understanding video data. We introduce a model that learns latent representations
of image sequences from pairs of successive images. The convolutional architecture of our model allows it to scale to realistic
image sizes whilst using a compact parametrization. In experiments on the NORB dataset, we show our model extracts latent
“flow fields” which correspond to the transformation between the pair of input frames. We also use our model to extract low-level
motion features in a multi-stage architecture for action recognition, demonstrating competitive performance on both the KTH
and Hollywood2 datasets.
09/2010: pages 140-153;
-
The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010; 01/2010
-
Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada.; 01/2010
-
The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010; 01/2010
-
20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23-26 August 2010; 01/2010
-
ESANN 2009, 17th European Symposium on Artificial Neural Networks, Bruges, Belgium, April 22-24, 2009, Proceedings; 01/2009
-
Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009; 01/2009
-
Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008; 01/2008
-
Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006; 01/2006
-
[show abstract]
[hide abstract]
ABSTRACT: Introduction In the past, using motion for recognition has been demonstrated mainly for classifying certain actions or gait styles ([1, 2, 3, 4, 5]). Using the more ambiguous motion of a person that appears during talking and performing "body lan-guage" is harder to quantify. This is similar to what the acoustic speech community calls "speaker recognition". It is not important what the person says, but how it is said. We developed a new video-based feature extraction technique and methods to train statistical models that classify body motion signatures. The recognition architecture is inspired by recent progress in speaker recognition research. This abstract has 3 main contributions: 1) a visual feature estimation technique, based on sparse flow computations and motion angle histograms that we call "Motion Orientation Signatures" (MOS). 2) Integration of this feature into a 3-stage recognition system, 3) an integration method with a state-of-the-art face recogni-tion architecture [6]. We demonstrate how this new technique can be used to identify people just based on their motion, or it can be used to significantly improve "hard-biometrics" techniques. For example, face verification achieves on this do-main 6.45% Equal Error Rate (EER), and the combined verification performance of motion features and face reduces the error to 4.96% using an adaptive score-level integration method. The more ambiguous motion-only performance is 17.1% EER. This is to the best of our knowledge the first time a system has been demonstrated that can significantly improve a state-of-the-art face recognition system with such complex and ambiguous signals like a person's body-language. Motion Orientation Signatures (MOS) The first step in our new visual extraction schema is flow computation at reliable feature locations. Given these robust flow estimates, we compute weighted flow angle histograms. (See [7] for more details on the visual feature extraction method). Inspired by acoustic speech features, we also compute "delta-features", the temporal derivative of each orientation bin value. Since the bin values are statistics of the visual velocity (flow), the delta-features cover acceleration and deceleration. Figure 1 shows a few examples that demonstrate what signatures are created with certain video input. We show several visualizations of these features at http://movement.nyu.edu/GreenDot We further improved our MOS feature extraction theme in incorporating a state-of-the-art face-tracking system from PittPatt.com [6] (winner of the 2008 NIST MBGC Challenge), and placing a N × M grid of local region of interests (ROIs) around the face such that it covers the body. Inside each local ROI of the grid we compute the MOS feature and concatenate all local ROI histograms to one big feature vector. GMM-Super-Features: Similar to recent approaches in the speech community we convert an arbitrary length video into a fixed dimensional feature vector with so called Super-Features [8]: A Gaussian Mixture Model is first trained on the MOS features. In speaker recognition, this is called the Universal Background Model (UBM). Given that UBM model, the statistics of each video are computed in MAP adapting the GMM to the specific video features. A GMM-Super-Feature is the difference between the UBM mean vectors and the new MAP adapted mean vectors. If the new video has some unique motion, then at least one mean vector has a large difference to the UBM model.
-
[show abstract]
[hide abstract]
ABSTRACT: Introduction Recognition of human activity from video data is a challenging problem that has received an increasing amount of attention from the computer vision community in recent years. The ability to parse high-level visual information has wide-ranging applications that include surveillance and security, the aid of people with special needs and the understanding of human non-verbal communication. Most of the methods proposed for human activity recognition have borrowed ideas from another task: object recognition, which has been dominated by methods that represent each image as a "bag of features" using hand-crafted descriptors applied to image patches. Therefore the majority of methods proposed follow a similar trajectory: detect local interest points, compute a representation of a window of pixels around each of these points using an engineered descriptor, quantize the local space-time features, represent each sequence as a spatio-temporal "bag of features", and then feed to a classifier. Most of the effort has been made in designing space-time interest-point detectors [1, 2, 3] and descriptors [4, 5, 3] and up to this point, learning has not played much of a role in advancing the field. Trans. codes Input video Normalized video Spatio-temporal cubes of pixels Max-pooling output Classifier output Sparse, overcomplete representation Figure 1: Our proposed architecture. A video is decomposed into spatio-temporal cubes. These are processed by two layers of learned feature de-tectors and a max-pooling layer be-fore classification. There is much evidence that learning feature detectors in a supervised setting [6], unsupervised setting [7, 8, 9], or semi-supervised [10] can improve perfor-mance in vision tasks, including object recognition. However, other than [11], we know of no methods that attempt to use learning at the level of feature-detectors to improve human activity recognition. This may be due to the prohibitive computa-tional cost of learning descriptors on video. Standard datasets for activity recogni-tion (e.g. [12, 13]) contain typically an order of magnitude more pixels than com-mon datasets for object recognition. However, with the advent of general-purpose GPU computing, and its growing popularity in learning features from images [14], it is now worth considering large-scale feature learning on these datasets. The first contribution of this paper is to address the problem of learning feature detectors for use in human activity recognition. Specifically, we focus on a recently proposed type of conditional random field called the gated Restricted Boltzmann Machine (GRBM) [15] which learns distributed, domain-specific representations of image transformations. Fundamental to all of the leading activity recognition methods is a step where spatio-temporal descriptors are quantized to generate a "bag-of-words" codebook (typically using K-means). In an attempt to easily ob-tain compact video representations, much of the potentially useful discriminative power of the descriptors is lost. A second contribution of this paper is to to ad-dress this issue. We argue and demonstrate that sparse, overcomplete distributed representations are more appropriate for video analysis. Method Our proposed architecture is shown in Fig. 1. We first apply local contrast normalization to each of the pixels in the input video. Then we extract local space-time "cubes" from the normalized video. In the first phase of unsupervised learning we estimate so-called latent "transformation-codes" that model local appearance and motion of the space-time cubes (see Fig. 2). Given low-level codes, we infer longer-term/mid-level encodings with a sparse dictionary learning method [16]. Finally max-pooling is used to find a sparse, vector representation for the video which is then fed to an SVM classifier.