[Show abstract][Hide abstract] ABSTRACT: In group activity recognition, the temporal dynamics of the whole activity
can be inferred based on the dynamics of the individual people representing the
activity. We build a deep model to capture these dynamics based on LSTM
(long-short term memory) models. To make use of these ob- servations, we
present a 2-stage deep temporal model for the group activity recognition
problem. In our model, a LSTM model is designed to represent action dynamics of
in- dividual people in a sequence and another LSTM model is designed to
aggregate human-level information for whole activity understanding. We evaluate
our model over two datasets: the collective activity dataset and a new volley-
ball dataset. Experimental results demonstrate that our proposed model improves
group activity recognition perfor- mance with compared to baseline methods.
[Show abstract][Hide abstract] ABSTRACT: Images of scenes have various objects as well as abundant attributes, and
diverse levels of visual categorization are possible. A natural image could be
assigned with fine-grained labels that describe major components,
coarse-grained labels that depict high level abstraction or a set of labels
that reveal attributes. Such categorization at different concept layers can be
modeled with label graphs encoding label information. In this paper, we exploit
this rich information with a state-of-art deep learning framework, and propose
a generic structured model that leverages diverse label relations to improve
image classification performance. Our approach employs a novel stacked label
prediction neural network, capturing both inter-level and intra-level label
semantics. We evaluate our method on benchmark image datasets, and empirical
results illustrate the efficacy of our model.
[Show abstract][Hide abstract] ABSTRACT: Rich semantic relations are important in a variety of visual recognition
problems. As a concrete example, group activity recognition involves the
interactions and relative spatial relations of a set of people in a scene.
State of the art recognition methods center on deep learning approaches for
training highly effective, complex classifiers for interpreting images.
However, bridging the relatively low-level concepts output by these methods to
interpret higher-level compositional scenes remains a challenge. Graphical
models are a standard tool for this task. In this paper, we propose a method to
integrate graphical models and deep neural networks into a joint framework.
Instead of using a traditional inference method, we instead use a sequential
prediction approximation, modeled by a recurrent neural network. Beyond this,
the appropriate structure for inference can be learned by imposing gates on
edges between connections of nodes. Empirical results on group activity
recognition demonstrate the potential of this model to handle highly structured
[Show abstract][Hide abstract] ABSTRACT: This paper presents a deep neural-network-based hierarchical graphical model
for individual and group activity recognition in surveillance scenes. Deep
networks are used to recognize the actions of individual people in a scene.
Next, a neural-network-based hierarchical graphical model refines the predicted
labels for each class by considering dependencies between the classes. This
refinement step mimics a message-passing step similar to inference in a
probabilistic graphical model. We show that this approach can be effective in
group activity recognition, with the deep graphical model improving recognition
rates over baseline methods.
The British Machine Vision Conference (BMVC); 09/2015
[Show abstract][Hide abstract] ABSTRACT: Every moment counts in action recognition. A comprehensive understanding of
human activity in video requires labeling every frame according to the actions
occurring, placing multiple labels densely over a video sequence. To study this
problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new
dataset of dense labels over unconstrained internet videos. Modeling multiple,
dense labels benefits from temporal relations within and across classes. We
define a novel variant of long short-term memory (LSTM) deep networks for
modeling these temporal relations via multiple input and output connections. We
show that this model improves action labeling accuracy and further enables
deeper understanding tasks ranging from structured retrieval to action
[Show abstract][Hide abstract] ABSTRACT: We present a method for learning an embedding that places images of humans in
similar poses nearby. This embedding can be used as a direct method of
comparing images based on human pose, avoiding potential challenges of
estimating body joint positions. Pose embedding learning is formulated under a
triplet-based distance criterion. A deep architecture is used to allow learning
of a representation capable of making distinctions between different poses.
Experiments on human pose matching and retrieval from video data demonstrate
the potential of the method.
[Show abstract][Hide abstract] ABSTRACT: We present a novel approach for discovering human interactions in videos.
Activity understanding techniques usually require a large number of labeled
examples, which are not available in many practical cases. Here, we focus on
recovering semantically meaningful clusters of human-human and human-object
interaction in an unsupervised fashion. A new iterative solution is introduced
based on Maximum Margin Clustering (MMC), which also accepts user feedback to
refine clusters. This is achieved by formulating the whole process as a unified
constrained latent max-margin clustering problem. Extensive experiments have
been carried out over three challenging datasets, Collective Activity, VIRAT,
and UT-interaction. Empirical results demonstrate that the proposed algorithm
can efficiently discover perfect semantic clusters of human interactions with
only a small amount of labeling effort.
Workshop on Group and Crowd Behavior Analysis and Understanding (at CVPR); 06/2015
[Show abstract][Hide abstract] ABSTRACT: In this paper, we propose to learn temporal embeddings of video frames for
complex video analysis. Large quantities of unlabeled video data can be easily
obtained from the Internet. These videos possess the implicit weak label that
they are sequences of temporally and semantically coherent images. We leverage
this information to learn temporal embeddings for video frames by associating
frames with the temporal context that they appear in. To do this, we propose a
scheme for incorporating temporal context based on past and future frames in
videos, and compare this to other contextual representations. In addition, we
show how data augmentation using multi-resolution samples and hard negatives
helps to significantly improve the quality of the learned embeddings. We
evaluate various design decisions for learning temporal embeddings, and show
that our embeddings can improve performance for multiple video tasks such as
retrieval, classification, and temporal order recovery in unconstrained
[Show abstract][Hide abstract] ABSTRACT: Not all frames are equal – selecting a subset of discriminative frames from a video can improve performance at detecting and recognizing human interactions. In this paper we present models for categorizing a video into one of a number of predefined interactions or for detecting these interactions in a long video sequence. The models represent the interaction by a set of key temporal moments and the spatial structures they entail. For instance: two people approaching each other, then extending their hands before engaging in a “handshaking” interaction. Learning the model parameters requires only weak supervision in the form of an overall label for the interaction. Experimental results on the UT-Interaction and VIRAT datasets verify the efficacy of these structured models for human interactions.
[Show abstract][Hide abstract] ABSTRACT: Many visual recognition problems can be approached by counting instances. To
determine whether an event is present in a long internet video, one could count
how many frames seem to contain the activity. Classifying the activity of a
group of people can be done by counting the actions of individual people.
Encoding these cardinality relationships can reduce sensitivity to clutter, in
the form of irrelevant frames or individuals not involved in a group activity.
Learned parameters can encode how many instances tend to occur in a class of
interest. To this end, this paper develops a powerful and flexible framework to
infer any cardinality relation between latent labels in a multi-instance model.
Hard or soft cardinality relations can be encoded to tackle diverse levels of
ambiguity. Experiments on tasks such as human activity recognition, video event
detection, and video summarization demonstrate the effectiveness of using
cardinality relations for improving recognition results.
[Show abstract][Hide abstract] ABSTRACT: We present a hierarchical maximum-margin clustering method for unsupervised
data analysis. Our method extends beyond flat maximum-margin clustering, and
performs clustering recursively in a top-down manner. We propose an effective
greedy splitting criteria for selecting which cluster to split next, and employ
regularizers that enforce feature sharing/competition for capturing data
semantics. Experimental results obtained on four standard datasets show that
our method outperforms flat and hierarchical clustering baselines, while
forming clean and semantically meaningful cluster hierarchies.
[Show abstract][Hide abstract] ABSTRACT: Several studies have shown that cyclists can reduce the risk of severe head injuries by wearing a helmet. A system is proposed to collect cyclist helmet usage data automatically from video footage. Computer vision techniques are used to track the moving objects and then to analyze the object trajectories and speed profiles to identify cyclists. Image features are extracted from a region around the cyclist's head. Support vector machines determine whether the cyclist is wearing a helmet. The system can be approximately 90% accurate in cyclist classification when provided with accurate tracks of the cyclist's head. Even for situations in which obtaining video to track a cyclist is challenging, the proposed method provides an effective retrieval system, potentially reducing the number of video records that must be analyzed manually to find instances of cyclists not wearing helmets.
Transportation Research Record Journal of the Transportation Research Board 12/2014; 2468(-1):1-10. DOI:10.3141/2468-01 · 0.54 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In sustainable urban planning, non-motorised active modes of travel such as walking are identified as a leading driver for a healthy, liveable and resource-efficient environment. To encourage walking, there is a need for a solid understanding of pedestrian walking behaviour. This understanding is central to the evaluation of measures of walking conditions such as comfortability and efficiency. The main purpose of this study is to gain an in-depth understanding of pedestrian walking behaviour through the investigation of the spatio-temporal gait parameters (step length and step frequency). This microscopic-level analysis provides insight into the pedestrian walking mechanisms and the effect of various attributes such as gender and age. This analysis relies on automated video-based data collection using computer vision techniques. The step frequency and step length are estimated based on oscillatory patterns in the walking speed profile. The study uses real-world video data collected in downtown Vancouver, BC. The results show that the gait parameters are influenced by factors such as crosswalk grade, pedestrian gender, age and group size. The step length was found to generally have more influence on walking speed than step frequency. It was also found that, compared to males, females increase their step frequency to increase their walking speed.
[Show abstract][Hide abstract] ABSTRACT: We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.
[Show abstract][Hide abstract] ABSTRACT: The appearance of an object changes profoundly with pose, camera view and interactions of the object with other objects in the scene. This makes it challenging to learn detectors based on an object-level label (e.g., "car"). We postulate that having a richer set of labelings (at different levels of granularity) for an object, including finer-grained subcategories, consistent in appearance and view, and higher order composites - contextual groupings of objects consistent in their spatial layout and appearance, can significantly alleviate these problems. However, obtaining such a rich set of annotations, including annotation of an exponentially growing set of object groupings, is simply not feasible. We propose a weakly-supervised framework for object detection where we discover subcategories and the composites automatically with only traditional object-level category labels as input. To this end, we first propose an exemplar-SVM-based clustering approach, with latent SVM refinement, that discovers a variable length set of discriminative subcategories for each object class. We then develop a structured model for object detection that captures interactions among object subcategories and automatically discovers semantically meaningful and discriminatively relevant visual composites. We show that this model produces state-of-the-art performance on UIUC phrase object detection benchmark.
Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
[Show abstract][Hide abstract] ABSTRACT: We present a compositional model for video event detection. A video is modeled using a collection of both global and segment-level features and kernel functions are employed for similarity comparisons. The locations of salient, discriminative video segments are treated as a latent variable, allowing the model to explicitly ignore portions of the video that are unimportant for classification. A novel, multiple kernel learning (MKL) latent support vector machine (SVM) is defined, that is used to combine and re-weight multiple feature types in a principled fashion while simultaneously operating within the latent variable framework. The compositional nature of the proposed model allows it to respond directly to the challenges of temporal clutter and intra-class variation, which are prevalent in unconstrained internet videos. Experimental results on the TRECVID Multimedia Event Detection 2011 (MED11) dataset demonstrate the efficacy of the method.
Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
[Show abstract][Hide abstract] ABSTRACT: Gathering accurate training data for recognizing a set of attributes or tags on images or videos is a challenge. Obtaining labels via manual effort or from weakly-supervised data typically results in noisy training labels. We develop the FlipSVM, a novel algorithm for handling these noisy, structured labels. The FlipSVM models label noise by "flipping" labels on training examples. We show empirically that the FlipSVM is effective on images-and-attributes and video tagging datasets.
Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
[Show abstract][Hide abstract] ABSTRACT: In this work, we address the problem of complex event detection on unconstrained videos. We introduce a novel multi-way feature pooling approach which leverages segment-level information. The approach is simple and widely applicable to diverse audio-visual features. Our approach uses a set of clusters discovered via unsupervised clustering of segment-level features. Depending on feature characteristics, not only scene-based clusters but also motion/audio-based clusters can be incorporated. Then, every video is represented with multiple descriptors, where each descriptor is designed to relate to one of the pre-built clusters. For classification, intersection kernel SVMs are used where the kernel is obtained by combining multiple kernels computed from corresponding per-cluster descriptor pairs. Evaluation on TRECVID'11 MED dataset shows a significant improvement by the proposed approach beyond the state-of-the-art.
Proceedings of the 21st ACM international conference on Multimedia; 10/2013