[Show abstract][Hide abstract] ABSTRACT: This paper presents a deep neural-network-based hierarchical graphical model
for individual and group activity recognition in surveillance scenes. Deep
networks are used to recognize the actions of individual people in a scene.
Next, a neural-network-based hierarchical graphical model refines the predicted
labels for each class by considering dependencies between the classes. This
refinement step mimics a message-passing step similar to inference in a
probabilistic graphical model. We show that this approach can be effective in
group activity recognition, with the deep graphical model improving recognition
rates over baseline methods.
The British Machine Vision Conference (BMVC); 09/2015
[Show abstract][Hide abstract] ABSTRACT: Every moment counts in action recognition. A comprehensive understanding of
human activity in video requires labeling every frame according to the actions
occurring, placing multiple labels densely over a video sequence. To study this
problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new
dataset of dense labels over unconstrained internet videos. Modeling multiple,
dense labels benefits from temporal relations within and across classes. We
define a novel variant of long short-term memory (LSTM) deep networks for
modeling these temporal relations via multiple input and output connections. We
show that this model improves action labeling accuracy and further enables
deeper understanding tasks ranging from structured retrieval to action
[Show abstract][Hide abstract] ABSTRACT: We present a method for learning an embedding that places images of humans in
similar poses nearby. This embedding can be used as a direct method of
comparing images based on human pose, avoiding potential challenges of
estimating body joint positions. Pose embedding learning is formulated under a
triplet-based distance criterion. A deep architecture is used to allow learning
of a representation capable of making distinctions between different poses.
Experiments on human pose matching and retrieval from video data demonstrate
the potential of the method.
[Show abstract][Hide abstract] ABSTRACT: We present a novel approach for discovering human interactions in videos.
Activity understanding techniques usually require a large number of labeled
examples, which are not available in many practical cases. Here, we focus on
recovering semantically meaningful clusters of human-human and human-object
interaction in an unsupervised fashion. A new iterative solution is introduced
based on Maximum Margin Clustering (MMC), which also accepts user feedback to
refine clusters. This is achieved by formulating the whole process as a unified
constrained latent max-margin clustering problem. Extensive experiments have
been carried out over three challenging datasets, Collective Activity, VIRAT,
and UT-interaction. Empirical results demonstrate that the proposed algorithm
can efficiently discover perfect semantic clusters of human interactions with
only a small amount of labeling effort.
Workshop on Group and Crowd Behavior Analysis and Understanding (at CVPR); 06/2015
[Show abstract][Hide abstract] ABSTRACT: In this paper, we propose to learn temporal embeddings of video frames for
complex video analysis. Large quantities of unlabeled video data can be easily
obtained from the Internet. These videos possess the implicit weak label that
they are sequences of temporally and semantically coherent images. We leverage
this information to learn temporal embeddings for video frames by associating
frames with the temporal context that they appear in. To do this, we propose a
scheme for incorporating temporal context based on past and future frames in
videos, and compare this to other contextual representations. In addition, we
show how data augmentation using multi-resolution samples and hard negatives
helps to significantly improve the quality of the learned embeddings. We
evaluate various design decisions for learning temporal embeddings, and show
that our embeddings can improve performance for multiple video tasks such as
retrieval, classification, and temporal order recovery in unconstrained
[Show abstract][Hide abstract] ABSTRACT: Not all frames are equal – selecting a subset of discriminative frames from a video can improve performance at detecting and recognizing human interactions. In this paper we present models for categorizing a video into one of a number of predefined interactions or for detecting these interactions in a long video sequence. The models represent the interaction by a set of key temporal moments and the spatial structures they entail. For instance: two people approaching each other, then extending their hands before engaging in a “handshaking” interaction. Learning the model parameters requires only weak supervision in the form of an overall label for the interaction. Experimental results on the UT-Interaction and VIRAT datasets verify the efficacy of these structured models for human interactions.
[Show abstract][Hide abstract] ABSTRACT: Many visual recognition problems can be approached by counting instances. To
determine whether an event is present in a long internet video, one could count
how many frames seem to contain the activity. Classifying the activity of a
group of people can be done by counting the actions of individual people.
Encoding these cardinality relationships can reduce sensitivity to clutter, in
the form of irrelevant frames or individuals not involved in a group activity.
Learned parameters can encode how many instances tend to occur in a class of
interest. To this end, this paper develops a powerful and flexible framework to
infer any cardinality relation between latent labels in a multi-instance model.
Hard or soft cardinality relations can be encoded to tackle diverse levels of
ambiguity. Experiments on tasks such as human activity recognition, video event
detection, and video summarization demonstrate the effectiveness of using
cardinality relations for improving recognition results.
[Show abstract][Hide abstract] ABSTRACT: We present a hierarchical maximum-margin clustering method for unsupervised
data analysis. Our method extends beyond flat maximum-margin clustering, and
performs clustering recursively in a top-down manner. We propose an effective
greedy splitting criteria for selecting which cluster to split next, and employ
regularizers that enforce feature sharing/competition for capturing data
semantics. Experimental results obtained on four standard datasets show that
our method outperforms flat and hierarchical clustering baselines, while
forming clean and semantically meaningful cluster hierarchies.
[Show abstract][Hide abstract] ABSTRACT: In sustainable urban planning, non-motorised active modes of travel such as walking are identified as a leading driver for a healthy, liveable and resource-efficient environment. To encourage walking, there is a need for a solid understanding of pedestrian walking behaviour. This understanding is central to the evaluation of measures of walking conditions such as comfortability and efficiency. The main purpose of this study is to gain an in-depth understanding of pedestrian walking behaviour through the investigation of the spatio-temporal gait parameters (step length and step frequency). This microscopic-level analysis provides insight into the pedestrian walking mechanisms and the effect of various attributes such as gender and age. This analysis relies on automated video-based data collection using computer vision techniques. The step frequency and step length are estimated based on oscillatory patterns in the walking speed profile. The study uses real-world video data collected in downtown Vancouver, BC. The results show that the gait parameters are influenced by factors such as crosswalk grade, pedestrian gender, age and group size. The step length was found to generally have more influence on walking speed than step frequency. It was also found that, compared to males, females increase their step frequency to increase their walking speed.
[Show abstract][Hide abstract] ABSTRACT: We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.
[Show abstract][Hide abstract] ABSTRACT: We present a compositional model for video event detection. A video is modeled using a collection of both global and segment-level features and kernel functions are employed for similarity comparisons. The locations of salient, discriminative video segments are treated as a latent variable, allowing the model to explicitly ignore portions of the video that are unimportant for classification. A novel, multiple kernel learning (MKL) latent support vector machine (SVM) is defined, that is used to combine and re-weight multiple feature types in a principled fashion while simultaneously operating within the latent variable framework. The compositional nature of the proposed model allows it to respond directly to the challenges of temporal clutter and intra-class variation, which are prevalent in unconstrained internet videos. Experimental results on the TRECVID Multimedia Event Detection 2011 (MED11) dataset demonstrate the efficacy of the method.
Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
[Show abstract][Hide abstract] ABSTRACT: Gathering accurate training data for recognizing a set of attributes or tags on images or videos is a challenge. Obtaining labels via manual effort or from weakly-supervised data typically results in noisy training labels. We develop the FlipSVM, a novel algorithm for handling these noisy, structured labels. The FlipSVM models label noise by "flipping" labels on training examples. We show empirically that the FlipSVM is effective on images-and-attributes and video tagging datasets.
Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
[Show abstract][Hide abstract] ABSTRACT: The appearance of an object changes profoundly with pose, camera view and interactions of the object with other objects in the scene. This makes it challenging to learn detectors based on an object-level label (e.g., "car"). We postulate that having a richer set of labelings (at different levels of granularity) for an object, including finer-grained subcategories, consistent in appearance and view, and higher order composites - contextual groupings of objects consistent in their spatial layout and appearance, can significantly alleviate these problems. However, obtaining such a rich set of annotations, including annotation of an exponentially growing set of object groupings, is simply not feasible. We propose a weakly-supervised framework for object detection where we discover subcategories and the composites automatically with only traditional object-level category labels as input. To this end, we first propose an exemplar-SVM-based clustering approach, with latent SVM refinement, that discovers a variable length set of discriminative subcategories for each object class. We then develop a structured model for object detection that captures interactions among object subcategories and automatically discovers semantically meaningful and discriminatively relevant visual composites. We show that this model produces state-of-the-art performance on UIUC phrase object detection benchmark.
Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
[Show abstract][Hide abstract] ABSTRACT: In this work, we address the problem of complex event detection on unconstrained videos. We introduce a novel multi-way feature pooling approach which leverages segment-level information. The approach is simple and widely applicable to diverse audio-visual features. Our approach uses a set of clusters discovered via unsupervised clustering of segment-level features. Depending on feature characteristics, not only scene-based clusters but also motion/audio-based clusters can be incorporated. Then, every video is represented with multiple descriptors, where each descriptor is designed to relate to one of the pre-built clusters. For classification, intersection kernel SVMs are used where the kernel is obtained by combining multiple kernels computed from corresponding per-cluster descriptor pairs. Evaluation on TRECVID'11 MED dataset shows a significant improvement by the proposed approach beyond the state-of-the-art.
Proceedings of the 21st ACM international conference on Multimedia; 10/2013
[Show abstract][Hide abstract] ABSTRACT: We introduce a graphical framework for multiple instance learning (MIL) based
on Markov networks. This framework can be used to model the traditional MIL
definition as well as more general MIL definitions. Different levels of
ambiguity -- the portion of positive instances in a bag -- can be explored in
weakly supervised data. To train these models, we propose a discriminative
max-margin learning algorithm leveraging efficient inference for
cardinality-based cliques. The efficacy of the proposed framework is evaluated
on a variety of data sets. Experimental results verify that encoding or
learning the degree of ambiguity can improve classification performance.
[Show abstract][Hide abstract] ABSTRACT: Falls are the number one cause of injury in older adults. Lack of objective evidence on the cause and circumstances of falls is often a barrier to effective prevention strategies. Previous studies have established the ability of wearable miniature inertial sensors (accelerometers and gyroscopes) to automatically detect falls, for the purpose of delivering medical assistance. In the current study, we extend the applications of this technology, by developing and evaluating the accuracy of wearable sensor systems for determining the cause of falls. Twelve young adults participated in experimental trials involving falls due to seven causes: slips, trips, fainting, and incorrect shifting/transfer of body weight while sitting down, standing up from sitting, reaching and turning. Features (means and variances) of acceleration data acquired from four tri-axial accelerometers during the falling trials were input to a linear discriminant analysis technique. Data from an array of three sensors (left ankle+right ankle+sternum) provided at least 83% sensitivity and 89% specificity in classifying falls due to slips, trips, and incorrect shift of body weight during sitting, reaching and turning. Classification of falls due to fainting and incorrect shift during rising was less successful across all sensor combinations. Furthermore, similar classification accuracy was observed with data from wearable sensors and a video-based motion analysis system. These results establish a basis for the development of sensor-based fall monitoring systems that provide information on the cause and circumstances of falls, to direct fall prevention strategies at a patient or population level.
[Show abstract][Hide abstract] ABSTRACT: This paper describes an automated classification approach to road users. The main motivation behind road-user classification in the context of safety stems from the necessity to learn traffic scenarios and understand patterns within each road-user class. The end goal in the analysis is to identify and learn scenarios that may contribute to hazards in traffic conditions. The classification relies on video data (movement trajectories) collected in urban intersections. The approach is based on the discrimination of the shapes of the speed profiles of each road-user type, more precisely, the discrimination between the speed movement patterns of vehicles and the ambulatory characteristics of pedestrians. The collected movement-trajectory data are represented as time series. The classification is performed using singular value decomposition and reconstruction of the time series. Two complementary methods are proposed based on the quality evaluation (correlation score) of the reconstructed trajectories. In the first method, a threshold-based decision procedure is applied. This approach is complemented in the second method by a semisupervised classification procedure guided by movement prototypes. The approach is validated on real-world data collected in Oakland, California. A correct classification of around 90% was achieved using both methods.
Journal of Computing in Civil Engineering 07/2013; 27(4):395-406. DOI:10.1061/(ASCE)CP.1943-5487.0000237 · 1.27 Impact Factor