Greg Mori

Simon Fraser University, Burnaby, British Columbia, Canada

Are you Greg Mori?

Claim your profile

Publications (110)48.29 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent's decision policy. Our model achieves state-of-the-art results on the THUMOS'14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames.
    No preview · Article · Nov 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In group activity recognition, the temporal dynamics of the whole activity can be inferred based on the dynamics of the individual people representing the activity. We build a deep model to capture these dynamics based on LSTM (long-short term memory) models. To make use of these ob- servations, we present a 2-stage deep temporal model for the group activity recognition problem. In our model, a LSTM model is designed to represent action dynamics of in- dividual people in a sequence and another LSTM model is designed to aggregate human-level information for whole activity understanding. We evaluate our model over two datasets: the collective activity dataset and a new volley- ball dataset. Experimental results demonstrate that our proposed model improves group activity recognition perfor- mance with compared to baseline methods.
    Preview · Article · Nov 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Images of scenes have various objects as well as abundant attributes, and diverse levels of visual categorization are possible. A natural image could be assigned with fine-grained labels that describe major components, coarse-grained labels that depict high level abstraction or a set of labels that reveal attributes. Such categorization at different concept layers can be modeled with label graphs encoding label information. In this paper, we exploit this rich information with a state-of-art deep learning framework, and propose a generic structured model that leverages diverse label relations to improve image classification performance. Our approach employs a novel stacked label prediction neural network, capturing both inter-level and intra-level label semantics. We evaluate our method on benchmark image datasets, and empirical results illustrate the efficacy of our model.
    Preview · Article · Nov 2015
  • Zhiwei Deng · Arash Vahdat · Hexiang Hu · Greg Mori
    [Show abstract] [Hide abstract]
    ABSTRACT: Rich semantic relations are important in a variety of visual recognition problems. As a concrete example, group activity recognition involves the interactions and relative spatial relations of a set of people in a scene. State of the art recognition methods center on deep learning approaches for training highly effective, complex classifiers for interpreting images. However, bridging the relatively low-level concepts output by these methods to interpret higher-level compositional scenes remains a challenge. Graphical models are a standard tool for this task. In this paper, we propose a method to integrate graphical models and deep neural networks into a joint framework. Instead of using a traditional inference method, we instead use a sequential prediction approximation, modeled by a recurrent neural network. Beyond this, the appropriate structure for inference can be learned by imposing gates on edges between connections of nodes. Empirical results on group activity recognition demonstrate the potential of this model to handle highly structured learning tasks.
    No preview · Article · Nov 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a deep neural-network-based hierarchical graphical model for individual and group activity recognition in surveillance scenes. Deep networks are used to recognize the actions of individual people in a scene. Next, a neural-network-based hierarchical graphical model refines the predicted labels for each class by considering dependencies between the classes. This refinement step mimics a message-passing step similar to inference in a probabilistic graphical model. We show that this approach can be effective in group activity recognition, with the deep graphical model improving recognition rates over baseline methods.
    Full-text · Conference Paper · Sep 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: An approach for vehicle conflict analysis based on three-dimensional (3-D) vehicle detection is presented. Techniques for quantitative conflict measurements often use a point trajectory representation for vehicles. More accurate conflict measurement can be facilitated with a region-based vehicle representation instead. This paper describes a computer vision approach for extracting vehicle trajectories from video sequences. The method relied on a fusion of background subtraction and feature-based tracking to provide a three-dimensional (3-D) cuboid representation of the vehicle. Standard conflict measures, including time to collision and postencroachment time, were computed with the use of the 3-D cuboid vehicle representations. The use of these conflict measures was demonstrated on a challenging data set of video footage. Results showed that the region-based representation could provide more precise calculation of traffic conflict indicators compared with approaches based on a point representation.
    Full-text · Article · Sep 2015 · Transportation Research Record Journal of the Transportation Research Board
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory (LSTM) deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.
    Preview · Article · Jul 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: Detecting objects such as humans or vehicles is a central problem in video surveillance. Myriad standard approaches exist for this problem. At their core, approaches consider either the appearance of people, patterns of their motion, or differences from the background. In this paper we build on dense trajectories, a state-of-the-art approach for describing spatio-temporal patterns in video sequences. We demonstrate an application of dense trajectories to object detection in surveillance video, showing that they can be used to both regress estimates of object locations and accurately classify objects.
    No preview · Article · Jul 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a method for learning an embedding that places images of humans in similar poses nearby. This embedding can be used as a direct method of comparing images based on human pose, avoiding potential challenges of estimating body joint positions. Pose embedding learning is formulated under a triplet-based distance criterion. A deep architecture is used to allow learning of a representation capable of making distinctions between different poses. Experiments on human pose matching and retrieval from video data demonstrate the potential of the method.
    Preview · Article · Jul 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel approach for discovering human interactions in videos. Activity understanding techniques usually require a large number of labeled examples, which are not available in many practical cases. Here, we focus on recovering semantically meaningful clusters of human-human and human-object interaction in an unsupervised fashion. A new iterative solution is introduced based on Maximum Margin Clustering (MMC), which also accepts user feedback to refine clusters. This is achieved by formulating the whole process as a unified constrained latent max-margin clustering problem. Extensive experiments have been carried out over three challenging datasets, Collective Activity, VIRAT, and UT-interaction. Empirical results demonstrate that the proposed algorithm can efficiently discover perfect semantic clusters of human interactions with only a small amount of labeling effort.
    Full-text · Conference Paper · Jun 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose to learn temporal embeddings of video frames for complex video analysis. Large quantities of unlabeled video data can be easily obtained from the Internet. These videos possess the implicit weak label that they are sequences of temporally and semantically coherent images. We leverage this information to learn temporal embeddings for video frames by associating frames with the temporal context that they appear in. To do this, we propose a scheme for incorporating temporal context based on past and future frames in videos, and compare this to other contextual representations. In addition, we show how data augmentation using multi-resolution samples and hard negatives helps to significantly improve the quality of the learned embeddings. We evaluate various design decisions for learning temporal embeddings, and show that our embeddings can improve performance for multiple video tasks such as retrieval, classification, and temporal order recovery in unconstrained Internet video.
    Preview · Article · May 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: Not all frames are equal – selecting a subset of discriminative frames from a video can improve performance at detecting and recognizing human interactions. In this paper we present models for categorizing a video into one of a number of predefined interactions or for detecting these interactions in a long video sequence. The models represent the interaction by a set of key temporal moments and the spatial structures they entail. For instance: two people approaching each other, then extending their hands before engaging in a “handshaking” interaction. Learning the model parameters requires only weak supervision in the form of an overall label for the interaction. Experimental results on the UT-Interaction and VIRAT datasets verify the efficacy of these structured models for human interactions.
    No preview · Article · Mar 2015 · Computer Vision and Image Understanding
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many visual recognition problems can be approached by counting instances. To determine whether an event is present in a long internet video, one could count how many frames seem to contain the activity. Classifying the activity of a group of people can be done by counting the actions of individual people. Encoding these cardinality relationships can reduce sensitivity to clutter, in the form of irrelevant frames or individuals not involved in a group activity. Learned parameters can encode how many instances tend to occur in a class of interest. To this end, this paper develops a powerful and flexible framework to infer any cardinality relation between latent labels in a multi-instance model. Hard or soft cardinality relations can be encoded to tackle diverse levels of ambiguity. Experiments on tasks such as human activity recognition, video event detection, and video summarization demonstrate the effectiveness of using cardinality relations for improving recognition results.
    Full-text · Article · Feb 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a hierarchical maximum-margin clustering method for unsupervised data analysis. Our method extends beyond flat maximum-margin clustering, and performs clustering recursively in a top-down manner. We propose an effective greedy splitting criteria for selecting which cluster to split next, and employ regularizers that enforce feature sharing/competition for capturing data semantics. Experimental results obtained on four standard datasets show that our method outperforms flat and hierarchical clustering baselines, while forming clean and semantically meaningful cluster hierarchies.
    Full-text · Article · Feb 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Several studies have shown that cyclists can reduce the risk of severe head injuries by wearing a helmet. A system is proposed to collect cyclist helmet usage data automatically from video footage. Computer vision techniques are used to track the moving objects and then to analyze the object trajectories and speed profiles to identify cyclists. Image features are extracted from a region around the cyclist's head. Support vector machines determine whether the cyclist is wearing a helmet. The system can be approximately 90% accurate in cyclist classification when provided with accurate tracks of the cyclist's head. Even for situations in which obtaining video to track a cyclist is challenging, the proposed method provides an effective retrieval system, potentially reducing the number of video records that must be analyzed manually to find instances of cyclists not wearing helmets.
    Full-text · Article · Dec 2014 · Transportation Research Record Journal of the Transportation Research Board
  • [Show abstract] [Hide abstract]
    ABSTRACT: In sustainable urban planning, non-motorised active modes of travel such as walking are identified as a leading driver for a healthy, liveable and resource-efficient environment. To encourage walking, there is a need for a solid understanding of pedestrian walking behaviour. This understanding is central to the evaluation of measures of walking conditions such as comfortability and efficiency. The main purpose of this study is to gain an in-depth understanding of pedestrian walking behaviour through the investigation of the spatio-temporal gait parameters (step length and step frequency). This microscopic-level analysis provides insight into the pedestrian walking mechanisms and the effect of various attributes such as gender and age. This analysis relies on automated video-based data collection using computer vision techniques. The step frequency and step length are estimated based on oscillatory patterns in the walking speed profile. The study uses real-world video data collected in downtown Vancouver, BC. The results show that the gait parameters are influenced by factors such as crosswalk grade, pedestrian gender, age and group size. The step length was found to generally have more influence on walking speed than step frequency. It was also found that, compared to males, females increase their step frequency to increase their walking speed.
    No preview · Article · Mar 2014
  • Source

    Full-text · Dataset · Jan 2014

  • No preview · Conference Paper · Jan 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.
    No preview · Article · Jan 2014 · Machine Vision and Applications
  • Conference Paper: "You are green"

    No preview · Conference Paper · Jan 2014