Greg Mori

Simon Fraser University, Burnaby, British Columbia, Canada

Are you Greg Mori?

Claim your profile

Publications (96)34.25 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Not all frames are equal – selecting a subset of discriminative frames from a video can improve performance at detecting and recognizing human interactions. In this paper we present models for categorizing a video into one of a number of predefined interactions or for detecting these interactions in a long video sequence. The models represent the interaction by a set of key temporal moments and the spatial structures they entail. For instance: two people approaching each other, then extending their hands before engaging in a “handshaking” interaction. Learning the model parameters requires only weak supervision in the form of an overall label for the interaction. Experimental results on the UT-Interaction and VIRAT datasets verify the efficacy of these structured models for human interactions.
    Computer Vision and Image Understanding 03/2015; DOI:10.1016/j.cviu.2015.02.012 · 1.23 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel approach for discovering human interactions in videos. Activity understanding techniques usually require a large number of labeled examples, which are not available in many practical cases. Here, we focus on recovering semantically meaningful clusters of human-human and human-object interaction in an unsupervised fashion. A new iterative solution is introduced based on Maximum Margin Clustering (MMC), which also accepts user feedback to refine clusters. This is achieved by formulating the whole process as a unified constrained latent max-margin clustering problem. Extensive experiments have been carried out over three challenging datasets, Collective Activity, VIRAT, and UT-interaction. Empirical results demonstrate that the proposed algorithm can efficiently discover perfect semantic clusters of human interactions with only a small amount of labeling effort.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a hierarchical maximum-margin clustering method for unsupervised data analysis. Our method extends beyond flat maximum-margin clustering, and performs clustering recursively in a top-down manner. We propose an effective greedy splitting criteria for selecting which cluster to split next, and employ regularizers that enforce feature sharing/competition for capturing data semantics. Experimental results obtained on four standard datasets show that our method outperforms flat and hierarchical clustering baselines, while forming clean and semantically meaningful cluster hierarchies.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many visual recognition problems can be approached by counting instances. To determine whether an event is present in a long internet video, one could count how many frames seem to contain the activity. Classifying the activity of a group of people can be done by counting the actions of individual people. Encoding these cardinality relationships can reduce sensitivity to clutter, in the form of irrelevant frames or individuals not involved in a group activity. Learned parameters can encode how many instances tend to occur in a class of interest. To this end, this paper develops a powerful and flexible framework to infer any cardinality relation between latent labels in a multi-instance model. Hard or soft cardinality relations can be encoded to tackle diverse levels of ambiguity. Experiments on tasks such as human activity recognition, video event detection, and video summarization demonstrate the effectiveness of using cardinality relations for improving recognition results.
  • [Show abstract] [Hide abstract]
    ABSTRACT: In sustainable urban planning, non-motorised active modes of travel such as walking are identified as a leading driver for a healthy, liveable and resource-efficient environment. To encourage walking, there is a need for a solid understanding of pedestrian walking behaviour. This understanding is central to the evaluation of measures of walking conditions such as comfortability and efficiency. The main purpose of this study is to gain an in-depth understanding of pedestrian walking behaviour through the investigation of the spatio-temporal gait parameters (step length and step frequency). This microscopic-level analysis provides insight into the pedestrian walking mechanisms and the effect of various attributes such as gender and age. This analysis relies on automated video-based data collection using computer vision techniques. The step frequency and step length are estimated based on oscillatory patterns in the walking speed profile. The study uses real-world video data collected in downtown Vancouver, BC. The results show that the gait parameters are influenced by factors such as crosswalk grade, pedestrian gender, age and group size. The step length was found to generally have more influence on walking speed than step frequency. It was also found that, compared to males, females increase their step frequency to increase their walking speed.
    03/2014; 10(3). DOI:10.1080/18128602.2012.727498
  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.
    Machine Vision and Applications 01/2014; DOI:10.1007/s00138-013-0525-x · 1.44 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a compositional model for video event detection. A video is modeled using a collection of both global and segment-level features and kernel functions are employed for similarity comparisons. The locations of salient, discriminative video segments are treated as a latent variable, allowing the model to explicitly ignore portions of the video that are unimportant for classification. A novel, multiple kernel learning (MKL) latent support vector machine (SVM) is defined, that is used to combine and re-weight multiple feature types in a principled fashion while simultaneously operating within the latent variable framework. The compositional nature of the proposed model allows it to respond directly to the challenges of temporal clutter and intra-class variation, which are prevalent in unconstrained internet videos. Experimental results on the TRECVID Multimedia Event Detection 2011 (MED11) dataset demonstrate the efficacy of the method.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • Arash Vahdat, Greg Mori
    [Show abstract] [Hide abstract]
    ABSTRACT: Gathering accurate training data for recognizing a set of attributes or tags on images or videos is a challenge. Obtaining labels via manual effort or from weakly-supervised data typically results in noisy training labels. We develop the FlipSVM, a novel algorithm for handling these noisy, structured labels. The FlipSVM models label noise by "flipping" labels on training examples. We show empirically that the FlipSVM is effective on images-and-attributes and video tagging datasets.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The appearance of an object changes profoundly with pose, camera view and interactions of the object with other objects in the scene. This makes it challenging to learn detectors based on an object-level label (e.g., "car"). We postulate that having a richer set of labelings (at different levels of granularity) for an object, including finer-grained subcategories, consistent in appearance and view, and higher order composites - contextual groupings of objects consistent in their spatial layout and appearance, can significantly alleviate these problems. However, obtaining such a rich set of annotations, including annotation of an exponentially growing set of object groupings, is simply not feasible. We propose a weakly-supervised framework for object detection where we discover subcategories and the composites automatically with only traditional object-level category labels as input. To this end, we first propose an exemplar-SVM-based clustering approach, with latent SVM refinement, that discovers a variable length set of discriminative subcategories for each object class. We then develop a structured model for object detection that captures interactions among object subcategories and automatically discovers semantically meaningful and discriminatively relevant visual composites. We show that this model produces state-of-the-art performance on UIUC phrase object detection benchmark.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we address the problem of complex event detection on unconstrained videos. We introduce a novel multi-way feature pooling approach which leverages segment-level information. The approach is simple and widely applicable to diverse audio-visual features. Our approach uses a set of clusters discovered via unsupervised clustering of segment-level features. Depending on feature characteristics, not only scene-based clusters but also motion/audio-based clusters can be incorporated. Then, every video is represented with multiple descriptors, where each descriptor is designed to relate to one of the pre-built clusters. For classification, intersection kernel SVMs are used where the kernel is obtained by combining multiple kernels computed from corresponding per-cluster descriptor pairs. Evaluation on TRECVID'11 MED dataset shows a significant improvement by the proposed approach beyond the state-of-the-art.
    Proceedings of the 21st ACM international conference on Multimedia; 10/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a graphical framework for multiple instance learning (MIL) based on Markov networks. This framework can be used to model the traditional MIL definition as well as more general MIL definitions. Different levels of ambiguity -- the portion of positive instances in a bag -- can be explored in weakly supervised data. To train these models, we propose a discriminative max-margin learning algorithm leveraging efficient inference for cardinality-based cliques. The efficacy of the proposed framework is evaluated on a variety of data sets. Experimental results verify that encoding or learning the degree of ambiguity can improve classification performance.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Falls are the number one cause of injury in older adults. Lack of objective evidence on the cause and circumstances of falls is often a barrier to effective prevention strategies. Previous studies have established the ability of wearable miniature inertial sensors (accelerometers and gyroscopes) to automatically detect falls, for the purpose of delivering medical assistance. In the current study, we extend the applications of this technology, by developing and evaluating the accuracy of wearable sensor systems for determining the cause of falls. Twelve young adults participated in experimental trials involving falls due to seven causes: slips, trips, fainting, and incorrect shifting/transfer of body weight while sitting down, standing up from sitting, reaching and turning. Features (means and variances) of acceleration data acquired from four tri-axial accelerometers during the falling trials were input to a linear discriminant analysis technique. Data from an array of three sensors (left ankle+right ankle+sternum) provided at least 83% sensitivity and 89% specificity in classifying falls due to slips, trips, and incorrect shift of body weight during sitting, reaching and turning. Classification of falls due to fainting and incorrect shift during rising was less successful across all sensor combinations. Furthermore, similar classification accuracy was observed with data from wearable sensors and a video-based motion analysis system. These results establish a basis for the development of sensor-based fall monitoring systems that provide information on the cause and circumstances of falls, to direct fall prevention strategies at a patient or population level.
    Gait & posture 09/2013; 39(1). DOI:10.1016/j.gaitpost.2013.08.034 · 2.58 Impact Factor
  • 6th annual British Columbia Alliance on Telehealth Policy and Research (BCATPR) workshop, Vancouver, British Columbia, Canada.; 06/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an application of vision-based monitoring of long-term care facility residents. We develop an algorithm to detect events of interest, particularly falls by elderly residents. The algorithm uses a max-margin latent variable approach with spatio-temporal locations of the person in the video as latent variables. The recently developed Action Bank descriptor is utilized as a rich feature representation for each frame. Empirical results demonstrate the e�ectiveness of this method.
    The 13th IAPR Conference on Machine Vision Applications, Kyoto, Japan; 05/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a multimodal system for creating, modifying and commanding groups of robots from a population. Extending our previous work on selecting an individual robot from a population by face engagement, we show that we can dynamically create groups of a desired number of robots by speaking the number we desire, e.g. “You three”, and looking at the robots we intend to form the group. We evaluate two different methods of detecting which robots are intended by the user, and show that an iterated election performs well in our setting. We also show that teams can be modified by adding and removing individual robots: “And you. Not you”. The success of the system is examined for different spatial configurations of robots with respect to each other and the user to find the proper workspace of selection methods.
    Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Extending our previous work in real-time vision-based Human Robot Interaction (HRI) with multi-robot systems, we present the first example of creating, modifying and commanding teams of UAVs by an uninstrumented human. To create a team the user focuses attention on an individual robot by simply looking at it, then adds or removes it from the current team with a motion-based hand gesture. Another gesture commands the entire team to begin task execution. Robots communicate among themselves by wireless network to ensure that no more than one robot is focused, and so that the whole team agrees that it has been commanded. Since robots can be added and removed from the team, the system is robust to incorrect additions. A series of trials with two and three very low-cost UAVs and off-board processing demonstrates the practicality of our approach.
    Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We conduct image classification by learning a class-to-image distance function that matches objects. The set of objects in training images for an image class are treated as a collage. When presented with a test image, the best matching between this collage of training image objects and those in the test image is found. We validate the efficacy of the proposed model on the PASCAL 07 and SUN 09 datasets, showing that our model is effective for object classification and scene classification tasks. State-of-the-art image classification results are obtained, and qualitative results demonstrate that objects can be accurately matched.
    Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We describe a system whereby multiple humans and mobile robots interact robustly using a combination of sensing and signalling modalities. Extending our previous work on selecting an individual robot from a population by face-engagement, we show that reaching toward a robot - a specialization of pointing - can be used to designate a particular robot for subsequent one-on-one interaction. To achieve robust operation despite frequent sensing problems, the robots use three phases of human detection and tracking, and emit audio cues to solicit interaction and guide the behaviour of the human. A series of real-world trials demonstrates the practicality of our approach.
    Robotics and Automation (ICRA), 2013 IEEE International Conference on; 01/2013
  • Tian Lan, Greg Mori
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose Max-Margin Riffled Independence Model (MMRIM), a new method for image tag ranking modeling the structured preferences among tags. The goal is to predict a ranked tag list for a given image, where tags are ordered by their importance or relevance to the image content. Our model integrates the max-margin formalism with riffled independence factorizations proposed in [10], which naturally allows for structured learning and efficient ranking. Experimental results on the SUN Attribute and Label Me datasets demonstrate the superior performance of the proposed model compared with baseline tag ranking methods. We also apply the predicted rank list of tags to several higher-level computer vision applications in image understanding and retrieval, and demonstrate that MMRIM significantly improves the accuracy of these applications.
    Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on; 01/2013