Greg Mori

Simon Fraser University, Burnaby, British Columbia, Canada

Are you Greg Mori?

Claim your profile

Publications (91)30.06 Total impact

  • Source
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a system for multimedia event detection. The developed system characterizes complex multimedia events based on a large array of multimodal features, and classifies unseen videos by effectively fusing diverse responses. We present three major technical innovations. First, we explore novel visual and audio features across multiple semantic granularities, including building, often in an unsupervised manner, mid-level and high-level features upon low-level features to enable semantic understanding. Second, we show a novel Latent SVM model which learns and localizes discriminative high-level concepts in cluttered video sequences. In addition to improving detection accuracy beyond existing approaches, it enables a unique summary for every retrieval by its use of high-level concepts and temporal evidence localization. The resulting summary provides some transparency into why the system classified the video as it did. Finally, we present novel fusion learning algorithms and our methodology to improve fusion learning under limited training data condition. Thorough evaluation on a large TRECVID MED 2011 dataset showcases the benefits of the presented system.
    Machine Vision and Applications 01/2014; · 1.10 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In sustainable urban planning, non-motorised active modes of travel such as walking are identified as a leading driver for a healthy, liveable and resource-efficient environment. To encourage walking, there is a need for a solid understanding of pedestrian walking behaviour. This understanding is central to the evaluation of measures of walking conditions such as comfortability and efficiency. The main purpose of this study is to gain an in-depth understanding of pedestrian walking behaviour through the investigation of the spatio-temporal gait parameters (step length and step frequency). This microscopic-level analysis provides insight into the pedestrian walking mechanisms and the effect of various attributes such as gender and age. This analysis relies on automated video-based data collection using computer vision techniques. The step frequency and step length are estimated based on oscillatory patterns in the walking speed profile. The study uses real-world video data collected in downtown Vancouver, BC. The results show that the gait parameters are influenced by factors such as crosswalk grade, pedestrian gender, age and group size. The step length was found to generally have more influence on walking speed than step frequency. It was also found that, compared to males, females increase their step frequency to increase their walking speed.
    Transportmetrica A: Transport Science. 01/2014; 10(3).
  • [Show abstract] [Hide abstract]
    ABSTRACT: The appearance of an object changes profoundly with pose, camera view and interactions of the object with other objects in the scene. This makes it challenging to learn detectors based on an object-level label (e.g., "car"). We postulate that having a richer set of labelings (at different levels of granularity) for an object, including finer-grained subcategories, consistent in appearance and view, and higher order composites - contextual groupings of objects consistent in their spatial layout and appearance, can significantly alleviate these problems. However, obtaining such a rich set of annotations, including annotation of an exponentially growing set of object groupings, is simply not feasible. We propose a weakly-supervised framework for object detection where we discover subcategories and the composites automatically with only traditional object-level category labels as input. To this end, we first propose an exemplar-SVM-based clustering approach, with latent SVM refinement, that discovers a variable length set of discriminative subcategories for each object class. We then develop a structured model for object detection that captures interactions among object subcategories and automatically discovers semantically meaningful and discriminatively relevant visual composites. We show that this model produces state-of-the-art performance on UIUC phrase object detection benchmark.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • Arash Vahdat, Greg Mori
    [Show abstract] [Hide abstract]
    ABSTRACT: Gathering accurate training data for recognizing a set of attributes or tags on images or videos is a challenge. Obtaining labels via manual effort or from weakly-supervised data typically results in noisy training labels. We develop the FlipSVM, a novel algorithm for handling these noisy, structured labels. The FlipSVM models label noise by "flipping" labels on training examples. We show empirically that the FlipSVM is effective on images-and-attributes and video tagging datasets.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a compositional model for video event detection. A video is modeled using a collection of both global and segment-level features and kernel functions are employed for similarity comparisons. The locations of salient, discriminative video segments are treated as a latent variable, allowing the model to explicitly ignore portions of the video that are unimportant for classification. A novel, multiple kernel learning (MKL) latent support vector machine (SVM) is defined, that is used to combine and re-weight multiple feature types in a principled fashion while simultaneously operating within the latent variable framework. The compositional nature of the proposed model allows it to respond directly to the challenges of temporal clutter and intra-class variation, which are prevalent in unconstrained internet videos. Experimental results on the TRECVID Multimedia Event Detection 2011 (MED11) dataset demonstrate the efficacy of the method.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we address the problem of complex event detection on unconstrained videos. We introduce a novel multi-way feature pooling approach which leverages segment-level information. The approach is simple and widely applicable to diverse audio-visual features. Our approach uses a set of clusters discovered via unsupervised clustering of segment-level features. Depending on feature characteristics, not only scene-based clusters but also motion/audio-based clusters can be incorporated. Then, every video is represented with multiple descriptors, where each descriptor is designed to relate to one of the pre-built clusters. For classification, intersection kernel SVMs are used where the kernel is obtained by combining multiple kernels computed from corresponding per-cluster descriptor pairs. Evaluation on TRECVID'11 MED dataset shows a significant improvement by the proposed approach beyond the state-of-the-art.
    Proceedings of the 21st ACM international conference on Multimedia; 10/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a graphical framework for multiple instance learning (MIL) based on Markov networks. This framework can be used to model the traditional MIL definition as well as more general MIL definitions. Different levels of ambiguity -- the portion of positive instances in a bag -- can be explored in weakly supervised data. To train these models, we propose a discriminative max-margin learning algorithm leveraging efficient inference for cardinality-based cliques. The efficacy of the proposed framework is evaluated on a variety of data sets. Experimental results verify that encoding or learning the degree of ambiguity can improve classification performance.
    09/2013;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Falls are the number one cause of injury in older adults. Lack of objective evidence on the cause and circumstances of falls is often a barrier to effective prevention strategies. Previous studies have established the ability of wearable miniature inertial sensors (accelerometers and gyroscopes) to automatically detect falls, for the purpose of delivering medical assistance. In the current study, we extend the applications of this technology, by developing and evaluating the accuracy of wearable sensor systems for determining the cause of falls. Twelve young adults participated in experimental trials involving falls due to seven causes: slips, trips, fainting, and incorrect shifting/transfer of body weight while sitting down, standing up from sitting, reaching and turning. Features (means and variances) of acceleration data acquired from four tri-axial accelerometers during the falling trials were input to a linear discriminant analysis technique. Data from an array of three sensors (left ankle+right ankle+sternum) provided at least 83% sensitivity and 89% specificity in classifying falls due to slips, trips, and incorrect shift of body weight during sitting, reaching and turning. Classification of falls due to fainting and incorrect shift during rising was less successful across all sensor combinations. Furthermore, similar classification accuracy was observed with data from wearable sensors and a video-based motion analysis system. These results establish a basis for the development of sensor-based fall monitoring systems that provide information on the cause and circumstances of falls, to direct fall prevention strategies at a patient or population level.
    Gait & posture 09/2013; · 2.58 Impact Factor
  • 6th annual British Columbia Alliance on Telehealth Policy and Research (BCATPR) workshop, Vancouver, British Columbia, Canada.; 06/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an application of vision-based monitoring of long-term care facility residents. We develop an algorithm to detect events of interest, particularly falls by elderly residents. The algorithm uses a max-margin latent variable approach with spatio-temporal locations of the person in the video as latent variables. The recently developed Action Bank descriptor is utilized as a rich feature representation for each frame. Empirical results demonstrate the e�ectiveness of this method.
    The 13th IAPR Conference on Machine Vision Applications, Kyoto, Japan; 05/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We describe a system whereby multiple humans and mobile robots interact robustly using a combination of sensing and signalling modalities. Extending our previous work on selecting an individual robot from a population by face-engagement, we show that reaching toward a robot - a specialization of pointing - can be used to designate a particular robot for subsequent one-on-one interaction. To achieve robust operation despite frequent sensing problems, the robots use three phases of human detection and tracking, and emit audio cues to solicit interaction and guide the behaviour of the human. A series of real-world trials demonstrates the practicality of our approach.
    Robotics and Automation (ICRA), 2013 IEEE International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We conduct image classification by learning a class-to-image distance function that matches objects. The set of objects in training images for an image class are treated as a collage. When presented with a test image, the best matching between this collage of training image objects and those in the test image is found. We validate the efficacy of the proposed model on the PASCAL 07 and SUN 09 datasets, showing that our model is effective for object classification and scene classification tasks. State-of-the-art image classification results are obtained, and qualitative results demonstrate that objects can be accurately matched.
    Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on; 01/2013
  • Tian Lan, G. Mori
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose Max-Margin Riffled Independence Model (MMRIM), a new method for image tag ranking modeling the structured preferences among tags. The goal is to predict a ranked tag list for a given image, where tags are ordered by their importance or relevance to the image content. Our model integrates the max-margin formalism with riffled independence factorizations proposed in [10], which naturally allows for structured learning and efficient ranking. Experimental results on the SUN Attribute and Label Me datasets demonstrate the superior performance of the proposed model compared with baseline tag ranking methods. We also apply the predicted rank list of tags to several higher-level computer vision applications in image understanding and retrieval, and demonstrate that MMRIM significantly improves the accuracy of these applications.
    Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a multimodal system for creating, modifying and commanding groups of robots from a population. Extending our previous work on selecting an individual robot from a population by face engagement, we show that we can dynamically create groups of a desired number of robots by speaking the number we desire, e.g. “You three”, and looking at the robots we intend to form the group. We evaluate two different methods of detecting which robots are intended by the user, and show that an iterated election performs well in our setting. We also show that teams can be modified by adding and removing individual robots: “And you. Not you”. The success of the system is examined for different spatial configurations of robots with respect to each other and the user to find the proper workspace of selection methods.
    Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Extending our previous work in real-time vision-based Human Robot Interaction (HRI) with multi-robot systems, we present the first example of creating, modifying and commanding teams of UAVs by an uninstrumented human. To create a team the user focuses attention on an individual robot by simply looking at it, then adds or removes it from the current team with a motion-based hand gesture. Another gesture commands the entire team to begin task execution. Robots communicate among themselves by wireless network to ensure that no more than one robot is focused, and so that the whole team agrees that it has been commanded. Since robots can be added and removed from the team, the system is robust to incorrect additions. A series of trials with two and three very low-cost UAVs and off-board processing demonstrates the practicality of our approach.
    Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We consider image retrieval with structured object queries --- queries that specify the objects that should be present in the scene, and their spatial relations. An example of such queries is "car on the road". Existing image retrieval systems typically consider queries consisting of object classes (i.e. keywords). They train a separate classifier for each object class and combine the output heuristically. In contrast, we develop a learning framework to jointly consider object classes and their relations. Our method considers not only the objects in the query ("car" and "road" in the above example), but also related object categories can be useful for retrieval. Since we do not have ground-truth labeling of object bounding boxes on the test image, we represent them as latent variables in our model. Our learning method is an extension of the ranking SVM with latent variables, which we call latent ranking SVM. We demonstrate image retrieval and ranking results on a dataset with more than a hundred of object classes.
    Proceedings of the 12th European conference on Computer Vision - Volume Part VI; 10/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel algorithm for weakly supervised action classification in videos. We assume we are given training videos annotated only with action class labels. We learn a model that can classify unseen test videos, as well as localize a region of interest in the video that captures the discriminative essence of the action class. A novel Similarity Constrained Latent Support Vector Machine model is developed to operationalize this goal. This model specifies that videos should be classified correctly, and that the latent regions of interest chosen should be coherent over videos of an action class. The resulting learning problem is challenging, and we show how dual decomposition can be employed to render it tractable. Experimental results demonstrate the efficacy of the method.
    Proceedings of the 12th European conference on Computer Vision - Volume Part VII; 10/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We develop an algorithm for structured prediction with non-decomposable performance measures. The algorithm learns parameters of Markov random fields and can be applied to multivariate performance measures. Examples include performance measures such as F_\beta score (natural language processing), intersection over union (object category segmentation), Precision/Recall at k (search engines) and ROC area (binary classifiers). We attack this optimization problem by approximating the loss function with a piecewise linear function. The loss augmented inference forms a quadratic program (QP), which we solve using LP relaxation. We apply this approach to two tasks: object class-specific segmentation and human action retrieval from videos. We show significant improvement over baseline approaches that either use simple loss functions or simple scoring functions on the PASCAL VOC and H3D Segmentation datasets, and a nursing home action recognition dataset.
    IEEE Transactions on Software Engineering 08/2012; · 2.59 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Falls are the number one cause of injury in older adults. An individual's risk for falls depends on his or her frequency of imbalance episodes, and ability to recover balance following these events. However, there is little direct evidence on the frequency and circumstances of imbalance episodes (near falls) in older adults. Currently, there is rapid growth in the development of wearable fall monitoring systems based on inertial sensors. The utility of these systems would be enhanced by the ability to detect near-falls. In the current study, we conducted laboratory experiments to determine how the number and location of wearable inertial sensors influences the accuracy of a machine learning algorithm in distinguishing near-falls from activities of daily living (ADLs).
    Conference proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference 08/2012; 2012:5837-40.