Martial Hebert

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Are you Martial Hebert?

Claim your profile

Publications (300)97.36 Total impact

  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Structured prediction plays a central role in machine learning applications from computational biology to computer vision. These models require significantly more computation than unstructured models, and, in many applications, algorithms may need to make predictions within a computational budget or in an anytime fashion. In this work we propose an anytime technique for learning structured prediction that, at training time, incorporates both structural elements and feature computation trade-offs that affect test-time inference. We apply our technique to the challenging problem of scene understanding in computer vision and demonstrate efficient and anytime predictions that gradually improve towards state-of-the-art classification performance as the allotted time increases.
    12/2013;
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, we consider the problem of Lifelong Robotic Object Discovery (LROD) as the long-term goal of discovering novel objects in the environment while the robot operates, for as long as the robot operates. As a first step towards LROD, we automatically process the raw video stream of an entire workday of a robotic agent to discover objects. We claim that the key to achieve this goal is to incorporate domain knowledge whenever available, in order to detect and adapt to changes in the environment. We propose a general graph-based formulation for LROD in which generic domain knowledge is encoded as constraints. Our formulation enables new sources of domain knowledge —metadata— to be added dynamically to the system, as they become available or as conditions change. By adding domain knowledge, we discover 2.7� more objects and decrease processing time 190 times. Our optimized implementation, HerbDisc, processes 6 h 20 min of RGBD video of real human environments in 18 min 30 s, and discovers 121 correct novel objects with their 3D models.
    IEEE International Conference on Robotics and Automation (ICRA); 05/2013
  • [show abstract] [hide abstract]
    ABSTRACT: Rich scene understanding from 3-D point clouds is a challenging task that requires contextual reasoning, which is typically computationally expensive. The task is further complicated when we expect the scene analysis algorithm to also efficiently handle data that is continuously streamed from a sensor on a mobile robot. Hence, we are typically forced to make a choice between 1) using a precise representation of the scene at the cost of speed, or 2) making fast, though inaccurate, approximations at the cost of increased misclassifications. In this work, we demonstrate that we can achieve the best of both worlds by using an efficient and simple representation of the scene in conjunction with recent developments in structured prediction in order to obtain both efficient and state-of-the-art classifications. Furthermore, this efficient scene representation naturally handles streaming data and provides a 300% to 500% speedup over more precise representations.
    Robotics and Automation (ICRA), 2013 IEEE International Conference on; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: We address the problem of image-based scene analysis from streaming video, as would be seen from a moving platform, in order to efficiently generate spatially and temporally consistent predictions of semantic categories over time. In contrast to previous techniques which typically address this problem in batch and/or through graphical models, we demonstrate that by learning visual similarities between pixels across frames, a simple filtering algorithfiltering algorithmm is able to achieve high performance predictions in an efficient and online/causal manner. Our technique is a meta-algorithm that can be efficiently wrapped around any scene analysis technique that produces a per-pixel semantic category distribution. We validate our approach over three different scene analysis techniques on three different datasets that contain different semantic object categories. Our experiments demonstrate that our approach is very efficient in practice and substantially improves the consistency of the predictions over time.
    Robotics and Automation (ICRA), 2013 IEEE International Conference on; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: We investigate the problem of selecting a state-machine from a library to control a robot. We are particularly interested in this problem when evaluating such state machines on a particular robotics task is expensive. As a motivating example, we consider a problem where a simulated vacuuming robot must select a driving state machine well-suited for a particular (unknown) room layout. By borrowing concepts from collaborative filtering (recommender systems such as Netflix and Amazon.com), we present a multi-armed bandit formulation that incorporates recommendation techniques to efficiently select state machines for individual room layouts. We show that this formulation outperforms the individual approaches (recommendation, multi-armed bandits) as well as the baseline of selecting the 'average best' state machine across all rooms.
    Robotics and Automation (ICRA), 2013 IEEE International Conference on; 01/2013
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: The detection and tracking of moving objects is an essential task in robotics. The CMU-RI Navlab group has developed such a system that uses a laser scanner as its primary sensor. We will describe our algorithm and its use in several applications. Our system worked successfully on indoor and outdoor platforms and with several different kinds and configurations of two-dimensional and three-dimensional laser scanners. The applications vary from collision warning systems, people classification, observing human tracks, and input to a dynamic planner. Several of these systems were evaluated in live field tests and shown to be robust and reliable. © 2012 Wiley Periodicals, Inc.
    Journal of Field Robotics 01/2013; 30(1):17–43}, numpages = {27. · 2.15 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Autonomous navigation for large Unmanned Aerial Vehicles (UAVs) is fairly straight-forward, as expensive sensors and monitoring devices can be employed. In contrast, obstacle avoidance remains a challenging task for Micro Aerial Vehicles (MAVs) which operate at low altitude in cluttered environments. Unlike large vehicles, MAVs can only carry very light sensors, such as cameras, making autonomous navigation through obstacles much more challenging. In this paper, we describe a system that navigates a small quadrotor helicopter autonomously at low altitude through natural forest environments. Using only a single cheap camera to perceive the environment, we are able to maintain a constant velocity of up to 1.5m/s. Given a small set of human pilot demonstrations, we use recent state-of-the-art imitation learning techniques to train a controller that can avoid trees by adapting the MAVs heading. We demonstrate the performance of our system in a more controlled environment indoors, and in real natural forest environments outdoors.
    Proceedings - IEEE International Conference on Robotics and Automation 11/2012;
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: The main stated contribution of the Deformable Parts Model (DPM) detector of Felzenszwalb et al. (over the Histogram-of-Oriented-Gradients approach of Dalal and Triggs) is the use of deformable parts. A secondary contribution is the latent discriminative learning. Tertiary is the use of multiple components. A common belief in the vision community (including ours, before this study) is that their ordering of contributions reflects the performance of detector in practice. However, what we have experimentally found is that the ordering of importance might actually be the reverse. First, we show that by increasing the number of components, and switching the initialization step from their aspect-ratio, left-right flipping heuristics to appearance-based clustering, considerable improvement in performance is obtained. But more intriguingly, we show that with these new components, the part deformations can now be completely switched off, yet obtaining results that are almost on par with the original DPM detector. Finally, we also show initial results for using multiple components on a different problem -- scene classification, suggesting that this idea might have wider applications in addition to object detection.
    06/2012;
  • [show abstract] [hide abstract]
    ABSTRACT: Semantic perception involves naming objects and features in the scene, understanding the relations between them, and understanding the behaviors of agents, e.g., people, and their intent from sensor data. Semantic perception is a central component of future UGVs to provide representations which 1) can be used for higher-level reasoning and tactical behaviors, beyond the immediate needs of autonomous mobility, and 2) provide an intuitive description of the robot's environment in terms of semantic elements that can shared effectively with a human operator. In this paper, we summarize the main approaches that we are investigating in the RCTA as initial steps toward the development of perception systems for UGVs.
    Proc SPIE 05/2012;
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Sequence optimization, where the items in a list are ordered to maximize some reward has many applications such as web advertisement placement, search, and control libraries in robotics. Previous work in sequence optimization produces a static ordering that does not take any features of the item or context of the problem into account. In this work, we propose a general approach to order the items within the sequence based on the context (e.g., perceptual information, environment description, and goals). We take a simple, efficient, reduction-based approach where the choice and order of the items is established by repeatedly learning simple classifiers or regressors for each "slot" in the sequence. Our approach leverages recent work on submodular function maximization to provide a formal regret reduction from submodular sequence optimization to simple cost-sensitive prediction. We apply our contextual sequence prediction algorithm to optimize control libraries and demonstrate results on two robotics problems: manipulator trajectory prediction and mobile robot path planning.
    02/2012;
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We describe the software components of a robotics system designed to autonomously grasp objects and perform dexterous manipulation tasks with only high-level supervision. The system is centered on the tight integration of several core functionalities, including perception, planning and control, with the logical structuring of tasks driven by a Behavior Tree architecture. The advantage of the implementation is to reduce the execution time while integrating advanced algorithms for autonomous manipulation. We describe our approach to 3-D perception, real-time planning, force compliant motions, and audio processing. Performance results for object grasping and complex manipulation tasks of in-house tests and of an independent evaluation team are presented.
    Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on; 01/2012
  • [show abstract] [hide abstract]
    ABSTRACT: Simply choosing one model out of a large set of possibilities for a given vision task is a surprisingly difficult problem, especially if there is limited evaluation data with which to distinguish among models, such as when choosing the best “walk” action classifier from a large pool of classifiers tuned for different viewing angles, lighting conditions, and background clutter. In this paper we suggest that this problem of selecting a good model can be recast as a recommendation problem, where the goal is to recommend a good model for a particular task based on how well a limited probe set of models appears to perform. Through this conceptual remapping, we can bring to bear all the collaborative filtering techniques developed for consumer recommender systems (e.g., Netflix, Amazon.com). We test this hypothesis on action recognition, and find that even when every model has been directly rated on a training set, recommendation finds better selections for the corresponding test set than the best performers on the training set.
    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on; 01/2012
  • T. Kanade, M. Hebert
    [show abstract] [hide abstract]
    ABSTRACT: For understanding the behavior, intent, and environment of a person, the surveillance metaphor is traditional; that is, install cameras and observe the subject, and his/her interaction with other people and the environment. Instead, we argue that the first-person vision (FPV), which senses the environment and the subject's activities from a wearable sensor, is more advantageous with images about the subject's environment as taken from his/her view points, and with readily available information about head motion and gaze through eye tracking. In this paper, we review key research challenges that need to be addressed to develop such FPV systems, and describe our ongoing work to address them using examples from our prototype systems.
    Proceedings of the IEEE 01/2012; 100(8):2442-2453. · 6.91 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Graph matching is an essential problem in computer vision that has been successfully applied to 2D and 3D feature matching and object recognition. Despite its importance, little has been published on learning the parameters that control graph matching, even though learning has been shown to be vital for improving the matching rate. In this paper we show how to perform parameter learning in an unsupervised fashion, that is when no correct correspondences between graphs are given during training. Our experiments reveal that unsupervised learning compares favorably to the supervised case, both in terms of efficiency and quality, while avoiding the tedious manual labeling of ground truth correspondences. We verify experimentally that our learning method can improve the performance of several state-of-the art graph matching algorithms. We also show that a similar method can be successfully applied to parameter learning for graphical models and demonstrate its effectiveness empirically.
    International Journal of Computer Vision 01/2012; 96:28-45. · 3.62 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Vision-based prop-free pointing detection is challenging both from an algorithmic and a systems standpoint. From a computer vision perspective, accurately determining where multiple users are pointing is difficult in cluttered environments with dynamic scene content. Standard approaches relying on appearance models or background subtraction to segment users operate poorly in this domain. We propose a method that focuses on motion analysis to detect pointing gestures and robustly estimate the pointing direction. Our algorithm is self-initializing; as the user points, we analyze the observed motion from two cameras and infer rotation centers that best explain the observed motion. From these, we group pixel-level flow into dominant pointing vectors that each originate from a rotation center and merge across views to obtain 3D pointing vectors. However, our proposed algorithm is computationally expensive, posing systems challenges even with current computing infrastructure. We achieve interactive speeds by exploiting coarse-grained parallelization over a cluster of computers. In unconstrained environments, we obtain an average angular precision of 2.7°.
    Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on; 04/2011
  • Source
    Hongwen Kang, M. Hebert, T. Kanade
    [show abstract] [hide abstract]
    ABSTRACT: In this paper we propose an image indexing and matching algorithm that relies on selecting distinctive high dimensional features. In contrast with conventional techniques that treated all features equally, we claim that one can benefit significantly from focusing on distinctive features. We propose a bag-of-words algorithm that combines the feature distinctiveness in visual vocabulary generation. Our approach compares favorably with the state of the art in image matching tasks on the University of Kentucky Recognition Benchmark dataset and on an indoor localization dataset. We also show that our approach scales up more gracefully on a large scale Flickr dataset.
    Applications of Computer Vision (WACV), 2011 IEEE Workshop on; 02/2011
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: The ability of a perception system to discern what is important in a scene and what is not is an invaluable asset, with multiple applications in object recognition, people detection and SLAM, among others. In this paper, we aim to analyze all sensory data available to separate a scene into a few physically meaningful parts, which we term structure, while discarding background clutter. In particular, we consider the combination of image and range data, and base our decision in both appearance and 3D shape. Our main contribution is the development of a framework to perform scene segmentation that preserves physical objects using multi-modal data. We combine image and range data using a novel mid-level fusion technique based on the concept of regions that avoids any pixel-level correspondences between data sources. We associate groups of pixels with 3D points into multi-modal regions that we term regionlets, and measure the structure-ness of each regionlet using simple, bottom-up cues from image and range features. We show that the highest-ranked regionlets correspond to the most prominent objects in the scene. We verify the validity of our approach on 105 scenes of household environments.
    IEEE International Conference on Robotics and Automation, ICRA 2011, Shanghai, China, 9-13 May 2011; 01/2011
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Nearly every structured prediction problem in computer vision requires approximate inference due to large and complex dependencies among output labels. While graphical models provide a clean separation between modeling and inference, learning these models with approximate inference is not well understood. Furthermore, even if a good model is learned, predictions are often inaccurate due to approximations. In this work, instead of performing inference over a graphical model, we instead consider the inference procedure as a composition of predictors. Specifically, we focus on message-passing algorithms, such as Belief Propagation, and show how they can be viewed as procedures that sequentially predict label distributions at each node over a graph. Given labeled graphs, we can then train the sequence of predictors to output the correct labeling s. The result no longer corresponds to a graphical model but simply defines an inference procedure, with strong theoretical properties, that can be used to classify new graphs. We demonstrate the scalability and efficacy of our approach on 3D point cloud classification and 3D surface estimation from single images.
    The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011; 01/2011
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: We address the problem of understanding scenes from 3-D laser scans via per-point assignment of semantic labels. In order to mitigate the difficulties of using a graphical model for modeling the contextual relationships among the 3-D points, we instead propose a multi-stage inference procedure to capture these relationships. More specifically, we train this procedure to use point cloud statistics and learn relational information (e.g., tree-trunks are below vegetation) over fine (point-wise) and coarse (region-wise) scales. We evaluate our approach on three different datasets, that were obtained from different sensors, and demonstrate improved performance. I. INTRODUCTION
    IEEE International Conference on Robotics and Automation, ICRA 2011, Shanghai, China, 9-13 May 2011; 01/2011
  • Source
    Hongwen Kang, Martial Hebert, Takeo Kanade
    [show abstract] [hide abstract]
    ABSTRACT: We propose an approach to identify and segment objects from scenes that a person (or robot) encounters in Activities of Daily Living (ADL). Images collected in those cluttered scenes contain multiple objects. Each image provides only a partial, possibly very different view of each object. An object instance discovery program must be able to link pieces of visual information from multiple images and extract the consistent patterns.
    IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011; 01/2011