Martial Hebert

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Are you Martial Hebert?

Claim your profile

Publications (340)116.66 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Cameras provide a rich source of information while being passive, cheap and lightweight for small and medium Unmanned Aerial Vehicles (UAVs). In this work we present the first implementation of receding horizon control, which is widely used in ground vehicles, with monocular vision as the only sensing mode for autonomous UAV flight in dense clutter. We make it feasible on UAVs via a number of contributions: novel coupling of perception and control via relevant and diverse, multiple interpretations of the scene around the robot, leveraging recent advances in machine learning to showcase anytime budgeted cost-sensitive feature selection, and fast non-linear regression for monocular depth prediction. We empirically demonstrate the efficacy of our novel pipeline via real world experiments of more than 2 kms through dense trees with a quadrotor built from off-the-shelf parts. Moreover our pipeline is designed to combine information from other modalities like stereo and lidar as well if available.
    11/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose a regularized linear learning algorithm to sequence groups of features, where each group incurs test-time cost or computation. Specifically, we develop a simple extension to Orthogonal Matching Pursuit (OMP) that respects the structure of groups of features with variable costs, and we prove that it achieves near-optimal anytime linear prediction at each budget threshold where a new group is selected. Our algorithm and analysis extends to generalized linear models with multi-dimensional responses. We demonstrate the scalability of the resulting approach on large real-world data-sets with many feature groups associated with test-time computational costs. Our method improves over Group Lasso and Group OMP in the anytime performance of linear predictions, measured in timeliness, an anytime prediction performance metric, while providing rigorous performance guarantees.
    09/2014;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Our long-term goal is to develop a general solution to the Life- long Robotic Object Discovery (LROD) problem: to discover new objects in the environment while the robot operates, for as long as the robot operates. In this paper, we consider the first step towards LROD: we automatically process the raw data stream of an entire workday of a robotic agent to discover objects. Our key contribution to achieve this goal is to incorporate domain knowledge—robotic metadata—in the discovery process, in addition to visual data. We propose a general graph-based formulation for LROD in which generic domain knowledge is encoded as constraints. To make long-term object discovery feasible, we encode into our formulation the natural constraints and non-visual sensory information in service robotics. A key advantage of our generic formulation is that we can add, modify, or remove sources of domain knowledge dynamically, as they become available or as conditions change. In our experiments, we show that by adding domain knowl- edge we discover 2.7x more objects and decrease processing time 190 times. With our optimized implementation, Herb- Disc, we show for the first time a system that processes a video stream of 6 h 20 min of continuous exploration in cluttered human environments (and over half a million images) in 18 min 34 s, to discover 206 new objects with their 3D models.
    The International Journal of Robotics Research 01/2014; · 2.86 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Structured prediction plays a central role in machine learning applications from computational biology to computer vision. These models require significantly more computation than unstructured models, and, in many applications, algorithms may need to make predictions within a computational budget or in an anytime fashion. In this work we propose an anytime technique for learning structured prediction that, at training time, incorporates both structural elements and feature computation trade-offs that affect test-time inference. We apply our technique to the challenging problem of scene understanding in computer vision and demonstrate efficient and anytime predictions that gradually improve towards state-of-the-art classification performance as the allotted time increases.
    12/2013;
  • Scott Satkin, Martial Hebert
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a new algorithm 3DNN (3D Nearest-Neighbor), which is capable of matching an image with 3D data, independently of the viewpoint from which the image was captured. By leveraging rich annotations associated with each image, our algorithm can automatically produce precise and detailed 3D models of a scene from a single image. Moreover, we can transfer information across images to accurately label and segment objects in a scene. The true benefit of 3DNN compared to a traditional 2D nearest-neighbor approach is that by generalizing across viewpoints, we free ourselves from the need to have training examples captured from all possible viewpoints. Thus, we are able to achieve comparable results using orders of magnitude less data, and recognize objects from never-before-seen viewpoints. In this work, we describe the 3DNN algorithm and rigorously evaluate its performance for the tasks of geometry estimation and object detection/segmentation. By decoupling the viewpoint and the geometry of an image, we develop a scene matching approach which is truly 100% viewpoint invariant, yielding state-of-the-art performance on challenging data.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • Yong Jae Lee, Alexei A. Efros, Martial Hebert
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a weakly-supervised visual data mining approach that discovers connections between recurring mid-level visual elements in historic (temporal) and geographic (spatial) image collections, and attempts to capture the underlying visual style. In contrast to existing discovery methods that mine for patterns that remain visually consistent throughout the dataset, our goal is to discover visual elements whose appearance changes due to change in time or location; i.e., exhibit consistent stylistic variations across the label space (date or geo-location). To discover these elements, we first identify groups of patches that are style-sensitive. We then incrementally build correspondences to find the same element across the entire dataset. Finally, we train style-aware regressors that model each element's range of stylistic differences. We apply our approach to date and geo-location prediction and show substantial improvement over several baselines that do not model visual style. We also demonstrate the method's effectiveness on the related task of fine-grained classification.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • David F. Fouhey, Abhinav Gupta, Martial Hebert
    [Show abstract] [Hide abstract]
    ABSTRACT: What primitives should we use to infer the rich 3D world behind an image? We argue that these primitives should be both visually discriminative and geometrically informative and we present a technique for discovering such primitives. We demonstrate the utility of our primitives by using them to infer 3D surface normals given a single image. Our technique substantially outperforms the state-of-the-art and shows improved cross-dataset performance.
    Proceedings of the 2013 IEEE International Conference on Computer Vision; 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We describe an architecture to provide online semantic labeling capabilities to field robots operating in urban environments. At the core of our system is the stacked hierarchical classifier developed by Munoz et al., which classifies regions in monocular color images using models derived from hand labeled training data. The classifier is trained to identify buildings, several kinds of hard surfaces, grass, trees, and sky. When taking this algorithm into the real world, practical concerns with difficult and varying lighting conditions require careful control of the imaging process. First, camera exposure is controlled by software, examining all of the image's pixels, to compensate for the poorly performing, simplistic algorithm used on the camera. Second, by merging multiple images taken with different exposure times, we are able to synthesize images with higher dynamic range than the ones produced by the sensor itself. The sensor 's limited dynamic range makes it difficult to, at the same time, properly expose areas in shadow along with high albedo surfaces that are directly illuminated by the sun. Texture is a key feature used by the classifier, and under /over exposed regions lacking texture are a leading cause of misclassifications. The results of the classifier are shared with higher-lev elements operating in the UGV in order to perform tasks such as building identification from a distance and finding traversable surfaces.
    SPIE Defense, Security, and Sensing; 05/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we consider the problem of Lifelong Robotic Object Discovery (LROD) as the long-term goal of discovering novel objects in the environment while the robot operates, for as long as the robot operates. As a first step towards LROD, we automatically process the raw video stream of an entire workday of a robotic agent to discover objects. We claim that the key to achieve this goal is to incorporate domain knowledge whenever available, in order to detect and adapt to changes in the environment. We propose a general graph-based formulation for LROD in which generic domain knowledge is encoded as constraints. Our formulation enables new sources of domain knowledge —metadata— to be added dynamically to the system, as they become available or as conditions change. By adding domain knowledge, we discover 2.7� more objects and decrease processing time 190 times. Our optimized implementation, HerbDisc, processes 6 h 20 min of RGBD video of real human environments in 18 min 30 s, and discovers 121 correct novel objects with their 3D models.
    IEEE International Conference on Robotics and Automation (ICRA); 05/2013
  • Jean Oh, Arne Suppe, Anthony Stentz, Martial Hebert
    [Show abstract] [Hide abstract]
    ABSTRACT: In robotics research, perception is one of the most challenging tasks. In contrast to existing approaches that rely only on computer vision, we propose an alternative method for improving perception by learning from human teammates. To evaluate, we apply this idea to a door detection problem. A set of preliminary experiments has been completed using software agents with real vision data. Our results demonstrate that information inferred from teammate observations significantly improves the perception precision.
    Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems; 05/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The detection and tracking of moving objects is an essential task in robotics. The CMU-RI Navlab group has developed such a system that uses a laser scanner as its primary sensor. We will describe our algorithm and its use in several applications. Our system worked successfully on indoor and outdoor platforms and with several different kinds and configurations of two-dimensional and three-dimensional laser scanners. The applications vary from collision warning systems, people classification, observing human tracks, and input to a dynamic planner. Several of these systems were evaluated in live field tests and shown to be robust and reliable. © 2012 Wiley Periodicals, Inc.
    Journal of Field Robotics 01/2013; 30(1):17–43}, numpages = {27. · 2.15 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We address the problem of image-based scene analysis from streaming video, as would be seen from a moving platform, in order to efficiently generate spatially and temporally consistent predictions of semantic categories over time. In contrast to previous techniques which typically address this problem in batch and/or through graphical models, we demonstrate that by learning visual similarities between pixels across frames, a simple filtering algorithfiltering algorithmm is able to achieve high performance predictions in an efficient and online/causal manner. Our technique is a meta-algorithm that can be efficiently wrapped around any scene analysis technique that produces a per-pixel semantic category distribution. We validate our approach over three different scene analysis techniques on three different datasets that contain different semantic object categories. Our experiments demonstrate that our approach is very efficient in practice and substantially improves the consistency of the predictions over time.
    Robotics and Automation (ICRA), 2013 IEEE International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Rich scene understanding from 3-D point clouds is a challenging task that requires contextual reasoning, which is typically computationally expensive. The task is further complicated when we expect the scene analysis algorithm to also efficiently handle data that is continuously streamed from a sensor on a mobile robot. Hence, we are typically forced to make a choice between 1) using a precise representation of the scene at the cost of speed, or 2) making fast, though inaccurate, approximations at the cost of increased misclassifications. In this work, we demonstrate that we can achieve the best of both worlds by using an efficient and simple representation of the scene in conjunction with recent developments in structured prediction in order to obtain both efficient and state-of-the-art classifications. Furthermore, this efficient scene representation naturally handles streaming data and provides a 300% to 500% speedup over more precise representations.
    Robotics and Automation (ICRA), 2013 IEEE International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the problem of selecting a state-machine from a library to control a robot. We are particularly interested in this problem when evaluating such state machines on a particular robotics task is expensive. As a motivating example, we consider a problem where a simulated vacuuming robot must select a driving state machine well-suited for a particular (unknown) room layout. By borrowing concepts from collaborative filtering (recommender systems such as Netflix and Amazon.com), we present a multi-armed bandit formulation that incorporates recommendation techniques to efficiently select state machines for individual room layouts. We show that this formulation outperforms the individual approaches (recommendation, multi-armed bandits) as well as the baseline of selecting the 'average best' state machine across all rooms.
    Robotics and Automation (ICRA), 2013 IEEE International Conference on; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Autonomous navigation for large Unmanned Aerial Vehicles (UAVs) is fairly straight-forward, as expensive sensors and monitoring devices can be employed. In contrast, obstacle avoidance remains a challenging task for Micro Aerial Vehicles (MAVs) which operate at low altitude in cluttered environments. Unlike large vehicles, MAVs can only carry very light sensors, such as cameras, making autonomous navigation through obstacles much more challenging. In this paper, we describe a system that navigates a small quadrotor helicopter autonomously at low altitude through natural forest environments. Using only a single cheap camera to perceive the environment, we are able to maintain a constant velocity of up to 1.5m/s. Given a small set of human pilot demonstrations, we use recent state-of-the-art imitation learning techniques to train a controller that can avoid trees by adapting the MAVs heading. We demonstrate the performance of our system in a more controlled environment indoors, and in real natural forest environments outdoors.
    Proceedings - IEEE International Conference on Robotics and Automation 11/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The problem of training classifiers from limited data is one that particularly affects large-scale and social applications, and as a result, although carefully trained machine learning forms the backbone of many current techniques in research, it sees dramatically fewer applications for end-users. Recently we demonstrated a technique for selecting or recommending a single good classifier from a large library even with highly impoverished training data. We consider alternatives for extending our recommendation technique to sets of classifiers, including a modification to the AdaBoost algorithm that incorporates recommendation. Evaluating on an action recognition problem, we present two viable methods for extending model recommendation to sets.
    Proceedings of the 12th international conference on Computer Vision - Volume Part I; 10/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Generating meaningful digests of videos by extracting interesting frames remains a difficult task. In this paper, we define interesting events as unusual events which occur rarely in the entire video and we propose a novel interesting event summarization framework based on the technique of density ratio estimation recently introduced in machine learning. Our proposed framework is unsupervised and it can be applied to general video sources, including videos from moving cameras. We evaluated the proposed approach on a publicly available dataset in the context of anomalous crowd behavior and with a challenging personal video dataset. We demonstrated competitive performance both in accuracy relative to human annotation and computation time.
    Proceedings of the 12th international conference on Computer Vision - Volume Part III; 10/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Object discovery algorithms group together image regions that originate from the same object. This process is effective when the input collection of images contains a large number of densely sampled views of each object, thereby creating strong connections between nearby views. However, existing approaches are less effective when the input data only provide sparse coverage of object views. We propose an approach for object discovery that addresses this problem. We collect a database of about 5 million product images that capture 1.2 million objects from multiple views. We represent each region in the input image by a "bag" of database object regions. We group input regions together if they share similar "bags of regions." Our approach can correctly discover links between regions of the same object even if they are captured from dramatically different viewpoints. With the help from these added links, our proposed approach can robustly discover object instances even with sparse coverage of the viewpoints.
    Proceedings of the 12th European conference on Computer Vision - Volume Part VI; 10/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We address the problem of understanding scenes from multiple sources of sensor data (e.g., a camera and a laser scanner) in the case where there is no one-to-one correspondence across modalities (e.g., pixels and 3-D points). This is an important scenario that frequently arises in practice not only when two different types of sensors are used, but also when the sensors are not co-located and have different sampling rates. Previous work has addressed this problem by restricting interpretation to a single representation in one of the domains, with augmented features that attempt to encode the information from the other modalities. Instead, we propose to analyze all modalities simultaneously while propagating information across domains during the inference procedure. In addition to the immediate benefit of generating a complete interpretation in all of the modalities, we demonstrate that this co-inference approach also improves performance over the canonical approach.
    Proceedings of the 12th European conference on Computer Vision - Volume Part VI; 10/2012
  • Source
    Conference Paper: Activity forecasting
    [Show abstract] [Hide abstract]
    ABSTRACT: We address the task of inferring the future actions of people from noisy visual input. We denote this task activity forecasting. To achieve accurate activity forecasting, our approach models the effect of the physical environment on the choice of human actions. This is accomplished by the use of state-of-the-art semantic scene understanding combined with ideas from optimal control theory. Our unified model also integrates several other key elements of activity analysis, namely, destination forecasting, sequence smoothing and transfer learning. As proof-of-concept, we focus on the domain of trajectory-based activity analysis from visual input. Experimental results demonstrate that our model accurately predicts distributions over future actions of individuals. We show how the same techniques can improve the results of tracking algorithms by leveraging information about likely goals and trajectories.
    Proceedings of the 12th European conference on Computer Vision - Volume Part IV; 10/2012

Publication Stats

9k Citations
116.66 Total Impact Points

Institutions

  • 1987–2013
    • Carnegie Mellon University
      • • Robotics Institute
      • • Computer Science Department
      Pittsburgh, Pennsylvania, United States
  • 2010
    • University of Illinois, Urbana-Champaign
      Urbana, Illinois, United States
  • 2009
    • NEC Corporation
      Edo, Tōkyō, Japan
  • 1998
    • California Institute of Technology
      • Jet Propulsion Laboratory
      Pasadena, CA, United States