Conference PaperPDF Available

Active Scene Recognition for Programming by Demonstration using Next-Best-View Estimates from Hierarchical Implicit Shape Models

Authors:

Abstract and Figures

We present an approach that combines passive scene understanding with object search in order to recognize scenes in indoor environments that cannot be perceived from a single point of view. Passive scene recognition is done based on spatial relations between objects using Implicit Shape Models. ISMs, a variant of Generalized Hough Transform, are extended to describe scenes as sets of objects with relations lying in- between. Relations are expressed as six Degree-of-Freedom (DoF) relative object poses. They are extracted from sensor recordings of human demonstrations of actions usually taking place in the corresponding scene. In a scene ISMs solely represent relations of n objects towards a common reference. Violations of other relations are not detectable. To overcome this limitation we extend our scene models to binary trees consisting of ISMs using hierarchical agglomerative clustering. Active scene recognition aims to simultaneously detect present scenes and localize objects these scenes consist of. For a pivoting stereo camera rig, we achieve this by performing recognition with hierarchical ISMs in an object search loop using Next-Best-View (NBV) estimation. A criterion, on which we greedily choose views the rig shall adopt next, is the confidence to detect objects in them. In each search step confidence on potential positions of objects, not found yet, is calculated based on the best available scene hypothesis. This is done by partly reversing the basic principle of ISMs and using spatial relations to predict potential object positions starting from objects already detected.
Content may be subject to copyright.
A preview of the PDF is not available

Supplementary resource (1)

... In a realistic scenario, such an approach is infeasible due to combinatorial explosion. In [2], we presented a method that, given a partially recognized All authors are with Institute of Anthropomatics, Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany. pascal.meissner@kit.edu 1 An approach for a mobile robot with a pivoting camera is to discretize the space of robot positions with a given resolution to a grid. ...
... We call the concept to alternate scene recognition and object search, Active Scene Recognition (ASR). In prior work [2], we demonstrated it with a simplified approach for a fixed robot head. Camera views were only optimized regarding their orientations for a single, given view position. ...
... Both position and orientation of a pivoting head of a mobile robot is optimized in terms of utility and costs. Beyond prior work [2], we introduce a new utility function and an algorithm for generating NBV candidates on a 4 DoF space of discrete combinations of robot positions and sensor orientations. Since uniform sampling of robot position space is infeasible for the resolutions, required in our scenario, we contribute a hierarchical strategy to reduce the number of searched positions: We start an iterative process by searching a view at a coarse resolution. ...
Conference Paper
Full-text available
We present an approach for recognizing indoor scenes in object constellations that require object search by a mobile robot, as they cannot be captured from a single viewpoint. In our approach that we call Active Scene Recognition (ASR), robots predict object poses from learnt spatial relations that they combine with their estimates about present scenes. Our models for estimating scenes and predicting poses are Implicit Shape Model (ISM) trees from prior work [1]. ISMs model scenes as sets of objects with spatial relations in-between and are learnt from observations. In prior work [2], we presented a realization of ASR, limited to choosing orientations for a fixed robot head with an approach to search objects that uses positions and ignores types. In this paper, we introduce an integrated system that extends ASR to selecting positions and orientations of camera views for a mobile robot with a pivoting head. We contribute an approach for Next-Best-View estimation in object search on predicted object poses. It is defined on 6 DoF viewing frustums and optimizes the searched view, together with the objects to be searched in it, based on 6 DoF pose predictions. To prevent combinatorial explosion when searching camera pose space, we introduce a hierarchical approach to sample robot positions with increasing resolution.
... Their redesign required expert knowledge. We successfully used both representation and learning in [2] to explore scenes by object search with a pivoting robot head. ...
Conference Paper
Full-text available
We present an approach that uses combinatorial optimization to decide which spatial relations between objects are relevant to accurately describe an indoor scene, made up of objects. We extract scene models from object configurations that are acquired during demonstration of actions, characteristic for a certain scene. We model scenes as graphs with Implicit Shape Models (ISMs), a Generalized Hough Transform variant. ISMs are limited to represent scenes as star-shaped topologies of object relations, leading to false positives in recognizing scenes. To describe other relation topologies, we introduced a representation of trees of ISMs in prior work together with a method to learn such ISM trees from demonstrations. Limited to creating topologies, corresponding to spanning trees, that method omits certain relations so that false positives still occur. In this paper, we introduce a method to convert any relation topology, corresponding to a connected graph, into an ISM tree using a heuristic depth-first-search. It allows using complete graphs as scene models. Despite causing no false positives, complete graphs are intractable for scene recognition. To achieve efficiency, we contribute a method that searches for an optimal relation topology by traversing the space of connected scene graphs, for a given set of objects, using an optimization similar to hill climbing. Optimality is defined as minimizing computational costs during scene recognition, while producing a minimum of false positives. Experiments with up to 15 objects show that both are achievable by the presented method. Costs, growing exponentially with the number of objects, are transferred from online recognition to offline optimization.
Chapter
Detailed technical presentation of our contributions that are related to Active Scene Recognition. This includes our approaches to Object Pose Prediction and Next-Best-View estimation.
Conference Paper
Full-text available
We present an approach for recognizing scenes, consisting of spatial relations between objects, in unstructured indoor environments, which change over time. Object relations are represented by full six Degree-of-Freedom (DoF) coordinate transformations between objects. They are acquired from object poses that are visually perceived while people demonstrate actions that are typically performed in a given scene. We recognize scenes using an Implicit Shape Model (ISM) that is similar to the Generalized Hough Transform. We extend it to take orientations between objects into account. This includes a verification step that allows us to infer not only the existence of scenes, but also the objects they are composed of. ISMs are restricted to represent scenes as star topologies of relations, which insufficiently approximate object relations in complex dynamic settings. False positive detections may occur. Our solution are exchangeable heuristics for recognizing object relations that have to be represented explicitly in separate ISMs. Object relations are modeled by the ISMs themselves. We use hierarchical agglomerative clustering, employing the heuristics, to construct a tree of ISMs. Learning and recognition of scenes with a single ISM is naturally extended to multiple ISMs.
Article
Full-text available
We present a next-best-scan (NBS) planning approach for autonomous 3D modeling. The system successively completes a 3D model from complex shaped objects by iteratively selecting a NBS based on previously acquired data. For this purpose, new range data is accumulated in-the-loop into a 3D surface (streaming reconstruction) and new continuous scan paths along the estimated surface trend are generated. Further, the space around the object is explored using a probabilistic exploration approach that considers sensor uncertainty. This allows for collision free path planning in order to completely scan unknown objects. For each scan path, the expected information gain is determined and the best path is selected as NBS. The presented NBS approach is tested with a laser striper system, attached to an industrial robot. The results are compared to state-of-the-art next-best-view methods. Our results show promising performance with respect to completeness, quality and scan time.
Conference Paper
Full-text available
Robots operating in domestic environments need to deal with a variety of different objects. Often, these objects are neither placed randomly, nor independently of each other. For example, objects on a breakfast table such as plates, knives, or bowls typically occur in recurrent configurations. In this paper, we propose a novel hierarchical generative model to reason about latent object constellations in a scene. The proposed model is a combination of Dirichlet processes and beta processes, which allow for a probabilistic treatment of the unknown dimensionality of the parameter space. We show how the model can be employed to address a set of different tasks in scene understanding ranging from unsupervised scene segmentation to completion of a par-tially specified scene. We describe how sampling in this model can be done using Markov chain Monte Carlo (MCMC) techniques and present an experimental evaluation with simulated as well as real-world data obtained with a Kinect camera.
Conference Paper
Full-text available
Robust vision-based grasping is still a hard problem for humanoid robot systems. When being restricted to using the camera system built-in into the robot's head for object localization, the scenarios get often very simplified in order to allow the robot to grasp autonomously. Within the computer vision community, many object recognition and localization systems exist, but in general, they are not tailored to the application on a humanoid robot. In particular, accurate 6D object localization in the camera coordinate system with respect to a 3D rigid model is crucial for a general framework for grasping. While many approaches try to avoid the use of stereo calibration, we will present a system that makes explicit use of the stereo camera system in order to achieve maximum depth accuracy. Our system can deal with textured objects as well as objects that can be segmented globally and are defined by their shape. Thus, it covers the cases of objects with complex texture and complex shape. Our work is directly linked to a grasping framework being implemented on the humanoid robot ARM AR and serves as its perception module for various grasping and manipulation experiments in a kitchen scenario.
Article
A scene analysis module for service robots is presented which uses SIFT in a stereo setting, a systematic handling of uncertainties and an active perception component. The system is integrated and evaluated on the DESIRE two-arm mo-bile robot. Complex everyday scenes composed of various items from a 100-object database are analyzed successfully and efficiently.
Conference Paper
This work demonstrates how 3D qualitative spatial relationships can be used to improve object detection by differentiating between true and false positive detections. Our method identifies the most likely subset of 3D detections using seven types of 3D relationships and adjusts detection confidence scores to improve the average precision. A model is learned using a structured support vector machine [1] from examples of 3D layouts of objects in offices and kitchens. We test our method on synthetic detections to determine how factors such as localization accuracy, number of detections and detection scores change the effectiveness of 3D spatial relationships for improving object detection rates. Finally, we describe a technique for generating 3D detections from 2D image-based object detections and demonstrate how our method improves the average precision of these 3D detections.
Article
In this article, we present an information gain-based variant of the next best view problem for occluded environment. Our proposed method utilizes a belief model of the unobserved space to estimate the expected information gain of each possible viewpoint. More precise, this belief model allows a more precise estimation of the visibility of occluded space and with that a more accurate prediction of the potential information gain of new viewing positions. We present experimental evaluation on a robotic platform for active data acquisition, however due to the generality of our approach it also applies to a wide variety of 3D reconstruction problems. With the evaluation done in simulation and on a real robotic platform, exploring and acquiring data from different environments we demonstrate the generality and usefulness of our approach for next best view estimation and autonomous data acquisition.
Article
In this paper, we provide a systematic study of the task of sensor planning for object search. The search agent's knowledge of object location is encoded as a discrete probability density which is up-dated whenever a sensing action occurs. Each sensing action of the agent is defined by a viewpoint, a viewing direction, a field-of-view, and the application of a recognition algorithm. The formulation casts sensor planning as an optimization problem: the goal is to maximize the probability of detecting the target with minimum cost. This problem is proved to be NP-Complete, thus a heuristic strat-egy is favored. To port the theoretical framework to a real working system, we propose a sensor planning strategy for a robot equipped with a camera that can pan, tilt, and zoom. In order to efficiently determine the sensing actions over time, the huge space of possible actions with fixed camera position is decomposed into a finite set of actions that must be considered. The next action is then selected from among these by comparing the likelihood of detection and the cost of each action. When detection is unlikely at the current posi-tion, the robot is moved to another position for which the probability of target detection is the highest. c 1999 Academic Press CONTENTS 1. Introduction. 2. Overview. 3. Problem formulation. 4. Detection function. 5. Sensed sphere. 6. Where to look next. 6.1. Determine the necessary viewing angle size. 6.2. De-termining the necessary viewing directions for a given angle size. 6.3. Se-lecting the next action.
Article
This paper presents a novel method for detecting and localizing objects of a visual category in cluttered real-world scenes. Our approach considers object categorization and figure-ground segmentation as two interleaved processes that closely collaborate towards a common goal. As shown in our work, the tight coupling between those two processes allows them to benefit from each other and improve the combined performance. The core part of our approach is a highly flexible learned representation for object shape that can combine the information observed on different training examples in a probabilistic extension of the Generalized Hough Transform. The resulting approach can detect categorical objects in novel images and automatically infer a probabilistic segmentation from the recognition result. This segmentation is then in turn used to again improve recognition by allowing the system to focus its efforts on object pixels and to discard misleading influences from the background. Moreover, the information from where in the image a hypothesis draws its support is employed in an MDL based hypothesis verification stage to resolve ambiguities between overlapping hypotheses and factor out the effects of partial occlusion. An extensive evaluation on several large data sets shows that the proposed system is applicable to a range of different object categories, including both rigid and articulated objects. In addition, its flexible representation allows it to achieve competitive object detection performance already from training sets that are between one and two orders of magnitude smaller than those used in comparable systems.