Preprint

Task-Focused Few-Shot Object Detection for Robot Manipulation

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the author.

Abstract

This paper addresses the problem of mobile robot manipulation of novel objects via detection. Our approach uses vision and control as complementary functions that learn from real-world tasks. We develop a manipulation method based solely on detection then introduce task-focused few-shot object detection to learn new objects and settings. The current paradigm for few-shot object detection uses existing annotated examples. In contrast, we extend this paradigm by using active data collection and annotation selection that improves performance for specific downstream tasks (e.g., depth estimation and grasping). In experiments for our interactive approach to few-shot learning, we train a robot to manipulate objects directly from detection (ClickBot). ClickBot learns visual servo control from a single click of annotation, grasps novel objects in clutter and other settings, and achieves state-of-the-art results on an existing visual servo control and depth estimation benchmark. Finally, we establish a task-focused few-shot object detection benchmark to support future research: https://github.com/griffbr/TFOD.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Most 3D reconstruction methods may only recover scene properties up to a global scale ambiguity. We present a novel approach to single view metrology that can recover the \emph{absolute} scale of a scene represented by 3D heights of objects or camera height above the ground as well as camera parameters of orientation and field of view, using just a monocular image acquired in unconstrained condition. Our method relies on data-driven priors learned by a deep network specifically designed to imbibe weakly supervised constraints from the interplay of the unknown camera with 3D entities such as object heights, through estimation of bounding box projections. We leverage categorical priors for objects such as humans or cars that commonly occur in natural images, as references for scale estimation. We demonstrate state-of-the-art qualitative and quantitative results on several datasets as well as applications including virtual object insertion. Furthermore, the perceptual quality of our outputs is validated by a user study.
Article
Full-text available
Abstract There has been an increasing interest in mobile manipulators that are capable of performing physical work in living spaces worldwide, corresponding to an aging population with declining birth rates with the expectation of improving quality of life (QoL). We assume that overall research and development will accelerate by using a common robot platform among a lot of researchers since that enables them to share their research results. Therefore we have developed a compact and safe research platform, Human Support Robot (HSR), which can be operated in an actual home environment and we have provided it to various research institutes to establish the developers community. Currently, the number of HSR users is expanding to 44 sites in 12 countries worldwide (as of November 30th, 2018). To activate the community, we assume that the robot competition will be effective. As a result of international public offering, HSR has been adopted as a standard platform for international robot competitions such as RoboCup@Home and World Robot Summit (WRS). HSR is provided to participants of those competitions. In this paper, we describe HSR’s development background since 2006, and technical detail of hardware design and software architecture. Specifically, we describe its omnidirectional mobile base using the dual-wheel caster-drive mechanism, which is the basis of HSR’s operational movement and a novel whole body motion control system. Finally, we describe the verification of autonomous task capability and the results of utilization in RoboCup@Home in order to demonstrate the effect of introducing the platform.
Article
Full-text available
Recent approaches in robot perception follow the insight that perception is facilitated by interaction with the environment. These approaches are subsumed under the term Interactive Perception (IP). This view of perception provides the following benefits. First, interaction with the environment creates a rich sensory signal that would otherwise not be present. Second, knowledge of the regularity in the combined space of sensory data and action parameters facilitates the prediction and interpretation of the sensory signal. In this survey, we postulate this as a principle for robot perception and collect evidence in its support by analyzing and categorizing existing work in this area. We also provide an overview of the most important applications of IP. We close this survey by discussing remaining open questions. With this survey, we hope to help define the field of Interactive Perception and to provide a valuable resource for future research.
Conference Paper
Full-text available
This paper proposes a method for bin-picking for objects without assuming the precise geometrical model of objects. We consider the case where the shape of objects are not uniform but are similarly approximated by cylinders. By using the point cloud of a single object, we extract the probabilistic properties with respect to the difference between an object and a cylinder and consider applying the probabilistic properties to the pick-and-place motion planner of an object stacked on a table. By using the probabilistic properties, we can also realize the contact state where a finger maintain contact with the target object while avoiding contact with other objects. We further consider approximating the region occupied by fingers by a rectangular parallelepiped. The pick-and-place motion is planned by using a set of regions in combination with the probabilistic properties. Finally, the effectiveness of the proposed method is confirmed by some numerical examples and experimental result.
Article
Recent advances in object detection are mainly driven by deep learning with large-scale detection benchmarks. However, the fully-annotated training set is often limited for a target detection task, which may deteriorate the performance of deep detectors. To address this challenge, we propose a novel low-shot transfer detector (LSTD) in this paper, where we leverage rich source-domain knowledge to construct an effective target-domain detector with very few training examples. The main contributions are described as follows. First, we design a flexible deep architecture of LSTD to alleviate transfer difficulties in low-shot detection. This architecture can integrate the advantages of both SSD and Faster RCNN in a unified deep framework. Second, we introduce a novel regularized transfer learning framework for low-shot detection, where the transfer knowledge (TK) and background depression (BD) regularizations are proposed to leverage object knowledge respectively from source and target domains, in order to further enhance fine-tuning with a few target images. Finally, we examine our LSTD on a number of challenging low-shot detection experiments, where LSTD outperforms other state-of-the-art approaches. The results demonstrate that LSTD is a preferable deep detector for low-shot scenarios.
Conference Paper
Current learning-based robot grasping approaches exploit human-labeled datasets for training the models. However, there are two problems with such a methodology: (a) since each object can be grasped in multiple ways, manually labeling grasp locations is not a trivial task; (b) human labeling is biased by semantics. While there have been attempts to train robots using trial-and-error experiments, the amount of data used in such experiments remains substantially low and hence makes the learner prone to over-fitting. In this paper, we take the leap of increasing the available training data to 40 times more than prior work, leading to a dataset size of 50K data points collected over 700 hours of robot grasping attempts. This allows us to train a Convolutional Neural Network (CNN) for the task of predicting grasp locations without severe overfitting. In our formulation, we recast the regression problem to an 18-way binary classification over image patches. We also present a multi-stage learning approach where a CNN trained in one stage is used to collect hard negatives in subsequent stages. Our experiments clearly show the benefit of using large-scale datasets (and multi-stage training) for the task of grasping. We also compare to several baselines and show state-of-the-art performance on generalization to unseen objects for grasping.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
This paper presents a new image-based visual servoing approach that simultaneously solves the feature correspondence and control problem. Using a finite-time optimal control framework, feature correspondence is implicitly solved for each new image during the control selection, alleviating the need for additional image processing and feature tracking. The proposed approach demonstrates mild robustness properties and leads to acceptable or improved image feature behaviour and robot trajectories compared to classical image-based visual servoing, particularly for under-actuated robots. As such, preliminary experimental results using a small unmanned quadrotor are also presented.
Conference Paper
This paper introduces a local motion planning method for robotic systems with manipulating limbs, moving bases (legged or wheeled), and stance stability constraints arising from the presence of gravity. We formulate the problem of selecting local motions as a linearly constrained quadratic program (QP), that can be solved efficiently. The solution to this QP is a tuple of locally optimal joint velocities. By using these velocities to step towards a goal, both a path and an inverse-kinematic solution to the goal are obtained. This formulation can be used directly for real-time control, or as a local motion planner to connect waypoints. This method is particularly useful for high-degree-of-freedom mobile robotic systems, as the QP solution scales well with the number of joints. We also show how a number of practically important geometric constraints (collision avoidance, mechanism self-collision avoidance, gaze direction, etc.) can be readily incorporated into either the constraint or objective parts of the formulation. Additionally, motion of the base, a particular joint, or a particular link can be encouraged/discouraged as desired. We summarize the important kinematic variables of the formulation, including the stance Jacobian, the reach Jacobian, and a center of mass Jacobian. The method is easily extended to provide sparse solutions, where the fewest number of joints are moved, by iteration using Tibshirani’s method to accommodate an l1l_1 regularizer. The approach is validated and demonstrated on SURROGATE, a mobile robot with a TALON base, a 7 DOF serial-revolute torso, and two 7 DOF modular arms developed at JPL/Caltech.
Article
This paper presents a vision guidance approach using an image-based visual servo (IBVS) for an aerial manipulator combining a multirotor with a multidegree of freedom robotic arm. To take into account the dynamic characteristics of the combined manipulation platform, the kinematic and dynamic models of the combined system are derived. Based on the combined model, a passivity-based adaptive controller which can be applied on both position and velocity control is designed. The position control is utilized for waypoint tracking such as taking off and landing, and the velocity control is engaged when the platform is guided by visual information. In addition, a guidance law utilizing IBVS is employed with modifications. To secure the view of an object with an eye-in-hand camera, IBVS is utilized with images taken from a fisheye camera. Also, to compensate underactuation of the multirotor, an image adjustment method is developed. With the proposed control and guidance laws, autonomous flight experiments involving grabbing and transporting an object are carried out. Successful experimental results demonstrate that the proposed approaches can be applied in various types of manipulation missions.
Article
In this article, we present the Yale-Carnegie Mellon University (CMU)-Berkeley (YCB) object and model set, intended to be used to facilitate benchmarking in robotic manipulation research. The objects in the set are designed to cover a wide range of aspects of the manipulation problem. The set includes objects of daily life with different shapes, sizes, textures, weights, and rigidities as well as some widely used manipulation tests. The associated database provides high-resolution red, green, blue, plus depth (RGB-D) scans, physical properties, and geometric models of the objects for easy incorporation into manipulation and planning software platforms. In addition to describing the objects and models in the set along with how they were chosen and derived, we provide a framework and a number of example task protocols, laying out how the set can be used to quantitatively evaluate a range of manipulation approaches, including planning, learning, mechanical design, control, and many others. A comprehensive literature survey on the existing benchmarks and object data sets is also presented, and their scope and limitations are discussed. The YCB set will be freely distributed to research groups worldwide at a series of tutorials at robotics conferences. Subsequent sets will be, otherwise, available to purchase at a reasonable cost. It is our hope that the ready availability of this set along with the ground laid in terms of protocol templates will enable the community of manipulation researchers to more easily compare approaches as well as continually evolve standardized benchmarking tests and metrics as the field matures.
Article
The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008–2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community’s progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Conference Paper
The mode of manual annotation used in an interactive segmentation algorithm affects both its accuracy and ease-of-use. For example, bounding boxes are fast to supply, yet may be too coarse to get good results on difficult images, freehand outlines are slower to supply and more specific, yet they may be overkill for simple images. Whereas existing methods assume a fixed form of input no matter the image, we propose to predict the tradeoff between accuracy and effort. Our approach learns whether a graph cuts segmentation will succeed if initialized with a given annotation mode, based on the image's visual separability and foreground uncertainty. Using these predictions, we optimize the mode of input requested on new images a user wants segmented. Whether given a single image that should be segmented as quickly as possible, or a batch of images that must be segmented within a specified time budget, we show how to select the easiest modality that will be sufficiently strong to yield high quality segmentations. Extensive results with real users and three datasets demonstrate the impact.
Conference Paper
In this work we present a reinforcement learning system for autonomous reaching and grasping using visual servoing with a robotic arm. Control is realized in a visual feedback control loop, making it both reactive and robust to noise. The controller is learned from scratch by success or failure without adding information about the task's solution. All of the system's major components are implemented as neural networks. The system is applied to solving a combined reaching and grasping task involving uncertainty directly on a real robotic platform. Its main parts and the conditions for their successful interoperation are described. It will be shown that even with minimal prior knowledge, the system can learn in a short amount of time to reliably perform its task. Furthermore, we describe the control system's ability to react to changes and errors.