Conference PaperPDF Available

Online learning for automatic segmentation of 3D data

Authors:

Abstract and Figures

We propose a method to perform automatic segmentation of 3D scenes based on a standard classifier, whose learning model is continuously improved by means of new samples, and a grouping stage, that enforces local consistency among classified labels. The new samples are automatically delivered to the system by a feedback loop based on a feature selection approach that exploits the outcome of the grouping stage. By experimental results on several datasets we demonstrate that the proposed online learning paradigm is effective in increasing the accuracy of the whole D segmentation thanks to the improvement of the learning model of the classifier by means of newly acquired, unsupervised data.
Content may be subject to copyright.
A preview of the PDF is not available
... OBJS : RGB-D Object Dataset [4] BBIR : Bigbird Dataset [5] SCAN: A large dataset for object scans [6] OSEG: Object Segmentation Dataset [7] GLHY: Global Hypothesis for Verification for 3D Object Recognition [8] SSEG : RGB-D Semantic Segmentation Dataset [9] A variety of different technologies can be used to capture RGB-D datasets. Before its discontinuation in October 2017, the Microsoft Kinect served as a popular tool to generate an extensive number of datasets [3]. ...
... Other datasets [9] [8] [7] use the Kinect v1 to capture a selection of small objects that were placed on a table. The setup of these datasets follows the same line of thought of the dataset being presented in this paper. ...
Article
Full-text available
Many modern computer vision systems include several modules that perform different processing operations packaged as a single pipeline architecture. This generally introduces a challenge in the evaluation process since most datasets provide evaluation data for just one of the operations. In this paper, we present an RGB-D dataset that was designed from first principles to cater for applications that involve salient object detection, segmentation, inpainting and blending techniques. This addresses a gap in the evaluation of image inpainting and blending applications that generally rely on subjective evaluation due to the lack of availability of comparative data. A set of experiments were carried out to demonstrate how the COTS dataset can be used to evaluate these different applications. This dataset includes a variety of scenes, where each scene is captured multiple times, each time adding a new object to the previous scene. This allows for a comparative analysis at pixel level in image inpainting and blending applications. Moreover, all objects were manually labelled in order to offer the possibility of salient object detection even in scenes that contain multiple objects. An online test with 1267 participants was also carried out, and this dataset also includes the click coordinates of users’ selection for every image, introducing a user interaction dimension in the same RGB-D dataset. This dataset was also validated using state of the art techniques, obtaining an Fβ of 0.957 in salient object detection and a mean (Intersection over Union) IoU of 0.942 in Segmentation. Results demonstrate that the COTS dataset introduces novel possibilities for the evaluation of computer vision applications.
... In the first section of our results, we evaluate the proposed algorithm for edge detection against state-of-the-art edge detection algorithms for organized and unorganized point clouds. We demonstrate our results on the RGB-D semantic segmentation dataset [20] for comparison. In the next section we evaluate the repeatability and accuracy of the corner detector on 3D models of washers from the ShapeNet dataset [14]. ...
... The 3D edge detection algorithm was evaluated using the RGB-D semantic segmentation dataset [20] that is acquired using the Microsoft Kinect device. The dataset provides 3D meshes and yaml files for 16 different scenes that includes 5 categories of common grocery products such as packets of biscuits, juice bottles, coffee cans and boxes of salt. ...
Preprint
Full-text available
In this paper, we propose novel edge and corner detection algorithms for unorganized point clouds. Our edge detection method evaluates symmetry in a local neighborhood and uses an adaptive density based threshold to differentiate 3D edge points. We extend this algorithm to propose a novel corner detector that clusters curvature vectors and uses their geometrical statistics to classify a point as corner. We perform rigorous evaluation of the algorithms on RGB-D semantic segmentation and 3D washer models from the ShapeNet dataset and report higher precision and recall scores. Finally, we also demonstrate how our edge and corner detectors can be used as a novel approach towards automatic weld seam detection for robotic welding. We propose to generate weld seams directly from a point cloud as opposed to using 3D models for offline planning of welding paths. For this application, we show a comparison between Harris 3D and our proposed approach on a panel workpiece.
... The dataset is intended for evaluating various flavors of the 6D object pose estimation problem and other related problems such as 2D object detection [248,101] and segmentation [247,79]. Since images from three types of sensors are available, the dataset allows to study the importance of different input modalities for the aforementioned problems. ...
Preprint
Full-text available
In this thesis, we address the problem of estimating the 6D pose of rigid objects from a single RGB or RGB-D input image, assuming that 3D models of the objects are available. This problem is of great importance to many application fields such as robotic manipulation, augmented reality, and autonomous driving. First, we propose EPOS, a method for 6D object pose estimation from an RGB image. The key idea is to represent an object by compact surface fragments and predict the probability distribution of corresponding fragments at each pixel of the input image by a neural network. Each pixel is linked with a data-dependent number of fragments, which allows systematic handling of symmetries, and the 6D poses are estimated from the links by a RANSAC-based fitting method. EPOS outperformed all RGB and most RGB-D and D methods on several standard datasets. Second, we present HashMatch, an RGB-D method that slides a window over the input image and searches for a match against templates, which are pre-generated by rendering 3D object models in different orientations. The method applies a cascade of evaluation stages to each window location, which avoids exhaustive matching against all templates. Third, we propose ObjectSynth, an approach to synthesize photorealistic images of 3D object models for training methods based on neural networks. The images yield substantial improvements compared to commonly used images of objects rendered on top of random photographs. Fourth, we introduce T-LESS, the first dataset for 6D object pose estimation that includes 3D models and RGB-D images of industry-relevant objects. Fifth, we define BOP, a benchmark that captures the status quo in the field. BOP comprises eleven datasets in a unified format, an evaluation methodology, an online evaluation system, and public challenges held at international workshops organized at the ICCV and ECCV conferences.
... Stanford dataset proposed by Tombari et al. [98] for segmentation This dataset is built up for the machine learning techniques which consists of indoor object data from the 3D Scanning Repository, Aim@Shape Watertight, a self-collected indoor dataset, and an outdoor dataset named New York City (NYC) acquired by the Kinect. The categories of the indoor data include packets, biscuits, juice bottles, coffee cans, boxes of salt of different brands and color, an armadillo, Asian dragon, Thai statue, bunny, happy Buddha, and dragon. ...
Article
Full-text available
Recently, researchers have realized a number of achievements involving deep-learning-based neural networks for the tasks of segmentation and detection based on 2D images, 3D point clouds, etc. Using 2D and 3D information fusion for the advantages of compensation and accuracy improvement has become a hot research topic. However, there are no critical reviews focusing on the fusion strategies of 2D and 3D information integration based on various data for segmentation and detection, which are the basic tasks of computer vision. To boost the development of this research domain, the existing representative fusion strategies are collected, introduced, categorized, and summarized in this paper. In addition, the general structures of different kinds of fusion strategies were firstly abstracted and categorized, which may inspire researchers. Moreover, according to the methods included in this paper, the 2D information and 3D information of different methods come from various kinds of data. Furthermore, suitable datasets are introduced and comparatively summarized to support the relative research. Last but not least, we put forward some open challenges and promising directions for future research.
... large dataset of object scans: It includes more than 10,000 scanned and reconstructed objects in nine categories acquired by PrimeSense Carmine cameras. -RGB-D Semantic Segmentation: The dataset has originally been proposed in[58], it was acquired by the Kinect RGB-D sensor. It contains six categories such as juice bottles, coffee cans and boxes of salt, etc. ...
Article
Full-text available
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.
... There have been multiple datasets studying semantic segmentation and pose estimation on RGB-D images [13], [14], [15]. However, most of them either consist of synthetic scenarios [16], [17], or contain limited scenes [18], [19]. ...
Preprint
Grasping in cluttered scenes is challenging for robot vision systems, as detection accuracy can be hindered by partial occlusion of objects. We adopt a reinforcement learning (RL) framework and 3D vision architectures to search for feasible viewpoints for grasping by the use of hand-mounted RGB-D cameras. To overcome the disadvantages of photo-realistic environment simulation, we propose a large-scale dataset called Real Embodied Dataset (RED), which includes full-viewpoint real samples on the upper hemisphere with amodal annotation and enables a simulator that has real visual feedback. Based on this dataset, a practical 3-stage transferable active grasping pipeline is developed, that is adaptive to unseen clutter scenes. In our pipeline, we propose a novel mask-guided reward to overcome the sparse reward issue in grasping and ensure category-irrelevant behavior. The grasping pipeline and its possible variants are evaluated with extensive experiments both in simulation and on a real-world UR-5 robotic arm.
Article
RGB-D data is essential for solving many problems in computer vision. Hundreds of public RGB-D datasets containing various scenes, such as indoor, outdoor, aerial, driving, and medical, have been proposed. These datasets are useful for different applications and are fundamental for addressing classic computer vision tasks, such as monocular depth estimation. This paper reviewed and categorized image datasets that include depth information. We gathered 231 datasets that contain accessible data and grouped them into three categories: scene/objects, body, and medical. We also provided an overview of the different types of sensors, depth applications, and we examined trends and future directions of the usage and creation of datasets containing depth data, and how they can be applied to investigate the development of generalizable machine learning models in the monocular depth estimation field.
Preprint
Full-text available
RGB-D data is essential for solving many problems in computer vision. Hundreds of public RGB-D datasets containing various scenes, such as indoor, outdoor, aerial, driving, and medical, have been proposed. These datasets are useful for different applications and are fundamental for addressing classic computer vision tasks, such as monocular depth estimation. This paper reviewed and categorized image datasets that include depth information. We gathered 203 datasets that contain accessible data and grouped them into three categories: scene/objects, body, and medical. We also provided an overview of the different types of sensors, depth applications, and we examined trends and future directions of the usage and creation of datasets containing depth data, and how they can be applied to investigate the development of generalizable machine learning models in the monocular depth estimation field.
Article
Full-text available
Laser range scanners have now the ability to acquire millions of 3D points of highly detailed and geometrically complex urban sites, opening new avenues of exploration in modeling urban environments. In the traditional modeling pipeline, range scans are processed off-line after acquisi-tion. The slow sequential acquisition though is a bottle-neck. The goal of our work is to alleviate this bottleneck, by exploiting the sequential nature of the data acquisition process. We have developed novel online algorithms, never before used in laser range scanning, that perform data clas-sification on-the-fly as data is being acquired. These algo-rithms are extremely efficient, and can be potentially inte-grated with the scanner's hardware, rendering a sensor that not only acquires but also intelligently processes and classi-fies the scene points. This sensor, armed with the proposed algorithms, can classify 3D points in real-time as being in vegetation vs. non-vegetation regions, or in horizontal vs. vertical regions. The former classification is possible by the implementation of sequential algorithms through a hidden Markov model (HMM) formulation, and the latter through the use of a combination of cleverly designed sequential de-tection algorithms. We envision an arsenal of algorithms of this type to be developed in the future.
Conference Paper
Full-text available
In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF’s strong performance.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.
Article
In this paper we address the problem of automated three dimensional point cloud interpretation. This problem is im-portant for various tasks from environment modeling to ob-stacle avoidance for autonomous robot navigation. In addi-tion to locally extracted features, classifiers need to utilize contextual information in order to perform well. A popu-lar approach to account for context is to utilize the Markov Random Field framework. One recent variant that has suc-cessfully been used for the problem considered is the Asso-ciative Markov Network (AMN). We extend the AMN model to learn directionality in the clique potentials, resulting in a new anisotropic model that can be efficiently learned us-ing the subgradient method. We validate the proposed ap-proach using data collected from different range sensors and show better performance against standard AMN and Support Vector Machine algorithms.