Article

Supervoxel Convolution for Online 3D Semantic Segmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Online 3D semantic segmentation, which aims to perform real-time 3D scene reconstruction along with semantic segmentation, is an important but challenging topic. A key challenge is to strike a balance between efficiency and segmentation accuracy. There are very few deep-learning-based solutions to this problem, since the commonly used deep representations based on volumetric-grids or points do not provide efficient 3D representation and organization structure for online segmentation. Observing that on-surface supervoxels, i.e., clusters of on-surface voxels, provide a compact representation of 3D surfaces and brings efficient connectivity structure via supervoxel clustering, we explore a supervoxel-based deep learning solution for this task. To this end, we contribute a novel convolution operation (SVConv) directly on supervoxels. SVConv can efficiently fuse the multi-view 2D features and 3D features projected on supervoxels during the online 3D reconstruction, and leads to an effective supervoxel-based convolutional neural network, termed as Supervoxel-CNN , enabling 2D-3D joint learning for 3D semantic prediction. With the Supervoxel-CNN , we propose a clustering-then-prediction online 3D semantic segmentation approach. The extensive evaluations on the public 3D indoor scene datasets show that our approach significantly outperforms the existing online semantic segmentation systems in terms of efficiency or accuracy.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Our 3D semantic segmentation results are shown in Table 1 and Table 2, where we compare our network with state-of-the-art pipelines. Similar to our methods, SF [27], SR [34], SPV [37] and PF [28] are semantic reconstruction systems based RGB-D images, while BPNet [3] deals with point clouds. To demonstrate the efficacy of 2D-assisted 3D semantic segmentation and the efficiency of 2D consistent semantic segmentation, in the subsequent experiments, the label 'our' signifies outcomes from our 2D-3D fusion technique without 2D consistency correction. ...
... It is important to note that for both BPNet [3] and our proposed method, the voxel size utilized was set to 5 cm. In contrast, SPV [37] employed a voxel size of 1 cm. The choice of voxel size plays a crucial role in balancing the trade-off between accuracy and computational complexity. ...
... 'Y' meams that a method has a dense reconstruction function. We use '-' to mark unsure situations and '*'means the result coming from[37]. ...
Article
2D and 3D semantic segmentation play important roles in robotic scene understanding. However, current 3D semantic segmentation heavily relies on 3D point clouds, which are susceptible to factors such as point cloud noise, sparsity, estimation and reconstruction errors, and data imbalance. In this paper, a novel approach is proposed to enhance 3D semantic segmentation by incorporating 2D semantic segmentation from RGB-D sequences. Firstly, the RGB-D pairs are consistently segmented into 2D semantic maps using the tracking pipeline of Simultaneous Localization and Mapping (SLAM). This process effectively propagates object labels from full scans to corresponding labels in partial views with high probability. Subsequently, a novel Semantic Projection (SP) block is introduced, which integrates features extracted from localized 2D fragments across different camera viewpoints into their corresponding 3D semantic features. Lastly, the 3D semantic segmentation network utilizes a combination of 2D-3D fusion features to facilitate a merged semantic segmentation process for both 2D and 3D. Extensive experiments conducted on public datasets demonstrate the effective performance of the proposed 2D-assisted 3D semantic segmentation method.
... Real-time semantic mapping methods usually rely on 2D convolutional neural networks with optional 3D postprocessing (2D-3D networks) to annotate incoming images with semantics, using back-projection to lift the semantic labels to the 3D map [6], [3], [7], [5], [8], [1], while 1 Slamcore Ltd. 2 University College London recent FP-Conv [7] or SVCNN [6] also rely on lightweight 3D post-processing. 2D-3D networks repetitively process images with similar visual content, solving 2D semantic segmentation from scratch for each image, which may be redundant [9]; lack multi-view consistency in 2D labels [10]; suffer from occlusions or object scale uncertainty [11]. ...
... Real-time semantic mapping methods usually rely on 2D convolutional neural networks with optional 3D postprocessing (2D-3D networks) to annotate incoming images with semantics, using back-projection to lift the semantic labels to the 3D map [6], [3], [7], [5], [8], [1], while 1 Slamcore Ltd. 2 University College London recent FP-Conv [7] or SVCNN [6] also rely on lightweight 3D post-processing. 2D-3D networks repetitively process images with similar visual content, solving 2D semantic segmentation from scratch for each image, which may be redundant [9]; lack multi-view consistency in 2D labels [10]; suffer from occlusions or object scale uncertainty [11]. ...
... Compared to prior methods [9], [10] that re-project late features or segmentation labels, we use early features and rely on differentiable rendering. This leads to improved quality of image-level semantic labels compared to the state-of-the-art 2D-3D networks-based method SVCNN [6]. Secondly, we propose quasi-planar over-segmentation (QPOS) for lightweight 3D map post-processing. ...
Preprint
Full-text available
The availability of real-time semantics greatly improves the core geometric functionality of SLAM systems, enabling numerous robotic and AR/VR applications. We present a new methodology for real-time semantic mapping from RGB-D sequences that combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping. When segmenting a new frame we perform latent feature re-projection from previous frames based on differentiable rendering. Fusing re-projected feature maps from previous frames with current-frame features greatly improves image segmentation quality, compared to a baseline that processes images independently. For 3D map processing, we propose a novel geometric quasi-planar over-segmentation method that groups 3D map elements likely to belong to the same semantic classes, relying on surface normals. We also describe a novel neural network design for lightweight semantic map post-processing. Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems and matches the performance of 3D convolutional networks on three real indoor datasets, while working in real-time. Moreover, it shows better cross-sensor generalization abilities compared to 3D CNNs, enabling training and inference with different depth sensors. Code and data will be released on project page: http://jingwenwang95.github.io/SeMLaPS
... Subsequently, Pointnet++ [5] was proposed and solved the problem of PointNet. Meanwhile, more and more methods were proposed, which can be roughly categorized into: point-based methods [4][5][6][7][8][9], voxel-based methods [10][11][12], and Transformer-based methods [13,14]. Point-based methods usually use the K-nearest-neighbor algorithm or spherical algorithm to aggregate the point cloud into a region, and then use the convolution operation on this local region. ...
... In recent years, with the development of artificial intelligence and the generalization of hardware acquisition devices, research on 3D point cloud segmentation tasks has begun to emerge, which can provide better geometric perception for intelligences, which is not available for 2D images. Current point cloud segmentation methods can be categorized into point-based methods [4][5][6][7][8][9][21][22][23], voxel-based methods [10][11][12]24], and transformer-based methods [13,14]. Point-based methods perform point cloud segmentation by learning the feature representation and semantic information of point cloud data. ...
Article
Full-text available
Large-scale point cloud segmentation is one of the important research directions in the field of computer vision, aiming at segmenting 3D point cloud data into parts with semantic meaning, which is widely used in the fields of robot perception, automated driving, and virtual reality. In practical applications, intelligences often face various uncertainties such as sensor noise, missing data, and uncertain model parameter estimation. However, many current research works do not consider the effects of these uncertainties, which can cause the model to overfit the noisy data and thus affect the model performance. In this paper, we propose a point cloud segmentation method with domain uncertainty that can greatly improve the robustness of the model to noise. Specifically, we first compute the neighborhood uncertainty, which is more reflective of the semantics of a local region than the prediction of a single point, which will reduce the impact of noise. Next, we fuse the uncertainty into the objective function, which allows the model to focus more on relatively deterministic data. Finally, we validate on the large-scale datasets S3DIS and Toronto3D, and the segmentation performance is substantially improved in both cases.
... Tatarchenko et al. [24] project features into predefined regular domains and apply 2D CNNs to the domain. Some works focus on online segmentation, which aims to perform real-time 3D scene reconstruction along with semantic segmentation [37,36]. We base the backbone of our segmentation network on the 3D U-Net architecture [3], which implements the efficient generalized sparse convolution. ...
... Although we believe that the annotations can be collected through crowdsourcing on the internet and thus no extra user effort is needed as users don't have to wait in front of the computer, this would cause information delay and would become a problem when the collection is conducted in person. Moreover, we adopt the same supervoxel clustering methods as in [17] for a fair comparison, however, it's worth exploring other more advanced methods such as [37] and [38] to further boost the performance. ...
Preprint
Since the preparation of labeled data for training semantic segmentation networks of point clouds is a time-consuming process, weakly supervised approaches have been introduced to learn from only a small fraction of data. These methods are typically based on learning with contrastive losses while automatically deriving per-point pseudo-labels from a sparse set of user-annotated labels. In this paper, our key observation is that the selection of what samples to annotate is as important as how these samples are used for training. Thus, we introduce a method for weakly supervised segmentation of 3D scenes that combines self-training with active learning. The active learning selects points for annotation that likely result in performance improvements to the trained model, while the self-training makes efficient use of the user-provided labels for learning the model. We demonstrate that our approach leads to an effective method that provides improvements in scene segmentation over previous works and baselines, while requiring only a small number of user annotations.
... When there is a strong homogeneity between sets, it indicates that the similarities, differences, and opposites between the two sets are dominated by the "same trend." In the above example, pattern A can be identified as the dominant one [11]. The quasi-homogeneity indicates that pattern A can be identified as clearly determined. ...
Article
Full-text available
Aiming at the defect of the poor ability to suppress the noise of the algorithm of image edge detection, as well as too much sensitive to the noise, an improved algorithm of image edge detection based on SPA (set pair analysis) by combining trend relationship of the degree connection situation with the algorithm of half-neighborhood adaptive image edge pick-up is presented in this paper. The core of the algorithm is first to detect the most possible direction of the edge by using the biggest value of Degree Connection Situation--c/a, and then calculate absolute value of the D-value between the average value of three points and the average value of five points and the standard deviation value of all the points value in the 8-neighborhood, at last, decide whether the center-point is on the edge according to the threshold value of the standard deviation value. The result of simulation indicates that, without any processing to the image, the improved algorithm is not only accurate but also legible in object brim pick-up, compared with the algorithm in literature; furthermore, the noise is greatly suppressed by using it. Received: 29 December 2023 | Revised: 16 May 2024| Accepted: 26 June 2024 Conflicts of Interest The authors declare that they have no conflicts of interest to this work. Data Availability Statement Data available on request from the corresponding author upon reasonable request. Author Contribution Statement Guohua Ji: Methodology, Software, Validation, Formal analysis, Investigation, Resources, Writing - original draft, Project administration. Yachong Tian: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization, Supervision, Project administration. Xianyi Cheng: Conceptualization, Methodology, Software, Investigation, Resources, Data curation, Writing - review & editing, Visualization, Supervision.
... Recently, several approaches have used geometrybased segmentation as the pre-processing step and generated mid-level scene representation as the inputs of deep learning frameworks. For instance, Huang et al. [24] clustered on-surface voxels to provide a compact representation of 3D scenes. Landrieu and Simonovsky [25] partitioned the scan data into superpoints, which are geometrically homogeneous elements. ...
Article
Full-text available
Point cloud segmentation is an essential task in three-dimensional (3D) vision and intelligence. It is a critical step in understanding 3D scenes with a variety of applications. With the rapid development of 3D scanning devices, point cloud data have become increasingly available to researchers. Recent advances in deep learning are driving advances in point cloud segmentation research and applications. This paper presents a comprehensive review of recent progress in point cloud segmentation for understanding 3D indoor scenes. First, we present public point cloud datasets, which are the foundation for research in this area. Second, we briefly review previous segmentation methods based on geometry. Then, learning-based segmentation methods with multi-views and voxels are presented. Next, we provide an overview of learning-based point cloud segmentation, ranging from semantic segmentation to instance segmentation. Based on the annotation level, these methods are categorized into fully supervised and weakly supervised methods. Finally, we discuss open challenges and research directions in the future.
... However, the predictions on 2D image is not geometric and temporal-aware, which makes the fusion step difficult and inaccurate. Fusion-aware 3D-Conv [37] and SVCNN [11] construct data structures to maintain the information of previous frames and conduct point-based 3D aggregation to fuse the 3D features for semantic segmentation. INS-CONV [16] extends sparse convolution [9; 5] to incremental CNN to efficiently extract global 3D features for semantic and instance segmentation. ...
Preprint
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) has revolutionized the field of 2D computer vision with superior performance, which makes the use of VFM to assist embodied 3D perception a promising direction. However, most existing VFM-assisted 3D perception methods are either offline or too slow that cannot be applied in practical embodied tasks. In this paper, we aim to leverage Segment Anything Model (SAM) for real-time 3D instance segmentation in an online setting. This is a challenging problem since future frames are not available in the input streaming RGB-D video, and an instance may be observed in several frames so object matching between frames is required. To address these challenges, we first propose a geometric-aware query lifting module to represent the 2D masks generated by SAM by 3D-aware queries, which is then iteratively refined by a dual-level query decoder. In this way, the 2D masks are transferred to fine-grained shapes on 3D point clouds. Benefit from the query representation for 3D masks, we can compute the similarity matrix between the 3D masks from different views by efficient matrix operation, which enables real-time inference. Experiments on ScanNet, ScanNet200, SceneNN and 3RScan show our method achieves leading performance even compared with offline methods. Our method also demonstrates great generalization ability in several zero-shot dataset transferring experiments and show great potential in open-vocabulary and data-efficient setting. Code and demo are available at https://xuxw98.github.io/ESAM/, with only one RTX 3090 GPU required for training and evaluation.
... Owing to the dilated K-nearest neighbor (DKNN) operation, Guo et al. 29 presented a dilated multiscale fusion network for the analysis of point cloud data, especially for the task of point cloud classification and segmentation. Since the on-surface supervoxel provides a compact representation of 3d surfaces and also brings efficient connectivity structure via supervoxel clustering, Huang et al. 30 explored the convolution operation directly on supervoxels and thus fused the multi-view 2d features and 3d features projected on these supervoxels for 2d-3d joint learning during 3d semantic prediction. To alleviate the computational costs of network training, RandLA-Net 31 adopted a random sampling scheme to realize point cloud downsampling operation for point cloud data. ...
Article
Full-text available
Efficient semantic segmentation of large-scale point cloud scenes is a fundamental and essential task for perception or understanding the surrounding 3d environments. However, due to the vast amount of point cloud data, it is always a challenging to train deep neural networks efficiently and also difficult to establish a unified model to represent different shapes effectively due to their variety and occlusions of scene objects. Taking scene super-patch as data representation and guided by its contextual information, we propose a novel multiscale super-patch transformer network (MSSPTNet) for point cloud segmentation, which consists of a multiscale super-patch local aggregation (MSSPLA) module and a super-patch transformer (SPT) module. Given large-scale point cloud data as input, a dynamic region-growing algorithm is first adopted to extract scene super-patches from the sampling points with consistent geometric features. Then, the MSSPLA module aggregates local features and their contextual information of adjacent super-patches at different scales. Owing to the self-attention mechanism, the SPT module exploits the similarity among scene super-patches in high-level feature space. By combining these two modules, our MSSPTNet can effectively learn both local and global features from the input point clouds. Finally, the interpolating upsampling and multi-layer perceptrons are exploited to generate semantic labels for the original point cloud data. Experimental results on the public S3DIS dataset demonstrate its efficiency of the proposed network for segmenting large-scale point cloud scenes, especially for those indoor scenes with a large number of repetitive structures, i.e., the network training of our MSSPTNet is much faster than other segmentation networks by a factor of tens to hundreds.
... Supervoxel-CNN (SVNet) [52] is a recently proposed supervoxel-based network for the semantic segmentation of large-scale point cloud by fusing multi-view 3D and 2D features projected on supervoxels. Based on the efficient connectivity structure of supervoxels, SVNet strikes the balance between segmentation accuracy and efficiency. ...
Article
Full-text available
Semantic segmentation in the context of 3D point clouds for the railway environment holds a significant economic value, but its development is severely hindered by the lack of suitable and specific datasets. Additionally, the models trained on existing urban road point cloud datasets demonstrate poor generalisation on railway data due to a large domain gap caused by non‐overlapping special/rare categories, for example, rail track, track bed etc. To harness the potential of supervised learning methods in the domain of 3D railway semantic segmentation, we introduce RailPC, a new point cloud benchmark. RailPC provides a large‐scale dataset with rich annotations for semantic segmentation in the railway environment. Notably, RailPC contains twice the number of annotated points compared to the largest available mobile laser scanning (MLS) point cloud dataset and is the first railway‐specific 3D dataset for semantic segmentation. It covers a total of nearly 25 km railway in two different scenes (urban and mountain), with 3 billion points that are finely labelled as 16 most typical classes with respect to railway, and the data acquisition process is completed in China by MLS systems. Through extensive experimentation, we evaluate the performance of advanced scene understanding methods on the annotated dataset and present a synthetic analysis of semantic segmentation results. Based on our findings, we establish some critical challenges towards railway‐scale point cloud semantic segmentation. The dataset is available at https://github.com/NNU‐GISA/GISA‐RailPC, and we will continuously update it based on community feedback.
... Tatarchenko et al. [31] project features into predefined regular domains and apply 2D CNNs to the domain. Some works focus on online segmentation, which aims to perform real-time 3D scene reconstruction along with semantic segmentation [32,33]. We base the backbone of our segmentation network on the 3D U-Net architecture [13], which implements the efficient generalized sparse convolution. ...
Article
Full-text available
Since the preparation of labeled data for training semantic segmentation networks of point clouds is a time-consuming process, weakly supervised approaches have been introduced to learn from only a small fraction of data. These methods are typically based on learning with contrastive losses while automatically deriving per-point pseudo-labels from a sparse set of user-annotated labels. In this paper, our key observation is that the selection of which samples to annotate is as important as how these samples are used for training. Thus, we introduce a method for weakly supervised segmentation of 3D scenes that combines self-training with active learning. Active learning selects points for annotation that are likely to result in improvements to the trained model, while self-training makes efficient use of the user-provided labels for learning the model. We demonstrate that our approach leads to an effective method that provides improvements in scene segmentation over previous work and baselines, while requiring only a few user annotations.
... Existing replication techniques such as 360-degree videos [6], RGB-D cameras [7], or light fields [8] however provide different quality-latency tradeoffs. Another important research topic concerns the semantic segmentation of these virtual replications [9] which is required to make single components of the replication referenceable. While most XR devices provide out of the box interaction techniques, plenty of research is being conducted to enhance their usability. ...
Chapter
Full-text available
Mixed and Virtual Reality technologies have been assigned considerable potential to support training and workflows in various domains. However, available solutions are subject to scalability limitations which evoke temporal and cognitive efforts that outweigh the technology’s intrinsic potential and prevent their application in profit-making, real-world settings. Addressing these issues, we developed a framework for Scalable Extended Reality (XR S ) spaces following a human-centered design process. To this end, we derived abstract high-level use cases which exploit key benefits of Mixed and Virtual Reality technologies and can be combined with each other to describe specific low-level use cases in many domains. Based on the defined high-level use cases, i.e., design and development of physical items, training, teleoperation, co-located and distributed collaboration, we specified functional and non-functional requirements and developed a framework design solution that implements multidimensional scalability enhancements: Multiple on-site and off-site users can access the XR S space through customized Mixed or Virtual Reality interfaces and then reference or manipulate real or virtual scene components. Thereby, full scalability regarding options of interaction is provided through the integration of a robotic system that allows off-site users to manipulate real scene components on site. Eventually, the framework’s applicability to different use cases is demonstrated in theoretical walkthroughs.
... A recent trend in computer vision is to process 2D images and 3D volumes at a higher-level representation instead of at the pixel-level representation [2]. As an example, image over-segmentation can be used as a pre-processing step in image compression [3], [4], object tracking [5], object segmentation [6], [7], 3D semantic segmentation [8] and saliency detection [9]. Considering that image over-segmentation can be applied to 2D images and 3D volumes to facilitate subsequent applications, applying a similar approach to 4D LFs would also make sense. ...
Article
Full-text available
4D Light Field (LF) imaging, since it conveys both spatial and angular scene information, can facilitate computer vision tasks and generate immersive experiences for end-users. A key challenge in 4D LF imaging is to flexibly and adaptively represent the included spatio-angular information to facilitate subsequent computer vision applications. Recently, image over-segmentation into homogenous regions with perceptually meaningful information has been exploited to represent 4D LFs. However, existing methods assume densely sampled LFs and do not adequately deal with sparse LFs with large occlusions. Furthermore, the spatio-angular LF cues are not fully exploited in the existing methods. In this paper, the concept of hyperpixels is defined and a flexible, automatic, and adaptive representation for both dense and sparse 4D LFs is proposed. Initially, disparity maps are estimated for all views to enhance over-segmentation accuracy and consistency. Afterwards, a modified weighted K -means clustering using robust spatio-angular features is performed in 4D Euclidean space. Experimental results on several dense and sparse 4D LF datasets show competitive and outperforming performance in terms of over-segmentation accuracy, shape regularity and view consistency against state-of-the-art methods.
... Song et al. [26] conduct scene semantic segmentation with the help of the scene completion task. For efficiency, many works [27][28][29][30][31] use sparse voxels or supervoxels to reduce the computational and memory costs while achieving better segmentation results. Unlike these supervised methods that require a large amount labeled data, we leverage a few labeled data items and a large amount of unlabeled data for effective segmentation. ...
Article
Full-text available
The lack of fine-grained 3D shape segmentation data is the main obstacle to developing learning-based 3D segmentation techniques. We propose an effective semi-supervised method for learning 3D segmentations from a few labeled 3D shapes and a large amount of unlabeled 3D data. For the unlabeled data, we present a novel multilevel consistency loss to enforce consistency of network predictions between perturbed copies of a 3D shape at multiple levels: point level, part level, and hierarchical level. For the labeled data, we develop a simple yet effective part substitution scheme to augment the labeled 3D shapes with more structural variations to enhance training. Our method has been extensively validated on the task of 3D object semantic segmentation on PartNet and ShapeNetPart, and indoor scene semantic segmentation on ScanNet. It exhibits superior performance to existing semi-supervised and unsupervised pre-training 3D approaches.
... Note that, the alternatives to fuse semantic predictions do exist, e.g. 3D convolution [18,23]. However, di-Online Organized 3D points Active Navigation rectly conducting 3D convolution into such a floor-level 3D representation would inevitably lead to a huge rise of computational cost, especially in the context of learning-based policy. ...
Preprint
Object goal navigation (ObjectNav) in unseen environments is a fundamental task for Embodied AI. Agents in existing works learn ObjectNav policies based on 2D maps, scene graphs, or image sequences. Considering this task happens in 3D space, a 3D-aware agent can advance its ObjectNav capability via learning from fine-grained spatial information. However, leveraging 3D scene representation can be prohibitively unpractical for policy learning in this floor-level task, due to low sample efficiency and expensive computational cost. In this work, we propose a framework for the challenging 3D-aware ObjectNav based on two straightforward sub-policies. The two sub-polices, namely corner-guided exploration policy and category-aware identification policy, simultaneously perform by utilizing online fused 3D points as observation. Through extensive experiments, we show that this framework can dramatically improve the performance in ObjectNav through learning from 3D scene representation. Our framework achieves the best performance among all modular-based methods on the Matterport3D and Gibson datasets, while requiring (up to 30x) less computational cost for training.
... Subsequent computationally intensive image processing algorithms work on supervoxels instead of individual points or pixels to save computational time. Supervoxels find applications in various fields, such as point cloud segmentation and classification, [1], 3D semantic segmentation of point clouds, [2], [3], medical imaging, [4], [5], object detection, [6] and saliency detection, [7], to name a few. Despite so many applications, there is few literature that deals with clustering methods tailored for point clouds. ...
Article
Full-text available
Supervoxels find applications as a pre-processing step in many image processing problems due to their ability to present a regional representation of points by correlating them into a set of clusters. Besides reducing the overall computational time for subsequent algorithms, the desirable properties in supervoxels are adherence to object boundaries and compactness. Existing supervoxel segmentation methods define the size of a supervoxel based on a user inputted resolution value. A fixed resolution results in poor performance in point clouds with non-uniform density. Whereas, other methods, in their quest for better boundary adherence, produce supervoxels with irregular shapes and elongated boundaries. In this article, we propose a new supervoxel segmentation method, based on k-means algorithm, with dynamic cluster seed initialization to ensure uniform distribution of cluster seeds in point clouds with variable densities. We also propose a new cluster seed initialization strategy, based on histogram binning of surface normals, for better boundary adherence. Our algorithm is parameter-free and gives equal importance to the color, spatial location and orientation of the points resulting in compact supervoxels with tight boundaries. We test the efficacy of our algorithm on a publicly available point cloud dataset consisting of 1449 pairs of indoor RGB-D images, i.e., color (RGB) images coupled with depth information (D) mapped per pixel. Results are compared against three state-of-the-art algorithms based on four quality metrics. Results show that our method provides significant improvement over other methods in the undersegmentation error and compactness metrics and, performs equally well in the boundary recall and contour density metrics.
... In recent years, the efficient supervoxel method has been introduced to 3D semantic segmentation [13][14][15]. Supervoxels were applied in a convolution operation (SVConv) by Huang, Ma et al. to effectively accomplish online 3D semantic segmentation [16]. In Sha, Chen et al.'s work, road contours were extracted efficiently and based completely on a supervoxel method without any trajectory data [17]. ...
Article
Full-text available
Supervoxels have a widespread application of instance segmentation on account of the merit of providing a highly approximate representation with fewer data. However, low accuracy, mainly caused by point cloud adhesion in the localization of industrial robots, is a crucial issue. An improved bottom-up clustering method based on supervoxels was proposed for better accuracy. Firstly, point cloud data were preprocessed to eliminate the noise points and background. Then, improved supervoxel over-segmentation with moving least squares (MLS) surface fitting was employed to segment the point clouds of workpieces into supervoxel clusters. Every supervoxel cluster can be refined by MLS surface fitting, which reduces the occurrence that over-segmentation divides the point clouds of two objects into a patch. Additionally, an adaptive merging algorithm based on fusion features and convexity judgment was proposed to accomplish the clustering of the individual workpiece. An experimental platform was set up to verify the proposed method. The experimental results showed that the recognition accuracy and the recognition rate in three different kinds of workpieces were all over 0.980 and 0.935, respectively. Combined with the sample consensus initial alignment (SAC-IA) coarse registration and iterative closest point (ICP) fine registration, the coarse-to-fine strategy was adopted to obtain the location of the segmented workpieces in the experiments. The experimental results demonstrate that the proposed clustering algorithm can accomplish the localization of industrial robots with higher accuracy and lower registration time.
... Point-based Methods [23,24,30,31,42,45,49] apply convolutional kernels to a local region of points for feature extraction and the neighbors of a point are computed from k-NN or spherical search. In the case of voxelbased methods [4,7,14,46], the points in the 3D space are first transformed into voxel representations so that standard CNN can be adopted to process the structured voxels. In either point-based or voxel-based methods, feature aggregation is performed in the Euclidean space, while there are some recent works [13,15,20,32] that consider geodesic information for better feature representation. ...
Preprint
Full-text available
Semantic segmentation of point cloud usually relies on dense annotation that is exhausting and costly, so it attracts wide attention to investigate solutions for the weakly supervised scheme with only sparse points annotated. Existing works start from the given labels and propagate them to highly-related but unlabeled points, with the guidance of data, e.g. intra-point relation. However, it suffers from (i) the inefficient exploitation of data information, and (ii) the strong reliance on labels thus is easily suppressed when given much fewer annotations. Therefore, we propose a novel framework, PointMatch, that stands on both data and label, by applying consistency regularization to sufficiently probe information from data itself and leveraging weak labels as assistance at the same time. By doing so, meaningful information can be learned from both data and label for better representation learning, which also enables the model more robust to the extent of label sparsity. Simple yet effective, the proposed PointMatch achieves the state-of-the-art performance under various weakly-supervised schemes on both ScanNet-v2 and S3DIS datasets, especially on the settings with extremely sparse labels, e.g. surpassing SQN by 21.2% and 17.2% on the 0.01% and 0.1% setting of ScanNet-v2, respectively.
... As such, they were able to collect a large data set of labeled 3D scenes. Huang et al. [38] took use of this data set and presented an approach that allowed to semantically segment a 3D space described by supervoxels in real time. They noted that so far, applying deep learning techniques has been inefficient due to the large amounts of data describing the 3D scenes. ...
Article
Full-text available
Extensive research has outlined the potential of augmented, mixed, and virtual reality applications. However, little attention has been paid to scalability enhancements fostering practical adoption. In this paper, we introduce the concept of scalable extended reality (XRS), i.e., spaces scaling between different displays and degrees of virtuality that can be entered by multiple, possibly distributed users. The development of such XRS spaces concerns several research fields. To provide bidirectional interaction and maintain consistency with the real environment, virtual reconstructions of physical scenes need to be segmented semantically and adapted dynamically. Moreover, scalable interaction techniques for selection, manipulation, and navigation as well as a world-stabilized rendering of 2D annotations in 3D space are needed to let users intuitively switch between handheld and head-mounted displays. Collaborative settings should further integrate access control and awareness cues indicating the collaborators’ locations and actions. While many of these topics were investigated by previous research, very few have considered their integration to enhance scalability. Addressing this gap, we review related previous research, list current barriers to the development of XRS spaces, and highlight dependencies between them.
... Our 3D semantic segmentation results are shown in Table II,III, where we compare our network with state-of-the-art pipelines. Similar to our methods, SF [28], SR [40], SPV [43] and PF [29] are semantic reconstruction systems based RGB-D images, while BPNet [3] deals with point clouds. ...
Preprint
Full-text available
In this paper, a method for dense semantic 3D scene reconstruction from an RGB-D sequence is proposed to solve high-level scene understanding tasks. First, each RGB-D pair is consistently segmented into 2D semantic maps based on a camera tracking backbone that propagates objects' labels with high probabilities from full scans to corresponding ones of partial views. Then a dense 3D mesh model of an unknown environment is incrementally generated from the input RGB-D sequence. Benefiting from 2D consistent semantic segments and the 3D model, a novel semantic projection block (SP-Block) is proposed to extract deep feature volumes from 2D segments of different views. Moreover, the semantic volumes are fused into deep volumes from a point cloud encoder to make the final semantic segmentation. Extensive experimental evaluations on public datasets show that our system achieves accurate 3D dense reconstruction and state-of-the-art semantic prediction performances simultaneously.
... Numerous designs of point-based convolutional kernels have been proposed [31,28,58,37,69]. In the case of voxel-based methods, the raw 3D data is first transformed into a voxel representation and then processed by standard CNNs [39,44,62,72,24]. To address the cubic memory and computation consumption problem of voxel-based operations, recent works have made efforts to propose efficient sparse voxel convolutions [17,7,56]. ...
Preprint
Full-text available
In recent years, sparse voxel-based methods have become the state-of-the-arts for 3D semantic segmentation of indoor scenes, thanks to the powerful 3D CNNs. Nevertheless, being oblivious to the underlying geometry, voxel-based methods suffer from ambiguous features on spatially close objects and struggle with handling complex and irregular geometries due to the lack of geodesic information. In view of this, we present Voxel-Mesh Network (VMNet), a novel 3D deep architecture that operates on the voxel and mesh representations leveraging both the Euclidean and geodesic information. Intuitively, the Euclidean information extracted from voxels can offer contextual cues representing interactions between nearby objects, while the geodesic information extracted from meshes can help separate objects that are spatially close but have disconnected surfaces. To incorporate such information from the two domains, we design an intra-domain attentive module for effective feature aggregation and an inter-domain attentive module for adaptive feature fusion. Experimental results validate the effectiveness of VMNet: specifically, on the challenging ScanNet dataset for large-scale segmentation of indoor scenes, it outperforms the state-of-the-art SparseConvNet and MinkowskiNet (74.6% vs 72.5% and 73.6% in mIoU) with a simpler network structure (17M vs 30M and 38M parameters). Code release: https://github.com/hzykent/VMNet
Chapter
Full-text available
Reconstructing real-world scenes with unparalleled levels of realism and detail has been a long-standing goal in the fields of computer vision and graphics. Achieving this goal necessitates coordinated efforts in both sensing techniques and plenoptic reconstruction algorithms.
Article
We present X-SLAM, a real-time dense differentiable SLAM system that leverages the complex-step finite difference (CSFD) method for efficient calculation of numerical derivatives, bypassing the need for a large-scale computational graph. The key to our approach is treating the SLAM process as a differentiable function, enabling the calculation of the derivatives of important SLAM parameters through Taylor series expansion within the complex domain. Our system allows for the real-time calculation of not just the gradient, but also higher-order differentiation. This facilitates the use of high-order optimizers to achieve better accuracy and faster convergence. Building on X-SLAM, we implemented end-to-end optimization frameworks for two important tasks: camera relocalization in wide outdoor scenes and active robotic scanning in complex indoor environments. Comprehensive evaluations on public benchmarks and intricate real scenes underscore the improvements in the accuracy of camera relocalization and the efficiency of robotic navigation achieved through our task-aware optimization. The code and data are available at https://gapszju.github.io/X-SLAM.
Article
2D-3D joint learning is essential and effective for fundamental 3D vision tasks, such as 3D semantic segmentation, due to the complementary information these two visual modalities contain. Most current 3D scene semantic segmentation methods process 2D images “as they are”, i.e., only real captured 2D images are used. However, such captured 2D images may be redundant, with abundant occlusion and/or limited field of view (FoV), leading to poor performance for the current methods involving 2D inputs. In this paper, we propose a general learning framework for joint 2D-3D scene understanding by selecting informative virtual 2D views of the underlying 3D scene. We then feed both the 3D geometry and the generated virtual 2D views into any joint 2D-3D-input or pure 3D-input based deep neural models for improving 3D scene understanding. Specifically, we generate virtual 2D views based on an information score map learned from the current 3D scene semantic segmentation results. To achieve this, we formalize the learning of the information score map as a deep reinforcement learning process, which rewards good predictions using a deep neural network. To obtain a compact set of virtual 2D views that jointly cover informative surfaces of the 3D scene as much as possible, we further propose an efficient greedy virtual view coverage strategy in the normal-sensitive 6D space, including 3-dimensional point coordinates and 3-dimensional normal. We have validated our proposed framework for various joint 2D-3D-input or pure 3D-input based deep neural models on two real-world 3D scene datasets, i.e., ScanNet v2 [1] and S3DIS [2], and the results demonstrate that our method obtains a consistent gain over baseline models and achieves new top accuracy for joint 2D and 3D scene semantic segmentation. Code is available at https://github.com/smy-THU/VirtualViewSelection.
Preprint
Full-text available
Fig. 1: Our survey covers 250+ papers on Neural Radiance Fields with semantic scene understanding capabilities, spanning the 6 main categories depicted above. Illustrations from [81, 195, 94, 72, 160, 192]. Abstract This review thoroughly examines the role of semantically-aware Neural Radiance Fields (NeRFs) in visual scene understanding, covering an analysis of over 250 scholarly papers. It explores how NeRFs adeptly infer 3D representations for both stationary and dynamic objects in a scene. This capability is pivotal for generating high-quality new viewpoints, completing missing scene details (inpainting), conducting comprehensive scene segmentation (panoptic segmentation), predicting 3D bounding boxes, editing 3D scenes, and extracting object-centric 3D models. A significant aspect of this study is the application of semantic labels as viewpoint-invariant functions, which effectively map spatial coordinates to a spectrum of semantic labels, thus facilitating the recognition of distinct objects within the scene. Overall, this survey highlights the progression and diverse applications of semantically-aware neural radiance fields in the context of visual scene interpretation.
Article
The availability of real-time semantics greatly improves the core geometric functionality of SLAM systems, enabling numerous robotic and AR/VR applications. We present a new methodology for real-time semantic mapping from RGB-D sequences that combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping. When segmenting a new frame we perform latent feature re-projection from previous frames based on differentiable rendering. Fusing re-projected feature maps from previous frames with current-frame features greatly improves image segmentation quality, compared to a baseline that processes images independently. For 3D map processing, we propose a novel geometric quasi-planar over-segmentation method that groups 3D map elements likely to belong to the same semantic classes, relying on surface normals. We also describe a novel neural network design for lightweight semantic map post-processing. Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems and matches the performance of 3D convolutional networks on three real indoor datasets, while working in real-time. Moreover, it shows better cross-sensor generalization abilities compared to 3D CNNs, enabling training and inference with different depth sensors. Code and data can be found at https://github.com/slamcore/semlaps .
Article
Incomplete or outdated inventories of railway infrastructures may disrupt the railway sector’s administration and maintenance of transportation infrastructure, thus posing potential threats to the safety of traffic networks. Previous studies have adopted point clouds to accelerate inventory and inspection automation procedures. However, owing to the complexity of the railway scenes, previous studies reveal an imbalance between semantic richness, segmentation accuracy, and processing efficiency. This study aims to advance our understanding by providing a deep-learning framework for railway point cloud semantic segmentation. The proposed framework, named RailSeg, encompasses point cloud downsampling, integrated local-global feature extraction, spatial context aggregation, and semantic regularization. The proposed method, validated using point clouds collected in suburban and rural scenes, generates a point-level railway furniture inventory of 11 categories and achieves competitive performance in overall accuracy and mean intersection over union. In addition, RailSeg achieves better results than the baseline for additional types of point clouds (i.e., plateau railway mobile laser scanning (MLS) point clouds, street MLS point clouds, and urban-scale photogrammetric point clouds), demonstrating the superior generalization capabilities of RailSeg. This study may contribute to the development of 3D semantic segmentation, digital railway, and intelligent transportation.
Article
Hidden features in the neural networks usually fail to learn informative representation for 3D segmentation as supervisions are only given on output prediction, while this can be solved by omni-scale supervision on intermediate layers. In this paper, we bring the first omni-scale supervision method to 3D segmentation via the proposed gradual Receptive Field Component Reasoning (RFCR), where target Receptive Field Component Codes (RFCCs) is designed to record categories within receptive fields for hidden units in the encoder. Then, target RFCCs will supervise the decoder to gradually infer the RFCCs in a coarse-to-fine categories reasoning manner, and finally obtain the semantic labels. To purchase more supervisions, we also propose an RFCR-NL model with complementary negative codes ( i.e. , Negative RFCCs, NRFCCs) with negative learning. Because many hidden features are inactive with tiny magnitudes and make minor contributions to RFCC prediction, we propose Feature Densification with a centrifugal potential to obtain more unambiguous features, and it is in effect equivalent to entropy regularization over features. More active features can unleash the potential of omni-supervision method. We embed our method into three prevailing backbones, which are significantly improved in all three datasets on both fully and weakly supervised segmentation tasks and achieve competitive performances.
Chapter
Full-text available
Data-driven machine learning (ML) models are attracting increasing interest in chemical engineering and already partly outperform traditional physical simulations. Previous work in this field has mainly focused on improving the models’ statistical performance while the thereby imparted knowledge has been taken for granted. However, also the structures learned by the model during the training are fascinating yet non-trivial to assess as they are usually high-dimensional. As such, the interpretable communication of the relationship between the learned model and domain knowledge is vital for its evaluation by applying engineers. Specifically, visual analytics enables the interactive exploration of data sets and can thus reveal structures in otherwise too large-scale or too complex data. This chapter focuses on the thermodynamic modeling of mixtures of substances using the so-called activity coefficients as exemplary measures. We present and apply two visualization techniques that enable analyzing high-dimensional learned substance descriptors compared to chemical domain knowledge. We found explanations regarding chemical classes for most of the learned descriptor structures and striking correlations with physicochemical properties.
Article
In recent years, point cloud registration has achieved great success by learning geometric features with deep learning techniques. However, existing approaches that rely on pure geometric context still suffer from sensor noise and geometric ambiguities (e.g., flat or symmetric structure), which limit their robustness to real-world scenes. When 3D point clouds are constructed by RGB-D cameras, we can enhance the learned features with complementary texture information from RGB images. To this end, we propose to learn a 3D hybrid feature that fully exploits the multi-view colored images and point clouds from indoor RGB-D scene scans. Specifically, to address the discrepancy of 2D–3D observations, we design to extract informative 2D features from image planes and take only these features for fusion. Then, we utilize a novel soft-fusion module to associate and fuse hybrid features in a unified space while alleviating the ambiguities of 2D–3D feature binding. Finally, we develop a self-supervised feature scoring module customized for our multi-modal hybrid features, which significantly improves the keypoint selection quality in noisy indoor scene scans. Our method shows competitive registration performance with previous methods on two real-world datasets.
Article
Full-text available
Deep learning has been successfully used for tasks in the 2D image domain. Research on 3D computer vision and deep geometry learning has also attracted attention. Considerable achievements have been made regarding feature extraction and discrimination of 3D shapes. Following recent advances in deep generative models such as generative adversarial networks, effective generation of 3D shapes has become an active research topic. Unlike 2D images with a regular grid structure, 3D shapes have various representations, such as voxels, point clouds, meshes, and implicit functions. For deep learning of 3D shapes, shape representation has to be taken into account as there is no unified representation that can cover all tasks well. Factors such as the representativeness of geometry and topology often largely affect the quality of the generated 3D shapes. In this survey, we comprehensively review works on deep-learning-based 3D shape generation by classifying and discussing them in terms of the underlying shape representation and the architecture of the shape generator. The advantages and disadvantages of each class are further analyzed. We also consider the 3D shape datasets commonly used for shape generation. Finally, we present several potential research directions that hopefully can inspire future works on this topic.
Article
Full-text available
Autonomous vehicles require in-depth knowledge of their surroundings, making path segmentation and object detection crucial for determining the feasible region for path planning. Uniform characteristics of a road portion can be denoted by segmentations. Currently, road segmentation techniques mostly depend on the quality of camera images under different lighting conditions. However, Light Detection and Ranging (LiDAR) sensors can provide extremely precise 3D geometry information about the surroundings, leading to increased accuracy with increased memory consumption and computational overhead. This paper introduces a novel methodology which combines LiDAR and camera data for road detection, bridging the gap between 3D LiDAR Point Clouds (PCs). The assignment of semantic labels to 3D points is essential in various fields, including remote sensing, autonomous vehicles, and computer vision. This research discusses how to select the most relevant geometric features for path planning and improve autonomous navigation. An automatic framework for Semantic Segmentation (SS) is introduced, consisting of four processes: selecting neighborhoods, extracting classification features, and selecting features. The aim is to make the various components usable for end users without specialized knowledge by considering simplicity, effectiveness, and reproducibility. Through an extensive evaluation of different neighborhoods, geometric features, feature selection methods, classifiers, and benchmark datasets, the outcomes show that selecting the appropriate neighborhoods significantly develops 3D path segmentation. Additionally, selecting the right feature subsets can reduce computation time, memory usage, and enhance the quality of the results.
Article
Point cloud semantic segmentation in urban scenes plays a vital role in intelligent city modeling, autonomous driving, and urban planning. Point cloud semantic segmentation based on deep learning methods has achieved significant improvement. However, it is also challenging for accurate semantic segmentation in large scenes due to complex elements, variety of scene classes, occlusions, and noise. Besides, most methods need to split the original point cloud into multiple blocks before processing and cannot directly deal with the point clouds on a large scale. We propose a novel context-aware network (CAN) that can directly deal with large-scale point clouds. In the proposed network, a local feature aggregation module (LFAM) is designed to preserve rich geometric details in the raw point cloud and reduce the information loss during feature extraction. Then, in combination with a global context aggregation module (GCAM), capture long-range dependencies to enhance the network feature representation and suppress the noise. Finally, a context-aware upsampling module (CAUM) is embedded into the proposed network to capture the global perception from a broad perspective. The ensemble of low-level and high-level features facilitates the effectiveness and efficiency of 3-D point cloud feature refinement. Comprehensive experiments were carried out on three large-scale point cloud datasets in both outdoor and indoor environments to evaluate the performance of the proposed network. The results show that the proposed method outperformed the state-of-the-art representative semantic segmentation networks, and the overall accuracy (OA) of Tongji-3D, Semantic3D, and Stanford large-scale 3-D indoor spaces (S3DIS) is 96.01%, 95.0%, and 88.55%, respectively.
Article
Point cloud processing has received more attention in recent years. Due to the huge amount of data, using supervoxels to pre-segment the points can improve the performance of point cloud processing tasks. There are some supervoxel algorithms generating high-quality results, but their low efficiency hinders the wide application in point cloud processing tasks. In this paper, we try to strike a good balance between the quality and efficiency of point cloud over-segmentation. We propose an algorithm suitable for GPU acceleration, which can generate supervoxel with high efficiency. The algorithm is a seed-based segmentation method, and we carefully design two stages: the clustering stage and optimization stage, each of which can be executed in parallel on the GPU. In the first stage, the algorithm generates an initial segmentation based on well designed energy functions, and the second stage further improves the result by minimizing the segmentation energy. Our method generates good segmentation results and achieves the fastest processing speed compared with the existing methods. We evaluate the supervoxels on three public datasets. Experiments show that our algorithm can generate high-quality segmentation for various point cloud data with high efficiency, which is important for advancing the application of point cloud supervoxels in subsequent processing.
Article
Full-text available
Researchers have achieved great success in dealing with 2D images using deep learning. In recent years, 3D computer vision and geometry deep learning have gained ever more attention. Many advanced techniques for 3D shapes have been proposed for different applications. Unlike 2D images, which can be uniformly represented by a regular grid of pixels, 3D shapes have various representations, such as depth images, multi-view images, voxels, point clouds, meshes, implicit surfaces, etc. The performance achieved in different applications largely depends on the representation used, and there is no unique representation that works well for all applications. Therefore, in this survey, we review recent developments in deep learning for 3D geometry from a representation perspective, summarizing the advantages and disadvantages of different representations for different applications. We also present existing datasets in these representations and further discuss future research directions.
Article
Full-text available
We propose a novel approach to robot‐operated active understanding of unknown indoor scenes, based on online RGBD reconstruction with semantic segmentation. In our method, the exploratory robot scanning is both driven by and targeting at the recognition and segmentation of semantic objects from the scene. Our algorithm is built on top of a volumetric depth fusion framework and performs real‐time voxel‐based semantic labeling over the online reconstructed volume. The robot is guided by an online estimated discrete viewing score field (VSF) parameterized over the 3D space of 2D location and azimuth rotation. VSF stores for each grid the score of the corresponding view, which measures how much it reduces the uncertainty (entropy) of both geometric reconstruction and semantic labeling. Based on VSF, we select the next best views (NBV) as the target for each time step. We then jointly optimize the traverse path and camera trajectory between two adjacent NBVs, through maximizing the integral viewing score (information gain) along path and trajectory. Through extensive evaluation, we show that our method achieves efficient and accurate online scene parsing during exploratory scanning.
Article
Full-text available
Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by varying illumination and weather conditions. Leveraging complementary modalities can enable learning of semantically richer representations that are resilient to such perturbations. Despite the tremendous progress in recent years, most multimodal convolutional neural network approaches directly concatenate feature maps from individual modality streams rendering the model incapable of focusing only on the relevant complementary information for fusion. To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modality-specific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner. Specifically, we propose an architecture consisting of two modality-specific encoder streams that fuse intermediate encoder representations into a single decoder using our proposed self-supervised model adaptation fusion mechanism which optimally combines complementary features. As intermediate representations are not aligned across modalities, we introduce an attention scheme for better correlation. In addition, we propose a computationally efficient unimodal segmentation architecture termed AdapNet++ that incorporates a new encoder with multiscale residual units and an efficient atrous spatial pyramid pooling that has a larger effective receptive field with more than 10×10\,\times fewer parameters, complemented with a strong decoder with a multi-resolution supervision scheme that recovers high-resolution details. Comprehensive empirical evaluations on Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest benchmarks demonstrate that both our unimodal and multimodal architectures achieve state-of-the-art performance while simultaneously being efficient in terms of parameters and inference time as well as demonstrating substantial robustness in adverse perceptual conditions.
Conference Paper
Full-text available
We present a real-time dense mapping system which uses the predicted 2D semantic labels for optimizing the geometric quality of reconstruction. With a combination of Convolutional Neural Networks (CNNs) for 2D labeling and a Simultaneous Localization and Mapping (SLAM) system for camera trajectory estimation, recent approaches have succeeded in incrementally fusing and labeling 3D scenes. However, the geometric quality of the reconstruction can be further improved by incorporating such semantic prediction results, which is not sufficiently exploited by existing methods. In this paper, we propose to use semantic information to improve two crucial modules in the reconstruction pipeline, namely tracking and loop detection, for obtaining mutual benefits in geometric reconstruction and semantic recognition. Specifically for tracking, we use a novel probabilistic projective association approach to efficiently pick out candidate correspondences, where the confidence of these correspondences is quantified concerning similarities on all available short-term invariant features. For the loop detection, we incorporate these semantic labels into the original encoding through Randomized Ferns to generate a more comprehensive representation for retrieving candidate loop frames. Evaluations on a publicly available synthetic dataset have shown the effectiveness of our approach that considers such semantic hints as a reliable feature for achieving higher geometric quality.
Article
Full-text available
We present an integrated approach for reconstructing high-fidelity three-dimensional (3D) models using consumer RGB-D cameras. RGB-D registration and reconstruction algorithms are prone to errors from scanning noise, making it hard to perform 3D reconstruction accurately. The key idea of our method is to assign a probabilistic uncertainty model to each depth measurement, which then guides the scan alignment and depth fusion. This allows us to effectively handle inherent noise and distortion in depth maps while keeping the overall scan registration procedure under the iterative closest point framework for simplicity and efficiency. We further introduce a local-to-global, submap-based, and uncertainty-aware global pose optimization scheme to improve scalability and guarantee global model consistency. Finally, we have implemented the proposed algorithm on the GPU, achieving real-time 3D scanning frame rates and updating the reconstructed model on-the-fly. Experimental results on simulated and real-world data demonstrate that the proposed method outperforms state-of-the-art systems in terms of the accuracy of both recovered camera trajectories and reconstructed models.
Article
Full-text available
Supervoxels provide a more natural and compact representation of three dimensional point clouds, and enable the operations to be performed on regions rather than on the scattered points. Many state-of-the-art supervoxel segmentation methods adopt fixed resolution for each supervoxel, and rely on the initialization of seed points. As a result, they may not preserve well the boundaries of the point cloud with a non-uniform density. In this paper, we present a simple but effective supervoxel segmentation method for point clouds, which formalizes supervoxel segmentation as a subset selection problem. We develop an heuristic algorithm that utilizes local information to efficiently solve the subset selection problem. The proposed method can produce supervoxels with adaptive resolutions, and dose not rely the selection of seed points. The method is fully tested on three publicly available point cloud segmentation benchmarks, which cover the major point cloud types. The experimental results show that compared with the state-of-the-art supervoxel segmentation methods, the supervoxels extracted using our method preserve the object boundaries and small structures more effectively, which is reflected in a higher boundary recall and lower under-segmentation error. © 2018 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS)
Article
Full-text available
While object recognition on 2D images is getting more and more mature, 3D understanding is eagerly in demand yet largely underexplored. In this paper, we study the 3D object detection problem from RGB-D data captured by depth sensors in both indoor and outdoor environments. Different from previous deep learning methods that work on 2D RGB-D images or 3D voxels, which often obscure natural 3D patterns and invariances of 3D data, we directly operate on raw point clouds by popping up RGB-D scans. Although recent works such as PointNet performs well for segmentation in small-scale point clouds, one key challenge is how to efficiently detect objects in large-scale scenes. Leveraging the wisdom of dimension reduction and mature 2D object detectors, we develop a Frustum PointNet framework that addresses the challenge. Evaluated on KITTI and SUN RGB-D 3D detection benchmarks, our method outperforms state of the arts by remarkable margins with high efficiency (running at 5 fps).
Article
Full-text available
Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.
Article
Full-text available
Point cloud segmentation is a fundamental problem. Due to the complexity of real-world scenes and the limitations of 3D scanners, interactive segmentation is currently the only way to cope with all kinds of point clouds. However, interactively segmenting complex and large-scale scenes is very time-consuming. In this paper, we present a novel interactive system for segmenting point cloud scenes. Our system automatically suggests a series of camera views, in which users can conveniently specify segmentation guidance. In this way, users may focus on specifying segmentation hints instead of manually searching for desirable views of unsegmented objects, thus significantly reducing user effort. To achieve this, we introduce a novel view preference model, which is based on a set of dedicated view attributes, with weights learned from a user study. We also introduce support relations for both graph-cut-based segmentation and finding similar objects. Our experiments show that our segmentation technique helps users quickly segment various types of scenes, outperforming alternative methods.
Article
Full-text available
In this paper we present a novel feature-based RGB-D camera pose optimization algorithm for real-time 3D reconstruction systems. During camera pose estimation, current methods in online systems suffer from fast-scanned RGB-D data, or generate inaccurate relative transformations between consecutive frames. Our approach improves current methods by utilizing matched features across all frames and is robust for RGB-D data with large shifts in consecutive frames. We directly estimate camera pose for each frame by efficiently solving a quadratic minimization problem to maximize the consistency of 3D points in global space across frames corresponding to matched feature points. We have implemented our method within two state-of-the-art online 3D reconstruction platforms. Experimental results testify that our method is efficient and reliable in estimating camera poses for RGB-D data with large shifts.
Conference Paper
Full-text available
Several RGB-D datasets have been publicized over the past few years for facilitating research in computer vision and robotics. However, the lack of comprehensive and fine-grained annotation in these RGB-D datasets has posed challenges to their widespread usage. In this paper, we introduce SceneNN, an RGB-D scene dataset consisting of 100 scenes. All scenes are reconstructed into triangle meshes and have per-vertex and per-pixel annotation. We further enriched the dataset with fine-grained information such as axis-aligned bounding boxes, oriented bounding boxes, and object poses. We used the dataset as a benchmark to evaluate the state-of-the-art methods on relevant research problems such as intrinsic decomposition and shape completion. Our dataset and annotation tools are available at http://www.scenenn.net.
Article
Full-text available
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
Article
Full-text available
Recent advances of 3D acquisition devices have enabled large-scale acquisition of 3D scene data. Such data, if completely and well annotated, can serve as useful ingredients for a wide spectrum of computer vision and graphics works such as data-driven modeling and scene understanding, object detection and recognition. However, annotating a vast amount of 3D scene data remains challenging due to the lack of an effective tool and/or the complexity of 3D scenes (e.g. clutter, varying illumination conditions). This paper aims to build a robust annotation tool that effectively and conveniently enables the segmentation and annotation of massive 3D data. Our tool works by coupling 2D and 3D information via an interactive framework, through which users can provide high-level semantic annotation for objects. We have experimented our tool and found that a typical indoor scene could be well segmented and annotated in less than 30 minutes by using the tool, as opposed to a few hours if done manually. Along with the tool, we created a dataset of over a hundred 3D scenes associated with complete annotations using our tool. The tool and dataset are available at www.scenenn.net.
Chapter
Semantic segmentation of 3D meshes is an important problem for 3D scene understanding. In this paper we revisit the classic multiview representation of 3D meshes and study several techniques that make them effective for 3D semantic segmentation of meshes. Given a 3D mesh reconstructed from RGBD sensors, our method effectively chooses different virtual views of the 3D mesh and renders multiple 2D channels for training an effective 2D semantic segmentation model. Features from multiple per view predictions are finally fused on 3D mesh vertices to predict mesh semantic segmentation labels. Using the large scale indoor 3D semantic segmentation benchmark of ScanNet, we show that our virtual views enable more effective training of 2D semantic segmentation networks than previous multiview approaches. When the 2D per pixel predictions are aggregated on 3D surfaces, our virtual multiview fusion method is able to achieve significantly better 3D semantic segmentation results compared to all prior multiview approaches and recent 3D convolution approaches.
Chapter
We present a novel approach to reconstructing lightweight, CAD-based representations of scanned 3D environments from commodity RGB-D sensors. Our key idea is to jointly optimize for both CAD model alignments as well as layout estimations of the scanned scene, explicitly modeling inter-relationships between objects-to-objects and objects-to-layout. Since object arrangement and scene layout are intrinsically coupled, we show that treating the problem jointly significantly helps to produce globally-consistent representations of a scene. Object CAD models are aligned to the scene by establishing dense correspondences between geometry, and we introduce a hierarchical layout prediction approach to estimate layout planes from corners and edges of the scene. To this end, we propose a message-passing graph neural network to model the inter-relationships between objects and layout, guiding generation of a globally object alignment in a scene. By considering the global scene layout, we achieve significantly improved CAD alignments compared to state-of-the-art methods, improving from 41.83% to 58.41% alignment accuracy on SUNCG and from 50.05% to 61.24% on ScanNet, respectively. The resulting CAD-based representations makes our method well-suited for applications in content creation such as augmented- or virtual reality.
Chapter
We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language expressions with geometric features, enabling regression of the 3D bounding box of a target object. We also introduce the ScanRefer dataset, containing 51,583 descriptions of 11,046 objects from 800 ScanNet [8] scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D (Code: https://daveredrum.github.io/ScanRefer/).
Article
Polygonal meshes provide an efficient representation for 3D shapes. They explicitly captureboth shape surface and topology, and leverage non-uniformity to represent large flat regions as well as sharp, intricate features. This non-uniformity and irregularity, however, inhibits mesh analysis efforts using neural networks that combine convolution and pooling operations. In this paper, we utilize the unique properties of the mesh for a direct analysis of 3D shapes using MeshCNN, a convolutional neural network designed specifically for triangular meshes. Analogous to classic CNNs, MeshCNN combines specialized convolution and pooling layers that operate on the mesh edges, by leveraging their intrinsic geodesic connections. Convolutions are applied on edges and the four edges of their incident triangles, and pooling is applied via an edge collapse operation that retains surface topology, thereby, generating new mesh connectivity for the subsequent convolutions. MeshCNN learns which edges to collapse, thus forming a task-driven process where the network exposes and expands the important features while discarding the redundant ones. We demonstrate the effectiveness of MeshCNN on various learning tasks applied to 3D meshes.
Article
We present an autonomous scanning approach which allows multiple robots to perform collaborative scanning for dense 3D reconstruction of unknown indoor scenes. Our method plans scanning paths for several robots, allowing them to efficiently coordinate with each other such that the collective scanning coverage and reconstruction quality is maximized while the overall scanning effort is minimized. To this end, we define the problem as a dynamic task assignment and introduce a novel formulation based on Optimal Mass Transport (OMT). Given the currently scanned scene, a set of task views are extracted to cover scene regions which are either unknown or uncertain. These task views are assigned to the robots based on the OMT optimization. We then compute for each robot a smooth path over its assigned tasks by solving an approximate traveling salesman problem. In order to showcase our algorithm, we implement a multi-robot auto-scanning system. Since our method is computationally efficient, we can easily run it in real time on commodity hardware, and combine it with online RGB-D reconstruction approaches. In our results, we show several real-world examples of large indoor environments; in addition, we build a benchmark with a series of carefully designed metrics for quantitatively evaluating multi-robot autoscanning. Overall, we are able to demonstrate high-quality scanning results with respect to reconstruction quality and scanning efficiency, which significantly outperforms existing multi-robot exploration systems.
Article
We present a novel algorithm for semantic segmentation and labeling of 3D point clouds of indoor scenes, where objects in point clouds can have significant variations and complex configurations. Effective segmentation methods decomposing point clouds into semantically meaningful pieces are highly desirable for object recognition, scene understanding, scene modeling, etc. However, existing segmentation methods based on low-level geometry tend to either under-segment or over-segment point clouds. Our method takes a fundamentally different approach, where semantic segmentation is achieved along with labeling. To cope with substantial shape variation for objects in the same category, we first segment point clouds into surface patches and use unsupervised clustering to group patches in the training set into clusters, providing an intermediate representation for effectively learning patch relationships. During testing, we propose a novel patch segmentation and classification framework with multiscale processing, where the local segmentation level is automatically determined by exploiting the learned cluster based contextual information. Our method thus produces robust patch segmentation and semantic labeling results, avoiding parameter sensitivity. We further learn object-cluster relationships from the training set, and produce semantically meaningful object level segmentation.Our method outperforms state-of-the-art methods on several representative point cloud datasets, including S3DIS, SceneNN, Cornell RGB-D and ETH.
Conference Paper
We introduce a learning-based method to reconstruct objects acquired in a casual handheld scanning setting with a depth camera. Our method is based on two core components. First, a deep network that provides a semantic segmentation and labeling of the frames of an input RGBD sequence. Second, an alignment and reconstruction method that employs the semantic labeling to reconstruct the acquired object from the frames. We demonstrate that the use of a semantic labeling improves the reconstructions of the objects, when compared to methods that use only the depth information of the frames. Moreover, since training a deep network requires a large amount of labeled data, a key contribution of our work is an active self-learning framework to simplify the creation of the training data. Specifically, we iteratively predict the labeling of frames with the neural network, reconstruct the object from the labeled frames, and evaluate the confidence of the labeling, to incrementally train the neural network while requiring only a small amount of user-provided annotations. We show that this method enables the creation of data for training a neural network with high accuracy, while requiring only little manual effort.
Article
Semantic segmentation partitions a given image or 3D model of a scene into semantically meaning parts and assigns predetermined labels to the parts. With well‐established datasets, deep networks have been successfully used for semantic segmentation of RGB and RGB‐D images. On the other hand, due to the lack of annotated large‐scale 3D datasets, semantic segmentation for 3D scenes has not yet been much addressed with deep learning. In this paper, we present a novel framework for generating semantically segmented triangular meshes of reconstructed 3D indoor scenes using volumetric semantic fusion in the reconstruction process. Our method integrates the results of CNN‐based 2D semantic segmentation that is applied to the RGB‐D stream used for dense surface reconstruction. To reduce the artifacts from noise and uncertainty of single‐view semantic segmentation, we introduce adaptive integration for the volumetric semantic fusion and CRF‐based semantic label regularization. With these methods, our framework can easily generate a high‐quality triangular mesh of the reconstructed 3D scene with dense (i.e., per‐vertex) semantic labels. Extensive experiments demonstrate that our semantic segmentation results of 3D scenes achieves the state‐of‐the‐art performance compared to the previous voxel‐based and point cloud‐based methods.
Chapter
We present 3DMV, a novel method for 3D semantic scene segmentation of RGB-D scans in indoor environments using a joint 3D-multi-view prediction network. In contrast to existing methods that either use geometry or RGB data as input for this task, we combine both data modalities in a joint, end-to-end network architecture. Rather than simply projecting color data into a volumetric grid and operating solely in 3D – which would result in insufficient detail – we first extract feature maps from associated RGB images. These features are then mapped into the volumetric feature grid of a 3D network using a differentiable back-projection layer. Since our target is 3D scanning scenarios with possibly many frames, we use a multi-view pooling approach in order to handle a varying number of RGB input views. This learned combination of RGB and geometric features with our joint 2D-3D architecture achieves significantly better results than existing baselines. For instance, our final result on the ScanNet 3D segmentation benchmark increases from 52.8% to 75% accuracy compared to existing volumetric architectures.
Preprint
To carry out autonomous 3D scanning and online reconstruction of unknown indoor scenes, one has to find a balance between global exploration of the entire scene and local scanning of the objects within it. In this work, we propose a novel approach, which provides object-aware guidance for autoscanning, for exploring, reconstructing, and understanding an unknown scene within one navigation pass. Our approach interleaves between object analysis to identify the next best object (NBO) for global exploration, and object-aware information gain analysis to plan the next best view (NBV) for local scanning. First, an objectness-based segmentation method is introduced to extract semantic objects from the current scene surface via a multi-class graph cuts minimization. Then, an object of interest (OOI) is identified as the NBO which the robot aims to visit and scan. The robot then conducts fine scanning on the OOI with views determined by the NBV strategy. When the OOI is recognized as a full object, it can be replaced by its most similar 3D model in a shape database. The algorithm iterates until all of the objects are recognized and reconstructed in the scene. Various experiments and comparisons have shown the feasibility of our proposed approach.
Article
Real-time, high-quality, 3D scanning of large-scale scenes is key to mixed reality and robotic applications. However, scalability brings challenges of drift in pose estimation, introducing significant errors in the accumulated model. Approaches often require hours of offline processing to globally correct model errors. Recent online methods demonstrate compelling results but suffer from (1) needing minutes to perform online correction, preventing true real-time use; (2) brittle frame-to-frame (or frame-to-model) pose estimation, resulting in many tracking failures; or (3) supporting only unstructured point-based representations, which limit scan quality and applicability. We systematically address these issues with a novel, real-time, end-to-end reconstruction framework. At its core is a robust pose estimation strategy, optimizing per frame for a global set of camera poses by considering the complete history of RGB-D input with an efficient hierarchical approach. We remove the heavy reliance on temporal tracking and continually localize to the globally optimized frames instead. We contribute a parallelizable optimization framework, which employs correspondences based on sparse features and dense geometric and photometric matching. Our approach estimates globally optimized (i.e., bundle adjusted) poses in real time, supports robust tracking with recovery from gross tracking failures (i.e., relocalization), and re-estimates the 3D model in real time to ensure global consistency, all within a single framework. Our approach outperforms state-of-the-art online systems with quality on par to offline methods, but with unprecedented speed and scan completeness. Our framework leads to a comprehensive online scanning solution for large indoor environments, enabling ease of use and high-quality results.
Article
Convolutional networks are the de-facto standard for analyzing spatio-temporal data such as images, videos, and 3D shapes. Whilst some of this data is naturally dense (e.g., photos), many other data sources are inherently sparse. Examples include 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard "dense" implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce new sparse convolutional operations that are designed to process spatially-sparse data more efficiently, and use them to develop spatially-sparse convolutional networks. We demonstrate the strong performance of the resulting models, called submanifold sparse convolutional networks (SSCNs), on two tasks involving semantic segmentation of 3D point clouds. In particular, our models outperform all prior state-of-the-art on the test set of a recent semantic segmentation competition.
Article
This paper focuses on semantic scene completion, a task for producing a complete 3D voxel representation of volumetric occupancy and semantic labels for a scene from a single-view depth map observation. Previous work has considered scene completion and semantic labeling of depth maps separately. However, we observe that these two problems are tightly intertwined. To leverage the coupled nature of these two tasks, we introduce the semantic scene completion network (SSCNet), an end-to-end 3D convolutional network that takes a single depth image as input and simultaneously outputs occupancy and semantic labels for all voxels in the camera view frustum. Our network uses a dilation-based 3D context module to efficiently expand the receptive field and enable 3D context learning. To train our network, we construct SUNCG - a manually created large-scale dataset of synthetic 3D scenes with dense volumetric annotations. Our experiments demonstrate that the joint model outperforms methods addressing each task in isolation and outperforms alternative approaches on the semantic scene completion task.
Article
We address the problem of autonomously exploring unknown objects in a scene by consecutive depth acquisitions. The goal is to reconstruct the scene while online identifying the objects from among a large collection of 3D shapes. Fine-grained shape identification demands a meticulous series of observations attending to varying views and parts of the object of interest. Inspired by the recent success of attention-based models for 2D recognition, we develop a 3D Attention Model that selects the best views to scan from, as well as the most informative regions in each view to focus on, to achieve efficient object recognition. The region-level attention leads to focus-driven features which are quite robust against object occlusion. The attention model, trained with the 3D shape collection, encodes the temporal dependencies among consecutive views with deep recurrent networks. This facilitates order-aware view planning accounting for robot movement cost. In achieving instance identification, the shape collection is organized into a hierarchy, associated with pre-trained hierarchical classifiers. The effectiveness of our method is demonstrated on an autonomous robot (PR) that explores a scene and identifies the objects to construct a 3D scene model.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.