Conference Paper

Retro-FPN: Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Additionally, several methods [18,57,58] have projected point features into regular features to use more mature 2D transformer technologies. Other approaches [20,[59][60][61] have extended the 2D transformer architecture to 3D. For instance, ST [20] adopted the Swin Transformer, facilitating cross-window communication for successive windows and proposing a stratified strategy to extend the receptive field. ...
... Ground Truth Ours Retro-FPN ST Figure 5. Visual comparison of our model with other methods [20,59] on ScanNetv2. Note that black indicates ignored labels, and differences in semantic segmentation results are highlighted using red box for clarity. ...
Article
Full-text available
Recently, significant advancements have been made in 3D point cloud analysis by leveraging transformer architecture in 3D space. However, it remains challenging to effectively implement local and global learning within irregular and sparse structures of 3D point clouds. This paper presents the Adaptive Interaction Transformer (AIFormer), a novel hierarchical transformer architecture designed to enhance 3D point cloud analysis by fusing local and global features through the adaptive interaction of features. Specifically, AIFormer mainly consists of several stacked AIFormer Blocks. Each AIFormer module employs the Local Relation Aggregation Module and the Global Context Aggregation Module, respectively, to extract local details of relationships within the reference point and long-range dependencies between reference points. Then, the local and global features are fused using the Adaptive Interaction Module for adaptive interaction to optimize the point representation. Additionally, the AIFormer Block further designs geometric relation functions and contextual relative semantic encoding to enhance local and global feature extraction capabilities, respectively. Extensive experiments on three popular 3D point cloud datasets verify that AIFormer achieves state-of-the-art or comparable performances. Our comprehensive ablation study further validates the effectiveness and soundness of the AIFormer design.
... Neural implicit representations have made a huge progress in various tasks [22,23,25,29,38,67,70,91,92,94], which can be learned using different supervision like multiview [13,17,69,85,87] and point clouds [4, 5, 24, 34-37, 43, 44, 90]. In the following, we focus on reviewing works on learning implicit representations from multi-view. ...
Preprint
Full-text available
Monocular depth priors have been widely adopted by neural rendering in multi-view based tasks such as 3D reconstruction and novel view synthesis. However, due to the inconsistent prediction on each view, how to more effectively leverage monocular cues in a multi-view context remains a challenge. Current methods treat the entire estimated depth map indiscriminately, and use it as ground truth supervision, while ignoring the inherent inaccuracy and cross-view inconsistency in monocular priors. To resolve these issues, we propose MonoInstance, a general approach that explores the uncertainty of monocular depths to provide enhanced geometric priors for neural rendering and reconstruction. Our key insight lies in aligning each segmented instance depths from multiple views within a common 3D space, thereby casting the uncertainty estimation of monocular depths into a density measure within noisy point clouds. For high-uncertainty areas where depth priors are unreliable, we further introduce a constraint term that encourages the projected instances to align with corresponding instance masks on nearby views. MonoInstance is a versatile strategy which can be seamlessly integrated into various multi-view neural rendering frameworks. Our experimental results demonstrate that MonoInstance significantly improves the performance in both reconstruction and novel view synthesis under various benchmarks.
... Gong et al. addressed the limitations of FPN in small target detection by proposing a fusion factor estimated through statistical methods [26], which controls the information transfer from deep to shallow layers to adapt FPN for small target detection. Xiang et al. proposed a retrospective feature pyramid network, Retro-FPN [27], which innovatively introduced a retro-transformer [41], and effectively extracts semantic features for each point through explicit and retrospective feature refinement processes. ...
Preprint
Current state-of-the-art vision models often utilize feature pyramids to extract multi-scale information, with the Feature Pyramid Network (FPN) being one of the most widely used classic architectures. However, traditional FPNs and their variants (e.g., AUGFPN, PAFPN) fail to fully address spatial misalignment on a global scale, leading to suboptimal performance in high-precision localization of objects. In this paper, we propose a novel Bidirectional Alignment Feature Pyramid Network (BAFPN), which aligns misaligned features globally through a Spatial Feature Alignment Module (SPAM) during the bottom-up information propagation phase. Subsequently, it further mitigates aliasing effects caused by cross-scale feature fusion via a fine-grained Semantic Alignment Module (SEAM) in the top-down phase. On the DOTAv1.5 dataset, BAFPN improves the baseline model's AP75, AP50, and mAP by 1.68%, 1.45%, and 1.34%, respectively. Additionally, BAFPN demonstrates significant performance gains when applied to various other advanced detectors.
... The top-down pathway involves upsampling and fusing higher-level feature maps with lower-level feature maps, while the bottom-up pathway involves extracting features at different scales. FPN enables robust feature representation and the detection of objects or structures at different sizes, making it highly effective in object detection, semantic segmentation, and instance segmentation [31,32]. SCARNet introduces the FPN structure in its encoder part to enhance the model's perception of multi-scale features and improve its ability to represent different patterns and trends in time series. ...
Article
Full-text available
Time series forecasting tasks are important in practical scenarios as they can be applied in various fields such as economics, meteorology, and transportation. However, there are still challenges when applying methods based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to time series forecasting. These challenges include limitations in gradient propagation, handling long-range dependencies, and ensuring stability in the prediction results. In this paper, we propose a deep learning forecasting model called SCARNet (Stacked Convolution Sequence Autoregressive Encoding Network) to address these challenges. The SCARNet adopts an encoder-decoder structure and utilizes one-dimensional convolution to achieve autoregressive-like computations. This model can extract deeper-level information from time series, including trend components, periodic components, and white noise. Specifically, SCARNet employs a pyramid-stacked convolutional structure as the encoder for feature extraction and utilizes fully connected layers as the decoder for prediction. We evaluate the proposed model on a private and two mainstream public datasets. Experimental results demonstrate that the SCARNet model outperforms existing models in single-step prediction (with RMSE values of 0.0632, 0.6901, and 0.5416 for the three datasets, respectively) and achieves performance close to the state-of-the-art in medium-term and short-term multi-step prediction (Friedman Test at a significance level of 5% confirms the superiority of SCARNet). Additionally, we conduct ablation studies to validate the effectiveness of each component and verify the efficiency and efficacy of the proposed method.
... With the rapid development of deep learning, the neural networks have shown great potential in 3D applications [67,30,80,76,79,75,91,31,81,77,14,40]. We mainly focus on learning Neural Implicit Functions with networks for representing 3D shapes or scenes. ...
Preprint
Full-text available
Latest methods represent shapes with open surfaces using unsigned distance functions (UDFs). They train neural networks to learn UDFs and reconstruct surfaces with the gradients around the zero level set of the UDF. However, the differential networks struggle from learning the zero level set where the UDF is not differentiable, which leads to large errors on unsigned distances and gradients around the zero level set, resulting in highly fragmented and discontinuous surfaces. To resolve this problem, we propose to learn a more continuous zero level set in UDFs with level set projections. Our insight is to guide the learning of zero level set using the rest non-zero level sets via a projection procedure. Our idea is inspired from the observations that the non-zero level sets are much smoother and more continuous than the zero level set. We pull the non-zero level sets onto the zero level set with gradient constraints which align gradients over different level sets and correct unsigned distance errors on the zero level set, leading to a smoother and more continuous unsigned distance field. We conduct comprehensive experiments in surface reconstruction for point clouds, real scans or depth maps, and further explore the performance in unsupervised point cloud upsampling and unsupervised point normal estimation with the learned UDF, which demonstrate our non-trivial improvements over the state-of-the-art methods. Code is available at https://github.com/junshengzhou/LevelSetUDF .
Article
Full-text available
Travelable area boundaries not only constrain the movement of field robots but also indicate alternative guiding routes for dynamic objects. Publicly available road boundary datasets have outlined boundaries by binary segmentation labels. However, hard post-processes have to be done to extract from detected boundaries further semantics including the shapes of the boundaries and guiding routes, which poses challenges to a real-time visual navigation system without detailed prior maps. In addition, boundary detectors suffer from insufficient data collected from complex roads with severe occlusion and of different shapes. In this paper, a travelable area boundary dataset is semi-automatically built. 82.05% of the data is collected from bends, crossroads, T-shape roads and other irregular roads. Novel guiding semantics labels, shape labels and scene complexity labels are assigned to boundaries. With the support of the new dataset, travelable area boundary detectors could be trained, evaluated and fairly compared. The dataset can also be used to train, evaluate or test detectors for the road boundary detection task.
Article
2D-3D joint learning is essential and effective for fundamental 3D vision tasks, such as 3D semantic segmentation, due to the complementary information these two visual modalities contain. Most current 3D scene semantic segmentation methods process 2D images “as they are”, i.e., only real captured 2D images are used. However, such captured 2D images may be redundant, with abundant occlusion and/or limited field of view (FoV), leading to poor performance for the current methods involving 2D inputs. In this paper, we propose a general learning framework for joint 2D-3D scene understanding by selecting informative virtual 2D views of the underlying 3D scene. We then feed both the 3D geometry and the generated virtual 2D views into any joint 2D-3D-input or pure 3D-input based deep neural models for improving 3D scene understanding. Specifically, we generate virtual 2D views based on an information score map learned from the current 3D scene semantic segmentation results. To achieve this, we formalize the learning of the information score map as a deep reinforcement learning process, which rewards good predictions using a deep neural network. To obtain a compact set of virtual 2D views that jointly cover informative surfaces of the 3D scene as much as possible, we further propose an efficient greedy virtual view coverage strategy in the normal-sensitive 6D space, including 3-dimensional point coordinates and 3-dimensional normal. We have validated our proposed framework for various joint 2D-3D-input or pure 3D-input based deep neural models on two real-world 3D scene datasets, i.e., ScanNet v2 [1] and S3DIS [2], and the results demonstrate that our method obtains a consistent gain over baseline models and achieves new top accuracy for joint 2D and 3D scene semantic segmentation. Code is available at https://github.com/smy-THU/VirtualViewSelection.
Article
Full-text available
Deep learning-based point cloud semantic segmentation has gained popularity over time, with sparse convolution being the most prominent example. Although sparse convolution is more efficient than regular convolution, it comes with the drawback of sacrificing global context information. To solve this problem, this paper proposes the OcspareNet network, which uses sparse convolution as the backbone and captures global contextual information using the offset attention module and context aggregation module. The offset attention module improves the network’s capacity to obtain global contextual information about the point cloud. The context aggregation module utilizes contextual information in the training and testing phases, which increases the network’s capacity to discern the overall structure and successfully improves the network’s capacity and the accuracy of the difficult-scene segmentation category. Compared to the state-of-the-art (SOTA) models, our model has a smaller parameter count and achieves higher accuracy on challenging segmentation categories such as ‘pictures’, ‘counters’, and ‘desks’ in the ScanNetV2 dataset, with IoU scores of 41.1%, 70.3%, and 72.5%, respectively. Furthermore, ablation experiments confirmed the efficacy of our designed modules.
Conference Paper
Full-text available
In recent years, huge progress has been made on learning neural implicit representations from multi-view images for 3D reconstruction. As an additional input complementing coordinates, using sinusoidal functions as positional encodings plays a key role in revealing high frequency details with coordinate-based neural networks. However, high frequency positional encodings make the optimization unstable , which results in noisy reconstructions and artifacts in empty space. To resolve this issue in a general sense, we introduce to learn neural implicit representations with quantized coordinates, which reduces the uncertainty and ambiguity in the field during optimization. Instead of continuous coordinates, we discretize continuous coordinates into discrete coordinates using nearest interpolation among quantized coordinates which are obtained by discretizing the field in an extremely high resolution. We use discrete coordinates and their positional encodings to learn implicit functions through volume rendering. This significantly reduces the variations in the sample space, and triggers more multi-view consistency constraints on intersections of rays from different views, which enables to infer implicit function in a more effective way. Our quantized coordinates do not bring any computational burden, and can seamlessly work upon the latest methods. Our evaluations under the widely used benchmarks show our superiority over the state-of-the-art. Our code is available at https://github.com/MachinePerceptionLab/CQ-NIR.
Article
Full-text available
Normal estimation for unstructured point clouds is an important task in 3D computer vision. Current methods achieve encouraging results by mapping local patches to normal vectors or learning local surface fitting using neural networks. However, these methods are not generalized well to unseen scenarios and are sensitive to parameter settings. To resolve these issues, we propose an implicit function to learn an angle field around the normal of each point in the spherical coordinate system, which is dubbed as Neural Angle Fields (NeAF). Instead of directly predicting the normal of an input point, we predict the angle offset between the ground truth normal and a randomly sampled query normal. This strategy pushes the network to observe more diverse samples, which leads to higher prediction accuracy in a more robust manner. To predict normals from the learned angle fields at inference time, we randomly sample query vectors in a unit spherical space and take the vectors with minimal angle values as the predicted normals. To further leverage the prior learned by NeAF, we propose to refine the predicted normal vectors by minimizing the angle offsets. The experimental results with synthetic data and real scans show significant improvements over the state-of-the-art under widely used benchmarks. Project page: https://lisj575.github.io/NeAF/.
Article
Full-text available
The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing. This paper presents a novel framework named Point Cloud Transformer (PCT) for point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation, semantic segmentation, and normal estimation tasks.
Article
Learning signed distance functions (SDFs) from point clouds is an important task in 3D computer vision. However, without ground truth signed distances, point normals or clean point clouds, current methods still struggle from learning SDFs from noisy point clouds. To overcome this challenge, we propose to learn SDFs via a noise to noise mapping, which does not require any clean point cloud or ground truth supervision. Our novelty lies in the noise to noise mapping which can infer a highly accurate SDF of a single object or scene from its multiple or even single noisy observations. We achieve this by a novel loss which enables statistical reasoning on point clouds and maintains geometric consistency although point clouds are irregular, unordered and have no point correspondence among noisy observations. To accelerate training, we use multi-resolution hash encodings implemented in CUDA in our framework, which reduces our training time by a factor of ten, achieving convergence within one minute. We further introduce a novel schema to improve multi-view reconstruction by estimating SDFs as a prior. Our evaluations under widely-used benchmarks demonstrate our superiority over the state-of-the-art methods in surface reconstruction from point clouds or multi-view images, point cloud denoising and upsampling.
Article
Surface reconstruction for point clouds is an important task in 3D computer vision. Most of the latest methods resolve this problem by learning signed distance functions from point clouds, which are limited to reconstructing closed surfaces. Some other methods tried to represent open surfaces using unsigned distance functions (UDF) which are learned from ground truth distances. However, the learned UDF is hard to provide smooth distance fields due to the discontinuous character of point clouds. In this paper, we propose CAP-UDF, a novel method to learn consistency-aware UDF from raw point clouds. We achieve this by learning to move queries onto the surface with a field consistency constraint, where we also enable to progressively estimate a more accurate surface. Specifically, we train a neural network to gradually infer the relationship between queries and the approximated surface by searching for the moving target of queries in a dynamic way. Meanwhile, we introduce a polygonization algorithm to extract surfaces using the gradients of the learned UDF. We conduct comprehensive experiments in surface reconstruction for point clouds, real scans or depth maps, and further explore our performance in unsupervised point normal estimation, which demonstrate non-trivial improvements of CAP-UDF over the state-of-the-art methods.
Article
Learning radiance fields has shown remarkable results for novel view synthesis. The learning procedure usually costs lots of time, which motivates the latest methods to speed up the learning procedure by learning without neural networks or using more efficient data structures. However, these specially designed approaches do not work for most of radiance fields based methods. To resolve this issue, we introduce a general strategy to speed up the learning procedure for almost all radiance fields based methods. Our key idea is to reduce the redundancy by shooting much fewer rays in the multi-view volume rendering procedure which is the base for almost all radiance fields based methods. We find that shooting rays at pixels with dramatic color change not only significantly reduces the training burden but also barely affects the accuracy of the learned radiance fields. In addition, we also adaptively subdivide each view into a quadtree according to the average rendering error in each node in the tree, which makes us dynamically shoot more rays in more complex regions with larger rendering error. We evaluate our method with different radiance fields based methods under the widely used benchmarks. Experimental results show that our method achieves comparable accuracy to the state-of-the-art with much faster training.
Chapter
MLP-Mixer has newly appeared as a new challenger against the realm of CNNs and Transformer. Despite its simplicity compared to Transformer, the concept of channel-mixing MLPs and token-mixing MLPs achieves noticeable performance in image recognition tasks. Unlike images, point clouds are inherently sparse, unordered and irregular, which limits the direct use of MLP-Mixer for point cloud understanding. To overcome these limitations, we propose PointMixer, a universal point set operator that facilitates information sharing among unstructured 3D point cloud. By simply replacing token-mixing MLPs with Softmax function, PointMixer can “mix” features within/between point sets. By doing so, PointMixer can be broadly used for intra-set, inter-set, and hierarchical-set mixing. We demonstrate that various channel-wise feature aggregation in numerous point sets is better than self-attention layers or dense token-wise interaction in a view of parameter efficiency and accuracy. Extensive experiments show the competitive or superior performance of PointMixer in semantic segmentation, classification, and reconstruction against Transformer-based methods.
Article
Most existing point cloud completion methods suffer from the discrete nature of point clouds and the unstructured prediction of points in local regions, which makes it difficult to reveal fine local geometric details. To resolve this issue, we propose SnowflakeNet with snowflake point deconvolution (SPD) to generate complete point clouds. SPD models the generation of point clouds as the snowflake-like growth of points, where child points are generated progressively by splitting their parent points after each SPD. Our insight into the detailed geometry is to introduce a skip-transformer in the SPD to learn the point splitting patterns that can best fit the local regions. The skip-transformer leverages attention mechanism to summarize the splitting patterns used in the previous SPD layer to produce the splitting in the current layer. The locally compact and structured point clouds generated by SPD precisely reveal the structural characteristics of the 3D shape in local patches, which enables us to predict highly detailed geometries. Moreover, since SPD is a general operation that is not limited to completion, we explore its applications in other generative tasks, including point cloud auto-encoding, generation, single image reconstruction, and upsampling. Our experimental results outperform state-of-the-art methods under widely used benchmarks.
Chapter
As camera and LiDAR sensors capture complementary information in autonomous driving, great efforts have been made to conduct semantic segmentation through multi-modality data fusion. However, fusion-based approaches require paired data, i.e., LiDAR point clouds and camera images with strict point-to-pixel mappings, as the inputs in both training and inference stages. It seriously hinders their application in practical scenarios. Thus, in this work, we propose the 2D Priors Assisted Semantic Segmentation (2DPASS) method, a general training scheme, to boost the representation learning on point clouds. The proposed 2DPASS method fully takes advantage of 2D images with rich appearance during training, and then conduct semantic segmentation without strict paired data constraints. In practice, by leveraging an auxiliary modal fusion and multi-scale fusion-to-single knowledge distillation (MSFSKD), 2DPASS acquires richer semantic and structural information from the multi-modal data, which are then distilled to the pure 3D network. As a result, our baseline model shows significant improvement with only point cloud inputs once equipped with the 2DPASS. Specifically, it achieves the state-of-the-arts on two large-scale recognized benchmarks (i.e., SemanticKITTI and NuScenes), i.e., ranking the top-1 in both single and multiple scan(s) competitions of SemanticKITTI.
Article
LiDAR point cloud analysis is a core task for 3D computer vision, especially for autonomous driving. However, due to the severe sparsity and noise interference in the single sweep LiDAR point cloud, the accurate semantic segmentation is non-trivial to achieve. In this paper, we propose a novel sparse LiDAR point cloud semantic segmentation framework assisted by learned contextual shape priors. In practice, an initial semantic segmentation (SS) of a single sweep point cloud can be achieved by any appealing network and then flows into the semantic scene completion (SSC) module as the input. By merging multiple frames in the LiDAR sequence as supervision, the optimized SSC module has learned the contextual shape priors from sequential LiDAR data, completing the sparse single sweep point cloud to the dense one. Thus, it inherently improves SS optimization through fully end-to-end training. Besides, a Point-Voxel Interaction (PVI) module is proposed to further enhance the knowledge fusion between SS and SSC tasks, i.e., promoting the interaction of incomplete local geometry of point cloud and complete voxel-wise global structure. Furthermore, the auxiliary SSC and PVI modules can be discarded during inference without extra burden for SS. Extensive experiments confirm that our JS3C-Net achieves superior performance on both SemanticKITTI and SemanticPOSS benchmarks, i.e., 4% and 3% improvement correspondingly.
Article
The task of point cloud upsampling aims to acquire dense and uniform point sets from sparse and irregular point sets. Although significant progress has been made with deep learning models, state-of-the-art methods require ground-truth dense point sets as the supervision, which makes them limited to be trained under synthetic paired training data and not suitable to be under real-scanned sparse data. However, it is expensive and tedious to obtain large numbers of paired sparsedense point sets as supervision from real-scanned sparse data. To address this problem, we propose a self-supervised point cloud upsampling network, named SPU-Net, to capture the inherent upsampling patterns of points lying on the underlying object surface. Specifically, we propose a coarse-to-fine reconstruction framework, which contains two main components: point feature extraction and point feature expansion, respectively. In the point feature extraction, we integrate the self-attention module with the graph convolution network (GCN) to capture context information inside and among local regions simultaneously. In the point feature expansion, we introduce a hierarchically learnable folding strategy to generate upsampled point sets with learnable 2D grids. Moreover, to further optimize the noisy points in the generated point sets, we propose a novel self-projection optimization associated with uniform and reconstruction terms as a joint loss to facilitate the self-supervised point cloud upsampling. We conduct various experiments on both synthetic and real-scanned datasets, and the results demonstrate that we achieve comparable performances to state-of-the-art supervised methods.
Article
Point cloud completion concerns to predict missing part for incomplete 3D shapes. A common strategy is to generate complete shape according to incomplete input. However, unordered nature of point clouds will degrade generation of high-quality 3D shapes, as detailed topology and structure of unordered points are hard to be captured during the generative process using an extracted latent code. We address this problem by formulating completion as point cloud deformation process. Specifically, we design a novel neural network, named PMP-Net++, to mimic behavior of an earth mover. It moves each point of incomplete input to obtain a complete point cloud, where total distance of point moving paths (PMPs) should be the shortest. Therefore, PMP-Net++ predicts unique PMP for each point according to constraint of point moving distances. The network learns a strict and unique correspondence on point-level, and thus improves quality of predicted complete shape. Moreover, since moving points heavily relies on per-point features learned by network, we further introduce a transformer-enhanced representation learning network, which significantly improves completion performance of PMP-Net++. We conduct comprehensive experiments in shape completion, and further explore application on point cloud up-sampling, which demonstrate non-trivial improvement of PMP-Net++ over state-of-the-art point cloud completion/up-sampling methods.
Article
With the help of the deep learning paradigm, many point cloud networks have been invented for visual analysis. However, there is great potential for development of these networks since the given information of point cloud data has not been fully exploited. To improve the effectiveness of existing networks in analyzing point cloud data, we propose a plug-and-play module, PnP-3D, aiming to refine the fundamental point cloud feature representations by involving more local context and global bilinear response from explicit 3D space and implicit feature space. To thoroughly evaluate our approach, we conduct experiments on three standard point cloud analysis tasks, including classification, semantic segmentation, and object detection, where we select three state-of-the-art networks from each task for evaluation. Serving as a plug-and-play module, PnP-3D can significantly boost the performances of established networks. In addition to achieving state-of-the-art results on four widely used point cloud benchmarks, we present comprehensive ablation studies and visualizations to demonstrate our approach's advantages. The code will be available at https://github.com/ShiQiu0419/pnp-3d.
Article
Fine-grained 3D shape classification is important for shape understanding and analysis, which poses a challenging research problem. However, the studies on the fine-grained 3D shape classification have rarely been explored, due to the lack of fine-grained 3D shape benchmarks. To address this issue, we first introduce a new 3D shape dataset (named FG3D dataset) with fine-grained class labels, which consists of three categories including airplane, car and chair. Each category consists of several subcategories at a fine-grained level. According to our experiments under this fine-grained dataset, we find that state-of-the-art methods are significantly limited by the small variance among subcategories in the same category. To resolve this problem, we further propose a novel fine-grained 3D shape classification method named FG3D-Net to capture the fine-grained local details of 3D shapes from multiple rendered views. Specifically, we first train a Region Proposal Network (RPN) to detect the generally semantic parts inside multiple views under the benchmark of generally semantic part detection. Then, we design a hierarchical part-view attention aggregation module to learn a global shape representation by aggregating generally semantic part features, which preserves the local details of 3D shapes. The part-view attention module hierarchically leverages part-level and view-level attention to increase the discriminability of our features. The part-level attention highlights the important parts in each view while the view-level attention highlights the discriminative views among all the views of the same object. In addition, we integrate a Recurrent Neural Network (RNN) to capture the spatial relationships among sequential views from different viewpoints. Our results under the fine-grained 3D shape dataset show that our method outperforms other state-of-the-art methods. The FG3D dataset is available at https://github.com/liuxinhai/FG3D-Net.
Chapter
In this paper, we introduce SalsaNext for the uncertainty-aware semantic segmentation of a full 3D LiDAR point cloud in real-time. SalsaNext is the next version of SalsaNet [1] which has an encoder-decoder architecture where the encoder unit has a set of ResNet blocks and the decoder part combines upsampled features from the residual blocks. In contrast to SalsaNet, we introduce a new context module, replace the ResNet encoder blocks with a new residual dilated convolution stack with gradually increasing receptive fields and add the pixel-shuffle layer in the decoder. Additionally, we switch from stride convolution to average pooling and also apply central dropout treatment. To directly optimize the Jaccard index, we further combine the weighted cross entropy loss with Lovász-Softmax loss [4]. We finally inject a Bayesian treatment to compute the epistemic and aleatoric uncertainties for each point in the cloud. We provide a thorough quantitative evaluation on the Semantic-KITTI dataset [3], which demonstrates that the proposed SalsaNext outperforms other published semantic segmentation networks and achieves 3.6%3.6\% more accuracy over the previous state-of-the-art method. We also release our source code (https://github.com/TiagoCortinhal/SalsaNext).
Chapter
Many point cloud segmentation methods rely on transferring irregular points into a voxel-based regular representation. Although voxel-based convolutions are useful for feature aggregation, they produce ambiguous or wrong predictions if a voxel contains points from different classes. Other approaches (such as PointNets and point-wise convolutions) can take irregular points for feature learning. But their high memory and computational costs (such as for neighborhood search and ball-querying) limit their ability and accuracy for large-scale point cloud processing. To address these issues, we propose a deep fusion network architecture (FusionNet) with a unique voxel-based “mini-PointNet” point cloud representation and a new feature aggregation module (fusion module) for large-scale 3D semantic segmentation. Our FusionNet can learn more accurate point-wise predictions when compared to voxel-based convolutional networks. It can realize more effective feature aggregations with lower memory and computational complexity for large-scale point cloud segmentation when compared to the popular point-wise convolutions. Our experimental results show that FusionNet can take more than one million points on one GPU for training to achieve state-of-the-art accuracy on large-scale Semantic KITTI benchmark.The code will be available at https://github.com/feihuzhang/LiDARSeg.
Chapter
Semantic segmentation and semantic edge detection can be seen as two dual problems with close relationships in computer vision. Despite the fast evolution of learning-based 3D semantic segmentation methods, little attention has been drawn to the learning of 3D semantic edge detectors, even less to a joint learning method for the two tasks. In this paper, we tackle the 3D semantic edge detection task for the first time and present a new two-stream fully-convolutional network that jointly performs the two tasks. In particular, we design a joint refinement module that explicitly wires region information and edge information to improve the performances of both tasks. Further, we propose a novel loss function that encourages the network to produce semantic segmentation results with better boundaries. Extensive evaluations on S3DIS and ScanNet datasets show that our method achieves on par or better performance than the state-of-the-art methods for semantic segmentation and outperforms the baseline methods for semantic edge detection. Code release: https://github.com/hzykent/JSENet .
Chapter
LiDAR point-cloud segmentation is an important problem for many applications. For large-scale point cloud segmentation, the de facto method is to project a 3D point cloud to get a 2D LiDAR image and use convolutions to process it. Despite the similarity between regular RGB and LiDAR images, we are the first to discover that the feature distribution of LiDAR images changes drastically at different image locations. Using standard convolutions to process such LiDAR images is problematic, as convolution filters pick up local features that are only active in specific regions in the image. As a result, the capacity of the network is under-utilized and the segmentation performance decreases. To fix this, we propose Spatially-Adaptive Convolution (SAC) to adopt different filters for different locations according to the input image. SAC can be computed efficiently since it can be implemented as a series of element-wise multiplications, im2col, and standard convolution. It is a general framework such that several previous methods can be seen as special cases of SAC. Using SAC, we build SqueezeSegV3 for LiDAR point-cloud segmentation and outperform all previous published methods by at least 2.0% mIoU on the SemanticKITTI benchmark. Code and pretrained model are available at https://github.com/chenfengxu714/SqueezeSegV3.