Article

Snowflake Point Deconvolution for Point Cloud Completion and Generation With Skip-Transformer

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Most existing point cloud completion methods suffer from the discrete nature of point clouds and the unstructured prediction of points in local regions, which makes it difficult to reveal fine local geometric details. To resolve this issue, we propose SnowflakeNet with snowflake point deconvolution (SPD) to generate complete point clouds. SPD models the generation of point clouds as the snowflake-like growth of points, where child points are generated progressively by splitting their parent points after each SPD. Our insight into the detailed geometry is to introduce a skip-transformer in the SPD to learn the point splitting patterns that can best fit the local regions. The skip-transformer leverages attention mechanism to summarize the splitting patterns used in the previous SPD layer to produce the splitting in the current layer. The locally compact and structured point clouds generated by SPD precisely reveal the structural characteristics of the 3D shape in local patches, which enables us to predict highly detailed geometries. Moreover, since SPD is a general operation that is not limited to completion, we explore its applications in other generative tasks, including point cloud auto-encoding, generation, single image reconstruction, and upsampling. Our experimental results outperform state-of-the-art methods under widely used benchmarks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Before they can be used in downstream applications (e.g., digital twin), they need to be faithfully completed, a process known as point cloud completion. Recent years have witnessed significant progress in this field (Yuan et al., 2018;Huang et al., 2020;Zhang et al., 2020;Yu et al., 2021;Xiang et al., 2023;Yan et al., 2022; Tang et al., 2022;Zhou et al., 2022;Zhang et al., 2023d;Yu et al., 2023a;Wang et al., 2022a). However, the sparsity and large structural incompleteness of point clouds still limit their ability to produce satisfactory results. ...
... The first challenge leads to a vast solution space for pointbased networks (Yuan et al., 2018;Xiang et al., 2023;Zhou et al., 2022) to robustly locate missing regions and create a partial-to-complete mapping. Some alternative methods attempt to address this issue by incorporating additional color images (Zhang et al., 2021b;Aiello et al., 2022;Zhu et al., 2024) or viewpoints (Zhang et al., 2022a;Gong et al., 2021;Fu et al., 2023). ...
... One pioneering point-based work is PCN (Yuan et al., 2018), which uses a shared multi-layer perceptron (MLP) to extract features and generates additional points using a folding operation (Yang et al., 2018) in a coarse-to-fine manner. Inspired by it, a lot of point-based methods (Wang et al., 2020;Liu et al., 2020;Wen et al., 2020;Xiang et al., 2023;Zhou et al., 2022;Yu et al., 2021;Wang et al., 2022a;Pan, 2020;Wei et al., 2023a;Hu et al., 2022) have been proposed. Later, to address the issue of limited information available in partial shapes, several works (Zhang et al., 2021b;Aiello et al., 2022;Zhu et al., 2024;Huang et al., 2022;Zhang et al., 2022a;Gong et al., 2021;Fu et al., 2023) have explored the use of auxiliary data to enhance performance. ...
Article
Full-text available
Point cloud completion is a fundamental yet not well-solved problem in 3D vision. Current approaches often rely on 3D coordinate information and/or additional data (e.g., images and scanning viewpoints) to fill in missing parts. Unlike these methods, we explore self-structure augmentation and propose PointSea for global-to-local point cloud completion. In the global stage, consider how we inspect a defective region of a physical object, we may observe it from various perspectives for a better understanding. Inspired by this, PointSea augments data representation by leveraging self-projected depth images from multiple views. To reconstruct a compact global shape from the cross-modal input, we incorporate a feature fusion module to fuse features at both intra-view and inter-view levels. In the local stage, to reveal highly detailed structures, we introduce a point generator called the self-structure dual-generator. This generator integrates both learned shape priors and geometric self-similarities for shape refinement. Unlike existing efforts that apply a unified strategy for all points, our dual-path design adapts refinement strategies conditioned on the structural type of each point, addressing the specific incompleteness of each point. Comprehensive experiments on widely-used benchmarks demonstrate that PointSea effectively understands global shapes and generates local details from incomplete input, showing clear improvements over existing methods. Our code is available at https://github.com/czvvd/SVDFormer_PointSea.
... We first fit the SMPL model from input sparse views, and then feed the SMPL depth into a depth refiner to get refined depth, from which we obtain voxel-level features. These features are then aggregated with pixel-level features extracted from source images, followed by the SPD network [74,75] to generate dense image-aligned prior points for coarse Gaussian rasterization. To help model finer details, the image-aligned depth maps from coarse Gaussians are unprojected to yield finer pixel-wise points. ...
... To get reliable yet dense prior points for Gaussian regression, we develop image-aligned human points prediction. Considering that SMPL model contains only sparse points, we employ the Snowflake Point Deconvolution (SPD) network [74,75] that is suitable for refining and densifying the estimated SMPL points P due to the ability to learn local geometric characteristics. Akin to [74,75], we employ two SPD steps. ...
... Considering that SMPL model contains only sparse points, we employ the Snowflake Point Deconvolution (SPD) network [74,75] that is suitable for refining and densifying the estimated SMPL points P due to the ability to learn local geometric characteristics. Akin to [74,75], we employ two SPD steps. To enhance its effectiveness and robustness, we integrate it with both pixel-level and voxel-level features to get the output human points by: ...
Preprint
This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at https://github.com/iSEE-Laboratory/RoGSplat.
... We identify two key challenges in this task: First, mmWave point clouds exhibit inter-frame heterogeneity, in contrast to the more consistent data from LiDAR or RGB-D sensors. Second, conventional point cloud completion methods are primarily designed for static objects or autonomous driving scenes [15], [21], whereas our focus is on dynamic human bodies, which necessitates accounting for temporal changes and motion. To address these issues, we propose a multi-stage mmWave point cloud enhancement method that leverages 2D human mask information from single-view images as supervision during the training phase. ...
... However, these approaches primarily target large-scale scenarios like autonomous driving and are not designed to capture the detailed pose and shape information essential for human body reconstruction. LiDAR point cloud completion methods [21], [28] aim to recover accurate shapes from partial point clouds, similar to our goal, but they focus on static objects. Therefore, incorporating temporal information is crucial for enhancing dynamic human poses. ...
... The sparse mmWave point cloud struggles to fully capture the motion of every body part. Inspired by point cloud completion techniques [15], [21], we leverage human shape information from single-view images as a supervision signal for enhancing mmWave point clouds, which provides more detailed and intuitive shape and structural information. Notably, image data is used only during the training phase, while during inference, only the raw point cloud is used as input, ensuring better privacy protection. ...
Preprint
Millimeter-wave (mmWave) radar offers robust sensing capabilities in diverse environments, making it a highly promising solution for human body reconstruction due to its privacy-friendly and non-intrusive nature. However, the significant sparsity of mmWave point clouds limits the estimation accuracy. To overcome this challenge, we propose a two-stage deep learning framework that enhances mmWave point clouds and improves human body reconstruction accuracy. Our method includes a mmWave point cloud enhancement module that densifies the raw data by leveraging temporal features and a multi-stage completion network, followed by a 2D-3D fusion module that extracts both 2D and 3D motion features to refine SMPL parameters. The mmWave point cloud enhancement module learns the detailed shape and posture information from 2D human masks in single-view images. However, image-based supervision is involved only during the training phase, and the inference relies solely on sparse point clouds to maintain privacy. Experiments on multiple datasets demonstrate that our approach outperforms state-of-the-art methods, with the enhanced point clouds further improving performance when integrated into existing models.
... In recent years, numerous deep learning-based point cloud completion methods [18,40,45,[47][48][49]55] have shown remarkable success. These approaches utilize carefully designed neural networks to extract shape patterns from input point clouds, enabling them to generate detailed geometric structures to complete missing portions of the point * Equal Contribution † Corresponding author Figure 1. ...
... PoinTr [47] treats the point cloud as a token sequence, using transformer encoder-decoder to predict the missing parts. SnowflakeNet [40] designs a transformer decoder with skip connections to refine the point cloud. Another line of works [53,56] enhances completion performance using 2D information. ...
... With or without the SDS Refining step, GenPC consistently achieves state-of-the-art performance across the entire dataset. These results indicate that existing learning-based methods [40,47,48] struggle to complete AdaPoinTr SnowflakeNet Ours Input GT Figure 6. Visual comparisons with recent methods [40,48] on the ScanNet dataset. ...
Preprint
Existing point cloud completion methods, which typically depend on predefined synthetic training datasets, encounter significant challenges when applied to out-of-distribution, real-world scans. To overcome this limitation, we introduce a zero-shot completion framework, termed GenPC, designed to reconstruct high-quality real-world scans by leveraging explicit 3D generative priors. Our key insight is that recent feed-forward 3D generative models, trained on extensive internet-scale data, have demonstrated the ability to perform 3D generation from single-view images in a zero-shot setting. To harness this for completion, we first develop a Depth Prompting module that links partial point clouds with image-to-3D generative models by leveraging depth images as a stepping stone. To retain the original partial structure in the final results, we design the Geometric Preserving Fusion module that aligns the generated shape with input by adaptively adjusting its pose and scale. Extensive experiments on widely used benchmarks validate the superiority and generalizability of our approach, bringing us a step closer to robust real-world scan completion.
... In these tasks, it is critical to have a good understanding of the spatial structure of an object. On the other hand, point cloud completion aims to estimate the complete shape of objects from partial observations [28,34,36,38], which pays more attention to the geometric details. Manipulation Tasks. ...
... Representative approaches for each task [5,19,20,36,39] including SOTA are adopted as baselines to evaluate the improvement achieved after being assisted by our synthesized data and annotations. The training is conducted on randomly synthesized new objects and stops when the loss converges. ...
... Point Cloud Completion. Following [36,38], we uniformly sample 16384 points from each object in both training and test sets as the complete point clouds and then acquire partial point clouds by back projecting the complete shapes into 8 different partial views. 2048 points are sampled from each partial point cloud as input. ...
Preprint
The acquisition of substantial volumes of 3D articulated object data is expensive and time-consuming, and consequently the scarcity of 3D articulated object data becomes an obstacle for deep learning methods to achieve remarkable performance in various articulated object understanding tasks. Meanwhile, pairing these object data with detailed annotations to enable training for various tasks is also difficult and labor-intensive to achieve. In order to expeditiously gather a significant number of 3D articulated objects with comprehensive and detailed annotations for training, we propose Articulated Object Procedural Generation toolbox, a.k.a. Arti-PG toolbox. Arti-PG toolbox consists of i) descriptions of articulated objects by means of a generalized structure program along with their analytic correspondence to the objects' point cloud, ii) procedural rules about manipulations on the structure program to synthesize large-scale and diverse new articulated objects, and iii) mathematical descriptions of knowledge (e.g. affordance, semantics, etc.) to provide annotations to the synthesized object. Arti-PG has two appealing properties for providing training data for articulated object understanding tasks: i) objects are created with unlimited variations in shape through program-oriented structure manipulation, ii) Arti-PG is widely applicable to diverse tasks by easily providing comprehensive and detailed annotations. Arti-PG now supports the procedural generation of 26 categories of articulate objects and provides annotations across a wide range of both vision and manipulation tasks, and we provide exhaustive experiments which fully demonstrate its advantages. We will make Arti-PG toolbox publicly available for the community to use.
... Wang et al. proposed the Cascaded Refinement Network [16], which refines the predicted point positions by calculating displacement offsets. Xiang et al. proposed [17], which splits the number of points through a special deconvolution strategy, generating complete point clouds from coarse to fine through multiple stages of point splitting processes. The GSSnowflakeNet proposed in this paper has stronger global feature-capturing capabilities and parameter utilization efficiency, achieving better performance. ...
... The Grouped Vector Skip-Transformer (GVST) learns and refines the between parent points and child points. "Skip" represents the connectio displacement feature from the previous layer and the point feature from th [17]. Figure 4 illustrates the detailed structure of the Grouped Vector Skip-T takes Q, K, and "pos" as inputs, where Q represents the point features contextual information outputted by GVSPA; K represents the offset featu putted by the previous stage; and "pos" represents the coordinates of the outputted by the previous stage. ...
... The Grouped Vector Skip-Transformer (GVST) learns and refines the spatial context between parent points and child points. "Skip" represents the connection between the displacement feature from the previous layer and the point feature from the current layer [17]. Figure 4 illustrates the detailed structure of the Grouped Vector Skip-Transformer. ...
Article
Full-text available
Point clouds are essential 3D data representations utilized across various disciplines, often requiring point cloud completion methods to address inherent incompleteness. Existing completion methods like SnowflakeNet only consider local attention, lacking global information of the complete shape, and tend to suffer from overfitting as the model depth increases. To address these issues, we introduced self-positioning point-based attention to better capture complete global contextual features and designed a Channel Attention module for adaptive feature adjustment within the global vector. Additionally, we implemented a vector attention grouping strategy in both the skip-transformer and self-positioning point-based attention to mitigate overfitting, improving parameter efficiency and generalization. We evaluated our method on the PCN dataset as well as the ShapeNet55/34 datasets. The experimental results show that our method achieved an average CD-L1 of 7.09 and average CD-L2 scores of 8.0, 7.8, and 14.4 on the PCN, ShapeNet55, ShapeNet34, and ShapeNet-unseen21 benchmarks, respectively. Compared to SnowflakeNet, we improved the average CD by 1.6%, 3.6%, 3.7%, and 4.6% on the corresponding benchmarks, while also reducing complexity and computational costs and accelerating training and inference speeds. Compared to other existing point cloud completion networks, our method also achieves competitive results.
... Additionally, PoinTr [33] considers point cloud completion as a set-to-set translation problem and first proposes a fully transformer-based completion network. SnowflakeNet [34] utilizes self-attentions on points to extract discriminative features, and adopts the skip-transformer to split parent points and gradually generate child points. AGFA-Net [35] utilizes spatial attention blocks to replace KNN operations and aggregate global features adaptively by calculating per-point attention values for point cloud generation. ...
... It significantly reduces mean CD by 0.296 (↓ 20.5%) and increases mean FS by 0.055 (↑ 6.9%), compared with XMFNet [21]. It is worth noting that the performance of XMFNet [21] is not even as good as that of single-modal method Snowflak-eNet [34], which only uses partial point clouds for the shape completion. However, our DuInNet achieves better completion performance, even though it has a simpler point cloud generation network than SnowflakeNet [34]. ...
... It is worth noting that the performance of XMFNet [21] is not even as good as that of single-modal method Snowflak-eNet [34], which only uses partial point clouds for the shape completion. However, our DuInNet achieves better completion performance, even though it has a simpler point cloud generation network than SnowflakeNet [34]. This is because the unidirectional fusion structure in XMFNet ignores the shape prior contained in the image modality, while our DuInNet makes deep interaction between dual modalities and directly restores the point clouds from image modality. ...
Preprint
Full-text available
To further promote the development of multimodal point cloud completion, we contribute a large-scale multimodal point cloud completion benchmark ModelNet-MPC with richer shape categories and more diverse test data, which contains nearly 400,000 pairs of high-quality point clouds and rendered images of 40 categories. Besides the fully supervised point cloud completion task, two additional tasks including denoising completion and zero-shot learning completion are proposed in ModelNet-MPC, to simulate real-world scenarios and verify the robustness to noise and the transfer ability across categories of current methods. Meanwhile, considering that existing multimodal completion pipelines usually adopt a unidirectional fusion mechanism and ignore the shape prior contained in the image modality, we propose a Dual-Modality Feature Interaction Network (DuInNet) in this paper. DuInNet iteratively interacts features between point clouds and images to learn both geometric and texture characteristics of shapes with the dual feature interactor. To adapt to specific tasks such as fully supervised, denoising, and zero-shot learning point cloud completions, an adaptive point generator is proposed to generate complete point clouds in blocks with different weights for these two modalities. Extensive experiments on the ShapeNet-ViPC and ModelNet-MPC benchmarks demonstrate that DuInNet exhibits superiority, robustness and transfer ability in all completion tasks over state-of-the-art methods. The code and dataset will be available soon.
... Voxelization during training can result in high memory costs and loss of geometric information, but point clouds provide a more efficient and memory-friendly representation. Existing point cloud completion networks [10][11][12][13][14][15][16] generally adopt a coarseto-fine structure, and the entire completion pipeline is divided into three stages: feature extraction, skeleton generation, and detail refinement. Figure 1 displays some airplane category visualization results. ...
... Fig. 1 Both the partial and ground truth consist of 2048 points each. Compared with other completion methods like PMPNet [27], PMPNet++ [13], and SnowflakeNet [11], our network generates the complete shape (2048 points) with fine-grained geometric details that the missing parts can be effectively restored Although the methods [13,23,24] based on PointNet++ take into account the local structure of the point cloud, the structure of the constructed graph is fixed. And the processing of edge features is still the same as PointNet [17]. ...
... However, these methods can only generate coarse point clouds which often suffer from some loss of details. Inspired by SnowflakeNet [11], we propose the deconvolution attention skeleton generation module. This module employs deconvolution operations to recover coarse point cloud features from the global feature. ...
Article
Full-text available
Point clouds acquired through 3D scanning devices often suffer from sparsity and incompleteness due to reflection, device resolution, and viewing angle limitations. Therefore, the recovery of the complete shape from partial observations plays a vital role in assisting downstream tasks. Existing point cloud completion networks mostly ignore the encoding of the local region structure in the point cloud. In this work, we propose an edge-guided generative network with attention for point cloud completion. Specifically, the network has three consecutive stages. In the feature extraction stage, we propose the edge attention(EA) block, which can be stacked and applied to effectively capture local geometric details and structural information. The local neighborhood information is dynamically calculated, and the attention mechanism further deepens the relationship between the acquired edge features and position coordinates. We design the deconvolution attention skeleton generation module in the skeleton generation stage to generate a shape skeleton. For the detail refinement stage, we design a layered encoder based on PointNet++ module, which can better fuse the local geometry from the coarse point cloud and the global feature from the input point cloud to facilitate fine-grained point cloud generation. Comprehensive evaluations of several benchmarks indicate the effectiveness of our network and its ability to generate fine-grained point clouds.
... The advent of deep learning, combined with the widespread popularity of 3D scanning devices has led to significant work on point cloud learning tasks. These include point cloud completion [30,33,37,39], denoising [17,29,40], up-sampling [15,41] and generation [10,25,35]. All of these tasks are heavily reliant on the ability of a network to learn an accurate representation of a point cloud. ...
... The second use case is simply as an evaluation metric for gauging the effectiveness of the above tasks [10,15,33,34,39]. An evaluation metric has notably different requirements in comparison to an objective function. ...
... The former typically incorporates various global [6,35,68] or local priors [27,33,34], along with additional constraints [5,71] or gradients [26,40,41]. However, the optimization relies on ground truth point clouds [25,30,52,54,66,67], which are often difficult to acquire. Recently, NeRF [38] has achieved impressive results in novel view synthesis. ...
Preprint
Full-text available
Recently, it has shown that priors are vital for neural implicit functions to reconstruct high-quality surfaces from multi-view RGB images. However, current priors require large-scale pre-training, and merely provide geometric clues without considering the importance of color. In this paper, we present NeRFPrior, which adopts a neural radiance field as a prior to learn signed distance fields using volume rendering for surface reconstruction. Our NeRF prior can provide both geometric and color clues, and also get trained fast under the same scene without additional data. Based on the NeRF prior, we are enabled to learn a signed distance function (SDF) by explicitly imposing a multi-view consistency constraint on each ray intersection for surface inference. Specifically, at each ray intersection, we use the density in the prior as a coarse geometry estimation, while using the color near the surface as a clue to check its visibility from another view angle. For the textureless areas where the multi-view consistency constraint does not work well, we further introduce a depth consistency loss with confidence weights to infer the SDF. Our experimental results outperform the state-of-the-art methods under the widely used benchmarks.
... (4) SnowflakeNet [64] presents snowflake point deconvolution layers, which progressively refine the point cloud. Each layer generates child points by splitting parent points. ...
Article
Full-text available
With the widespread adoption of 3D scanning technology, depth view-driven 3D reconstruction has become crucial for applications such as SLAM, virtual reality, and autonomous vehicles. However, due to the effects of self-occlusion and environmental occlusion, obtaining complete and error-free 3D shapes directly from 3D scans remains challenging, as previous reconstruction methods tend to lose details. To this end, we propose Dynamic Quality Refinement Network (DQRNet) for reconstructing complete and accurate 3D shape from a single depth view. DQRNet introduces a dynamic encoder–decoder and a detail quality refiner to generate high-resolution 3D shapes, where the former designs a dynamic latent extractor to adaptively select important parts of an object and the latter designs global and local point refiners to enhance the reconstruction quality. Experimental results show that DQRNet is able to focus on capturing the details at boundaries and key areas on ShapeNet dataset, thereby achieving better accuracy and robustness than SOTA methods.
... In order to improve the detection of small target elements in the diagram, this study introduces a convolution building block known for detecting low-resolution and small targets into the detection head-Snowflake Point Deconvolution (SPD) [21]. SPD utilizes frame-cycle super-resolution transformation technology to resize the original image, and its structure is shown in Figure 4. ...
Article
Full-text available
Secondary systems in electrical engineering often rely on traditional CAD software (AutoCAD v2024.1.6) or non-structured, paper-based diagrams for fieldwork, posing challenges for digital transformation. Electrical diagram recognition technology bridges this gap by converting traditional diagram operations into a “digital” model, playing a critical role in power system scheduling, operation, and maintenance. However, conventional recognition methods, which primarily rely on partition detection, face significant limitations such as poor adaptability to diverse diagram styles, interference among recognition objects, and reduced accuracy in handling complex and varied electrical diagrams. This paper introduces a novel layered framework for electrical diagram recognition that sequentially extracts the element layer, text layer, and connection relationship layer to address these challenges. First, an improved YOLOv7 model, combined with a multi-scale sliding window strategy, is employed to accurately segment large and small diagram objects. Next, PaddleOCR, trained with electrical-specific terminology, and PaddleClas, using multi-angle classification, are utilized for robust text recognition, effectively mitigating interference from diagram elements. Finally, clustering and adaptive FcF-inpainting algorithms are applied to repair the connection relationship layer, resolving local occlusion issues and enhancing the overall coupling of the diagram. Experimental results demonstrate that the proposed method outperforms existing approaches in robustness and universality, particularly for complex diagrams, providing technical support for intelligent power grid construction and operation.
... Neural networks have increasingly shown transformative potential in 3D applications, contributing to advancements in areas such as shape modeling, scene reconstruction, and virtual simulation [23,29,40,41,50,52,82,83,85,94,97,[99][100][101]. Building on this momentum, our work focuses on generating implicit functions, which notoriously are efficient 3D data representations. ...
Preprint
Full-text available
Autoregressive (AR) models have achieved remarkable success in natural language and image generation, but their application to 3D shape modeling remains largely unexplored. Unlike diffusion models, AR models enable more efficient and controllable generation with faster inference times, making them especially suitable for data-intensive domains. Traditional 3D generative models using AR approaches often rely on ``next-token" predictions at the voxel or point level. While effective for certain applications, these methods can be restrictive and computationally expensive when dealing with large-scale 3D data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D implicit distance fields that can perform unconditional shape generation, class-conditioned and also text-conditioned shape generation. Our key idea is to encode shapes as multi-scale wavelet token maps and use a Transformer to predict the ``next higher-resolution token map" in an autoregressive manner. By redefining 3D AR generation task as ``next-scale" prediction, we reduce the computational cost of generation compared to traditional ``next-token" prediction models, while preserving essential geometric details of 3D shapes in a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its benefit by quantitative and qualitative comparisons with state-of-the-art methods on widely used benchmarks. Our results show 3D-WAG achieves superior performance in key metrics like Coverage and MMD, generating high-fidelity 3D shapes that closely match the real data distribution.
... Datasets. In our experiments we utilize PCN [36], ShapeNet-55/34 [11], ShapeNet-Part [37], KITTI [38], and SVR ShapeNet [39]. Please refer to HyperCD for the dataset details of PCN, ShapeNet-55/34, and ShapeNet-Part. ...
Preprint
Full-text available
3D point clouds enhanced the robot's ability to perceive the geometrical information of the environments, making it possible for many downstream tasks such as grasp pose detection and scene understanding. The performance of these tasks, though, heavily relies on the quality of data input, as incomplete can lead to poor results and failure cases. Recent training loss functions designed for deep learning-based point cloud completion, such as Chamfer distance (CD) and its variants (\eg HyperCD ), imply a good gradient weighting scheme can significantly boost performance. However, these CD-based loss functions usually require data-related parameter tuning, which can be time-consuming for data-extensive tasks. To address this issue, we aim to find a family of weighted training losses ({\em weighted CD}) that requires no parameter tuning. To this end, we propose a search scheme, {\em Loss Distillation via Gradient Matching}, to find good candidate loss functions by mimicking the learning behavior in backpropagation between HyperCD and weighted CD. Once this is done, we propose a novel bilevel optimization formula to train the backbone network based on the weighted CD loss. We observe that: (1) with proper weighted functions, the weighted CD can always achieve similar performance to HyperCD, and (2) the Landau weighted CD, namely {\em Landau CD}, can outperform HyperCD for point cloud completion and lead to new state-of-the-art results on several benchmark datasets. {\it Our demo code is available at \url{https://github.com/Zhang-VISLab/IROS2024-LossDistillationWeightedCD}.}
... Neural surface reconstruction from multi-view images [12], [13], [16], [26], [61], [96], [120], [121], [122], [123], [124], [125], [126], [127] has been shown to be powerful for recovering dense 3D surfaces via image-based neural rendering. We demonstrate that our method can also improve the performance of multi-view reconstruction by introducing an SDF through noise to noise mapping on point clouds from SfM [15] as a geometry prior. ...
Preprint
Learning signed distance functions (SDFs) from point clouds is an important task in 3D computer vision. However, without ground truth signed distances, point normals or clean point clouds, current methods still struggle from learning SDFs from noisy point clouds. To overcome this challenge, we propose to learn SDFs via a noise to noise mapping, which does not require any clean point cloud or ground truth supervision. Our novelty lies in the noise to noise mapping which can infer a highly accurate SDF of a single object or scene from its multiple or even single noisy observations. We achieve this by a novel loss which enables statistical reasoning on point clouds and maintains geometric consistency although point clouds are irregular, unordered and have no point correspondence among noisy observations. To accelerate training, we use multi-resolution hash encodings implemented in CUDA in our framework, which reduces our training time by a factor of ten, achieving convergence within one minute. We further introduce a novel schema to improve multi-view reconstruction by estimating SDFs as a prior. Our evaluations under widely-used benchmarks demonstrate our superiority over the state-of-the-art methods in surface reconstruction from point clouds or multi-view images, point cloud denoising and upsampling.
... The usual approach involves using diffusion models to generate 2D images, building an intermediate bridge from text-to-3D models, and then using novel view synthesis methods to generate 3D models from images. Some studies [7][8][9][10][11] attempt to express 3D models in the form of implicit neural radiance fields, while other methods [12][13][14][15][16][17][18][19][20][21] focus on more traditional forms, such as voxels, point clouds, and meshes. One important thing is, benefiting from the excellent 3D scene synthesis quality provided by recent 3D Gaussian splatting, researchers have proposed methods [22][23][24] for generating 3D models with higher quality or performance. ...
Article
Full-text available
The current research on text-guided 3D synthesis predominantly utilizes complex diffusion models, posing significant challenges in tasks like terrain generation. This study ventures into the direct synthesis of text-to-3D terrain in a zero-shot fashion, circumventing the need for diffusion models. By exploiting the large language model’s inherent spatial awareness, we innovatively formulate a method to update existing 3D models through text, thereby enhancing their accuracy. Specifically, we introduce a Gaussian–Voronoi map data structure that converts simplistic map summaries into detailed terrain heightmaps. Employing a chain-of-thought behavior tree approach, which combines action chains and thought trees, the model is guided to analyze a variety of textual inputs and extract relevant terrain data, effectively bridging the gap between textual descriptions and 3D models. Furthermore, we develop a text–terrain re-editing technique utilizing multiagent reasoning, allowing for the dynamic update of the terrain’s representational structure. Our experimental results indicate that this method proficiently interprets the spatial information embedded in the text and generates controllable 3D terrains with superior visual quality.
... The global representation extracted by an encoder was utilized to generate new point clouds. Snowflake point deconvolution (SPD) was proposed in Snow-flakeNet [32]. SPD employed a global vector to guide the process of generating high-quality point clouds. ...
Article
Full-text available
Point clouds in the real world are often sparse and incomplete. Point cloud completion aims to restore incomplete point clouds into meaningful shapes. In recent years, point cloud completion has attracted the interest of many researchers. In most existing methods, the efficiency of completion is generally ignored in favor of producing meaningful shapes with adequate details. This paper proposes a fast point completion network (FPCN). FPCN is mainly composed of a multi-scale attention encoder (MSAE) and a structural refinement (SR) module. MSAE first obtains multi-scale geometric information by extracting incomplete inputs at different resolutions. After that, MSAE fuses multi-scale geometric information through cross-attention mechanisms. Compared with existing mainstream encoders, MSAE can extract rich geometric information from input point clouds with low complexity. The SR module aims to extract local information from the coarse point clouds to guide the process of extending points. Furthermore, the process of extending points is achieved by a replication strategy. Compared to existing folding-based decoders, the SR module can produce fine point clouds with more local details. Compared to existing transformer-based decoders, the SR module has a lower calculation price by employing a replication strategy to generate high-resolution point clouds. In conclusion, FPCN can restore partial point clouds into meaningful shapes efficiently, and the outcomes of completion contain sufficient details. Extensive experiments on various datasets demonstrate the performance and efficiency of the FPCN in point cloud completion. Source code is available at https://github.com/doldolOuO/FPCN.
... We use this property to solve the shape completion problem using implicit surface reconstruction. There exist various works (Yuan et al. 2018;Yan et al. 2022b;Yu et al. 2021;Xiang et al. 2022;Mittal et al. 2022;Yew and Lee 2022;Yan et al. 2022a) for shape completion task, but there is no work to solve these two problems simultaneously. The algorithm proposed in (Chibane, Alldieck, and Pons-Moll 2020) addresses this problem but mostly for human 3D shapes and also uses an occupancy network to first learn the implicit representation and subsequently solve the completion problem. ...
Article
Full-text available
Implicit 3D surface reconstruction of an object from its partial and noisy 3D point cloud scan is the classical geometry processing and 3D computer vision problem. In the literature, various 3D shape representations have been developed, differing in memory efficiency and shape retrieval effectiveness, such as volumetric, parametric, and implicit surfaces. Radial basis functions provide memory-efficient parameterization of the implicit surface. However, we show that training a neural network using the mean squared error between the ground-truth implicit surface and the linear basis-based implicit surfaces does not converge to the global solution. In this work, we propose locally supported compact radial basis functions for a linear representation of the implicit surface. This representation enables us to generate 3D shapes with arbitrary topologies at any resolution due to their continuous nature. We then propose a neural network architecture for learning the linear implicit shape representation of the 3D surface of an object. We learn linear implicit shapes within a supervised learning framework using ground truth Signed-Distance Field (SDF) data for guidance. The classical strategies face difficulties in finding linear implicit shapes from a given 3D point cloud due to numerical issues (requires solving inverse of a large matrix) in basis and query point selection. The proposed approach achieves better Chamfer distance and comparable F-score than the state-of-the-art approach on the benchmark dataset. We also show the effectiveness of the proposed approach by using it for the 3D shape completion task.
Preprint
Full-text available
Chamfer Distance (CD) is widely used as a metric to quantify difference between two point clouds. In point cloud completion, Chamfer Distance (CD) is typically used as a loss function in deep learning frameworks. However, it is generally acknowledged within the field that Chamfer Distance (CD) is vulnerable to the presence of outliers, which can consequently lead to the convergence on suboptimal models. In divergence from the existing literature, which largely concentrates on resolving such concerns in the realm of Euclidean space, we put forth a notably uncomplicated yet potent metric specifically designed for point cloud completion tasks: {Hyperbolic Chamfer Distance (HyperCD)}. This metric conducts Chamfer Distance computations within the parameters of hyperbolic space. During the backpropagation process, HyperCD systematically allocates greater weight to matched point pairs exhibiting reduced Euclidean distances. This mechanism facilitates the preservation of accurate point pair matches while permitting the incremental adjustment of suboptimal matches, thereby contributing to enhanced point cloud completion outcomes. Moreover, measure the shape dissimilarity is not solely work for point cloud completion task, we further explore its applications in other generative related tasks, including single image reconstruction from point cloud, and upsampling. We demonstrate state-of-the-art performance on the point cloud completion benchmark datasets, PCN, ShapeNet-55, and ShapeNet-34, and show from visualization that HyperCD can significantly improve the surface smoothness, we also provide the provide experimental results beyond completion task.
Article
The increasing deployment of UAVs as mobile communication relays in urban environments necessitates accurate 3D modeling of complex urban areas for optimal communication. Current practices involve LiDAR-based 3D scanning to generate point cloud data; however, sensor limitations and adverse weather conditions may compromise data quality. This study proposes a new multi-modal point cloud and image fusion completion network (PIFC-Net) based on a generative adversarial network (GAN), specifically tailored for large-scale urban environments. The experimental study tested five different building shapes and various objects, and the results commend the network for its effectiveness in enhancing the quality and efficiency of point cloud completion.
Article
The measurement of oil tank volume holds significant safety and economic implications. A common method of measurement is the use of 3D scanning point clouds. However, point cloud data obtained through 3D scanning may be incomplete and contain certain noise, affecting the accuracy of volume measurement. To address these issues, this paper proposes an oil tank volume measurement method based on 3D point clouds. There are two key innovations: one is the introduction of a stratified truncated cone Inner Diameter Fitting (IDF)method to overcome point cloud measurement noise. The other is the development of a point cloud completion network (BPoinTr) through a Bias Learning Model (BLM). The incomplete bottom point cloud data of the oil tank is completed by BPoinTr and used for subsequent volume calculation. Extensive experiments on actual collected oil tank point cloud data demonstrate that the method proposed in this paper can reduce the calculation error of the tank bottom volume to 0.004m³, merely 0.57‰. Furthermore, the mean absolute percentage error of the calculated tank volume by this method is less than 0.1%.
Article
Surface reconstruction for point clouds is one of the important tasks in 3D computer vision. The latest methods rely on generalizing the priors learned from large scale supervision. However, the learned priors usually do not generalize well to various geometric variations that are unseen during training, especially for extremely sparse point clouds. To resolve this issue, we present a neural network to directly infer SDFs from single sparse point clouds without using signed distance supervision, learned priors or even normals. Our insight here is to learn surface parameterization and SDFs inference in an end-to-end manner. To make up the sparsity, we leverage parameterized surfaces as a coarse surface sampler to provide many coarse surface estimations in training iterations, according to which we mine supervision for our thin plate splines (TPS) based network to infer smooth SDFs in a statistical way. Our method significantly improves the generalization ability and accuracy on unseen point clouds. Our experimental results show our advantages over the state-of-the-art methods in surface reconstruction for sparse point clouds under synthetic datasets and real scans.
Article
This work presents a new completion method that specifically designed for low-overlapping partial point cloud registration. Based on the assumption that the candidate partial point clouds to be registered belong to the same target, the proposed mutual prior based completion (MPC) method uses these candidate partial point clouds as completion reference for each other to extend their overlapping regions. Without relying on shape prior knowledge, MPC can work for different types of point clouds, such as object, room scene, and street view. The main challenge of this mutual reference approach is that partial clouds without spatial alignment cannot provide a reliable completion reference. Based on the mutual information maximization, a progressive completion structure is developed to achieve pose, feature representation and completion alignment between input point clouds. Experiments on public datasets show encouraging results. Especially for the low-overlapping cases, compared with the state-of-the-art (SOTA) models, the size of overlapping regions can be increased by about 15.0%, and the rotation and translation error can be reduced by 30.8% and 57.7% respectively. (Code is available at: https://*.*).
Preprint
Full-text available
The complex traffic environment and various weather conditions make the collection of LiDAR data expensive and challenging. Achieving high-quality and controllable LiDAR data generation is urgently needed, controlling with text is a common practice, but there is little research in this field. To this end, we propose Text2LiDAR, the first efficient, diverse, and text-controllable LiDAR data generation model. Specifically, we design an equirectangular transformer architecture, utilizing the designed equirectangular attention to capture LiDAR features in a manner with data characteristics. Then, we design a control-signal embedding injector to efficiently integrate control signals through the global-to-focused attention mechanism. Additionally, we devise a frequency modulator to assist the model in recovering high-frequency details, ensuring the clarity of the generated point cloud. To foster development in the field and optimize text-controlled generation performance, we construct nuLiDARtext which offers diverse text descriptors for 34,149 LiDAR point clouds from 850 scenes. Experiments on uncontrolled and text-controlled generation in various forms on KITTI-360 and nuScenes datasets demonstrate the superiority of our approach.
Article
Learning signed distance functions (SDFs) from point clouds is an important task in 3D computer vision. However, without ground truth signed distances, point normals or clean point clouds, current methods still struggle from learning SDFs from noisy point clouds. To overcome this challenge, we propose to learn SDFs via a noise to noise mapping, which does not require any clean point cloud or ground truth supervision. Our novelty lies in the noise to noise mapping which can infer a highly accurate SDF of a single object or scene from its multiple or even single noisy observations. We achieve this by a novel loss which enables statistical reasoning on point clouds and maintains geometric consistency although point clouds are irregular, unordered and have no point correspondence among noisy observations. To accelerate training, we use multi-resolution hash encodings implemented in CUDA in our framework, which reduces our training time by a factor of ten, achieving convergence within one minute. We further introduce a novel schema to improve multi-view reconstruction by estimating SDFs as a prior. Our evaluations under widely-used benchmarks demonstrate our superiority over the state-of-the-art methods in surface reconstruction from point clouds or multi-view images, point cloud denoising and upsampling.
Article
Outstanding effectiveness of transformers in visual tasks has resulted in its fast growth and adoption in three dimensions (3D) vision tasks. Vision transformers have shown numerous advantages over earlier convolutional neural network (CNN) architectures including broad modelling abilities, more substantial modelling capabilities, convolution complementarity, scalability to model data size, and better connection for enhancing the performance records of many visual tasks. We present thorough review that classifies and summarizes the popular transformer-based approaches based on key features for transformer integration such as the input data, scalability element that enables transformer processing, architectural design, and context level through which the transformer functions as well as a highlight of the primary contributions of each transformer approach. Furthermore, we compare the results of these techniques with commonly employed non-transformer techniques in 3D object classification, segmentation, and object detection using standard 3D datasets including ModelNet, SUN RGB-D, ScanNet, nuScenes, Waymo, ShapeNet, S3DIS, and KITTI. This study also includes the discussion of numerous potential future options and limitation for 3D vision transformers.
Article
Surface reconstruction for point clouds is an important task in 3D computer vision. Most of the latest methods resolve this problem by learning signed distance functions from point clouds, which are limited to reconstructing closed surfaces. Some other methods tried to represent open surfaces using unsigned distance functions (UDF) which are learned from ground truth distances. However, the learned UDF is hard to provide smooth distance fields due to the discontinuous character of point clouds. In this paper, we propose CAP-UDF, a novel method to learn consistency-aware UDF from raw point clouds. We achieve this by learning to move queries onto the surface with a field consistency constraint, where we also enable to progressively estimate a more accurate surface. Specifically, we train a neural network to gradually infer the relationship between queries and the approximated surface by searching for the moving target of queries in a dynamic way. Meanwhile, we introduce a polygonization algorithm to extract surfaces using the gradients of the learned UDF. We conduct comprehensive experiments in surface reconstruction for point clouds, real scans or depth maps, and further explore our performance in unsupervised point normal estimation, which demonstrate non-trivial improvements of CAP-UDF over the state-of-the-art methods.
Preprint
Full-text available
Outstanding effectiveness of transformers in visual tasks has resulted in its fast growth and adoption in three dimensions (3D) vision tasks. Vision transformers have shown numerous advantages over earlier convolutional neural network (CNN) architectures including broad modelling abilities, more substantial modelling capabilities, convolution complementarity, scalability to model data size, and better connection for enhancing the performance records of many visual tasks. We present thorough review that classifies and summarizes the popular transformer-based approaches based on key features for transformer integration such as the input data, scalability element that enables transformer processing, architectural design, and context level through which the transformer functions as well as a highlight of the primary contributions of each transformer approach. Furthermore , we compare the results of these techniques with commonly employed non-transformer techniques in 3D object classification, segmentation, and object 1 detection using standard 3D datasets including ModelNet, SUN RGB-D, Scan-Net, nuScenes, Waymo, ShapeNet, S3DIS, and KITTI. This study also includes the discussion of numerous potential future options and limitation for 3D vision transformers.
Article
By formulating the data generation as a sequence procedure of denoising autoencoding, diffusion models have achieved superior in-painting performance on image data and beyond. Nevertheless, it is not trivial when capitalizing on diffusion models to generate missing 3D points. The difficulty originates from the intrinsic structure where 3D point cloud is a set of unordered and irregular coordinates. That motivates us to delve into the 3D structural information for designing point cloud encoder-decoder and shape latent generator, to precisely formulate the latent distribution of the complete point cloud and partial observation. In this paper, we propose Point cloud completion with Latent Diffusion Models (PointLDM), a new approach that leverages the conditional denoising diffusion probabilistic modeling (DDPM) in the 3D latent space for shape reconstruction. The architecture of PointLDM consists of a transformer-based variational auto-encoder (VAE) to model the complete shape latent, and a diffusion network for shape latent prediction. The encoder of VAE exploits both of global shape latent and local point features in shape distribution learning. With the learnt shape latent, the decoder first decodes the shape latent into coarse points, and then recovers the fine-grained details around each coarse point by deforming a 2D grid. To reconstruct the shape latent from partial observation, the diffusion network treats the partial observation as the conditional input and generates the shape latent via DDPM. Extensive experiments conducted on MVP, Completion3D, and KITTI quantitatively and qualitatively demonstrate the efficacy of PointLDM over the state-of-the-art shape completion approaches.
Article
Full-text available
Estimating the complete 3D point cloud from an incomplete one lies at the core of many vision and robotics applications. Existing methods typically predict the complete point cloud based on the global shape representation extracted from the incomplete input. Although they could predict the overall shape of 3D objects, they are incapable of generating structure details of objects. Moreover, the partial input point sets obtained from range scans are often sparse, noisy and non-uniform, which largely hinder shape completion. In this paper, we propose an adaptive sampling and hierarchical folding network (ASHF-Net) for robust 3D point cloud completion. Our main contributions are two-fold. First, we propose a denoising auto-encoder with an adaptive sampling module, aiming at learning robust local region features that are insensitive to noise. Second, we propose a hierarchical folding decoder with the gated skip-attention and multi-resolution completion goal to effectively exploit the local structure details of partial inputs. We also design a KL regularization term to evenly distribute the generated points. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods on multiple 3D point cloud completion benchmarks.
Conference Paper
Full-text available
Point completion refers to completing the missing ge-ometries of an object from incomplete observations. Mainstream methods predict the missing shapes by decoding a global feature learned from the input point cloud, which often leads to deficient results in preserving topology consistency and surface details. In this work, we present ME-PCN, a point completion network that leverages emptiness in 3D shape space. Given a single depth scan, previous methods often encode the occupied partial shapes while ignoring the empty regions (e.g. holes) in depth maps. In contrast , we argue that these 'emptiness' clues indicate shape boundaries that can be used to improve topology representation and detail granularity on surfaces. Specifically, our ME-PCN encodes both the occupied point cloud and the neighboring 'empty points'. It estimates coarse-grained but complete and reasonable surface points in the first stage, followed by a refinement stage to produce fine-grained surface details. Comprehensive experiments verify that our ME-PCN presents better qualitative and quantitative performance against the state-of-the-art. Besides, we further prove that our 'emptiness' design is lightweight and easy to embed in existing methods, which shows consistent effectiveness in improving the CD and EMD scores.
Article
Full-text available
The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing. This paper presents a novel framework named Point Cloud Transformer (PCT) for point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation, semantic segmentation, and normal estimation tasks.
Chapter
Full-text available
Point clouds are often the default choice for many applications as they exhibit more flexibility and efficiency than volumetric data. Nevertheless, their unorganized nature – points are stored in an unordered way – makes them less suited to be processed by deep learning pipelines. In this paper, we propose a method for 3D object completion and classification based on point clouds. We introduce a new way of organizing the extracted features based on their activations, which we name soft pooling. For the decoder stage, we propose regional convolutions, a novel operator aimed at maximizing the global activation entropy. Furthermore, inspired by the local refining procedure in Point Completion Network (PCN), we also propose a patch-deforming operation to simulate deconvolutional operations for point clouds. This paper proves that our regional activation can be incorporated in many point cloud architectures like AtlasNet and PCN, leading to better performance for geometric completion. We evaluate our approach on different 3D tasks such as object completion and classification, achieving state-of-the-art accuracy.
Article
Full-text available
Recovering the 3D shape of an object from single or multiple images with deep neural networks has been attracting increasing attention in the past few years. Mainstream works (e.g. 3D-R2N2) use recurrent neural networks (RNNs) to sequentially fuse feature maps of input images. However, RNN-based approaches are unable to produce consistent reconstruction results when given the same input images with different orders. Moreover, RNNs may forget important features from early input images due to long-term memory loss. To address these issues, we propose a novel framework for single-view and multi-view 3D object reconstruction, named Pix2Vox++. By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image. A multi-scale context-aware fusion module is then introduced to adaptively select high-quality reconstructions for different parts from all coarse 3D volumes to obtain a fused 3D volume. To further correct the wrongly recovered parts in the fused 3D volume, a refiner is adopted to generate the final output. Experimental results on the ShapeNet, Pix3D, and Things3D benchmarks show that Pix2Vox++ performs favorably against state-of-the-art methods in terms of both accuracy and efficiency.
Conference Paper
Full-text available
Recovering the 3D representation of an object from single-view or multi-view RGB images by deep neural networks has attracted increasing attention in the past few years. Several mainstream works (e.g., 3D-R2N2) use recurrent neural networks (RNNs) to fuse multiple feature maps extracted from input images sequentially. However, when given the same set of input images with different orders, RNN-based approaches are unable to produce consistent reconstruction results. Moreover, due to long-term memory loss, RNNs cannot fully exploit input images to refine reconstruction results. To solve these problems, we propose a novel framework for single-view and multi-view 3D reconstruction, named Pix2Vox. By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image. Then, a context-aware fusion module is introduced to adaptively select high-quality reconstructions for each part (e.g., table legs) from different coarse 3D volumes to obtain a fused 3D volume. Finally, a refiner further refines the fused 3D volume to generate the final output. Experimental results on the ShapeNet and Pix3D benchmarks indicate that the proposed Pix2Vox outperforms state-of-the-arts by a large margin. Furthermore, the proposed method is 24 times faster than 3D-R2N2 in terms of backward inference time. The experiments on ShapeNet unseen 3D categories have shown the superior generalization abilities of our method.
Article
Point cloud completion concerns to predict missing part for incomplete 3D shapes. A common strategy is to generate complete shape according to incomplete input. However, unordered nature of point clouds will degrade generation of high-quality 3D shapes, as detailed topology and structure of unordered points are hard to be captured during the generative process using an extracted latent code. We address this problem by formulating completion as point cloud deformation process. Specifically, we design a novel neural network, named PMP-Net++, to mimic behavior of an earth mover. It moves each point of incomplete input to obtain a complete point cloud, where total distance of point moving paths (PMPs) should be the shortest. Therefore, PMP-Net++ predicts unique PMP for each point according to constraint of point moving distances. The network learns a strict and unique correspondence on point-level, and thus improves quality of predicted complete shape. Moreover, since moving points heavily relies on per-point features learned by network, we further introduce a transformer-enhanced representation learning network, which significantly improves completion performance of PMP-Net++. We conduct comprehensive experiments in shape completion, and further explore application on point cloud up-sampling, which demonstrate non-trivial improvement of PMP-Net++ over state-of-the-art point cloud completion/up-sampling methods.
Article
Point clouds are often sparse and incomplete, which imposes difficulties for real-world applications. Existing shape completion methods tend to generate rough shapes without fine-grained details. Considering this, we introduce a two-branch network for shape completion. The first branch is a cascaded shape completion sub-network to synthesize complete objects, where we propose to use the partial input together with the coarse output to preserve the object details during the dense point reconstruction. The second branch is an auto-encoder to reconstruct the original partial input. The two branches share a same feature extractor to learn an accurate global feature for shape completion. Furthermore, we propose two strategies to enable the training of our network when ground truth data are not available. This is to mitigate the dependence of existing approaches on large amounts of ground truth training data that are often difficult to obtain in real-world applications. Additionally, our proposed strategies are also able to improve the reconstruction quality for fully supervised learning. We verify our approach in self-supervised, semi-supervised and fully supervised settings with superior performances. Quantitative and qualitative results on different datasets demonstrate that our method achieves more realistic outputs than state-of-the-art approaches on the point cloud completion task.
Article
Fine-grained 3D shape classification is important for shape understanding and analysis, which poses a challenging research problem. However, the studies on the fine-grained 3D shape classification have rarely been explored, due to the lack of fine-grained 3D shape benchmarks. To address this issue, we first introduce a new 3D shape dataset (named FG3D dataset) with fine-grained class labels, which consists of three categories including airplane, car and chair. Each category consists of several subcategories at a fine-grained level. According to our experiments under this fine-grained dataset, we find that state-of-the-art methods are significantly limited by the small variance among subcategories in the same category. To resolve this problem, we further propose a novel fine-grained 3D shape classification method named FG3D-Net to capture the fine-grained local details of 3D shapes from multiple rendered views. Specifically, we first train a Region Proposal Network (RPN) to detect the generally semantic parts inside multiple views under the benchmark of generally semantic part detection. Then, we design a hierarchical part-view attention aggregation module to learn a global shape representation by aggregating generally semantic part features, which preserves the local details of 3D shapes. The part-view attention module hierarchically leverages part-level and view-level attention to increase the discriminability of our features. The part-level attention highlights the important parts in each view while the view-level attention highlights the discriminative views among all the views of the same object. In addition, we integrate a Recurrent Neural Network (RNN) to capture the spatial relationships among sequential views from different viewpoints. Our results under the fine-grained 3D shape dataset show that our method outperforms other state-of-the-art methods. The FG3D dataset is available at https://github.com/liuxinhai/FG3D-Net.
Article
The continual improvement of 3D sensors has driven the development of algorithms to perform point cloud analysis. In fact, techniques for point cloud classification and segmentation have in recent years achieved incredible performance driven in part by leveraging large synthetic datasets. Unfortunately these same state-of-the-art approaches perform poorly when applied to incomplete point clouds. This limitation of existing algorithms is particularly concerning since point clouds generated by 3D sensors in the real world are usually incomplete due to perspective view or occlusion by other objects. This paper proposes a general model for partial point clouds analysis wherein the latent feature encoding a complete point cloud is inferred by applying a point set voting strategy. In particular, each local point set constructs a vote that corresponds to a distribution in the latent space, and the optimal latent feature is the one with the highest probability. This approach ensures that any subsequent point cloud analysis is robust to partial observation while simultaneously guaranteeing that the proposed model is able to output multiple possible results. This paper illustrates that this proposed method achieves the state-of-the-art performance on shape classification, part segmentation and point cloud completion.
Chapter
In this work, we propose a novel technique to generate shapes from point cloud data. A point cloud can be viewed as samples from a distribution of 3D points whose density is concentrated near the surface of the shape. Point cloud generation thus amounts to moving randomly sampled points to high-density areas. We generate point clouds by performing stochastic gradient ascent on an unnormalized probability density, thereby moving sampled points toward the high-likelihood regions. Our model directly predicts the gradient of the log density field and can be trained with a simple objective adapted from score-based generative models. We show that our method can reach state-of-the-art performance for point cloud auto-encoding and generation, while also allowing for extraction of a high-quality implicit surface. Code is available at https://github.com/RuojinCai/ShapeGF.
Chapter
Structure learning for 3D shapes is vital for 3D computer vision. State-of-the-art methods show promising results by representing shapes using implicit functions in 3D that are learned using discriminative neural networks. However, learning implicit functions requires dense and irregular sampling in 3D space, which also makes the sampling methods affect the accuracy of shape reconstruction during test. To avoid dense and irregular sampling in 3D, we propose to represent shapes using 2D functions, where the output of the function at each 2D location is a sequence of line segments inside the shape. Our approach leverages the power of functional representations, but without the disadvantage of 3D sampling. Specifically, we use a voxel tubelization to represent a voxel grid as a set of tubes along any one of the X, Y, or Z axes. Each tube can be indexed by its 2D coordinates on the plane spanned by the other two axes. We further simplify each tube into a sequence of occupancy segments. Each occupancy segment consists of successive voxels occupied by the shape, which leads to a simple representation of its 1D start and end location. Given the 2D coordinates of the tube and a shape feature as condition, this representation enables us to learn 3D shape structures by sequentially predicting the start and end locations of each occupancy segment in the tube. We implement this approach using a Seq2Seq model with attention, called SeqXY2SeqZ, which learns the mapping from a sequence of 2D coordinates along two arbitrary axes to a sequence of 1D locations along the third axis. SeqXY2SeqZ not only benefits from the regularity of voxel grids in training and testing, but also achieves high memory efficiency. Our experiments show that SeqXY2SeqZ outperforms the state-of-the-art methods under the widely used benchmarks.
Chapter
Point cloud shape completion is a challenging problem in 3D vision and robotics. Existing learning-based frameworks leverage encoder-decoder architectures to recover the complete shape from a highly encoded global feature vector. Though the global feature can approximately represent the overall shape of 3D objects, it would lead to the loss of shape details during the completion process. In this work, instead of using a global feature to recover the whole complete surface, we explore the functionality of multi-level features and aggregate different features to represent the known part and the missing part separately. We propose two different feature aggregation strategies, named global & local feature aggregation (GLFA) and residual feature aggregation (RFA), to express the two kinds of features and reconstruct coordinates from their combination. In addition, we also design a refinement component to prevent the generated point cloud from non-uniform distribution and outliers. Extensive experiments have been conducted on the ShapeNet and KITTI dataset. Qualitative and quantitative evaluations demonstrate that our proposed network outperforms current state-of-the art methods especially on detail preservation.
Chapter
Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN and TopNet) use Multi-layer Perceptrons (MLPs) to directly process point clouds, which may cause the loss of details because the structural and context of point clouds are not fully considered. To solve this problem, we introduce 3D grids as intermediate representations to regularize unordered point clouds and propose a novel Gridding Residual Network (GRNet) for point cloud completion. In particular, we devise two novel differentiable layers, named Gridding and Gridding Reverse, to convert between point clouds and 3D grids without losing structural information. We also present the differentiable Cubic Feature Sampling layer to extract features of neighboring points, which preserves context information. In addition, we design a new loss function, namely Gridding Loss, to calculate the L1 distance between the 3D grids of the predicted and ground truth point clouds, which is helpful to recover details. Experimental results indicate that the proposed GRNet performs favorably against state-of-the-art methods on the ShapeNet, Completion3D, and KITTI benchmarks.
Chapter
3D shape completion for real data is important but challenging, since partial point clouds acquired by real-world sensors are usually sparse, noisy and unaligned. Different from previous methods, we address the problem of learning 3D complete shape from unaligned and real-world partial point clouds. To this end, we propose a weakly-supervised method to estimate both 3D canonical shape and 6-DoF pose for alignment, given multiple partial observations associated with the same instance. The network jointly optimizes canonical shapes and poses with multi-view geometry constraints during training, and can infer the complete shape given a single partial point cloud. Moreover, learned pose estimation can facilitate partial point cloud registration. Experiments on both synthetic and real data show that it is feasible and promising to learn 3D shape completion through large-scale data without shape and pose supervision.
Article
Learning discriminative shape representation directly on point clouds is still challenging in 3D shape analysis and understanding. Recent studies usually involve three steps: first splitting a point cloud into some local regions, then extracting the corresponding feature of each local region, and finally aggregating all individual local region features into a global feature as shape representation using simple max-pooling. However, such pooling-based feature aggregation methods do not adequately take the spatial relationships (e.g. the relative locations to other regions) between local regions into account, which greatly limits the ability to learn discriminative shape representation. To address this issue, we propose a novel deep learning network, named Point2SpatialCapsule, for aggregating features and spatial relationships of local regions on point clouds, which aims to learn more discriminative shape representation. Compared with the traditional max-pooling based feature aggregation networks, Point2SpatialCapsule can explicitly learn not only geometric features of local regions but also the spatial relationships among them. Point2SpatialCapsule consists of two main modules. To resolve the disorder problem of local regions, the first module, named geometric feature aggregation, is designed to aggregate the local region features into the learnable cluster centers, which explicitly encodes the spatial locations from the original 3D space. The second module, named spatial relationship aggregation, is proposed for further aggregating the clustered features and the spatial relationships among them in the feature space using the spatial-aware capsules developed in this paper. Compared to the previous capsule network based methods, the feature routing on the spatial-aware capsules can learn more discriminative spatial relationships among local regions for point clouds, which establishes a direct mapping between log priors and the spatial locations through feature clusters. Experimental results demonstrate that Point2SpatialCapsule outperforms the state-of-the-art methods in the 3D shape classification, retrieval and segmentation tasks under the well-known ModelNet and ShapeNet datasets.
Article
3D shape reconstruction from multiple hand-drawn sketches is an intriguing way to 3D shape modeling. Currently, state-of-the-art methods employ neural networks to learn a mapping from multiple sketches from arbitrary view angles to a 3D voxel grid. Because of the cubic complexity of 3D voxel grids, however, neural networks are hard to train and limited to low resolution reconstructions, which leads to a lack of geometric detail and low accuracy. To resolve this issue, we propose to reconstruct 3D shapes from multiple sketches using direct shape optimization (DSO), which does not involve deep learning models for direct voxel-based 3D shape generation. Specifically, we first leverage a conditional generative adversarial network (CGAN) to translate each sketch into an attenuance image that captures the predicted geometry from a given viewpoint. Then, DSO minimizes a project-and-compare loss to reconstruct the 3D shape such that it matches the predicted attenuance images from the view angles of all input sketches. Based on this, we further propose a progressive update approach to handle inconsistencies among a few hand-drawn sketches for the same 3D shape. Our experimental results show that our method significantly outperforms the state-of-the-art methods under widely used benchmarks and produces intuitive results in an interactive application.
Article
Cross-modal retrieval using deep neural networks aims to retrieve relevant data between the two different modalities. The performance of cross-modal retrieval is still unsatisfactory for two problems. First, most of the previous methods failed to incorporate the common knowledge among modalities when predicting the item representations. Second, the semantic relationships indicated by class label are still insufficiently utilized, which is an important clue for inferring similarities between the cross modal items. To address the above issues, we propose a novel cross memory network with pair discrimination (CMPD) for image-text cross modal retrieval, where the main contributions lie in two-folds. First, we propose the cross memory as a set of latent concepts to capture the common knowledge among different modalities. It is learnable and can be fused into each modality through attention mechanism, which aims to discriminatively predict representations. Second, we propose the pair discrimination loss to discriminate modality labels and class labels of item pairs, which can efficiently capture the semantic relationships among these modality labels and class labels. Comprehensive experimental results show that our method outperforms the state-of-the-art approaches in image-text retrieval.
Article
3D shape completion is important to enable machines to perceive the complete geometry of objects from partial observations. To address this problem, view-based methods have been presented. These methods represent shapes as multiple depth images, which can be back-projected to yield corresponding 3D point clouds, and they perform shape completion by learning to complete each depth image using neural networks. While view-based methods lead to state-of-the-art results, they currently do not enforce geometric consistency among the completed views during the inference stage. To resolve this issue, we propose a multi-view consistent inference technique for 3D shape completion, which we express as an energy minimization problem including a data term and a regularization term. We formulate the regularization term as a consistency loss that encourages geometric consistency among multiple views, while the data term guarantees that the optimized views do not drift away too much from a learned shape descriptor. Experimental results demonstrate that our method completes shapes more accurately than previous techniques.
Article
3D point cloud completion, the task of inferring the complete geometric shape from a partial point cloud, has been attracting attention in the community. For acquiring high-fidelity dense point clouds and avoiding uneven distribution, blurred details, or structural loss of existing methods' results, we propose a novel approach to complete the partial point cloud in two stages. Specifically, in the first stage, the approach predicts a complete but coarse-grained point cloud with a collection of parametric surface elements. Then, in the second stage, it merges the coarse-grained prediction with the input point cloud by a novel sampling algorithm. Our method utilizes a joint loss function to guide the distribution of the points. Extensive experiments verify the effectiveness of our method and demonstrate that it outperforms the existing methods in both the Earth Mover's Distance (EMD) and the Chamfer Distance (CD).
Article
Learning discriminative feature directly on point clouds is still challenging in the understanding of 3D shapes. Recent methods usually partition point clouds into local region sets, and then extract the local region features with fixed-size CNN or MLP, and finally aggregate all individual local features into a global feature using simple max pooling. However, due to the irregularity and sparsity in sampled point clouds, it is hard to encode the fine-grained geometry of local regions and their spatial relationships when only using the fixed-size filters and individual local feature integration, which limit the ability to learn discriminative features. To address this issue, we present a novel Local-Region-Context Network (LRC-Net), to learn discriminative features on point clouds by encoding the fine-grained contexts inside and among local regions simultaneously. LRC-Net consists of two main modules. The first module, named intra-region context encoding, is designed for capturing the geometric correlation inside each local region by novel variable-size convolution filter. The second module, named inter-region context encoding, is proposed for integrating the spatial relationships among local regions based on spatial similarity measures. Experimental results show that LRC-Net is competitive with state-of-the-art methods in shape classification and shape segmentation applications.