Article

PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Point cloud completion concerns to predict missing part for incomplete 3D shapes. A common strategy is to generate complete shape according to incomplete input. However, unordered nature of point clouds will degrade generation of high-quality 3D shapes, as detailed topology and structure of unordered points are hard to be captured during the generative process using an extracted latent code. We address this problem by formulating completion as point cloud deformation process. Specifically, we design a novel neural network, named PMP-Net++, to mimic behavior of an earth mover. It moves each point of incomplete input to obtain a complete point cloud, where total distance of point moving paths (PMPs) should be the shortest. Therefore, PMP-Net++ predicts unique PMP for each point according to constraint of point moving distances. The network learns a strict and unique correspondence on point-level, and thus improves quality of predicted complete shape. Moreover, since moving points heavily relies on per-point features learned by network, we further introduce a transformer-enhanced representation learning network, which significantly improves completion performance of PMP-Net++. We conduct comprehensive experiments in shape completion, and further explore application on point cloud up-sampling, which demonstrate non-trivial improvement of PMP-Net++ over state-of-the-art point cloud completion/up-sampling methods.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Recently, the development of point cloud completion methods based on deep learning has been rapid [5][6][7][8][9][10][11][12]. Initially, researchers voxelized point clouds and used 3D CNNs to generate complete shapes [13][14][15]. ...
... However, decoding directly from global features could not capture fine-grained information. Consequently, some methods incorporated local structural information into shape completion [9,10,16]. Nevertheless, detail loss still occurred during the decoding process. To generate higher quality complete point clouds, the most popular current approach ...
... From Table 7, it can be seen that our method achieves competitive performance with the smallest FLOPs, indicating that our model requires the least computational resources and has faster training and inference speeds. Although PMP-Net [9] and PMP-Net++ [10] have very small parameter counts, their performance is lower than our method, and they have higher FLOPs. Compared to the original SnowflakeNet, our improved GSSnowflak-eNet reduces the number of parameters, preventing overfitting, while also reducing FLOPs and enhancing performance. ...
Article
Full-text available
Point clouds are essential 3D data representations utilized across various disciplines, often requiring point cloud completion methods to address inherent incompleteness. Existing completion methods like SnowflakeNet only consider local attention, lacking global information of the complete shape, and tend to suffer from overfitting as the model depth increases. To address these issues, we introduced self-positioning point-based attention to better capture complete global contextual features and designed a Channel Attention module for adaptive feature adjustment within the global vector. Additionally, we implemented a vector attention grouping strategy in both the skip-transformer and self-positioning point-based attention to mitigate overfitting, improving parameter efficiency and generalization. We evaluated our method on the PCN dataset as well as the ShapeNet55/34 datasets. The experimental results show that our method achieved an average CD-L1 of 7.09 and average CD-L2 scores of 8.0, 7.8, and 14.4 on the PCN, ShapeNet55, ShapeNet34, and ShapeNet-unseen21 benchmarks, respectively. Compared to SnowflakeNet, we improved the average CD by 1.6%, 3.6%, 3.7%, and 4.6% on the corresponding benchmarks, while also reducing complexity and computational costs and accelerating training and inference speeds. Compared to other existing point cloud completion networks, our method also achieves competitive results.
... In the field of natural language processing, the Transformer architecture (Vaswani et al. 2017) has revolutionized sequence tasks with its self-attention and crossattention mechanisms. In computer vision, Vision Transformer (Dosovitskiy et al. 2020;Han et al. 2022) applies the Transformer architecture to 2D image processing, successfully managing image classification by deconstructing images into token sequences. Further research, like DeiT (Touvron et al. 2021), has expanded the Transformer's efficient training strategy in visual tasks. ...
... The unstructured, high-dimensional nature of such data presents certain challenges. Preliminary studies show that the Transformer can effectively capture local and global features in point clouds, providing a new approach to 3D point cloud recognition (Mao et al. 2021;Pan et al. 2021) and completion (Wen et al. 2022;Zhang et al. 2022;Fei et al. 2023). This involves using the Transformer's self-attention mechanism to handle point cloud data's local structure and the cross-attention mechanism to predict missing parts using existing point cloud information. ...
Preprint
Full-text available
Point cloud completion aims to reconstruct the complete 3D shape from incomplete point clouds, and it is crucial for tasks such as 3D object detection and segmentation. Despite the continuous advances in point cloud analysis techniques, feature extraction methods are still confronted with apparent limitations. The sparse sampling of point clouds, used as inputs in most methods, often results in a certain loss of global structure information. Meanwhile, traditional local feature extraction methods usually struggle to capture the intricate geometric details. To overcome these drawbacks, we introduce PointCFormer, a transformer framework optimized for robust global retention and precise local detail capture in point cloud completion. This framework embraces several key advantages. First, we propose a relation-based local feature extraction method to perceive local delicate geometry characteristics. This approach establishes a fine-grained relationship metric between the target point and its k-nearest neighbors, quantifying each neighboring point's contribution to the target point's local features. Secondly, we introduce a progressive feature extractor that integrates our local feature perception method with self-attention. Starting with a denser sampling of points as input, it iteratively queries long-distance global dependencies and local neighborhood relationships. This extractor maintains enhanced global structure and refined local details, without generating substantial computational overhead. Additionally, we develop a correction module after generating point proxies in the latent space to reintroduce denser information from the input points, enhancing the representation capability of the point proxies. PointCFormer demonstrates state-of-the-art performance on several widely used benchmarks.
... Point-based networks like PointNet [26,27] emerged to process 3D coordinates directly, with PCN [49] being the first to use an end-to-end point-based approach, generating points through coarse-to-fine folding. Subsequent encoder-decoder methods [22,37,40,43] improved performance. ...
... To demonstrate the effectiveness of our method, we compare it with SoTA methods, including PCN [49], GR-Net [44], PoinTr [48], PMP-Net++ [40], SnowFlak-eNet [43], SeedFormer [55], AnchorFormer [6], Hy-perCD [21], SVDFormer [56], and CRA-PCN [30]. We retrain their networks from scratch on our datasets with their default configurations. ...
Preprint
This paper presents PCDreamer, a novel method for point cloud completion. Traditional methods typically extract features from partial point clouds to predict missing regions, but the large solution space often leads to unsatisfactory results. More recent approaches have started to use images as extra guidance, effectively improving performance, but obtaining paired data of images and partial point clouds is challenging in practice. To overcome these limitations, we harness the relatively view-consistent multi-view diffusion priors within large models, to generate novel views of the desired shape. The resulting image set encodes both global and local shape cues, which is especially beneficial for shape completion. To fully exploit the priors, we have designed a shape fusion module for producing an initial complete shape from multi-modality input (\ie, images and point clouds), and a follow-up shape consolidation module to obtain the final complete shape by discarding unreliable points introduced by the inconsistency from diffusion priors. Extensive experimental results demonstrate our superior performance, especially in recovering fine details.
... We compare PointSea with 17 competitors (Yuan et al., 2018;Xie et al., 2020;Wang et al., 2020;Zhang et al., 2020;Yu et al., 2021;Xiang et al., 2023;Wen et al., 2023;Yan et al., 2022;Zhou et al., 2022;Zhang et al., 2023d;Tang et al., 2022;Fu et al., 2023;Zhang et al., 2023c;Xu et al., 2023b;Chen et al., 2023b;Yu et al., 2023a;Zhu et al., 2023b) in Table 3. The results demonstrate that PointSea achieves the best performance across all metrics. ...
Article
Full-text available
Point cloud completion is a fundamental yet not well-solved problem in 3D vision. Current approaches often rely on 3D coordinate information and/or additional data (e.g., images and scanning viewpoints) to fill in missing parts. Unlike these methods, we explore self-structure augmentation and propose PointSea for global-to-local point cloud completion. In the global stage, consider how we inspect a defective region of a physical object, we may observe it from various perspectives for a better understanding. Inspired by this, PointSea augments data representation by leveraging self-projected depth images from multiple views. To reconstruct a compact global shape from the cross-modal input, we incorporate a feature fusion module to fuse features at both intra-view and inter-view levels. In the local stage, to reveal highly detailed structures, we introduce a point generator called the self-structure dual-generator. This generator integrates both learned shape priors and geometric self-similarities for shape refinement. Unlike existing efforts that apply a unified strategy for all points, our dual-path design adapts refinement strategies conditioned on the structural type of each point, addressing the specific incompleteness of each point. Comprehensive experiments on widely-used benchmarks demonstrate that PointSea effectively understands global shapes and generates local details from incomplete input, showing clear improvements over existing methods. Our code is available at https://github.com/czvvd/SVDFormer_PointSea.
... Transformer in point cloud processing captures global relationships through its self-attention mechanism, thereby enhancing the richness and accuracy of feature representation. The transformer architecture [36] is as follows (Fig. 3): First, the point cloud data with input dimensions N , K , d + C , where N is the number of points, K is the number of neighbors per point, and d + C is the feature dimension, passes through a Multi-Layer Perceptron (MLP) to extract initial features, denoted as ...
Article
Full-text available
Cotton phenomics plays a crucial role in understanding and managing the growth and development of cotton plants. The segmentation of point clouds, a process that underpins the measurement of plant organ structures through 3D point clouds, is necessary for obtaining precise phenotypic parameters. This study proposes a cotton point cloud organ semantic segmentation method named TPointNetPlus, which combines PointNet++ and Transformer algorithms. Firstly, a dedicated point cloud dataset for cotton plants is constructed using multi-view images. Secondly, the attention module Transformer is introduced into the PointNet++ model to increase the accuracy of feature extraction. Finally, organ-level cotton plant point cloud segmentation is performed using the HDBSCAN algorithm, successfully segmenting cotton leaves, bolls, and branches from the entire plant, and obtaining their phenotypic feature parameters. The research results indicate that the TPointNetPlus model achieved a high accuracy of 98.39% in leaf semantic segmentation. The correlation coefficients between the measured values of four phenotypic parameters (plant height, leaf area, and boll volume) ranged from 0.95 to 0.97, demonstrating the accurate predictive capability of the model for these key traits. The proposed method, which enables automated data analysis from a plant's 3D point cloud to phenotypic parameters, provides a reliable reference for in-depth studies of plant phenotypes.
... The main limitation of using manual features is that such features cannot cover complex point cloud features. Most deep learning [18][19][20] based methods for point cloud completion are grid or voxel based methods [21,22] and point based methods [23,24], due to the disordered nature of point clouds. Both gridand voxel-based methods have common shortcomings such as lack of detailed features, inability to produce high-resolution output, high memory requirements, complex models that are difficult to train, and difficulty in fine-tuning the model [25,26]. ...
Article
Full-text available
Point cloud completion aims to infer complete point clouds based on partial 3D point cloud inputs. Various previous methods apply coarse-to-fine strategy networks for generating complete point clouds. However, such methods are not only relatively time-consuming but also cannot provide representative complete shape features based on partial inputs. In this paper, a novel feature alignment fast point cloud completion network (FACNet) is proposed to directly and efficiently generate the detailed shapes of objects. FACNet aligns high-dimensional feature distributions of both partial and complete point clouds to maintain global information about the complete shape. During its decoding process, the local features from the partial point cloud are incorporated along with the maintained global information to ensure complete and time-saving generation of the complete point cloud. Experimental results show that FACNet outperforms the state-of-the-art on PCN, Completion3D, and MVP datasets, and achieves competitive performance on ShapeNet-55 and KITTI datasets. Moreover, FACNet and a simplified version, FACNet-slight, achieve a significant speedup of 3–10 times over other state-of-the-art methods.
... Yu et al. 42 developed PointTr, utilizing the transformer framework that reinterprets point cloud completion as a set-to-set translation. Wen et al. 43 introduced a transformer-enhanced representation learning network based on PMP-Net 27 and proposed PMP-Net++, significantly improving the performance of point cloud completion. FBNet 44 proposed a feedback network to refine the present features and designed a point cross transformer to build feedback connections. ...
Article
Full-text available
The attention mechanism has significantly progressed in various point cloud tasks. Benefiting from its significant competence in capturing long-range dependencies, research in point cloud completion has achieved promising results. However, the typically disordered point cloud data features complicated non-Euclidean geometric structures and exhibits unpredictable behavior. Most current attention modules are based on Euclidean or local geometry, which fails to accurately represent the intrinsic non-Euclidean characteristics of point cloud data. Thus, we propose a novel geodesic attention-based multi-stage refinement transformer network, which enables the alignment of feature dimensions among query, key, and value, and long-range geometric dependencies are captured on the manifold. Then, a novel Position Feature Extractor is designed to enhance geometric features and explicitly capture graph-based non-Euclidean properties of point cloud objects. A Recurrent Information Aggregation Unit is further applied to aggregate historical information from the previous stages and current geometric features to guide the network in the current stage. The proposed method exhibits strong competitiveness when compared to current state-of-the-art methods.
... Among all the 3D descriptors [6][7][8][9], the point cloud stands out because of its remarkable ability to render spatial structure at a lower computational cost. However, because of blockages, viewing angles, and sensor resolution constraints, raw point clouds are usually sparse and defective [10][11][12]. Consequently, point cloud completion becomes essential. Taking advantage of large-scale point cloud datasets [13][14][15], many effective learning-based point cloud completion approaches have arisen. ...
Article
Full-text available
The goal of point cloud completion is to reconstruct raw scanned point clouds acquired from incomplete observations due to occlusion and restricted viewpoints. Numerous methods use a partial‐to‐complete framework, directly predicting missing components via global characteristics extracted from incomplete inputs. However, this makes detail recovery challenging, as global characteristics fail to provide complete missing component specifics. A new point cloud completion method named Point‐PC is proposed. A memory network and a causal inference model are separately designed to introduce shape priors and select absent shape information as supplementary geometric factors for aiding completion. Concretely, a memory mechanism is proposed to store complete shape features and their associated shapes in a key‐value format. The authors design a pre‐training strategy that uses contrastive learning to map incomplete shape features into the complete shape feature domain, enabling retrieval of analogous shapes from incomplete inputs. In addition, the authors employ backdoor adjustment to eliminate confounders, which are shape prior components sharing identical semantic structures with incomplete inputs. Experiments conducted on three datasets show that our method achieves superior performance compared to state‐of‐the‐art approaches. The code for Point‐PC can be accessed by https://github.com/bizbard/Point‐PC.git.
... It adopts a coarse-to-fine architecture to generate a rough approximation of the missing parts first and then refine the details to achieve a more accurate completion. Based on the encoderdecoder architecture, many works (Cai et al. 2024;Wen et al. 2021Wen et al. , 2022Xiang et al. 2021) obtain plausible performance. For example, SnowflakeNet interprets point cloud completion as an explicit and structured generation of local patterns and introduces a novel type of skip transformer to learn the split patterns in the Snowflake Point Decomposition (SPD). ...
Preprint
Point cloud completion aims to recover partial geometric and topological shapes caused by equipment defects or limited viewpoints. Current methods either solely rely on the 3D coordinates of the point cloud to complete it or incorporate additional images with well-calibrated intrinsic parameters to guide the geometric estimation of the missing parts. Although these methods have achieved excellent performance by directly predicting the location of complete points, the extracted features lack fine-grained information regarding the location of the missing area. To address this issue, we propose a rapid and efficient method to expand an unimodal framework into a multimodal framework. This approach incorporates a position-aware module designed to enhance the spatial information of the missing parts through a weighted map learning mechanism. In addition, we establish a Point-Text-Image triplet corpus PCI-TI and MVP-TI based on the existing unimodal point cloud completion dataset and use the pre-trained vision-language model CLIP to provide richer detail information for 3D shapes, thereby enhancing performance. Extensive quantitative and qualitative experiments demonstrate that our method outperforms state-of-the-art point cloud completion methods.
... 来逐渐成为主流,如 PF-Net 和 PMP-Net++ 等网络[7][8] , 来进行全局视点规划,生成最佳视点集合与采集路径。 先探索后利用 (Explore Then Exploit) 是其中最为常见 的思路之一[11] ,基于该思路的方法通常有 2 个阶段:探 ...
Article
Full-text available
A 3D model improving method based on supplementary capture by unmanned ground vehicle was proposed to address the issue of damage and holes in 3D reconstruction models generated from images captured solely by unmanned ground vehicle. This method combined model resolution, triangular mesh structure, and manual point selection to extract areas needing improvement, generated 3D bounding boxes and normal vector information, and utilized heuristic methods to generate supplementary viewpoints. The results show that, under this method's optimization, the low-quality areas of the rough 3D model are significantly improved, with an average reduction of 66% in model projection pixel size. Therefore, this method effectively enhances the quality of 3D model reconstruction, providing a reliable solution for large-scale, detailed outdoor 3D reconstruction.
... Neural networks have increasingly shown transformative potential in 3D applications, contributing to advancements in areas such as shape modeling, scene reconstruction, and virtual simulation [23,29,40,41,50,52,82,83,85,94,97,[99][100][101]. Building on this momentum, our work focuses on generating implicit functions, which notoriously are efficient 3D data representations. ...
Preprint
Full-text available
Autoregressive (AR) models have achieved remarkable success in natural language and image generation, but their application to 3D shape modeling remains largely unexplored. Unlike diffusion models, AR models enable more efficient and controllable generation with faster inference times, making them especially suitable for data-intensive domains. Traditional 3D generative models using AR approaches often rely on ``next-token" predictions at the voxel or point level. While effective for certain applications, these methods can be restrictive and computationally expensive when dealing with large-scale 3D data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D implicit distance fields that can perform unconditional shape generation, class-conditioned and also text-conditioned shape generation. Our key idea is to encode shapes as multi-scale wavelet token maps and use a Transformer to predict the ``next higher-resolution token map" in an autoregressive manner. By redefining 3D AR generation task as ``next-scale" prediction, we reduce the computational cost of generation compared to traditional ``next-token" prediction models, while preserving essential geometric details of 3D shapes in a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its benefit by quantitative and qualitative comparisons with state-of-the-art methods on widely used benchmarks. Our results show 3D-WAG achieves superior performance in key metrics like Coverage and MMD, generating high-fidelity 3D shapes that closely match the real data distribution.
... Improvements of the completion performance of recent PC-CNs can primarily be attributed to innovations in network architectures [6,7,8], point generation strategies [9,10,11], and representations [12]. In contrast, the training strategy employed by existing PCCNs has remained relatively unchanged, that is, to minimize the dissimilarities between the predicted complete point clouds and the ground truths [13], often measured using the computationally efficient Chamfer Distance (CD) metric [14]. ...
Preprint
Full-text available
Point cloud completion networks are conventionally trained to minimize the disparities between the completed point cloud and the ground-truth counterpart. However, an incomplete object-level point cloud can have multiple valid completion solutions when it is examined in isolation. This one-to-many mapping issue can cause contradictory supervision signals to the network because the loss function may produce different values for identical input-output pairs of the network. In many cases, this issue could adversely affect the network optimization process. In this work, we propose to enhance the conventional learning objective using a novel completion consistency loss to mitigate the one-to-many mapping problem. Specifically, the proposed consistency loss ensure that a point cloud completion network generates a coherent completion solution for incomplete objects originating from the same source point cloud. Experimental results across multiple well-established datasets and benchmarks demonstrated the proposed completion consistency loss have excellent capability to enhance the completion performance of various existing networks without any modification to the design of the networks. The proposed consistency loss enhances the performance of the point completion network without affecting the inference speed, thereby increasing the accuracy of point cloud completion. Notably, a state-of-the-art point completion network trained with the proposed consistency loss can achieve state-of-the-art accuracy on the challenging new MVP dataset. The code and result of experiment various point completion models using proposed consistency loss will be available at: https://github.com/kaist-avelab/ConsistencyLoss .
... Deep learning based 3D shape reconstruction has made a big progress with different 3D representations including voxel grids [13,53,61], triangle meshes [11,21,22,34], point clouds [3,[18][19][20]24,27,35,36,47,58], and implicit functions [4,7,9,10,12,16,23,30,41,42,46,63,64]. The widely used strategy aims to leverage deep learning models to learn a global prior for shape reconstruction from 2D images. ...
Preprint
Full-text available
It is challenging to reconstruct 3D point clouds in unseen classes from single 2D images. Instead of object-centered coordinate system, current methods generalized global priors learned in seen classes to reconstruct 3D shapes from unseen classes in viewer-centered coordinate system. However, the reconstruction accuracy and interpretability are still eager to get improved. To resolve this issue, we introduce to learn local pattern modularization for reconstructing 3D shapes in unseen classes, which achieves both good generalization ability and high reconstruction accuracy. Our insight is to learn a local prior which is class-agnostic and easy to generalize in object-centered coordinate system. Specifically, the local prior is learned via a process of learning and customizing local pattern modularization in seen classes. During this process, we first learn a set of patterns in local regions, which is the basis in the object-centered coordinate system to represent an arbitrary region on shapes across different classes. Then, we modularize each region on an initially reconstructed shape using the learned local patterns. Based on that, we customize the local pattern modularization using the input image by refining the reconstruction with more details. Our method enables to reconstruct high fidelity point clouds from unseen classes in object-centered coordinate system without requiring a large number of patterns or any additional information, such as segmentation supervision or camera poses. Our experimental results under widely used benchmarks show that our method achieves the state-of-the-art reconstruction accuracy for shapes from unseen classes. The code is available at https://github.com/chenchao15/Unseen.
... We compare our GeoFormer with many classical methods [28,35,44,46,50,52] and several recent state-of-the-art techniques [4,16,38,41,42,45,48,49,55,57]. ...
Preprint
Full-text available
Point cloud completion aims to recover accurate global geometry and preserve fine-grained local details from partial point clouds. Conventional methods typically predict unseen points directly from 3D point cloud coordinates or use self-projected multi-view depth maps to ease this task. However, these gray-scale depth maps cannot reach multi-view consistency, consequently restricting the performance. In this paper, we introduce a GeoFormer that simultaneously enhances the global geometric structure of the points and improves the local details. Specifically, we design a CCM Feature Enhanced Point Generator to integrate image features from multi-view consistent canonical coordinate maps (CCMs) and align them with pure point features, thereby enhancing the global geometry feature. Additionally, we employ the Multi-scale Geometry-aware Upsampler module to progressively enhance local details. This is achieved through cross attention between the multi-scale features extracted from the partial input and the features derived from previously estimated points. Extensive experiments on the PCN, ShapeNet-55/34, and KITTI benchmarks demonstrate that our GeoFormer outperforms recent methods, achieving the state-of-the-art performance. Our code is available at \href{https://github.com/Jinpeng-Yu/GeoFormer}{https://github.com/Jinpeng-Yu/GeoFormer}.
... Neural surface reconstruction from multi-view images [12], [13], [16], [26], [61], [96], [120], [121], [122], [123], [124], [125], [126], [127] has been shown to be powerful for recovering dense 3D surfaces via image-based neural rendering. We demonstrate that our method can also improve the performance of multi-view reconstruction by introducing an SDF through noise to noise mapping on point clouds from SfM [15] as a geometry prior. ...
Preprint
Learning signed distance functions (SDFs) from point clouds is an important task in 3D computer vision. However, without ground truth signed distances, point normals or clean point clouds, current methods still struggle from learning SDFs from noisy point clouds. To overcome this challenge, we propose to learn SDFs via a noise to noise mapping, which does not require any clean point cloud or ground truth supervision. Our novelty lies in the noise to noise mapping which can infer a highly accurate SDF of a single object or scene from its multiple or even single noisy observations. We achieve this by a novel loss which enables statistical reasoning on point clouds and maintains geometric consistency although point clouds are irregular, unordered and have no point correspondence among noisy observations. To accelerate training, we use multi-resolution hash encodings implemented in CUDA in our framework, which reduces our training time by a factor of ten, achieving convergence within one minute. We further introduce a novel schema to improve multi-view reconstruction by estimating SDFs as a prior. Our evaluations under widely-used benchmarks demonstrate our superiority over the state-of-the-art methods in surface reconstruction from point clouds or multi-view images, point cloud denoising and upsampling.
Article
The bigger picture Many natural objects have intrinsic flexibility, for example, through articulated joints in living beings such as humans. In applications like autonomous vehicles, it is important that a class of object captured through imaging devices in either 2D or 3D is safely identified. Differences caused by motion and flexibility are often confounded by intrinsic differences, as seen, for example, in different plants of the same type of tree. Thus, reliably recognizing such objects is a challenging problem. Our study creates a bridge to this problem scope from molecular science by offering datasets for benchmarking methods trying to solve this recognition task. Molecules are flexible and offer many unequivocal classes that can “look” very similar. We were interested in how well modern machine learning methods perform in this task when they have to rely on spatial information alone. Taking a dataset from molecular science cures some technical issues seen with imaging data, such as differences in scale, resolution, and ambiguous labels. Our research shows that the exact way in which the spatial information is encoded continues to be important, and this holds for both accuracy and transferability. The latter can be thought of as a proxy for the appropriateness and generalizability of the strategy a given model learns. Transferability is the biggest concern in fields where there are limited and often non-extensible amounts of data, such as drug discovery, digital humanities, or financial modeling, and we touch upon the implications of our results for applications of machine learning in such a setting.
Article
In partial-to-complete point cloud completion, it is imperative that enabling every patch in the output point cloud faithfully represents the corresponding patch in partial input, ensuring similarity in terms of geometric content. To achieve this objective, we propose a straightforward method dubbed PPCL that aims to maximize the mutual information between two point patches from the encoder and decoder by leveraging a contrastive learning framework. Contrastive learning facilitates the mapping of two similar point patches to corresponding points in a learned feature space. Notably, we explore multi-layer point patches contrastive learning (MPPCL) instead of operating on the whole point cloud. The negatives are exploited within the input point cloud itself rather than the rest of the datasets. To fully leverage the local geometries present in the partial inputs and enhance the quality of point patches in the encoder, we introduce Multi-level Feature Learning (MFL) and Hierarchical Feature Fusion (HFF) modules. These modules are also able to facilitate the learning of various levels of features. Moreover, Spatial-Channel Transformer Point Up-sampling (SCT) is devised to guide the decoder to construct a complete and fine-grained point cloud by leveraging enhanced point patches from our point patches contrastive learning. Extensive experiments demonstrate that our PPCL can achieve better quantitive and qualitative performance over off-the-shelf methods across various datasets.
Article
Full-text available
Point clouds obtained from laser scanners or other devices often exhibit incompleteness, which poses a challenge for subsequent point cloud processing. Therefore, accurately predicting the complete shape from partial observations has paramount significance. In this paper, we introduce PCCDiff, a probabilistic model inspired by Denoising Diffusion Probabilistic Models (DDPMs), designed for point cloud completion tasks. Our model aims to predict missing parts in incomplete 3D shapes by learning the reverse diffusion process, transforming a 3D Gaussian noise distribution into the desired shape distribution without any structural assumption (e.g., geometric symmetry). Firstly, we design a conditional point cloud completion network that integrates Missing-Transformer and TreeGCN, facilitating the prediction of complete point cloud features. Subsequently, at each step of the diffusion process, the obtained point cloud features serve as condition inputs for the symmetric Diffusion ResUNet. By incorporating these condition features and incomplete point clouds into the diffusion process, PCCDiff demonstrates superior generation performance compared to other methods. Finally, extensive experiments are conducted to demonstrate the effectiveness of our proposed generative model for completing point clouds.
Preprint
The common occurrence of occlusion-induced incompleteness in point clouds has made point cloud completion (PCC) a highly-concerned task in the field of geometric processing. Existing PCC methods typically produce complete point clouds from partial point clouds in a coarse-to-fine paradigm, with the coarse stage generating entire shapes and the fine stage improving texture details. Though diffusion models have demonstrated effectiveness in the coarse stage, the fine stage still faces challenges in producing high-fidelity results due to the ill-posed nature of PCC. The intrinsic contextual information for texture details in partial point clouds is the key to solving the challenge. In this paper, we propose a high-fidelity PCC method that digs into both short and long-range contextual information from the partial point cloud in the fine stage. Specifically, after generating the coarse point cloud via a diffusion-based coarse generator, a mixed sampling module introduces short-range contextual information from partial point clouds into the fine stage. A surface freezing modules safeguards points from noise-free partial point clouds against disruption. As for the long-range contextual information, we design a similarity modeling module to derive similarity with rigid transformation invariance between points, conducting effective matching of geometric manifold features globally. In this way, the high-quality components present in the partial point cloud serve as valuable references for refining the coarse point cloud with high fidelity. Extensive experiments have demonstrated the superiority of the proposed method over SOTA competitors. Our code is available at https://github.com/JS-CHU/ContextualCompletion.
Article
Full-text available
Understanding and predicting viewers’ emotional responses to videos has emerged as a pivotal challenge due to its multifaceted applications in video indexing, summarization, personalized content recommendation, and effective advertisement design. A major roadblock in this domain has been the lack of expansive datasets with videos paired with viewer-reported emotional annotations. We address this challenge by employing a deep learning methodology trained on a dataset derived from the application of System1’s proprietary methodologies on over 30,000 real video advertisements, each annotated by an average of 75 viewers. This equates to over 2.3 million emotional annotations across eight distinct categories: anger, contempt, disgust, fear, happiness, sadness, surprise, and neutral, coupled with the temporal onset of these emotions. Leveraging 5-second video clips, our approach aims to capture pronounced emotional responses. Our convolutional neural network, which integrates both video and audio data, predicts salient 5-second emotional clips with an average balanced accuracy of 43.6%, and shows particularly high performance for detecting happiness (55.8%) and sadness (60.2%). When applied to full advertisements, our model achieves a strong average AUC of 75% in determining emotional undertones. To facilitate further research, our trained networks are freely available upon request for research purposes. This work not only overcomes previous data limitations but also provides an accurate deep learning solution for video emotion understanding.
Article
Full-text available
To improve the integrity of vegetation point clouds, the missing vegetation point can be compensated through vegetation point clouds completion technology. Further, it can enhance the accuracy of these point clouds’ applications, particularly in terms of quantitative calculations, such as for the urban living vegetation volume (LVV). However, owing to factors like the mutual occlusion between ground objects, sensor perspective, and penetration ability limitations resulting in missing single tree point clouds’ structures, the existing completion techniques cannot be directly applied to the single tree point clouds’ completion. This study combines the cutting-edge deep learning techniques, for example, the self-supervised and multiscale Encoder (Decoder), to propose a tree completion net (TC-Net) model that is suitable for the single tree structure completion. Being motivated by the attenuation of electromagnetic waves through a uniform medium, this study proposes an uneven density loss pattern. This study uses the local similarity visualization method, which is different from ordinary Chamfer distance (CD) values and can better assist in visually assessing the effects of point cloud completion. Experimental results indicate that the TC-Net model, based on the uneven density loss pattern, effectively identifies and compensates for the missing structures of single tree point clouds in real scenarios, thus reducing the average CD value by above 2.0, with the best result dropping from 23.89 to 13.08. Meanwhile, experiments on a large-scale tree dataset show that TC-Net has the lowest average CD value of 13.28. In the urban LVV estimates, the completed point clouds have reduced the average MAE, RMSE, and MAPE from 9.57, 7.78, and 14.11% to 1.86, 2.84, and 5.23%, respectively, thus demonstrating the effectiveness of TC-Net.
Article
Surface reconstruction for point clouds is one of the important tasks in 3D computer vision. The latest methods rely on generalizing the priors learned from large scale supervision. However, the learned priors usually do not generalize well to various geometric variations that are unseen during training, especially for extremely sparse point clouds. To resolve this issue, we present a neural network to directly infer SDFs from single sparse point clouds without using signed distance supervision, learned priors or even normals. Our insight here is to learn surface parameterization and SDFs inference in an end-to-end manner. To make up the sparsity, we leverage parameterized surfaces as a coarse surface sampler to provide many coarse surface estimations in training iterations, according to which we mine supervision for our thin plate splines (TPS) based network to infer smooth SDFs in a statistical way. Our method significantly improves the generalization ability and accuracy on unseen point clouds. Our experimental results show our advantages over the state-of-the-art methods in surface reconstruction for sparse point clouds under synthetic datasets and real scans.
Article
Full-text available
In this paper, we investigate the effectiveness of shape completion neural networks as clinical aids in maxillofacial surgery planning. We present a pipeline to apply shape completion networks to automatically reconstruct complete eumorphic 3D meshes starting from a partial input mesh, easily obtained from CT data routinely acquired for surgery planning. Most of the existing works introduced solutions to aid the design of implants for cranioplasty, i.e. all the defects are located in the neurocranium. In this work, we focus on reconstructing defects localized on both neurocranium and splanchnocranium. To this end, we introduce a new dataset, specifically designed for this task, derived from publicly available CT scans and subjected to a comprehensive pre-processing procedure. All the scans in the dataset have been manually cleaned and aligned to a common reference system. In addition, we devised a pre-processing stage to automatically extract point clouds from the scans and enrich them with virtual defects. We experimentally compare several state-of-the-art point cloud completion networks and identify the two most promising models. Finally, expert surgeons evaluated the best-performing network on a clinical case. Our results show how casting the creation of personalized implants as a problem of shape completion is a promising approach for automatizing this complex task.
Article
This work presents a new completion method that specifically designed for low-overlapping partial point cloud registration. Based on the assumption that the candidate partial point clouds to be registered belong to the same target, the proposed mutual prior based completion (MPC) method uses these candidate partial point clouds as completion reference for each other to extend their overlapping regions. Without relying on shape prior knowledge, MPC can work for different types of point clouds, such as object, room scene, and street view. The main challenge of this mutual reference approach is that partial clouds without spatial alignment cannot provide a reliable completion reference. Based on the mutual information maximization, a progressive completion structure is developed to achieve pose, feature representation and completion alignment between input point clouds. Experiments on public datasets show encouraging results. Especially for the low-overlapping cases, compared with the state-of-the-art (SOTA) models, the size of overlapping regions can be increased by about 15.0%, and the rotation and translation error can be reduced by 30.8% and 57.7% respectively. (Code is available at: https://*.*).
Chapter
Full-text available
Point clouds are often the default choice for many applications as they exhibit more flexibility and efficiency than volumetric data. Nevertheless, their unorganized nature – points are stored in an unordered way – makes them less suited to be processed by deep learning pipelines. In this paper, we propose a method for 3D object completion and classification based on point clouds. We introduce a new way of organizing the extracted features based on their activations, which we name soft pooling. For the decoder stage, we propose regional convolutions, a novel operator aimed at maximizing the global activation entropy. Furthermore, inspired by the local refining procedure in Point Completion Network (PCN), we also propose a patch-deforming operation to simulate deconvolutional operations for point clouds. This paper proves that our regional activation can be incorporated in many point cloud architectures like AtlasNet and PCN, leading to better performance for geometric completion. We evaluate our approach on different 3D tasks such as object completion and classification, achieving state-of-the-art accuracy.
Article
Full-text available
Visual analytics for machine learning has recently evolved as one of the most exciting areas in the field of visualization. To better identify which research topics are promising and to learn how to apply relevant techniques in visual analytics, we systematically review 259 papers published in the last ten years together with representative works before 2010. We build a taxonomy, which includes three first-level categories: techniques before model building, techniques during modeling building, and techniques after model building. Each category is further characterized by representative analysis tasks, and each task is exemplified by a set of recent influential works. We also discuss and highlight research challenges and promising potential future research opportunities useful for visual analytics researchers.
Article
Full-text available
Exploring contextual information in the local region is important for shape understanding and analysis. Existing studies often employ hand-crafted or explicit ways to encode contextual information of local regions. However, it is hard to capture fine-grained contextual information in hand-crafted or explicit manners, such as the correlation between different areas in a local region, which limits the discriminative ability of learned features. To resolve this issue, we propose a novel deep learning model for 3D point clouds, named Point2Sequence, to learn 3D shape features by capturing fine-grained contextual information in a novel implicit way. Point2Sequence employs a novel sequence learning model for point clouds to capture the correlations by aggregating multi-scale areas of each local region with attention. Specifically, Point2Sequence first learns the feature of each area scale in a local region. Then, it captures the correlation between area scales in the process of aggregating all area scales using a recurrent neural network (RNN) based encoder-decoder structure, where an attention mechanism is proposed to highlight the importance of different area scales. Experimental results show that Point2Sequence achieves state-of-the-art performance in shape classification and segmentation tasks.
Conference Paper
Full-text available
Learning global features by aggregating information over multiple views has been shown to be effective for 3D shape analysis. For view aggregation in deep learning models, pooling has been applied extensively. However, pooling leads to a loss of the content within views, and the spatial relationship among views, which limits the discriminability of learned features. We propose 3DViewGraph to resolve this issue, which learns 3D global features by more effectively aggregating unordered views with attention. Specifically, unordered views taken around a shape are regarded as view nodes on a view graph. 3DViewGraph first learns a novel latent semantic mapping to project low-level view features into meaningful latent semantic embeddings in a lower dimensional space, which is spanned by latent semantic patterns. Then, the content and spatial information of each pair of view nodes are encoded by a novel spatial pattern correlation, where the correlation is computed among latent semantic patterns. Finally, all spatial pattern correlations are integrated with attention weights learned by a novel attention mechanism. This further increases the discriminability of learned features by highlighting the unordered view nodes with distinctive characteristics and depressing the ones with appearance ambiguity. We show that 3DViewGraph outperforms state-of-the-art methods under three large-scale benchmarks.
Conference Paper
Full-text available
Deep learning has achieved remarkable results in 3D shape analysis by learning global shape features from the pixel-level over multiple views. Previous methods, however, compute low-level features for entire views without considering part-level information. In contrast, we propose a deep neural network, called Parts4Feature, to learn 3D global features from part-level information in multiple views. We introduce a novel definition of generally semantic parts, which Parts4Feature learns to detect in multiple views from different 3D shape segmentation benchmarks. A key idea of our architecture is that it transfers the ability to detect semantically meaningful parts in multiple views to learn 3D global features. Parts4Feature achieves this by combining a local part detection branch and a global feature learning branch with a shared region proposal module. The global feature learning branch aggregates the detected parts in terms of learned part patterns with a novel multi-attention mechanism, while the region proposal module enables locally and globally discriminative information to be promoted by each other. We demonstrate that Parts4Feature outperforms the state-of-the-art under three large-scale 3D shape benchmarks.
Conference Paper
Full-text available
3D shape completion from partial point clouds is a fundamental problem in computer vision and computer graphics. Recent approaches can be characterized as either data-driven or learning-based. Data-driven approaches rely on a shape model whose parameters are optimized to fit the observations. Learning-based approaches, in contrast, avoid the expensive optimization step and instead directly predict the complete shape from the incomplete observations using deep neural networks. However, full supervision is required which is often not available in practice. In this work, we propose a weakly-supervised learning-based approach to 3D shape completion which neither requires slow optimization nor direct supervision. While we also learn a shape prior on synthetic data, we amortize, i.e., learn, maximum likelihood fitting using deep neural networks resulting in efficient shape completion without sacrificing accuracy. Tackling 3D shape completion of cars on ShapeNet and KITTI, we demonstrate that the proposed amortized maximum likelihood approach is able to compete with a fully supervised baseline and a state-of-the-art data-driven approach while being significantly faster. On ModelNet, we additionally show that the approach is able to generalize to other object categories as well.
Conference Paper
Full-text available
We propose a data-driven method for recovering miss-ing parts of 3D shapes. Our method is based on a new deep learning architecture consisting of two sub-networks: a global structure inference network and a local geometry refinement network. The global structure inference network incorporates a long short-term memorized context fusion module (LSTM-CF) that infers the global structure of the shape based on multi-view depth information provided as part of the input. It also includes a 3D fully convolutional (3DFCN) module that further enriches the global structure representation according to volumetric information in the input. Under the guidance of the global structure network, the local geometry refinement network takes as input lo-cal 3D patches around missing regions, and progressively produces a high-resolution, complete surface through a volumetric encoder-decoder architecture. Our method jointly trains the global structure inference and local geometry refinement networks in an end-to-end manner. We perform qualitative and quantitative evaluations on six object categories, demonstrating that our method outperforms existing state-of-the-art work on shape completion.
Article
Full-text available
Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.
Article
Fine-grained 3D shape classification is important for shape understanding and analysis, which poses a challenging research problem. However, the studies on the fine-grained 3D shape classification have rarely been explored, due to the lack of fine-grained 3D shape benchmarks. To address this issue, we first introduce a new 3D shape dataset (named FG3D dataset) with fine-grained class labels, which consists of three categories including airplane, car and chair. Each category consists of several subcategories at a fine-grained level. According to our experiments under this fine-grained dataset, we find that state-of-the-art methods are significantly limited by the small variance among subcategories in the same category. To resolve this problem, we further propose a novel fine-grained 3D shape classification method named FG3D-Net to capture the fine-grained local details of 3D shapes from multiple rendered views. Specifically, we first train a Region Proposal Network (RPN) to detect the generally semantic parts inside multiple views under the benchmark of generally semantic part detection. Then, we design a hierarchical part-view attention aggregation module to learn a global shape representation by aggregating generally semantic part features, which preserves the local details of 3D shapes. The part-view attention module hierarchically leverages part-level and view-level attention to increase the discriminability of our features. The part-level attention highlights the important parts in each view while the view-level attention highlights the discriminative views among all the views of the same object. In addition, we integrate a Recurrent Neural Network (RNN) to capture the spatial relationships among sequential views from different viewpoints. Our results under the fine-grained 3D shape dataset show that our method outperforms other state-of-the-art methods. The FG3D dataset is available at https://github.com/liuxinhai/FG3D-Net.
Article
The continual improvement of 3D sensors has driven the development of algorithms to perform point cloud analysis. In fact, techniques for point cloud classification and segmentation have in recent years achieved incredible performance driven in part by leveraging large synthetic datasets. Unfortunately these same state-of-the-art approaches perform poorly when applied to incomplete point clouds. This limitation of existing algorithms is particularly concerning since point clouds generated by 3D sensors in the real world are usually incomplete due to perspective view or occlusion by other objects. This paper proposes a general model for partial point clouds analysis wherein the latent feature encoding a complete point cloud is inferred by applying a point set voting strategy. In particular, each local point set constructs a vote that corresponds to a distribution in the latent space, and the optimal latent feature is the one with the highest probability. This approach ensures that any subsequent point cloud analysis is robust to partial observation while simultaneously guaranteeing that the proposed model is able to output multiple possible results. This paper illustrates that this proposed method achieves the state-of-the-art performance on shape classification, part segmentation and point cloud completion.
Chapter
Structure learning for 3D shapes is vital for 3D computer vision. State-of-the-art methods show promising results by representing shapes using implicit functions in 3D that are learned using discriminative neural networks. However, learning implicit functions requires dense and irregular sampling in 3D space, which also makes the sampling methods affect the accuracy of shape reconstruction during test. To avoid dense and irregular sampling in 3D, we propose to represent shapes using 2D functions, where the output of the function at each 2D location is a sequence of line segments inside the shape. Our approach leverages the power of functional representations, but without the disadvantage of 3D sampling. Specifically, we use a voxel tubelization to represent a voxel grid as a set of tubes along any one of the X, Y, or Z axes. Each tube can be indexed by its 2D coordinates on the plane spanned by the other two axes. We further simplify each tube into a sequence of occupancy segments. Each occupancy segment consists of successive voxels occupied by the shape, which leads to a simple representation of its 1D start and end location. Given the 2D coordinates of the tube and a shape feature as condition, this representation enables us to learn 3D shape structures by sequentially predicting the start and end locations of each occupancy segment in the tube. We implement this approach using a Seq2Seq model with attention, called SeqXY2SeqZ, which learns the mapping from a sequence of 2D coordinates along two arbitrary axes to a sequence of 1D locations along the third axis. SeqXY2SeqZ not only benefits from the regularity of voxel grids in training and testing, but also achieves high memory efficiency. Our experiments show that SeqXY2SeqZ outperforms the state-of-the-art methods under the widely used benchmarks.
Chapter
Point cloud shape completion is a challenging problem in 3D vision and robotics. Existing learning-based frameworks leverage encoder-decoder architectures to recover the complete shape from a highly encoded global feature vector. Though the global feature can approximately represent the overall shape of 3D objects, it would lead to the loss of shape details during the completion process. In this work, instead of using a global feature to recover the whole complete surface, we explore the functionality of multi-level features and aggregate different features to represent the known part and the missing part separately. We propose two different feature aggregation strategies, named global & local feature aggregation (GLFA) and residual feature aggregation (RFA), to express the two kinds of features and reconstruct coordinates from their combination. In addition, we also design a refinement component to prevent the generated point cloud from non-uniform distribution and outliers. Extensive experiments have been conducted on the ShapeNet and KITTI dataset. Qualitative and quantitative evaluations demonstrate that our proposed network outperforms current state-of-the art methods especially on detail preservation.
Chapter
Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN and TopNet) use Multi-layer Perceptrons (MLPs) to directly process point clouds, which may cause the loss of details because the structural and context of point clouds are not fully considered. To solve this problem, we introduce 3D grids as intermediate representations to regularize unordered point clouds and propose a novel Gridding Residual Network (GRNet) for point cloud completion. In particular, we devise two novel differentiable layers, named Gridding and Gridding Reverse, to convert between point clouds and 3D grids without losing structural information. We also present the differentiable Cubic Feature Sampling layer to extract features of neighboring points, which preserves context information. In addition, we design a new loss function, namely Gridding Loss, to calculate the L1 distance between the 3D grids of the predicted and ground truth point clouds, which is helpful to recover details. Experimental results indicate that the proposed GRNet performs favorably against state-of-the-art methods on the ShapeNet, Completion3D, and KITTI benchmarks.
Article
Learning discriminative shape representation directly on point clouds is still challenging in 3D shape analysis and understanding. Recent studies usually involve three steps: first splitting a point cloud into some local regions, then extracting the corresponding feature of each local region, and finally aggregating all individual local region features into a global feature as shape representation using simple max-pooling. However, such pooling-based feature aggregation methods do not adequately take the spatial relationships (e.g. the relative locations to other regions) between local regions into account, which greatly limits the ability to learn discriminative shape representation. To address this issue, we propose a novel deep learning network, named Point2SpatialCapsule, for aggregating features and spatial relationships of local regions on point clouds, which aims to learn more discriminative shape representation. Compared with the traditional max-pooling based feature aggregation networks, Point2SpatialCapsule can explicitly learn not only geometric features of local regions but also the spatial relationships among them. Point2SpatialCapsule consists of two main modules. To resolve the disorder problem of local regions, the first module, named geometric feature aggregation, is designed to aggregate the local region features into the learnable cluster centers, which explicitly encodes the spatial locations from the original 3D space. The second module, named spatial relationship aggregation, is proposed for further aggregating the clustered features and the spatial relationships among them in the feature space using the spatial-aware capsules developed in this paper. Compared to the previous capsule network based methods, the feature routing on the spatial-aware capsules can learn more discriminative spatial relationships among local regions for point clouds, which establishes a direct mapping between log priors and the spatial locations through feature clusters. Experimental results demonstrate that Point2SpatialCapsule outperforms the state-of-the-art methods in the 3D shape classification, retrieval and segmentation tasks under the well-known ModelNet and ShapeNet datasets.
Article
3D shape reconstruction from multiple hand-drawn sketches is an intriguing way to 3D shape modeling. Currently, state-of-the-art methods employ neural networks to learn a mapping from multiple sketches from arbitrary view angles to a 3D voxel grid. Because of the cubic complexity of 3D voxel grids, however, neural networks are hard to train and limited to low resolution reconstructions, which leads to a lack of geometric detail and low accuracy. To resolve this issue, we propose to reconstruct 3D shapes from multiple sketches using direct shape optimization (DSO), which does not involve deep learning models for direct voxel-based 3D shape generation. Specifically, we first leverage a conditional generative adversarial network (CGAN) to translate each sketch into an attenuance image that captures the predicted geometry from a given viewpoint. Then, DSO minimizes a project-and-compare loss to reconstruct the 3D shape such that it matches the predicted attenuance images from the view angles of all input sketches. Based on this, we further propose a progressive update approach to handle inconsistencies among a few hand-drawn sketches for the same 3D shape. Our experimental results show that our method significantly outperforms the state-of-the-art methods under widely used benchmarks and produces intuitive results in an interactive application.
Article
3D shape completion is important to enable machines to perceive the complete geometry of objects from partial observations. To address this problem, view-based methods have been presented. These methods represent shapes as multiple depth images, which can be back-projected to yield corresponding 3D point clouds, and they perform shape completion by learning to complete each depth image using neural networks. While view-based methods lead to state-of-the-art results, they currently do not enforce geometric consistency among the completed views during the inference stage. To resolve this issue, we propose a multi-view consistent inference technique for 3D shape completion, which we express as an energy minimization problem including a data term and a regularization term. We formulate the regularization term as a consistency loss that encourages geometric consistency among multiple views, while the data term guarantees that the optimized views do not drift away too much from a learned shape descriptor. Experimental results demonstrate that our method completes shapes more accurately than previous techniques.
Article
Learning discriminative feature directly on point clouds is still challenging in the understanding of 3D shapes. Recent methods usually partition point clouds into local region sets, and then extract the local region features with fixed-size CNN or MLP, and finally aggregate all individual local features into a global feature using simple max pooling. However, due to the irregularity and sparsity in sampled point clouds, it is hard to encode the fine-grained geometry of local regions and their spatial relationships when only using the fixed-size filters and individual local feature integration, which limit the ability to learn discriminative features. To address this issue, we present a novel Local-Region-Context Network (LRC-Net), to learn discriminative features on point clouds by encoding the fine-grained contexts inside and among local regions simultaneously. LRC-Net consists of two main modules. The first module, named intra-region context encoding, is designed for capturing the geometric correlation inside each local region by novel variable-size convolution filter. The second module, named inter-region context encoding, is proposed for integrating the spatial relationships among local regions based on spatial similarity measures. Experimental results show that LRC-Net is competitive with state-of-the-art methods in shape classification and shape segmentation applications.
Conference Paper
Auto-encoder is an important architecture to understand point clouds in an encoding and decoding procedure of self reconstruction. Current auto-encoder mainly focuses on the learning of global structure by global shape reconstruction, while ignoring the learning of local structures. To resolve this issue, we propose Local-to-Global auto-encoder (L2G-AE) to simultaneously learn the local and global structure of point clouds by local to global reconstruction. Specifically, L2G-AE employs an encoder to encode the geometry information of multiple scales in a local region at the same time. In addition, we introduce a novel hierarchical self-attention mechanism to highlight the important points, scales and regions at different levels in the information aggregation of the encoder. Simultaneously, L2G-AE employs a recurrent neural network (RNN) as decoder to reconstruct a sequence of scales in a local region, based on which the global point cloud is incrementally reconstructed. Our outperforming results in shape classification, retrieval and upsampling show that L2G-AE can understand point clouds better than state-of-the-art methods.
Article
As 3D scanning devices and depth sensors mature, point clouds have attracted increasing attention as a format for 3D object representation, with applications in various fields such as tele-presence, navigation and heritage reconstruction. However, point clouds usually exhibit holes of missing data, mainly due to the limitation of acquisition techniques and complicated structure. Further, point clouds are defined on irregular non- Euclidean domains, which is challenging to address especially with conventional signal processing tools. Hence, leveraging on recent advances in graph signal processing, we propose an efficient point cloud inpainting method, exploiting both the local smoothness and the non-local self-similarity in point clouds. Specifically, we first propose a frequency interpretation in graph nodal domain, based on which we derive the smoothing and denoising properties of a graph-signal smoothness prior in order to describe the local smoothness of point clouds. Secondly, we explore the characteristics of non-local self-similarity, by globally searching for the most similar area to the missing region. The similarity metric between two areas is defined based on the direct component and the anisotropic graph total variation of normals in each area. Finally, we formulate the hole-filling step as an optimization problem based on the selected most similar area and regularized by the graph-signal smoothness prior. Besides, we propose voxelization and automatic hole detection methods for the point cloud prior to inpainting. Experimental results show that the proposed approach outperforms four competing methods significantly, both in objective and subjective quality.
Article
Learning 3D global features by aggregating multiple views is important. Pooling is widely used to aggregate views in deep learning models. However, pooling disregards a lot of content information within views and the spatial relationship among the views, which limits the discriminability of learned features. To resolve this issue, 3D to Sequential Views (3D2SeqViews) is proposed to more effectively aggregate sequential views using convolutional neural networks with a novel hierarchical attention aggregation. Specifically, the content information within each view is first encoded. Then, the encoded view content information and the sequential spatiality among the views are simultaneously aggregated by hierarchical attention aggregation, where view-level attention and class-level attention are proposed to hierarchically weight sequential views and shape classes. Viewlevel attention is learned to indicate how much attention is paid on each view by each shape class, which subsequently weights sequential views through a novel recursive view integration. Recursive view integration learns the semantic meaning of view sequence which is robust to the first view position. Furthermore, class-level attention is introduced to describe how much attention is paid on each shape class, which innovatively employs the discriminative ability of the fine-tuned network. 3D2SeqViews learns more discriminative features than the state-of-the-art, which leads to the outperforming results in shape classification and retrieval under three large-scale benchmarks.
Chapter
Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using convolutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, but are expensive and may fail. We propose the use of partial convolutions, where the convolution is masked and renormalized to be conditioned on only valid pixels. We further include a mechanism to automatically generate an updated mask for the next layer as part of the forward pass. Our model outperforms other methods for irregular masks. We show qualitative and quantitative comparisons with other methods to validate our approach.
Article
Learning 3D global features by aggregating multiple views has been introduced as a successful strategy for 3D shape analysis. In recent deep learning models with end-to-end training, pooling is a widely adopted procedure for view aggregation. However, pooling merely retains the max or mean value over all views, which disregards the content information of almost all views and also the spatial information among the views. To resolve these issues, we propose Sequential Views To Sequential Labels (SeqViews2SeqLabels) as a novel deep learning model with an encoder-decoder structure based on Recurrent Neural Networks (RNNs) with attention. SeqViews2SeqLabels consists of two connected parts, an encoder-RNN followed by a decoder-RNN, that aim to learn the global features by aggregating sequential views and then performing shape classification from the learned global features, respectively. Specifically, the encoder-RNN learns the global features by simultaneously encoding the spatial and content information of sequential views, which captures the semantics of the view sequence. With the proposed prediction of sequential labels, the decoder-RNN performs more accurate classification using the learned global features by predicting sequential labels step-by-step. Learning to predict sequential labels provides more and finer discriminative information among shape classes to learn, which alleviates the overfitting problem inherent in training using a limited number of 3D shapes. Moreover, we introduce an attention mechanism to further improve the discriminative ability of SeqViews2SeqLabels. This mechanism increases the weight of views that are distinctive to each shape class, and it dramatically reduces the effect of selecting the first view position. Shape classification and retrieval results under three large-scale benchmarks verify that SeqViews2SeqLabels learns more discriminative global features by more effectively aggregating sequential views than state-of-the-art methods.
Article
Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using convolutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value). This often leads to artifacts such as color discrepancy and blurriness. Post-processing is usually used to reduce such artifacts, but are expensive and may fail. We propose the use of partial convolutions, where the convolution is masked and renormalized to be conditioned on only valid pixels. We further include a mechanism to automatically generate an updated mask for the next layer as part of the forward pass. Our model outperforms other methods for irregular masks. We show qualitative and quantitative comparisons with other methods to validate our approach.
Article
We introduce P2P-NET, a general-purpose deep neural network which learns geometric transformations between point-based shape representations from two domains, e.g., meso-skeletons and surfaces, partial and complete scans, etc. The architecture of the P2P-NET is that of a bi-directional point displacement network, which transforms a source point set to a target point set with the same cardinality, and vice versa, by applying point-wise displacement vectors learned from data. P2P-NET is trained on paired shapes from the source and target domains, but without relying on point-to-point correspondences between the source and target point sets. The training loss combines two uni-directional geometric losses, each enforcing a shape-wise similarity between the predicted and the target point sets, and a cross-regularization term to encourage consistency between displacement vectors going in opposite directions. We develop and present several different applications enabled by our general-purpose bidirectional P2P-NET to highlight the effectiveness, versatility, and potential of our network in solving a variety of point-based shape transformation problems.
Article
Recent deep learning based approaches have shown promising results on image inpainting for the challenging task of filling in large missing regions in an image. These methods can generate visually plausible image structures and textures, but often create distorted structures or blurry textures inconsistent with surrounding areas. This is mainly due to ineffectiveness of convolutional neural networks in explicitly borrowing or copying information from distant spatial locations. On the other hand, traditional texture and patch synthesis approaches are particularly suitable when it needs to borrow textures from the surrounding regions. Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions. The model is a feed-forward, fully convolutional neural network which can process images with multiple holes at arbitrary locations and with variable sizes during the test time. Experiments on multiple datasets including faces, textures and natural images demonstrate that the proposed approach generates higher-quality inpainting results than existing ones. Code and trained models will be released.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.