ArticlePublisher preview available

PointSea: Point Cloud Completion via Self-structure Augmentation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Point cloud completion is a fundamental yet not well-solved problem in 3D vision. Current approaches often rely on 3D coordinate information and/or additional data (e.g., images and scanning viewpoints) to fill in missing parts. Unlike these methods, we explore self-structure augmentation and propose PointSea for global-to-local point cloud completion. In the global stage, consider how we inspect a defective region of a physical object, we may observe it from various perspectives for a better understanding. Inspired by this, PointSea augments data representation by leveraging self-projected depth images from multiple views. To reconstruct a compact global shape from the cross-modal input, we incorporate a feature fusion module to fuse features at both intra-view and inter-view levels. In the local stage, to reveal highly detailed structures, we introduce a point generator called the self-structure dual-generator. This generator integrates both learned shape priors and geometric self-similarities for shape refinement. Unlike existing efforts that apply a unified strategy for all points, our dual-path design adapts refinement strategies conditioned on the structural type of each point, addressing the specific incompleteness of each point. Comprehensive experiments on widely-used benchmarks demonstrate that PointSea effectively understands global shapes and generates local details from incomplete input, showing clear improvements over existing methods. Our code is available at https://github.com/czvvd/SVDFormer_PointSea.
This content is subject to copyright. Terms and conditions apply.
International Journal of Computer Vision
https://doi.org/10.1007/s11263-025-02400-y
PointSea: Point Cloud Completion via Self-structure Augmentation
Zhe Zhu1·Honghua Chen1·Xing He1·Mingqiang Wei1
Received: 8 July 2024 / Accepted: 17 February 2025
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025
Abstract
Point cloud completion is a fundamental yet not well-solved problem in 3D vision. Current approaches often rely on 3D
coordinate information and/or additional data (e.g., images and scanning viewpoints) to fill in missing parts. Unlike these
methods, we explore self-structure augmentation and propose PointSea for global-to-local point cloud completion. In the
global stage, consider how we inspect a defective region of a physical object, we may observe it from various perspectives for a
better understanding. Inspired by this, PointSea augments data representation by leveraging self-projected depth images from
multiple views. To reconstruct a compact global shape from the cross-modal input, we incorporate a feature fusion module
to fuse features at both intra-view and inter-view levels. In the local stage, to reveal highly detailed structures, we introduce
a point generator called the self-structure dual-generator. This generator integrates both learned shape priors and geometric
self-similarities for shape refinement. Unlike existing efforts that apply a unified strategy for all points, our dual-path design
adapts refinement strategies conditioned on the structural type of each point, addressing the specific incompleteness of each
point. Comprehensive experiments on widely-used benchmarks demonstrate that PointSea effectively understands global
shapes and generates local details from incomplete input, showing clear improvements over existing methods. Our code is
available at https://github.com/czvvd/SVDFormer_PointSea.
Keywords PointSea ·Point cloud completion ·Self-structure augmentation ·Cross-modal fusion
1 Introduction
Raw-captured point clouds are often incomplete due to fac-
tors like occlusion, surface reflectivity, and limited scanning
range (see Fig. 2). Before they can be used in downstream
applications (e.g., digital twin), they need to be faithfully
completed, a process known as point cloud completion.
Recent years have witnessed significant progress in this
field (Yuan et al., 2018; Huang et al., 2020; Zhang et al., 2020;
Yu et al., 2021; Xiang et al., 2023; Yan et al., 2022; Zhang et
Communicated by Wanli Ouyang.
BZhe Zhu
zhuzhe0619@nuaa.edu.cn
Honghua Chen
chenhonghuacn@gmail.com
Xing He
hexing@nuaa.edu.cn
Mingqiang Wei
mqwei@nuaa.edu.cn
1Nanjing University of Aeronautics and Astronautics, Nanjing,
China
al., 2022a; Tang et al., 2022; Zhou et al., 2022; Zhang et al.,
2023d; Yu et al., 2023a; Wang et al., 2022a). However, the
sparsity and large structural incompleteness of point clouds
still limit their ability to produce satisfactory results. There
are two primary challenges in point cloud completion:
Crucial semantic parts are often absent in the partial
observations.
Detailed structures cannot be effectively recovered.
The first challenge leads to a vast solution space for point-
based networks (Yuan et al., 2018; Xiang et al., 2023;Yuet
al., 2021; Zhou et al., 2022) to robustly locate missing regions
and create a partial-to-complete mapping. Some alternative
methods attempt to address this issue by incorporating addi-
tional color images (Zhang et al., 2021b; Aiello et al., 2022;
Zhu et al., 2024) or viewpoints (Zhang et al., 2022a; Gong
et al., 2021;Fuetal.,2023). However, the paired images
with well-calibrated camera parameters are hard to obtain,
as well as the scanning viewpoints. To resolve the second
challenge, some recent approaches (Xiang et al., 2023;Yan
et al., 2022) utilize skip-connections between multiple refine-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ResearchGate has not been able to resolve any citations for this publication.
Article
In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic text or image inputs as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations, etc). At its core is a large-scale generative model composed of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors directly from a diverse range of 3D geometries. Specifically, it adopts neural fields to represent continuous and complete surfaces and uses a geometry generative module with pure transformer blocks in latent space. We present a progressive training scheme to train CLAY on an ultra large 3D model dataset obtained through a carefully designed processing pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY sets out to produce physically-based rendering (PBR) textures by employing a multi-view material diffusion model that can generate 2K resolution textures with diffuse, roughness, and metallic modalities. We demonstrate using CLAY for a range of controllable 3D asset creations, from sketchy conceptual designs to production ready assets with intricate details. Even first time users can easily use CLAY to bring their vivid 3D imaginations to life, unleashing unlimited creativity.
Article
Nowadays, pre-training big models on large-scale datasets has achieved great success and dominated many downstream tasks in natural language processing and 2D vision, while pre-training in 3D vision is still under development. In this paper, we provide a new perspective of transferring the pre-trained knowledge from 2D domain to 3D domain with Point-to-Pixel Prompting in data space and Pixel-to-Point distillation in feature space, exploiting shared knowledge in images and point clouds that display the same visual world. Following the principle of prompting engineering, Point-to-Pixel Prompting transforms point clouds into colorful images with geometry-preserved projection and geometry-aware coloring. Then the pre-trained image models can be directly implemented for point cloud tasks without structural changes or weight modifications. With projection correspondence in feature space, Pixel-to-Point distillation further regards pre-trained image models as the teacher model and distills pre-trained 2D knowledge to student point cloud models, remarkably enhancing inference efficiency and model capacity for point cloud analysis. We conduct extensive experiments in both object classification and scene segmentation under various settings to demonstrate the superiority of our method. In object classification, we reveal the important scale-up trend of Point-to-Pixel Prompting and attain 90.3% accuracy on ScanObjectNN dataset, surpassing previous literature by a large margin. In scene-level semantic segmentation, our method outperforms traditional 3D analysis approaches and shows competitive capacity in dense prediction tasks. Code is available at https://github.com/wangzy22/P2P .