ArticlePublisher preview available

Multimodal contrastive learning using point clouds and their rendered images

Springer Nature
Multimedia Tools and Applications
Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

In this paper, we propose a novel unsupervised pre-training method for point cloud deep learning models using multimodal contrastive learning. Point clouds, which consist of a set of three-dimensional coordinate points acquired from 3D scanners, lidars, depth cameras, etc. play an important role in representing 3D scenes, and understanding them is crucial for implementing autonomous driving or navigation. Deep learning models based on supervised learning for point cloud understanding require a label for each point cloud data that corresponds to the correct answer in training. However, generating these labels is expensive, making it difficult to build large datasets, which is essential for good model performance. Our proposed unsupervised pre-training method, on the other hand, does not require labels and can serve as an initial value for a model that can alleviate the need for such large datasets. The proposed method is characterized as a multimodal approach that utilizes two modalities for point clouds: the point cloud itself and an image rendering of the point cloud. By using images that directly render the point clouds, the shape information of the point clouds from various viewpoints can be obtained from the images without additional data such as meshes. We pre-trained the model with the proposed method and conducted performance comparison on ModelNet40 and ScanObjectNN datasets. The linear classification accuracy of the point cloud feature vector extracted by the pre-trained model was 91.5% and 83.9%, and after fine-tuning for each dataset, the classification accuracy was 93.3% and 86.9%, respectively.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
Multimedia Tools and Applications (2024) 83:78577–78592
https://doi.org/10.1007/s11042-024-18653-7
1 3
Multimodal contrastive learning using point clouds andtheir
rendered images
WonyongLee1· HyungkiKim1
Received: 13 June 2023 / Revised: 13 February 2024 / Accepted: 14 February 2024 /
Published online: 27 February 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
In this paper, we propose a novel unsupervised pre-training method for point cloud deep
learning models using multimodal contrastive learning. Point clouds, which consist of a
set of three-dimensional coordinate points acquired from 3D scanners, lidars, depth cam-
eras, etc. play an important role in representing 3D scenes, and understanding them is
crucial for implementing autonomous driving or navigation. Deep learning models based
on supervised learning for point cloud understanding require a label for each point cloud
data that corresponds to the correct answer in training. However, generating these labels
is expensive, making it difficult to build large datasets, which is essential for good model
performance. Our proposed unsupervised pre-training method, on the other hand, does not
require labels and can serve as an initial value for a model that can alleviate the need for
such large datasets. The proposed method is characterized as a multimodal approach that
utilizes two modalities for point clouds: the point cloud itself and an image rendering of
the point cloud. By using images that directly render the point clouds, the shape informa-
tion of the point clouds from various viewpoints can be obtained from the images without
additional data such as meshes. We pre-trained the model with the proposed method and
conducted performance comparison on ModelNet40 and ScanObjectNN datasets. The lin-
ear classification accuracy of the point cloud feature vector extracted by the pre-trained
model was 91.5% and 83.9%, and after fine-tuning for each dataset, the classification accu-
racy was 93.3% and 86.9%, respectively.
Keywords Deep learning· Point cloud· Contrastive learning· Multimodal
1 Introduction
Point clouds, which comprise sets of 3D coordinates procured from RGB-D cam-
eras, LiDAR, or other 3D scanning devices, represent one of the most prevalent
forms of 3D geometry representation. Comprehending point clouds is imperative
* Hyungki Kim
hk.kim@jbnu.ac.kr
1 Department ofComputer Science andArtificial Intelligence/CAIIT, Jeonbuk National University,
Jeonju, RepublicofKorea
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... The contrastive learning framework [37] is widely adopted to form a latent feature space shared across different modalities. A notable example is bimodal contrastive learning, which combines 3D point clouds and 2D images [38][39][40][41][42][43][44][45]. In this method, each positive pair is created between a point cloud and an image derived from the same object/scene, while a negative pair is formed by using different objects/scenes. ...
Preprint
Full-text available
Parameter-efficient fine-tuning (PEFT) of pre-trained 3D point cloud Transformers has emerged as a promising technique for 3D point cloud analysis. While existing PEFT methods attempt to minimize the number of tunable parameters, they still suffer from high temporal and spatial computational costs during fine-tuning. This paper proposes a novel PEFT algorithm for 3D point cloud Transformers, called Side Token Adaptation on a neighborhood Graph (STAG), to achieve superior temporal and spatial efficiency. STAG employs a graph convolutional side network that operates in parallel with a frozen backbone Transformer to adapt tokens to downstream tasks. STAG's side network realizes high efficiency through three key components: connection with the backbone that enables reduced gradient computation, parameter sharing framework, and efficient graph convolution. Furthermore, we present Point Cloud Classification 13 (PCC13), a new benchmark comprising diverse publicly available 3D point cloud datasets, enabling comprehensive evaluation of PEFT methods. Extensive experiments using multiple pre-trained models and PCC13 demonstrates the effectiveness of STAG. Specifically, STAG maintains classification accuracy comparable to existing methods while reducing tunable parameters to only 0.43M and achieving significant reductions in both computational time and memory consumption for fine-tuning. Code and benchmark will be available at: https://github.com/takahikof/STAG
Article
Full-text available
The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing. This paper presents a novel framework named Point Cloud Transformer (PCT) for point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation, semantic segmentation, and normal estimation tasks.
Conference Paper
Manual annotation of large-scale point cloud dataset for varying tasks such as 3D object classification, segmenta- tion and detection is often laborious owing to the irregular structure of point clouds. Self-supervised learning, which operates without any human labeling, is a promising ap- proach to address this issue. We observe in the real world that humans are capable of mapping the visual concepts learnt from 2D images to understand the 3D world. En- couraged by this insight, we propose CrossPoint, a simple cross-modal contrastive learning approach to learn trans- ferable 3D point cloud representations. It enables a 3D- 2D correspondence of objects by maximizing agreement be- tween point clouds and the corresponding rendered 2D im- age in the invariant space, while encouraging invariance to transformations in the point cloud modality. Our joint train- ing objective combines the feature correspondences within and across modalities, thus ensembles a rich learning sig- nal from both 3D point cloud and 2D image modalities in a self-supervised fashion. Experimental results show that our approach outperforms the previous unsupervised learn- ing methods on a diverse range of downstream tasks in- cluding 3D object classification and segmentation. Fur- ther, the ablation studies validate the potency of our ap- proach for a better point cloud understanding. Code and pretrained models are available at https://github. com/MohamedAfham/CrossPoint.
Conference Paper
We describe a simple pre-training approach for point clouds. It works in three steps: 1. Mask all points occluded in a camera view; 2. Learn an encoder-decoder model to reconstruct the occluded points; 3. Use the encoder weights as initialisation for downstream point cloud tasks. We find that even when we pre-train on a single dataset (ModelNet40), this method improves accuracy across different datasets and encoders, on a wide range of downstream tasks. Specifically, we show that our method outperforms previous pre-training methods in object classification, and both part-based and semantic segmentation tasks. We study the pre-trained features and find that they lead to wide downstream minima, have high transformation invariance, and have activations that are highly correlated with part labels. Code and data are available at: https://github.com/hansen7/OcCo.
Chapter
Arguably one of the top success stories of deep learning is transfer learning. The finding that pre-training a network on a rich source set (e.g., ImageNet) can help boost performance once fine-tuned on a usually much smaller target set, has been instrumental to many applications in language and vision. Yet, very little is known about its usefulness in 3D point cloud understanding. We see this as an opportunity considering the effort required for annotating data in 3D. In this work, we aim at facilitating research on 3D representation learning. Different from previous works, we focus on high-level scene understanding tasks. To this end, we select a suit of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes. Our findings are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets – demonstrating that the learned representation can generalize across domains. Furthermore, the improvement was similar to supervised pre-training, suggesting that future efforts should favor scaling data collection over more detailed annotation. We hope these findings will encourage more research on unsupervised pretext task design for 3D deep learning.