A preview of this full-text is provided by Springer Nature.
Content available from Multimedia Tools and Applications
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
Multimedia Tools and Applications (2024) 83:78577–78592
https://doi.org/10.1007/s11042-024-18653-7
1 3
Multimodal contrastive learning using point clouds andtheir
rendered images
WonyongLee1· HyungkiKim1
Received: 13 June 2023 / Revised: 13 February 2024 / Accepted: 14 February 2024 /
Published online: 27 February 2024
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
Abstract
In this paper, we propose a novel unsupervised pre-training method for point cloud deep
learning models using multimodal contrastive learning. Point clouds, which consist of a
set of three-dimensional coordinate points acquired from 3D scanners, lidars, depth cam-
eras, etc. play an important role in representing 3D scenes, and understanding them is
crucial for implementing autonomous driving or navigation. Deep learning models based
on supervised learning for point cloud understanding require a label for each point cloud
data that corresponds to the correct answer in training. However, generating these labels
is expensive, making it difficult to build large datasets, which is essential for good model
performance. Our proposed unsupervised pre-training method, on the other hand, does not
require labels and can serve as an initial value for a model that can alleviate the need for
such large datasets. The proposed method is characterized as a multimodal approach that
utilizes two modalities for point clouds: the point cloud itself and an image rendering of
the point cloud. By using images that directly render the point clouds, the shape informa-
tion of the point clouds from various viewpoints can be obtained from the images without
additional data such as meshes. We pre-trained the model with the proposed method and
conducted performance comparison on ModelNet40 and ScanObjectNN datasets. The lin-
ear classification accuracy of the point cloud feature vector extracted by the pre-trained
model was 91.5% and 83.9%, and after fine-tuning for each dataset, the classification accu-
racy was 93.3% and 86.9%, respectively.
Keywords Deep learning· Point cloud· Contrastive learning· Multimodal
1 Introduction
Point clouds, which comprise sets of 3D coordinates procured from RGB-D cam-
eras, LiDAR, or other 3D scanning devices, represent one of the most prevalent
forms of 3D geometry representation. Comprehending point clouds is imperative
* Hyungki Kim
hk.kim@jbnu.ac.kr
1 Department ofComputer Science andArtificial Intelligence/CAIIT, Jeonbuk National University,
Jeonju, RepublicofKorea
Content courtesy of Springer Nature, terms of use apply. Rights reserved.