Conference PaperPDF Available

Learning Convolutional Transforms for Lossy Point Cloud Geometry Compression

Authors:

Abstract

Efficient point cloud compression is fundamental to enable the deployment of virtual and mixed reality applications, since the number of points to code can range in the order of millions. In this paper, we present a novel data-driven geometry compression method for static point clouds based on learned convolutional transforms and uniform quantization. We perform joint optimization of both rate and distortion using a trade-off parameter. In addition, we cast the decoding process as a binary classification of the point cloud occupancy map. Our method outperforms the MPEG reference solution in terms of rate-distortion on the Microsoft Voxelized Upper Bodies dataset with 51.5% BDBR savings on average. Moreover, while octree-based methods face exponential diminution of the number of points at low bitrates, our method still produces high resolution outputs even at low bitrates. Code and supplementary material are available at https://github.com/mauriceqch/pcc_geo_cnn .
LEARNING CONVOLUTIONAL TRANSFORMS FOR LOSSY POINT CLOUD GEOMETRY
COMPRESSION
Maurice Quach Giuseppe Valenzise Frederic Dufaux
L2S, CNRS, CentraleSup´
elec, Universit´
e Paris-Saclay
ABSTRACT
Efficient point cloud compression is fundamental to enable
the deployment of virtual and mixed reality applications,
since the number of points to code can range in the order of
millions. In this paper, we present a novel data-driven ge-
ometry compression method for static point clouds based on
learned convolutional transforms and uniform quantization.
We perform joint optimization of both rate and distortion
using a trade-off parameter. In addition, we cast the decoding
process as a binary classification of the point cloud occupancy
map. Our method outperforms the MPEG reference solution
in terms of rate-distortion on the Microsoft Voxelized Upper
Bodies dataset with 51.5% BDBR savings on average. More-
over, while octree-based methods face exponential diminu-
tion of the number of points at low bitrates, our method
still produces high resolution outputs even at low bitrates.
Code and supplementary material are available at https:
//github.com/mauriceqch/pcc_geo_cnn.
Index Termspoint cloud geometry compression, con-
volutional neural network, rate-distortion optimization
1. INTRODUCTION
Point clouds are an essential data structure for Virtual Re-
ality (VR) and Mixed Reality (MR) applications. A point
cloud is a set of points in the 3D space represented by coor-
dinates x, y, z and optional attributes (for example color, nor-
mals, etc.). Point cloud data is often very large as point clouds
easily range in the millions of points and can have complex
sets of attributes. Therefore, efficient point cloud compres-
sion (PCC) is particularly important to enable practical usage
in VR and MR applications.
The Moving Picture Experts Group (MPEG) is currently
working on PCC. In 2017, MPEG issued a call for proposals
(CfP) and in order to provide a baseline, a point cloud codec
for tele-immersive video [1] was chosen as the MPEG anchor.
To compare the proposed compression solutions, quality eval-
uation metrics were developed leading to the selection of the
point-to-point (D1) and point-to-plane (D2) as baseline met-
rics [2]. The point to point metric, also called D1 metric, is
computed using the Mean Squared Error (MSE) between the
reconstructed points and the nearest neighbors in the refer-
ence point cloud. The point-to-plane metric, also called D2
metric, uses the surface plane instead of the nearest neighbor.
Research on PCC can be categorized along two dimen-
sions. On one hand, one can either compress point cloud ge-
ometry, i.e., the spatial position of the points, or their associ-
ated attributes. On the other hand, we can also separate works
focusing on compression of dynamic point clouds, which con-
tain temporal information, and static point clouds.
In this work, we focus on the lossy compression of static
point cloud geometry. In PCC, a precise reconstruction of
geometric information is of paramount importance to en-
able high-quality rendering and interactive applications. For
this reasons, lossless geometry coding has been investigated
recently in MPEG, but even state-of-the-art techniques strug-
gle to compress beyond about 2 bits per occupied voxels
(bpov) [3]. This results in large storage and transmission
costs for rich point clouds. Lossy compression proposed in
the literature, on the other hand, are based on octrees which
achieve variable-rate geometry compression by changing the
octree depth. Unfortunately, lowering the depth reduces the
number of points exponentially. As a result, octree based
lossy compression tends to produce “blocky” results at the
rendering stage with medium to low bitrates. In order to
partially attenuate this issue, [4] proposes to use wavelet
transforms and volumetric functions to compact the energy
of the point cloud signal. However, since they still employ an
octree representation, their method exhibits rapid geometry
degradation at lower bitrates. While previous approaches
use hand-crafted transforms, we propose here a data driven
approach based on learned convolutional transforms which
directly works on voxels.
Specifically, we present a method for learning analysis
and synthesis transforms suitable for point cloud geometry
compression. In addition, by interpreting the point cloud ge-
ometry as a binary signal defined over the voxel grid, we cast
decoding as the problem of classifying whether a given voxel
is occupied or not. We train our model on the ModelNet40
mesh dataset [5, 6], test its performance on the Microsoft Vox-
elized Upper Bodies (MVUB) dataset [7] and compare it with
the MPEG anchor [1]. We find that our method outperforms
the anchor on all sequences at all bitrates. Additionally, in
contrast to octree-based methods, ours does not exhibit ex-
ponential diminution in the number of points when lowering
Copyright 2019 IEEE ICIP 2019
the bitrate. We also show that our model generalizes well by
using completely different datasets for training and testing.
After reviewing related work in Section 2, we describe
the proposed method in Section 3 and evaluate it on different
datasets in Section 4. Conclusions are drawn in Section 5.
2. RELATED WORK
Our work is mainly related to point cloud geometry compres-
sion, deep learning based image and video compression and
applications of deep learning to 3D objects.
Point cloud geometry compression research has mainly
focused on tree based methods [3, 4, 1] and dynamic point
clouds [8, 9]. Our work takes a different approach by com-
pressing point cloud geometry using a 3D auto-encoder.
While classical compression approaches use hand-crafted
transforms, we directly learn the filters from data.
Recent research has also applied deep learning to image
and video compression. In particular, auto-encoders, recur-
rent neural networks and context-based learning have been
used for image and video compression [10, 11, 12]. [13] pro-
poses to replace quantization with additive uniform noise dur-
ing training while performing actual quantization during eval-
uation. Our work takes inspiration from this approach in the
formulation of quantization, but significantly expands it with
new tools, a different loss function and several practical adap-
tations to the case of point cloud geometry compression.
Generative models [14] and auto-encoders [15] have also
been employed to learn a latent space of 3D objects. In the
context of point cloud compression, our work differs from the
above-mentioned approaches in two aspects. First, we con-
sider quantization in the training in order to jointly optimize
for rate-distortion (RD) performance; second, we propose a
lightweight architecture which allows us to process voxels
grids with resolutions that are an order of magnitude higher
than previous art.
3. PROPOSED METHOD
In this section, we describe the proposed method in more de-
tails.
3.1. Definitions
First, we define the set of possible points at resolution ras
r= [0 . . r]3. Then, we define a point cloud as a set of points
Srand its corresponding voxel grid vSas follows:
vS: Ωr→ {0,1},
z7−(1,if zS
0,otherwise.
For notational convenience, we use s3instead of s×s×s
for filter sizes and strides.
N,93,23, ReLU, bias
N,53,23, ReLU, bias
N,53,23, None, None
Quantization
N,53,23, ReLU, bias
N,53,23, ReLU, bias
1,93,23, ReLU, bias
xyˆyˆx
faQfs
Fig. 1: Neural Network Architecture. Layers are specified
using the following format: number of feature maps, filter
size, strides, activation and bias.
3.2. Model
We use a 3D convolutional auto-encoder composed of an
analysis transform fa, followed by a uniform quantizer and a
synthesis transform fs.
Let x=vSbe the original point cloud. The correspond-
ing latent representation is y=fa(x). To quantize y, we
introduce a quantization function Qso that ˆy=Q(y). This
allows us to express the decompressed point cloud as ˆx=
ˆvS=fs(ˆy). Finally, we obtain the decompressed point cloud
˜x= ˜vS=round(min(0,max(1,ˆx))) using element-wise
minimum, maximum and rounding functions.
In our model, we use convolutions and transpose convo-
lutions with same padding and strides. They are illustrated in
Figure 2 and defined as follows :
Same (half) padding pads the input with zeros so that
the output size is equal to the input size.
Convolution performed with unit stride means that the
convolution filter is computed for each element of the
input array. When iterating the input array, strides spec-
ify the step for each axis.
Convolution can be seen as matrix multiplication and
transpose convolution can be derived from this. In par-
ticular, we can build a sparse matrix Cwith non-zero
elements corresponding to the weights. The transpose
convolution, also called deconvolution, is obtained us-
ing the matrix CTas a layout for the weights.
Using these convolutional operations as a basis, we learn
analysis and synthesis transforms structured as in Figure 1
using the Adam optimizer [17] which is based on adaptive
estimates of first and second moments of the gradient.
We handle quantization similarly to [11]. Qrepresents
element-wise integer rounding during evaluation and Qadds
uniform noise between 0.5and 0.5to each element during
training which allows for differentiability. To compress Q(y),
we perform range coding and use the Deflate algorithm, a
2
(a) Strided convolution on a 52input with a 32filter, 22strides
and same padding. The shape of the output is 33.
(b) Strided transpose convolution on a 32input with a 32filter,
22strides and same padding. The shape of the output is 52.
Fig. 2: Strided convolution and strided transpose convolution
operations. Illustrations from [16].
combination of LZ77 and Huffman coding [18] with shape
information on xand yadded before compression. Note how-
ever that our method does not assume any specific entropy
coding algorithm.
Our decoding process can also be interpreted as a binary
classification problem where each point zrof the voxel
grid is either present or not. This allows us to decompose ˆx=
ˆvSinto its individuals voxels zwhose associated value is pz.
However, as point clouds are usually very sparse, most vS(z)
values are equal to zero. To compensate for the imbalance
between empty and occupied voxels we use the α-balanced
focal loss as defined in [19]:
F L(pt
z) = αz(1 pt
z)γlog(pt
z)(1)
with pt
zdefined as pzif vS(z)=1and 1pzotherwise.
Analogously, αzis defined as αwhen vS(z) = 1 and 1α
otherwise. The focal loss for the decompressed point cloud
can then be computed as follows:
F Lx) = X
zS
F L(pt
z).(2)
Our final loss is L=λD +Rwhere Dis the distortion
calculated using the focal loss and Ris the rate in number
of bits per input occupied voxel (bpov). The rate is com-
puted differently during training and during evaluation. On
one hand, during evaluation, as the data is quantized, we com-
pute the rate using the number of bits of the final compressed
representation. On the other hand, during training, we add
uniform noise in place of discretization to allow for differen-
tiation. It follows that the probability distribution of the latent
space Q(y)during training is a continuous relaxation of the
probability distribution of Q(y)during evaluation which is
discrete. As a result, entropies computed during training are
actually differential entropies, or continuous entropies, while
entropies computed during evaluation are discrete entropies.
During training, we use differential entropy as an approxima-
tion of discrete entropy. This makes the loss differentiable
which is primordial for training neural networks.
4. EXPERIMENTAL RESULTS
We use train, evaluation and test split across two datasets. We
train and evaluate our network on the ModelNet40 aligned
dataset [5, 6]. Then, we perform tests on the MVUB dataset
and we compare our method with the MPEG anchor [1].
We perform our experiments using Python 3.6 and Ten-
sorflow 0.12. We use N= 32 filters, a batch size of 64 and
Adam with lr = 104,β1= 0.9and β2= 0.999. For the
focal loss, we use α= 0.9and γ= 2.0.
To compute distortion, we use the point-to-plane symmet-
ric PSNR computed with the pc error MPEG tool [20].
4.1. Datasets
The ModelNet40 dataset contains 12,311 mesh models from
40 categories. This dataset provides us with both variety and
quantity to ensure good generalization when training our net-
work. To convert this dataset to a point cloud dataset, we
first perform sampling on the surface of each mesh. Then, we
translate and scale it into a voxel grid of resolution r. We use
this dataset for training with a resolution r= 64.
The MVUB dataset [7] contains 5 sequences captured at
30 fps during 7 to 10 seconds each with a total of 1202 frames.
We test our method on each individual frame with a resolution
r= 512. In other words, we evaluate performance for intra-
frame compression on each sequence.
We compute RD curves for each sequence of the test
dataset. For our method, we use the following λvalues to
compute RD points : 104,5×105,105,5×106and
106. For each sequence, we average distortions and bitrates
over time for each λto obtain RD points. For the MPEG
anchor, we use the same process with different octree depths.
To evaluate distortion, we use the point-to-plane sym-
metric PSNR [20] esymm(A, B) = min(e(A, B), e(B, A))
where e(A, B)provides the point-to-plane PSNR between
points in Aand their nearest neighbors in B. This choice is
due to the fact that original and reconstructed point clouds
may have a very different number of points, e.g., in octree-
based methods the compressed point cloud has significantly
less points than the original, while in our method it is the
opposite. In the rest of this section, we refer to the point-to-
plane symmetric PSNR as simply PSNR.
Our method outperforms the MPEG anchor on all se-
quences at all bitrates. The latter has a mean bitrate of 0.719
bpov and a mean PSNR of 16.68 dB while our method has
a mean bitrate of 0.691 and a mean PSNR of 24.11 dB. RD
curves and the Bjontegaard-delta bitrates (BDBR) for each
sequence are reported in Figure 3. Our method achieves
51.5% BDBR savings on average compared to the anchor.
3
0.0 0.5 1.0 1.5 2.0
bits per occupied voxel
0
10
20
30
PSNR (dB)
Proposed
Anchor
(a) Andrew sequence (47.8% BDBR)
0.0 0.5 1.0 1.5 2.0
bits per occupied voxel
0
10
20
30
PSNR (dB)
Proposed
Anchor
(b) David sequence (55.7% BDBR)
0.0 0.5 1.0 1.5 2.0
bits per occupied voxel
0
10
20
30
PSNR (dB)
Proposed
Anchor
(c) Phil sequence (49.0% BDBR)
0.0 0.5 1.0 1.5 2.0
bits per occupied voxel
0
10
20
30
PSNR (dB)
Proposed
Anchor
(d) Ricardo sequence (52.7% BDBR)
0.0 0.5 1.0 1.5 2.0
bits per occupied voxel
0
10
20
30
PSNR (dB)
Proposed
Anchor
(e) Sarah sequence (52.4% BDBR)
Fig. 3: RD curves for each sequence of the MVUB dataset. We compare our method to the MPEG anchor.
Fig. 4: Original point cloud (left), the compressed point cloud using the proposed method (middle) and the MPEG anchor
(right). Colors are mapped using nearest neighbor matching. Our compressed point cloud was compressed using λ= 106
with a PSNR of 29.22 dB and 0.071 bpov. The anchor compressed point cloud was compressed using a depth 6 octree with a
PSNR of 23.98 dB and 0.058 bpov. They respectively have 370,798; 1,302,027; and 5,963 points.
In Figure 4, we show examples on the first frame of the
Phil sequence. Our method achieves lower distortion at sim-
ilar bitrates and produces more points than the anchor which
increases quality at low bitrates while avoiding “blocking” ef-
fects. This particular example shows that our method pro-
duces 218 times more points than the anchor at similar bi-
trates. In other words, both methods introduce different types
of distortions. Indeed, the number of points produced by oc-
tree structures diminishes exponentially when reducing the
octree depth. Conversely, our method produces more points at
lower bitrates as the focal loss penalizes false negatives more
heavily.
In this work, we use a fixed threshold of 0.5during de-
compression. Changing this threshold can further optimize
rate-distortion performance or optimize other aspects such as
rendering performance (number of points).
5. CONCLUSION
We present a novel data-driven point cloud geometry com-
pression method using learned convolutional transforms and
a uniform quantizer. Our method outperforms the MPEG An-
chor on the MVUB dataset in terms of rate-distortion with
51.5% BDBR savings on average. Additionally, in constrast
to octree-based methods, our model does not exhibit expo-
nential diminution in the number of output points at lower
bitrates. This work can be extended to the compression of
attributes and dynamic point clouds.
6. ACKNOWLEDGMENTS
This work was funded by the ANR ReVeRy national fund
(REVERY ANR-17-CE23-0020).
4
7. REFERENCES
[1] Rufael Mekuria, Kees Blom, and Pablo Cesar, “Design,
Implementation, and Evaluation of a Point Cloud Codec
for Tele-Immersive Video,” IEEE Transactions on Cir-
cuits and Systems for Video Technology, vol. 27, no. 4,
pp. 828–842, Apr. 2017.
[2] Sebastian Schwarz, Ga¨
elle Martin-Cocher, David Flynn,
and Madhukar Budagavi, “Common test condi-
tions for point cloud compression,” in ISO/IEC
JTC1/SC29/WG11 MPEG output document N17766.
July 2018.
[3] Diogo C. Garcia and Ricardo L. de Queiroz, “Intra-
Frame Context-Based Octree Coding for Point-Cloud
Geometry, in 2018 25th IEEE International Confer-
ence on Image Processing (ICIP), Oct. 2018, pp. 1807–
1811.
[4] Maja Krivoku´
ca, Maxim Koroteev, and Philip A. Chou,
A Volumetric Approach to Point Cloud Compres-
sion,” arXiv:1810.00484 [eess], Sept. 2018, arXiv:
1810.00484.
[5] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu,
Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao,
“3d ShapeNets: A deep representation for volumetric
shapes,” in 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2015, pp. 1912–
1920.
[6] Nima Sedaghat, Mohammadreza Zolfaghari, Ehsan
Amiri, and Thomas Brox, “Orientation-boosted Voxel
Nets for 3d Object Recognition,” arXiv:1604.03351
[cs], Apr. 2016, arXiv: 1604.03351.
[7] Charles Loop, Qin Cai, Sergio O. Escolano, and
Philip A. Chou, “Microsoft voxelized upper bod-
ies - a voxelized point cloud dataset, in ISO/IEC
JTC1/SC29 Joint WG11/WG1 (MPEG/JPEG) input
document m38673/M72012. May 2016.
[8] Ricardo L. de Queiroz, Diogo C. Garcia, Philip A. Chou,
and Dinei. A. Florencio, “Distance-Based Probability
Model for Octree Coding,” IEEE Signal Processing Let-
ters, vol. 25, no. 6, pp. 739–742, June 2018.
[9] Dorina Thanou, Philip A. Chou, and Pascal Frossard,
“Graph-Based Compression of Dynamic 3d Point Cloud
Sequences,” IEEE Transactions on Image Processing,
vol. 25, no. 4, pp. 1765–1778, Apr. 2016.
[10] Giuseppe Valenzise, Andrei Purica, Vedad Hulusic,
and Marco Cagnazzo, “Quality Assessment of Deep-
Learning-Based Image Compression,” in 2018 IEEE
20th International Workshop on Multimedia Signal Pro-
cessing (MMSP), Vancouver, BC, Aug. 2018, pp. 1–6,
IEEE.
[11] Johannes Ball´
e, David Minnen, Saurabh Singh, Sung Jin
Hwang, and Nick Johnston, “Variational image com-
pression with a scale hyperprior, arXiv:1802.01436
[cs, eess, math], Jan. 2018, arXiv: 1802.01436.
[12] Li Wang, Attilio Fiandrotti, Andrei Purica, Giuseppe
Valenzise, and Marco Cagnazzo, “Enhancing HEVC
spatial prediction by context-based learning.,” Brighton,
UK, May 2019.
[13] Johannes Ball´
e, Valero Laparra, and Eero P. Simoncelli,
“End-to-end Optimized Image Compression,” in 2017
International Conference on Learning Representations,
2017, arXiv: 1611.01704.
[14] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas,
and Leonidas Guibas, “Learning Representations and
Generative Models for 3d Point Clouds, in 2018 Inter-
national Conference on Learning Representations, Feb.
2018.
[15] Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, and
Abhinav Gupta, “Learning a Predictable and Genera-
tive Vector Representation for Objects,” in Computer
Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu
Sebe, and Max Welling, Eds. 2016, Lecture Notes in
Computer Science, pp. 484–499, Springer International
Publishing.
[16] Vincent Dumoulin and Francesco Visin, A
guide to convolution arithmetic for deep learning,
arXiv:1603.07285 [cs, stat], Mar. 2016, arXiv:
1603.07285.
[17] Diederik P. Kingma and Jimmy Ba, Adam: A Method
for Stochastic Optimization,” arXiv:1412.6980 [cs],
Dec. 2014, arXiv: 1412.6980.
[18] David A. Huffman, “A Method for the Construction of
Minimum-Redundancy Codes,Proceedings of the IRE,
vol. 40, no. 9, pp. 1098–1101, Sept. 1952.
[19] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He,
and Piotr Doll´
ar, “Focal Loss for Dense Object De-
tection,” arXiv:1708.02002 [cs], Aug. 2017, arXiv:
1708.02002.
[20] Dong Tian, Hideaki Ochimizu, Chen Feng, Robert Co-
hen, and Anthony Vetro, “Geometric distortion metrics
for point cloud compression,” in 2017 IEEE Interna-
tional Conference on Image Processing (ICIP), Beijing,
Sept. 2017, pp. 3460–3464, IEEE.
5
... The decoder then reconstructs the point cloud from this latent representation. Quach et al. [32] proposed a joint loss function of both rate and distortion by using a trade-off parameter. The works in [12]- [15] use entropy models to predict the probability distribution of latent representations extracted by an encoder, thereby enhancing the compression efficiency of entropy coding. ...
... where µ ∈ [0, 2] is called the scale factor. The QP in v j (t + 1) is set to the maximum value of the range if it exceeds the preset range in (32), or to the minimum value if it is below the minimum value of the range. To ensure the randomness of mutation, the population size K must not be less than 4; otherwise, mutation cannot be performed [53]. ...
Preprint
Full-text available
Since the data volume of LiDAR point clouds is very huge, efficient compression is necessary to reduce their storage and transmission costs. However, existing learning-based compression methods do not exploit the inherent angular resolution of LiDAR and ignore the significant differences in the correlation of geometry information at different bitrates. The predictive geometry coding method in the geometry-based point cloud compression (G-PCC) standard uses the inherent angular resolution to predict the azimuth angles. However, it only models a simple linear relationship between the azimuth angles of neighboring points. Moreover, it does not optimize the quantization parameters for residuals on each coordinate axis in the spherical coordinate system. We propose a learning-based predictive coding method (LPCM) with both high-bitrate and low-bitrate coding modes. LPCM converts point clouds into predictive trees using the spherical coordinate system. In high-bitrate coding mode, we use a lightweight Long-Short-Term Memory-based predictive (LSTM-P) module that captures long-term geometry correlations between different coordinates to efficiently predict and compress the elevation angles. In low-bitrate coding mode, where geometry correlation degrades, we introduce a variational radius compression (VRC) module to directly compress the point radii. Then, we analyze why the quantization of spherical coordinates differs from that of Cartesian coordinates and propose a differential evolution (DE)-based quantization parameter selection method, which improves rate-distortion performance without increasing coding time. Experimental results on the LiDAR benchmark \textit{SemanticKITTI} and the MPEG-specified \textit{Ford} datasets show that LPCM outperforms G-PCC and other learning-based methods.
... Voxel-based PCGC methods are more suitable for dense point clouds, as they convert point clouds into volumetric models and use 3D convolution to design encoding and decoding transformation networks [15][16][17][18][19][20][21]. Guarda et al. [15] and Wang et al. [16] first used a self-encoder based on 3D convolution to extract potential representations on voxels for rate estimation, and then reconstructed point clouds through binary classification. ...
... Guarda et al. [15] and Wang et al. [16] first used a self-encoder based on 3D convolution to extract potential representations on voxels for rate estimation, and then reconstructed point clouds through binary classification. On this basis, Quach et al. [17,18] fused the super prior entropy model to improve coding efficiency. In addition, to solve the problem of low computational efficiency of 3D CNN, Wang et al. [19] used sparse tensors to represent voxels, and significantly reduced computational complexity through Minkowski sparse convolution. ...
Article
Full-text available
This paper addresses the pressing need for efficient compression techniques for 3D point cloud data, which is crucial for both human–computer interaction and machine vision tasks. While existing methods often prioritize human perception, they fail to meet the demands of machine-driven applications, leading to data redundancy. We introduce a collaborative point cloud geometry compression approach that optimally balances human and machine vision tasks. Leveraging global and local features, our method minimizes redundancy across various tasks, achieved through a dual-branch architecture for feature extraction and entropy encoding. Another dual-branch structure facilitates high-fidelity point cloud reconstruction and machine vision tasks during decoding, with a task-friendly entropy engine enhancing coding efficiency. Our contributions include a comprehensive framework for human–machine collaborative compression, a novel dual-branch feature extraction encoder, and a coordinate reconstruction module for decoding and fusion. Extensive experiments validate the superiority of our method, promising significant reductions in bit rate while maintaining reconstruction quality and machine vision performance.
... Following the good performance in image coding, several machine learning-based coding solutions for point clouds have been proposed recently [70], [29], [71], [72], [73], [74], [75], [76], [77]. Learning-based encoders usually cause distortions that are quite different from those caused by conventional codecs. ...
Preprint
Full-text available
Full-reference point cloud objective metrics are currently providing very accurate representations of perceptual quality. These metrics are usually composed of a set of features that are somehow combined, resulting in a final quality value. In this study, the different features of the best-performing metrics are analyzed. For that, different objective quality metrics are compared between them, and the differences in their quality representation are studied. This provided a selection of the set of metrics used in this study, namely the point-to-plane, point-to-attribute, Point Cloud Structural Similarity, Point Cloud Quality Metric and Multiscale Graph Similarity. The features defined in those metrics are examined based on their contribution to the objective estimation using recursive feature elimination. To employ the recursive feature selection algorithm, both the support vector regression and the ridge regression algorithms were employed. For this study, the Broad Quality Assessment of Static Point Clouds in Compression Scenario database was used for both training and validation of the models. According to the recursive feature elimination, several features were selected and then combined using the regression method used to select those features. The best combination models were then evaluated across five different publicly available subjective quality assessment datasets, targeting different point cloud characteristics and distortions. It was concluded that a combination of features selected from the Point Cloud Quality Metric, Multiscale Graph Similarity and PSNR MSE D2, combined with Ridge Regression, results in the best performance. This model leads to the definition of the Feature Selection Model.
... They use 3D convolution instead of 2D convolution in the meta-architecture and then process 3D voxelized point cloud data. Quach et al. [34], Xu et al. [35], and Guarda et al. [36] attempt to compress and decompress 3D occupancy model data of voxelized point cloud geometry using an autoencoder network based on 3D Convolution Neural Network (CNN). They optimize the decompression and reconstruction of occupied voxelized point cloud data modeled as a classification problem through Weight Binary Cross Entropy (WBCE) loss. ...
Chapter
Full-text available
In autonomous driving, lidar-derived point cloud data is crucial for detecting objects and understanding the vehicle's environment. However, the large volume of raw data poses challenges in storage, transmission, and real-time processing. Compression of lidar point cloud data is vital, as it reduces data size without significantly losing geometric and semantic information, thus enhancing autonomous driving system performance. Compressed data enables swift and accurate object recognition, improving safety and reliability. This chapter examines current compression methodologies, their strengths, limitations, and applications. It focuses on vehicle recognition as a practical application, detailing the chosen algorithm's processing and compression of lidar data. The impact of compression on vehicle recognition accuracy and efficiency is discussed, with implications for intelligent transportation systems and future advanced compression techniques for real-time data processing in autonomous vehicles.
... In some pioneering work [21]- [24], voxel-based methods voxelized point clouds and applied 3D convolutions to process them into a compact latent representation. They reconstructed the point clouds by predicting the occupancy probability of voxels. ...
Preprint
Voxel-based methods are among the most efficient for point cloud geometry compression, particularly with dense point clouds. However, they face limitations due to a restricted receptive field, especially when handling high-bit depth point clouds. To overcome this issue, we introduce a stage-wise Space-to-Channel (S2C) context model for both dense point clouds and low-level sparse point clouds. This model utilizes a channel-wise autoregressive strategy to effectively integrate neighborhood information at a coarse resolution. For high-level sparse point clouds, we further propose a level-wise S2C context model that addresses resolution limitations by incorporating Geometry Residual Coding (GRC) for consistent-resolution cross-level prediction. Additionally, we use the spherical coordinate system for its compact representation and enhance our GRC approach with a Residual Probability Approximation (RPA) module, which features a large kernel size. Experimental results show that our S2C context model not only achieves bit savings while maintaining or improving reconstruction quality but also reduces computational complexity compared to state-of-the-art voxel-based compression methods.
Article
LiDAR point cloud (LPC) compression is an indispensable component for 3D vision tasks, especially for dynamic point clouds. However, the existing methods based on traditional spatial-temporal attention are immature, causing little improvement in inter-frame feature extraction. In this paper, we propose Diverse Attention-based Point Cloud Compression (DAPCC), an LPC compression entropy model combining aggregation embedding modules for temporal point matching and spatial-temporal attention blocks for dynamic Octree node encoding, which can effectively utilize the change information of dynamic point clouds. Specifically, we first introduce aggregation embedding to match the Octree sequences from two sweeps to establish temporal correlation. To effectively capture the feature details, we further design local and global combined attention for the spatial-temporal information of point clouds which can focus on the whole context. Finally, we organize a symmetric MLP module capable of strengthening vital features. We conduct experiments of static and dynamic compression on both indoor/outdoor point cloud benchmark datasets ( i.e. , ScanNet, SemanticKITTI, and MPEG Common Test Conditions (CTC) Category 3 datasets) and downstream applications ( i.e. , vehicle detection and semantic segmentation). Compared with the previous state-of-the-art methods, our method achieves up to 14.7% bpp and 45% decoding time savings and adapts to the downstream tasks with almost no impact on performance.
Article
Full-text available
The quality evaluation of three deep learning-based coding solutions for point cloud geometry, notably ADLPCC, PCC GEO CNNv2, and PCGCv2, is presented. The MPEG G-PCC was used as an anchor. Furthermore, LUT SR, which uses multi-resolution Look-Up tables, was also considered. A set of six point clouds representing landscapes and objects was used. As point cloud texture has a great influence on the perceived quality, two different subjective studies that differ in the texture addition model are reported and statistically compared. In the first experiment, the dataset was first encoded with the identified codecs. Then, the texture of the original point cloud was mapped to the decoded point cloud using the Meshlab software, resulting in a point cloud with both geometry and texture information. Finally, the resulting point cloud was encoded with G-PCC using the lossless-geometry-lossy-atts mode, while in the second experiment the texture was mapped directly onto the distorted geometry. Moreover, both subjective evaluations were used to benchmark a set of objective point cloud quality metrics. The two experiments were shown to be statistically different, and the tested metrics revealed quite different behaviors for the two sets of data. The results reveal that the preferred method of evaluation is the encoding of texture information with G-PCC after mapping the texture of the original point cloud to the distorted point cloud. The results suggest that current objective metrics are not suitable to evaluate distortions created by machine learning-based codecs. Finally, this paper presents a study on the compression performance stability of the tested machine learning-based codecs using different training sessions. The obtained results show that the tested codecs revealed a high level of stability across all training sessions for most of the content, although some undesirable exceptions may be found.
Article
Dynamic point cloud compression (DPCC) is crucial in applications like autonomous driving and AR/VR. Current compression methods face challenges with complexity management and rate control. This paper introduces a novel dynamic coding framework that supports variable bitrate and computational complexities. Our approach includes a slimmable framework with multiple coding routes, allowing for efficient Rate-Distortion-Complexity Optimization (RDCO) within a single model. To address data sparsity in inter-frame prediction, we propose the coarse-to-fine motion estimation and compensation module that deconstructs geometric information while expanding the perceptive field. Additionally, we propose a precise rate control module that content-adaptively navigates point cloud frames through various coding routes to meet target bitrates. The experimental results demonstrate that our approach reduces the average BD-Rate by 5.81% and improves the BD-PSNR by 0.42 dB compared to the state-of-the-art method, while keeping the average bitrate error at 0.40%. Moreover, the average coding time is reduced by up to 44.6% compared to D-DPCC, underscoring its efficiency in real-time and bitrate-constrained DPCC scenarios.
Conference Paper
Full-text available
Image compression standards rely on predictive coding , transform coding, quantization and entropy coding, in order to achieve high compression performance. Very recently, deep generative models have been used to optimize or replace some of these operations, with very promising results. However, so far no systematic and independent study of the coding performance of these algorithms has been carried out. In this paper, for the first time, we conduct a subjective evaluation of two recent deep-learning-based image compression algorithms, comparing them to JPEG 2000 and to the recent BPG image codec based on HEVC Intra. We found that compression approaches based on deep auto-encoders can achieve coding performance higher than JPEG 2000, and sometimes as good as BPG. We also show experimentally that the PSNR metric is to be avoided when evaluating the visual quality of deep-learning-based methods, as their artifacts have different characteristics from those of DCT or wavelet-based codecs. In particular, images compressed at low bitrate appear more natural than JPEG 2000 coded pictures, according to a no-reference naturalness measure. Our study indicates that deep generative models are likely to bring huge innovation into the video coding arena in the coming years.
Article
Full-text available
We present a context-driven method to encode nodes of an octree, which is typically used to encode point cloud geometry. Instead of using one bit per node of the tree, the context allows for deriving probabilities for that node based on distances of the actual voxel to voxels in a reference point cloud. Accurate probabilities of the node state allows for the use of an arithmetic coder to reduce bit rate. Results point to potential large reductions in rate if there is a good model from which to derive the context, i.e. one can get large reduction if the reference cloud geometry is close enough to the one being encoded. IEEE
Article
Compression of point clouds has so far been confined to coding the positions of a discrete set of points in space and the attributes of those discrete points. We introduce an alternative approach based on volumetric functions, which are functions defined not just on a finite set of points, but throughout space. As in regression analysis, volumetric functions are continuous functions that are able to interpolate values on a finite set of points as linear combinations of continuous basis functions. Using a B-spline wavelet basis, we are able to code volumetric functions representing both geometry and attributes. Attribute compression is addressed in Part I of this paper, while geometry compression is addressed in Part II. Geometry is represented implicitly as the level set of a volumetric function (the signed distance function or similar). Experimental results show that geometry compression using volumetric functions improves over the methods used in the emerging MPEG Point Cloud Compression (G-PCC) standard.
Article
Compression of point clouds has so far been confined to coding the positions of a discrete set of points in space and the attributes of those discrete points. We introduce an alternative approach based on volumetric functions, which are functions defined not just on a finite set of points, but throughout space. As in regression analysis, volumetric functions are continuous functions that are able to interpolate values on a finite set of points as linear combinations of continuous basis functions. Using a B-spline wavelet basis, we are able to code volumetric functions representing both geometry and attributes. Geometry compression is addressed in Part II of this paper, while attribute compression is addressed in Part I. Attributes are represented by a volumetric function whose coefficients can be regarded as a critically sampled orthonormal transform that generalizes the recent successful region-adaptive hierarchical (or Haar) transform to higher orders. Experimental results show that attribute compression using higher order volumetric functions is an improvement over the first order functions used in the emerging MPEG Point Cloud Compression standard.
Article
We describe an end-to-end trainable model for image compression based on variational autoencoders. The model incorporates a hyperprior to effectively capture spatial dependencies in the latent representation. This hyperprior relates to side information, a concept universal to virtually all modern image codecs, but largely unexplored in image compression using artificial neural networks (ANNs). Unlike existing autoencoder compression methods, our model trains a complex prior jointly with the underlying autoencoder. We demonstrate that this model leads to state-of-the-art image compression when measuring visual quality using the popular MS-SSIM index, and yields rate-distortion performance surpassing published ANN-based methods when evaluated using a more traditional metric based on squared error (PSNR). Furthermore, we provide a qualitative comparison of models trained for different distortion metrics.