PreprintPDF Available

Investigate Indistinguishable Points in Semantic Segmentation of 3D Point Cloud

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This paper investigates the indistinguishable points (difficult to predict label) in semantic segmentation for large-scale 3D point clouds. The indistinguishable points consist of those located in complex boundary, points with similar local textures but different categories, and points in isolate small hard areas, which largely harm the performance of 3D semantic segmentation. To address this challenge, we propose a novel Indistinguishable Area Focalization Network (IAF-Net), which selects indistinguishable points adaptively by utilizing the hierarchical semantic features and enhances fine-grained features for points especially those indistinguishable points. We also introduce multi-stage loss to improve the feature representation in a progressive way. Moreover, in order to analyze the segmentation performances of indistinguishable areas, we propose a new evaluation metric called Indistinguishable Points Based Metric (IPBM). Our IAF-Net achieves the comparable results with state-of-the-art performance on several popular 3D point cloud datasets e.g. S3DIS and ScanNet, and clearly outperforms other methods on IPBM.
Content may be subject to copyright.
Investigate Indistinguishable Points in Semantic Segmentation of 3D Point Cloud
Mingye Xu1, 2*, Zhipeng Zhou1, 4*, Junhao Zhang1, Yu Qiao1 3†
1ShenZhen Key Lab of Computer Vision and Pattern Recognition,
SIAT-SenseTime Joint Lab,Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences, China
3Shanghai AI Lab, Shanghai, China
4SIAT Branch, Shenzhen Institute of Artificial Intelligence and Robotics for Society
{my.xu, zp.zhou, zhangjh, yu.qiao}@siat.ac.cn
Abstract
This paper investigates the indistinguishable points (difficult
to predict label) in semantic segmentation for large-scale 3D
point clouds. The indistinguishable points consist of those lo-
cated in complex boundary, points with similar local textures
but different categories, and points in isolate small hard ar-
eas, which largely harm the performance of 3D semantic seg-
mentation. To address this challenge, we propose a novel In-
distinguishable Area Focalization Network (IAF-Net), which
selects indistinguishable points adaptively by utilizing the hi-
erarchical semantic features and enhances fine-grained fea-
tures for points especially those indistinguishable points. We
also introduce multi-stage loss to improve the feature repre-
sentation in a progressive way. Moreover, in order to ana-
lyze the segmentation performances of indistinguishable ar-
eas, we propose a new evaluation metric called Indistinguish-
able Points Based Metric (IPBM). Our IAF-Net achieves the
comparable results with state-of-the-art performance on sev-
eral popular 3D point cloud datasets e.g. S3DIS and ScanNet,
and clearly outperforms other methods on IPBM. Our code
will be available at https://github.com/MingyeXu/IAF-Net
1 Introduction
Deep learning on point cloud analysis has been attracting
more and more attention recently. Among the tasks of point
cloud analysis, efficient semantic segmentation of large-
scale 3D point cloud is a challenging task with huge applica-
tions (Rusu et al. 2008; Chen et al. 2017; Chen et al. 2020).
A key challenge is that 3D point cloud semantic segmenta-
tion relies on unstructured data which is typically irregularly
sampled and unordered. Due to the complexity of large-scale
3D point cloud, this task also requires the understanding of
the fine-grained details for each point.
For point cloud semantic segmentation, there exist some
areas which are hard to be segmented, and we name these
areas as “indistinguishable” areas. In order to analyze the
image semantic segmentation results in detail, (Li et al.
2017) divide pixels into different difficulty levels. Inspired
by (Li et al. 2017), we can also categorize these “indistin-
guishable” areas into three types (Figure 1): The first type
*M.Xu and Z.Zhou contributed equally.
Corresponding author.
Copyright © 2021, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Figure 1: Three types of the indistinguishable areas.
is called complex boundary areas (orange areas in Figure
1) which belong to the boundary points (object boundaries
and prediction boundaries). In most cases, it is difficult to
identify the boundaries between different objects accurately.
Because the features of each point are characterized by the
information of local regions, the predictions of the boundary
points will be over smooth between objects of different cat-
egories which are close in Euclidean space. The second type
is named confusing interior areas (cyan areas in Figure 1)
which contain interior points from objects of different cat-
egories with similar textures and geometric structures. For
example, door and wall have similar appearance which are
almost flat and share similar colors. Even for human being,
it is hard to identify part of door and wall accurately in these
cases. The last type is called isolate small areas (yellow
areas in Figure 1), which are scattered and hard to be pre-
dicted. In addition, objects in the scenes would not be fully
captured by the devices because of the occlusion. All of the
challenges mentioned above hinder the accuracy of seman-
tic segmentation of 3D point cloud. As far as we know, these
“indistinguishable” points are not deeply explored in most of
the previous methods (Jiang et al. 2018; Yang et al. 2019) on
point cloud semantic segmentation task.
To improve the segmentation performance on indistin-
guishable points mentioned above, we design an efficient
neural network which is able to enhance the features of
points especially indistinguishable points. However, this
task has two challenges to be addressed: 1) How to dis-
cover indistinguishable points adaptively in the training pro-
arXiv:2103.10339v1 [cs.CV] 18 Mar 2021
Figure 2: (a) is the ground truth of a room. (b) and (c) are
two different predictions which result in similar m-IoU.
cess; 2) How to enhance the features of these points. To this
end, we propose a new module called Indistinguishable Ar-
eas Focalization (IAF) which can adaptively select indistin-
guishable points considering hierarchical semantic features.
To enhance the features of indistinguishable points, IAF
module firstly acquires the fine-grained features and high-
level semantic features of indistinguishable points, then en-
hances the features through an nonlocal operation between
these points and the corresponding whole point set. Fur-
thermore, we introduce a multi-stage loss function Lms to
strengthen the feature descriptions of corresponding points
in each layer. In this way, we can capture the features in a
more progressive manner, which guarantees the accuracy of
features in each layer.
Mean IoU (m-IoU) and Overall Accuracy (OA) are two
widely-used evaluation metrics of 3D semantic segmenta-
tion. OA describes the average degree of accuracy which ig-
nores the various distribution of different categories of ob-
jects. m-IoU can reflect the accuracy of model on the iden-
tification of each category independently. Under certain cir-
cumstances (Figure 2 shows), the visualizations of two pre-
dictions with similar m-IoU can be totally different in de-
tails. In order to cooperate with the indistinguishable points’
partitions and to provide a supplementary metric for OA and
m-IoU, we propose a novel evaluation metric named Indis-
tinguishable Points Based Metric (IPBM). This evaluation
metric focuses on different types of indistinguishable areas.
With this evaluation metric, we can evaluate the effective-
ness of segmentation methods more objectively and more
granularly. It has a certain contribution to the segmentation
task evaluation in the future.
The main contributions are summarized as follows,
We propose the Indistinguishable Areas Focalization
(IAF) module which can select indistinguishable points
adaptively and enhance the features of each point.
We utilize the multi-stage loss to strengthen feature de-
scriptions in each layer, which guarantees the features can
represent points more accurately in a progressive way.
Our method achieves the comparable performances with
state-of-the-art methods on several popular datasets for
3D point cloud semantic segmentation.
We introduce Indistinguishable Points Based Metric
(IPBM) which focuses on the performances of segmenta-
tion methods on different types of indistinguishable areas.
2 Related Work
Point-Based Networks. Point-based networks work on ir-
regular point clouds directly. Inspired by PointNet (Charles
et al. 2017) and PointNet++ (Qi et al. 2017b), many recent
works (Hu et al. 2020; Han et al. 2020; Zhang, Hua, and
Yeung 2019; Xu, Zhou, and Qiao 2020; Wu, Qi, and Fuxin
2019) propose different kinds of modules based on point-
wise MLP. ShellNet (Zhang, Hua, and Yeung 2019) intro-
duce ShellConv which can allow efficient neighbor point
query simultaneously and resolve point order ambiguity by
defining a convolution order from inner to the outer shells
on a concentric spherical domain. RandLANet (Hu et al.
2020) utilize a local feature aggregation module to automat-
ically preserve complex local structures by progressively in-
creasing the receptive field for each point. Some works con-
struct novel and efficient point convolutions. A-CNN (Ko-
marichev, Zhong, and Hua 2019) propose a multi-level hi-
erarchical annular convolution which can set arbitrary ker-
nel sizes on each local ring-shaped domain to better cap-
ture shape details. KPConv (Thomas et al. 2019) apply a
new convolution based operator which uses a set of kernel
points to define the area where each kernel weight is applied.
However, most methods do not consider the indistinguish-
able points on point cloud semantic segmentation task spe-
cially. By contrast, we propose a novel IAF module which
enhances the features of points, especially points in the in-
distinguishable areas. IAF module in each layer uses a spe-
cially designed points selection operation to mine the indis-
tinguishable points adaptively and applies the nonlocal op-
eration to fuse the features between indistinguishable points
and corresponding whole point set in each layer. Multi-stage
loss is conducive to abstracting representative features in a
progressive way.
Local-Nonlocal Mechanism. Nonlocal mechanism has
been applied to various tasks of computer vision (Yan et al.
2020; Cao et al. 2019). The pioneer work Nonlocal (Wang
et al. 2018b) in video classification present non-local op-
erations as an efficient, simple and generic component for
capturing long range dependencies with deep neural net-
works. It computes the response at a position as a weighted
sum of the features at all positions in the input feature
maps. Point2Node (Han et al. 2020) utilize both local and
non-local operations to dynamically explore the correlation
among all graph nodes from different levels and adaptively
aggregate the learned features. Local correlation and non-
local correlation are used in a serial way which largely en-
hances nodes characteristic from different scale correlation
learning. To apply the local and non-local mechanism in a
targeted way, we use the non-local operation to fuse the fea-
tures of different layers, which help to enhance the features
of indistinguishable points. Moreover, the local features in
our network are enhanced by using the multi-stage loss pro-
gressively.
3 Method
We denote the point cloud as P = {piR3+d, i =
1,2, ..., N }, where Nis the number of points and 3 + d
denotes the xyz-dimension and additional properties, such
Figure 3: The detailed architecture of our IAF-Net. (N, D)represents the number of points and feature dimension respectively.
FPS: Farthest Point Sampling.
as colors, normal vectors. Like previous works, our IAF-Net
mainly consists of encoder and decoder modules which are
shown in Figure 3. We will show the details of each part in
the following subsections.
3.1 Encoder Module: Geometry Based Attentive
Aggregation
This subsection describes the encoder module GBAA (Ge-
ometry Based Attentive Aggregation) of our network archi-
tecture. As Figure 4 shows, the GBAA module is constituted
of local feature aggregation and attentive pooling operations.
Local Feature Aggregation
In order to get a better description of each point, we en-
hance the local features with eigenvalues at each point. We
utilize KNN algorithm to get the K-nearest neighbors of
each point in Euclidean space. As introduced in (Xu, Zhou,
and Qiao 2020), we use the coordinates of neighbors of
point pito get eigenvalue-tuples denoted by (λ1
i, λ2
i, λ3
i).
The original input features of each point are denoted as
x0
i= (λ1
i, λ2
i, λ3
i). For GBAA module in the l-th layer, we
take original points Pand output of last layer Xl1as in-
put. We choose the K-nearest neighbors in Euclidean space
and eigenvalue space respectively for each point pi. Let
{pi1, ..., piK}be the K-nearest neighbors of piin Euclidean
space, and their corresponding features are {xl1
i1, ..., xl1
iK}.
The features of K-nearest neighbors in eigenvalue space of
point piare {xl1
˜
i1, ..., xl1
˜
iK}. We define the local feature ag-
gregation operation as g1
Θl: R2×(3+d)×R2×Dl1RDl,
where g1
Θlis a nonlinear function with a set of learnable pa-
rameters Θl, and Dl1, Dlare dimension of output features
of l-th and (l-1)-th layer respectively. In our module, g1
Θlis a
two-layer 2D-convolution. The local feature aggregation for
each point is
xlocal,l
i=g1
Θl(kK
k=1(pikpi)pixl1
ikxl1
˜
ik).(1)
where 1iN,is concatenation and xlocal,l
i
RK×Dl.kis the concatenation among K dimension.
Attentive Pooling
For each point pi, its local feature aggregation is xlocal,l
i
RK×Dl. Instead of max pooling or average pooling, we ap-
ply an attentive pooling to xlocal,l
i.
xl
i=XK
k=1 g2
Θl(xlocal,l
i[k]) ·xlocal,l
i[k].(2)
where g2
Θlis a 1-layer 2D convolution.
Moreover, as Figure 3 shows, we utilize the local feature
aggregation and attentive pooling to obtain features of each
point from two different receptive fields in order to enhance
the representation of each point. The receptive field depends
on K. We choose K1and K2nearest neighbors for feature
aggregation and attentive pooling. Finally, we get the output
of l-th layer which denoted as Xl=Xl
K1+Xl
K2.
3.2 Decoder Module: FP & IAF
This subsection elucidates the decoder module FP & IAF
which is shown in Figure 5. For convenience, we use Ylto
represent the features of decoder module in the l-th layer.
The decoder module contains two parts: feature propagation
and indistinguishable areas focalization (IAF).
Feature Propagation
In encoder part, the original point set is sub-sampled.
We adopt a hierarchical propagation strategy with distance
based interpolation and across level skip links as (Qi et al.
2017b). In a feature propagation process, we propagate point
features YlRNl×D0
land label predictions ZlRNl×Cto
Yl1
up RNl1×D0
l1and Zl1
up RNl1×C, where Nland
Nl1(NlNl1) are point set size of input and output
of the l-th layer, Cis the number of categories for semantic
segmentation.
yl1
i,fp =gΨl1(yl
i,up xl1
i).(3)
where gΨl1is convolution operations with batch normal-
ization and activate-function, xl1
iis the features from the
encoder module in the (l-1)-th layer.
Indistinguishable Areas Focalization
Indistinguishable points mining. In order to discover
indistinguishable points adaptively in the training process,
both low level geometric and high level semantic informa-
tion can be used to mine these points. Local difference is the
difference between each point and its neighbors. To a certain
extent, local difference reflects the distinctiveness of each
point which depend on low-level geometry, latent space and
Figure 4: Encoder Module: Geometry Based Attentive Ag-
gregation. The input are point-wise coordinates, colors and
features. In GBAA, we aggregate features in both eigenvalue
space and Euclidean space, then use attentive pooling to gen-
erate the output feature of each point.
high-level semantic features. So we use local difference as a
criterion of mining indistinguishable points. For each point
pi, we get the K-nearest neighbors in Euclidean space, then
we have the following local difference of each point in each
layer:
LDl
1(pi) = XK
k=1 ||pipik||2.(4)
LDl
2(pi) = XK
k=1 ||zl1
i,up zl1
ik,up||2.(5)
LDl
3(pi) = XK
k=1 ||yl1
i,fp yl1
ik,fp||2.(6)
Then we accumulate these local differences together:
LDl(pi) =
3
X
j=1
µj×LDl
j(pi)min(LDl
j(p))
max(LDl
j(p)) min(LDl
j(p)).(7)
where 0µj1.
LDlindicates the accumulation of fine-grained features’
difference LDl
3among each point’s local region, high-level
semantic predictions’ local difference LDl
2and low-level
properties’ local difference LDl
1, where {µj}is used to ad-
just the weight of these three local differences. We align the
points in a descending order according to LDl, then choose
top Ml1=Nl1
τpoints as the indistinguishable points.
There are three types of points mentioned in Introduction
as Figure 1, 5 shows. These indistinguishable points change
dynamically as the network updates iteratively (Figure 6). It
is noted that at the beginning of training, the indistinguish-
able points are distributed over the areas where the original
properties (coordinates and colors) change rapidly. As the
training process goes on, the indistinguishable points locate
at the indistinguishable areas mentioned in the introduction.
Indistinguishable points set focalization. We aggregate
intermediate features and label predictions of the indistin-
guishable points, then use the MLP (Hornik 1991) to extract
the features for indistinguishable points separately.
xl1
jMl1=g1
l(yl1
j,fp zl1
j,up)RDl1.(8)
where jMl1means that the points belong to the indis-
tinguishable points set and g1
lis MLP operations.
Figure 5: Decoder Module: It contains two stages: feature
propagation and indistinguishable areas focalization.
Figure 6: The adaptive change process of indistinguishable
points during the training process. The background is col-
ored white. Best viewed in color with 300% zoom.
Update nodes. To enhance the features of points, espe-
cially the indistinguishable points, here we utilize the Non-
Local mechanism to update the features of all points by the
following equations and it can enhance the features of indis-
tinguishable points implicitly.
yl1
i=g2
l(X
jMl1
(g3
l(xl1
j)g4
l(yl1
i,fp)) ·g5
l(yl1
i,fp)).
(9)
where g2
l, g3
l, g4
l, g5
lare MLPs. Also, we have the label
prediction probability zl1
iof point piin (l-1)-th layer.
zl1
i=Sof tmax(g6
l1(yl1
i)RC).(10)
3.3 Loss for Segmentation
In order to progressively refine the features of indistinguish-
able areas, we apply the multi-stage loss as follows,
Ll
ms =CrossEntropy(Zl
gt, Z l).(11)
where Zl
gt RNl×1is the ground truth points’ labels in l-th
layer.
As the output of the last layer is y1
i, inspired by (Han et al.
2020), we use the self correlation, local correlation and non-
local correlation operation to augment features of each point
pi. Finally, we get features of each point pias the accumula-
tion of three correlations’ outputs. The final loss for training
is as follows:
Lf=
5
X
l=1
Ll
ms +Lp.(12)
Figure 7: The evaluation process of the indistinguishable
points based metric (IPBM). The black line is the boundary
of the ground truth, the purple is the prediction boundary,
and the red areas are misclassified areas. ζ1, ζ2are the pa-
rameters for partitioning the three types of indistinguishable
points.
where the Lpis the loss between final label predictions Z
and ground truth labels Zgt.
3.4 Indistinguishable Points Based Metric
To better distinguish the effect of different methods in 3D
semantic segmentation, we propose a novel evaluation met-
ric named “Indistinguishable Points Based Metric” (IPBM).
This evaluation metric focuses on the effectiveness of seg-
mentation methods on the indistinguishable areas.
For the whole points P = {p1, p2, ..., pN}, we have the
predictions P red ={zi,1iN}and ground truth
labels Label ={zi,gt,1iN}. Figure 7 shows the
processing details of the IPBM. Firstly, for point pisatisfy-
ing the factor zi6=zi,gt, its neighbors in Euclidean space are
{zij,1jK}. Then we denote the number of neighbor
points that satisfy zik6=zik,gt as mifor each point pi. Next,
we divide interval [0, 1] (domain of mi
K) into three partitions
with endpoints as 0, ζ1, ζ2,1.
We determine ζ1= 0.33, ζ2= 0.66 empirically by con-
sidering the curve in Figure 8. To be more specific, Figure 8
shows that the growth trend of number of points with value
mi
Kon S3DIS dataset. The curve can be divided into three
partitions. From a large number of visualizations, we find
that these three partitions roughly reflect the value distribu-
tion of the three types of indistinguishable areas. The ex-
amples of visualization under different choices of ζ1, ζ2are
shown in Figure 9.
Finally, we use S1
N,S2
N,S3
Nas our new evaluation metric
where S1, S2, S3are number of points in three types of in-
distinguishable areas. As Figure 9 shows, S1
Nis used to eval-
uate the method’s performance on isolate small areas (col-
ored yellow), S2
Nis for complex boundary areas (colored or-
ange), and S3
Nis for confusing interior areas (colored cyan).
For a more comprehensive evaluation, three subsets of the
point cloud are sampled for the above evaluation. As Figure
10 shows, they are original point cloud, category boundary
point cloud and geometry boundary point cloud. The specific
Figure 8: The curve of the points number as mi
Kchanges on
S3DIS dataset. ζ1= 0.33, ζ2= 0.66.S1, S2, S3are number
of points in three types of indistinguishable areas.
Figure 9: The visual experiments of ζ1, ζ2on S3DIS dataset.
Yellow areas are isolate small areas; Orange areas are com-
plex boundary areas; Cyan areas are confusing interior areas.
Best viewed in color with 300% zoom.
methods of subset point cloud acquisition is explained in the
supplementary materials.
4 Experiments
4.1 Experimental Evaluations on Benchmarks
S3DIS Semantic Segmentation
Dateset. The S3DIS (Armeni et al. 2016) dataset contains
six sets of point cloud data from three different buildings
(including 271 rooms). We follow (Boulch 2020) to prepare
the dataset.
For training, we randomly select points in the considered
point cloud, and extract all points in an infinite column cen-
tered on this point, where the column section is 2 meters. For
each column, we randomly select 8192 points as the input
points. During the testing, for a more systematic sampling
of the space, we compute a 2D occupancy pixel map with
pixel size 0.1 meters. Then, we consider each occupied cell
as a center for a column (same for training). Finally, the out-
put scores are aggregated (sum) at point level and points not
seen by the network receive the label of its nearest neigh-
bors. Following (Boulch 2020), we report the results under
two settings: testing on Area 5 and 6-fold cross validation.
Performance Comparison. Table 1 and Table 3 show the
quantitative results of different methods under two settings
mentioned above, respectively. Our method achieves on-par
performance with the SOTA methods. It is noted that some
methods (Qi et al. 2017a; Zhao et al. 2019) use small col-
umn section(1 meter), and it may not contain enough holis-
tic information. It will be more common to the situation
which the column section cannot contain the whole object.
By contrast, our method use IAF module to deal with the in-
Methods (published time order) OA (%) mAcc (%) mIoU (%) ceiling flooring wall beam column window door table chair sofa bookcase board clutter
PointNet (Qi et al. 2017a) - 49.0 41.1 88.8 97.3 69.8 0.1 3.9 46.3 10.8 58.9 52.6 5.9 40.3 26.4 33.2
SegCloud (Tchapmi et al. 2017) - 57.4 48.9 90.1 96.1 69.9 0.0 18.4 38.4 23.1 70.4 75.9 40.9 58.4 13.0 41.6
TangentConv (Tatarchenko et al. 2018) - 62.2 52.6 90.5 97.7 74.0 0.0 20.7 39.0 31.3 77.5 69.4 57.3 38.5 48.8 39.8
SPGraph (Landrieu and Simonovsky 2018) 86.4 66.5 58.0 89.4 96.9 78.1 0.0 42.8 48.9 61.6 84.7 75.4 69.8 52.6 2.1 52.2
PCNN (Wang et al. 2018a) - 67.1 58.3 92.3 96.2 75.9 0.3 6.0 69.5 63.5 65.6 66.9 68.9 47.3 59.1 46.2
RNNFusion (Ye et al. 2018) - 63.9 57.3 92.3 98.2 79.4 0.0 17.6 22.8 62.1 80.6 74.4 66.7 31.7 62.1 56.7
Eff 3D Conv (Zhang, Luo, and Urtasun 2018) - 68.3 51.8 79.8 93.9 69.0 0.2 28.3 38.5 48.3 73.6 71.1 59.2 48.7 29.3 33.1
PointCNN (Li et al. 2018) 85.9 63.9 57.3 92.3 98.2 79.4 0.0 17.6 22.8 62.1 74.4 80.6 31.7 66.7 62.1 56.7
PointWeb (Zhao et al. 2019) 87.0 66.6 60.3 92.0 98.5 79.4 0.0 21.1 59.7 34.8 76.3 88.3 46.9 69.3 64.9 52.5
GACNet (Wang et al. 2019) 87.8 - 62.9 92.3 98.3 81.9 0.0 20.4 59.1 40.9 85.8 78.5 70.8 61.7 74.7 52.8
KPConv (Thomas et al. 2019) - 72.8 67.1 92.8 97.3 82.4 0.0 23.9 58.0 69.0 81.5 91.0 75.4 75.3 66.7 58.9
Point2Node (Han et al. 2020) 88.8 70.0 63.0 93.9 98.3 83.3 0.0 35.7 55.3 58.8 79.5 84.7 44.1 71.1 58.7 55.2
FPConv (Lin et al. 2020) - - 62.8 94.6 98.5 80.9 0.0 19.1 60.1 48.9 80.6 88.0 53.2 68.4 68.2 54.9
Ours(IAF-Net) 88.4 70.4 64.6 91.4 98.6 81.8 0.0 34.9 62.0 54.7 79.7 86.9 49.9 72.4 74.8 52.1
Table 1: Semantic segmentation results on S3DIS dataset evaluated on Area 5.
Figure 10: Three subsets of the point cloud used for the eval-
uation on indistinguishable points based metric. The back-
ground is colored white.
distinguishable points specially and use big column section
(2 meters) for getting more geometry information as the in-
put. For the Area 5 evaluation, our method achieves the best
performance except KPConv (Thomas et al. 2019), and get
2.01% higher result than Point2Node (Han et al. 2020). For
the 6-fold evaluation, our method achieves comparable per-
formance (70.3%) with the state-of-the-art method (Thomas
et al. 2019).The parameters of KPConv is 14.9M, while our
IAF-Net use less parameters (10.98M). Besides, we do not
use voting test due to the large scale points in S3DIS, it takes
a huge amount of computing resources and time.
ScanNet Semantic Voxel Labeling
The ScanNet (Dai et al. 2017) dataset contains 1,513
scanned and reconstructed indoor scenes, split into
1,201/312 for training/testing. For the semantic voxel label-
ing task, 20 categories are used for evaluation and 1 class for
free space. We followed the same data pre-processing strate-
gies as with (Zhao et al. 2019), where points are uniformly
sampled from scenes and are divided into blocks, each of
size 1.5m×1.5m. During the training, 8,192 point samples
are chosen, where no less than 2% voxels are occupied and at
least 70% of the surface voxels have valid annotation. Points
are sampled on-the-fly. All points in the testing set are used
for evaluation and a smaller sampling stride of 0.5 between
each pair of adjacent blocks is adopted during the testing. In
the evaluation, overall semantic voxel labeling accuracy is
adopted. For fair comparisons with the previous approaches,
we do not use the RGB color information for training and
testing. Table 4 shows the semantic voxel labeling results.
Our method achieves comparable performance (85.8%) with
the state-of-the-art methods on ScanNet dataset.
Subsets Methods ISA (%) CBA (%) CIA (%)
PointWeb 1.48 2.83 9.38
original KPConv 1.33 2.46 9.28
point cloud RandLANet 1.23 2.58 9.07
Ours(IAF-Net) 1.08 2.03 8.46
PointWeb 3.98 6.94 14.31
category KPConv 2.73 4.71 14.21
boundary RandLANet 2.57 5.19 15.02
Ours(IAF-Net) 2.40 4.33 13.76
PointWeb 2.75 4.47 12.94
geometry KPConv 4.60 6.13 10.89
boundary RandLANet 2.32 4.02 13.23
Ours(IAF-Net) 2.06 3.37 12.42
Table 2: Results on Indistinguishable Points Based Metric
(IPBM). ’ISA’: isolate small areas; ’CBA’: complex bound-
ary areas; ’CIA’: confusing interior areas.
4.2 The Evaluation Results of IPBM
As we have described in Sec. 3.4, we propose a novel eval-
uation metric (IPBM) for distinguishing the effect of differ-
ent methods. We compare our method with the state-of-art
methods on S3DIS dataset (Area 5 evaluation), and we use
the prediction of the methods to generate the results under
the IPBM. The results are summarized in Table 2 with three
settings: original point cloud, category boundary and geom-
etry boundary which correspond to three subsets of Sec 3.3
(shown in Figure 10) respectively. All methods in Table 2
are reproduced by ourselves. Our method achieves the best
performance under the settings of original point cloud and
geometry boundary, and get comparable result with KPConv
(Thomas et al. 2019) under the setting of category boundary.
5 Analysis
5.1 Analysis on Indistinguishable Points Mining
In this section, we conduct a series of experiments on
the hyperparameters in the indistinguishable points min-
ing process. As Section 3.2 shows, LDlis the weighted
sum of three local differences, where the weight factors is
{µ1, µ2, µ3}, then we choose top Nl1
τpoints according to
LDlas the indistinguishable points. The following experi-
ments are tested on S3DIS dataset (Area 5 evaluation). The
accumulation of local differences can be found in supple-
mentary materials, and the proportion of the indistinguish-
able points in original points is introduced as follows.
Methods (published time order) OA (%) mAcc (%) mIoU (%) ceiling flooring wall beam column window door table chair sofa bookcase board clutter
PointNet (Qi et al. 2017a) 78.5 66.2 47.8 88.0 88.7 69.3 42.4 23.1 47.5 51.6 54.1 42.0 9.6 38.2 29.4 35.2
DGCNN(Wang et al. 2018c) 84.1 - 56.1 - - - - - - - - - - - - -
RSNet (Huang, Wang, and Neumann 2018) - 66.5 56.5 92.5 92.8 78.6 32.8 34.4 51.6 68.1 59.7 60.1 16.4 50.2 44.9 52.0
PCNN (Wang et al. 2018a) - 67.0 58.3 92.3 96.2 75.9 0.27 6.0 69.5 63.5 66.9 65.6 47.3 68.9 59.1 46.2
SPGraph (Landrieu and Simonovsky 2018) 85.5 73.0 62.1 89.9 95.1 76.4 62.8 47.1 55.3 68.4 69.2 73.5 45.9 63.2 8.7 52.9
PointCNN (Li et al. 2018) 88.1 75.6 65.4 94.8 97.3 75.8 63.3 51.7 58.4 57.2 69.1 71.6 61.2 39.1 52.2 58.6
A-CNN (Komarichev, Zhong, and Hua 2019) 87.3 - 62.9 92.4 96.4 79.2 59.5 34.2 56.3 65.0 66.5 78.0 28.5 56.9 48.0 56.8
PointWeb (Zhao et al. 2019) 87.3 76.2 66.7 93.5 94.2 80.8 52.4 41.3 64.9 68.1 71.4 67.1 50.3 62.7 6.2 58.5
KPConv (Thomas et al. 2019) - 79.1 70.6 93.6 92.4 83.1 63.9 54.3 66.1 76.6 64.0 57.8 74.9 69.3 61.3 60.3
ShellNet (Zhang, Hua, and Yeung 2019) 87.1 - 66.8 90.2 93.6 79.9 60.4 44.1 64.9 52.9 71.6 84.7 53.8 64.6 48.6 59.4
Point2Node (Han et al. 2020) 89.0 79.1 70.0 94.1 97.3 83.4 62.7 52.3 72.3 64.3 75.8 70.8 65.7 49.8 60.3 60.9
RandLA-Net (Hu et al. 2020) 87.1 81.5 68.5 92.7 95.6 79.2 61.7 47.0 63.1 67.7 68.9 74.2 55.3 63.4 63.0 58.7
FPConv (Lin et al. 2020) - - 68.7 94.8 97.5 82.6 42.8 41.8 58.6 73.4 71.0 81.0 59.8 61.9 64.2 64.2
Ours(IAF-Net) 88.8 77.8 70.3 93.3 97.9 81.9 55.2 42.7 64.9 74.7 74.2 71.8 63.3 66.2 66.5 60.5
Table 3: Semantic segmentation results on S3DIS dataset with 6-folds cross validation.
Methods OA (%)
3DCNN (Bruna et al. 2013) 73.0
PointNet (Charles et al. 2017) 73.9
TCDP (Tatarchenko et al. 2018) 80.9
PointNet++ (Qi et al. 2017b) 84.5
PointCNN (Li et al. 2018) 85.1
A-CNN (Komarichev, Zhong, and Hua 2019) 85.4
PointWeb (Zhao et al. 2019) 85.9
Ours(IAF-Net) 85.8
Table 4: Results on ScanNet dataset.
Figure 11: Visual comparison of semantic segmentation re-
sults on S3DIS dataset. Best viewed in color and zoom in.
Proportion of the indistinguishable points in original
points. In order to achieve the balance of indistinguishable
points and original points. τis used to control the propor-
tion of indistinguishable points in input points in each layer.
As Figure 12 shows, when we set the proportion as 1:4, we
get the best performance. When the proportion is too large,
it will increase training difficulty of NonLocal mechanism
and then degrade the performance. By contrast, when the
proportion is too small, the indistinguishable points set may
not cover all categories, because the indistinguishable points
in different category may differ in degree.
5.2 Ablation Study
In this section, we conduct the following ablation studies for
our network architecture. All ablated networks are tested on
the Area 5 of S3DIS dataset. Table 5 shows the results.
Figure 12: Evaluation of the m-IoU when reducing the num-
ber of indistinguishable points.
m-IoU (%)
(a) without IAF 62.6
(b) without IAF & replace attentive pooling 60.6
(c) without IAF & replace attentive pooling 59.9
& without correlations
(d) without multi scale strategy of encoder 59.2
(e) The Full framework 64.6
Table 5: Ablation studies on S3DIS Area 5 validation based
on our full network.
(a) Removing IAF module. This module is used to deal
with the indistinguishable points specially. After removing
IAF module, we directly feed the output features of feature
propagation to the next module. (b) Removing IAF module
and replacing the attentive pooling with max-pooling. The
attentive pooling unit learns to automatically combine all
local point features in a soft way. By comparison, the max-
pooling tends to select or combine features in a hard way,
and the performance may be degraded. (c) Based on (b), re-
moving three correlations. (d) Removing multi scale strat-
egy of encoder. For enhancing the point’s representation of
encoder, we use multi scale strategy to obtain features from
two different receptive fields. Instead, we use only one re-
ceptive field, and the performance is reduced as expected.
6 Conclusion
Our paper revolves around the indistinguishable points for
semantic segmentation. Firstly, we make a qualitative anal-
ysis of the indistinguishable points. Then we present a novel
framework IAF-Net which is based on IAF module and
multi-stage loss. Besides, we propose a new evaluation met-
ric (IPBM) to evaluate the three types of indistinguishable
points respectively. Experimental results demonstrate the ef-
fectiveness and generalization ability of our method.
7 Acknowledgments
This work was supported in part by the Shanghai Com-
mittee of Science and Technology, China (Grant No.
20DZ1100800), in part by the National Natural Science
Foundation of China under Grant (61876176, U1713208),
and in part by the Shenzhen Basic Research Pro-
gram (CXB201104220032A), Guangzhou Research Pro-
gram (201803010066).
References
Armeni, I.; Sener, O.; Zamir, A. R.; Jiang, H.; Brilakis, I.;
Fischer, M.; and Savarese, S. 2016. 3d semantic parsing of
large-scale indoor spaces. In CVPR.
Boulch, A. 2020. ConvPoint: Continuous convolutions for
point cloud processing. Computers & Graphics .
Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013.
Spectral networks and locally connected networks on
graphs. arXiv preprint arXiv:1312.6203 .
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; and Hu, H. 2019. Gcnet:
Non-local networks meet squeeze-excitation networks and
beyond. In ICCV Workshops.
Charles, R. Q.; Su, H.; Kaichun, M.; and Guibas, L. J. 2017.
PointNet: Deep Learning on Point Sets for 3D Classification
and Segmentation. CVPR .
Chen, X.; Ma, H.; Wan, J.; Li, B.; and Xia, T. 2017. Multi-
view 3d object detection network for autonomous driving.
In CVPR.
Chen, Z.; Zeng, W.; Yang, Z.; Yu, L.; Fu, C. W.; and Qu, H.
2020. LassoNet: Deep Lasso-Selection of 3D Point Clouds.
IEEE Transactions on Visualization and Computer Graphics
26(1): 195–204. doi:10.1109/TVCG.2019.2934332.
Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser,
T.; and Nießner, M. 2017. Scannet: Richly-annotated 3d re-
constructions of indoor scenes. In CVPR.
Han, W.; Wen, C.; Wang, C.; Li, X.; and Li, Q. 2020.
Point2Node: Correlation Learning of Dynamic-Node for
Point Cloud Feature Modeling. AAAI .
Hornik, K. 1991. Approximation capabilities of multilayer
feedforward networks. Neural networks .
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.;
Trigoni, N.; and Markham, A. 2020. RandLA-Net: Efficient
Semantic Segmentation of Large-Scale Point Clouds. CVPR
.
Huang, Q.; Wang, W.; and Neumann, U. 2018. Recur-
rent slice networks for 3d segmentation of point clouds. In
CVPR.
Jiang, M.; Wu, Y.; Zhao, T.; Zhao, Z.; and Lu, C. 2018.
Pointsift: A sift-like network module for 3d point cloud se-
mantic segmentation. arXiv preprint arXiv:1807.00652 .
Komarichev, A.; Zhong, Z.; and Hua, J. 2019. A-CNN:
Annularly Convolutional Neural Networks on Point Clouds.
CVPR .
Landrieu, L.; and Simonovsky, M. 2018. Large-scale point
cloud semantic segmentation with superpoint graphs. In
CVPR.
Li, X.; Liu, Z.; Luo, P.; Change Loy, C.; and Tang, X. 2017.
Not all pixels are equal: Difficulty-aware semantic segmen-
tation via deep layer cascade. In CVPR.
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; and Chen, B. 2018.
Pointcnn: Convolution on x-transformed points. In NIPS.
Lin, Y.; Yan, Z.; Huang, H.; Du, D.; Liu, L.; Cui, S.; and
Han, X. 2020. FPConv: Learning Local Flattening for Point
Convolution. In CVPR.
Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017a. Pointnet:
Deep learning on point sets for 3d classification and segmen-
tation. In CVPR.
Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b. Point-
net++: Deep hierarchical feature learning on point sets in a
metric space. In NIPS.
Rusu, R. B.; Marton, Z. C.; Blodow, N.; Dolha, M.; and
Beetz, M. 2008. Towards 3D point cloud based object maps
for household environments. Robotics and Autonomous Sys-
tems .
Tatarchenko, M.; Park, J.; Koltun, V.; and Zhou, Q.-Y. 2018.
Tangent convolutions for dense prediction in 3d. In CVPR.
Tchapmi, L.; Choy, C.; Armeni, I.; Gwak, J.; and Savarese,
S. 2017. Segcloud: Semantic segmentation of 3d point
clouds. In 3DV.
Thomas, H.; Qi, C. R.; Deschaud, J.-E.; Marcotegui, B.;
Goulette, F.; and Guibas, L. J. 2019. KPConv: Flexible and
Deformable Convolution for Point Clouds. arXiv preprint
arXiv:1904.08889 .
Wang, L.; Huang, Y.; Hou, Y.; Zhang, S.; and Shan, J. 2019.
Graph Attention Convolution for Point Cloud Semantic Seg-
mentation. In CVPR.
Wang, S.; Suo, S.; Ma, W.-C.; Pokrovsky, A.; and Urtasun,
R. 2018a. Deep parametric continuous convolutional neural
networks. In CVPR.
Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018b. Non-
local Neural Networks. CVPR .
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.;
and Solomon, J. M. 2018c. Dynamic graph cnn for learning
on point clouds. arXiv preprint arXiv:1801.07829 .
Wu, W.; Qi, Z.; and Fuxin, L. 2019. Pointconv: Deep con-
volutional networks on 3d point clouds. In CVPR.
Xu, M.; Zhou, Z.; and Qiao, Y. 2020. Geometry Sharing
Network for 3D Point Cloud Classification and Segmenta-
tion. AAAI .
Yan, X.; Zheng, C.; Li, Z.; Wang, S.; and Cui, S. 2020.
PointASNL: Robust Point Clouds Processing using Nonlo-
cal Neural Networks with Adaptive Sampling. In CVPR.
Yang, J.; Zhang, Q.; Ni, B.; Li, L.; Liu, J.; Zhou, M.; and
Tian, Q. 2019. Modeling point clouds with self-attention
and gumbel subset sampling. In CVPR.
Ye, X.; Li, J.; Huang, H.; Du, L.; and Zhang, X. 2018. 3d
recurrent neural networks with context fusion for point cloud
semantic segmentation. In ECCV.
Zhang, C.; Luo, W.; and Urtasun, R. 2018. Efficient con-
volutions for real-time semantic segmentation of 3d point
clouds. In 3DV.
Zhang, Z.; Hua, B.-S.; and Yeung, S.-K. 2019. ShellNet:
Efficient Point Cloud Convolutional Neural Networks using
Concentric Shells Statistics. In ICCV.
Zhao, H.; Jiang, L.; Fu, C.-W.; and Jia, J. 2019. PointWeb:
Enhancing local neighborhood features for point cloud pro-
cessing. In CVPR.
References
Armeni, I.; Sener, O.; Zamir, A. R.; Jiang, H.; Brilakis, I.;
Fischer, M.; and Savarese, S. 2016. 3d semantic parsing of
large-scale indoor spaces. In CVPR.
Boulch, A. 2020. ConvPoint: Continuous convolutions for
point cloud processing. Computers & Graphics .
Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013.
Spectral networks and locally connected networks on
graphs. arXiv preprint arXiv:1312.6203 .
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; and Hu, H. 2019. Gcnet:
Non-local networks meet squeeze-excitation networks and
beyond. In ICCV Workshops.
Charles, R. Q.; Su, H.; Kaichun, M.; and Guibas, L. J. 2017.
PointNet: Deep Learning on Point Sets for 3D Classification
and Segmentation. CVPR .
Chen, X.; Ma, H.; Wan, J.; Li, B.; and Xia, T. 2017. Multi-
view 3d object detection network for autonomous driving.
In CVPR.
Chen, Z.; Zeng, W.; Yang, Z.; Yu, L.; Fu, C. W.; and Qu, H.
2020. LassoNet: Deep Lasso-Selection of 3D Point Clouds.
IEEE Transactions on Visualization and Computer Graphics
26(1): 195–204. doi:10.1109/TVCG.2019.2934332.
Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser,
T.; and Nießner, M. 2017. Scannet: Richly-annotated 3d re-
constructions of indoor scenes. In CVPR.
Han, W.; Wen, C.; Wang, C.; Li, X.; and Li, Q. 2020.
Point2Node: Correlation Learning of Dynamic-Node for
Point Cloud Feature Modeling. AAAI .
Hornik, K. 1991. Approximation capabilities of multilayer
feedforward networks. Neural networks .
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.;
Trigoni, N.; and Markham, A. 2020. RandLA-Net: Efficient
Semantic Segmentation of Large-Scale Point Clouds. CVPR
.
Huang, Q.; Wang, W.; and Neumann, U. 2018. Recur-
rent slice networks for 3d segmentation of point clouds. In
CVPR.
Jiang, M.; Wu, Y.; Zhao, T.; Zhao, Z.; and Lu, C. 2018.
Pointsift: A sift-like network module for 3d point cloud se-
mantic segmentation. arXiv preprint arXiv:1807.00652 .
Komarichev, A.; Zhong, Z.; and Hua, J. 2019. A-CNN:
Annularly Convolutional Neural Networks on Point Clouds.
CVPR .
Landrieu, L.; and Simonovsky, M. 2018. Large-scale point
cloud semantic segmentation with superpoint graphs. In
CVPR.
Li, X.; Liu, Z.; Luo, P.; Change Loy, C.; and Tang, X. 2017.
Not all pixels are equal: Difficulty-aware semantic segmen-
tation via deep layer cascade. In CVPR.
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; and Chen, B. 2018.
Pointcnn: Convolution on x-transformed points. In NIPS.
Lin, Y.; Yan, Z.; Huang, H.; Du, D.; Liu, L.; Cui, S.; and
Han, X. 2020. FPConv: Learning Local Flattening for Point
Convolution. In CVPR.
Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017a. Pointnet:
Deep learning on point sets for 3d classification and segmen-
tation. In CVPR.
Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b. Point-
net++: Deep hierarchical feature learning on point sets in a
metric space. In NIPS.
Rusu, R. B.; Marton, Z. C.; Blodow, N.; Dolha, M.; and
Beetz, M. 2008. Towards 3D point cloud based object maps
for household environments. Robotics and Autonomous Sys-
tems .
Tatarchenko, M.; Park, J.; Koltun, V.; and Zhou, Q.-Y. 2018.
Tangent convolutions for dense prediction in 3d. In CVPR.
Tchapmi, L.; Choy, C.; Armeni, I.; Gwak, J.; and Savarese,
S. 2017. Segcloud: Semantic segmentation of 3d point
clouds. In 3DV.
Thomas, H.; Qi, C. R.; Deschaud, J.-E.; Marcotegui, B.;
Goulette, F.; and Guibas, L. J. 2019. KPConv: Flexible and
Deformable Convolution for Point Clouds. arXiv preprint
arXiv:1904.08889 .
Wang, L.; Huang, Y.; Hou, Y.; Zhang, S.; and Shan, J. 2019.
Graph Attention Convolution for Point Cloud Semantic Seg-
mentation. In CVPR.
Wang, S.; Suo, S.; Ma, W.-C.; Pokrovsky, A.; and Urtasun,
R. 2018a. Deep parametric continuous convolutional neural
networks. In CVPR.
Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018b. Non-
local Neural Networks. CVPR .
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.;
and Solomon, J. M. 2018c. Dynamic graph cnn for learning
on point clouds. arXiv preprint arXiv:1801.07829 .
Wu, W.; Qi, Z.; and Fuxin, L. 2019. Pointconv: Deep con-
volutional networks on 3d point clouds. In CVPR.
Xu, M.; Zhou, Z.; and Qiao, Y. 2020. Geometry Sharing
Network for 3D Point Cloud Classification and Segmenta-
tion. AAAI .
Yan, X.; Zheng, C.; Li, Z.; Wang, S.; and Cui, S. 2020.
PointASNL: Robust Point Clouds Processing using Nonlo-
cal Neural Networks with Adaptive Sampling. In CVPR.
Yang, J.; Zhang, Q.; Ni, B.; Li, L.; Liu, J.; Zhou, M.; and
Tian, Q. 2019. Modeling point clouds with self-attention
and gumbel subset sampling. In CVPR.
Ye, X.; Li, J.; Huang, H.; Du, L.; and Zhang, X. 2018. 3d
recurrent neural networks with context fusion for point cloud
semantic segmentation. In ECCV.
Zhang, C.; Luo, W.; and Urtasun, R. 2018. Efficient con-
volutions for real-time semantic segmentation of 3d point
clouds. In 3DV.
Zhang, Z.; Hua, B.-S.; and Yeung, S.-K. 2019. ShellNet:
Efficient Point Cloud Convolutional Neural Networks using
Concentric Shells Statistics. In ICCV.
Zhao, H.; Jiang, L.; Fu, C.-W.; and Jia, J. 2019. PointWeb:
Enhancing local neighborhood features for point cloud pro-
cessing. In CVPR.
ResearchGate has not been able to resolve any citations for this publication.
ResearchGate has not been able to resolve any references for this publication.