Conference PaperPDF Available

Large-Scale Point Cloud Semantic Segmentation with Superpoint Graphs


Abstract and Figures

We propose a novel deep learning-based framework to tackle the challenge of semantic segmentation of large-scale point clouds of millions of points. We argue that the organization of 3D point clouds can be efficiently captured by a structure called superpoint graph (SPG), derived from a partition of the scanned scene into geometrically homogeneous elements. SPGs offer a compact yet rich representation of contextual relationships between object parts, which is then exploited by a graph convolutional network. Our framework sets a new state of the art for segmenting outdoor LiDAR scans (+11.9 and +8.8 mIoU points for both Semantic3D test sets), as well as indoor scans (+5.8 mIoU points for the S3DIS dataset).
Content may be subject to copyright.
Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs
Loic Landrieu1?, Martin Simonovsky2?
e Paris-Est, Ecole des Ponts ParisTech,
We propose a novel deep learning-based framework to
tackle the challenge of semantic segmentation of large-
scale point clouds of millions of points. We argue that the
organization of 3D point clouds can be efficiently captured
by a structure called superpoint graph (SPG), derived
from a partition of the scanned scene into geometrically
homogeneous elements. SPGs offer a compact yet rich
representation of contextual relationships between object
parts, which is then exploited by a graph convolutional
network. Our framework sets a new state of the art for
segmenting outdoor LiDAR scans (+11.9and +8.8mIoU
points for both Semantic3D test sets), as well as indoor
scans (+12.4mIoU points for the S3DIS dataset).
1. Introduction
Semantic segmentation of large 3D point clouds presents
numerous challenges, the most obvious one being the scale
of the data. Another hurdle is the lack of clear structure
akin to the regular grid arrangement in images. These obsta-
cles have likely prevented Convolutional Neural Networks
(CNNs) from achieving on irregular data the impressive per-
formances attained for speech processing or images.
Previous attempts at using deep learning for large 3D
data were trying to replicate successful CNN architectures
used for image segmentation. For example, SnapNet [5]
converts a 3D point cloud into a set of virtual 2D RGBD
snapshots, the semantic segmentation of which can then be
projected on the original data. SegCloud [46] uses 3D con-
volutions on a regular voxel grid. However, we argue that
such methods do not capture the inherent structure of 3D
point clouds, which results in limited discrimination per-
formance. Indeed, converting point clouds to 2D format
comes with loss of information and requires to perform sur-
face reconstruction, a problem arguably as hard as semantic
segmentation. Volumetric representation of point clouds is
?Both authors contributed equally to this work.
inefficient and tends to discard small details.
Deep learning architectures specifically designed for 3D
point clouds [39,45,42,40,10] display good results, but
are limited by the size of inputs they can handle at once.
We propose a representation of large 3D point clouds as
a collection of interconnected simple shapes coined super-
points, in spirit similar to superpixel methods for image seg-
mentation [1]. As illustrated in Figure 1, this structure can
be captured by an attributed directed graph called the super-
point graph (SPG). Its nodes represent simple shapes while
edges describe their adjacency relationship characterized by
rich edge features.
The SPG representation has several compelling advan-
tages. First, instead of classifying individual points or vox-
els, it considers entire object parts as whole, which are eas-
ier to identify. Second, it is able to describe in detail the
relationship between adjacent objects, which is crucial for
contextual classification: cars are generally above roads,
ceilings are surrounded by walls, etc. Third, the size of
the SPG is defined by the number of simple structures in a
scene rather than the total number of points, which is typ-
ically several order of magnitude smaller. This allows us
to model long-range interaction which would be intractable
otherwise without strong assumptions on the nature of the
pairwise connections. Our contributions are as follows:
We introduce superpoint graphs, a novel point cloud
representation with rich edge features encoding the
contextual relationship between object parts in 3D
point clouds.
Based on this representation, we are able to apply deep
learning on large-scale point clouds without major sac-
rifice in fine details. Our architecture consists of Point-
Nets [39] for superpoint embedding and graph con-
volutions for contextual segmentation. For the latter,
we introduce a novel, more efficient version of Edge-
Conditioned Convolutions [45] as well as a new form
of input gating in Gated Recurrent Units [8].
We set a new state of the art on two publicly available
datasets: Semantic3D [14] and S3DIS [3]. In particu-
arXiv:1711.09869v2 [cs.CV] 28 Mar 2018
(a) RGB point cloud (b) Geometric partition (c) Superpoint graph (d) Semantic segmentation
Figure 1: Visualization of individual steps in our pipeline. An input point cloud (a) is partitioned into geometrically simple
shapes, called superpoints (b). Based on this preprocessing, a superpoints graph (SPG) is constructed by linking nearby
superpoints by superedges with rich attributes (c). Finally, superpoints are transformed into compact embeddings, processed
with graph convolutions to make use of contextual information, and classified into semantic labels.
lar, we improve mean per-class intersection over union
(mIoU) by 11.9points for the Semantic3D reduced test
set, by 8.8points for the Semantic3D full test set, and
by up to 12.4points for the S3DIS dataset.
2. Related Work
The classic approach to large-scale point cloud segmen-
tation is to classify each point or voxel independently using
handcrafted features derived from their local neighborhood
[48]. The solution is then spatially regularized using graph-
ical models [37,24,34,44,21,2,38,35,49] or structured
optimization [27]. Clustering as preprocessing [16,13] or
postprocessing [47] have been used by several frameworks
to improve the accuracy of the classification.
Deep Learning on Point Clouds. Several different
approaches going beyond naive volumetric processing of
point clouds have been proposed recently, notably set-
based [39,40], tree-based [42,23], and graph-based [45].
However, very few methods with deep learning components
have been demonstrated to be able to segment large-scale
point clouds. PointNet [39] can segment large clouds with
a sliding window approach, therefore constraining contex-
tual information within a small area only. Engelmann et
al. [10] improves on this by increasing the context scope
with multi-scale windows or by considering directly neigh-
boring window positions on a voxel grid. SEGCloud [46]
handles large clouds by voxelizing followed by interpola-
tion back to the original resolution and post-processing with
a conditional random field (CRF). None of these approaches
is able to consider fine details and long-range contextual in-
formation simultaneously. In contrast, our pipeline parti-
tions point clouds in an adaptive way according to their ge-
ometric complexity and allows deep learning architecture to
use both fine detail and interactions over long distance.
Graph Convolutions. A key step of our approach is
using graph convolutions to spread contextual information.
Formulations that are able to deal with graphs of variable
sizes can be seen as a form of message passing over graph
edges [12]. Of particular interest are models supporting
continuous edge attributes [45,36], which we use to rep-
resent interactions. In image segmentation, convolutions
on graphs built over superpixels have been used for post-
processing: Liang et al. [32,31] traverses such graphs in
a sequential node order based on unary confidences to im-
prove the final labels. We update graph nodes in parallel
and exploit edge attributes for informative context model-
ing. Xu et al. [50] convolves information over graphs of
object detections to infer their contextual relationships. Our
work infers relationships implicitly to improve segmenta-
tion results. Qi et al. [41] also relies on graph convolutions
on 3D point clouds. However, we process large point clouds
instead of small RGBD images with nodes embedded in 3D
instead of 2D in a novel, rich-attributed graph. Finally, we
note that graph convolutions also bear functional similar-
ity to deep learning formulations of CRFs [51], which we
discuss more in Section 3.4.
3. Method
The main obstacle that our framework tries to overcome
is the size of LiDAR scans. Indeed, they can reach hun-
dreds of millions of points, making direct deep learning
approaches intractable. The proposed SPG representation
allows us to split the semantic segmentation problem into
three distinct problems of different scales, shown in Fig-
ure 2, which can in turn be solved by methods of corre-
sponding complexity:
1Geometrically homogeneous partition: The first step
of our algorithm is to partition the point cloud into geo-
metrically simple yet meaningful shapes, called super-
points. This unsupervised step takes the whole point
cloud as input, and therefore must be computationally
very efficient. The SPG can be easily computed from
this partition.
2Superpoint embedding: Each node of the SPG corre-
sponds to a small part of the point cloud correspond-
point edge of Evor
(a) Input point cloud
superpoint superedge
(b) Superpoint graph
(c) Network architecture
Figure 2: Illustration of our framework on a toy scan of a table and a chair. We perform geometric partitioning on the point
cloud (a), which allows us to build the superpoint graph (b). Each superpoint is embedded by a PointNet network. The
embeddings are then refined in GRUs by message passing along superedges to produce the final labeling (c).
ing to a geometrically simple primitive, which we as-
sume to be semantically homogeneous. Such prim-
itives can be reliably represented by downsampling
small point clouds to at most hundreds of points. This
small size allows us to utilize recent point cloud em-
bedding methods such as PointNet [39].
3Contextual segmentation: The graph of superpoints
is by orders of magnitude smaller than any graph built
on the original point cloud. Deep learning algorithms
based on graph convolutions can then be used to clas-
sify its nodes using rich edge features facilitating long-
range interactions.
The SPG representation allows us to perform end-to-end
learning of the trainable two last steps. We will describe
each step of our pipeline in the following subsections.
3.1. Geometric Partition with a Global Energy
In this subsection, we describe our method for partition-
ing the input point cloud into parts of simple shape. Our
objective is not to retrieve individual objects such as cars
or chairs, but rather to break down the objects into simple
parts, as seen in Figure 3. However, the clusters being ge-
ometrically simple, one can expect them to be semantically
homogeneous as well, i.e. not to cover objects of different
classes. Note that this step of the pipeline is purely unsuper-
vised and makes no use of class labels beyond validation.
We follow the global energy model described by [13] for
its computational efficiency. Another advantage is that the
segmentation is adaptive to the local geometric complexity.
In other words, the segments obtained can be large simple
shapes such as roads or walls, as well as much smaller com-
ponents such as parts of a car or a chair.
Let us consider the input point cloud Cas a set of n
3D points. Each point iCis defined by its 3D position
pi, and, if available, other observations oisuch as color or
intensity. For each point, we compute a set of dggeomet-
ric features fiRdgcharacterizing the shape of its local
neighborhood. In this paper, we use three dimensionality
values proposed by [9]: linearity, planarity and scattering,
as well as the verticality feature introduced by [13]. We
also compute the elevation of each point, defined as the z
coordinate of pinormalized over the whole input cloud.
The global energy proposed by [13] is defined with re-
spect to the 10-nearest neighbor adjacency graph Gnn =
(C, Enn)of the point cloud (note that this is not the SPG).
The geometrically homogeneous partition is defined as the
constant connected components of the solution of the fol-
lowing optimization problem:
arg min
gRdg X
wi,j [gigj6= 0] ,
where [·]is the Iverson bracket. The edge weight wR|E|
is chosen to be linearly decreasing with respect to the edge
length. The factor µis the regularization strength and deter-
mines the coarseness of the resulting partition.
The problem defined in Equation 1is known as gen-
eralized minimal partition problem, and can be seen as
a continuous-space version of the Potts energy model, or
an `0variant of the graph total variation. The minimized
functional being nonconvex and noncontinuous implies that
the problem cannot realistically be solved exactly for large
point clouds. However, the `0-cut pursuit algorithm intro-
duced by [26] is able to quickly find an approximate so-
lution with a few graph-cut iterations. In contrast to other
optimization methods such as α-expansion [6], the `0-cut
pursuit algorithm does not require selecting the size of the
partition in advance. The constant connected components
S={S1,· · · , Sk}of the solution of Equation 1define our
geometrically simple elements, and are referred as super-
Feature name Size Description
mean offset 3 meanmδ(S,T )δm
offset deviation 3 stdmδ(S,T )δm
centroid offset 3 meaniSpimeanjTpj
length ratio 1 log length (S)/length (T)
surface ratio 1 log surface (S)/surface (T)
volume ratio 1 log volume (S)/volume (T)
point count ratio 1 log |S|/|T|
Table 1: List of df= 13 superedge features characterizing
the adjacency between two superpoints Sand T.
points (i.e. set of points) in the rest of this paper.
3.2. Superpoint Graph Construction
In this subsection, we describe how we compute the SPG
as well as its key features. The SPG is a structured represen-
tation of the point cloud, defined as an oriented attributed
graph G= (S,E, F )whose nodes are the set of superpoints
Sand edges E(referred to as superedges) represent the ad-
jacency between superpoints. The superedges are annotated
by a set of dffeatures: FRdfcharacterizing the adja-
cency relationship between superpoints.
We define Gvor = (C, Evor )as the symmetric Voronoi
adjacency graph of the complete input point cloud as de-
fined by [20]. Two superpoints Sand Tare adjacent if there
is at least one edge in Evor with one end in Sand one end in
E=(S, T )∈ S2| (i, j)Evor (S×T).(2)
Important spatial features associated with a superedge
(S, T )are obtained from the set of offsets δ(S, T )for edges
in Evor linking both superpoints:
δ(S, T ) = {(pipj)|(i, j)Evor (S×T)}.(3)
Superedge features can also be derived by comparing the
shape and size of the adjacent superpoints. To this end,
we compute |S|as the number of points comprised in a
superpoint S, as well as shape features length (S) = λ1,
surface (S) = λ1λ2, volume (S) = λ1λ2λ3derived from
the eigenvalues λ1, λ2, λ3of the covariance of the positions
of the points comprised in each superpoint, sorted by de-
creasing value. In Table 1, we describe a list of the different
superedge features used in this paper. Note that the break
of symmetry in the edge features makes the SPG a directed
3.3. Superpoint Embedding
The goal of this stage is to compute a descriptor for every
superpoint Siby embedding it into a vector ziof fixed-size
dimensionality dz. Note that each superpoint is embedded
in isolation; contextual information required for its reliable
classification is provided only in the following stage by the
means of graph convolutions.
Several deep learning-based methods have been pro-
posed for this purpose recently. We choose PointNet [39]
for its remarkable simplicity, efficiency, and robustness. In
PointNet, input points are first aligned by a Spatial Trans-
former Network [19], independently processed by multi-
layer perceptrons (MLPs), and finally max-pooled to sum-
marize the shape.
In our case, input shapes are geometrically simple ob-
jects, which can be reliably represented by a small amount
of points and embedded by a rather compact PointNet. This
is important to limit the memory needed when evaluating
many superpoints on current GPUs. In particular, we sub-
sample superpoints on-the-fly down to np= 128 points to
maintain efficient computation in batches and facilitate data
augmentation. Superpoints of less than nppoints are sam-
pled with replacement, which in principle does not affect
the evaluation of PointNet due to its max-pooling. How-
ever, we observed that including very small superpoints of
less than nminp = 40 points in training harms the overall
performance. Thus, embedding of such superpoints is set
to zero so that their classification relies solely on contextual
In order for PointNet to learn spatial distribution of dif-
ferent shapes, each superpoint is rescaled to unit sphere be-
fore embedding. Points are represented by their normal-
ized position p0
i, observations oi, and geometric features fi
(since these are already available precomputed from the par-
titioning step). Furthermore, the original metric diameter of
the superpoint is concatenated as an additional feature after
PointNet max-pooling in order to stay covariant with shape
3.4. Contextual Segmentation
The final stage of the pipeline is to classify each su-
perpoint Sibased on its embedding ziand its local sur-
roundings within the SPG. Graph convolutions are naturally
suited to this task. In this section, we explain the propaga-
tion model of our system.
Our approach builds on the ideas from Gated Graph
Neural Networks [30] and Edge-Conditioned Convolutions
(ECC) [45]. The general idea is that superpoints refine their
embedding according to pieces of information passed along
superedges. Concretely, each superpoint Simaintains its
state hidden in a Gated Recurrent Unit (GRU) [8]. The hid-
den state is initialized with embedding ziand is then pro-
cessed over several iterations (time steps) t= 1 . . . T . At
each iteration t, a GRU takes its hidden state h(t)
iand an
incoming message m(t)
ias input, and computes its new hid-
den state h(t+1)
i. The incoming message m(t)
ito superpoint
iis computed as a weighted sum of hidden states h(t)
neighboring superpoints j. The actual weighting for a su-
peredge (j, i)depends on its attributes Fji,·, listed in Ta-
ble 1. In particular, it is computed from the attributes by a
multi-layer perceptron Θ, so-called Filter Generating Net-
work. Formally:
i= (1 u(t)
i= tanh(x(t)
1,i +r(t)
2,i +h(t)
3,i +h(t)
i= meanj|(j,i)∈E Θ(Fji,·;We)h(t)
where is element-wise multiplication, σ(·)sigmoid func-
tion, and W·and b·are trainable parameters shared among
all GRUs. Equation 4lists the standard GRU rules [8] with
its update gate u(t)
iand reset gate r(t)
i. To improve stability
during training, in Equation 5we apply Layer Normaliza-
tion [4] defined as ρ(a) := (amean(a))/(std(a)+)sep-
arately to linearly transformed input x(t)
iand transformed
hidden state h(t)
i, with being a small constant. Finally, the
model includes three interesting extensions in Equations 6
8, which we detail below.
Input Gating. We argue that GRU should possess the
ability to down-weight (parts of) an input vector based on
its hidden state. For example, GRU might learn to ignore its
context if its class state is highly certain or to direct its atten-
tion to only specific feature channels. Equation 6achieves
this by gating message m(t)
iby the hidden state before using
it as input x(t)
Edge-Conditioned Convolution. ECC plays a crucial
role in our model as it can dynamically generate filter-
ing weights for any value of continuous attributes Fji,·
by processing them with a multi-layer perceptron Θ. In
the original formulation [45] (ECC-MV), Θregresses
a weight matrix to perform matrix-vector multiplication
jfor each edge. In this work, we propose
a lightweight variant with lower memory requirements and
fewer parameters, which is beneficial for datasets with few
but large point clouds. Specifically, we regress only an
edge-specific weight vector and perform element-wise mul-
tiplication as in Equation 7(ECC-VV). Channel mixing, al-
beit in an edge-unspecific fashion, is postponed to Equa-
tion 5. Finally, let us remark that Θis shared over time iter-
ations and that self-loops as proposed in [45] are not neces-
sary due to the existence of hidden states in GRUs.
State Concatenation. Inspired by DenseNet [17], we
concatenate hidden states over all time steps and linearly
transform them to produce segmentation logits yiin Equa-
tion 8. This allows to exploit the dynamics of hidden states
due to increasing receptive field for the final classification.
Relation to CRFs. In image segmentation, post-
processing of convolutional outputs using Conditional
Random Fields (CRFs) is widely popular. Several infer-
ence algorithms can be formulated as (recurrent) network
layers amendable to end-to-end learning [51,43], possibly
with general pairwise potentials [33,7,28]. While our
method of information propagation shares both these
characteristics, our GRUs operate on dz-dimensional
intermediate feature space, which is richer and less con-
strained than low-dimensional vectors representing beliefs
over classes, as also discussed in [11]. Such enhanced
access to information is motivated by the desire to learn
a powerful representation of context, which goes beyond
belief compatibilities, as well as the desire to be able to
discriminate our often relatively weak unaries (superpixel
embeddings). We empirically evaluate these claims in
Section 4.3.
3.5. Further Details
Adjacency Graphs. In this paper, we use two different
adjacency graphs between points of the input clouds: Gnn in
Section 3.1 and Gvor in Section 3.2. Indeed, different defini-
tions of adjacency have different advantages. Voronoi adja-
cency is more suited to capture long-range relationships be-
tween superpoints, which is beneficial for the SPG. Nearest
neighbors adjacency tends not to connect objects separated
by a small gap. This is desirable for the global energy but
tends to produce a SPG with many small connected com-
ponents, decreasing embedding quality. Fixed radius adja-
cency should be avoided in general as it handles the variable
density of LiDAR scans poorly.
Training. While the geometric partitioning step is unsu-
pervised, superpoint embedding and contextual segmenta-
tion are trained jointly in a supervised way with cross en-
tropy loss. Superpoints are assumed to be semantically ho-
mogeneous and, consequently, assigned a hard ground truth
label corresponding to the majority label among their con-
tained points. We also considered using soft labels cor-
responding to normalized histograms of point labels and
training with Kullback-Leibler [25] divergence loss. It per-
formed slightly worse in our initial experiments, though.
Naive training on large SPGs may approach memory
limits of current GPUs. We circumvent this issue by ran-
domly subsampling the sets of superpoints at each itera-
tion and training on induced subgraphs, i.e. graphs com-
posed of subsets of nodes and the original edges connecting
them. Specifically, graph neighborhoods of order 3are sam-
pled to select at most 512 superpoints per SPG with more
than nminp points, as smaller superpoints are not embed-
ded. Note that as the induced graph is a union of small
neighborhoods, relationships over many hops may still be
formed and learned. This strategy also doubles as data aug-
mentation and a strong regularization, together with ran-
domized sampling of point clouds described in Section 3.3.
Additional data augmentation is performed by randomly
rotating superpoints around the vertical axis and jittering
point features by Gaussian noise N(0,0.01) truncated to
Testing. In modern deep learning frameworks, testing can
be made very memory-efficient by discarding layer activa-
tions as soon as the follow-up layers have been computed.
In practice, we were able to label full SPGs at once. To com-
pensate for randomness due to subsampling of point clouds
in PointNets, we average logits obtained over 10 runs with
different seeds.
4. Experiments
We evaluate our pipeline on the two currently largest
point cloud segmentation benchmarks, Semantic3D [14]
and Stanford Large-Scale 3D Indoor Spaces (S3DIS) [3], on
both of which we set the new state of the art. Furthermore,
we perform an ablation study of our pipeline in Section 4.3.
Even though the two data sets are quite different in nature
(large outdoor scenes for Semantic3D, smaller indoor scan-
ning for S3DIS), we use nearly the same model for both.
The deep model is rather compact and 6GB of GPU mem-
ory is enough for both testing and training. We refer to Ap-
pendix Afor precise details on hyperparameter selection,
architecture configuration, and training procedure.
Performance is evaluated using three metrics: per-class
intersection over union (IoU), per-class accuracy (Acc), and
overall accuracy (OA), defined as the proportion of cor-
rectly classified points. We stress that the metrics are com-
puted on the original point clouds, not on superpoints.
4.1. Semantic3D
Semantic3D [14] is the largest available LiDAR dataset
with over 3 billion points from a variety of urban and rural
scenes. Each point has RGB and intensity values (the latter
of which we do not use). The dataset consists of 15 training
scans and 15 test scans with withheld labels. We also eval-
uate on the reduced set of 4 subsampled scans, as common
in past work.
In Table 2, we provide the results of our algorithm com-
pared to other state of the art recent algorithms and in Fig-
ure 3, we provide qualitative results of our framework. Our
framework improves significantly on the state of the art of
semantic segmentation for this data set, i.e. by nearly 12
mIoU points on the reduced set and by nearly 9 mIoU points
on the full set. In particular, we observe a steep gain on the
”artefact” class. This can be explained by the ability of the
partitioning algorithm to detect artifacts due to their singu-
lar shape, while they are hard to capture using snapshots, as
suggested by [5]. Furthermore, these small object are often
merged with the road when performing spatial regulariza-
4.2. Stanford Large-Scale 3D Indoor Spaces
The S3DIS dataset [3] consists of 3D RGB point clouds
of six floors from three different buildings split into indi-
vidual rooms. We evaluate our framework following two
dominant strategies found in previous works. As advocated
by [39,10], we perform 6-fold cross validation with micro-
averaging, i.e. computing metrics once over the merged pre-
dictions of all test folds. Following [46], we also report
the performance on the fifth fold only (Area 5), correspond-
ing to a building not present in the other folds. Since some
classes in this data set cannot be partitioned purely using ge-
ometric features (such as boards or paintings on walls), we
concatenate the color information oto the geometric fea-
tures ffor the partitioning step.
The quantitative results are displayed in Table 3, with
qualitative results in Figure 3and in Appendix D. S3DIS is
a difficult dataset with hard to retrieve classes such as white
boards on white walls and columns within walls. From the
quantitative results we can see that our framework performs
better than other methods on average. Notably, doors are
able to be correctly classified at a higher rate than other ap-
proaches, as long as they are open, as illustrated in Figure 3.
Indeed, doors are geometrically similar to walls, but their
position with respect to the door frame allows our network
to retrieve them correctly. On the other hand, the partition
merges white boards with walls, depriving the network from
the opportunity to even learn to classify them: the IoU of
boards for theoretical perfect classification of superpoints
(as in Section 4.3) is only 51.3.
Computation Time. In Table 4, we report computation
time over the different steps of our pipeline for the infer-
ence on Area 5 measured on a 4 GHz CPU and GTX 1080
Ti GPU. While the bulk of time is spent on the CPU for
partitioning and SPG computation, we show that voxeliza-
tion as pre-processing, detailed in Appendix A, leads to a
significant speed-up as well as improved accuracy.
Method OA mIoU man-made
vegetation buildings hard-
artefact cars
reduced test set: 78 699 329 points
TMLC-MSR [15] 86.2 54.2 89.8 74.5 53.7 26.8 88.8 18.9 36.4 44.7
DeePr3SS [29] 88.9 58.5 85.6 83.2 74.2 32.4 89.7 18.5 25.1 59.2
SnapNet [5] 88.6 59.1 82.0 77.3 79.7 22.9 91.1 18.4 37.3 64.4
SegCloud [46] 88.1 61.3 83.9 66.0 86.0 40.5 91.1 30.9 27.5 64.3
SPG (Ours) 94.0 73.2 97.4 92.6 87.9 44.0 93.2 31.0 63.5 76.2
full test set: 2 091 952 018 points
TMLC-MS [15] 85.0 49.4 91.1 69.5 32.8 21.6 87.6 25.9 11.3 55.3
SnapNet [5] 91.0 67.4 89.6 79.5 74.8 56.1 90.9 36.5 34.3 77.2
SPG (Ours) 92.9 76.2 91.5 75.6 78.3 71.7 94.4 56.8 52.9 88.4
Table 2: Intersection over union metric for the different classes of the Semantic3D dataset. OA is the global accuracy, while
mIoU refers to the unweighted average of IoU of each class.
(a) RGB point cloud (b) Geometric partitioning (c) Prediction (d) Ground truth
Figure 3: Example visualizations on both datasets. The colors in (b) are chosen randomly for each element of the partition.
4.3. Ablation Studies
To better understand the influence of various design
choices made in our framework, we compare it to several
baselines and perform an ablation study. Due to the lack of
public ground truth for test sets of Semantic3D, we evaluate
on S3DIS with 6-fold cross validation and show comparison
of different models to our Best model in Table 5.
Performance Limits. The contribution of contextual
segmentation can be bounded both from below and above.
The lower bound (Unary) is estimated by training PointNet
with dz= 13 but otherwise the same architecture, denoted
as PointNet13, to directly predict class logits, without SPG
and GRUs. The upper bound (Perfect) corresponds to as-
signing each superpoint its ground truth label, and thus sets
the limit of performance due to the geometric partition. We
can see that contextual segmentation is able to win roughly
22 mIoU points over unaries, confirming its importance.
Nevertheless, the learned model still has room of up to 26
Method OA mAcc mIoU ceiling floor wall beam column window door chair table bookcase sofa board clutter
A5 PointNet [39] 48.98 41.09 88.80 97.33 69.80 0.05 3.92 46.26 10.76 52.61 58.93 40.28 5.85 26.38 33.22
A5 SEGCloud [46] 57.35 48.92 90.06 96.05 69.86 0.00 18.37 38.35 23.12 75.89 70.40 58.42 40.88 12.96 41.60
A5 SPG (Ours) 86.38 66.50 58.04 89.35 96.87 78.12 0.0 42.81 48.93 61.58 84.66 75.41 69.84 52.60 2.10 52.22
PointNet [39] in [10] 78.5 66.2 47.6 88.0 88.7 69.3 42.4 23.1 47.5 51.6 42.0 54.1 38.2 9.6 29.4 35.2
Engelmann et al. [10] 81.1 66.4 49.7 90.3 92.1 67.9 44.7 24.2 52.3 51.2 47.4 58.1 39.0 6.9 30.0 41.9
SPG (Ours) 85.5 73.0 62.1 89.9 95.1 76.4 62.8 47.1 55.3 68.4 73.5 69.2 63.2 45.9 8.7 52.9
Table 3: Results on the S3DIS dataset on fold “Area 5” (top) and micro-averaged over all 6 folds (bottom). Intersection over
union is shown split per class.
Step Full cloud 2cm 3cm 4cm
Voxelization 0 40 24 16
Feature computation 439 194 88 43
Geometric partition 3428 1013 447 238
SPG computation 3800 958 436 252
Inference 10 ×24 10 ×11 10 ×6 10 ×5
Total 7907 2315 1055 599
mIoU 6-fold 54.1 60.2 62.1 57.1
Table 4: Computation time in seconds for the inference on
S3DIS Area 5 (68 rooms, 78 649 682 points) for different
voxel sizes.
mIoU points for improvement, while about 12 mIoU points
are forfeited to the semantic inhomogeneity of superpoints.
CRFs. We compare the effect of our GRU+ECC-
based network to CRF-based regularization. As a baseline
(iCRF), we post-process Unary outputs by CRF inference
over SPG connectivity with scalar transition matrix, as de-
scribed by [13]. Next (CRF ECC), we adapt CRF-RNN
framework of Zheng et al. [51] to general graphs with edge-
conditioned convolutions (see Appendix Bfor details) and
train it with PointNet13 end-to-end. Finally (GRU13), we
modify Best to use PointNet13. We observe that iCRF
barely improves accuracy (+1 mIoU), which is to be ex-
pected, since the partitioning step already encourages spa-
tial regularity. CRF ECC does better (+15 mIoU) due
to end-to-end learning and use of edge attributes, though it
is still below GRU13 (+18 mIoU), which performs more
complex operations and does not enforce normalization of
the embedding. Nevertheless, the 32 channels used in Best
instead of the 13 used in GRU13 provide even more free-
dom for feature representation (+22 mIoU).
Ablation. We explore the advantages of several design
choices by individually removing them from Best in order
to compare the framework’s performance with and without
them. In NoInputGate we remove input gating in GRU; in
NoConcat we only consider the last hidden state in GRU
for output as yi=Woh(T+1)
iinstead of concatenation of
all steps; in NoEdgeFeat we perform homogeneous regu-
larization by setting all superedge features to scalar 1; and
Model mAcc mIoU
Best 73.0 62.1
Perfect 92.7 88.2
Unary 50.8 40.0
iCRF 51.5 40.7
CRF ECC 65.6 55.3
GRU13 69.1 58.5
NoInputGate 68.6 57.5
NoConcat 69.3 57.7
NoEdgeFeat 50.1 39.9
ECC VV 70.2 59.4
Table 5: Ablation study and comparison to various base-
lines on S3DIS (6-fold cross validation).
in ECC VV we use the proposed lightweight formula-
tion of ECC. We can see that each of the first two choices
accounts for about 5mIoU points. Next, without edge fea-
tures our method falls back even below iCRF to the level
of Unary, which validates their design and overall motiva-
tion for SPG. ECC VV decreases the performance on the
S3DIS dataset by 3mIoU points, whereas it has improved
the performance on Semantic3D by 2mIoU. Finally, we in-
vite the reader to Appendix Cfor further ablations.
5. Conclusion
We presented a deep learning framework for perform-
ing semantic segmentation of large point clouds based on
a partition into simple shapes. We showed that SPGs al-
low us to use effective deep learning tools, which would not
be able to handle the data volume otherwise. Our method
significantly improves on the state of the art on two publicly
available datasets. Our experimental analysis suggested that
future improvements can be made in both partitioning and
learning deep contextual classifiers.
The source code in PyTorch as well as the trained models
are available at
[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
S. S¨
usstrunk. SLIC superpixels compared to state-of-the-art
superpixel methods. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 34(11):2274–2282, 2012. 1
[2] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena.
Contextually guided semantic labeling and search for three-
dimensional point clouds. The International Journal of
Robotics Research, 32(1):19–34, 2013. 2
[3] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis,
M. Fischer, and S. Savarese. 3D semantic parsing of large-
scale indoor spaces. In CVPR, 2016. 1,6
[4] L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization.
CoRR, abs/1607.06450, 2016. 5
[5] A. Boulch, B. L. Saux, and N. Audebert. Unstructured
point cloud semantic labeling using deep segmentation net-
works. In Eurographics Workshop on 3D Object Retrieval,
volume 2, 2017. 1,6,7
[6] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-
ergy minimization via graph cuts. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 23(11):1222–1239,
2001. 3
[7] S. Chandra and I. Kokkinos. Fast, exact and multi-scale in-
ference for semantic image segmentation with deep gaussian
crfs. In ECCV, 2016. 5
[8] K. Cho, B. van Merri¨
enboer, C¸ . G ¨
ulc¸ ehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase
representations using RNN encoder–decoder for statistical
machine translation. In EMNLP, 2014. 1,4,5
[9] J. Demantk, C. Mallet, N. David, and B. Vallet. Dimension-
ality based scale selection in 3D lidar point clouds. Inter-
national Archives of the Photogrammetry, Remote Sensing
and Spatial Information Sciences, XXXVIII-5/W12:97–102,
2011. 3
[10] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe.
Exploring spatial context for 3d semantic segmentation of
point clouds. In ICCV, 3DRMS Workshop,, 2017. 1,2,6,8
[11] R. Gadde, V. Jampani, M. Kiefel, D. Kappler, and P. Gehler.
Superpixel convolutional networks using bilateral incep-
tions. In ECCV, 2016. 5
[12] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E.
Dahl. Neural message passing for quantum chemistry. In
ICML, pages 1263–1272, 2017. 2
[13] S. Guinard and L. Landrieu. Weakly supervised
segmentation-aided classification of urban scenes from 3d
LiDAR point clouds. In ISPRS 2017, 2017. 2,3,8
[14] T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner,
K. Schindler, and M. Pollefeys. Semantic3d. net: A new
large-scale point cloud classification benchmark. arXiv
preprint arXiv:1704.03847, 2017. 1,6
[15] T. Hackel, J. D. Wegner, and K. Schindler. Fast semantic
segmentation of 3D point clouds with strongly varying den-
sity. ISPRS Annals of Photogrammetry, Remote Sensing &
Spatial Information Sciences, 3(3), 2016. 7
[16] H. Hu, D. Munoz, J. A. Bagnell, and M. Hebert. Efficient
3-d scene analysis from streaming data. In ICRA, 2013. 2
[17] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected
convolutional networks. In CVPR, 2017. 5
[18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. In
ICML, 2015. 10
[19] M. Jaderberg, K. Simonyan, A. Zisserman, and
K. Kavukcuoglu. Spatial transformer networks. In
NIPS, pages 2017–2025. 2015. 4
[20] J. W. Jaromczyk and G. T. Toussaint. Relative neighbor-
hood graphs and their relatives. Proceedings of the IEEE,
80(9):1502–1517, 1992. 4
[21] B.-S. Kim, P. Kohli, and S. Savarese. 3D scene understand-
ing by voxel-CRF. In ICCV, 2013. 2
[22] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. In ICLR, 2015. 11
[23] R. Klokov and V. S. Lempitsky. Escape from cells: Deep
Kd-networks for the recognition of 3D point cloud models.
CoRR, abs/1704.01222, 2017. 2
[24] H. S. Koppula, A. Anand, T. Joachims, and A. Saxena. Se-
mantic labeling of 3D point clouds for indoor scenes. In
NIPS, pages 244–252, 2011. 2
[25] S. Kullback and R. A. Leibler. On information and suffi-
ciency. The Annals of Mathematical Statistics, 22 (1):79–86,
1951. 6
[26] L. Landrieu and G. Obozinski. Cut pursuit: Fast algorithms
to learn piecewise constant functions on general weighted
graphs. SIAM Journal on Imaging Sciences, 10(4):1724–
1766, 2017. 3
[27] L. Landrieu, H. Raguet, B. Vallet, C. Mallet, and M. Wein-
mann. A structured regularization framework for spatially
smoothing semantic labelings of 3D point clouds. ISPRS
Journal of Photogrammetry and Remote Sensing, 132:102 –
118, 2017. 2
[28] M. Larsson, F. Kahl, S. Zheng, A. Arnab, P. H. S. Torr, and
R. I. Hartley. Learning arbitrary potentials in CRFs with gra-
dient descent. CoRR, abs/1701.06805, 2017. 5
[29] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan,
and M. Felsberg. Deep projective 3D semantic segmentation.
arXiv preprint arXiv:1705.03428, 2017. 7
[30] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel. Gated
graph sequence neural networks. In ICLR, 2016. 4
[31] X. Liang, L. Lin, X. Shen, J. Feng, S. Yan, and E. P. Xing. In-
terpretable structure-evolving LSTM. In CVPR, pages 2175–
2184, 2017. 2
[32] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. Semantic
object parsing with graph LSTM. In ECCV, pages 125–143,
2016. 2
[33] G. Lin, C. Shen, A. van den Hengel, and I. D. Reid. Efficient
piecewise training of deep structured models for semantic
segmentation. In CVPR, 2016. 5
[34] Y. Lu and C. Rasmussen. Simplified Markov random fields
for efficient semantic labeling of 3d point clouds. In IROS,
pages 2690–2697, 2012. 2
[35] A. Martinovic, J. Knopp, H. Riemenschneider, and
L. Van Gool. 3D all the way: Semantic segmentation of
urban scenes from start to end in 3D. In CVPR, 2015. 2
[36] F. Monti, D. Boscaini, J. Masci, E. Rodol`
a, J. Svoboda,
and M. M. Bronstein. Geometric deep learning on graphs
and manifolds using mixture model CNNs. In CVPR, pages
5425–5434, 2017. 2
[37] D. Munoz, J. A. Bagnell, N. Vandapel, and M. Hebert. Con-
textual classification with functional max-margin Markov
networks. In CVPR, 2009. 2
[38] J. Niemeyer, F. Rottensteiner, and U. Soergel. Contextual
classification of lidar data and building object detection in
urban areas. ISPRS Journal of Photogrammetry and Remote
Sensing, 87:152–165, 2014. 2
[39] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep
learning on point sets for 3D classification and segmentation.
In CVPR, 2017. 1,2,3,4,6,8,11
[40] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep
hierarchical feature learning on point sets in a metric space.
In NIPS, 2017. 1,2
[41] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3D graph
neural networks for RGBD semantic segmentation. In ICCV,
pages 5209–5218, 2017. 2
[42] G. Riegler, A. O. Ulusoy, and A. Geiger. OctNet: Learning
deep 3D representations at high resolutions. In CVPR, 2017.
[43] A. G. Schwing and R. Urtasun. Fully connected deep struc-
tured networks. CoRR, abs/1503.02351, 2015. 5
[44] R. Shapovalov, D. Vetrov, and P. Kohli. Spatial inference
machines. In CVPR, 2013. 2
[45] M. Simonovsky and N. Komodakis. Dynamic edge-
conditioned filters in convolutional neural networks on
graphs. In CVPR, 2017. 1,2,4,5,11
[46] L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, and
S. Savarese. SEGCloud: Semantic segmentation of 3D point
clouds. arXiv preprint arXiv:1710.07563, 2017. 1,2,6,7,8,
[47] M. Weinmann, S. Hinz, and M. Weinmann. A hybrid seman-
tic point cloud classification-segmentation framework based
on geometric features and semantic rules. PFG–Journal of
Photogrammetry, Remote Sensing and Geoinformation Sci-
ence, 85(3):183–194, 2017. 2
[48] M. Weinmann, A. Schmidt, C. Mallet, S. Hinz, F. Rotten-
steiner, and B. Jutzi. Contextual classification of point cloud
data by exploiting individual 3D neighborhoods. ISPRS An-
nals of the Photogrammetry, Remote Sensing and Spatial In-
formation Sciences, II-3/W4:271–278, 2015. 2
[49] D. Wolf, J. Prankl, and M. Vincze. Fast semantic segmen-
tation of 3D point clouds using a dense CRF with learned
parameters. In ICRA, 2015. 2
[50] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph
generation by iterative message passing. In CVPR, pages
3097–3106, 2017. 2
[51] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional ran-
dom fields as recurrent neural networks. In ICCV, 2015. 2,
A. Model Details
Voxelization. We pre-process input point clouds with
voxelization subsampling by computing per-voxel mean po-
sitions and observations over a regular 3D grid (5cm bins
for Semantic3D and 3cm bins for S3DIS dataset). The re-
sulting semantic segmentation is interpolated back to the
original point cloud in a nearest neighbor fashion. Voxeliza-
tion helps decreasing the computation time and memory re-
quirement, and improves the accuracy of the semantic seg-
mentation by acting as a form of geometric and radiometric
denoising as well (Table 4in the main paper). The quality
of further steps is practically not affected, as superpoints are
usually strongly subsampled for embedding during learning
and inference anyway (Section 3.3 in the main paper).
Geometric Partition. We set regularization strength µ=
0.8for Semantic3D and µ= 0.03 for S3DIS, which strikes
a balance between semantic homogeneity of superpoints
and the potential for their successful discrimination (S3DIS
is composed of smaller semantic parts than Semantic3D).
In addition to five geometric features f(linearity, planarity,
scattering, verticality, elevation), we use color information
ofor clustering in S3DIS due to some classes being geo-
metrically indistinguishable, such as boards or doors.
shared shared 257 ×1
Figure 4: The PointNet embedding npdp-dimensional sam-
ples of a superpoint to a dz-dimensional vector.
PointNet. We use a simplified shallow and narrow Point-
Net architecture with just a single Spatial Transformer Net-
work (STN), see Figure 4. We set np= 128 and nminp =
40. Input points are processed by a sequence of MLPs
(widths 64, 64, 128, 128, 256) and max pooled to a sin-
gle vector of 256 features. The scalar metric diameter is
appended and the result further processed by a sequence
of MLPs (widths 256, 64, dz=32). A residual matrix Φ
R2×2is regressed by STN and (I+ Φ) is used to transform
XY coordinates of input points as the first step. The archi-
tecture of STN is a ”small PointNet” with 3 MLPs (widths
64, 64, 128) before max pooling and 3 MLPs after (widths
128, 64, 4). Batch Normalization [18] and ReLUs are used
everywhere. Input points have dp=11 dimensional features
for Semantic3D (position pi, color oi, geometric features
fi), with 3 additional ones for S3DIS (room-normalized
spatial coordinates, as in past work [39]).
Segmentation Network. We use embedding dimension-
ality dz= 32 and T= 10 iterations. ECC-VV is used
for Semantic3D (there are only 15 point clouds even though
the amount of points is large), while ECC-MV is used for
S3DIS (large number of point clouds). Filter-generating
network Θis a MLP with 4 layers (widths 32, 128, 64, and
32 or 322for ECC-VV or ECC-MV) with ReLUs. Batch
Normalization is used only after the third parametric layer.
No bias is used in the last layer. Superedges have df= 13
dimensional features, normalized by mean subtraction and
scaling to unit variance based on the whole training set.
Training. We train using Adam [22] with initial learning
rate 0.01 and batch size 2, i.e. effectively up to 1024 super-
points per batch. For Semantic3D, we train for 500 epochs
with stepwise learning rate decay of 0.7 at epochs 350, 400,
and 450. For S3DIS, we train for 250 epochs with steps at
200 and 230. We clip gradients within [1,1].
In this section, we describe our adaptation of CRF-
RNN mean field inference by Zheng et al. [51] for post-
processing PointNet embeddings in SPG, denoted as unary
potentials Uihere.
The original work proposed a dense CRF with pairwise
potentials Ψdefined to be a mixture of mGaussian kernels
as Ψij =µPmwmKm(Fij ), where µis label compati-
bility matrix, ware parameters, and Kare fixed Gaussian
kernels applied on edge features.
We replace this definition of the pairwise term with a Fil-
ter generating network Θ[45] parameterized with weights
We, which generalizes the message passing and compati-
bility transform steps of Zheng et al. . Furthermore, we use
superedge connectivity Einstead of assuming a complete
graph. The pseudo-code is listed in Algorithm 1. Its output
are marginal probability distributions Q. In practice we run
the inference for T= 10 iterations.
Algorithm 1 CRF-ECC
while not converged do
QiPj|(j,i)∈E Θ(Fji,·;We)Qj
Qisoftmax( ˘
end while
C. Extended Ablation Studies
In this section, we present additional set of experiments
to validate our design choices and present their results in
Table 6.
a) Spatial Transformer Network. While STN makes
superpoint embedding orientation invariant, the relationship
with surrounding objects are still captured by superedges,
which are orientation variant. In practice, STN helps by 4
mIoU points.
b) Geometric Features. Geometric features fiare com-
puted in the geometric partition step and can therefore be
used in the following learning step for free. While Point-
Nets could be expected to learn similar features from the
data, this is hampered by superpoint subsampling, and
therefore their explicit use helps (+4 mIoU).
c) Sampling Superpoints. The main effect of subsam-
pling SPG is regularization by data augmentation. Too
small a sample size leads to disregarding contextual infor-
mation (-4 mIoU) while too large a size leads to overfitting
(-2 mIoU). Lower memory requirements at training is an
extra benefit. There is no subsampling at test time.
d) Long-range Context. We observe that limiting the
range of context information in SPG harms the perfor-
mance. Specifically, capping distances in Gvor to 1m (as
used in PointNet [39]) or 5m (as used in SegCloud [46])
worsens the performance of our method (even more on our
Semantic 3D validation set).
e) Input Gate. We evaluate the effect of input gating
(IG) for GRUs as well as LSTM units. While a LSTM unit
achieves higher score than a GRU (-3 mIoU), the proposed
IG reverses this situation in favor of GRU (+1 mIoU). Un-
like the standard input gate of LSTM, which controls the
information flow from the hidden state and input to the cell,
our IG controls the input even before it is used to compute
all other gates.
f) Regularization Strength µ.We investigate the bal-
ance between superpoints’ discriminative potential and their
homogeneity controlled by parameter µ. We observe that
the system is able to perform reasonably over a range of
SPG sizes.
g) Superpoint Sizes. We include a breakdown of su-
perpoint sizes for µ= 0.03 in relation to hyperparameters
nminp = 40 and np= 128, showing that 93% of points are
in embedded superpoints, and 79% in superpoints that are
Superedge Features. Finally, in Table 7we evaluate
empirical importance of individual superedge features by
removing them from Best. Although no single feature is
crucial, the most being offset deviation (+3 mIoU), we re-
mind the reader than without any superedge features the net-
Furthermore, SegCloud divides the inference into cubes without over-
lap, possibly causing inconsistencies across boundaries.
a) Spatial transf. no yes
mIoU 58.1 62.1
b) Geometric features no yes
mIoU 58.4 62.1
c) Max superpoints 256 512 1024
mIoU 57.9 62.1 60.4
d) Superedge limit 1 m 5 m
mIoU 61.0 61.3 62.1
mIoU 61.0 61.0 57.5 62.1
f) Regularization µ0.01 0.02 0.03 0.04
# superpoints 785 010 385 091 251 266 186 108
perfect mIoU 90.6 88.2 86.6 85.2
mIoU 59.1 59.2 62.1 58.8
g) Superpoint size 1-40 40-128 128-1000 1000
proportion of points 7% 14% 27% 52%
Table 6: Ablation study of design decisions on S3DIS (6-
fold cross validation). Our choices in bold.
Model mAcc mIoU
Best 73.0 62.1
no mean offset 72.5 61.8
no offset deviation 71.7 59.3
no centroid offset 74.5 61.2
no len/surf/vol ratios 71.2 60.7
no point count ratio 72.7 61.7
Table 7: Ablation study of superedge features on S3DIS (6-
fold cross validation).
40 128 1,000 10,000
size of superpoints
number of points
Figure 5: Histogram of points contained in superpoints of
different size (in log scale) on the full S3DIS dataset. The
embedding threshold nminp and subsampling threshold np
are marked in red.
work performs distinctly worse (NoEdgeFeat, -22 mIoU).
D. Video Illustration
We provide a video illustrating our method and qual-
itative results on S3DIS dataset, which can be viewed at
... (3) Graph-based deep learning networks for point clouds: recent work by researchers has begun to experiment with directly processing large-scale point clouds. For example, SPG [13,14] For the large-scale field point cloud data collected by the telemetry system, it faces the problems of scene complexity, oversized computation, and uneven sparsity in addition to the inherent characteristics of disorder, rotational invariance, sparsity, severe occlusion, and unstructured point clouds. e above methods suffer from insufficient large scale processing, excessive parameter scale, and low segmentation accuracy in practice. ...
... Finally, PoingLAE, a neural network for the semantic segmentation of a point cloud with high performance, is introduced. [10], and Landure and Simonovsky [13] use the voxelization method for preprocessing in Superpoint graph. ese methods are computationally large compared with either FPS method. ...
Full-text available
The fast semantic segmentation algorithm of 3D laser point clouds for large scenes is of great significance for mobile information measurement systems, but the point cloud data is complex and generates problems such as disorder, rotational invariance, sparsity, severe occlusion, and unstructured data. We address the above problems by proposing the random sampling feature aggregation module ATSE module, which solves the problem of effective aggregation of features at different scales, and a new semantic segmentation framework PointLAE, which effectively presegments point clouds and obtains good semantic segmentation results by neural network training based on the features aggregated by the above module. We validate the accuracy of the algorithm by training on Semantic3D, a public dataset of large outdoor scenes, with an accuracy of 90.3, while verifying the robustness of the algorithm on Mvf CNN datasets with different sparsity levels, with an accuracy of 86.2, and on Bjfumap data aggregated by our own mobile environmental information collection platform, with an accuracy of 77.4, demonstrating that the algorithm is good for mobile information complex scale data in mobile information collection with great recognition effect.
... PointNet [37] is a pioneering work that raised a geometric deep learning network consuming point cloud directly and achieves good performance for both point cloud classification and segmentation tasks. Afterward, various point cloud-based deep learning networks are proposed [21,26,39]. To deal with the irregularity and disorder property of point cloud, these methods usually design transform nets and max-pooling layers in the deep architecture. ...
Full-text available
Point cloud is currently the most typical representation in describing the 3D world. However, recognizing objects as well as the poses from point clouds is still a great challenge due to the property of disordered 3D data arrangement. In this paper, a unified deep learning framework for 3D scene segmentation and 6D object pose estimation is proposed. In order to accurately segment foreground objects, a novel shape pattern aggregation module called PointDoN is proposed, which could learn meaningful deep geometric representations from both Difference of Normals (DoN) and the initial spatial coordinates of point cloud. Our PointDoN is flexible to be applied to any convolutional networks and shows improvements in the popular tasks of point cloud classification and semantic segmentation. Once the objects are segmented, the range of point clouds for each object in the entire scene could be specified, which enables us to further estimate the 6D pose for each object within local region of interest. To acquire good estimate, we propose a new 6D pose estimation approach that incorporates both 2D and 3D features generated from RGB images and point clouds, respectively. Specifically, 3D features are extracted via a CNN-based architecture where the input is XYZ map converted from the initial point cloud. Experiments showed that our method could achieve satisfactory results on the publicly available point cloud datasets in both tasks of segmentation and 6D pose estimation.
... Each point is annotated with one semantic label from 13 categories. For a common evaluation protocol [11], [66], [76], we choose Area 5 as the test set which is not in the same building as other areas. ...
Convolution on 3D point clouds is widely researched yet far from perfect in geometric deep learning. The traditional wisdom of convolution characterises feature correspondences indistinguishably among 3D points, arising an intrinsic limitation of poor distinctive feature learning. In this paper, we propose Adaptive Graph Convolution (AGConv) for wide applications of point cloud analysis. AGConv generates adaptive kernels for points according to their dynamically learned features. Compared with the solution of using fixed/isotropic kernels, AGConv improves the flexibility of point cloud convolutions, effectively and precisely capturing the diverse relations between points from different semantic parts. Unlike the popular attentional weight schemes, AGConv implements the adaptiveness inside the convolution operation instead of simply assigning different weights to the neighboring points. Extensive evaluations clearly show that our method outperforms state-of-the-arts of point cloud classification and segmentation on various benchmark datasets.Meanwhile, AGConv can flexibly serve more point cloud analysis approaches to boost their performance. To validate its flexibility and effectiveness, we explore AGConv-based paradigms of completion, denoising, upsampling, registration and circle extraction, which are comparable or even superior to their competitors. Our code is available at
... 3D segmentation is a fundamental problem within computer graphics and vision. Many applications in these fields require methods capable of decomposing 3D shapes and scenes into meaningful regions: establishing correspondences, skeleton extraction, guiding semantic labeling, and allowing autonomous agents to interact with objects, to name a few [Abbatematteo et al. 2019;Landrieu and Simonovsky 2018;Liu and Zhang 2004]. ...
We present SHRED, a method for 3D SHape REgion Decomposition. SHRED takes a 3D point cloud as input and uses learned local operations to produce a segmentation that approximates fine-grained part instances. We endow SHRED with three decomposition operations: splitting regions, fixing the boundaries between regions, and merging regions together. Modules are trained independently and locally, allowing SHRED to generate high-quality segmentations for categories not seen during training. We train and evaluate SHRED with fine-grained segmentations from PartNet; using its merge-threshold hyperparameter, we show that SHRED produces segmentations that better respect ground-truth annotations compared with baseline methods, at any desired decomposition granularity. Finally, we demonstrate that SHRED is useful for downstream applications, out-performing all baselines on zero-shot fine-grained part instance segmentation and few-shot fine-grained semantic segmentation when combined with methods that learn to label shape regions.
... In addition, they suffer from scalability issues for large point clouds as mentioned in Section I. RandLaNet [22] attempts to solve this issue by random sampling points, but it does not solve the issue of information loss. SPGraph [23] is the highest performing Graph Neural Network model that has previously attempted semantic segmentation. Although previously mentioned point based models use ideas similar to GNNs, they focus on encoding and attentive pooling modules rather than neighborhood feature aggregation mechanisms. ...
Inspired by recent improvements in point cloud processing for autonomous navigation, we focus on using hierarchical graph neural networks for processing and feature learning over large-scale outdoor LiDAR point clouds. We observe that existing GNN based methods fail to overcome challenges of scale and irregularity of points in outdoor datasets. Addressing the need to preserve structural details while learning over a larger volume efficiently, we propose Hierarchical Point Graph Neural Network (HPGNN). It learns node features at various levels of graph coarseness to extract information. This enables to learn over a large point cloud while retaining fine details that existing point-level graph networks struggle to achieve. Connections between multiple levels enable a point to learn features in multiple scales, in a few iterations. We design HPGNN as a purely GNN-based approach, so that it offers modular expandability as seen with other point-based and Graph network baselines. To illustrate the improved processing capability, we compare previous point based and GNN models for semantic segmentation with our HPGNN, achieving a significant improvement for GNNs (+36.7 mIoU) on the SemanticKITTI dataset.
... The usual method of target detection and depth calculation do not address obstacle avoidance problems in their approach plans for retrieving target fruit. With the development and application of deep networks of 3D point clouds in agriculture, the picking environment may be reconstructed using semantic segmentation method of 3D point clouds to obtain global information and retain local details (Landrieu and Simonovsky, 2018). Information on the location and growth direction of apples and branches may then be used for robot picking path decisions and planning. ...
A robotic apple harvester consisting of a mobile platform, a manipulator, an end-effector, a stereo camera, and a host computer was constructed and evaluated using two picking motions. The field tests showed all apple picking with success rates of 80.17% and 82.93% when using anthropomorphic and “horizontal pull with bending” motions, respectively. The main reasons for picking failure were depth misalignment, detachment failure, and blocked grasp. The “horizontal pull with bending” and anthropomorphic motions took 1.14 s and 3.13 s, respectively. The full picking cycle process using “horizontal pull with bending” motion was 12.53 ± 0.53 s, 4.64 s less than the average picking time when using anthropomorphic picking motion (17.17 ± 0.36 s). The picking process using anthropomorphic motion experienced a lower dynamic payload, meaning less effort would be required by the manipulator joints; however, fruit slipping decreased the overall success rate. The “horizontal pull with bending” picking motion had a superior picking cycle time and success rate. Notably, there were no stem-pulled or bruised apples during picking process using either motion. Based on this study, both picking motions have the potential to be applied in harvesting robots.
... The current recognition method of the target classification for 3D datasets research mainly uses the coordinate information of point clouds [10][11][12]. The geometric dimension of the target is calculated by the coordinate information of the point cloud; thus the target classification and recognition can be realized. ...
Full-text available
At present, the mainstream laser point cloud classification algorithms are mainly based on the geometric information of the target. Nevertheless, if there is occlusion between the targets, the classification effect will be negatively affected. Compared with the above methods, a new method of ski tracks extraction using laser intensity information based on target reflection is presented in this paper. The method can complete the downsampling of the point cloud datasets of ski tracks under the condition that the information of the target edge is complete. Then, the clustering and extraction of ski tracks are effectively accomplished based on the smoothing threshold and curvature between adjacent point clouds. The experimental results show that, different from the traditional methods, the composite classification method based on the intensity information proposed in this paper can effectively extract ski tracks from the complex background. By comparing the proposed method to the Euclidean distance method, the clustering segmentation method, and the RANSAC method, the average extraction accuracy is increased by 16.9%, while the over extraction rate is reduced by 8.4% and the under extraction rate is reduced by 8.6%, allowing us to accurately extract the ski track point cloud of a ski resort.
... A point cloud is a set of 3D points in space which is usually collected by LiDAR scans. GCN has been explored for classifying and segmenting points clouds [26,55]. Scene graph generation aims to parse the input image intro a graph with the objects and their relationship, which is usually solved by combining object detector and GCN [60,63]. ...
Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-level feature for visual tasks. We first split the image to a number of patches which are viewed as nodes, and construct a graph by connecting the nearest neighbors. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. ViG consists of two basic modules: Grapher module with graph convolution for aggregating and updating graph information, and FFN module with two linear layers for node feature transformation. Both isotropic and pyramid architectures of ViG are built with different model sizes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture. We hope this pioneering study of GNN on general visual tasks will provide useful inspiration and experience for future research. The PyTroch code will be available at and the MindSpore code will be avaiable at
An accurate coarse-to-fine out-of-core outlier removal method is proposed for large-scale indoor point clouds by mining the geometric shape constraints. In coarse processing stage, a low-resolution point cloud (LPC) is obtained using random downsampling. LPC has the same density distribution as the raw point clouds (RPC), which is important information for outlier removal. The correspondences from the LPC to the RPC are also recorded. The outliers in the LPC are removed via a global threshold. The outliers in the RPC are roughly removed guided by the cleaned LPC. In refinement processing stage, the cleaned LPC is segmented into planar and non-planar segments; and the LPC segmentation is transferred to the RPC. Finally, the outliers in each RPC segment are removed elaborately via a local threshold by exploring the shape information. The experiments show that the proposed method improves the quality of outlier removal results.
This paper addresses the challenge of enriching geometric digital twins of buildings, with a particular emphasis on capturing small but important entities from the electrical and the fire-safety domain, such as signs, sockets, switches, smoke alarms, etc. Unlike most previous research that focussed on structural elements and processed laser point clouds and images separately, we propose a novel method that fuses laser scanning and photogrammetry methods to capture the relevant objects, recognise them in 2D images and then map these to a 3D space. The considered object classes include electrical elements (light switch, light, speaker, socket, elevator button), safety elements (emergency switch, smoke alarm, fire extinguisher, escape sign), plumbing system elements (pipes), and other objects with useful information (door sign, board). Semantic information like class labels is extracted by applying AI-based image segmentation and then mapped to the 3D point cloud, segmenting the point cloud into point clusters. We subsequently fit geometric primitives to the point clusters and extract text information by AI-based text detection and recognition. The final output of our proposed method is an information-rich digital twin of buildings that contains geometric information, semantic information such as object categories and useful text information which is valuable in many aspects, like condition monitoring, facility maintenance and management. In summary, the paper presents a nearly fully-automated pipeline to enrich a geometric digital twin of buildings with details and provides a comprehensive case study.
Full-text available
In this paper, we introduce a mathematical framework for obtaining spatially smooth semantic labelings of 3D point clouds from a pointwise classification. We argue that structured regularization offers a more versatile alternative to the standard graphical model approach. Indeed, our framework allows us to choose between a wide range of fidelity functions and regularizers, influencing the properties of the solution. In particular, we investigate the conditions under which the smoothed labeling remains probabilistic in nature, allowing us to measure the uncertainty associated with each label. Finally, we present efficient algorithms to solve the corresponding optimization problems. To demonstrate the performance of our approach, we present classification results derived for standard benchmark datasets. We demonstrate that the structured regularization framework offers higher accuracy at a lighter computational cost in comparison to the classic graphical model approach.
Conference Paper
Deep learning approaches have made tremendous progress in the field of semantic segmentation over the past few years. However, most current approaches operate in the 2D image space. Direct semantic segmentation of unstructured 3D point clouds is still an open research problem. The recently proposed PointNet architecture presents an interesting step ahead in that it can operate on unstructured point clouds, achieving encouraging segmentation results. However, it subdivides the input points into a grid of blocks and processes each such block individually. In this paper, we investigate the question how such an architecture can be extended to incorporate larger-scale spatial context. We build upon PointNet and propose two extensions that enlarge the receptive field over the 3D scene. We evaluate the proposed strategies on challenging indoor and outdoor datasets and show improved results in both scenarios.
3D semantic scene labeling is fundamental to agents operating in the real world. In particular, labeling raw 3D point sets from sensors provides fine-grained semantics. Recent works leverage the capabilities of Neural Networks (NNs), but are limited to coarse voxel predictions and do not explicitly enforce global consistency. We present SEGCloud, an end-to-end framework to obtain 3D point-level segmentation that combines the advantages of NNs, trilinear interpolation(TI) and fully connected Conditional Random Fields (FC-CRF). Coarse voxel predictions from a 3D Fully Convolutional NN are transferred back to the raw 3D points via trilinear interpolation. Then the FC-CRF enforces global consistency and provides fine-grained semantics on the points. We implement the latter as a differentiable Recurrent NN to allow joint optimization. We evaluate the framework on two indoor and two outdoor 3D datasets (NYU V2, S3DIS, KITTI,, and show performance comparable or superior to the state-of-the-art on all datasets.