ArticlePDF Available

Street-view imagery guided street furniture inventory from mobile laser scanning point clouds


Abstract and Figures

Outdated or sketchy inventory of street furniture may misguide the planners on the renovation and upgrade of transportation infrastructures, thus posing potential threats to traffic safety. Previous studies have taken their steps using point clouds or street-view imagery (SVI) for street furniture inventory, but there remains a gap to balance semantic richness, localization accuracy and working efficiency. Therefore, this paper proposes an effective pipeline that combines SVI and point clouds for the inventory of street furniture. The proposed pipeline encompasses three steps: (1) Off-the-shelf street furniture detection models are applied on SVI for generating two-dimensional (2D) proposals and then three-dimensional (3D) point cloud frustums are accordingly cropped; (2) The instance mask and the instance 3D bounding box are predicted for each frustum using a multi-task neural network; (3) Frustums from adjacent perspectives are associated and fused via multi-object tracking, after which the object-centric instance segmentation outputs the final street furniture with 3D locations and semantic labels. This pipeline was validated on datasets collected in Shanghai and Wuhan, producing component-level street furniture inventory of nine classes. The instance-level mean recall and precision reach 86.4%, 80.9% and 83.2%, 87.8% respectively in Shanghai and Wuhan, and the point-level mean recall, precision, weighted coverage all exceed 73.7%.
Content may be subject to copyright.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
Available online 12 May 2022
0924-2716/© 2022 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
Street-view imagery guided street furniture inventory from mobile laser
scanning point clouds
Yuzhou Zhou
, Xu Han
, Mingjun Peng
, Haiting Li
, Bo Yang
, Zhen Dong
, Bisheng Yang
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China
Wuhan Geomatics Institute, Wuhan, China
Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong
Street-view imagery
Mobile laser scanning
Point clouds
Street furniture
Instance segmentation
Neural network
Outdated or sketchy inventory of street furniture may misguide the planners on the renovation and upgrade of
transportation infrastructures, thus posing potential threats to trafc safety. Previous studies have taken their
steps using point clouds or street-view imagery (SVI) for street furniture inventory, but there remains a gap to
balance semantic richness, localization accuracy and working efciency. Therefore, this paper proposes an
effective pipeline that combines SVI and point clouds for the inventory of street furniture. The proposed pipeline
encompasses three steps: (1) Off-the-shelf street furniture detection models are applied on SVI for generating
two-dimensional (2D) proposals and then three-dimensional (3D) point cloud frustums are accordingly cropped;
(2) The instance mask and the instance 3D bounding box are predicted for each frustum using a multi-task neural
network; (3) Frustums from adjacent perspectives are associated and fused via multi-object tracking, after which
the object-centric instance segmentation outputs the nal street furniture with 3D locations and semantic labels.
This pipeline was validated on datasets collected in Shanghai and Wuhan, producing component-level street
furniture inventory of nine classes. The instance-level mean recall and precision reach 86.4%, 80.9% and 83.2%,
87.8% respectively in Shanghai and Wuhan, and the point-level mean recall, precision, weighted coverage all
exceed 73.7%.
1. Introduction
Recently, substantial and increasing investments have been made in
updating road infrastructures by countries worldwide. The U.S. gov-
ernment, as an example, announced that about 110 billion dollars would
be spent on improving roads and bridges (White House, 2021). As an
important part of this task, the maintenance and upgrade of street
furniture ought to take advantage of the existing inventory. Therefore,
obsolete or inaccurate inventory of street furniture may mislead trans-
portation planners and constructors, and this will pose potential hazards
to trafc safety. However, current automated inventory solutions using
only street view imagery or mobile laser scanning (MLS) point clouds
have respectively shown their drawbacks in meeting the comprehensive
demand involving working efciency, localization accuracy and se-
mantic richness, which motivates this study.
Street furniture collectively represents objects and equipment
installed along roads for municipal functions, including street lamps,
trash bins, trafc lights, trafc signs, etc. (Wang et al., 2017; Guan et al.,
2014). Not only are the design and distribution of street furniture closely
interrelated with trafc safety and comfort (Ma et al., 2022; Gargoum
et al., 2018), but they are also considered as fundamental infrastructures
for various cutting-edge transportation applications, such as high-
denition (HD) maps (Zhou et al., 2021), vehicle-to-everything (V2X)
and autonomous driving (Cui et al., 2021). In the context of intelligent
transportation, street furniture inventory is of great signicance for city
administration and should be oriented toward potential future applica-
tions. For example, the visibility of trafc signs and occlusions of street
lamps need periodical inspections to reduce latent threats to trans-
portation safety (Jensen et al., 2016). In addition, semantic and three-
dimensional (3D) geometric features of street furniture are widely
adopted by autonomous vehicles (AV) as localization and planning
reference (Chen et al., 2020; Wang et al., 2021). Therefore, beyond just
recording the amount, these applications demand both detailed se-
mantic labels and the corresponding 3D geometric information
including locations, shapes, sizes from the street furniture inventory, as
shown in Fig. 1.
* Corresponding authors.
E-mail addresses: (Y. Zhou), (Z. Dong), (B. Yang).
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing
journal homepage:
Received 15 December 2021; Received in revised form 24 April 2022; Accepted 25 April 2022
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
To keep a regularly updated street furniture inventory, street-view
imagery analysis has attracted extensive attention due to its visual and
complete presentation of road scenes (Laumer et al., 2020). With the
prevalence of street-view imagery (SVI) thanks to Google Street View
and the upsurge of image understanding algorithms, the automation and
efciency of inventory collection have drastically increased. However,
the localization accuracy of SVI based methods is conned to meter-
level and the output lacks 3D geometric information (Biljecki and Ito,
2021), so this does not fully satisfy the need of the aforementioned ap-
plications, especially the development of autonomous driving. Mean-
while, the previous practice has demonstrated it is promising to
inventory street assets by segmenting MLS point clouds (Yang et al.,
2015), which features high localization accuracy but are at a loss of
semantic richness (Che et al., 2019). Specically, when dealing with
small objects, point cloud based methods are more likely to confuse
because of relatively fewer point numbers and the lack of sufcient se-
mantic and texture information.
Accordingly, SVI based methods and MLS based methods have
complementary properties in terms of localization accuracy and se-
mantic richness. Combining SVI and MLS point clouds may help them
supplement each other toward better inventory performance. For
example, with SVI providing semantic labels and point clouds providing
3D geometric features, the inventory of trafc signs can be enriched to
satisfy the need for localization and planning in autonomous driving.
Moreover, although SVI based methods suffer relatively lower locali-
zation accuracy, SVI may guide the 3D survey of small street furniture
like re hydrants or benches in point clouds by indicating a potential
search space. In this regard, some pioneering frameworks have been
proposed to fuse images and point clouds, among which the methods
based on frustums (Qi et al., 2018) offer a heuristic inspiration for our
In this study, we rstly leverage off-the-shelf SVI object detection
models for two-dimensional (2D) proposal generation and accordingly
segment point cloud frustums by projecting the 2D bounding boxes into
the point cloud space. Then, the instance mask and the instance
bounding box are predicted for each frustum using a multi-task neural
network. Lastly, the frustums from different images are associated via
object tracking based on 3D bounding boxes and the nal object-centric
street furniture instance masks are predicted by fusing the associated
frustums. The main contributions of this study are as follows:
A novel framework for surveying street furniture inventory
combining SVI and MLS data is proposed, which outputs component-
level street furniture semantic labels, 3D locations, and corre-
sponding instance point clouds.
An effective split and merge pattern for processing MLS data is
designed, which rstly segments point clouds into frustums and then
associates them to be object-centric, and hence reduces search spaces
for street furniture instance segmentation.
A multi-task neural network perceiving the instance-aware context
and considering the point cloud semantic supervision is designed to
enhance the per-frustum instance mask prediction.
2. Related work
According to the type of data used for surveying road infrastructure,
we roughly categorize the related studies into three groups: only images,
only point clouds, combining images and point clouds.
2.1. Inventorying street furniture from street-view imagery
High-level image understanding in street scenes has been greatly
propelled by several outstanding public datasets. Cityscapes, for
example, provides 5000 densely annotated images for urban street
panoptic segmentation (Cordts et al., 2016). Objects365, containing 365
common object categories, covers small objects like re hydrants and
trafc cones (Shao et al., 2019). In the eld of trafc sign detection,
GTSDB (Houben et al., 2013) and TT100K (Zhu et al., 2016) are
prominent for their diversied classes and background scenes. Among
the extensive urban street scene understanding methods, the work
proposed and implemented by researchers from NVIDIA, not only ach-
ieves state-of-the-art performance but also shows strong portability and
generalizability (Tao et al., 2020). It presents an encouraging perfor-
mance in multi-scale semantic segmentation with the hierarchical
attention that enables the network to predict weights between scale
The above-mentioned contributions are 2D only, and to map the road
objects, an estimation of their geographical locations should be per-
formed. Google Street View (GSV) is commonly used in relevant studies
(Anguelov et al., 2010). Peng et al. (2017) match images to GSV to get an
estimation of the picturing position and then locate the POI (Point of
Interest) according to the intersection between the picturing direction
and buildings in digital maps. Laumer et al. (2020) match detected tree
instances to a previous database for the update of the tree inventory.
Photogrammetric calculation is used by Campbell et al. (2019) for
locating trafc signs from GSV observations. Triangulation is another
Fig. 1. Component-level street furniture inventory based on frustums. The rst
row shows the correspondences between the MLS system, street-view imagery,
frustum point clouds, and instance points. The second row shows the point-level
inventory results of four classes in Wuhan Dataset. Original point clouds that
are not street furniture are colored with the grayscale according to intensity,
with the darker gray representing the lower intensity.
Table 1
Three-dimensional localization accuracy properties of some representative
image based street assets inventory methods.
Dataset Method Class Localization
Images Peng et al. (2017) Roadside Stores 10 m
Krylov et al. (2018) Trafc Lights,
2 m
Campbell et al. (2019) Trafc Signs 25 m
Laumer et al. (2020) Trees meter-level
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
effective tool for estimating object locations. Hebbalaguppe et al. (2017)
rstly detect telecom infrastructure from GSV images and then adopt
triangulation to locate the instances. Also using GSV and triangulation,
Krylov et al. (2018) leverage monocular depth estimation for an initial
relative position and then renes it with a fusion and clustering module.
These studies follow a similar pattern: (1) detect objects of interest in
images; (2) approximately locate the object in the geographical space;
(3) rene the locations or match them to the existing inventory data-
base. However, these methods only report meter-level localization ac-
curacy (Table 1), which impedes their further applications. Moreover,
the overlapping of similar objects poses great challenges to these
methods due to the lack of depth or 3D information.
2.2. Street object extraction from point clouds
To overcome the problems of precisely locating objects, point clouds
are a promising type of data source because they present dense 3D co-
ordinates (Li et al., 2020). Since the popularization of mobile laser
scanning, it has been a main focus to extract pole-like objects along
streets (Ma et al., 2018; Chen et al., 2019), and machine learning
methods are quite commonly used. Yu et al. (2016) segment point clouds
into separated clusters for feature description and then accordingly
construct a contextual vocabulary for object recognition. Chen et al.
(2021) incorporate voxel-based and point-based features for urban tree
inventory. Yang et al. (2017) achieve effective road facility inventory
using multi-level geometric and contextual information aggregation
followed by a support vector machine (SVM). Li et al. (2019) further
separate the poles and their attachments using machine learning clas-
siers and this is a meaningful step toward component-level road
furniture inventory. But these methods cannot be easily generalized due
to the choice of feature calculation units and hand-crafted feature de-
scriptors (Yang and Dong, 2013). Another threat is the overreliance on
the performance of pole extraction.
The advancements of deep neural networks in the elds of point
cloud object detection and semantic instance segmentation offer an
impetus for street furniture surveying. Effectively and elegantly encod-
ing point cloud features is a fundamental task. PointPillars encodes the
point cloud feature map for object detection by converting original point
clouds to pillar voxels (Lang et al., 2019). PointRCNN designs a two-
stage solution that rstly estimates 3D proposals and then renes
them by aggregating spatial and semantic features (Shi et al., 2019). PV-
RCNN leverages both multi-scale voxel features and aggregated key-
point features for effective object detection (Shi et al., 2020). On the
other hand, for point cloud segmentation, RandLA-Net can efciently
produce semantic labels for up to 106 points with the simple random
sampling strategy and the local feature aggregator (Hu et al., 2020).
Jointly predicting semantic and instance labels demands spatially
discriminative feature learning. Wang et al. (2019) associate the se-
mantic decoder and instance decoder to make them benet each other
and thus achieve instance segmentation in outdoor point clouds. Lahoud
et al. (2019) and Han et al. (2020) both involve spatial vectors from
points to instance centers to learn instance-specic information. Yang
et al. (2019) predict a group of bounding boxes from the global feature
for potential instances and meanwhile predict the per-point mask for all
instances, achieving efcient instance segmentation without demanding
post-processing procedures. These methods using point clouds feature
high localization quality, but face serious challenges of capturing ac-
curate semantic information and dealing with small road objects.
2.3. Road scene parsing combing images and point clouds
Images and point clouds have complementary characteristics in
presenting geometric and semantic information so some researchers
attempt to combine them in their frameworks. In this part, the related
Fig. 2. Pipeline of the proposed method. P
f denotes the frustum point cloud introduced in Section 3.1 and Section 3.2, and Po denotes the object-centric point cloud
introduced in Section 3.3.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
studies are roughly categorized into two groups according to the 2D-3D
correspondence levels: pixel-level and object-level.
The pixel-level integration of point clouds and images has been
exploited in some pioneering works. PointPainting projects pixel-level
semantic predictions to 3D spaces to lter background points (Vora
et al., 2020). Hou et al. (2019) introduce a novel method that summa-
rizes image features and projects the learned 2D features to 3D voxels to
supplement the 3D geometric feature. Barcon and Picard (2021) infer
the MLS point cloud clusters of street lamps according to the panoramic
image instance segmentation results. Sanchez Castillo et al. (2021)
establish the correspondence between terrestrial laser scanning data and
the panoramic image semantic segmentation result to add a semantic
mask to the point clouds. Besides, similarly requiring ne 2D-3D
alignment, 3D-CVF and EPNet deeply fuse multi-modal information in
a feature scale (Yoo et al., 2020; Huang et al., 2020). Zhu et al. (2021)
consider the resolution mismatch between images and points, and
therefore design a framework based on virtual points with moderate
density to bridge the resolution gap. By contrast, object-level fusion of
points and images is also widely explored in autonomous driving for
dynamic object detection. MV3D is an inuential study that projects
object-level 3D proposals generated from bird-view point cloud images
to both front-view images and point clouds to learn a fused feature (Chen
et al., 2017). Frustum-PointNet, on the contrary, projects 2D proposals
to the point cloud space to search for a 3D object (Qi et al., 2018). Gong
et al. (2020) also base the objection detection framework on frustums,
and develop a novel probabilistic localization method. CLOCs associates
2D and 3D detections and feeds the joint detection candidates to a sparse
tensor and then learns the fusion parameters (Pang et al., 2020).
Pixel-level fusion of RGB and 3D data is valid for studies using
relatively sparse point clouds collected with multi-beam laser sensors or
RGB-D data as the correspondences are easy to build. However, for MLS
systems, there is a temporal mismatch between image frames and
scanning samples and the point clouds are much denser, so the pixel-
level correspondences are not so reliable. Besides, noise and occlu-
sions also pose challenges. Therefore, we base our study on object-level
associations between panoramic images and MLS point clouds.
Concretely, we use the 2D object detection proposals from panoramic
images as guidance by projecting them into frustums for reducing search
spaces of locating road furniture.
3. Methods
The proposed method takes combined street-view imagery and MLS
point clouds as input, leveraging both semantic and 3D geometric in-
formation. It outputs component-level street furniture locations with
detailed semantic labels and point-level instance masks. Generally, as
shown in Fig. 2, this pipeline constitutes three major steps. Firstly, based
on the pre-established point-image alignment and the detected street
furniture objects, point cloud frustums of interest (FoI) are cropped.
Then, a multi-task neural network is used to predict the instance
bounding box and the point mask for each FoI. Thirdly, the predicted
bounding boxes are used to associate the instances in FoI from adjacent
image frames, and then the nal object-centric instance masks are pre-
dicted by fusing the instance mask prediction results of the associated
3.1. Frustum-of-interest cropping
The alignment of street-view imagery and point clouds, both
collected with the MLS system, is established using camera coordinates
and Euler angles of picturing moments, and pre-calibrated parameters.
As shown in Fig. 3, this alignment maps a point (x,y,z)from the local
projection coordinate system xp,yp,zpto the camera-centric coordi-
nate system xc,yc,zc, and then to the panoramic pixel coordinate
system via spherical projection.
We adopt off-the-shelf image object detection models to generate 2D
proposals and lift them to 3D point cloud spaces for frustum cropping.
The model winning the championship of benchmark Objects365 is
suitable and used for this work due to its outstanding performance in
detecting a wide range of street furniture objects from panoramic images
(Gao, 2021; Shao et al., 2019). Besides, based on the basic types of trafc
signs included in Objects365, we further enrich the trafc sign cate-
gories using the TT100k dataset by ne-tuning a lighter detection model
with it (Zhu et al., 2016). In total, the surveyed street furniture in this
setup constitutes common transportation infrastructures including
trafc signs, trafc lights, street lamps, temporary cone barriers, and
other public assets like re hydrants, trash bins, potted plants and
roadside benches. Moreover, trafc signs are further classied into
dozens of categories including text signs, stop signs and various warning
or speed limit signs.
Fig. 3. Alignment of street-view imagery and point clouds collected using the MLS system. Point clouds are rstly transformed from the local projection coordinate
system to the camera-centric coordinate system through rotation and translation, and then to image plane coordinates via spherical projection. Accordingly, a 2D
proposal is mapped to a 3D frustum.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
Based on the established alignment, each detection bounding box is
lifted to a frustum-of-interest in the point cloud space and the points
within the frustum constitute a frustum point cloud PfRN×(3+c), where
N is the point number and c is the number of additional feature channels
(e.g., intensity, RGB). Then, a rotation of coordinates along Y axis,
making Z axis pointing the frustum center, is performed for each frustum
to avoid the excessive variation of point clouds placement (Qi et al.,
2018). The rotation degree depends on the horizontal location of the
detected object on the image. Then points in the frustum are trans-
formed to the frustum coordinate system xf,yf,zf, as shown in Fig. 4.
3.2. Per-frustum instance mask and bounding box prediction
Primarily, this part predicts instance masks measuring the proba-
bility of being street furniture points for each frustum point cloud Pf. It is
assumed that a single frustum of interest (FoI) contains only one street
furniture instance (i.e., the detected object) and other points are clutter.
A baseline idea is to directly predict the instance mask from point fea-
tures, but this method is prone to errors especially false positives,
because it pays overwhelming focus on the point-level local features and
lacks the awareness of the instance.
Instead, we simultaneously estimate a 3D bounding box of the road
furniture and use it as an input of the instance mask prediction, to in-
crease the awareness of the detected instance and the context near the
bounding box. Another merit of the bounding box prediction is that it
supports the subsequent procedure of FoI fusion by providing concrete
information about the location and size of the object. Meanwhile, the
point-level semantic prediction branch further promotes the point
feature learning. The above tasks should be mutually reinforcing in the
network and the experiments prove this hypothesis. Specically, we
adopt a neural network containing three branches to process each
frustum point cloud Pf: (a) bounding box branch; (b) instance mask
branch; (c) semantic label branch. And they share a point cloud pro-
cessing backbone, as shown in Fig. 5. While this method is not restricted
to any specic point cloud processing network, we use PointNet++ as
the backbone (Qi et al., 2017). Refer to the Appendix for more detailed
neural network settings.
Bounding Box Branch. This branch takes the global feature Fg as
input and outputs min-max coordinates B= [xmin,ymin,zmin ,xmax,ymax,
zmax]of the estimated bounding box. Fg concatenates the point cloud
global feature Fgpc and the one-hot semantic vector Fimg from image
object detection.
We use box in Eq. (1) to supervise this branch.
box = ℒdis + ℒsiou.(1)
In Eq. (1), dis measures the spatial coordinate differences and siou
encourages the predicted box to include more valid instance points. dis
in Eq. (2) measures the euclidean distance between the predicted
bounding box B and the ground truth bounding box B.
dis = − 1
And siou in Eq. (3) is a soft intersection-over-union (sIoU) loss intro-
duced by Yang et al. (2019).
siou =N
In Eq. (3), q ranges from 0 to 1, and measures the degree of a point is
inside of a 3D box (Yang et al., 2019). The point closer to the box center
has a higher q. qn and qn respectively represents the q value of the nth
point in Pf with B and B.
Fig. 4. Point clouds are transformed from the camera-view coordinate system
to the frustum-view coordinate system through a rotation along Y
Fig. 5. Framework of the multi-task neural network. This network takes the semantic class from image object detection and the frustum point cloud as input, and
contains three prediction branches.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
Instance Mask Branch. The instance mask prediction branch takes
point features Fp, the aggregated feature Fa and the predicted bounding
box coordinates B as input. Concretely, Fa and B are respectively tiled by
replication to be concatenated with the point-level features. Compared
with predicting the instance mask solely depending on point features,
the supplementary information from the global feature and the instance
bounding box offers instance-aware context. The output is instance
masks MRN×1 indicating the probability of points belonging to the
detected object. A detailed network illustration is shown in the Appen-
dix. In our experiment, we adopt the Focal Loss to tackle the imbalance
between instance points and clutters, which is shown in Eq. (4), where
and γ are hyper-parameters (Lin et al., 2017).
mask = − 1
Semantic Label Branch. Additionally, most point cloud segmenta-
tion datasets have point-level semantic annotations, so a straightforward
point-level semantic label prediction branch is adopted to exert the
backbone to learn useful information from point clouds. This branch
passes Fp through fully connected layers and a nal softmax layer. sem is
a weighted cross-entropy loss as shown in Eq. (5). In Eq. (5), yn
i is a bi-
nary indicator denoting whether the nth point belongs to class i (totally ns
classes), and correspondingly, pn
i is the predicted probability. The weight
ri is involved to deal with the imbalance between classes,
where ri denotes the ratio of points belonging to the ith class and
median(r)is the median value of these ri.
sem = − 1
A joint multi-task loss function joint is used to train this network, as
shown in Eq. (6).
joint = ℒbox + ℒmask + ℒsem.(6)
3.3. Object-centric instance segmentation
The ultimate pipeline output is supposed to be an object-centric road
furniture survey containing semantic and point-level information. Usu-
ally, the same street furniture object may occur in adjacent street-view
images with overlapping elds of sight. Therefore, simply accumu-
lating every instance mask M from all images will lead to the accumu-
lation of instance mask prediction errors. In that concern, an effective
fusion strategy is used to fuse M from different image frames, which
rstly associates frustums containing the same instance. Before we
associate the frustums, the predicted bounding boxes B and the point
clouds are all transformed to the original projection coordinate system.
Given the predicted bounding boxes and the orientation parameters of
image frames, the problem is how to assign an instance ID to each
frustum (Qi et al., 2021). Concretely, we used the multi-object tracking
(MOT) method proposed by Weng et al. (2020), where the 3D bounding
box IoU (Intersection over Union) and the semantic class are used as
association criteria, as shown in Fig. 6. The association strategy is based
on the Hungarian algorithm (Kuhn, 1955) and we eliminate the Kalman
lter module since no moving objects are involved. Then, the point
clouds of each associated frustum group are translated to the object-
centric coordinate system (xo,yo,zo), where the coordinate origin is
the center of the bounding boxes.
Firstly, a threshold Tfm is applied to lter the clutter points, whose
mfMfrustum is smaller than Tfm . Those kept points from different frus-
tums are gathered as PoRN×(3+1)with the predicted mf as the feature
channel. Finally, after fusing the instance mask prediction from the
associated frustums, a straightforward PointNet++ (Qi et al., 2017) is
applied to predict a binary instance segmentation mask, as shown in
Fig. 7. In the instance segmentation network, the overlapping points
from different frustums are treated as separated points. And they will be
regarded as one instance point if any of them is predicted to be positive
by the network.
4. Experiments and discussions
4.1. Dataset description and experiment settings
Street-view images and MLS point clouds from three regions were
used in the experiments. For training the neural network, we use the
manually labeled point clouds introduced by Han et al. (2021), which
covers 6.0 km urban roads in Shanghai. Datasets collected from
Shanghai and Wuhan are used to validate the proposed method, as
Fig. 6. Multi-view frustum association. Orange, blue, green points are 3 frustum point clouds from different perspectives and target at the same instance. They are
associated according to the predicted bounding boxes B
Fig. 7. Multi-view frustum fusion based object-centric instance segmentation.
The predicted clutter points are ltered rst, and then the kept points in
associated frustums are gathered into P
o for the nal instance segmentation.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
shown in Fig. 8. Shanghai dataset covers 6.5 km urban roads, 1.3 km of
which are manually labeled for quantitative evaluation. Wuhan dataset
covers 3.2 km urban roads, with 1.3 km street furniture annotations.
Table 2 shows the classes and instance numbers in the experiment
dataset. The inventory of nine classes of street furniture is evaluated,
covering typical municipal and transportation assets. Our method out-
puts the concrete semantic class of trafc signs (totally 127 classes in the
training set), and they are categorized into two groups in the point cloud
annotations: text signs and warning signs.
After cropping frustums, the point intensity and 3D coordinates are
used as the feature channels of Pf. The multi-scale grouping PointNet++
(Qi et al., 2017) is used as the point cloud processing backbone in pre-
dicting per-frustum bounding boxes and instance masks, and is also used
to predict the nal object-centric instance masks, with hyper-parameters
set according to Qi et al. (2018). After frustum association, Tfm is
empirically set to 0.3 to lter the points that are predicted with a very
low possibility of being instance points. The mentioned neural networks
in the proposed pipeline are trained on an NVIDIA 2080Ti GPU for 50
epochs. Network details are shown in the Appendix.
4.2. Qualitative results
Fig. 9 illustrates the qualitative experiment results from the global
perspective, where the extracted street lamp, trafc light, trafc sign
and trash bin points are overlaid on the raw MLS point clouds, and
Fig. 10 presents close-up results at cross-roads. As the experiment results
show, the proposed pipeline locates the street furniture and outputs
their point clouds in typical urban streets and crossroads, presenting
their inventories and geometric features including shapes and sizes for
transportation administration. In general, this method processes all
types of detected street furniture using a universal framework and
common parameter settings, and it is valid for all the mentioned classes.
And this class list is promising to be extended without worrying about
excessive labor for designing extra hand-crafted feature descriptors.
Fig. 11 shows the results concerning small street furniture, which
illustrates that the proposed method is capable of segmenting trafc
cones, the temporary trafc barriers, supporting the update of HD maps.
Besides, it is signicant for residents that the re hydrants are effectively
inventoried using the proposed method, considering that they are usu-
ally ignored using methods depending solely on point clouds.
Solely depending on images or point clouds can hardly handle the
imbalance between semantic richness and localization accuracy, and
hence we collectively exploit images and point clouds. As introduced in
Section 2.1, despite that images present abundant semantic information,
3D localization of the objects is mainly achieved using multi-view
photogrammetry methods, achieving meter-level accuracy. The pro-
posed pipeline uses the semantic cues from images as guidance and
segments the instances of interest from MLS point clouds, and this
guarantees much higher absolute localization accuracy and supports
further point-level geometric feature analysis. On the other side, most
methods using only point clouds as source data start from extracting
pole-like objects based on pre-dened feature descriptors. Li et al.
(2019) uses point clouds for roadside asset surveying, producing
component-level street furniture point segments and their class labels.
However, it relies on rule-based pole extraction and feature design,
which limits its exibility to complex scenes, especially densely vege-
tated urban streets.
In previous studies using street-view imagery or point clouds for
street furniture recognition, the dense vegetation in urban streets often
leads to serious occlusions, especially for lamps. By contrast, since the
proposed pipeline merges and integrates the semantic and geometric
cues from multiple perspectives by subsequently predicting instance
masks in the frustum-centric and the object-centric manners, the oc-
clusions do not cause a great interference to the results.
Another characteristic of our method is that it produces component-
level results under the semantic guidance from street-view imagery. For
example, trafc signs and trafc lights mounted on the same pole are
respectively surveyed with detailed trafc sign semantic labels, even the
weight or speed limits. This benets the trafc management authorities
by providing detailed distribution and semantic information about road
assets for analyzing their rationality to trafc planning and consistency
with driving safety.
Fig. 8. Experiment dataset in Shanghai (6.5 km) and Wuhan (3.2 km). MLS
point clouds (blue points) are overlaid on the satellite images. Red boxes show
the annotated areas for quantitative evaluation (1.3 km respectively).
Table 2
Instance numbers and corresponding frustum numbers in the training set
collected in Shanghai (6.0 km), and the instance number (Ins. Num.) of the
annotated part (1.3 km respectively) of Shanghai and Wuhan test sets.
Class Training Set Test Set Ins. Num.
Text Signs 179 959 33 28
96 953 68 90
287 1888 99 97
Trash Bins 169 870 39 21
49 204 18 163
23 48 10 7
Benches 4 16 0 4
63 466 32 62
163 959 3 0
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
4.3. Quantitative evaluation
Quantitative evaluation is performed on our annotated street seg-
ments, which respectively cover 1.3 km roads in Shanghai and Wuhan
datasets. To precisely reect on the performance of our proposed
method, we specially adopt 2 levels of quantitative evaluation indicators
in this study.
Instance-level Metrics. Instance-level recall and precision are listed
to show the ratio of correctly inventoried instances at a given point IoU
threshold (0.5 in this study). If the point IoU of a predicted and a ground
truth (GT) instance is greater than the threshold and the assigned se-
mantic label is correct, it is correctly inventoried (i.e., true positive
(TP)). The other predicted instances are regarded as false positives (FP),
and the missed GT instances are false negatives (FN). Refer to Eq. (7) and
Eq. (8) for the calculation of instance-level recall and precision.
Point-level Metrics. Our pipeline integrates 2 point mask prediction
neural networks and produces street furniture points, so the point-level
indicators (precision, recall, weighted coverage) are adopted. The
equations of point recall, precision and IoU are shown in Eq. (7), Eq. (8),
Eq. (9). Point-level TP denotes the correctly predicted instance points of
the successfully inventoried instances. FP (False Positive) denotes the
wrongly predicted instance points and FN (False Negative) denotes the
missed instance points.
Recall =TP
TP +FN .(7)
Precision =TP
TP +FP .(8)
TP +FP +FN .(9)
Fig. 9. Global views of qualitative results in Shanghai and Wuhan datasets. The segmented street furniture point clouds are overlaid on the raw MLS point clouds
rendered with height. For better visualization, upper parts of some high vegetation and buildings are ltered from raw point clouds.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
Weighted coverage (wCov) is widely used to evaluate the performance
of instance segmentation, which calculates the weighted average
instance-wise IoU for ground truth instances ri,rk∈ 𝒢 and their associ-
ated predictions rj∈ ℘, as shown in Eq. (10) and Eq. (11).
wCov(𝒢,℘) =
Table 3 and Table 4 show the quantitative results in Shanghai and
Wuhan Dataset. From the global perspective, the mean instance recall
and precision of Wuhan and Shanghai exceed 80.9%, which veries the
effectiveness of our proposed method. Meanwhile, this universal
framework is valid for all mentioned nine classes. Moreover, the mean
metrics of Shanghai and Wuhan datasets show no evident performance
differences. The instance-level recall and precision of street lamps, one
of the most basic street furniture, surpass 92.4% in both Shanghai
(98.0%, 92.4%) and Wuhan (99.0%, 96.0%), achieving the best results
among all categories.
Considering the introduced neural networks in our pipeline are
trained with a different MLS dataset in Shanghai. And this method can
be applied not only in another area in Shanghai, but it is also robust to
the scene change to Wuhan with different city landscapes and data
collection systems, which guarantees its generalizability.
For trafc signs, indicators in Wuhan are higher than in Shanghai,
and this is because the trafc sign point clouds in Shanghai suffer a more
severe loss due to the scanning angle and the reective properties. On
the contrary, the proposed pipeline performs better on trash bins in
Shanghai than in Wuhan, since the image object detection misses more
trash bin instances. Although the proposed method correctly recalls or
locates most trafc cones and re hydrants, the point-level metrics are
evidently lower than other larger trafc assets. The reason is that the
point number of each instance is limited due to the size, so a few false
positive instances may result in a very considerable decrease in the in-
dicators. Several factors that inuence the performance of our method
will be further discussed in Section 4.6.
4.4. Architecture design analysis
In this part, we present quantitative results using different pipeline
settings to demonstrate the effects of specic modules. The settings are
shown in Table 5. To facilitate the object-centric evaluation, the frustum
association is kept using the distances between centroids of predicted
instance points, even if the bounding box branch is removed. The per-
formance comparison is shown in Fig. 12. Moreover, the effects of the
semantic labels from images and the point cloud intensity are analyzed.
4.4.1. Effects of multi-view fusion
In setting A, after frustum association, the nal object-centric binary
mask prediction network is eliminated. We use the distance between the
center of the predicted bounding box B and the center of instance points
to measure their consistency. The frustum having the smallest center
distance is kept for each object, considered as having the most condent
instance segmentation result. Fig. 12 shows an obvious decrease in
point-level mRec and mwCov, but the point-level mPrec is not inu-
enced. That is, a large ratio of the instance points predicted by the kept
frustum are correct, considering that they produce a very condent
prediction from the frustum point cloud. However, this leads to the
omission of some instance points that are not signicant enough in
certain frustums, and results in lower point-level coverage and instance-
level recall, precision at the IoU threshold. Especially for tough scenes,
like lamps partly occluded by trees or re hydrants partly hidden by
Fig. 10. Close-up views of the qualitative results at crossroads from Shanghai and Wuhan datasets. The raw point clouds are rendered with intensity. Blue: trafc
signs. Yellow: trash bins. Green: street lamps. Red: trafc lights.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
Fig. 11. Close-up views of the qualitative results on certain classes of small street furniture. The classes from up to bottom: trafc cones, re hydrants, potted plants,
roadside benches. Each row shows only one class and different colors are used to distinguish instances. The raw point clouds are rendered with intensity.
Table 3
Instance-level and point-level quantitative evaluation results on Shanghai Dataset.
Class Instance-level Statistics (IoU@0.5) Point-level Statistics (IoU@0.5)
Recall/% Precision/% TP FN FP Recall/% Precision/% wCov/%
Text Signs 78.8 78.8 26 7 7 77.7 81.0 76.4
Trafc Lights 86.8 88.1 59 9 8 86.1 71.5 76.3
Street Lamps 98.0 92.4 97 2 8 95.4 93.6 94.5
Trash Bins 89.7 85.4 35 4 6 89.5 88.3 87.8
Trafc Cones 83.3 75.0 15 3 5 76.0 39.5 69.4
Fire Hydrants 70.0 77.8 7 3 2 53.2 69.2 58.0
Warning Signs 84.4 75.0 27 5 9 62.8 55.6 52.9
Potted Plants 100.0 75.0 3 0 1 91.2 91.1 91.1
Mean 86.4 80.9 79.0 73.7 75.8
Table 4
Instance-level and point-level quantitative evaluation results on Wuhan Dataset.
Class Instance-level Statistics (IoU@0.5) Point-level Statistics (IoU@0.5)
Recall/% Precision/% TP FN FP Recall/% Precision/% wCov/%
Text Signs 85.7 96.0 24 4 1 73.1 84.8 70.4
Trafc Lights 78.9 86.6 71 19 11 86.8 73.6 78.3
Street Lamps 99.0 96.0 96 1 4 90.6 90.4 89.8
Trash Bins 76.2 72.7 16 5 6 63.2 71.6 66.5
Trafc Cones 79.8 89.7 130 33 15 83.4 70.2 76.3
Fire Hydrants 57.1 66.7 4 3 2 52.0 77.3 51.5
Warning Signs 88.7 94.8 55 7 3 89.2 67.7 70.8
Benches 100.0 100.0 4 0 0 100.0 89.1 89.7
Mean 83.2 87.8 79.8 78.1 74.2
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
vegetation, the fusion based nal mask prediction equips the pipeline
with a second chance to compensate biases in the frustum perspective.
4.4.2. Ablation Studies of the Frustum-centric Neural Network
Different combinations of neural network branches are tested to
demonstrate the effects of the bounding box branch and the semantic
label branch. Generally, compared with solely predicting instance
masks, both these two branches show their effects in promoting the
inventory performance. They are mutually reinforcing in this multi-task
setting. The bounding box branch, especially, provides an evident boost.
This branch takes the global point feature as input and estimates a
unique instance bounding box and thus encourages the backbone to
notice global information within frustums. Then the estimated bounding
box is provided as additional information for the instance mask branch,
and the performance increase validates our intention to add instance-
aware context to strengthen the instance mask prediction. Moreover,
the point-level semantic branch further enhances the feature learning
and benets the nal output.
4.4.3. Effects of image class labels and point intensity
Fig. 13 shows the performance comparison after eliminating image
semantic labels or point cloud intensity. As shown, image semantic la-
bels are of great importance, especially when the frustum contains an
instance of another class, which often occurs at crossroads. The effect of
point intensity is not very prominent, which is good for portability using
other MLS systems. But note that in Shanghai Dataset, the recall and
precision of text signs drop from 78.8%, 78.8% to 69.7%, 57.6% without
the point intensity. Because text sign frustums sometimes contain the
building façade, showing similarly planar geometry features, but their
intensity features are usually very different.
4.5. Comparative studies
In this part, Frustum-PointNet (Qi et al., 2018) is discussed as a
baseline framework to further illustrate our contributions. Frustum-
PointNet is implemented and trained on our experiment dataset, and
the quantitative comparison is shown in Table 6. The visual comparison
is shown in Fig. 14. To facilitate the instance-level comparison, the same
frustum association strategy as in Section 4.4.1 is used. Frustum-
PointNet is an inspiring method integrating images and point clouds
for 3D object detection, producing instance masks and amodal bounding
boxes. It rstly predicts the point-level instance mask and then uses it for
bounding box regression. Therefore, the bounding box regression is
heavily dependent on the results of instance segmentation, and the
instance segmentation lacks instance-aware information. Compara-
tively, the proposed method enhances the instance context information
in per-frustum instance mask prediction by involving the global feature
and the bounding box supervision as assistance. The visual comparison
shows that our method evidently outperforms F-PointNet by producing
fewer errors in densely vegetated areas and less confusion of object and
Table 5
Settings of the ablation study.
Setting Group B. Box Branch Sem. Label Branch Multi-view Fusion
Fig. 12. Ablation studies on Shanghai and Wuhan Dataset.
Fig. 13. Analysis of the effects of image semantic cues and point intensity.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
background. Meanwhile, the instance-level mean recall and precision
increase from 70.8%, 61.7% to 86.4%, 80.9% in Shanghai, from 76.4%,
77.3% to 83.2%, 87.8% in Wuhan. This increase in the instance seg-
mentation performance veries the effectiveness of our design.
4.6. Error analysis
Although the proposed method inventories most urban street furni-
ture in the experiments, some aspects need attention in future
applications and improvements. We statistically analyze three typical
types of errors in our experiments, as is shown in Fig. 15. These errors
collectively consider the FP and FN in both Shanghai and Wuhan
Object Detection Error. The proposed method is based on the
guidance of image object detection, and hence is inuenced by the
quality of 2D proposals. Typical object detection errors mainly consti-
tute semantic label errors (Fig. 16) and object omissions (Fig. 16(b)),
and they account for most errors for street lamps and trash bins.
Instance Mask Error. Based on the 2D proposals, our method segments
the street furniture from point clouds. Instance mask errors refer to those
caused by incorrect mask prediction from the neural networks. For
trafc signs, trafc lights, the instance mask prediction errors are
comparatively more considerable. Also, for re hydrants and trafc
cones, instance mask prediction is more challenging since the great point
number imbalance between instance points and background. Object
Missing in Point Clouds. Although some objects are detected in the
images, the corresponding point clouds are not successfully collected,
which takes up a minor part of errors. For example, as shown in Fig. 16
(c), due to the scanning range and the reective property of the trafc
Table 6
Quantitative comparison with Frustum-PointNet.
Dataset Method Instance-level Statistics (IOU@0.5) Point-level Statistics (IOU@0.5)
mRecall/% mPrecision/% mRecall/% mPrecision/% mwCov/%
Shanghai F-PointNet 70.8 61.7 59.1 69.3 60.5
Ours 86.4 80.9 79.0 73.7 75.8
Wuhan F-PointNet 76.4 77.3 66.3 71.2 62.6
Ours 83.2 87.8 79.8 78.1 74.2
Fig. 14. Visual comparison of our method and F-PointNet. Street lights: green.
Trafc lights: red. Trash bins: yellow. Trafc signs: blue.
Fig. 15. Statistical error analysis of different classes and types of error.
Fig. 16. Typical errors in image object detection and the missing object in
point clouds. (a) The utility pole is falsely recognized as a street lamp. (b) The
trash bin is omitted due to its inconspicuous appearance. (c) Most points of the
trafc sign are not collected due to the scanning angle and the reec-
tive property.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
sign, the trafc sign shows very few point numbers.
5. Conclusion
This study presents a framework for inventorying various street
furniture from MLS point clouds under the guidance of street-view im-
agery. The proposed pipeline is thoroughly introduced, which contains
three major steps: (1) frustum cropping; (2) per-frustum instance mask
and 3D bounding box prediction; (3) object-centric instance segmenta-
tion. This pipeline is validated by experiments on Wuhan and Shanghai
datasets, producing component-level street furniture inventory with
concrete semantic labels and instance point clouds. Experiments report
the mean instance recall and precision respectively reach 86.4%, 80.9%
and 83.2%, 87.8% in Shanghai and Wuhan, and the point-level mean
recall, precision, weighted coverage all exceed 73.7%, meeting the de-
mand of urban administration and outperforming previous studies.
Declaration of Competing Interest
The authors declare that they have no known competing nancial
interests or personal relationships that could have appeared to inuence
the work reported in this paper.
This work is jointly supported by the National Natural Science
Foundation of China Projects (No. 41725005, No. 42172431).
Appendix A. Implementation details
Point Cloud Processing Backbone. Fig. 17 shows the architecture of the point cloud processing backbone. The input Pf contains 4 feature
channels, which are the point intensity and the XYZ coordinates. Using the multi-scale grouping (MSG) set abstraction layers, it outputs the 1024
dimensional point cloud global feature Fgpc. Then Fgpc is concatenated with the image one-hot class vector Fimg into the global feature Fg. Afterward,
Fg is passed to the feature propagation layers and the network outputs the point features Fp. The subsample radius and the channels of the multi-layer
perception (MLP) are presented.
The Multi-task Neural Network. The multi-task neural network for processing frustum point clouds contain three branches. The bounding box
branch passes Fg through an MLP with channels [512, 256, 256, 6] and outputs the box coordinates B. The structure of the instance mask branch is
shown in Fig. 18, where FC denotes fully connected layers and LReLU denotes the Leaky ReLU activation layers. It takes Fp,Fa,B as input and outputs
the instance mask M. The last layer is a sigmoid activation layer. The semantic label branch passes Fp through fully connected layers with channels [128,
64, 64, ns] and a nal softmax layer. Fig. 19 respectively shows the training curves of the loss functions. The joint multi-task loss function joint =
box +ℒmask +ℒsem consistently converges.
The Object-centric Instance Segmentation Network. This network uses the described point cloud processing backbone followed by FC layers
with channels [128, 64, 64, 1]. The last layer is a sigmoid layer.
Fig. 17. Structure of the point cloud processing backbone.
Fig. 18. Structure of the instance mask branch in the multi-task neural network.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
Anguelov, D., Dulong, C., Filip, D., Frueh, C., Lafon, S., Lyon, R., Ogale, A., Vincent, L.,
Weaver, J., 2010. Google street view: Capturing the world at street level. Computer
43 (6), 3238.
Barcon, E., Picard, A., 2021. Automatic detection and vectorization of linear and point
objects in 3d point cloud and panoramic images from mobile mapping system. Int.
Arch. Photogramm. Remote Sens. Spatial Inform. Sci. 43. B22021.
Biljecki, F., Ito, K., 2021. Street view imagery in urban analytics and gis: A review.
Landscape Urban Plan. 215, 104217.
Campbell, A., Both, A., Sun, Q.C., 2019. Detecting and mapping trafc signs from google
street view images using deep learning and gis. Comput. Environ. Urban Syst. 77,
Che, E., Jung, J., Olsen, M.J., 2019. Object recognition, segmentation, and classication
of mobile laser scanning point clouds: A state of the art review. Sensors 19 (4), 810.
Chen, S., Liu, B., Feng, C., Vallespi-Gonzalez, C., Wellington, C., 2020. 3d point cloud
processing and learning for autonomous driving: Impacting map creation,
localization, and perception. IEEE Signal Process. Mag. 38 (1), 6886.
Chen, X., Ma, H., Wan, J., Li, B., Xia, T., 2017. Multi-view 3d object detection network
for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition, pp. 19071915.
Chen, Y., Wang, S., Li, J., Ma, L., Wu, R., Luo, Z., Wang, C., 2019. Rapid urban roadside
tree inventory using a mobile laser scanning system. IEEE J. Sel. Top. Appl. Earth
Observ. Remote Sens. 12 (9), 36903700.
Chen, Y., Wu, R., Yang, C., Lin, Y., 2021. Urban vegetation segmentation using terrestrial
lidar point clouds based on point non-local means network. Int. J. Appl. Earth Obs.
Geoinf. 105, 102580.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U.,
Roth, S., Schiele, B., 2016. The cityscapes dataset for semantic urban scene
understanding. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 32133223.
Cui, Y., Chen, R., Chu, W., Chen, L., Cao, D., 2021. Deep learning for image and point
cloud fusion in autonomous driving: A review. IEEE Trans. Intell. Transp. Syst. PP
(99), 118.
Gao, Y. Cacascade rcnn. 2021. URL:
Gargoum, S.A., Karsten, L., El-Basyouny, K., Koch, J.C., 2018. Automated assessment of
vertical clearance on highways scanned using mobile lidar technology. Autom.
Constr. 95, 260274.
Gong, Z., Lin, H., Zhang, D., Luo, Z., Zelek, J., Chen, Y., Nurunnabi, A., Wang, C., Li, J.,
2020. A frustum-based probabilistic framework for 3d object detection by fusion of
lidar and camera data. ISPRS J. Photogramm. Remote Sens. 159, 90100.
Guan, H., Li, J., Yu, Y., Chapman, M., Wang, C., 2014. Automated road information
extraction from mobile laser scanning data. IEEE Trans. Intell. Transp. Syst. 16 (1),
Han, L., Zheng, T., Xu, L., Fang, L., 2020. Occuseg: Occupancy-aware 3d instance
segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pp. 29402949.
Han, X., Dong, Z., Yang, B., 2021. A point-based deep learning network for semantic
segmentation of mls point clouds. ISPRS J. Photogramm. Remote Sens. 175,
Hebbalaguppe, R., Garg, G., Hassan, E., Ghosh, H., Verma, A., 2017. Telecom inventory
management via object recognition and localisation on google street view images. In:
2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, p.
Hou, J., Dai, A., Nießner, M., 2019. 3d-sis: 3d semantic instance segmentation of rgb-
d scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 44214430.
Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., Igel, C., 2013. Detection of trafc
signs in real-world images: The german trafc sign detection benchmark. In: The
2013 International Joint Conference on Neural Networks (IJCNN). IEEE, pp. 18.
Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., Trigoni, N., Markham, A., 2020.
Randla-net: Efcient semantic segmentation of large-scale point clouds. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 1110811117.
Huang, T., Liu, Z., Chen, X., Bai, X., 2020. Epnet: Enhancing point features with image
semantics for 3d object detection. In: European Conference on Computer Vision.
Springer, pp. 3552.
Jensen, M.B., Philipsen, M.P., Møgelmose, A., Moeslund, T.B., Trivedi, M.M., 2016.
Vision for looking at trafc lights: Issues, survey, and perspectives. IEEE Trans. Intell.
Transp. Syst. 17 (7), 18001815.
Krylov, V.A., Kenny, E., Dahyot, R., 2018. Automatic discovery and geotagging of objects
from street view imagery. Remote Sens. 10 (5), 661.
Kuhn, H.W., 1955. The hungarian method for the assignment problem. Naval Res. Logist.
Quart. 2 (12), 8397.
Lahoud, J., Ghanem, B., Pollefeys, M., Oswald, M.R., 2019. 3d instance segmentation via
multi-task metric learning. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 92569266.
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O., 2019. Pointpillars: Fast
encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 1269712705.
Laumer, D., Lang, N., van Doorn, N., Mac Aodha, O., Perona, P., Wegner, J.D., 2020.
Geocoding of trees from street addresses and street-level images. ISPRS J.
Photogramm. Remote Sens. 162, 125136.
Li, F., Lehtom¨
aki, M., Elberink, S.O., Vosselman, G., Kukko, A., Puttonen, E., Chen, Y.,
a, J., 2019. Semantic segmentation of road furniture in mobile laser scanning
data. ISPRS J. Photogramm. Remote Sens. 154, 98113.
Li, Y., Ma, L., Zhong, Z., Liu, F., Chapman, M.A., Cao, D., Li, J., 2020. Deep learning for
lidar point clouds in autonomous driving: a review. IEEE Trans. Neural Netw. Learn.
Syst. 32 (8), 34123432.
Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´
ar, P., 2017. Focal loss for dense object
detection. In: Proceedings of the IEEE International Conference on Computer Vision,
pp. 29802988.
Ma, L., Li, Y., Li, J., Wang, C., Wang, R., Chapman, M.A., 2018. Mobile laser scanned
point-clouds for road object detection and extraction: A review. Remote Sens. 10
(10), 1531.
Ma, Y., Zheng, Y., Easa, S., Wong, Y.D., El-Basyouny, K., 2022. Virtual analysis of urban
road visibility using mobile laser scanning data and deep learning. Autom. Constr.
133, 104014.
Pang, S., Morris, D., Radha, H., 2020. Clocs: Camera-lidar object candidates fusion for 3d
object detection. In: 2020 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). IEEE, pp. 1038610393.
Peng, Z., Gao, S., Xiao, B., Guo, S., Yang, Y., 2017. Crowdgis: Updating digital maps via
mobile crowdsensing. IEEE Trans. Autom. Sci. Eng. 15 (1), 369380.
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J., . 2018. Frustum pointnets for 3d object
detection from rgb-d data. In: Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition, pp. 918927.
Qi, C.R., Yi, L., Su, H., Guibas, L.J., 2017. Pointnet++: Deep hierarchical feature learning
on point sets in a metric space. In: Advances in Neural Information Processing
Systems, pp. 51055114.
Qi, C.R., Zhou, Y., Najibi, M., Sun, P., Vo, K., Deng, B., Anguelov, D., 2021. Offboard 3d
object detection from point cloud sequences. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 61346144.
Sanchez Castillo, E., Grifths, D., Boehm, J., 2021. Semantic segmentation of terrestrial
lidar data using co-registered rgb data. Int. Arch. Photogramm. Remote Sens. Spatial
Inform. Sci. 43, 223229.
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J., 2019. Objects365: A
large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/
CVF International Conference on Computer Vision, pp. 84308439.
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H., 2020. Pv-rcnn: Point-voxel
feature set abstraction for 3d object detection. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 1052910538.
Shi, S., Wang, X., Li, H., 2019. Pointrcnn: 3d object proposal generation and detection
from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pp. 770779.
Tao, A., Sapra, K., Catanzaro, B., 2020. Hierarchical multi-scale attention for semantic
segmentation. arXiv preprint arXiv:200510821 2020.
Vora, S., Lang, A.H., Helou, B., Beijbom, O., 2020. Pointpainting: Sequential fusion for 3d
object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 46044612.
Wang, H., Xue, C., Zhou, Y., Wen, F., Zhang, H., 2021. Visual semantic localization based
on hd map for autonomous vehicles in urban scenarios. In: 2021 IEEE International
Conference on Robotics and Automation (ICRA). IEEE, pp. 1125511261.
Wang, J., Lindenbergh, R., Menenti, M., 2017. Sigvoxa 3d feature matching algorithm
for automatic street object recognition in mobile laser scanning point clouds. ISPRS
J. Photogramm. Remote Sens. 128, 111129.
Wang, X., Liu, S., Shen, X., Shen, C., Jia, J., 2019. Associatively segmenting instances and
semantics in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 40964105.
Weng, X., Wang, J., Held, D., Kitani, K., 2020. 3d multi-object tracking: A baseline and
new evaluation metrics. In: 2020 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS). IEEE, pp. 1035910366.
White House, B., 2021. Fact sheet: The bipartisan infrastructure deal. URL: https://www.ng-room/statements-releases/2021/11/06/fact-sheet-the-bi
Fig. 19. Training curves of the loss functions in the multi-task neural network.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
Yang, B., Dong, Z., 2013. A shape-based segmentation method for mobile laser scanning
point clouds. ISPRS J. Photogramm. Remote Sens. 81, 1930.
Yang, B., Dong, Z., Liu, Y., Liang, F., Wang, Y., 2017. Computing multiple aggregation
levels and contextual features for road facilities recognition using mobile laser
scanning data. ISPRS J. Photogramm. Remote Sens. 126, 180194.
Yang, B., Dong, Z., Zhao, G., Dai, W., 2015. Hierarchical extraction of urban objects from
mobile laser scanning data. ISPRS J. Photogramm. Remote Sens. 99, 4557.
Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham, A., Trigoni, N., 2019. Learning
object bounding boxes for 3d instance segmentation on point clouds. In: Advances in
Neural Information Processing Systems, pp. 67376746.
Yoo, J.H., Kim, Y., Kim, J., Choi, J.W., 2020. 3d-cvf: Generating joint camera and lidar
features using cross-view spatial feature fusion for 3d object detection. In: European
Conference on Computer Vision. Springer, pp. 720736.
Yu, Y., Li, J., Guan, H., Wang, C., Wen, C., 2016. Bag of contextual-visual words for road
scene object detection from mobile laser scanning data. IEEE Trans. Intell. Transp.
Syst. 17 (12), 33913406.
Zhou, Y., Huang, R., Jiang, T., Dong, Z., Yang, B., 2021. Highway alignments extraction
and 3d modeling from airborne laser scanning point clouds. Int. J. Appl. Earth Obs.
Geoinf. 102, 102429.
Zhu, H., Deng, J., Zhang, Y., Ji, J., Mao, Q., Li, H., Zhang, Y., 2021. Vpfnet: Improving 3d
object detection with virtual point based lidar and stereo data fusion. arXiv preprint
Zhu, Z., Liang, D., Zhang, S., Huang, X., Li, B., Hu, S., 2016. Trafc-sign detection and
classication in the wild. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 21102118.
Y. Zhou et al.
... Traffic lights have been considered in the context of pole detection, i.e., as detected traffic light poles, without being represented as individual instances [33], [34], [37]. Further, traffic lights occur as class in various proposed semantic segmentation algorithms for point clouds relying on conventional algorithms [41], [42], [43] or DNNs [44]. ...
Full-text available
In this work we present a novel deep learning-based approach to detect and specify map deviations in erroneous or outdated high-definition (HD) maps using both sensor and map data as input to a deep neural network (DNN). We first present our proposed reference method for map deviation detection (MDD) utilizing a sensor-only DNN detecting traffic signs, traffic lights, and pole-like objects in LiDAR data, with deviations obtained by subsequently comparing detected objects and examined map. Second, we facilitate the object detection task by using the examined map as additional input to the network. Third, we employ a specialized MDD network to directly infer the correctness of the map input. Finally, we demonstrate the robustness of our approach for challenging scenes featuring occlusions and a reduced point density, e.g., due to heavy rain. Our code is available at
... Other widespread applications of LiDAR technology are found in geology, atmospheric physics or seismology [12]. More specific ones concerning our work are building inspections [85], monitoring of natural environments (landform dynamics [86], ecological resilience [87], etc.), autonomous driving [88], preservation of cultural heritage [89], land mapping or urban planning [90]. ...
Full-text available
The objective of this thesis is to develop a framework capable of handling multiple data sources by correcting and fusing them to monitor, predict, and optimize real-world processes. The scope is not limited to images but also covers the reconstruction of 3D point clouds integrating visible, multispectral, thermal and hyperspectral data. However, working with real-world data is also tedious as it involves multiple steps that must be performed manually, such as collecting data, marking control points or annotating points. Instead, an alternative is to generate synthetic data from realistic scenarios, hence avoiding the acquisition of prohibitive technology and efficiently constructing large datasets. In addition, models in virtual scenarios can be attached to semantic annotations and materials, among other properties. Unlike manual annotations, synthetic datasets do not introduce spurious information that could mislead the algorithms that will use them. Remotely sensed images, albeit showing notable radiometric changes, can be fused by optimizing the correlation among them. This thesis exploits the Enhanced Correlation Co-efficient image-matching algorithm to overlap visible, multispectral and thermal data. Then, multispectral and thermal data are projected into a dense RGB point cloud reconstructed with photogrammetry. By projecting and not directly reconstructing, the aim is to achieve geometrically accurate and dense point clouds from low-resolution imagery. In addition, this methodology is notably more efficient than GPU-based photogrammetry in commercial software. Radiometric data is ensured to be correct by identifying the occlusion of points as well as by minimizing the dissimilarity of aggregated data from the starting samples. Hyperspectral data is, on the other hand, projected over 2.5D point clouds with a pipeline adapted to push-broom scanning. The hyperspectral swaths are geometrically corrected and overlapped to compose an orthomosaic. Then, it is projected over a voxelized point cloud. Due to the large volume of the resulting hypercube, it is compressed following a stack-based representation in the radiometric dimension. The real-time rendering of the compressed hypercube is enabled by iteratively constructing an image in a few frames, thus reducing the overhead of single frames. In contrast, the generation of synthetic data is focused on LiDAR technology. The baseline of this simulation is the indexing of scenarios with a high level of detail in state-of-the-art ray-tracing data structures that help to rapidly solve ray-triangle intersections. From here, random and systematic errors are introduced, such as outliers, jittering of rays and return losses, among others. In addition, the construction of large LiDAR datasets is supported by the procedural generation of scenes that can be enriched with semantic annotations and materials. Airborne and terrestrial scans are parameterized to be fed with datasheets from commercial sensors. The airborne scans integrate several scan geometries, whereas the intensity of returns is estimated with BRDF databases that collect samples from a gonio-photometer. In addition, the simulated LiDAR can operate at different wavelengths, including bathymetry, and emulates several returns. This thesis is concluded by showing the benefits of fused data and synthetic datasets with three case studies. The LiDAR simulation is employed to optimize scanning plans in buildings by using local searches to determine optimal scan locations while minimizing the number of required scans with the help of genetic algorithms. These metaheuristics are guided by four objective functions that evaluate the accuracy, coverage, detail, and overlapping of the LiDAR scans. Then, thermal infrared point clouds and orthorectified maps are used to locate buried remains and reconstruct the structure of a poorly conserved archaeological site, highlighting the potential of remotely sensed data to support the preservation of cultural heritage. Finally, hyperspectral data is corrected and transformed to train a convolutional neural network in pursuit of classifying different grapevine varieties.
Full-text available
Street View Imagery (SVI) is crucial in estimating indicators such as Sky View Factor (SVF) and Green View Index (GVI), but (1) approaches and terminology differ across fields such as planning, transportation and climate, potentially causing inconsistencies; (2) it is unknown whether the regularly used panoramic imagery is actually essential for such tasks, or we can use only a portion of the imagery, simplifying the process; and (3) we do not know if non-panoramic (single-frame) photos typical in crowdsourced platforms can serve the same purposes as panoramic ones from services such as Google Street View and Baidu Maps for their limited perspectives. This study is the first to examine comprehensively the built form metrics, the influence of different practices on computing them across multiple fields, and the usability of normal photos (from consumer cameras). We overview approaches and run experiments on 70 million images in 5 cities to analyse the impact of a multitude of variants of SVI on characterising the physical environment and mapping street canyons: a few panoramic approaches (e.g. fisheye) and 96 scenarios of perspective imagery with variable directions, fields of view, and aspect ratios mirroring diverse photos from smartphones and dashcams. We demonstrate that (1) disparate panoramic approaches give different but mostly comparable results in computing the same metric (e.g. from R=0.82 for Green View to R=0.98 for Sky View metrics); and (2) often (e.g. when using a front-facing ultrawide camera), single-frame images can derive results comparable to commercial panoramic counterparts. This finding may simplify typical processes of using panoramic data and also unlock the value of billions of crowdsourced images, which are often overlooked, and can benefit scores of locations worldwide not yet covered by commercial services. Further, when aggregated for city-scale analyses, the results correspond closely.
With the rapid development of reality capture methods, such as laser scanning and oblique photogrammetry, point cloud data have become the third most important data source, after vector maps and imagery. Point cloud data also play an increasingly important role in scientific research and engineering in the fields of Earth science, spatial cognition, and smart cities. However, how to acquire high-quality three-dimensional (3D) geospatial information from point clouds has become a scientific frontier, for which there is an urgent demand in the fields of surveying and mapping, as well as geoscience applications. To address the challenges mentioned above, point cloud intelligence came into being. This paper summarizes the state-of-the-art of point cloud intelligence, with regard to acquisition equipment, intelligent processing, scientific research, and engineering applications. For this purpose, we refer to a recent project on the hybrid georeferencing of images and LiDAR data for high-quality point cloud collection, as well as a current benchmark for the semantic segmentation of high-resolution 3D point clouds. These projects were conducted at the Institute for Photogrammetry, the University of Stuttgart, which was initially headed by the late Prof. Ackermann. Finally, the development prospects of point cloud intelligence are summarized.
Digital Twin (DT) offers a novel framework to track, model, analyze, and anticipate complex urban processes and support data-driven decision-making. However, a premise of developing DT applications is to inventory physical urban built environment digitally, which are often lacking for small-and medium-sized cities due to limited resources. Particularly, few digital inventories have been built for urban curb environments, which have been increasingly challenged by new vehicle technologies and emerging mobility services. We propose a data-driven framework to inventory curb facilities across types and locations using computer vision (CV) and Google Street View (GSV) imagery. Specifically, we used a state-of-the-art semantic segmentation model, i.e., DeepLab V3, pre-trained on the CityScapes dataset, to detect curb facilities of interest from GSV images. We then used the Inverse Perspective Mapping (IPM) to estimate the spatial location for each detected facility and used spatial processing to aggregate and filter estimation results. We demonstrated the framework for inventorying curbs in the Innovation District in the City of Gainesville, FL. The preliminary research contributes to Smart Curb Digital Twin for more safe, accessible, and productive curb environments.
Full-text available
This study proposes a new computer-aided framework for virtually analyzing urban road visibility using mobile laser scanning (MLS) data. The proposed framework compromises three main parts: 1) based on the data reorganization procedure, the 3D U-net is successfully introduced to tackle the issue of non-stationary noises that significantly hinders accurate detections of stationary sight obstructions, 2) a multi-step procedure is developed to extract the road areas to be estimated and fill data gaps caused by occlusions, and 3) a virtual scanning method (VSM) is proposed to achieve a fast and accurate visibility assessment of the extracted road areas. The proposed VSM also facilitates the application of deep neural networks in the automated driving domain to classify sight obstacles. By enabling multiple outputs, the proposed virtual framework provides a comprehensive understanding of urban road visibility, which can help road administrators detect and understand poor-visibility locations on urban streets.
Full-text available
Street view imagery has rapidly ascended as an important data source for geospatial data collection and urban analytics, deriving insights and supporting informed decisions. Such surge has been mainly catalysed by the proliferation of large-scale imagery platforms, advances in computer vision and machine learning, and availability of computing resources. We screened more than 600 recent papers to provide a comprehensive systematic review of the state of the art of how street-level imagery is currently used in studies pertaining to the built environment. The main findings are that: (i) street view imagery is now clearly an entrenched component of urban analytics and GIScience; (ii) most of the research relies on data from Google Street View; and (iii) it is used across myriads of domains with numerous applications – ranging from analysing vegetation and transportation to health and socio-economic studies. A notable trend is crowdsourced street view imagery, facilitated by services such as Mapillary and KartaView, in some cases furthering geographical coverage and temporal granularity, at a permissive licence.
Full-text available
Accurate highway alignments and three-dimensional (3D) models are essential for various intelligent transportation applications. Airborne laser scanning (ALS) provides a desirable means of data collection, which increases data quality and collection efficiency. However, automatic alignments extraction and 3D modeling remain open problems. Therefore, this paper proposes an automatic framework to extract highway alignments by minimizing an elaborate energy function and reconstruct highway 3D models with the restrictions of alignments. Specifically, the proposed method contains the following steps: 1) Adopt an adaptive method based on spatially smooth and interconnected grid cells to recognize highway pavement points from ALS data. 2) Extract pavement boundaries and lane markings from the pavement areas using the α-shape algorithm and a marking tracking strategy. 3) Extract highway alignments by minimizing an energy function and reconstruct highway 3D models with the restrictions of alignments. The method was validated in scenes of various highways, where the point density is 10-25 pts/m^2. The extracted alignments respectively achieved the correctness of 90.67% and 99.25% and the completeness of 87.60% and 99.55% within 10 cm and 15 cm errors. The root mean square error (RMSE) of the generated 3D model is 2.4 cm on pavement and 5.8 cm on hills and slopes.
Full-text available
This paper proposes a semantic segmentation pipeline for terrestrial laser scanning data. We achieve this by combining co-registered RGB and 3D point cloud information. Semantic segmentation is performed by applying a pre-trained off-the-shelf 2D convolutional neural network over a set of projected images extracted from a panoramic photograph. This allows the network to exploit the visual image features that are learnt in a state-of-the-art segmentation models trained on very large datasets. The study focuses on the adoption of the spherical information from the laser capture and assessing the results using image classification metrics. The obtained results demonstrate that the approach is a promising alternative for asset identification in laser scanning data. We demonstrate comparable performance with spherical machine learning frameworks, however, avoid both the labelling and training efforts required with such approaches.
Full-text available
Surveys of roadways with Mobile Laser Scanning (MLS) are nowadays the faster and more secured way to collect topographic data compared with conventional techniques. To deliver topographic plans, the voluminous data collected by the MLS device need to be processed. If the acquisition step is quite fast, the second part of interpretation and vectorization of the LiDAR data and the panoramic images is laborious and time consuming. This paper proposes two approaches that have been developed in order to reduce the time required to process roadway MLS data. The first one is about automatic detection of pole like objects, and the second one is about the detection of linear objects. The presented workflow try to automatically extract a 3D position for each object from MLS Data.
It has been well recognized that fusing the complementary information from depth-aware LiDAR point clouds and semantic-rich stereo images would benefit 3D object detection. Nevertheless, it is non-trivial to explore the inherently unnatural interaction between sparse 3D points and dense 2D pixels. To ease this difficulty, the recent approaches generally project the 3D points onto the 2D image plane to sample the image data and then aggregate the data at the points. However, these approaches often suffer from the mismatch between the resolution of point clouds and RGB images, leading to sub-optimal performance. Specifically, taking the sparse points as the multi-modal data aggregation locations causes severe information loss for high-resolution images, which in turn undermines the effectiveness of multi-sensor fusion. In this paper, we present VPFNet —a new architecture that cleverly aligns and aggregates the point cloud and image data at the “virtual” points. Particularly, with their density lying between that of the 3D points and 2D pixels, the virtual points can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing. Moreover, we also investigate the data augmentation techniques that can be applied to both point clouds and RGB images, as the data augmentation has made non-negligible contribution towards 3D object detectors to date. We have conducted extensive experiments on KITTI dataset, and have observed good performance compared to the state-of-the-art methods. Remarkably, our VPFNet achieves 83.21% moderate $AP_{3D}$ and 91.86% moderate $AP_{BEV}$ on the KITTI test set. The network design also takes computation efficiency into consideration – we can achieve a FPS of 15 on a single NVIDIA RTX 2080Ti GPU. The source code is available at
Urban vegetation inventory at city-scale using terrestrial light detection and ranging (LiDAR) point clouds is very challenging due to the large quantity of points, varying local density, and occlusion effects, leading to missing features and incompleteness of data. This paper proposes a novel method, named Point Non-Local Means (PointNLM) network, which incorporates the supervoxel-based and point-wise for automatic semantic segmentation of vegetation from large scale complex scene point clouds. PointNLM captures the long-range relationship between groups of points via a non-local branch cascaded three times to describe sharp geometric features. Simultaneously, a local branch processes the position of scattered feature points and captures the low and high level features. Finally, we proposed a fusion layer of neighborhood max-pooling method to concatenate the long-range features, low level features and high level features for segmenting the trees. The proposed architecture was evaluated on three datasets, including two open access datasets of Semantic3D and Paris-Lille-3D, and an in-house dataset acquired by a commercial mobile LiDAR system. Experimental results indicated that the proposed method provides an efficient and robust result for vegetation segmentation, achieving an Intersection over Union (IoU) of 94.4%, F1-score of 92.7% and overall accuracy of 96.3%, respectively.
Semantic segmentation of point cloud is critical to 3D scene understanding and also a challenging problem in point cloud processing. Although an increasing number of deep learning based methods are proposed in recent years for semantic segmentation of point clouds, few deep learning networks can be directly used for large-scale outdoor point cloud segmentation which is essential for urban scene understanding. Given both the challenges of outdoor large-scale scenes and the properties of the 3D point clouds, this paper proposes an end-to-end network for semantic segmentation of urban scenes. Three key components are encompassed in the proposed point clouds deep learning network: (1) an efficient and effective sampling strategy for point cloud spatial downsampling; (2) a point-based feature abstraction module for effectively encoding the local features through spatial aggregating; (3) a loss function to address the imbalance of different categories, resulting in the overall performance improvement. To validate the proposed point clouds deep learning network, two datasets were used to check the effectiveness, showing the state-of-the-art performance in most of the testing data, which achieves mean IoU of 70.8% and 73.9% in Toronto-3D and Shanghai MLS dataset, respectively.