Content uploaded by Yuzhou Zhou
Author content
All content in this area was uploaded by Yuzhou Zhou on May 13, 2022
Content may be subject to copyright.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
Available online 12 May 2022
0924-2716/© 2022 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
Street-view imagery guided street furniture inventory from mobile laser
scanning point clouds
Yuzhou Zhou
a
, Xu Han
a
, Mingjun Peng
b
, Haiting Li
b
, Bo Yang
c
, Zhen Dong
a
,
*
, Bisheng Yang
a
,
*
a
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China
b
Wuhan Geomatics Institute, Wuhan, China
c
Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong
ARTICLE INFO
Keywords:
Street-view imagery
Mobile laser scanning
Point clouds
Street furniture
Instance segmentation
Neural network
ABSTRACT
Outdated or sketchy inventory of street furniture may misguide the planners on the renovation and upgrade of
transportation infrastructures, thus posing potential threats to trafc safety. Previous studies have taken their
steps using point clouds or street-view imagery (SVI) for street furniture inventory, but there remains a gap to
balance semantic richness, localization accuracy and working efciency. Therefore, this paper proposes an
effective pipeline that combines SVI and point clouds for the inventory of street furniture. The proposed pipeline
encompasses three steps: (1) Off-the-shelf street furniture detection models are applied on SVI for generating
two-dimensional (2D) proposals and then three-dimensional (3D) point cloud frustums are accordingly cropped;
(2) The instance mask and the instance 3D bounding box are predicted for each frustum using a multi-task neural
network; (3) Frustums from adjacent perspectives are associated and fused via multi-object tracking, after which
the object-centric instance segmentation outputs the nal street furniture with 3D locations and semantic labels.
This pipeline was validated on datasets collected in Shanghai and Wuhan, producing component-level street
furniture inventory of nine classes. The instance-level mean recall and precision reach 86.4%, 80.9% and 83.2%,
87.8% respectively in Shanghai and Wuhan, and the point-level mean recall, precision, weighted coverage all
exceed 73.7%.
1. Introduction
Recently, substantial and increasing investments have been made in
updating road infrastructures by countries worldwide. The U.S. gov-
ernment, as an example, announced that about 110 billion dollars would
be spent on improving roads and bridges (White House, 2021). As an
important part of this task, the maintenance and upgrade of street
furniture ought to take advantage of the existing inventory. Therefore,
obsolete or inaccurate inventory of street furniture may mislead trans-
portation planners and constructors, and this will pose potential hazards
to trafc safety. However, current automated inventory solutions using
only street view imagery or mobile laser scanning (MLS) point clouds
have respectively shown their drawbacks in meeting the comprehensive
demand involving working efciency, localization accuracy and se-
mantic richness, which motivates this study.
Street furniture collectively represents objects and equipment
installed along roads for municipal functions, including street lamps,
trash bins, trafc lights, trafc signs, etc. (Wang et al., 2017; Guan et al.,
2014). Not only are the design and distribution of street furniture closely
interrelated with trafc safety and comfort (Ma et al., 2022; Gargoum
et al., 2018), but they are also considered as fundamental infrastructures
for various cutting-edge transportation applications, such as high-
denition (HD) maps (Zhou et al., 2021), vehicle-to-everything (V2X)
and autonomous driving (Cui et al., 2021). In the context of intelligent
transportation, street furniture inventory is of great signicance for city
administration and should be oriented toward potential future applica-
tions. For example, the visibility of trafc signs and occlusions of street
lamps need periodical inspections to reduce latent threats to trans-
portation safety (Jensen et al., 2016). In addition, semantic and three-
dimensional (3D) geometric features of street furniture are widely
adopted by autonomous vehicles (AV) as localization and planning
reference (Chen et al., 2020; Wang et al., 2021). Therefore, beyond just
recording the amount, these applications demand both detailed se-
mantic labels and the corresponding 3D geometric information
including locations, shapes, sizes from the street furniture inventory, as
shown in Fig. 1.
* Corresponding authors.
E-mail addresses: zhouyuzhou@whu.edu.cn (Y. Zhou), dongzhenwhu@whu.edu.cn (Z. Dong), bshyang@whu.edu.cn (B. Yang).
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing
journal homepage: www.elsevier.com/locate/isprsjprs
https://doi.org/10.1016/j.isprsjprs.2022.04.023
Received 15 December 2021; Received in revised form 24 April 2022; Accepted 25 April 2022
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
64
To keep a regularly updated street furniture inventory, street-view
imagery analysis has attracted extensive attention due to its visual and
complete presentation of road scenes (Laumer et al., 2020). With the
prevalence of street-view imagery (SVI) thanks to Google Street View
and the upsurge of image understanding algorithms, the automation and
efciency of inventory collection have drastically increased. However,
the localization accuracy of SVI based methods is conned to meter-
level and the output lacks 3D geometric information (Biljecki and Ito,
2021), so this does not fully satisfy the need of the aforementioned ap-
plications, especially the development of autonomous driving. Mean-
while, the previous practice has demonstrated it is promising to
inventory street assets by segmenting MLS point clouds (Yang et al.,
2015), which features high localization accuracy but are at a loss of
semantic richness (Che et al., 2019). Specically, when dealing with
small objects, point cloud based methods are more likely to confuse
because of relatively fewer point numbers and the lack of sufcient se-
mantic and texture information.
Accordingly, SVI based methods and MLS based methods have
complementary properties in terms of localization accuracy and se-
mantic richness. Combining SVI and MLS point clouds may help them
supplement each other toward better inventory performance. For
example, with SVI providing semantic labels and point clouds providing
3D geometric features, the inventory of trafc signs can be enriched to
satisfy the need for localization and planning in autonomous driving.
Moreover, although SVI based methods suffer relatively lower locali-
zation accuracy, SVI may guide the 3D survey of small street furniture
like re hydrants or benches in point clouds by indicating a potential
search space. In this regard, some pioneering frameworks have been
proposed to fuse images and point clouds, among which the methods
based on frustums (Qi et al., 2018) offer a heuristic inspiration for our
study.
In this study, we rstly leverage off-the-shelf SVI object detection
models for two-dimensional (2D) proposal generation and accordingly
segment point cloud frustums by projecting the 2D bounding boxes into
the point cloud space. Then, the instance mask and the instance
bounding box are predicted for each frustum using a multi-task neural
network. Lastly, the frustums from different images are associated via
object tracking based on 3D bounding boxes and the nal object-centric
street furniture instance masks are predicted by fusing the associated
frustums. The main contributions of this study are as follows:
•A novel framework for surveying street furniture inventory
combining SVI and MLS data is proposed, which outputs component-
level street furniture semantic labels, 3D locations, and corre-
sponding instance point clouds.
•An effective split and merge pattern for processing MLS data is
designed, which rstly segments point clouds into frustums and then
associates them to be object-centric, and hence reduces search spaces
for street furniture instance segmentation.
•A multi-task neural network perceiving the instance-aware context
and considering the point cloud semantic supervision is designed to
enhance the per-frustum instance mask prediction.
2. Related work
According to the type of data used for surveying road infrastructure,
we roughly categorize the related studies into three groups: only images,
only point clouds, combining images and point clouds.
2.1. Inventorying street furniture from street-view imagery
High-level image understanding in street scenes has been greatly
propelled by several outstanding public datasets. Cityscapes, for
example, provides 5000 densely annotated images for urban street
panoptic segmentation (Cordts et al., 2016). Objects365, containing 365
common object categories, covers small objects like re hydrants and
trafc cones (Shao et al., 2019). In the eld of trafc sign detection,
GTSDB (Houben et al., 2013) and TT100K (Zhu et al., 2016) are
prominent for their diversied classes and background scenes. Among
the extensive urban street scene understanding methods, the work
proposed and implemented by researchers from NVIDIA, not only ach-
ieves state-of-the-art performance but also shows strong portability and
generalizability (Tao et al., 2020). It presents an encouraging perfor-
mance in multi-scale semantic segmentation with the hierarchical
attention that enables the network to predict weights between scale
pairs.
The above-mentioned contributions are 2D only, and to map the road
objects, an estimation of their geographical locations should be per-
formed. Google Street View (GSV) is commonly used in relevant studies
(Anguelov et al., 2010). Peng et al. (2017) match images to GSV to get an
estimation of the picturing position and then locate the POI (Point of
Interest) according to the intersection between the picturing direction
and buildings in digital maps. Laumer et al. (2020) match detected tree
instances to a previous database for the update of the tree inventory.
Photogrammetric calculation is used by Campbell et al. (2019) for
locating trafc signs from GSV observations. Triangulation is another
Fig. 1. Component-level street furniture inventory based on frustums. The rst
row shows the correspondences between the MLS system, street-view imagery,
frustum point clouds, and instance points. The second row shows the point-level
inventory results of four classes in Wuhan Dataset. Original point clouds that
are not street furniture are colored with the grayscale according to intensity,
with the darker gray representing the lower intensity.
Table 1
Three-dimensional localization accuracy properties of some representative
image based street assets inventory methods.
Dataset Method Class Localization
Uncertainty
Images Peng et al. (2017) Roadside Stores 10 m
Krylov et al. (2018) Trafc Lights,
Poles
2 m
Campbell et al. (2019) Trafc Signs 2–5 m
Laumer et al. (2020) Trees meter-level
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
65
effective tool for estimating object locations. Hebbalaguppe et al. (2017)
rstly detect telecom infrastructure from GSV images and then adopt
triangulation to locate the instances. Also using GSV and triangulation,
Krylov et al. (2018) leverage monocular depth estimation for an initial
relative position and then renes it with a fusion and clustering module.
These studies follow a similar pattern: (1) detect objects of interest in
images; (2) approximately locate the object in the geographical space;
(3) rene the locations or match them to the existing inventory data-
base. However, these methods only report meter-level localization ac-
curacy (Table 1), which impedes their further applications. Moreover,
the overlapping of similar objects poses great challenges to these
methods due to the lack of depth or 3D information.
2.2. Street object extraction from point clouds
To overcome the problems of precisely locating objects, point clouds
are a promising type of data source because they present dense 3D co-
ordinates (Li et al., 2020). Since the popularization of mobile laser
scanning, it has been a main focus to extract pole-like objects along
streets (Ma et al., 2018; Chen et al., 2019), and machine learning
methods are quite commonly used. Yu et al. (2016) segment point clouds
into separated clusters for feature description and then accordingly
construct a contextual vocabulary for object recognition. Chen et al.
(2021) incorporate voxel-based and point-based features for urban tree
inventory. Yang et al. (2017) achieve effective road facility inventory
using multi-level geometric and contextual information aggregation
followed by a support vector machine (SVM). Li et al. (2019) further
separate the poles and their attachments using machine learning clas-
siers and this is a meaningful step toward component-level road
furniture inventory. But these methods cannot be easily generalized due
to the choice of feature calculation units and hand-crafted feature de-
scriptors (Yang and Dong, 2013). Another threat is the overreliance on
the performance of pole extraction.
The advancements of deep neural networks in the elds of point
cloud object detection and semantic instance segmentation offer an
impetus for street furniture surveying. Effectively and elegantly encod-
ing point cloud features is a fundamental task. PointPillars encodes the
point cloud feature map for object detection by converting original point
clouds to pillar voxels (Lang et al., 2019). PointRCNN designs a two-
stage solution that rstly estimates 3D proposals and then renes
them by aggregating spatial and semantic features (Shi et al., 2019). PV-
RCNN leverages both multi-scale voxel features and aggregated key-
point features for effective object detection (Shi et al., 2020). On the
other hand, for point cloud segmentation, RandLA-Net can efciently
produce semantic labels for up to 106 points with the simple random
sampling strategy and the local feature aggregator (Hu et al., 2020).
Jointly predicting semantic and instance labels demands spatially
discriminative feature learning. Wang et al. (2019) associate the se-
mantic decoder and instance decoder to make them benet each other
and thus achieve instance segmentation in outdoor point clouds. Lahoud
et al. (2019) and Han et al. (2020) both involve spatial vectors from
points to instance centers to learn instance-specic information. Yang
et al. (2019) predict a group of bounding boxes from the global feature
for potential instances and meanwhile predict the per-point mask for all
instances, achieving efcient instance segmentation without demanding
post-processing procedures. These methods using point clouds feature
high localization quality, but face serious challenges of capturing ac-
curate semantic information and dealing with small road objects.
2.3. Road scene parsing combing images and point clouds
Images and point clouds have complementary characteristics in
presenting geometric and semantic information so some researchers
attempt to combine them in their frameworks. In this part, the related
Fig. 2. Pipeline of the proposed method. P
f denotes the frustum point cloud introduced in Section 3.1 and Section 3.2, and Po denotes the object-centric point cloud
introduced in Section 3.3.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
66
studies are roughly categorized into two groups according to the 2D-3D
correspondence levels: pixel-level and object-level.
The pixel-level integration of point clouds and images has been
exploited in some pioneering works. PointPainting projects pixel-level
semantic predictions to 3D spaces to lter background points (Vora
et al., 2020). Hou et al. (2019) introduce a novel method that summa-
rizes image features and projects the learned 2D features to 3D voxels to
supplement the 3D geometric feature. Barcon and Picard (2021) infer
the MLS point cloud clusters of street lamps according to the panoramic
image instance segmentation results. Sanchez Castillo et al. (2021)
establish the correspondence between terrestrial laser scanning data and
the panoramic image semantic segmentation result to add a semantic
mask to the point clouds. Besides, similarly requiring ne 2D-3D
alignment, 3D-CVF and EPNet deeply fuse multi-modal information in
a feature scale (Yoo et al., 2020; Huang et al., 2020). Zhu et al. (2021)
consider the resolution mismatch between images and points, and
therefore design a framework based on virtual points with moderate
density to bridge the resolution gap. By contrast, object-level fusion of
points and images is also widely explored in autonomous driving for
dynamic object detection. MV3D is an inuential study that projects
object-level 3D proposals generated from bird-view point cloud images
to both front-view images and point clouds to learn a fused feature (Chen
et al., 2017). Frustum-PointNet, on the contrary, projects 2D proposals
to the point cloud space to search for a 3D object (Qi et al., 2018). Gong
et al. (2020) also base the objection detection framework on frustums,
and develop a novel probabilistic localization method. CLOCs associates
2D and 3D detections and feeds the joint detection candidates to a sparse
tensor and then learns the fusion parameters (Pang et al., 2020).
Pixel-level fusion of RGB and 3D data is valid for studies using
relatively sparse point clouds collected with multi-beam laser sensors or
RGB-D data as the correspondences are easy to build. However, for MLS
systems, there is a temporal mismatch between image frames and
scanning samples and the point clouds are much denser, so the pixel-
level correspondences are not so reliable. Besides, noise and occlu-
sions also pose challenges. Therefore, we base our study on object-level
associations between panoramic images and MLS point clouds.
Concretely, we use the 2D object detection proposals from panoramic
images as guidance by projecting them into frustums for reducing search
spaces of locating road furniture.
3. Methods
The proposed method takes combined street-view imagery and MLS
point clouds as input, leveraging both semantic and 3D geometric in-
formation. It outputs component-level street furniture locations with
detailed semantic labels and point-level instance masks. Generally, as
shown in Fig. 2, this pipeline constitutes three major steps. Firstly, based
on the pre-established point-image alignment and the detected street
furniture objects, point cloud frustums of interest (FoI) are cropped.
Then, a multi-task neural network is used to predict the instance
bounding box and the point mask for each FoI. Thirdly, the predicted
bounding boxes are used to associate the instances in FoI from adjacent
image frames, and then the nal object-centric instance masks are pre-
dicted by fusing the instance mask prediction results of the associated
frustums.
3.1. Frustum-of-interest cropping
The alignment of street-view imagery and point clouds, both
collected with the MLS system, is established using camera coordinates
and Euler angles of picturing moments, and pre-calibrated parameters.
As shown in Fig. 3, this alignment maps a point (x,y,z)from the local
projection coordinate system xp,yp,zpto the camera-centric coordi-
nate system xc,yc,zc, and then to the panoramic pixel coordinate
system via spherical projection.
We adopt off-the-shelf image object detection models to generate 2D
proposals and lift them to 3D point cloud spaces for frustum cropping.
The model winning the championship of benchmark Objects365 is
suitable and used for this work due to its outstanding performance in
detecting a wide range of street furniture objects from panoramic images
(Gao, 2021; Shao et al., 2019). Besides, based on the basic types of trafc
signs included in Objects365, we further enrich the trafc sign cate-
gories using the TT100k dataset by ne-tuning a lighter detection model
with it (Zhu et al., 2016). In total, the surveyed street furniture in this
setup constitutes common transportation infrastructures including
trafc signs, trafc lights, street lamps, temporary cone barriers, and
other public assets like re hydrants, trash bins, potted plants and
roadside benches. Moreover, trafc signs are further classied into
dozens of categories including text signs, stop signs and various warning
or speed limit signs.
Fig. 3. Alignment of street-view imagery and point clouds collected using the MLS system. Point clouds are rstly transformed from the local projection coordinate
system to the camera-centric coordinate system through rotation and translation, and then to image plane coordinates via spherical projection. Accordingly, a 2D
proposal is mapped to a 3D frustum.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
67
Based on the established alignment, each detection bounding box is
lifted to a frustum-of-interest in the point cloud space and the points
within the frustum constitute a frustum point cloud Pf∈RN×(3+c), where
N is the point number and c is the number of additional feature channels
(e.g., intensity, RGB). Then, a rotation of coordinates along Y axis,
making Z axis pointing the frustum center, is performed for each frustum
to avoid the excessive variation of point clouds placement (Qi et al.,
2018). The rotation degree depends on the horizontal location of the
detected object on the image. Then points in the frustum are trans-
formed to the frustum coordinate system xf,yf,zf, as shown in Fig. 4.
3.2. Per-frustum instance mask and bounding box prediction
Primarily, this part predicts instance masks measuring the proba-
bility of being street furniture points for each frustum point cloud Pf. It is
assumed that a single frustum of interest (FoI) contains only one street
furniture instance (i.e., the detected object) and other points are clutter.
A baseline idea is to directly predict the instance mask from point fea-
tures, but this method is prone to errors especially false positives,
because it pays overwhelming focus on the point-level local features and
lacks the awareness of the instance.
Instead, we simultaneously estimate a 3D bounding box of the road
furniture and use it as an input of the instance mask prediction, to in-
crease the awareness of the detected instance and the context near the
bounding box. Another merit of the bounding box prediction is that it
supports the subsequent procedure of FoI fusion by providing concrete
information about the location and size of the object. Meanwhile, the
point-level semantic prediction branch further promotes the point
feature learning. The above tasks should be mutually reinforcing in the
network and the experiments prove this hypothesis. Specically, we
adopt a neural network containing three branches to process each
frustum point cloud Pf: (a) bounding box branch; (b) instance mask
branch; (c) semantic label branch. And they share a point cloud pro-
cessing backbone, as shown in Fig. 5. While this method is not restricted
to any specic point cloud processing network, we use PointNet++ as
the backbone (Qi et al., 2017). Refer to the Appendix for more detailed
neural network settings.
Bounding Box Branch. This branch takes the global feature Fg as
input and outputs min-max coordinates B= [xmin,ymin,zmin ,xmax,ymax,
zmax]of the estimated bounding box. Fg concatenates the point cloud
global feature Fg−pc and the one-hot semantic vector Fimg from image
object detection.
We use ℒbox in Eq. (1) to supervise this branch.
ℒbox = ℒdis + ℒsiou.(1)
In Eq. (1), ℒdis measures the spatial coordinate differences and ℒsiou
encourages the predicted box to include more valid instance points. ℒdis
in Eq. (2) measures the euclidean distance between the predicted
bounding box B and the ground truth bounding box B.
ℒdis = − 1
6(B−B)2
.(2)
And ℒsiou in Eq. (3) is a soft intersection-over-union (sIoU) loss intro-
duced by Yang et al. (2019).
ℒsiou =−N
n=1(qn*qn)
N
n=1qn+N
n=1qn−N
n=1(qn*qn).(3)
In Eq. (3), q ranges from 0 to 1, and measures the degree of a point is
inside of a 3D box (Yang et al., 2019). The point closer to the box center
has a higher q. qn and qn respectively represents the q value of the nth
point in Pf with B and B.
Fig. 4. Point clouds are transformed from the camera-view coordinate system
to the frustum-view coordinate system through a rotation along Y
(down)axis.
Fig. 5. Framework of the multi-task neural network. This network takes the semantic class from image object detection and the frustum point cloud as input, and
contains three prediction branches.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
68
Instance Mask Branch. The instance mask prediction branch takes
point features Fp, the aggregated feature Fa and the predicted bounding
box coordinates B as input. Concretely, Fa and B are respectively tiled by
replication to be concatenated with the point-level features. Compared
with predicting the instance mask solely depending on point features,
the supplementary information from the global feature and the instance
bounding box offers instance-aware context. The output is instance
masks M∈RN×1 indicating the probability of points belonging to the
detected object. A detailed network illustration is shown in the Appen-
dix. In our experiment, we adopt the Focal Loss to tackle the imbalance
between instance points and clutters, which is shown in Eq. (4), where
α
and γ are hyper-parameters (Lin et al., 2017).
ℒmask = − 1
N
mi∈M
α
(1−mi)γlog(mi).(4)
Semantic Label Branch. Additionally, most point cloud segmenta-
tion datasets have point-level semantic annotations, so a straightforward
point-level semantic label prediction branch is adopted to exert the
backbone to learn useful information from point clouds. This branch
passes Fp through fully connected layers and a nal softmax layer. ℒsem is
a weighted cross-entropy loss as shown in Eq. (5). In Eq. (5), yn
i is a bi-
nary indicator denoting whether the nth point belongs to class i (totally ns
classes), and correspondingly, pn
i is the predicted probability. The weight
wi=median(r)
ri is involved to deal with the imbalance between classes,
where ri denotes the ratio of points belonging to the ith class and
median(r)is the median value of these ri.
ℒsem = − 1
N
N
n=1
ns
i=1
wiyn
ilogpn
i.(5)
A joint multi-task loss function ℒjoint is used to train this network, as
shown in Eq. (6).
ℒjoint = ℒbox + ℒmask + ℒsem.(6)
3.3. Object-centric instance segmentation
The ultimate pipeline output is supposed to be an object-centric road
furniture survey containing semantic and point-level information. Usu-
ally, the same street furniture object may occur in adjacent street-view
images with overlapping elds of sight. Therefore, simply accumu-
lating every instance mask M from all images will lead to the accumu-
lation of instance mask prediction errors. In that concern, an effective
fusion strategy is used to fuse M from different image frames, which
rstly associates frustums containing the same instance. Before we
associate the frustums, the predicted bounding boxes B and the point
clouds are all transformed to the original projection coordinate system.
Given the predicted bounding boxes and the orientation parameters of
image frames, the problem is how to assign an instance ID to each
frustum (Qi et al., 2021). Concretely, we used the multi-object tracking
(MOT) method proposed by Weng et al. (2020), where the 3D bounding
box IoU (Intersection over Union) and the semantic class are used as
association criteria, as shown in Fig. 6. The association strategy is based
on the Hungarian algorithm (Kuhn, 1955) and we eliminate the Kalman
lter module since no moving objects are involved. Then, the point
clouds of each associated frustum group are translated to the object-
centric coordinate system (xo,yo,zo), where the coordinate origin is
the center of the bounding boxes.
Firstly, a threshold Tfm is applied to lter the clutter points, whose
mf∈Mfrustum is smaller than Tfm . Those kept points from different frus-
tums are gathered as Po∈RN′×(3+1)with the predicted mf as the feature
channel. Finally, after fusing the instance mask prediction from the
associated frustums, a straightforward PointNet++ (Qi et al., 2017) is
applied to predict a binary instance segmentation mask, as shown in
Fig. 7. In the instance segmentation network, the overlapping points
from different frustums are treated as separated points. And they will be
regarded as one instance point if any of them is predicted to be positive
by the network.
4. Experiments and discussions
4.1. Dataset description and experiment settings
Street-view images and MLS point clouds from three regions were
used in the experiments. For training the neural network, we use the
manually labeled point clouds introduced by Han et al. (2021), which
covers 6.0 km urban roads in Shanghai. Datasets collected from
Shanghai and Wuhan are used to validate the proposed method, as
Fig. 6. Multi-view frustum association. Orange, blue, green points are 3 frustum point clouds from different perspectives and target at the same instance. They are
associated according to the predicted bounding boxes B
1,B2,B3.
Fig. 7. Multi-view frustum fusion based object-centric instance segmentation.
The predicted clutter points are ltered rst, and then the kept points in
associated frustums are gathered into P
o for the nal instance segmentation.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
69
shown in Fig. 8. Shanghai dataset covers 6.5 km urban roads, 1.3 km of
which are manually labeled for quantitative evaluation. Wuhan dataset
covers 3.2 km urban roads, with 1.3 km street furniture annotations.
Table 2 shows the classes and instance numbers in the experiment
dataset. The inventory of nine classes of street furniture is evaluated,
covering typical municipal and transportation assets. Our method out-
puts the concrete semantic class of trafc signs (totally 127 classes in the
training set), and they are categorized into two groups in the point cloud
annotations: text signs and warning signs.
After cropping frustums, the point intensity and 3D coordinates are
used as the feature channels of Pf. The multi-scale grouping PointNet++
(Qi et al., 2017) is used as the point cloud processing backbone in pre-
dicting per-frustum bounding boxes and instance masks, and is also used
to predict the nal object-centric instance masks, with hyper-parameters
set according to Qi et al. (2018). After frustum association, Tfm is
empirically set to 0.3 to lter the points that are predicted with a very
low possibility of being instance points. The mentioned neural networks
in the proposed pipeline are trained on an NVIDIA 2080Ti GPU for 50
epochs. Network details are shown in the Appendix.
4.2. Qualitative results
Fig. 9 illustrates the qualitative experiment results from the global
perspective, where the extracted street lamp, trafc light, trafc sign
and trash bin points are overlaid on the raw MLS point clouds, and
Fig. 10 presents close-up results at cross-roads. As the experiment results
show, the proposed pipeline locates the street furniture and outputs
their point clouds in typical urban streets and crossroads, presenting
their inventories and geometric features including shapes and sizes for
transportation administration. In general, this method processes all
types of detected street furniture using a universal framework and
common parameter settings, and it is valid for all the mentioned classes.
And this class list is promising to be extended without worrying about
excessive labor for designing extra hand-crafted feature descriptors.
Fig. 11 shows the results concerning small street furniture, which
illustrates that the proposed method is capable of segmenting trafc
cones, the temporary trafc barriers, supporting the update of HD maps.
Besides, it is signicant for residents that the re hydrants are effectively
inventoried using the proposed method, considering that they are usu-
ally ignored using methods depending solely on point clouds.
Solely depending on images or point clouds can hardly handle the
imbalance between semantic richness and localization accuracy, and
hence we collectively exploit images and point clouds. As introduced in
Section 2.1, despite that images present abundant semantic information,
3D localization of the objects is mainly achieved using multi-view
photogrammetry methods, achieving meter-level accuracy. The pro-
posed pipeline uses the semantic cues from images as guidance and
segments the instances of interest from MLS point clouds, and this
guarantees much higher absolute localization accuracy and supports
further point-level geometric feature analysis. On the other side, most
methods using only point clouds as source data start from extracting
pole-like objects based on pre-dened feature descriptors. Li et al.
(2019) uses point clouds for roadside asset surveying, producing
component-level street furniture point segments and their class labels.
However, it relies on rule-based pole extraction and feature design,
which limits its exibility to complex scenes, especially densely vege-
tated urban streets.
In previous studies using street-view imagery or point clouds for
street furniture recognition, the dense vegetation in urban streets often
leads to serious occlusions, especially for lamps. By contrast, since the
proposed pipeline merges and integrates the semantic and geometric
cues from multiple perspectives by subsequently predicting instance
masks in the frustum-centric and the object-centric manners, the oc-
clusions do not cause a great interference to the results.
Another characteristic of our method is that it produces component-
level results under the semantic guidance from street-view imagery. For
example, trafc signs and trafc lights mounted on the same pole are
respectively surveyed with detailed trafc sign semantic labels, even the
weight or speed limits. This benets the trafc management authorities
by providing detailed distribution and semantic information about road
assets for analyzing their rationality to trafc planning and consistency
with driving safety.
Fig. 8. Experiment dataset in Shanghai (6.5 km) and Wuhan (3.2 km). MLS
point clouds (blue points) are overlaid on the satellite images. Red boxes show
the annotated areas for quantitative evaluation (1.3 km respectively).
Table 2
Instance numbers and corresponding frustum numbers in the training set
collected in Shanghai (6.0 km), and the instance number (Ins. Num.) of the
annotated part (1.3 km respectively) of Shanghai and Wuhan test sets.
Class Training Set Test Set Ins. Num.
Instance
Number
Frustum
Number
Shanghai
Dataset
Wuhan
Dataset
Text Signs 179 959 33 28
Trafc
Lights
96 953 68 90
Street
Lamps
287 1888 99 97
Trash Bins 169 870 39 21
Trafc
Cones
49 204 18 163
Fire
Hydrants
23 48 10 7
Benches 4 16 0 4
Warning
Signs
63 466 32 62
Potted
Plants
163 959 3 0
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
70
4.3. Quantitative evaluation
Quantitative evaluation is performed on our annotated street seg-
ments, which respectively cover 1.3 km roads in Shanghai and Wuhan
datasets. To precisely reect on the performance of our proposed
method, we specially adopt 2 levels of quantitative evaluation indicators
in this study.
Instance-level Metrics. Instance-level recall and precision are listed
to show the ratio of correctly inventoried instances at a given point IoU
threshold (0.5 in this study). If the point IoU of a predicted and a ground
truth (GT) instance is greater than the threshold and the assigned se-
mantic label is correct, it is correctly inventoried (i.e., true positive
(TP)). The other predicted instances are regarded as false positives (FP),
and the missed GT instances are false negatives (FN). Refer to Eq. (7) and
Eq. (8) for the calculation of instance-level recall and precision.
Point-level Metrics. Our pipeline integrates 2 point mask prediction
neural networks and produces street furniture points, so the point-level
indicators (precision, recall, weighted coverage) are adopted. The
equations of point recall, precision and IoU are shown in Eq. (7), Eq. (8),
Eq. (9). Point-level TP denotes the correctly predicted instance points of
the successfully inventoried instances. FP (False Positive) denotes the
wrongly predicted instance points and FN (False Negative) denotes the
missed instance points.
Recall =TP
TP +FN .(7)
Precision =TP
TP +FP .(8)
IoU =TP
TP +FP +FN .(9)
Fig. 9. Global views of qualitative results in Shanghai and Wuhan datasets. The segmented street furniture point clouds are overlaid on the raw MLS point clouds
rendered with height. For better visualization, upper parts of some high vegetation and buildings are ltered from raw point clouds.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
71
Weighted coverage (wCov) is widely used to evaluate the performance
of instance segmentation, which calculates the weighted average
instance-wise IoU for ground truth instances ri,rk∈ 𝒢 and their associ-
ated predictions rj∈ ℘, as shown in Eq. (10) and Eq. (11).
wCov(𝒢,℘) =
i
wimax
jIoUr𝒢
i,r℘
j.(10)
wi=
r𝒢
i
k
|r𝒢
k|.(11)
Table 3 and Table 4 show the quantitative results in Shanghai and
Wuhan Dataset. From the global perspective, the mean instance recall
and precision of Wuhan and Shanghai exceed 80.9%, which veries the
effectiveness of our proposed method. Meanwhile, this universal
framework is valid for all mentioned nine classes. Moreover, the mean
metrics of Shanghai and Wuhan datasets show no evident performance
differences. The instance-level recall and precision of street lamps, one
of the most basic street furniture, surpass 92.4% in both Shanghai
(98.0%, 92.4%) and Wuhan (99.0%, 96.0%), achieving the best results
among all categories.
Considering the introduced neural networks in our pipeline are
trained with a different MLS dataset in Shanghai. And this method can
be applied not only in another area in Shanghai, but it is also robust to
the scene change to Wuhan with different city landscapes and data
collection systems, which guarantees its generalizability.
For trafc signs, indicators in Wuhan are higher than in Shanghai,
and this is because the trafc sign point clouds in Shanghai suffer a more
severe loss due to the scanning angle and the reective properties. On
the contrary, the proposed pipeline performs better on trash bins in
Shanghai than in Wuhan, since the image object detection misses more
trash bin instances. Although the proposed method correctly recalls or
locates most trafc cones and re hydrants, the point-level metrics are
evidently lower than other larger trafc assets. The reason is that the
point number of each instance is limited due to the size, so a few false
positive instances may result in a very considerable decrease in the in-
dicators. Several factors that inuence the performance of our method
will be further discussed in Section 4.6.
4.4. Architecture design analysis
In this part, we present quantitative results using different pipeline
settings to demonstrate the effects of specic modules. The settings are
shown in Table 5. To facilitate the object-centric evaluation, the frustum
association is kept using the distances between centroids of predicted
instance points, even if the bounding box branch is removed. The per-
formance comparison is shown in Fig. 12. Moreover, the effects of the
semantic labels from images and the point cloud intensity are analyzed.
4.4.1. Effects of multi-view fusion
In setting A, after frustum association, the nal object-centric binary
mask prediction network is eliminated. We use the distance between the
center of the predicted bounding box B and the center of instance points
to measure their consistency. The frustum having the smallest center
distance is kept for each object, considered as having the most condent
instance segmentation result. Fig. 12 shows an obvious decrease in
point-level mRec and mwCov, but the point-level mPrec is not inu-
enced. That is, a large ratio of the instance points predicted by the kept
frustum are correct, considering that they produce a very condent
prediction from the frustum point cloud. However, this leads to the
omission of some instance points that are not signicant enough in
certain frustums, and results in lower point-level coverage and instance-
level recall, precision at the IoU threshold. Especially for tough scenes,
like lamps partly occluded by trees or re hydrants partly hidden by
Fig. 10. Close-up views of the qualitative results at crossroads from Shanghai and Wuhan datasets. The raw point clouds are rendered with intensity. Blue: trafc
signs. Yellow: trash bins. Green: street lamps. Red: trafc lights.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
72
Fig. 11. Close-up views of the qualitative results on certain classes of small street furniture. The classes from up to bottom: trafc cones, re hydrants, potted plants,
roadside benches. Each row shows only one class and different colors are used to distinguish instances. The raw point clouds are rendered with intensity.
Table 3
Instance-level and point-level quantitative evaluation results on Shanghai Dataset.
Class Instance-level Statistics (IoU@0.5) Point-level Statistics (IoU@0.5)
Recall/% Precision/% TP FN FP Recall/% Precision/% wCov/%
Text Signs 78.8 78.8 26 7 7 77.7 81.0 76.4
Trafc Lights 86.8 88.1 59 9 8 86.1 71.5 76.3
Street Lamps 98.0 92.4 97 2 8 95.4 93.6 94.5
Trash Bins 89.7 85.4 35 4 6 89.5 88.3 87.8
Trafc Cones 83.3 75.0 15 3 5 76.0 39.5 69.4
Fire Hydrants 70.0 77.8 7 3 2 53.2 69.2 58.0
Warning Signs 84.4 75.0 27 5 9 62.8 55.6 52.9
Potted Plants 100.0 75.0 3 0 1 91.2 91.1 91.1
Mean 86.4 80.9 79.0 73.7 75.8
Table 4
Instance-level and point-level quantitative evaluation results on Wuhan Dataset.
Class Instance-level Statistics (IoU@0.5) Point-level Statistics (IoU@0.5)
Recall/% Precision/% TP FN FP Recall/% Precision/% wCov/%
Text Signs 85.7 96.0 24 4 1 73.1 84.8 70.4
Trafc Lights 78.9 86.6 71 19 11 86.8 73.6 78.3
Street Lamps 99.0 96.0 96 1 4 90.6 90.4 89.8
Trash Bins 76.2 72.7 16 5 6 63.2 71.6 66.5
Trafc Cones 79.8 89.7 130 33 15 83.4 70.2 76.3
Fire Hydrants 57.1 66.7 4 3 2 52.0 77.3 51.5
Warning Signs 88.7 94.8 55 7 3 89.2 67.7 70.8
Benches 100.0 100.0 4 0 0 100.0 89.1 89.7
Mean 83.2 87.8 79.8 78.1 74.2
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
73
vegetation, the fusion based nal mask prediction equips the pipeline
with a second chance to compensate biases in the frustum perspective.
4.4.2. Ablation Studies of the Frustum-centric Neural Network
Different combinations of neural network branches are tested to
demonstrate the effects of the bounding box branch and the semantic
label branch. Generally, compared with solely predicting instance
masks, both these two branches show their effects in promoting the
inventory performance. They are mutually reinforcing in this multi-task
setting. The bounding box branch, especially, provides an evident boost.
This branch takes the global point feature as input and estimates a
unique instance bounding box and thus encourages the backbone to
notice global information within frustums. Then the estimated bounding
box is provided as additional information for the instance mask branch,
and the performance increase validates our intention to add instance-
aware context to strengthen the instance mask prediction. Moreover,
the point-level semantic branch further enhances the feature learning
and benets the nal output.
4.4.3. Effects of image class labels and point intensity
Fig. 13 shows the performance comparison after eliminating image
semantic labels or point cloud intensity. As shown, image semantic la-
bels are of great importance, especially when the frustum contains an
instance of another class, which often occurs at crossroads. The effect of
point intensity is not very prominent, which is good for portability using
other MLS systems. But note that in Shanghai Dataset, the recall and
precision of text signs drop from 78.8%, 78.8% to 69.7%, 57.6% without
the point intensity. Because text sign frustums sometimes contain the
building façade, showing similarly planar geometry features, but their
intensity features are usually very different.
4.5. Comparative studies
In this part, Frustum-PointNet (Qi et al., 2018) is discussed as a
baseline framework to further illustrate our contributions. Frustum-
PointNet is implemented and trained on our experiment dataset, and
the quantitative comparison is shown in Table 6. The visual comparison
is shown in Fig. 14. To facilitate the instance-level comparison, the same
frustum association strategy as in Section 4.4.1 is used. Frustum-
PointNet is an inspiring method integrating images and point clouds
for 3D object detection, producing instance masks and amodal bounding
boxes. It rstly predicts the point-level instance mask and then uses it for
bounding box regression. Therefore, the bounding box regression is
heavily dependent on the results of instance segmentation, and the
instance segmentation lacks instance-aware information. Compara-
tively, the proposed method enhances the instance context information
in per-frustum instance mask prediction by involving the global feature
and the bounding box supervision as assistance. The visual comparison
shows that our method evidently outperforms F-PointNet by producing
fewer errors in densely vegetated areas and less confusion of object and
Table 5
Settings of the ablation study.
Setting Group B. Box Branch Sem. Label Branch Multi-view Fusion
A
B
C
D
Full
Fig. 12. Ablation studies on Shanghai and Wuhan Dataset.
Fig. 13. Analysis of the effects of image semantic cues and point intensity.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
74
background. Meanwhile, the instance-level mean recall and precision
increase from 70.8%, 61.7% to 86.4%, 80.9% in Shanghai, from 76.4%,
77.3% to 83.2%, 87.8% in Wuhan. This increase in the instance seg-
mentation performance veries the effectiveness of our design.
4.6. Error analysis
Although the proposed method inventories most urban street furni-
ture in the experiments, some aspects need attention in future
applications and improvements. We statistically analyze three typical
types of errors in our experiments, as is shown in Fig. 15. These errors
collectively consider the FP and FN in both Shanghai and Wuhan
Dataset.
Object Detection Error. The proposed method is based on the
guidance of image object detection, and hence is inuenced by the
quality of 2D proposals. Typical object detection errors mainly consti-
tute semantic label errors (Fig. 16) and object omissions (Fig. 16(b)),
and they account for most errors for street lamps and trash bins.
Instance Mask Error. Based on the 2D proposals, our method segments
the street furniture from point clouds. Instance mask errors refer to those
caused by incorrect mask prediction from the neural networks. For
trafc signs, trafc lights, the instance mask prediction errors are
comparatively more considerable. Also, for re hydrants and trafc
cones, instance mask prediction is more challenging since the great point
number imbalance between instance points and background. Object
Missing in Point Clouds. Although some objects are detected in the
images, the corresponding point clouds are not successfully collected,
which takes up a minor part of errors. For example, as shown in Fig. 16
(c), due to the scanning range and the reective property of the trafc
Table 6
Quantitative comparison with Frustum-PointNet.
Dataset Method Instance-level Statistics (IOU@0.5) Point-level Statistics (IOU@0.5)
mRecall/% mPrecision/% mRecall/% mPrecision/% mwCov/%
Shanghai F-PointNet 70.8 61.7 59.1 69.3 60.5
Ours 86.4 80.9 79.0 73.7 75.8
Wuhan F-PointNet 76.4 77.3 66.3 71.2 62.6
Ours 83.2 87.8 79.8 78.1 74.2
Fig. 14. Visual comparison of our method and F-PointNet. Street lights: green.
Trafc lights: red. Trash bins: yellow. Trafc signs: blue.
Fig. 15. Statistical error analysis of different classes and types of error.
Fig. 16. Typical errors in image object detection and the missing object in
point clouds. (a) The utility pole is falsely recognized as a street lamp. (b) The
trash bin is omitted due to its inconspicuous appearance. (c) Most points of the
trafc sign are not collected due to the scanning angle and the reec-
tive property.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
75
sign, the trafc sign shows very few point numbers.
5. Conclusion
This study presents a framework for inventorying various street
furniture from MLS point clouds under the guidance of street-view im-
agery. The proposed pipeline is thoroughly introduced, which contains
three major steps: (1) frustum cropping; (2) per-frustum instance mask
and 3D bounding box prediction; (3) object-centric instance segmenta-
tion. This pipeline is validated by experiments on Wuhan and Shanghai
datasets, producing component-level street furniture inventory with
concrete semantic labels and instance point clouds. Experiments report
the mean instance recall and precision respectively reach 86.4%, 80.9%
and 83.2%, 87.8% in Shanghai and Wuhan, and the point-level mean
recall, precision, weighted coverage all exceed 73.7%, meeting the de-
mand of urban administration and outperforming previous studies.
Declaration of Competing Interest
The authors declare that they have no known competing nancial
interests or personal relationships that could have appeared to inuence
the work reported in this paper.
Acknowledgments
This work is jointly supported by the National Natural Science
Foundation of China Projects (No. 41725005, No. 42172431).
Appendix A. Implementation details
Point Cloud Processing Backbone. Fig. 17 shows the architecture of the point cloud processing backbone. The input Pf contains 4 feature
channels, which are the point intensity and the XYZ coordinates. Using the multi-scale grouping (MSG) set abstraction layers, it outputs the 1024
dimensional point cloud global feature Fg−pc. Then Fg−pc is concatenated with the image one-hot class vector Fimg into the global feature Fg. Afterward,
Fg is passed to the feature propagation layers and the network outputs the point features Fp. The subsample radius and the channels of the multi-layer
perception (MLP) are presented.
The Multi-task Neural Network. The multi-task neural network for processing frustum point clouds contain three branches. The bounding box
branch passes Fg through an MLP with channels [512, 256, 256, 6] and outputs the box coordinates B. The structure of the instance mask branch is
shown in Fig. 18, where FC denotes fully connected layers and LReLU denotes the Leaky ReLU activation layers. It takes Fp,Fa,B as input and outputs
the instance mask M. The last layer is a sigmoid activation layer. The semantic label branch passes Fp through fully connected layers with channels [128,
64, 64, ns] and a nal softmax layer. Fig. 19 respectively shows the training curves of the loss functions. The joint multi-task loss function ℒjoint =
ℒbox +ℒmask +ℒsem consistently converges.
The Object-centric Instance Segmentation Network. This network uses the described point cloud processing backbone followed by FC layers
with channels [128, 64, 64, 1]. The last layer is a sigmoid layer.
Fig. 17. Structure of the point cloud processing backbone.
Fig. 18. Structure of the instance mask branch in the multi-task neural network.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
76
References
Anguelov, D., Dulong, C., Filip, D., Frueh, C., Lafon, S., Lyon, R., Ogale, A., Vincent, L.,
Weaver, J., 2010. Google street view: Capturing the world at street level. Computer
43 (6), 32–38.
Barcon, E., Picard, A., 2021. Automatic detection and vectorization of linear and point
objects in 3d point cloud and panoramic images from mobile mapping system. Int.
Arch. Photogramm. Remote Sens. Spatial Inform. Sci. 43. B2–2021.
Biljecki, F., Ito, K., 2021. Street view imagery in urban analytics and gis: A review.
Landscape Urban Plan. 215, 104217.
Campbell, A., Both, A., Sun, Q.C., 2019. Detecting and mapping trafc signs from google
street view images using deep learning and gis. Comput. Environ. Urban Syst. 77,
101350.
Che, E., Jung, J., Olsen, M.J., 2019. Object recognition, segmentation, and classication
of mobile laser scanning point clouds: A state of the art review. Sensors 19 (4), 810.
Chen, S., Liu, B., Feng, C., Vallespi-Gonzalez, C., Wellington, C., 2020. 3d point cloud
processing and learning for autonomous driving: Impacting map creation,
localization, and perception. IEEE Signal Process. Mag. 38 (1), 68–86.
Chen, X., Ma, H., Wan, J., Li, B., Xia, T., 2017. Multi-view 3d object detection network
for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition, pp. 1907–1915.
Chen, Y., Wang, S., Li, J., Ma, L., Wu, R., Luo, Z., Wang, C., 2019. Rapid urban roadside
tree inventory using a mobile laser scanning system. IEEE J. Sel. Top. Appl. Earth
Observ. Remote Sens. 12 (9), 3690–3700.
Chen, Y., Wu, R., Yang, C., Lin, Y., 2021. Urban vegetation segmentation using terrestrial
lidar point clouds based on point non-local means network. Int. J. Appl. Earth Obs.
Geoinf. 105, 102580.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U.,
Roth, S., Schiele, B., 2016. The cityscapes dataset for semantic urban scene
understanding. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 3213–3223.
Cui, Y., Chen, R., Chu, W., Chen, L., Cao, D., 2021. Deep learning for image and point
cloud fusion in autonomous driving: A review. IEEE Trans. Intell. Transp. Syst. PP
(99), 1–18.
Gao, Y. Cacascade rcnn. 2021. URL: https://github.com/PaddlePaddle/PaddleDetectio
n/blob/release/2.3/static/docs/featured_model/champion_model/CACascade
RCNN.md.
Gargoum, S.A., Karsten, L., El-Basyouny, K., Koch, J.C., 2018. Automated assessment of
vertical clearance on highways scanned using mobile lidar technology. Autom.
Constr. 95, 260–274.
Gong, Z., Lin, H., Zhang, D., Luo, Z., Zelek, J., Chen, Y., Nurunnabi, A., Wang, C., Li, J.,
2020. A frustum-based probabilistic framework for 3d object detection by fusion of
lidar and camera data. ISPRS J. Photogramm. Remote Sens. 159, 90–100.
Guan, H., Li, J., Yu, Y., Chapman, M., Wang, C., 2014. Automated road information
extraction from mobile laser scanning data. IEEE Trans. Intell. Transp. Syst. 16 (1),
194–205.
Han, L., Zheng, T., Xu, L., Fang, L., 2020. Occuseg: Occupancy-aware 3d instance
segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pp. 2940–2949.
Han, X., Dong, Z., Yang, B., 2021. A point-based deep learning network for semantic
segmentation of mls point clouds. ISPRS J. Photogramm. Remote Sens. 175,
199–214.
Hebbalaguppe, R., Garg, G., Hassan, E., Ghosh, H., Verma, A., 2017. Telecom inventory
management via object recognition and localisation on google street view images. In:
2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, p.
725–733.
Hou, J., Dai, A., Nießner, M., 2019. 3d-sis: 3d semantic instance segmentation of rgb-
d scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 4421–4430.
Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., Igel, C., 2013. Detection of trafc
signs in real-world images: The german trafc sign detection benchmark. In: The
2013 International Joint Conference on Neural Networks (IJCNN). IEEE, pp. 1–8.
Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., Trigoni, N., Markham, A., 2020.
Randla-net: Efcient semantic segmentation of large-scale point clouds. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 11108–11117.
Huang, T., Liu, Z., Chen, X., Bai, X., 2020. Epnet: Enhancing point features with image
semantics for 3d object detection. In: European Conference on Computer Vision.
Springer, pp. 35–52.
Jensen, M.B., Philipsen, M.P., Møgelmose, A., Moeslund, T.B., Trivedi, M.M., 2016.
Vision for looking at trafc lights: Issues, survey, and perspectives. IEEE Trans. Intell.
Transp. Syst. 17 (7), 1800–1815.
Krylov, V.A., Kenny, E., Dahyot, R., 2018. Automatic discovery and geotagging of objects
from street view imagery. Remote Sens. 10 (5), 661.
Kuhn, H.W., 1955. The hungarian method for the assignment problem. Naval Res. Logist.
Quart. 2 (1–2), 83–97.
Lahoud, J., Ghanem, B., Pollefeys, M., Oswald, M.R., 2019. 3d instance segmentation via
multi-task metric learning. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp. 9256–9266.
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O., 2019. Pointpillars: Fast
encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 12697–12705.
Laumer, D., Lang, N., van Doorn, N., Mac Aodha, O., Perona, P., Wegner, J.D., 2020.
Geocoding of trees from street addresses and street-level images. ISPRS J.
Photogramm. Remote Sens. 162, 125–136.
Li, F., Lehtom¨
aki, M., Elberink, S.O., Vosselman, G., Kukko, A., Puttonen, E., Chen, Y.,
Hyypp¨
a, J., 2019. Semantic segmentation of road furniture in mobile laser scanning
data. ISPRS J. Photogramm. Remote Sens. 154, 98–113.
Li, Y., Ma, L., Zhong, Z., Liu, F., Chapman, M.A., Cao, D., Li, J., 2020. Deep learning for
lidar point clouds in autonomous driving: a review. IEEE Trans. Neural Netw. Learn.
Syst. 32 (8), 3412–3432.
Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´
ar, P., 2017. Focal loss for dense object
detection. In: Proceedings of the IEEE International Conference on Computer Vision,
pp. 2980–2988.
Ma, L., Li, Y., Li, J., Wang, C., Wang, R., Chapman, M.A., 2018. Mobile laser scanned
point-clouds for road object detection and extraction: A review. Remote Sens. 10
(10), 1531.
Ma, Y., Zheng, Y., Easa, S., Wong, Y.D., El-Basyouny, K., 2022. Virtual analysis of urban
road visibility using mobile laser scanning data and deep learning. Autom. Constr.
133, 104014.
Pang, S., Morris, D., Radha, H., 2020. Clocs: Camera-lidar object candidates fusion for 3d
object detection. In: 2020 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). IEEE, pp. 10386–10393.
Peng, Z., Gao, S., Xiao, B., Guo, S., Yang, Y., 2017. Crowdgis: Updating digital maps via
mobile crowdsensing. IEEE Trans. Autom. Sci. Eng. 15 (1), 369–380.
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J., . 2018. Frustum pointnets for 3d object
detection from rgb-d data. In: Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition, pp. 918–927.
Qi, C.R., Yi, L., Su, H., Guibas, L.J., 2017. Pointnet++: Deep hierarchical feature learning
on point sets in a metric space. In: Advances in Neural Information Processing
Systems, pp. 5105–5114.
Qi, C.R., Zhou, Y., Najibi, M., Sun, P., Vo, K., Deng, B., Anguelov, D., 2021. Offboard 3d
object detection from point cloud sequences. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 6134–6144.
Sanchez Castillo, E., Grifths, D., Boehm, J., 2021. Semantic segmentation of terrestrial
lidar data using co-registered rgb data. Int. Arch. Photogramm. Remote Sens. Spatial
Inform. Sci. 43, 223–229.
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J., 2019. Objects365: A
large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/
CVF International Conference on Computer Vision, pp. 8430–8439.
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H., 2020. Pv-rcnn: Point-voxel
feature set abstraction for 3d object detection. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 10529–10538.
Shi, S., Wang, X., Li, H., 2019. Pointrcnn: 3d object proposal generation and detection
from point cloud. In: Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pp. 770–779.
Tao, A., Sapra, K., Catanzaro, B., 2020. Hierarchical multi-scale attention for semantic
segmentation. arXiv preprint arXiv:200510821 2020.
Vora, S., Lang, A.H., Helou, B., Beijbom, O., 2020. Pointpainting: Sequential fusion for 3d
object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 4604–4612.
Wang, H., Xue, C., Zhou, Y., Wen, F., Zhang, H., 2021. Visual semantic localization based
on hd map for autonomous vehicles in urban scenarios. In: 2021 IEEE International
Conference on Robotics and Automation (ICRA). IEEE, pp. 11255–11261.
Wang, J., Lindenbergh, R., Menenti, M., 2017. Sigvox–a 3d feature matching algorithm
for automatic street object recognition in mobile laser scanning point clouds. ISPRS
J. Photogramm. Remote Sens. 128, 111–129.
Wang, X., Liu, S., Shen, X., Shen, C., Jia, J., 2019. Associatively segmenting instances and
semantics in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 4096–4105.
Weng, X., Wang, J., Held, D., Kitani, K., 2020. 3d multi-object tracking: A baseline and
new evaluation metrics. In: 2020 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS). IEEE, pp. 10359–10366.
White House, B., 2021. Fact sheet: The bipartisan infrastructure deal. URL: https://www.
whitehouse.gov/brieng-room/statements-releases/2021/11/06/fact-sheet-the-bi
partisan-infrastructure-deal/.
Fig. 19. Training curves of the loss functions in the multi-task neural network.
Y. Zhou et al.
ISPRS Journal of Photogrammetry and Remote Sensing 189 (2022) 63–77
77
Yang, B., Dong, Z., 2013. A shape-based segmentation method for mobile laser scanning
point clouds. ISPRS J. Photogramm. Remote Sens. 81, 19–30.
Yang, B., Dong, Z., Liu, Y., Liang, F., Wang, Y., 2017. Computing multiple aggregation
levels and contextual features for road facilities recognition using mobile laser
scanning data. ISPRS J. Photogramm. Remote Sens. 126, 180–194.
Yang, B., Dong, Z., Zhao, G., Dai, W., 2015. Hierarchical extraction of urban objects from
mobile laser scanning data. ISPRS J. Photogramm. Remote Sens. 99, 45–57.
Yang, B., Wang, J., Clark, R., Hu, Q., Wang, S., Markham, A., Trigoni, N., 2019. Learning
object bounding boxes for 3d instance segmentation on point clouds. In: Advances in
Neural Information Processing Systems, pp. 6737–6746.
Yoo, J.H., Kim, Y., Kim, J., Choi, J.W., 2020. 3d-cvf: Generating joint camera and lidar
features using cross-view spatial feature fusion for 3d object detection. In: European
Conference on Computer Vision. Springer, pp. 720–736.
Yu, Y., Li, J., Guan, H., Wang, C., Wen, C., 2016. Bag of contextual-visual words for road
scene object detection from mobile laser scanning data. IEEE Trans. Intell. Transp.
Syst. 17 (12), 3391–3406.
Zhou, Y., Huang, R., Jiang, T., Dong, Z., Yang, B., 2021. Highway alignments extraction
and 3d modeling from airborne laser scanning point clouds. Int. J. Appl. Earth Obs.
Geoinf. 102, 102429.
Zhu, H., Deng, J., Zhang, Y., Ji, J., Mao, Q., Li, H., Zhang, Y., 2021. Vpfnet: Improving 3d
object detection with virtual point based lidar and stereo data fusion. arXiv preprint
arXiv:211114382.
Zhu, Z., Liang, D., Zhang, S., Huang, X., Li, B., Hu, S., 2016. Trafc-sign detection and
classication in the wild. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 2110–2118.
Y. Zhou et al.