Content uploaded by Chialing Wei
Author content
All content in this area was uploaded by Chialing Wei on Jan 21, 2024
Content may be subject to copyright.
– 1 –
Plan-View Wall Detection for Indoor Point Clouds using Weak Supervision
Chialing Wei;1 and Thomas Czerniawski2
1Department of Civil, Environmental and Sustainable Engineering, Arizona State University.
ORCID: https://orcid.org/0000-0001-8191-9091. Email: cwei32@asu.edu
2Assistant Professor, Department of Civil, Environmental and Sustainable Engineering, Arizona
State University. Email: thomas.czerniawski@asu.edu
ABSTRACT
We present an automated scan-to-BIM pipeline that simplifies the 3D building object
recognition problem into a 2D recognition problem. We used the Habitat Matterport 3D Dataset
(HM3D) for training wall detection model. The weakly supervised learning is conducted since we
used the noisy depth-projected annotation. We isolated individual building levels and projected
the points to 2D along the Z-axis (up/down). The architectural components recognition system
detects walls within the plan view projection of the indoor point cloud. We compare the
performance metric of validation on noisy annotation with human-labeled annotation and analyze
the wall inference results from visualization. We assume the human-labeled annotation as ground
truth and noisy annotation is prediction to calculate the average precision. The average precision
values are compared to the neural network performance. We anticipate this experiment can provide
a feasible weak supervision method for simplifying 3D digital model creation from scan data.
INTRODUCTION
With the advancement of IoT, machine learning, and big data technology, the digital twin is highly
valued within the architecture, engineering and construction (AEC) industry. The digital twin of a
building is a digital representation of an actual real-world building. The digital twin performs
predictive simulation and optimization across the lifecycle of buildings, integration with IoT and
real-time system data, testing, monitoring, and maintenance for facility management. At the
beginning of the roadmap for digital twin creation, building information modeling (BIM) is an
essential input (buildingSMART 2023).
To create BIMs of existing buildings, construction companies use laser scanners to capture
data on site. The data consist of a huge number of points that return from the surface when you are
scanning. These point clouds represent objects like walls, doors, windows, etc. After capturing the
raw point cloud data, they should be processed and manipulated to become a 3D digital model
which refers to the Scan-to-BIM process. However, this process is mostly manual. To make this
process more efficient and less labor-intensive, some researchers focus on automating this process
with 3D deep learning algorithms. The neural network segments the raw point cloud and
recognizes the objects through geometric information in 3D space.
The paper provides two contributions. First, we simplify the task of wall detection in 3D
point clouds by reducing laser scans into 2D plan view projections. Secondly, we show that the
wall detector can be trained successfully despite using weak supervision. The wall annotations
sourced from the Habitat Matterport 3D dataset (Ramakrishnan et al. 2021) are noisy because of
– 2 –
the imperfect underlying mesh segmentation. We analyze the performance of our wall detection
model and demonstrate neural network predictions with much higher quality than the training data.
RELATED WORK
Point cloud segmentation. We focus on reviewing the point cloud segmentation tasks involving
projected-based methods. Gankhuyag and Han (2020) detected wall objects from point cloud data.
The point cloud of floor and ceiling first are extracted using RANSAC algorithm and removed for
wall detection task. The rest of the point cloud, wall and furniture, are projected to x-y plane. The
line-detection algorithm for wall objects is applied to the 2D plane.
Many researchers applied deep learning on segmenting point cloud and combined other
methods with projected-based methods. Chen et al. (2016) combined LIDAR bird view (BV),
LIDAR front view (FV) and RGB image of outdoor scene dataset. The 3D object proposals were
generated from BV and projected to BV, FV and image views. A region-based fusion network
combined three views through ROI pooling and finally output multiclass classifier and 3D box
regressor. Kellner et al. (2022) explored different projected-based approaches on outdoor scene
point clouds. The projected-based methods include spherical, bird’s eye and cylindrical views.
Results show that bird’s-eye projection method is flexible, fast, and accurate but high dependency
on the sensor and 3D geometrical information loss.
Except for segmenting outdoor point cloud, Ahn et al. (2022) proposed PPConv to segment
indoor point cloud. The point cloud is projected to 2D plane and feed into 2D convolutions in
projection branch. The point features were transformed by multi-layer perceptrons (MLP) in point
branch. Ultimately, the two branches fused together to output point-wise features. In this study,
we simplified the 3D segmentation problem to plan-view problem and trained 2D neural networks
to detect wall components.
Weak supervision. We reviewed the existing study of point cloud segmentation using weak
supervision. Most of the studies used fully supervised learning methods as a baseline to evaluate
the quality of weak supervision. Xu and Lee (2020) state that the 3D point cloud segmentation
labels are too costly to obtain. They experimented with weak supervision on the ShapeNet dataset,
and two weak annotations were created. The two weak annotations are 10% labeled and only one
point labeled for each category of an object. They claim that they achieve competitive performance
and proposed additional training loss to include, inexact supervision, Siamese self-supervision and
spatial and color smoothness. Hu et al. (2021) proposed Semantic Query Network (SQN) to
segment indoor point cloud using 10%, 1% and 0.1% annotation. They observed that points have
semantic similarities with the points in their local neighborhood. They designed a query network
to collect relevant features with sparse signals. The results are promising even though only 0.1%
annotation is used. In this paper, we executed weakly supervised learning on projected annotation
from open source data and compared the results with our human-labeled annotation.
METHODOLOGY
Image preparation. We downloaded 75 indoor scan data from HM3D dataset. Each scan data
consists of a 3D mesh GLB file and a semantic text file. The overall process of preparing training
images from 3D mesh GLB file is shown in Figure 1. We performed mesh surface subdivision at
the midpoint and Poisson disk sampling to generate uniformly dense point cloud from each mesh
– 3 –
as shown in Figure 2. We projected all points along the vertical direction and created the 2D
histogram. The bins contain higher density points represented as black. We assume black areas
indicate the wall instances and they would be our object of interest to detect as shown in Figure 3.
Figure 1. The process of getting training image file from 3D mesh GLB file.
Figure 2. Uniformly dense point cloud example.
Figure 3. 2D histogram image with black high density.
Annotation preparation. To analyze the weakly supervised learning task, noisy “weak” labels
are used for training and high-quality human annotator labels are used for validation and testing.
The noisy labels are created by converting 3D mesh semantic segmentation labels from the HM3D
dataset into 2D plan-view bounding boxes. The overall noisy annotation generation pipeline is
shown in Figure 4. We extracted the hex color information of every wall class object from the
semantic file. The hex color is changed to RGB value and corresponds to the vertices in PLY files.
– 4 –
These vertices are the points belonging to wall objects. To transfer these vertices to 2D bounding
boxes annotation, we only used the x and y coordinates. The length and width of the bounding box
are the distance of the two farthest points for each wall object along the x and y directions. These
bounding boxes are cleaned up and merged in cases of duplicate overlapping annotations. Finally,
the annotation is transferred to pixel values. To ensure accurate performance evaluation, we
manually labeled the validation and test set on the 2D histogram images as our ground truth
annotation as shown in Figure 5. To assess the quality of the noisy annotations, we calculate the
average precision of the noisy labels as compared with the human annotator labels and the noisy
annotations achieved AP50 and AP20 of 4.4 and 32.8 respectively.
Figure 4. The process of getting noisy annotation from semantic and PLY file.
Figure 5. Left: 2D histogram, middle: noisy annotation, right: human annotator label.
Wall detection model training. We divided 75 indoor scan data into 60 training set, 8 validation
set and 7 test set. As our previous work (Wei et al. 2022), each projected data is rescaled to 2200
x 3400 pixels and we performed 100 times random cropping on each one. The size of small crops
is 800 x 800 pixels for feeding into the neural network model as shown in Figure 6. We used the
Faster R-CNN model to train wall detection model from Detectron2 library (Wu et al. 2019). The
hyperparameters include 0.00025 learning rate, 8 batch size and 53 epochs. After training, we
evaluated the model using 5 different sets in Table 1. We evaluate the training set, validation set
with noisy annotation, validation set with human-labeled annotation, test set with human-labeled
annotation, and human-labeled validation and test set combination.
– 5 –
Figure 6. 800 x 800 crop from Figure 4.
Table 1. Experiment setting.
training
validation
validation
test
validation & test
No. of images
6,000
800
800
700
1,500
Type of label
noisy
noisy
Human-labeled
Human-labeled
Human-labeled
Wall detection model inference. We applied the same processing on the new inference 3D data
to get 2D histogram image. Images are cropped into 800 x 800 pixels size in a sliding windows
sequence and input to our wall detection model. We would get the bounding boxes and confidence
scores of inferred walls in each crop. The last step is to combine each small crop to original size
and merge the adjacent bounding boxes.
RESULTS AND DISCUSSION
The training and validation loss curve is shown in Figure 7. The total training process is up until
106 epochs. We chose the model trained at 53 epochs since it starts to overfit.
Figure 7. Training and validation loss of the neural network versus the number of epochs.
Performance metrics are shown in Table 2. We computed both average precision 50 (AP50)
and average precision (AP20) for different prediction scores and non-maximum suppression
threshold settings. When we remove predicted bounding boxes with confidence lower than 0.5,
AP20 is about 50%. If we lower the threshold of the acceptable confidence score to 0.05 and add
– 6 –
a non-maximum suppression threshold to remove duplicate prediction, AP20 increases to range
51% to 66%. These results are surprisingly high considering the noisy labels the network was
trained achieves an AP20 of 32.8 when evaluated against human labels. Both the AP50 and AP20
of the trained model predictions are higher than the noisy annotation’s quality metrics.
From the visualization shown in Figure 5, the thickness of noisy annotation is not
consistent. The thickness of wall annotation is sometimes wider than the wall boundary and
sometimes thinner. However, the human-labeled annotation always drew a bounding box along
the boundary of wall objects. Moreover, the noisy annotation sometimes have duplicate prediction
on same wall object which does not happen in human-labeled annotation. These characteristics
affect the inference results as shown in Figure 8. We combined the small crops in a sequence after
doing inference on new scan data. Each recognized wall object is surrounded by a bounding box
with information of wall class and confidence score of prediction. We conducted post-processing
on the prediction output which merged the adjacent bounding box as shown in Figure 9.
Table 2. Performance metric with varies score and non max suppression threshold.
training
validation
validation
test
validation & test
noisy
noisy
Human-labeled
Human-labeled
Human-labeled
Score = 0.5
AP50
19.2
18.1
27.4
29.3
28.0
AP20
43.5
45.4
48.7
51.8
49.7
Score = 0.05
AP50
15.7
14.8
27.7
26.2
26.9
NMS = 0.1
AP20
51.6
51.9
65.6
63.1
64.1
Score = 0.05
AP50
16.8
15.9
28.6
27.4
27.8
NMS = 0.2
AP20
52.2
52.6
63.8
62.7
63.0
We discovered that when we provide noisy annotation in our training set, the neural
network only captured the stable pattern and filtered out the noise (e.g., duplicate overlapping
bounding boxes). This experiment proves that our neural network model can overcome the
noisiness in training data and learn the stable pattern to do prediction. Future work will include,
first, having more iteration on subdividing midpoint surface to mesh, the point cloud would have
higher density. The black area in 2D histogram images would be more solid when the point
cloud density is higher. Secondly, we will increase the size of small crops or increase the
resolution of whole 2D histogram images. Lastly, we would explore more pre-processing on
noisy annotation which can lead to better training performance.
– 7 –
Figure 8. Inference visualization with class label and confidence prediction.
Figure 9. Inference visualization after merging adjacent bounding box.
CONCLUSION
This study presents a wall detection task from projected point cloud using weak supervision. We
used the open source HM3D dataset to train our neural network model. HM3D is a high-resolution
3D scan of indoor environments, and we mainly focus on detecting the wall components. We
created 2D histogram images from projected point cloud as input to our neural networks training.
We trained the weakly supervised learning task using projected HM3D dataset annotation. To
comprehend the performance of our model, we also validated the human-labeled annotation. The
results show that the validation on human-labeled set is better than noisy label since the former
one is more consistent and has no overlap annotation. We conclude that our neural network model
can learn the steady patten in our noisy training set. We further suggest that increasing density of
point cloud, training image resolution and adding more rule-based method on noisy annotation
before training. We consider this projected 2D recognition task using noisy annotation is a feasible
method and the performance could be improved by our suggestion.
– 8 –
REFERENCES
Ahn, P., Yang, J., Yi, E., Lee, C., and Kim, J. (2022) “Projection-based Point Convolution for
Efficient Point Cloud Segmentation.” IEEE Access.
BuildingSMART. (2023). “Take BIM Processes to the next level with Digital Twins.”
<https://www.buildingsmart.org/take-bim-processes-to-the-next-level-with-digital-
twins/> (Feb. 17, 2023).
Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. (2017). “Multi-view 3d object detection network
for autonomous driving.” Proc., IEEE Conference on Computer Vision and Pattern
Recognition, 6526–6534.
Gankhuyag, U. and Han, J. -H. (2021). “Automatic 2D Floorplan CAD Generation from 3D
Point Clouds.” Automatic BIM Indoor Modelling from Unstructured Point Clouds Using
a Convolutional Neural Network.” Intell. Autom. & Soft Computing, 28, 133-152.
Hu, Q., Yang, B., Fang, G., Guo, Y., Leonardis, A., Trigoni, N., and Markham, A. (2022). “Sqn:
Weakly-supervised semantic segmentation of large-scale 3d point clouds.” European
Conference on Computer Vision, 600–619.
Kellner, M., Stahl, B. and Reiterer, A. (2022). “Fused Projection-Based Point Cloud
Segmentation.” Sensors, 22, 1139.
Ramakrishnan, S. T., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J. M.,
Undersander, E., Galuba, W., Westbury, A., Chang, A. X., Savva, M., Zhao, Y. and
Batra, D. (2021). “Habitat-Matterport 3D Dataset ({HM}3D): 1000 Large-scale 3D
Environments for Embodied {AI}.” Thirty-fifth Conference on Neural Information
Processing Systems Datasets and Benchmarks Track.
Wei, C., Gupta, M., and Czerniawski, T. (2022). “Automated Wall Detection in 2D CAD
Drawings to Create Digital 3D Models.” Proc., International Symposium on Automation
and Robotics in Construction, 152-158.
Wu, Y., Kirillov, A., Massa, F., Lo, W. -Y., and Girshick, R. (2019). “Detectron2”
https://github.com/facebookresearch/detectron2.
Xu, X. and Lee, G. H. (2020). “Weakly supervised semantic point cloud segmentation: Towards
10x fewer labels.” Proc., IEEE Conference on Computer Vision and Pattern Recognition,
13,706–13,715.