Conference PaperPDF Available

Plan View Wall Detection for Indoor Point Clouds Using Weak Supervision

Authors:

Abstract and Figures

We present an automated scan-to-BIM pipeline that simplifies the 3D building object recognition problem into a 2D recognition problem. We used the Habitat Matterport 3D Dataset (HM3D) for training wall detection model. The weakly supervised learning is conducted since we used the noisy depth-projected annotation. We isolated individual building levels and projected the points to 2D along the Z-axis (up/down). The architectural components recognition system detects walls within the plan view projection of the indoor point cloud. We compare the performance metric of validation on noisy annotation with human-labeled annotation and analyze the wall inference results from visualization. We assume the human-labeled annotation as ground truth and noisy annotation is prediction to calculate the average precision. The average precision values are compared to the neural network performance. We anticipate this experiment can provide a feasible weak supervision method for simplifying 3D digital model creation from scan data.
Content may be subject to copyright.
1
Plan-View Wall Detection for Indoor Point Clouds using Weak Supervision
Chialing Wei;1 and Thomas Czerniawski2
1Department of Civil, Environmental and Sustainable Engineering, Arizona State University.
ORCID: https://orcid.org/0000-0001-8191-9091. Email: cwei32@asu.edu
2Assistant Professor, Department of Civil, Environmental and Sustainable Engineering, Arizona
State University. Email: thomas.czerniawski@asu.edu
ABSTRACT
We present an automated scan-to-BIM pipeline that simplifies the 3D building object
recognition problem into a 2D recognition problem. We used the Habitat Matterport 3D Dataset
(HM3D) for training wall detection model. The weakly supervised learning is conducted since we
used the noisy depth-projected annotation. We isolated individual building levels and projected
the points to 2D along the Z-axis (up/down). The architectural components recognition system
detects walls within the plan view projection of the indoor point cloud. We compare the
performance metric of validation on noisy annotation with human-labeled annotation and analyze
the wall inference results from visualization. We assume the human-labeled annotation as ground
truth and noisy annotation is prediction to calculate the average precision. The average precision
values are compared to the neural network performance. We anticipate this experiment can provide
a feasible weak supervision method for simplifying 3D digital model creation from scan data.
INTRODUCTION
With the advancement of IoT, machine learning, and big data technology, the digital twin is highly
valued within the architecture, engineering and construction (AEC) industry. The digital twin of a
building is a digital representation of an actual real-world building. The digital twin performs
predictive simulation and optimization across the lifecycle of buildings, integration with IoT and
real-time system data, testing, monitoring, and maintenance for facility management. At the
beginning of the roadmap for digital twin creation, building information modeling (BIM) is an
essential input (buildingSMART 2023).
To create BIMs of existing buildings, construction companies use laser scanners to capture
data on site. The data consist of a huge number of points that return from the surface when you are
scanning. These point clouds represent objects like walls, doors, windows, etc. After capturing the
raw point cloud data, they should be processed and manipulated to become a 3D digital model
which refers to the Scan-to-BIM process. However, this process is mostly manual. To make this
process more efficient and less labor-intensive, some researchers focus on automating this process
with 3D deep learning algorithms. The neural network segments the raw point cloud and
recognizes the objects through geometric information in 3D space.
The paper provides two contributions. First, we simplify the task of wall detection in 3D
point clouds by reducing laser scans into 2D plan view projections. Secondly, we show that the
wall detector can be trained successfully despite using weak supervision. The wall annotations
sourced from the Habitat Matterport 3D dataset (Ramakrishnan et al. 2021) are noisy because of
2
the imperfect underlying mesh segmentation. We analyze the performance of our wall detection
model and demonstrate neural network predictions with much higher quality than the training data.
RELATED WORK
Point cloud segmentation. We focus on reviewing the point cloud segmentation tasks involving
projected-based methods. Gankhuyag and Han (2020) detected wall objects from point cloud data.
The point cloud of floor and ceiling first are extracted using RANSAC algorithm and removed for
wall detection task. The rest of the point cloud, wall and furniture, are projected to x-y plane. The
line-detection algorithm for wall objects is applied to the 2D plane.
Many researchers applied deep learning on segmenting point cloud and combined other
methods with projected-based methods. Chen et al. (2016) combined LIDAR bird view (BV),
LIDAR front view (FV) and RGB image of outdoor scene dataset. The 3D object proposals were
generated from BV and projected to BV, FV and image views. A region-based fusion network
combined three views through ROI pooling and finally output multiclass classifier and 3D box
regressor. Kellner et al. (2022) explored different projected-based approaches on outdoor scene
point clouds. The projected-based methods include spherical, bird’s eye and cylindrical views.
Results show that bird’s-eye projection method is flexible, fast, and accurate but high dependency
on the sensor and 3D geometrical information loss.
Except for segmenting outdoor point cloud, Ahn et al. (2022) proposed PPConv to segment
indoor point cloud. The point cloud is projected to 2D plane and feed into 2D convolutions in
projection branch. The point features were transformed by multi-layer perceptrons (MLP) in point
branch. Ultimately, the two branches fused together to output point-wise features. In this study,
we simplified the 3D segmentation problem to plan-view problem and trained 2D neural networks
to detect wall components.
Weak supervision. We reviewed the existing study of point cloud segmentation using weak
supervision. Most of the studies used fully supervised learning methods as a baseline to evaluate
the quality of weak supervision. Xu and Lee (2020) state that the 3D point cloud segmentation
labels are too costly to obtain. They experimented with weak supervision on the ShapeNet dataset,
and two weak annotations were created. The two weak annotations are 10% labeled and only one
point labeled for each category of an object. They claim that they achieve competitive performance
and proposed additional training loss to include, inexact supervision, Siamese self-supervision and
spatial and color smoothness. Hu et al. (2021) proposed Semantic Query Network (SQN) to
segment indoor point cloud using 10%, 1% and 0.1% annotation. They observed that points have
semantic similarities with the points in their local neighborhood. They designed a query network
to collect relevant features with sparse signals. The results are promising even though only 0.1%
annotation is used. In this paper, we executed weakly supervised learning on projected annotation
from open source data and compared the results with our human-labeled annotation.
METHODOLOGY
Image preparation. We downloaded 75 indoor scan data from HM3D dataset. Each scan data
consists of a 3D mesh GLB file and a semantic text file. The overall process of preparing training
images from 3D mesh GLB file is shown in Figure 1. We performed mesh surface subdivision at
the midpoint and Poisson disk sampling to generate uniformly dense point cloud from each mesh
3
as shown in Figure 2. We projected all points along the vertical direction and created the 2D
histogram. The bins contain higher density points represented as black. We assume black areas
indicate the wall instances and they would be our object of interest to detect as shown in Figure 3.
Figure 1. The process of getting training image file from 3D mesh GLB file.
Figure 2. Uniformly dense point cloud example.
Figure 3. 2D histogram image with black high density.
Annotation preparation. To analyze the weakly supervised learning task, noisy “weak” labels
are used for training and high-quality human annotator labels are used for validation and testing.
The noisy labels are created by converting 3D mesh semantic segmentation labels from the HM3D
dataset into 2D plan-view bounding boxes. The overall noisy annotation generation pipeline is
shown in Figure 4. We extracted the hex color information of every wall class object from the
semantic file. The hex color is changed to RGB value and corresponds to the vertices in PLY files.
4
These vertices are the points belonging to wall objects. To transfer these vertices to 2D bounding
boxes annotation, we only used the x and y coordinates. The length and width of the bounding box
are the distance of the two farthest points for each wall object along the x and y directions. These
bounding boxes are cleaned up and merged in cases of duplicate overlapping annotations. Finally,
the annotation is transferred to pixel values. To ensure accurate performance evaluation, we
manually labeled the validation and test set on the 2D histogram images as our ground truth
annotation as shown in Figure 5. To assess the quality of the noisy annotations, we calculate the
average precision of the noisy labels as compared with the human annotator labels and the noisy
annotations achieved AP50 and AP20 of 4.4 and 32.8 respectively.
Figure 4. The process of getting noisy annotation from semantic and PLY file.
Figure 5. Left: 2D histogram, middle: noisy annotation, right: human annotator label.
Wall detection model training. We divided 75 indoor scan data into 60 training set, 8 validation
set and 7 test set. As our previous work (Wei et al. 2022), each projected data is rescaled to 2200
x 3400 pixels and we performed 100 times random cropping on each one. The size of small crops
is 800 x 800 pixels for feeding into the neural network model as shown in Figure 6. We used the
Faster R-CNN model to train wall detection model from Detectron2 library (Wu et al. 2019). The
hyperparameters include 0.00025 learning rate, 8 batch size and 53 epochs. After training, we
evaluated the model using 5 different sets in Table 1. We evaluate the training set, validation set
with noisy annotation, validation set with human-labeled annotation, test set with human-labeled
annotation, and human-labeled validation and test set combination.
5
Figure 6. 800 x 800 crop from Figure 4.
Table 1. Experiment setting.
training
validation
validation
test
validation & test
No. of images
6,000
800
800
700
1,500
Type of label
noisy
noisy
Human-labeled
Human-labeled
Human-labeled
Wall detection model inference. We applied the same processing on the new inference 3D data
to get 2D histogram image. Images are cropped into 800 x 800 pixels size in a sliding windows
sequence and input to our wall detection model. We would get the bounding boxes and confidence
scores of inferred walls in each crop. The last step is to combine each small crop to original size
and merge the adjacent bounding boxes.
RESULTS AND DISCUSSION
The training and validation loss curve is shown in Figure 7. The total training process is up until
106 epochs. We chose the model trained at 53 epochs since it starts to overfit.
Figure 7. Training and validation loss of the neural network versus the number of epochs.
Performance metrics are shown in Table 2. We computed both average precision 50 (AP50)
and average precision (AP20) for different prediction scores and non-maximum suppression
threshold settings. When we remove predicted bounding boxes with confidence lower than 0.5,
AP20 is about 50%. If we lower the threshold of the acceptable confidence score to 0.05 and add
6
a non-maximum suppression threshold to remove duplicate prediction, AP20 increases to range
51% to 66%. These results are surprisingly high considering the noisy labels the network was
trained achieves an AP20 of 32.8 when evaluated against human labels. Both the AP50 and AP20
of the trained model predictions are higher than the noisy annotation’s quality metrics.
From the visualization shown in Figure 5, the thickness of noisy annotation is not
consistent. The thickness of wall annotation is sometimes wider than the wall boundary and
sometimes thinner. However, the human-labeled annotation always drew a bounding box along
the boundary of wall objects. Moreover, the noisy annotation sometimes have duplicate prediction
on same wall object which does not happen in human-labeled annotation. These characteristics
affect the inference results as shown in Figure 8. We combined the small crops in a sequence after
doing inference on new scan data. Each recognized wall object is surrounded by a bounding box
with information of wall class and confidence score of prediction. We conducted post-processing
on the prediction output which merged the adjacent bounding box as shown in Figure 9.
Table 2. Performance metric with varies score and non max suppression threshold.
validation
validation
test
validation & test
noisy
Human-labeled
Human-labeled
Human-labeled
Score = 0.5
AP50
18.1
27.4
29.3
28.0
AP20
45.4
48.7
51.8
49.7
Score = 0.05
AP50
14.8
27.7
26.2
26.9
NMS = 0.1
AP20
51.9
65.6
63.1
64.1
Score = 0.05
AP50
15.9
28.6
27.4
27.8
NMS = 0.2
AP20
52.6
63.8
62.7
63.0
We discovered that when we provide noisy annotation in our training set, the neural
network only captured the stable pattern and filtered out the noise (e.g., duplicate overlapping
bounding boxes). This experiment proves that our neural network model can overcome the
noisiness in training data and learn the stable pattern to do prediction. Future work will include,
first, having more iteration on subdividing midpoint surface to mesh, the point cloud would have
higher density. The black area in 2D histogram images would be more solid when the point
cloud density is higher. Secondly, we will increase the size of small crops or increase the
resolution of whole 2D histogram images. Lastly, we would explore more pre-processing on
noisy annotation which can lead to better training performance.
7
Figure 8. Inference visualization with class label and confidence prediction.
Figure 9. Inference visualization after merging adjacent bounding box.
CONCLUSION
This study presents a wall detection task from projected point cloud using weak supervision. We
used the open source HM3D dataset to train our neural network model. HM3D is a high-resolution
3D scan of indoor environments, and we mainly focus on detecting the wall components. We
created 2D histogram images from projected point cloud as input to our neural networks training.
We trained the weakly supervised learning task using projected HM3D dataset annotation. To
comprehend the performance of our model, we also validated the human-labeled annotation. The
results show that the validation on human-labeled set is better than noisy label since the former
one is more consistent and has no overlap annotation. We conclude that our neural network model
can learn the steady patten in our noisy training set. We further suggest that increasing density of
point cloud, training image resolution and adding more rule-based method on noisy annotation
before training. We consider this projected 2D recognition task using noisy annotation is a feasible
method and the performance could be improved by our suggestion.
8
REFERENCES
Ahn, P., Yang, J., Yi, E., Lee, C., and Kim, J. (2022) “Projection-based Point Convolution for
Efficient Point Cloud Segmentation.” IEEE Access.
BuildingSMART. (2023). “Take BIM Processes to the next level with Digital Twins.”
<https://www.buildingsmart.org/take-bim-processes-to-the-next-level-with-digital-
twins/> (Feb. 17, 2023).
Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. (2017). “Multi-view 3d object detection network
for autonomous driving.” Proc., IEEE Conference on Computer Vision and Pattern
Recognition, 65266534.
Gankhuyag, U. and Han, J. -H. (2021). “Automatic 2D Floorplan CAD Generation from 3D
Point Clouds.” Automatic BIM Indoor Modelling from Unstructured Point Clouds Using
a Convolutional Neural Network.” Intell. Autom. & Soft Computing, 28, 133-152.
Hu, Q., Yang, B., Fang, G., Guo, Y., Leonardis, A., Trigoni, N., and Markham, A. (2022). “Sqn:
Weakly-supervised semantic segmentation of large-scale 3d point clouds.” European
Conference on Computer Vision, 600619.
Kellner, M., Stahl, B. and Reiterer, A. (2022). “Fused Projection-Based Point Cloud
Segmentation.” Sensors, 22, 1139.
Ramakrishnan, S. T., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J. M.,
Undersander, E., Galuba, W., Westbury, A., Chang, A. X., Savva, M., Zhao, Y. and
Batra, D. (2021). “Habitat-Matterport 3D Dataset ({HM}3D): 1000 Large-scale 3D
Environments for Embodied {AI}.” Thirty-fifth Conference on Neural Information
Processing Systems Datasets and Benchmarks Track.
Wei, C., Gupta, M., and Czerniawski, T. (2022). Automated Wall Detection in 2D CAD
Drawings to Create Digital 3D Models. Proc., International Symposium on Automation
and Robotics in Construction, 152-158.
Wu, Y., Kirillov, A., Massa, F., Lo, W. -Y., and Girshick, R. (2019). “Detectron2”
https://github.com/facebookresearch/detectron2.
Xu, X. and Lee, G. H. (2020). “Weakly supervised semantic point cloud segmentation: Towards
10x fewer labels.” Proc., IEEE Conference on Computer Vision and Pattern Recognition,
13,70613,715.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Generating digital 3D buildings models from scratch is time-consuming and labor-intensive. In this paper, we present an automated detection process leveraging computer vision and the information available in 2D drawings to reduce 3D modeling time. The recognition system is limited to walls and has two parts: (1) Image classification on walls by ResNet-50 model, (2) Object Detection on walls by YOLOv3 model. The system accepts new 2D drawings and outputs parameters of recognized walls. The parameters are input into Dynamo for 3D model reconstruction. We anticipate these types of systems, which rely on 2D drawings as recognition priors, will be pivotal to the industry’s transition from 2D to 3D information modalities.
Article
Full-text available
Semantic segmentation is used to enable a computer to understand its surrounding environment. In image processing, images are partitioned into segments for this purpose. State-of-the-art methods make use of Convolutional Neural Networks to segment a 2D image. Compared to that, 3D approaches suffer from computational cost and are not applicable without any further steps. In this work, we focus on semantic segmentation based on 3D point clouds. We use the idea to project the 3D data into a 2D image to accelerate the segmentation process. Afterward, the processed image gets re-projected to receive the desired result. We investigate different projection views and compare them to clarify their strengths and weaknesses. To compensate for projection errors and the loss of geometrical information, we evolve the approach and show how to fuse different views. We have decided to fuse the bird’s-eye and the spherical projection as each of them achieves reasonable results, and the two perspectives complement each other best. For training and evaluation, we use the real-world datasets SemanticKITTI. Further, we use the ParisLille and synthetic data generated by the simulation framework Carla to analyze the approaches in more detail and clarify their strengths and weaknesses. Although these methods achieve reasonable and competitive results, they lack flexibility. They depend on the sensor used and the setup in which the sensor is used.
Article
Full-text available
Understanding point cloud has recently gained huge interests following the development of 3D scanning devices and the accumulation of large-scale 3D data. Most point cloud processing algorithms can be classified as either point-based or voxel-based methods, both of which have severe limitations in processing time or memory, or both. To overcome these limitations, we propose Projection-based Point Convolution (PPConv), a point convolutional module that uses 2D convolutions and multi-layer perceptrons (MLPs) as its components. In PPConv, point features are processed through two branches: point branch and projection branch. Point branch consists of MLPs, while projection branch transforms point features into a 2D feature map and then apply 2D convolutions. As PPConv does not use point-based or voxel-based convolutions, it has advantages in fast point cloud processing. When combined with a learnable projection and effective feature fusion strategy, PPConv achieves superior efficiency compared to state-of-the-art methods, even with a simple architecture based on PointNet++. We demonstrate the efficiency of PPConv in terms of the trade-off between inference time and segmentation performance. The experimental results on S3DIS and ShapeNetPart show that PPConv is the most efficient method among the compared ones. The code is available at github.com/pahn04/PPConv.
Chapter
Labelling point clouds fully is highly time-consuming and costly. As larger point cloud datasets with billions of points become more common, we ask whether the full annotation is even necessary, demonstrating that existing baselines designed under a fully annotated assumption only degrade slightly even when faced with 1% random point annotations. However, beyond this point, e.g., at 0.1% annotations, segmentation accuracy is unacceptably low. We observe that, as point clouds are samples of the 3D world, the distribution of points in a local neighbourhood is relatively homogeneous, exhibiting strong semantic similarity. Motivated by this, we propose a new weak supervision method to implicitly augment highly sparse supervision signals. Extensive experiments demonstrate the proposed Semantic Query Network (SQN) achieves promising performance on seven large-scale open datasets under weak supervision schemes, while requiring only 0.1% randomly annotated points for training, greatly reducing annotation cost and effort.
Take BIM Processes to the next level with Digital Twins
  • Buildingsmart
BuildingSMART. (2023). "Take BIM Processes to the next level with Digital Twins." <https://www.buildingsmart.org/take-bim-processes-to-the-next-level-with-digitaltwins/> (Feb. 17, 2023).
Habitat-Matterport 3D Dataset ({HM}3D): 1000 Large-scale 3D Environments for Embodied {AI}
  • S T Ramakrishnan
  • A Gokaslan
  • E Wijmans
  • O Maksymets
  • A Clegg
  • J M Turner
  • E Undersander
  • W Galuba
  • A Westbury
  • A X Chang
  • M Savva
  • Y Zhao
  • D Batra
Ramakrishnan, S. T., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J. M., Undersander, E., Galuba, W., Westbury, A., Chang, A. X., Savva, M., Zhao, Y. and Batra, D. (2021). "Habitat-Matterport 3D Dataset ({HM}3D): 1000 Large-scale 3D Environments for Embodied {AI}." Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Projection-based Point Convolution for Efficient Point Cloud Segmentation.
  • P Ahn
  • J Yang
  • E Yi
  • C Lee
  • Ahn P.