Content uploaded by Muhammad Saif Ullah Khan
Author content
All content in this area was uploaded by Muhammad Saif Ullah Khan on Mar 12, 2024
Content may be subject to copyright.
Learning to Reconstruct Textureless and Transparent
Surfaces in 3D
Muhammad Saif Ullah Khan
19 September 2022
Version: 1.0
Masters Thesis
Learning to Reconstruct Textureless and
Transparent Surfaces in 3D
Muhammad Saif Ullah Khan
1. Supervisor Prof. Dr. Didier Stricker
Department of Augmented Vision
TECHNISCHE UNIVERSITÄT KAISERSLAUTERN (TUK)
2. Supervisor Dr. Muhammad Zeshan Afzal
Department of Augmented Vision
TECHNISCHE UNIVERSITÄT KAISERSLAUTERN (TUK)
19 September 2022
Muhammad Saif Ullah Khan
Learning to Reconstruct Textureless and Transparent Surfaces in 3D
Masters Thesis, 19 September 2022
Supervisors: Prof. Dr. Didier Stricker and Dr. Muhammad Zeshan Afzal
TECHNISCHE UNIVERSITÄT KAISERSLAUTERN (TUK)
Department of Augmented Vision
Kaiserslautern
Contents
1 Introduction 1
2 Background 7
2.1 ProblemDefinition ............................ 7
2.1.1 Multi-View and Single-View . . . . . . . . . . . . . . . . . . . 8
2.1.2 Object and Scene Reconstruction . . . . . . . . . . . . . . . . 9
2.1.3 Surface Type and Reconstruction . . . . . . . . . . . . . . . . 10
2.2 Shape in Three Dimensions . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 DepthMap ............................ 12
2.2.2 NormalMap............................ 12
2.2.3 PointCloud............................ 13
2.2.4 3DMesh.............................. 13
2.2.5 Voxel................................ 13
2.3 ScopeoftheThesis............................ 14
3 Related Work 15
3.1 Deep Learning for 3D Reconstruction . . . . . . . . . . . . . . . . . . 15
3.2 ExistingDatasets ............................. 17
3.2.1 Textureless Datasets . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 More Opaque Datasets . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Transparent Datasets . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Reconstruction of Opaque Objects . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Bednariketal. .......................... 23
3.3.2 Patch-Net ............................. 25
3.3.3 HDM-Net ............................. 26
3.3.4 IsMo-GAN............................. 27
3.3.5 Pixel2Mesh ............................ 29
3.3.6 Salvietal.............................. 30
3.3.7 VANet ............................... 32
3.3.8 3D-VRVT ............................. 33
3.4 EvaluationMetrics ............................ 36
4 Our Datasets 39
4.1 TexturelessDatasets ........................... 39
v
4.1.1 Synthetic Textureless Dataset . . . . . . . . . . . . . . . . . . 39
4.1.2 Real Textureless Dataset . . . . . . . . . . . . . . . . . . . . . 47
4.2 TransparentDataset ........................... 50
4.2.1 Motivation ............................ 50
4.2.2 Generation Process . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 DataDescription ......................... 53
4.2.4 Limitations ............................ 53
4.3 Data Sources and Licenses . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Methodology 57
5.1 TexturelessSurfaces ........................... 57
5.1.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . 57
5.1.2 LossFunctions .......................... 58
5.2 TransparentSurfaces........................... 61
5.2.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2 LossFunctions .......................... 63
6 Experiments and Results 65
6.1 EvaluationMetrics ............................ 65
6.2 Textureless Object Reconstruction . . . . . . . . . . . . . . . . . . . . 66
6.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.2 Results............................... 67
6.2.3 AblationStudies ......................... 72
6.3 Transparent Object Reconstruction . . . . . . . . . . . . . . . . . . . 75
6.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.2 Results............................... 76
6.3.3 AblationStudies ......................... 77
7 Conclusion 83
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.1 Reconstruction of Textureless Surfaces . . . . . . . . . . . . . 83
7.1.2 Reconstruction of Transparent Surfaces . . . . . . . . . . . . . 85
7.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . 86
Bibliography 89
List of Figures
1.1 Two examples illustrating loss of 3D information in 2D images. . . . . 2
2.1 An object viewed from multiple cameras. . . . . . . . . . . . . . . . . . 8
3.1
Interest in using deep learning-based methods for 3D reconstruction is
reflected in the number of publications on ScienceDirect matching the
keywords “3d reconstruction” AND “deep learning”, which have been
exponentially growing since 2015. . . . . . . . . . . . . . . . . . . . . 16
3.2
Examples of images in the datasets of Bednarik et al. and Golyanik
et al., which are used to evaluate some of the networks in this paper.
(a) The textureless surfaces dataset [BFS18] contains RGB images and
corresponding normal and depth maps for 5 different real objects. (b)
The synthetic point cloud dataset of Golyanik et al. has a deforming thin
plate rendered with 4 different textures under 5 different illuminations
(Figures adapted from [BFS18; Gol+18]). . . . . . . . . . . . . . . . . 19
3.3
The textureless surface reconstruction network [BFS18] (left) consists
of an encoder
Λ
that takes a masked image
In
m
as input and outputs
a latent representation
Λ
. This is followed by three parallel decoders
ΦN,ΦD,
and
ΦC
that use
Λ
for reconstructing the normal map, depth
map, and a 3D mesh respectively. The indices of all maxpool operations
in the encoder are saved when downsampling (right). These indices
are later used for non-linear upsampling in corresponding decoder layers.
23
3.4
Patch-Net uses Bednarik et al.’s network with only depth and normal
decoders. The input image is divided into overlapping patches, and
predictions for each patch are obtained separately. Patch predictions
are stitched to form the complete depth and normal maps . . . . . . . . 25
3.5
Overview of the HDM-Net [Gol+18] architecture. It has an encoder
that takes an RGB image of size
224 ×224 ×3
and encodes it into a
latent representation of size
28 ×28 ×128
. This is then used by the
decoder to reconstruct a 3D point cloud of the surface with 732points. 26
3.6
Overview of IsMo-GAN [Shi+19]. The generator network accepts a
masked RGB image, segmented by the object detection network (OD-
Net), and returns a 3D point cloud. The output and ground-truth are
fed to the discriminator which serves as a surface regularizer. . . . . . 28
vii
3.7
The Pixel2Mesh [Wan+18] network consists of two parallel networks
that take an RGB image and a coarse ellipsoid 3D mesh, and learn to
regress the 3D shape of the object in the image. The key contribution is
the graph-based convolutions and unpooling operators in the bottom
halfofthenetwork. ............................ 29
3.8
The attentioned ResNet-18 [He+15] network with four self-attention
blocks [Vas+17] added to it. This encoder network is used by [Sal+20]
to extract image features, which are fed to a decoder with five Condi-
tional Batch Normalization blocks followed by an occupancy function. 30
3.9
Overview of VANet [YTZ21], a unified approach for both single and
multi-view reconstruction with a two-branch architecture. . . . . . . . 32
3.10
3D-VRVT takes one image as input and uses a Vision Transformer
encoder to extract a feature vector. This is then fed to a decoder that
outputs the voxel representation of the object. . . . . . . . . . . . . . . 33
4.1 Samples from 6 main categories of the synthetic textureless dataset. . . 40
4.2
The Blender scene. The 3D model is surrounded by multiple lights and
cameras.................................... 41
4.3
This shows how (a) how the different nodes are connected in Blender
and (b) the render settings used to obatin the depth map data. . . . . 43
4.4
The ShapeNet category. 24 renders of 200 models for 13 main ShapeNet
objects along with depth maps and surface normals are provided. . . . 44
4.5
Objects in the dataset have depth variations at many scales, with some
like the rubber duck having a largely smooth surface with uniform
normals and others like the San Diego Convention Center or the Thai
statue having many deviations in their depth and normal vectors. . . . 45
4.6
The top row of the figure shows the results of our skin-detection algo-
rithm that removes the person wearing the clothes from the images.
The middle row shows the raw output from Kinect with a lot of noise
and a hole in the depth and normal maps (right leg). The bottom row
shows the output after post-processing steps. . . . . . . . . . . . . . . . 49
4.7
The shader in Blender used to model a transparent material. We use
this to set the transparent material’s refractive index, color, absorption,
and transmission properties. . . . . . . . . . . . . . . . . . . . . . . . . 51
4.8 The five different HDRIs used to render the transparent dataset. . . . . 52
4.9 The Blender settings used for transparent dataset creation. . . . . . . . 52
4.10 Transparent objects in the dataset. . . . . . . . . . . . . . . . . . . . . 53
4.11
Groundtruth labels for a single camera orientation and object rotation
inthefiveworlds............................... 54
5.1
The Sketch Reconstruction Multi-task Autoencoder (SRMA) network. It
has 11M trainable parameters. . . . . . . . . . . . . . . . . . . . . . . . 58
5.2
The Residual Sketch Reconstruction Vision Transformer (RSRVT) net-
work. It has 22M trainable parameters. . . . . . . . . . . . . . . . . . . 61
6.1
Visualization of the qualitative errors on random samples in the test data.
68
6.2 Visualization of the output on real objects from [BFS18] when trained
onoursyntheticdata. ........................... 70
6.3
Visualization of the qualitative errors on a random sample from our
real-worldtestdata. ............................ 71
6.4 Effect of using a larger input size to train the network. . . . . . . . . . 72
6.5
The Sketch Reconstruction Multi-task Autoencoder (SRMA) network
without the self-attention. It has 11M trainable parameters. . . . . . . 73
6.6
The effect of removing the self-attention layers. The network with-
out self-attention produces more "blurred" normals and misses low-
frequency details near the edges. . . . . . . . . . . . . . . . . . . . . . 73
6.7
The Sketch Reconstruction Autoencoder (SRAE) network. It has 7M
trainableparameters............................. 73
6.8
The evolution of the difference between training and validation loss
over time for the three transparent experiments. . . . . . . . . . . . . . 77
6.9
The RSRVT network successfully locates the transparent object in real
photographs and reconstructs fairly good depth and normal maps. . . . 78
6.10
The Sketch Reconstruction Vision Transformer (SRVT) network. It has
14M trainable parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.11
Ablation study 1: Removing the residual blocks and shortcut paths from
thenetwork.................................. 79
6.12
The Vision Transformer (ViT) network without the feature extractor and
reconstruction networks. It has 23M trainable parameters, comparable
to the 22M parameters of the original RSRVT network. . . . . . . . . . 80
6.13
Ablation study 2: Removing the feature extractor and reconstruction
networks, only keeping the Vision Transformer. . . . . . . . . . . . . . 80
6.14
Ablation study 3: Removing the Vision Transformer, only keeping the
feature extractor and reconstruction networks. . . . . . . . . . . . . . . 81
List of Tables
3.1 Summary of the existing 3D datasets of textureless objects. . . . . . . . 17
3.2
Summary of objects in the textureless surfaces dataset [BFS18]. Se-
quences of data samples were captured using a Kinect device at 5 FPS
with varying lighting conditions across sequences. . . . . . . . . . . . . 19
3.3 Summary of the existing 3D datasets of opaque objects. . . . . . . . . . 20
3.4 Summary of all 3D reconstruction networks discussed in this paper. . . 34
3.4 Cont. ..................................... 35
4.1
These elements in the scene are used in various combinations to gener-
ate’sequences’ofdata............................ 42
4.2
The synthetic textureless dataset has 48 objects divided into seven
subcategories with 2635 unique 3D models and 364.800 samples in total.
46
4.3 Summary of objects in the supplementary dataset of real objects. . . . 48
6.1
The training (R), validation (V), and test (T) splits for the textureless
dataset. ................................... 66
6.2
Results of the intra-category experiments. The furniture and clothing
categories show the best results with more than 92.12% and 88.87%
of the predicted normals, respectively having smaller than 30
◦
angular
difference from the groundtruth. This shows that our dataset allows for
a good inter-class generalization ability. . . . . . . . . . . . . . . . . . 67
6.3
Results of the inter-category experiments. The degree of generalization
to new categories is less than that for objects within the same category
(Table 6.2). This is because the network learns strong shape priors that
do not generalize well to very different geometries. . . . . . . . . . . . 68
6.3 Contd..................................... 69
6.4
Results on the shapenet objects. Performance improves greatly when
more shapes are seen during training. This shows the network can learn
shape representations from our textureless renders. . . . . . . . . . . . 70
6.5
Comparison of the normal map reconstruction between a network
trained on our synthetic dataset (S) and the same network trained on
real
cloth
data from [BFS18] (R). When trained on our synthetic data,
the same network gives better surface normals for all four real objects
other than the
cloth
object which was used to train the real network,
where our results are comparable. . . . . . . . . . . . . . . . . . . . . . 71
xi
6.6
Results of reconstructing depth and normal vectors of our real dataset
using the baseline network trained on our synthetic clothing dataset. . 72
6.7
Results of the second ablation study on the textureless data where we
use only a single decoder. . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.8
The training (R), validation (V), and test (T) splits of the synthetic
transparent dataset. Three different experiments are performed on the
dataset using these splits. . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.9 The results of the transparent experiments described in Section 6.3.1. . 76
6.10 ........................................ 76
6.11
Ablation study 1: The results of the transparent experiments with the
SRVT network. The average
EN
error increases by 10% without the
residual blocks and shortcut paths. . . . . . . . . . . . . . . . . . . . . 79
6.12
Ablation study 2: The results of the transparent experiments without the
autoencoder part. Only 2.7% of the surface normals have an angular
error of less than 30 degrees in these experiments, indicating this
networks complete inability to reconstruct the normals. . . . . . . . . . 80
6.13
Ablation study 3: The results of the transparent experiments without the
autoencoder part. Only 2.7% of the surface normals have an angular
error of less than 30 degrees in these experiments, indicating this
networks complete inability to reconstruct the normals. . . . . . . . . . 81
Acknowledgement
I want to start by expressing my gratitude to the TU Kaiserslautern Mensa and the
Bistro 36 for providing a comfortable space to sit down and work on my thesis with
easy access to coffee, sandwiches, and cheesecakes. Similarly, my thank goes to the
vending machines in the university library and building 46 that let me buy coffee
after the cafeteria was closed and the recharging stations around the campus that let
me put money on my card to pay for said coffees.
I would also like to sincerely thank my thesis supervisor, Dr. Muhammad Zeshan
Afzal, for providing me with continuous guidance, encouragement, technical support,
essential equipment, and lab space. I also want to thank Dr. Afzal, Prof. Dr. Didier
Stricker, and Prof. Dr. Ralf Hinze for their understanding and support during
difficult times. Thanks to their flexibility, I could complete my work despite several
extraordinary situations that could have easily derailed me. I am genuinely grateful
for their consideration.
Thank you to Maxwell Bald for proofreading my work several times over without
getting sick of it. Similarly, I would be remiss not to mention Amna Abdul Rehman,
who provided me access to Grammarly Premium, which assisted me in writing this
thesis. And a big shoutout to GitHub Copilot and Chrome Remote Desktop, without
whom this thesis would not have been finished.
Finally, I would like to thank my parents for accepting my decision to move to
the other side of the world for my Master’s. Also, to my Bachelor’s supervisor
Dr. Muhammad Imran Malik, for showing me the possibilities that awaited me in
Germany and encouraging me to move here.
xiii
Abstract
Performing 3D reconstruction from a single 2D input is a challenging problem that is
trending in literature. Until recently, it was an ill-posed optimization problem, but
with the advent of learning-based methods, the performance of 3D reconstruction has
also significantly improved. However, the state-of-the-art approaches mainly focus on
datasets with highly textured images. Most of these methods are trained on datasets
like ShapeNet, which contain rendered images of well-textured objects. However,
in natural scenes, many objects are textureless and challenging to reconstruct.
Unlike textured surfaces, reconstruction of textureless surfaces has not received as
much attention mainly because of a lack of large-scale annotated datasets. Some
recent works have also focused on textureless surfaces, many of which are trained
on a small real-world dataset containing 26k images of 5 different textureless
clothing items. Transparent surfaces have received even less attention from the deep
learning community, with most works using traditional computer vision methods to
reconstruct these surfaces. Most techniques depend on inferring the shape of the
objects by how light is reflected off the surfaces. However, this may not be possible
in the case of transparent surfaces as they allow some light to pass through them,
and the algorithms now have to deal with light refraction and absorption in addition
to reflections. To facilitate further research in this direction, we present a synthetic
dataset generation strategy for images of both textureless and transparent objects
and corresponding depth maps and surface normals map groundtruth. We also make
available three new datasets: a large synthetic textureless dataset containing 364k
samples and 2635 3D models, a small real-world textureless dataset containing 4k
samples and six objects, and a large transparent object dataset containing 126k
samples and ten 3D models. We also propose an autoencoder-based network for
learning to reconstruct the depth maps and surface normal maps from a single image
for textureless objects. Furthermore, we propose a novel architecture that combines
a Vision Transformer with a residual autoencoder and uses an auxiliary silhouette
output to find transparent objects in realistic scenes and reconstruct their depth
maps and surface normal maps.
xv
Keywords: 3D Reconstruction, Single-View, Textureless, Transparent
xvi
1
Introduction
„The world is changed. I feel it in the water. I feel
it in the earth. I smell it in the air.
—Lady Galadriel
The Lord of the Rings
Three-dimensional reconstruction is the task of inferring the geometric structure
of a scene from a set of two-dimensional images. Given one or more images of a
scene, the goal is to recover the 3D shape and position of the shown objects. This
problem is important to the scientific community because of its numerous appli-
cations in diverse fields. For example, in the medical industry, 3D reconstruction
may aid in diagnosis and treatment by reconstructing a patient’s internal anatomy
from 2D scans [Chi+11]. In robotics, depth perception and understanding of the
three-dimensional structure of the scene may be necessary for a robot to navigate its
environment successfully. Autonomous vehicles and surgical robots are examples of
such robots, which rely heavily on the depth information and 3D reconstruction of
their environment [Tes19]. Similarly, 3D reconstruction is a core technology in the
entertainment industry, including gaming and augmented reality [Haf+17]. Other
fields that rely on 3D reconstruction include virtual tourism [El-+04], city plan-
ning [Vos03], 3D object detection [HR14], and aerospace engineering [Pau14].
3D reconstruction has been studied for decades by researchers working in computer
graphics, computer vision, machine learning, and, more recently, deep learning.
However, despite strong interest by the scientific community and numerous ap-
plications in various fields, it remains an ill-posed problem, mainly because 3D
reconstruction is a general task that can take many different forms depending on
the application, and no single solution works in every case. In addition, the 2D
images of the scene do not uniquely define the 3D geometry of a scene as a point in
two-dimensional space can be a projection of an infinite number of 3D points. This
is illustrated in Figure 1.1.
Deep neural networks are more robust to noise and variations in input data than
traditional methods. Additionally, they can learn complex features directly from
data without human intervention, which makes them well suited for tasks such
1
(a)
A cuboid and a cylinder projected to a
rectangle in 2D.
(b)
A cone and a sphere projected to a circle
in 2D.
Fig. 1.1: Two examples illustrating loss of 3D information in 2D images.
as 3D reconstruction, where input images have many variabilities. The vision-
based neural networks, such as Convolutional Neural Networks (CNNs) and Vision
Transformers (ViT) that are trained on images, learn to identify and understand
shape and pattern cues in the images that help them perform computer vision tasks.
For 3D reconstruction, factors such as the material of the surfaces that are being
reconstructed, the type of provided visual input, the number of available viewpoints,
and the kind of available groundtruth 3D geometry all determine what sort of visual
cues a neural network should learn from the input to be able to reconstruct the 3D
geometry of the surface successfully.
This is particularly true when attempting to reconstruct surfaces with complex
properties, such as the lack of texture or transparent objects that are difficult to
distinguish from the background. For example, even for humans, it is difficult to
identify different faces of a smooth textureless surface with a homogeneous white
color. Similarly, as not only the background but the back faces of transparent objects
are visible in the input images, it is difficult to distinguish the front faces of the object
from the background and to decide which visual information to rely on and which
to discard. This is a problem because textureless surfaces exist everywhere, from
white walls and plain clothes to soft tissue in endoscopy images, and reconstructing
them in 3D has many practical applications [LHH16; Wid+19]. For example, for
robots working in an industrial setup or a product line where they have to handle
different, mostly textureless small bits and pieces, understanding their correct 3D
shape may be crucial for operation [Lam+13; SS17; ZZD20]. Similarly, 3D recon-
struction of "weakly-textured" soft tissue or semi-transparent veins might be useful
in medical diagnosis [Böv+20]. In the same way, transparent objects like glass,
plastic, and water are also often found in real-world scenes. Reconstruction of these
surfaces may also find virtual and augmented reality applications when working
with scenes containing large, homogeneous surfaces or objects made from glass or
other transparent materials [Haf+17]. The lack of distinctive features on textureless
2Chapter 1 Introduction
surfaces makes reconstructing them in 3D challenging. Similarly, unnecessary back-
ground information in images of transparent objects poses a problem. This becomes
even more complicated when only a single view of the object is available [BFS18;
Kha+22].
The reconstruction task has been solved many times using different learning meth-
ods [BFS18; Gol+18; LK21; Shi+19; TA19; Wan+18]. However, most of these
works focus on reconstructing well-textured surfaces, and there are only a few deep
learning methods for reconstructing textureless surfaces [BFS18; TA19], and even
fewer that reconstruct transparent objects [Kar+22]. In this thesis, we propose using
an autoencoder with self-attention for reconstructing textureless objects. In addition,
we propose a novel architecture combining a ViT with the CNN autoencoder, showing
that this network can reconstruct the 3D geometry of transparent objects from a
single view with high accuracy.
Data is one of the most crucial parts of all machine learning algorithms [Gup+21;
PKS15], particularly deep-learning, which relies heavily on large amounts of data
[Aji+15; Bar18; Sun+17; ÇN21]. Unavailability of many public large 3D datasets
for textureless and transparent surfaces hinders the progress of research in this
field [Kha+22]. There are several ways to represent the groundtruth 3D geometry,
such as depth maps, normal maps, 3D point clouds, meshes, or voxels [Kha+22].
Datasets containing RGB images with corresponding depth information for the scene
in the image as a depth map or a surface normals map are sometimes called RGB-D
datasets [Cai+17]. Where 3D point clouds, voxels, and meshes provide a complete
3D shape, depth maps and surface normal maps from a single viewpoint represent
a very sparse 3D geometry, sometimes also called 2.5D [Cai+17]. It is relatively
easier to reconstruct the depth maps and surface normal maps of one view of the
object than a complete 3D mesh [BFS18]. However, textureless and transparent
surfaces have limited visual cues that indicate their 3D shape, making reconstructing
them more complicated than standard 3D reconstruction [BFS18]. This problem
becomes even more ill-defined when the input is a single RGB image. Therefore,
reconstruction of depth maps and surface normals maps of one view at a time
(so-called 2.5D) is an excellent first step to complete 3D shape reconstruction of
textureless surfaces [BFS18]. For this reason, we also introduce three new datasets
containing 2.5D groundtruth geometry for textureless and transparent surfaces. The
main contributions of this thesis include:
• A two-part novel RGB-D dataset of synthetic textureless surfaces containing:
3
1.
The first part features one 3D model of 35 everyday objects rendered
without textures under four lighting conditions and viewed from three
elevation angles and 360 unit azimuth steps, giving 302,400 samples.
2.
The second part contains 2600 3D models, with 200 from each of the 13
most common objects in ShapeNet [Cha+15] rendered textureless under
one lighting condition and viewed from three elevation angles and eight
45◦azimuth steps, giving 62,400 samples.
Each sample has an RGB image of the object rendered in front of a black
background, corresponding depth maps, and surface normal maps. To our
knowledge, this is the first dataset for textureless surfaces.
•
A small supplementary dataset of real-world textureless objects captured with
an RGB-D camera containing 4672 samples across six different objects. This
data is an extension of an existing textureless surfaces dataset [BFS18].
•
Another synthetic dataset of transparent surfaces containing 126,000 RGB
images, depth maps, and surface normal maps shows ten different 3D models
in five real-world environments and 2520 viewpoints in each environment.
•
We also describe our data generation pipelines in detail and make source
code public, which other researchers can use to generate similar datasets in
an automated way. In particular, both synthetic datasets can be extended to
include more ShapeNet objects or 3D models with minimal setup. The source
code for using the Kinect v2 camera for obtaining groundtruth depth maps and
normal maps is also published.
•
An encoder-decoder architecture with multiple decoders and for the recon-
struction of textureless surfaces from a single RGB image.
•
A hybrid Vision Transformer and autoencoder network with residual connec-
tions, shortcut branches, and an auxiliary silhouette input, for automatically
segmented transparent objects from a real-world background and then recon-
structing the depth map and normals map of the foreground.
The remainder of this thesis is structured as follows: Chapter 2 provides background
information on this research topic, Chapter 3 discusses related work, including
existing datasets and neural networks, and Chapter 4 describes the new datasets for
textureless and transparent objects created in this thesis. Chapter 5 explains our
methodology, network architectures, and loss functions, while Chapter 6 reports on
4Chapter 1 Introduction
2
Background
„Never trust anything that can think for itself if
you can’t see where it keeps its brain.
—Arthur Weasley
Harry Potter and the Chamber of Secrets
This chapter provides the foundation for understanding the problem of 3D reconstruc-
tion. It formally defines the broader concept of 3D reconstruction and introduces its
different forms, including single-view and multi-view reconstruction, scene recon-
struction, object reconstruction, and the role of the type of object’s surface material
in 3D reconstruction. It also explores various ways of representing 3D shapes in a
computer, and finally defines the problem of 3D reconstruction in the context of this
thesis.
2.1 Problem Definition
In computer graphics, 3D reconstruction is the process of recovering the three-
dimensional geometry of an object from two-dimensional visual information. Given
one or more images showing the same scene from different viewpoints, the goal is
to reconstruct the 3D shape and position of the objects in the 2D images. This can
be seen as a function
f:I7→ S
that takes as input a set of
k
images
I=I1, I2,· · · Ik
and outputs a 3D shape S.
Each image
I
is a collection of points
p= (x, y)
where
x
and
y
are the pixel
coordinates of the point in the image. Let
p1
be a point in
I1
corresponding to a real
point
P= (¯x, ¯y, ¯z)
in the 3D space, where
¯x
and
¯y
are the horizontal and vertical
locations of the same point in world coordinates, and
¯z
is the distance of the point
from the camera (Figure 2.1a). Let
p2,· · · , pk
be points in the other
k−1
images
corresponding to the same 3D point
P
. Then,
∀p∈Ii
and for
i= 1,2,· · · , k
, the goal
is to find the corresponding 3D point
P
. This is the problem of 3D reconstruction.
7
(a) A 3D point Pin the world space viewed from different viewpoints.
(b) Occlusions may make a point invisible in some images.
Fig. 2.1: An object viewed from multiple cameras.
2.1.1 Multi-View and Single-View
There are two main ways to reconstruct a 3D scene: multi-view reconstruction from
several images taken from different perspectives and monocular reconstruction from
a single image. Both tasks aim to infer the 3D geometry of the scene and differ in
the number of images
k
used to reconstruct the scene. Specifically, in multi-view
reconstruction, k > 1and in single-view reconstruction, k= 1.
In multi-view reconstruction, where several viewpoints of the scene are available,
the problem becomes relatively easier as the reconstruction algorithms can use the
extra information to make better decisions about the 3D geometry of the scene.
For example, in Figure 2.1b, the 3D point
P
is visible in more than one image, so
the reconstruction algorithm can use the information from all these images to infer
the 3D location of the point. However, in single-view, where only one image is
available, the back faces are completely invisible as well as some 3D points may be
self-occluded or occluded by other objects in the scene, and the algorithm has to
use this limited amount of information to make assumptions about the 3D geometry.
This makes the problem more challenging. For example, in the top-left view in the
Figure 2.1b, the 3D point
P
is occluded, and the reconstruction algorithm would
have to guess its 3D location. This is analogous to how humans perceive the world.
8Chapter 2 Background
We can see the 3D geometry of the world around us because we have two eyes that
enable stereoscopic vision. However, if someone was born blind in one eye, they
would lack strong depth perception and would have a hard time understanding the
3D geometry of the world around them. This is why the problem of single-view
reconstruction is more challenging than multi-view reconstruction.
However, multiple images of the same scene from different angles are not always
readily available. For example, a robot exploring a new environment may only have
access to a single camera due to hardware limitations. In such a case, the robot
would have to rely on single-view reconstruction to infer the 3D geometry of its
environment and navigate through it. Capturing multiple images is often impractical
in many real-world situations, and single-view reconstruction is the only feasible
option. This is why we cannot always rely on multi-view reconstruction algorithms,
and it is essential to develop algorithms that can reconstruct 3D geometry from a
single image.
Continuing the analogy to human vision above, if a person loses their vision in an
eye later in life instead of being born like that, they have better depth perception of
the world around them better, as their brain has a better understanding of the 3D
world based on their previous experience [Wel16; Ric70]. While training single-view
reconstruction algorithms, we can use the same analogy to teach the algorithms
strong shape priors that let them reconstruct 3D geometry from a single image at
test time.
The same image can be a projection of infinitely many different 3D shapes, which
makes correctly reconstructing the 3D shape from a single image very hard. Re-
construction of the non-visible faces of the object is challenging in particular as the
input image often provides no information about their shape. Bautista et al. showed
that many existing monocular 3D reconstruction networks learn shape priors over
training object categories to solve this problem, which makes it difficult for these
networks to generalize to unseen object categories or scenes [Bau+21]. Tatarchenko
et al. demonstrated that single-view 3D reconstruction networks do not reason about
the 3D geometry of the object from visual shape cues but rather rely on object
recognition to perform the reconstruction [Tat+19].
2.1.2 Object and Scene Reconstruction
In the literature, scene reconstruction and object reconstruction are also often dif-
ferentiated. Scene reconstruction aims to recover the 3D geometry of the entire
scene. On the other hand, in object reconstruction, we want to reconstruct a single
surface. A scene is usually made up of several objects, and scene reconstruction
2.1 Problem Definition 9
involves reconstructing the geometry of all the objects in the scene and their relative
positions. The reconstruction of a whole scene is generally more challenging than re-
constructing a single object because it involves dealing with more complex shadows,
occlusions, and interactions between objects. In real-world scenes, single objects
are rarely encountered in isolation. However, obtaining dense 3D geometry of the
whole scene is expensive, and reconstructing the entire scene is often unnecessary.
For example, it is not necessary to reconstruct the entire scene for a self-driving car,
and it is sufficient to reconstruct the road and the objects in the road, such as other
cars, pedestrians, and traffic signs.
Similarly, a robot might be only interested in the geometry of the objects in its
immediate vicinity or directly in front of it. In such cases, it is sufficient to reconstruct
the geometry of some objects in the scene and not the entire scene. This is why
object reconstruction is a more practical task than scene reconstruction and is often
the focus of many 3D reconstruction algorithms. These algorithms segment the scene
into a background and foreground and then reconstruct the foreground object or
objects. It is also possible to use object reconstruction as a building block for scene
reconstruction.
2.1.3 Surface Type and Reconstruction
One factor that plays a substantial role in the performance of the reconstruction
algorithms is the material with which the object is made. The object’s material
affects how light interacts with the object and how the object reflects light. This
changes the object’s appearance in the image, and the reconstruction algorithm has
to consider this. Algorithms that try to reconstruct geometry by looking for visual
cues such as edges and corners in the image by detecting sharp changes in intensity
or color need to be able to ignore these cues when they are caused by the lighting or
shadows and not the actual geometry of the object. This is why the object’s material
is a significant factor in the performance of the reconstruction algorithm.
Similarly, reconstruction becomes harder if no textures exist in the scene. Without
any distinctive features on the object surfaces, it becomes tougher to infer the 3D
shape of an object from a 2D image, making it complicated to create a complete and
accurate 3D model of the surface. This is because the visual cues that traditional
reconstruction algorithms use to learn the 3D shape of well-textured objects may
not be present in the image of a non-textured object.
While all opaque objects reflect light in one way or another, translucent and trans-
parent objects let a fraction of the light pass through them. The amount of light
that passes through the transparent object and how it bends as it passes through
10 Chapter 2 Background
depends on factors such as thickness, color, and refractive index of the transparent
material, and this determines how the object appears in the image. This makes the
reconstruction of transparent objects even more challenging as the reconstruction
algorithm has to deal with both reflections and refraction of light and the fact that
the background is visible through the object. This makes the reconstruction of
transparent objects even more challenging than opaque objects.
2.1 Problem Definition 11
2.2 Shape in Three Dimensions
There are many different ways to represent the 3D shape of a scene. The depth and
normal maps can be used to represent the partial geometry of the scene, which is
limited to the surfaces of the objects directly facing the camera. Point clouds, meshes,
and voxels can be used for a more complete representation. The 3D reconstruction
networks can be trained to reconstruct 3D scenes from 2D images differently. In this
section, we briefly introduce different ways of representing 3D data.
2.2.1 Depth Map
For each pixel in an image, an a depth map provides the distance from the camera to
the corresponding point in space. This gives a single-channel image of the same size
as the input image, with the corresponding depth value,
z
at each
(x, y)
position.
The absolute depth values are sometimes mapped to the range [0, 255] and, together
with a normal RGB image, the the depth map is given as the fourth channel of the
so-called RGB-D images with points closer to the camera appearing darker and the
points further away appearing brighter.
As depth maps are created from a single viewpoint, they represent a very sparse
3D geometry of the scene containing only points directly in the line of sight of the
camera. They say nothing about the occluded planes, nor do they say anything about
the 3D orientation of different faces of an object in the scene. For this reason, RGB-D
images are sometimes called 2.5D because they cannot represent a complete 3D
topology on their own. They are a partial surface model, with only the shape of the
front face of the surface represented.
2.2.2 Normal Map
A normal vector or a “normal” to a surface is a three-dimensional vector perpendicu-
lar to the surface at a given point. Analogous to depth maps, normal maps provide
normals for each pixel in an image. This means that we can tell the 3D orientation
of the surface at any given point in space that is visible in the image.
When RGB-D datasets are combined with normal maps, we can extract both the
distance and orientation of every point in a scene from a single viewpoint. However,
since we only see the scene from one perspective, information about the hidden
surfaces of the objects in the scene cannot be completely represented. Like depth
maps, normal maps also represent a partial surface model.
12 Chapter 2 Background
2.2.3 Point Cloud
A point cloud is a set of 3D points in space. Theoretically, they can exactly represent
a complete 3D scene by storing the position of every point in space. Depending on
how many and which points are present in the point cloud, it can be both a solid
and a surface model of the scene. However, due to limited computational memory, it
is often necessary to downsample them to reduce the size of the dataset. This can be
done by removing the points very close to each other or not needed to understand
the visible shape of the 3D surfaces.
Point clouds can be extremely useful for representing 3D shapes but they can also be
difficult to work with. Sometimes it is necessary to convert them to a mesh to get a
more accurate representation of the object.
2.2.4 3D Mesh
A mesh is a collection of 3D points (or vertices) connected with edges to form the
surfaces (or faces) of 3D objects. Vertices are connected in a way that the faces
are made up of many polygons adjacent to each other. Usually, these polygons are
triangles, and the meshes are called “triangulated meshes”. Meshes can be used to
represent the surface models of a 3D scene as “wireframes”.
2.2.5 Voxel
A voxel is a 3D equivalent of a pixel. Voxel-based models represent objects as a
collection of cubes (or voxels) stacked in space like a 3D grid. They represent a
discretized, solid model of the scene. The accuracy of the 3D model depends on the
size of the voxels making up the objects. The bigger the voxels, the more “pixelated”
the surfaces of the objects appear.
Like meshes, voxel grids can also be generated directly from point clouds where sev-
eral adjoining points are all approximated to a single voxel (or a cube) in space. This
process is called voxelization and is one way of downsampling the point clouds.
2.2 Shape in Three Dimensions 13
2.3 Scope of the Thesis
In this thesis, we focus on single-view 3D reconstruction and propose deep learning
algorithms for reconstructing the geometry of textureless and transparent objects, in
particular, and we only consider the reconstruction of a single object in the scene.
However, considering the complex and severely ill-posed nature of this problem,
and due to a lack of public labeled datasets for 3D geometry of textureless and
transparent objects, we reconstruct the depth maps and surface normal maps for
these opposed - the so-called 2.5D - instead of dense surface reconstruction.
14 Chapter 2 Background
3
Related Work
„But the old ways serve us no longer.
—Gellert Grindelwald
Fantastic Beasts: The Crimes of Grindelwald
This chapter discusses the related work in the field of 3D reconstruction using deep
learning, primarily focusing on single-view reconstruction. The chapter starts with
a brief overview of the trend of deep learning for 3D reconstruction. As datasets
are an essential part of deep learning research, it discusses the 3D datasets in the
literature. Then, the chapter details the existing deep neural networks for monocular
3D reconstruction of opaque objects, including both textured and textureless objects,
and shows that very few existing methods focus solely on textureless surfaces. Finally,
it reviews the existing methods for the 3D reconstruction of transparent objects.
3.1 Deep Learning for 3D Reconstruction
The problem of 3D reconstruction from visual data has been well-studied in computer
vision literature, but reconstructing 3D geometry from images remained an ill-posed
problem before 2015, when researchers started using convolutional neural networks
for this task. Figure 3.1 shows the trend of related publications since then. In this
section, we shed some light on existing surveys that have previously reviewed the
3D reconstruction methods in the literature.
Zollhöfer et al. [Zol+18] published a report in 2018 on the state-of-the-art in
monocular 3D reconstruction of faces. The authors mainly focus on optimization-
based algorithms for facial reconstruction and briefly mention the emerging trend of
using learning-based techniques for this task. They conclude that they “expect to see
heavy use of techniques based on deep learning in the future”. In 2019, Yuniarti and
Suciati [YS19] formally defined 3D reconstruction as a learning problem and showed
an exponentially growing interest in 3D reconstruction among the deep learning
community. This paper talks about the different ways of representing shapes in 3D,
such as parametric models, meshes, and point clouds, lists the 3D datasets available
at that time and summarizes various deep learning methods for 3D reconstruction.
15
Fig. 3.1:
Interest in using deep learning-based methods for 3D reconstruction is reflected
in the number of publications on ScienceDirect matching the keywords “3d re-
construction” AND “deep learning”, which have been exponentially growing since
2015.
Han et al. [HLB19] published a more extensive review of single and multi-view 3D
reconstruction later that year. This work distinguishes between reconstructing scenes
and objects in isolation and reviews techniques for both.
A large body of work in this area focuses solely on producing depth maps from images,
which partially represent 3D geometry. In this context, Laga [Lag19] extensively
surveyed more than 100 key contributions using learning-based approaches for
recovering depth maps from RGB images. More reviews published in the following
years show the shift in trend from using plain CNNs to recurrent neural networks
(RNNs), residual networks, and generative adversarial networks (GANs) for 3D
reconstruction with encouraging results [Liu+21; MN21]. Fu et al. [Fu+21] also
published a review of single-view 3D reconstruction methods, focusing only on
objects in isolation. They cover networks proposed between 2016 and 2019 in their
review.
16 Chapter 3 Related Work
3.2 Existing Datasets
This section discusses the datasets used for 3D reconstruction in the literature. We
first discuss the 3D datasets of opaque objects and then transparent objects. The
datasets of opaque objects are divided into textureless and normal opaque datasets.
Most of these datasets are generated from CAD models, and real datasets captured
directly with 3D sensors are less common and have fewer samples because of the
difficulty of obtaining them. These datasets include groundtruth 3D geometry in
different formats, as already introduced in the previous chapter.
3.2.1 Textureless Datasets
Most existing 3D datasets of opaque objects deal with well-textured datasets, and
only some specifically targeting textureless surfaces exist in the literature. Table 3.1
summarizes the 3D datasets of textureless objects discussed here.
Tab. 3.1: Summary of the existing 3D datasets of textureless objects.
Dataset
Name
Data
Type
Sensor
Type
Sensor
Model
Scene
Type
No. of
Objects
No. of
Samples
T-LESS
(2017)
Color,
Depth
Structured
Light,
TOF
Kinect
V2,
Prime-
Sense
Carmine
1.0
Isolated
Objects /
Focussed
On
Objects
30 38k
Ley
(2017)
Color Isolated
Objects /
Cluttered
Scene
2 450
Bednarik
(2018)
Color,
Depth,
Normals,
Mesh
TOF Kinect
V2
Isolated
Objects
5 26k
3.2.1.1 T-LESS
Hodaˇ
n et al. [Hod+17] published one of the first RGB-D datasets of textureless
surfaces, named T-LESS, in 2017. It contains around 38k training images of 30
industry-relevant small textureless objects. For training data, synchronized images
from a time-of-flight RGB-D sensor, a structured-light RGB-D sensor, and a regular
RGB camera were captured against a black background by sampling each object
systematically along a sphere with 10
◦
elevation steps between 85
◦
and -85
◦
, and
3.2 Existing Datasets 17
5
◦
azimuth steps. The test dataset contains annotations for the task of 6D pose
estimation, including two 3D models of each object. It is technically possible to
also use this dataset for training networks to reconstruct 3D models from images
of textureless objects. However, as this dataset was initially captured for the robot
manipulation and pose estimation tasks, the objects in it are made up of basic
geometric shapes. They show few depth variations or surface deformations, which
makes this dataset unsuitable for 3D reconstruction. No examples exist in the
literature using the T-LESS dataset to train networks for 3D reconstruction.
3.2.1.2 Ley Datasets
In the same year, Ley et al. [LHH16] published a small real-world dataset containing
two textureless objects: white walls and a cat sculpture. The wall dataset contains
seven different viewpoints with 30 images, each of a single scene containing a sofa
and a chair in front of white walls, where the walls are defined as the "textureless"
objects, and the couch and chair are textured. Similarly, the cat dataset contains 14
viewpoints of a white cat sculpture with 30 images each. This dataset was used to
reconstruct 3D point clouds of the scenes using a multi-view stereo (MVS) pipeline,
and because of its tiny size, it is not suitable for deep learning-based approaches.
3.2.1.3 Bednarik Dataset
Bednarik et al. [BFS18] made available another RGB-D dataset in 2018 containing 5
deformable textureless objects. Four of these objects are clothing items (cloth, hoody,
sweater, tshirt) and the fifth a crumpled sheet of paper, all of which have a significant
amount of deformations, therefore giving depth maps and surface normals with
many variations. The objects have no texture or colors on them. Figure 3.2a shows
some samples from this dataset.
Like T-LESS, they used a time-of-flight camera (Microsoft Kinect v2) to capture
synchronized RGB images and corresponding depth maps of real-world objects.
The surface normals were then computed by differentiating the depth maps. This
dataset contains 26,445 samples in total, with each sample containing an RGB image
showing the object in front of a black background, depth and surface normal maps
of the object corresponding to that image, and for a small subset of the cloth object,
also triangulated 3D meshes. The tshirt, hoody, and sweater were worn by a person
who made random motions to simulate realistic creases. The sheet of cloth was fixed
to a bar on the wall and manually deformed, and the piece of paper was crumpled
by hand to create different depths. Different combinations of four light sources were
used to create lighting variations across different recording sequences. This included
three fixed lights in front of the objects on the right, left, and center, and one moving
18 Chapter 3 Related Work
(a) (b)
Fig. 3.2:
Examples of images in the datasets of Bednarik et al. and Golyanik et al., which are
used to evaluate some of the networks in this paper. (a) The textureless surfaces
dataset [BFS18] contains RGB images and corresponding normal and depth maps
for 5 different real objects. (b) The synthetic point cloud dataset of Golyanik et al.
has a deforming thin plate rendered with 4 different textures under 5 different
illuminations (Figures adapted from [BFS18; Gol+18]).
dynamic light in the room. Table 3.2 summarizes the number of samples of each
kind of object.
Tab. 3.2:
Summary of objects in the textureless surfaces dataset [BFS18]. Sequences of
data samples were captured using a Kinect device at 5 FPS with varying lighting
conditions across sequences.
cloth tshirt sweater hoody paper
sequences 18 12 4 1 3
samples 15,799 6739 2203 517 1187
Several works have shown encouraging performance for the depth and normal map
reconstruction tasks [BFS18; TA19] and the 3D point cloud reconstruction task
[Shi+19] for textureless objects using this dataset. But like most real-world datasets,
this is limited by the number of samples and types of objects it contains, with all
five objects in this dataset having similar geometric structures. The Kinect v2 sensor
often produces depth maps with noise and holes, and [BFS18] used interpolation
to fill these holes, giving depth maps, and subsequently the normal maps, of a
lower quality than those that might be obtained through a 3D modeling computer
software.
3.2.2 More Opaque Datasets
The 3D datasets of regular opaque objects are much more common than those for
textureless objects. We discuss the most popular of these here, which are summarized
in Table 3.3.
3.2 Existing Datasets 19
Tab. 3.3: Summary of the existing 3D datasets of opaque objects.
Dataset
Name
Data Type Sensor
Type/Model
Scene
Type
No. of
Samples
No. of
Objects
ShapeNet
Core
(2015)
Mesh,
Voxels
Synthetic - 51k 55
R2N2
(2016)
Color,
Mesh,
Voxels
Synthetic Isolated
Objects
50k 13
Golyanik
(2021)
Color, Point
Cloud
Synthetic Isolated
Objects
5k 1
3.2.2.1 ShapeNet
The ShapeNet dataset [Cha+15] (2015) is one of the largest public repositories of 3D
objects containing millions of richly annotated 3D CAD models organized according
to the WordNet [Mil95] hierarchy; around 220M CAD models are classified into
3135 WordNet synsets [GB19]. ShapeNet includes many smaller subsets, and an
often used subset is called ShapeNetCore, which comprises approximately 51,300
densely annotated and manually verified 3D models of 55 common objects. The
3D models are represented as 3D meshes, vertices, and faces. Each vertex is a 3D
point in space, and each face is a polygon defined by three vertices. These models
are provided as Wavefront files along with their texture information and the pre-
computed voxelizations of the models. However, it does not provide any rendered
images of the 3D models, and no examples of rendered images of these models
without textures exist in the literature. In addition, while ShapeNet provides the 3D
geometry and voxel representations of the objects, it does not directly provide the
depth maps or surface normal maps. It is particularly ill-suited in its raw form for
training neural networks to reconstruct depth or surface normal maps from a single
view. It is available for download from the ShapeNet website [Tea].
3.2.2.2 R2N2 Dataset
In 2016, Choy et al. [Cho+16] defined a benchmark on the most significant 13
of the 55 ShapeNetCore objects with the most 3D models. These include around
50k models of a plane, bench, cabinet, car, chair, monitor, lamp, speaker, firearm,
sofa, table, phone, and watercraft. Choy et al. [Cho+16] also published up to 24
texture-mapped renders per model showing these ShapeNet objects from random
views. These renders have a resolution of
137 ×137
. This subset of the ShapeNet
dataset, also known as the R2N2 dataset, became the de facto standard dataset for
many subsequent 3D reconstruction networks [Wan+18; Sal+20; YTZ21; LK21].
20 Chapter 3 Related Work
3.2.2.3 Golyanik Dataset
Golyanik et al. generated a synthetic 3D dataset in point cloud representation. Using
Blender [Com18], they created a 3D scene with a thin plate undergoing various
isometric non-linear deformations. Four kinds of textures (
endoscopy
,
graf fiti
,
clothes
, and
carpet
) were mapped onto the deformed 3D model, which was then
illuminated in various settings using five different light sources. The scene was
viewed from five separate cameras at different angles. In this way, a total of 4648
states were generated. Each state is represented with
732
3D points sampled on a
regular grid at rest and a consistent topology across states. For each state, there is
also a corresponding rendered 2D image viewing the object from one of the cameras.
Figure 3.2b shows some samples from this dataset.
3.2.3 Transparent Datasets
We now explore the RGB-D datasets of transparent objects that already exist. As
we are interested in reconstructing the depth maps and surface normal maps from
a single view, we will only discuss datasets that can be used for this task and are
publicly available. Not many transparent datasets with other types of 3D groundtruth
are available to the best of our knowledge.
3.2.3.1 TransProteus
The TransProteus dataset [Epp+22] is a large synthetic dataset of 3D shapes of
transparent objects generated with Blender. This dataset was first published in
2022 and is one of the most recent datasets for transparent objects. It contains 50k
images of transparent containers with liquids or solid objects inside. Around 13k
random objects from the ShapeNet dataset were used for both the containers and
the objects within. Three separate depth maps are provided for each image, one
for the glass container, one for the object within, and one for the opening of the
container. Groundtruth segmentation masks are also provided for the glass container
and the object inside. The data is rendered inside synthetic scenes containing a
ground plane with 1450 material textures, a realistic environment created with 500
different High Dynamic Range Imaging (HDRI) images providing a wide range of
natural lighting conditions, and the object itself. In addition to the synthetic data,
TransProteus also provides 104 actual photographs of transparent objects and their
corresponding depth maps, which were captured using the RealSense depth sensor
[Kes+17].
3.2 Existing Datasets 21
3.2.3.2 ClearGrasp
The ClearGrasp dataset [Saj+20] is another synthetically generated 3D dataset of
transparent objects published in 2020. Like TransProteus, it contains over 50k images
of transparent objects but does not contain solid objects inside glass containers. It
uses only nine different CAD models, 33 HDRI environments for lighting, and 65
materials in its synthetic scenes, which means this dataset lacks the variety of
TransProteus. However, it provides groundtruth 3D data in the form of not only
depth maps and segmentation masks but also surface normals maps and object
boundary maps, which makes it possible to train networks for tasks such as depth
estimation, normals map reconstruction, and boundary detection. The dataset also
has a real-world component comprising 286 images and corresponding depth maps
of 10 transparent objects.
3.2.3.3 TransCG
TransCG [Fan+22] is another transparent dataset that was also published in 2022
for the task of robotic object grasping. It is the first large dataset containing 57,715
RGB-D images of real-world transparent objects. It includes 130 scenes having
51 everyday household transparent objects and several opaque objects captured
from various angles, including entirely transparent objects, translucent objects, and
objects with many small dense holes. The dataset is collected with two RealSense
cameras using a semi-automatic pipeline that outputs the 6D pose of the transparent
object, which is then used to generate the groundtruth depth maps, surface normals,
and segmentation masks.
22 Chapter 3 Related Work
3.3 Reconstruction of Opaque Objects
This section introduces some of the recent methods proposed for reconstructing 3D
surfaces from a single 2D image. These are summarized in Table 3.4.
3.3.1 Bednarik et al.
Fig. 3.3:
The textureless surface reconstruction network [BFS18] (left) consists of an en-
coder
Λ
that takes a masked image
In
m
as input and outputs a latent representation
Λ
. This is followed by three parallel decoders
ΦN,ΦD,
and
ΦC
that use
Λ
for
reconstructing the normal map, depth map, and a 3D mesh respectively. The
indices of all maxpool operations in the encoder are saved when downsampling
(right). These indices are later used for non-linear upsampling in corresponding
decoder layers.
Bednarik et al. [BFS18] introduced a general framework for reconstructing the 3D
shape of textureless surfaces with an encoder–decoder architecture. Using a single
RGB image, they reconstruct the normal maps, depth maps, and triangulated meshes
for the objects in the images. Figure 3.3 shows an overview of their architecture.
This network has an encoder connected to three separate decoders, one each for
reconstructing the normal map, depth map, and the triangulated mesh. The encoder
takes an RGB image of size
224 ×224 ×3
and creates a latent representation of size
7×7×256. This encoding is fed to the three decoders.
The architecture of the encoder and the depth and normal decoders is based on
SegNet [BHC15]. The encoder has the same layers as VGG-16 [SZ14] except for the
fully convolutional layers. However, in contrast to VGG-16, the output channels at
the convolutional blocks are
32,64,128,256,264
respectively. As the normal maps
and depth maps have the same spatial size as the input image, the normal and
depth decoders are symmetric to the encoder with both having the same architecture
except the number of channels at the final output layer; the normal decoder has
three channels and depth decoder has one channel. Like SegNet, pooling indices
at the max pooling layers in the encoder are saved, and used in the normal and
depth decoders to perform non-linear upsampling. For the mesh decoder, a smaller
network with a single convolutional layer followed by average pooling and a fully
connected layer is used.
3.3 Reconstruction of Opaque Objects 23
The depth decoder is trained by minimizing the absolute difference between the
predicted and ground-truth depth values of the foreground, giving the loss function
LD=1
N
N
X
n=1 Pi|Dn
i−ΦD(Λ(In
m))i|Bn
i
PiBn
i
,(3.1)
where
Dn
is the ground-truth depth map and
ΦD
is the depth decoder, which takes
the encoder output on the masked input image
Λ(In
m)
and returns the predicted
depth map. The absolute difference is only calculated for the foreground pixels, i.e.,
where the foreground mask
Bn
has the value 1, and the sum of absolute differences
is averaged over all the foreground pixels.
To train the normal decoder, the angular distance between the predicted and ground-
truth normal vectors and the length of the predicted normal vectors are both opti-
mized using the loss function
LN=1
N
N
X
n=1 PiκLaNn
i,¯
Nn
i+Ll¯
Nn
iBn
i
PiBn
i
(3.2)
with
LaNn
i,¯
Nn
i= arccos Nn
i¯
Nn
i
kNn
ikk ¯
Nn
ik+!1
π,(3.3)
Ll(¯
Nn
i) = k¯
Nn
ik − 12(3.4)
where
La
is the angular distance calculated as the
arccos
of the cosine similarity
between the predicted and ground-truth normal vectors,
Ll
is the term that prefers
unit normal vectors, and
κ
is a hyperparameter that sets the relative influence
of the two terms. Furthermore,
Nn
is the ground-truth normal map, and the
¯
Nn= ΦN(Λ(In
m))
is the predicted normal map. As with depth loss, the normal loss
is only calculated for foreground pixels.
Finally, for the triangulated mesh prediction, the mesh decoder optimizes the mean
squared error between predicted and ground-truth vertex coordinates. That is,
LC=1
N
N
X
n=1
1
V
V
X
i=1
kvn
i−ΦC(Λ(In
m))k2(3.5)
As all three decoders take input from the same encoder with the same latent rep-
resentation, they can be trained either jointly or separately. When trained jointly,
the authors of [BFS18] show the accuracy of the reconstruction improves because
the encoder is able to learn more robust feature extractors. The textureless dataset
described in Section 3.2.1.3 was used for training and testing this network, and ex-
24 Chapter 3 Related Work
periments showed poor reconstruction accuracy for 3D meshes compared to normal
and depth maps.
The network was trained using the Adam optimizer [KB14] with a fixed learning rate
of
10−3
and
κ= 10
. The authors used Keras [Cho+15] with a Tensorflow [Mar+15]
backend for implementation and published the source code. At run-time, the network
takes
0.016
s to predict both depth and normal maps together, and
0.01
s when
predicting either the depth or normal map individually.
3.3.2 Patch-Net
Fig. 3.4:
Patch-Net uses Bednarik et al.’s network with only depth and normal decoders.
The input image is divided into overlapping patches, and predictions for each
patch are obtained separately. Patch predictions are stitched to form the complete
depth and normal maps .
Tsoli and Argyros [TA19] proposed a patch-based variation for better textureless
reconstruction. They take the network from [BFS18] and change the block sizes
to match VGG-16 [SZ14], i.e., 64, 128, 256, 512, 512. They also remove the
mesh decoder, keeping only the normal and depth decoders. They divide the input
image into overlapping patches and get per patch reconstructions for normal and
depth maps. These patches are then stitched together to get the final normal and
depth maps at the input image resolution, and use bilateral filtering to smooth out
inconsistencies that were not resolved by stitching. They call this network Patch-Net.
Since the network expects a
224 ×224
spatial size of the input, each patch can have
that size with the full image being even larger. This allows Patch-Net to get a higher
resolution reconstruction than [BFS18] with better accuracy and generalization. It
3.3 Reconstruction of Opaque Objects 25
uses the loss functions of Equations (3.2) and (3.1) on each patch to compute the
normal and depth loss respectively.
The network was trained using the Adam optimizer with a fixed learning rate of
10−3
.
The authors extended the source code from [BFS18], and trained their network on
an Nvidia Titan V GPU with 12 GB memory. This code is not publicly available, and
the authors do not report inference-time performance.
3.3.3 HDM-Net
Fig. 3.5:
Overview of the HDM-Net [Gol+18] architecture. It has an encoder that takes an
RGB image of size
224 ×224 ×3
and encodes it into a latent representation of size
28 ×28 ×128
. This is then used by the decoder to reconstruct a 3D point cloud of
the surface with 732points.
The Hybrid Deformation Model Network (HDM-Net) [Gol+18] is another approach
for reconstructing deformable surfaces from a single-view. Like [BFS18] and Patch-
Net, HDM-Net uses an encoder–decoder architecture (Figure 3.5), but with only one
decoder instead. However, the encoder and decoder are not symmetric to each other
in this network. They also have a smaller depth, with only 9 convolution layers
in the encoder instead of 13 convolution layers in the VGG-16-based architectures.
The upsampling in the decoder is performed using transposed convolutions, as
in [RFB15], except at the first decoder layer where a non-linear max-unpooling
operation similar to [BHC15; BFS18; TA19] is used. HDM-Net directly learns the
3D shape and gives a dense reconstruction of the surface of size
73 ×73 ×3
as a
point cloud. It is trained on the synthetic point cloud data (Section 3.2.2.3) of a thin
non-rigid plate undergoing various non-linear deformations, with a known shape
at rest. Three different domain-specific loss functions are used to jointly optimize
the output of the network, with the goal of learning texture-dependent surface
deformations, shading, and contours for effective handling of occlusions.
The first loss function is a common 3D regression loss that computes the 3D error by
penalizing the difference between the predicted 3D geometry
Sn
and ground-truth
3D geometry S0
n, that is,
L3D=1
N
N
X
n=1
kS0
n−Snk2
F(3.6)
26 Chapter 3 Related Work
where
k·kF
is the Frobenius norm. For each state
n
, the squared Frobenius norm of
the difference between predicted and ground-truth geometries is calculated, and
then averaged for all Nstates.
An isometry prior is used to constrain the regression space using an isometric loss
that penalizes the roughness in the predicted surface by ensuring that neighboring
vertices are located close to each other. The loss function is expressed in terms of
the predicted geometry Snand its smooth version ¯
Sn
Liso. =1
N
N
X
n=1
k¯
Sn−SnkF(3.7)
with
¯
Sn=1
2πσ2exp −x2+y2
σ2!∗Sn(3.8)
where
∗
is a convolution operator and
σ2
is the variance of Gaussian, and
x
and
y
stand for the point coordinates.
The third loss function optimizes the contour shapes by computing a reprojection
loss. The predicted and ground-truth 3D geometries are first projected onto a 2D
plane and before computing their difference as
Lcont. =1
N
N
X
n=1
kτ(π(Sn)) −τ(π(S0
n))k2
F(3.9)
where
π
is a differentiable 3D to 2D projection function and
τ
is a function that
thresholds all positive values to 1 using a combination of
tanh
and
ReLU
. This gives
contours as 0–1 transitions. The total loss is computed by adding all three losses
with equal weights.
HDM-Net was trained for 95 epochs on a GEFORCE GTX 1080Ti GPU with 11 GB of
global memory. The training relied on the PyTorch framework [Pas+19] and took
2 days to complete. At inference time, the network can reconstruct frames with a
frequency of 200 Hz, or 0.005 s per frame. The source code was not published.
3.3.4 IsMo-GAN
An improved version of HDM-Net is the Isometry-Aware Monocular Generative Ad-
versarial Network (IsMo-GAN) [Shi+19], which introduces two key modifications to
achieve
10–30%
reduction in reconstruction error in different cases, including recon-
struction of textureless surfaces. First, IsMo-GAN has an integrated Object Detection
Network (OD-Net) that generates a confidence map separating the background from
3.3 Reconstruction of Opaque Objects 27
Fig. 3.6:
Overview of IsMo-GAN [Shi+19]. The generator network accepts a masked RGB
image, segmented by the object detection network (OD-Net), and returns a 3D
point cloud. The output and ground-truth are fed to the discriminator which serves
as a surface regularizer.
the foreground. Secondly, IsMo-GAN is trained in an adversarial setting, which is
different from the training of the simple auto-encoder-based networks discussed
in previous sections. The OD-Net is a simplified version of U-Net [RFB15] with
fewer layers than the original. It takes a
224 ×224 ×3
RGB image and outputs a
grayscale confidence map indicating the position of the foreground in the image. The
confidence map is binarized [Ots79] and the target image is extracted using Suzuki
et al.’s algorithm [Suz+85]. The masked-out input image is then passed to the
Reconstruction Network (Rec-Net), which has skip connections like HDM-Net and
has a similar architecture but with fewer layers. Like HDM-Net, the Rec-Net outputs
a
73 ×73 ×3
size point cloud. OD-Net and Rec-Net together make up the generator
of IsMo-GAN. The discriminator network consists of four convolution layers followed
by a fully connected layer and a sigmoid function. IsMo-GAN uses the LeakyReLU ac-
tivation everywhere, instead of ReLU which was used in all other networks discussed
previously. Figure 3.6 shows an overview of the IsMo-GAN network.
IsMo-GAN penalizes the output of the Rec-Net with the 3D loss (Equation
(3.6)
)
and isometric loss (Equation
(3.7)
) from HDM-Net, where the predicted geometry is
equal to the generator output on the input image, i.e.,
Sn=G(I)
. In addition to
this, for adversarial training, IsMo-GAN uses cross entropy (BCE) [Goo+14], defined
as
LG=−1
MN
M
X
m=1
N
X
n=1
log (D(G(In
m))) (3.10)
for the generator G, and
LD=−1
MN
M
X
m=1
N
X
n=1 log (D(S0
m)) + log (1 −D(G(In
m)))(3.11)
28 Chapter 3 Related Work
for the discriminator
D
, where
M
is the number of states, and
N
is the number
of images for each state. The adversarial loss is then defined as the sum of the
generator and discriminator losses
Ladv =LG+LD,(3.12)
and it represents the overall objective of the training which encourages IsMo-GAN
to generate more realistic surfaces. It is a key component that lets IsMo-GAN
outperform HDM-Net [Gol+18] by 10–15% quantitatively as well as qualitatively on
real images. The adversarial loss makes up for the undesired effects of the 3D loss
and the isometry prior by acting as a novel regularizer for the surface deformations.
This network is trained and evaluated on the same dataset as HDM-Net, as well as
on the 3D mesh data of the
cloth
object from the subset of the Bednarik et al.’s real
textureless surfaces dataset.
The OD-Net and Rec-Net were both trained separately for 30 and 130 epochs
respectively, using the Adam optimizer with a fixed learning rate of
10−3
and a batch
size of
8
. IsMo-GAN was implemented using PyTorch, but the source code was not
made public. It takes
0.004
s to run an inference, which is a
20%
improvement over
HDM-Net.
3.3.5 Pixel2Mesh
Fig. 3.7:
The Pixel2Mesh [Wan+18] network consists of two parallel networks that take an
RGB image and a coarse ellipsoid 3D mesh, and learn to regress the 3D shape of
the object in the image. The key contribution is the graph-based convolutions and
unpooling operators in the bottom half of the network.
Pixel2Mesh [Wan+18] is a deep learning network that reconstructs 3D shape as a
triangulated mesh from a single RGB image. It was proposed in 2018 and is one
of the earliest methods for monocular 3D reconstruction. Its primary idea is to
use graph-based convolutions [Bro+17] to regress the mesh vertices. The network
is made up of two parts: a VGG-16-based feature extractor and a graph-based
convolution network (GCN). The feature extractor network takes a
224 ×224
image
3.3 Reconstruction of Opaque Objects 29
to reconstruct. Additionally, the GCN takes an ellipsoid mesh with 156 vertices
and 462 edges. The feature extractor network then feeds the extracted perceptual
features at different stages to the GCN in a cascaded manner, which refines the initial
mesh in a coarse-to-fine manner by adding details at each stage. The GCN finally
outputs a mesh with 2466 vertices (Figure 3.7). Each mesh deformation block in the
GCN is made of 14 layers of graph-based convolutions with ResNet-like [He+15]
skip connections. Their job is to optimize the position of existing vertices to get a
mesh matching the object shape. This is followed by a graph unpooling layer that
interpolates the mesh to increase the number of vertices.
Pixel2Mesh combines four different loss functions to optimize its weights. These
include the Chamfer loss [FSG17] to constraint the location of mesh vertices, a
normal consistency loss, a Laplacian regularization to maintain the neighborhood
relationships when deforming the mesh, and an edge length loss to prevent outliers.
The total loss is then calculated as a weighted sum of the individual losses. The
network is trained and evaluated on the R2N2 subset [Cho+16] of the ShapeNet
dataset [Cha+15], which consists of synthetically rendered images and 3D mesh
ground truth. The network is also qualitatively evaluated on the Stanford Online
Products dataset [Oh +16], which contains real-world images of objects without
any 3D labels.
Pixel2Mesh was implemented using Tensorflow and the official source code is
available on GitHub. It used the Adam optimizer with a weight decay of
1−5
and a batch size of
1
to train for
50
epochs, with the initial learning rate of
3−5
.
The training took 72 h on Nvidia Titan X GPU with 12 GB memory, and the trained
network can reconstruct a mesh containing 2466 vertices in 15.58 ms.
3.3.6 Salvi et al.
Fig. 3.8:
The attentioned ResNet-18 [He+15] network with four self-attention blocks
[Vas+17] added to it. This encoder network is used by [Sal+20] to extract
image features, which are fed to a decoder with five Conditional Batch Normaliza-
tion blocks followed by an occupancy function.
30 Chapter 3 Related Work
A new category of networks is adding self-attention modules [Vas+17; Zha+19] to
3D reconstruction networks. Salvi et al. proposed one such network, which improves
Occupancy Networks (ONets) [Mes+19] by adding self-attention to them. ONets
consist of three parts: a feature extractor, a decoder, and a continuous decision
boundary function, called the occupancy function
o:R→ {0,1}
, that classifies each
point from the space as whether or not it belongs to the surface. This provides a
general 3D representation that allows extracting meshes at any resolution. ONets
are an extension of the autoencoders discussed in previous sections, where the
encoders functioned as feature extractors, followed by a decoder to reconstruct the
3D shape.
In the networks discussed previously as well as ONets introduced in [Mes+19], the
feature extractors are based on CNNs. Standard CNNs work with local receptive
fields and need very deep architectures to successfully model global dependencies.
This is because the features they learn are relatively shallow and do not capture the
long-range correlations in natural images. To address this limitation, self-attention
modules were introduced that calculate the response at a given position as a weighted
sum of the features at all positions. This allows them to efficiently model global
dependencies with much smaller networks than traditional CNNs. Salvi et al. show
that adding self-attention modules at different locations in the feature extractor
can improve the performance of an Occupancy Network. When used earlier in the
network, self-attention allows the network to focus more on finer details. When
used later in the network, it allows the network to extract better structural features.
Figure 3.8 depicts one such feature extractor proposed by [Sal+20], showing a
ResNet-18 [He+15] network with four self-attention modules.
They train their network on the synthetic R2N2 dataset [Cho+16] (see Section 3.2.2.2)
using an ensemble approach, where the ensemble is made up of one specialized
ONet for each object type. This is supported by their experiments which show that
self-attention-based ONets have better results if trained for each category separately.
The network was also qualitatively evaluated on a subset of the Stanford Online
Products dataset [Oh +16], which contains real images, and showed a more con-
sistent and better reconstruction of meshes when compared to existing approaches.
Self-attention in decoders was not used due to computational limitations.
Adam optimizer with a learning rate of
10−3
and weight decay of
1−5
was used for
training the network for
200
K steps. All other hyperparameters were kept the same
as in [Mes+19]. The source code for this network is not available.
3.3 Reconstruction of Opaque Objects 31
Fig. 3.9:
Overview of VANet [YTZ21], a unified approach for both single and multi-view
reconstruction with a two-branch architecture.
3.3.7 VANet
Another network that uses the attention mechanism is the View Attention Guided
Network (VANet) [YTZ21]. It uses channel-wise view attention and a dual pathway
network for better reconstruction of occluded parts of the objects, and defines
a unified approach for both single and multi-view reconstruction. As shown in
Figure 3.9, the proposed architecture consists of a main pathway and an auxiliary
pathway. The main path uses the first view of a scene to reconstruct a 3D mesh. If
any more views are available, they are then fed to the auxiliary path, which aligns
them with the main view and uses the additional information from these new views
to refine the reconstructed mesh. The main view features after the encoder are
pooled along the spatial dimensions using global average pooling to get a channel
descriptor of shape
1×1×C
. This is then sent to a system of fully connected layers
followed by a sigmoid function to generate a channel-wise attention map
αmain
.
These attention weights are then used to re-calibrate the computed feature maps.
If auxiliary views are available, they are used to enhance the less-visible parts in
the original view. A max pooling operation is used to select permutation invariant
auxiliary view features, which are multiplied by
1−αmain
and finally added to the
main view features. These are then sent to a vertex prediction module to generate
the reconstructed 3D mesh. The vertex prediction module is based on the mesh
deformation module of Pixel2Mesh [Wan+18].
VANet is trained using the same four loss functions as Pixel2Mesh, and evaluated
on the R2N2 subset [Cho+16] of the ShapeNet dataset [Cha+15]. Using the Adam
optimizer with an initial learning rate of
2×10−5
and a batch size of
1
, the network
was trained for 20 epochs. It was implemented in Tensorflow but the source code
was not published.
32 Chapter 3 Related Work
3.3.8 3D-VRVT
Fig. 3.10:
3D-VRVT takes one image as input and uses a Vision Transformer encoder to
extract a feature vector. This is then fed to a decoder that outputs the voxel
representation of the object.
Vaswani et al. [Vas+17] initially proposed the Transformer architecture for natural
language processing (NLP) tasks. These methods used the self-attention mechanism
to let the network understand longer sequences of text to compute a representation
for the whole sequence. Salvi et al. [Sal+20] used the self-attention mechanism from
Transformers in their “attentioned” ResNet encoder to extract better features for 3D
reconstruction. However, their input is not sequential (Section 3.3.6). Kolesnikov
et al. [Kol+21] proposed a novel architecture called Vision Transformers that breaks
down images into patches and treats those patches as part of a sequence. Using a
linear projection, vector embeddings for each patch are obtained. This sequence
of patch embeddings is then fed to a Transformer network. Inspired by this, Li
and Kuang [LK21] proposed a Vision Transformer-based network (Figure 3.10) for
reconstructing voxels from a single image. They call this network 3D-VRVT.
3D-VRVT uses a Vision Transformer as an encoder, that takes a
224 ×224
RGB image
as input and produces a feature vector of size 768 which is fed to a decoder network.
The decoder network has a fully connected layer that upscales the feature vector to
2048 and then reshapes it into a 3D tensor of shape
256×23
. This is followed by four
3D deconvolutions with a kernel size of 4, stride 2, and padding 1 that iteratively
refine the 3D grid until it has the resolution
32 ×323
. Each deconvolution operation
is also followed by a 3D batch normalization and a GELU activation function. Then,
a final deconvolution with kernel size 1 is applied to get a grid of
1×323
. This is
passed through a sigmoid activation function before getting the final voxel output.
The network was trained on the ShapeNet dataset. It used an SGD optimizer and a
warm-up cosine annealing learning rate with a momentum of
0.9
. The learning rate
ranged between
2−5
and
2−3
. The training relied on a PyTorch implementation and
continued for 600 epochs on Nvidia Titan V GPU, including 10 warm-up epochs. At
test time, it takes 8.82 ms to reconstruct an object with this network.
3.3 Reconstruction of Opaque Objects 33
Tab. 3.4: Summary of all 3D reconstruction networks discussed in this paper.
Literature Architecture Output Method Dataset Type
Bednarik
et al. [BFS18]
(2018)
VAE with one
encoder and
three
decoders
normal map,
depth map,
and 3D mesh
based on Seg-
Net [BHC15]
with VGG-
16 [SZ14]
backbone
real,
deformable,
textureless
surfaces
Patch-
Net [TA19]
(2019)
VAE with one
encoder and
two decoders
normal and
depth maps
converts
image to
patches, gets
3D shape of
patches us-
ing [BFS18],
and stitches
them together
real,
deformable,
textureless
surfaces
Hybrid
Deformation
Model
Network
(HDM-Net)
[Gol+18]
(2018)
VAE with one
encoder and
one decoder
3D point
cloud
simple
autoencoder
with skip
connections
like
ResNet [He+15],
combines 3D
regression
loss with an
isometry prior
and a contour
loss
synthetic,
deformable,
well-textured
surfaces
Isometry-
Aware
Monocular
Generative
Adversarial
Network
(IsMo-GAN)
[Shi+19]
(2019)
GAN with two
sequential
VAEs as a
generator and
a simple CNN
as
discriminator
3D point
cloud
integrates an
OD-Net to
segment
foreground,
and trains in
an adversarial
setting along
with 3D loss
and isometry
prior from
[Gol+18]
synthetic,
deformable,
well-textured
surfaces and
real,
deformable,
textureless
surfaces
34 Chapter 3 Related Work
Tab. 3.4: Cont.
Literature Architecture Output Method Dataset Type
Pixel2Mesh
[Wan+18]
(2018)
two-lane
network with
a feature
extractor and
a graph based
mesh
predictor
(GCN)
3D mesh feature
extractor
based on
VGG-
16 [SZ14],
feeds
cascaded
features to
the GCN that
uses graph
convolutions
synthetic,
rigid,
well-textured
surfaces
View
Attention
Guided
Network
(VANet)
[YTZ21]
(2021)
two-lane
feature
extractor for
both single or
multi-view re-
construction,
followed by a
mesh
prediction
network
3D mesh uses
channel-wise
attention and
information
from all
available
views to
extract
features,
which are
then sent to a
Pixel2Mesh-
based
[Wan+18]
mesh vertex
predictor
synthetic,
rigid,
well-textured
surfaces
Salvi et al.
[Sal+20]
(2020)
VAE based on
ONets
[Mes+19]
with
self-attention
[Vas+17] in
encoder
parametric
representa-
tion
ResNet-18
[He+15]
encoder with
self-attention
modules,
followed by a
decoder and
an occupancy
function
synthetic,
rigid,
well-textured
surfaces
3D-
VRVT [LK21]
(2021)
encoder-
decoder
architecture
voxel grid
encoder based
on Vision
Transform-
ers [Kol+21],
followed by a
decoder made
up of 3D de-
convolutions
both synthetic
and real,
rigid,
well-textured
surfaces
3.3 Reconstruction of Opaque Objects 35
3.4 Evaluation Metrics
We described different 3D reconstruction methods in this chapter, which were trained
on various datasets and evaluated using different error metrics. In this section, we
define the error metrics commonly used for the evaluation of 3D reconstruction
methods.
1.
Depth Error (
ED
): The depth error metric is used to compute the accuracy of
depth map predictions. Let
ΘK
and
Θ0
K
be the point clouds associated with the
predicted and ground-truth depth maps respectively, with the camera matrix
K
. To remove the inherent global scale ambiguity [EPF14] in the prediction,
ΘK
is aligned to ground-truth depth map
D0
to get an aligned point cloud
¯
ΘK
as
¯
ΘK= Ω(ΘK,D0)(3.13)
where
Ω
is the Procrustes transformation [SG02]. Then, the depth error
ED
is
calculated as
ED=1
N
N
X
n=1 PikΘ0
K−¯
ΘKkBn
i
PiBn
i
.(3.14)
Note that the foreground mask
B
in the equation ensures that the error is only
calculated for foreground pixels. Smaller depth errors are preferred.
2.
Mean Angular Error (
EMAE
): The mean angular error
EMAE
metric is used to
calculate the accuracy of normal maps, by computing the average difference
between the predicted and ground-truth normal vectors. The angular errors
for all samples are calculated using Equation (3.3), and then averaged for all
samples. Smaller angular errors indicate better predictions.
3.
Volumetric IoU (
EIOU
): The Intersection over Union (IoU) metric for meshes
is calculated as the volume of the intersection of ground-truth and predicted
meshes, divided by the volume of their union. Larger values are better.
4.
Chamfer Distance (
ECD
): Chamfer distance is a measure of similarity between
two point clouds. It takes the distance of each point into account by find-
ing, for each point in a point cloud, the nearest point in the other cloud,
and summing their squared distances.
ECD =1
|Θ|X
x∈Θ
min
y∈Θ0kx−yk2+1
|Θ0|X
x∈Θ0
min
y∈Θkx−yk2(3.15)
where
k·k2
is the square of Euclidean distance. A smaller CD score indicates a
better value.
36 Chapter 3 Related Work
5.
Chamfer-L1 (
ECD1
): The Chamfer distance (CD) has a high computational
cost for meshes because of a large number of points, so an approximation
called Chamfer-L1 is defined. It uses the L1-norm instead of the Euclidean
distance [Sal+20]. Smaller values are preferred.
6.
Normal Consistency (
ENC
): The normal consistency score is defined as the
average absolute dot product of normals in one mesh and normals at the
corresponding nearest neighbors in the other mesh. It is computed similarly to
Chamfer-L1 but the L1-norm is replaced with the dot product of the normal
vectors on one mesh with their projection on the other mesh [Sal+20]. Normal
consistency shows how similar the shapes of two volumes are, and is useful in
cases such as where two meshes might overlap significantly, giving a high IoU,
but have a different surface shape. Higher normal consistency is preferred.
7. Earth Mover’s Distance (EEM D ):
The Earth Mover’s Distance computes the cost of transforming one one pile of
dirt, or one probability distribution, into another. It was introduced in [RTG00]
as a metric for image retrieval. In case of 3D reconstruction, it computes the
cost of transforming the set of predicted vertices into the ground-truth vertices.
The lower the cost, the better the prediction.
8. F-score (EF):
The F-score evaluates the distance between object surfaces [Kna+17; Tat+19].
It is defined as the harmonic mean between precision and recall. Precision
measures reconstruction accuracy by counting the percentage of predicted
points that lie within a certain distance from the ground truth. Recall measures
completeness by counting the percentage of points on the ground truth that lie
within a certain distance from the prediction. The distance threshold
τ
can be
varied to control the strictness of the F-score. In the results reported in this
paper, τ= 10−4.
3.4 Evaluation Metrics 37
4
Our Datasets
„Data! Data! Data! I can’t make bricks without
clay!
—Sherlock Holmes
The Adventure of the Copper Beeches
This chapter discusses the new RGB-D datasets created for this thesis’s experiments,
including three synthetic and real-world datasets: a synthetic and a real-world
dataset of textureless objects and a synthetic dataset of transparent objects. These
datasets consist of samples of RGB images of textureless or transparent objects,
corresponding depth maps, and surface normal maps. We discuss the composition
of each dataset, the motivation behind their creation, and how they were used in
experiments. We also examine their limitations and see how they can be further
improved.
4.1 Textureless Datasets
We introduce two new RGB-D datasets for textureless surfaces: a synthetically
generated large-scale dataset created using a 3D modeling tool and a smaller real-
world dataset collected with a time-of-flight camera. Each instance in these datasets
shows a single textureless object from a particular viewpoint on a black background,
using a render of a 3D model in the synthetic dataset and a photograph of an actual
object in the real dataset. The datasets also include foreground masks indicating
the object’s location in each image and provide depth maps and surface normals
corresponding to each image. The intended purpose of these datasets is to train
neural networks for reconstructing depth maps and surface normals from a single
RGB image containing textureless objects.
4.1.1 Synthetic Textureless Dataset
This section describes the first major contribution of this thesis: a large-scale synthetic
RGB-D dataset of textureless objects.
39
Fig. 4.1: Samples from 6 main categories of the synthetic textureless dataset.
4.1.1.1 Motivation
As discussed in Chapter 3, most existing RGB-D datasets either contain textured
objects or have no labels for the objects’ surface. Very few datasets contain textureless
objects, but those that do are small in size, making them unsuitable for most
practical applications. Deep neural networks need large and diverse datasets to
learn generalizable representations of the world, and in the case of 3D datasets of
textureless objects, this remains an unmet requirement. However, capturing large
amounts of real-world data is difficult and time-consuming, and it is not always
possible to collect a large number of images of textureless objects from a variety of
viewpoints because of logistical and hardware constraints. Therefore, we created a
large-scale synthetic dataset of textureless objects to fill this gap.
The intended purpose of this dataset is to train deep neural networks to reconstruct
depth maps and surface normal maps from a single RGB image. As synthetic datasets
aim to mimic real-world data, the dataset is also designed to be as realistic as possible
and contains various lighting conditions and object poses. It is designed to be large
enough to train deep neural networks with millions of parameters, and it contains
a variety of objects from different categories to allow for generalization. Since the
end goal of the research in 3D reconstruction is to deploy these models in real-world
applications, this synthetic dataset should ideally allow for the training of models
that generalize well to real-world data. We discuss the dataset’s limitations and
how it can be improved in Section 4.1.1.5 and perform experiments to evaluate its
usefulness in the real world in Chapter 6.
40 Chapter 4 Our Datasets
Fig. 4.2: The Blender scene. The 3D model is surrounded by multiple lights and cameras.
4.1.1.2 Generation Process
The open-source 3D modeling tool Blender [Com18] is used to generate this data.
With the 3D model placed at the origin in an empty Blender scene, we added
several lights of different types for realistic shadows. Multiple cameras viewed the
objects from different elevations as the object was rotated around itself by uniformly
sampled azimuth steps. Figure 4.2 provides an overview of the general setup.
The scene has three cameras, with the first parallel to the object directly in front
of it at a 90
◦
angle to the x-axis. The other two cameras are slightly elevated and
tilted downwards or upwards. For the six main categories, the elevation angles
of the upwards and downwards cameras randomly vary between -30
◦
and 30
◦
.
For the shapenet category, these angles are fixed at -45
◦
and 45
◦
. The object is
rotated around itself through 360 unit azimuth steps for the main categories and
eight 45
◦
azimuth steps for the shapenet category. The scene is rendered at every
rotation while viewing it from each camera one by one to ensure that we capture
the complex deformations on the object from as many views as possible. This allows
for a comprehensive view of the object from all sides.
In natural scenes, many factors, such as illumination, shadows, and occlusions, can
affect a surface’s perceived shape. Samples in this dataset contain single objects and
do not deal with inter-object occlusions. Depending on the object’s shape, some
samples may include self-occlusions, but we do not provide labels for occlusion.
However, we include multiple lights and sources to create various illuminations and
shadows. In our scene, two bright spotlights shine on the object from the front-left
4.1 Textureless Datasets 41
and front-right sides. These lights have a warm white color resembling the standard
incandescent lamps with an RGB value of
(1.0,0.945,0.875)
. Blue-tinged sunlight
with RGB value
(0.785,0.883,1.0)
is added far above the object, and a soft glowing
white ambient light is placed beneath the object to illuminate the bottom faces of
the objects. The sunlight has a power of 2 Watts per square meter, and the ambient
floor light has a power of 100 Watts. This creates multiple shadows on the object
from all sides, giving it more realism. The scene is rendered under multiple lighting
configurations, with one or more lights turned on simultaneously. This ensures
a wide range of shadows and lighting effects is included in the dataset, allowing
networks to understand how shadows contribute to shape perception and to be able
to create illumination invariant 3D reconstructions.
Various lighting and camera angle configurations give 12 different sequences. We also
observe that textureless surfaces do not necessarily mean colorless objects but only a
lack of distinctive surface textures. We extend the definition of textureless surfaces
to include surfaces with a single homogeneous color. Therefore, all sequences are
rendered once using a bare, colorless model with no texture added and again with a
diffuse material of a random but uniform color mapped onto the whole surface. This
is different from [BFS18] where all objects are only grayscale. This way, we obtain
24 different configurations for each model, as summarized in Table 4.1.
Tab. 4.1:
These elements in the scene are used in various combinations to generate ’se-
quences’ of data.
Variable Name Description
Lights LsOnly sunlight on.
LlLeft lamp and sunlight on.
LrRight lamp and sunlight on.
LaBoth lamps and sunlight on.
Camera down Above the object, looking down.
front At object height, looking straight.
up Below the object, looking up.
Color Yes Uniform color on the model.
No Model completely colorless.
The background in all the rendered images is a black plane, and we provide a fore-
ground mask indicating the pixels containing the object in the images. Like [BFS18],
we label the samples with groundtruth depth maps and surface normal maps. We
export the RGB images and depth maps using the Combined and Z data passes in
Blender and connect Blender’s compositing nodes as shown in Figure 4.3. To ensure
the visibility of widely differently sized objects in the camera frame, we manually
adjusted the distance and the exact viewing angles of all three cameras for each
object while keeping the general configuration discussed previously. Depth values
42 Chapter 4 Our Datasets
Fig. 4.3:
This shows how (a) how the different nodes are connected in Blender and (b) the
render settings used to obatin the depth map data.
were then normalized in the range of 0 and 1. Like [BFS18], we differentiate the
depth maps to obtain the surface normals using Algorithm 1 for the objects in the
main categories. For the shapenet category, the normals are exported directly from
the Normal data pass in Blender. All normal vectors have a unit length and values
between -1 and 1. Samples in this dataset have a resolution of 512x512, twice the
size of existing datasets. This means it is possible to identify low-level features, which
will allow for more accurate reconstruction. However, increased computational and
storage resources are necessary to accommodate the larger image size. Most existing
neural networks only expect input of 224x224, so we conclude that samples of size
512x512 provide an adequate balance between quality and computation. Images are
saved in PNG format, and depth and normal maps are saved as floating-point NumPy
arrays. We also provide the source code for a data loader for conveniently reading
the data samples using PyTorch. Figure 4.1 shows examples of images, depth maps,
and surface normals for the main objects in the dataset. The top seven rows in the
figure enumerate the 35 objects, while the last row shows selected objects rendered
in non-white homogeneous colors.
We make the source code used for dataset generation publicly available on GitHub
1
which can be used to generate more textureless groundtruth RGB-D data in an
automated way. The source code also provides the ability to choose between Blender-
exported surface normals or those obtained from depth maps. It expects input as
a collection of Wavefront files organized in a flat or ShapeNet-like hierarchy. We
used the Cycles rendering engine and rendered the images on an Intel(R) Core(TM)
1https://github.com/saifkhichi96/blender_texless_data
4.1 Textureless Datasets 43
Algorithm 1: An algorithm to compute surface normals from a depth map.
Data: A grayscale depth map image, Xof size (H,W)
Result: Corresponding surface normals map of size (H,W,3)
1: ˆ
X←smooth(X)
2: Zx←gradientx(ˆ
X)
3: Zy←gradienty(ˆ
X)
4: N←stack(−Zx,−Zy,1)
5: N←N/||N||F
6: return N
i7-6700K CPU with a 4.00GHz processor. Each sample takes around 30 seconds to
render, which reduces to 5 seconds if a GPU is used for rendering.
4.1.1.3 Data Description
In this section, we describe the contents of the synthetic dataset. This dataset is
made up of two parts.
Fig. 4.4:
The ShapeNet category. 24 renders of 200 models for 13 main ShapeNet objects
along with depth maps and surface normals are provided.
The first part of this dataset contains 35 common objects with one 3D model and 8640
samples per object from 6 different shape categories, including animals, clothing,
furniture, statues, vehicles, and misc. The items of clothing and furniture, in
particular, are often encountered in real scenes. Many times, these objects are
either minimally textured or completely textureless. Architectural sculptures are
another regularly occurring feature in many real-world scenes across the world, and
44 Chapter 4 Our Datasets
many times, these are also devoid of any colors or elaborate textures on them. The
dataset contains samples of statues, including some intricate, life-size 3D models
with complex shapes and depth variations. Each object is rendered from hundreds
of perspectives and under different illuminations. For textureless objects that have
otherwise homogeneous surfaces with consistent color, changes in lighting and
shadows are one of the only noticeable visual cues which, however, are not indicative
of their 3D shape. We include renders of these objects from many perspectives with
changing lighting to help networks learn illumination-independent features for
reconstruction.
The second part, later referred to as the shapenet category, focuses on various 3D
geometries instead of illumination changes on the same shapes. It is comprised of
13 objects from the Choy et al. [Cho+16] subset of the ShapeNet dataset [Cha+15],
with 200 different 3D models each. For each of these 2600 models, there are 24
samples rendered under constant lighting and from fewer views (Figure 4.4).
Fig. 4.5:
Objects in the dataset have depth variations at many scales, with some like the
rubber duck having a largely smooth surface with uniform normals and others like
the San Diego Convention Center or the Thai statue having many deviations in
their depth and normal vectors.
This dataset contains objects with diverse shapes and varying degrees of realisticness
in terms of deformations, polygon count, and size variations in the real world,
ranging from a tiny rubber duck to several life-size statues and a model of the San
Diego Convention Center building (Figure 4.5). This ensures that the dataset has
depth variations at many different scales across the objects. This variety is important
for training neural networks as it allows them to learn how to recognize and respond
to objects of different types and sizes. The range of deformations in this dataset
also helps train AI models to be more accurate in their predictions, as they must
account for a wider range of potential irregularities. Table 4.2 lists all objects in the
dataset.
4.1 Textureless Datasets 45
Tab. 4.2:
The synthetic textureless dataset has 48 objects divided into seven subcategories
with 2635 unique 3D models and 364.800 samples in total.
Category Objects Models Samples
animals asian_dragon, cats, duck, pig, stanford_bunny, 6 51,840
stanford_dragon
clothing cape, dress, hoodie, jacket, shirt, suit, 8 69,120
tracksuit, tshirt
furniture armchair, bed, chair, desk, rocking_chair, sofa 6 51,840
statues armadillo, buddha, lucy, roman, thai 5 43,200
vehicles bicycle, car, jeep, ship, spacehship 5 43,200
misc diego, kettle, plants, teapot, skeleton 5 43,200
shapenet plane, bench, cabinet, car, chair, display, lamp, 13×200 62,400
speaker, rifle, sofa,table, phone, watercraft
4.1.1.4 Post-Processing
The synthetic dataset did not require much post-processing as our Blender pipeline
outputs samples with cleanly segmented background and foreground, depth maps
without any noise or holes, and high-quality surface normals when exported directly
from Blender. However, we ensure that depth values always lie between 0 and
1, with 1 representing the furthest point from the camera (i.e., the background).
Similarly, normal vectors always have a unit length and values between -1 and 1. We
also align our normal vectors such that the positive x-axis is facing right, the positive
y-axis is facing upwards, and the positive z-axis is facing away from the camera. This
is the same convention used by ShapeNetCore v2 [Cha+15].
4.1.1.5 Limitations
The Blender scene we use to generate data represents a tiny subset of the variations
that exist in natural scenes. In reality, shadows and illumination changes caused
by lighting are only one of the factors that affect the appearance of objects. For
example, natural environments are often cluttered with objects rarely appearing
in isolation. Our dataset does not include samples that model the occlusions or
truncations caused by other objects. Furthermore, the objects in our dataset are
rendered on a black background, which is not the case in natural scenes, and this
means that the networks trained on this data may not generalize to realistic scenes
with complex backgrounds. However, our dataset is a good starting point for future
research. We provide foreground masks for our objects, which technically makes
it trivially possible to replace the background pixels with an actual scene before
feeding the images to a neural network for training. However, we do not do that in
this thesis and leave that as future work.
46 Chapter 4 Our Datasets
4.1.2 Real Textureless Dataset
This section describes the small supplementary dataset of real-world textureless
objects.
4.1.2.1 Motivation
The synthetic dataset described in Section 4.1.1 is a good starting point for training
textureless object reconstruction networks. However, it is not a realistic representa-
tion of natural scenes. We collected a small dataset of real-world textureless objects
to validate the generalization ability of our synthetic data to real surfaces. This
dataset is not meant to be used for training; instead, it was created to evaluate
the performance of deep neural networks trained on the synthetic dataset on real-
world data. This dataset can also be seen as an extension of the textureless dataset
of [BFS18] as it was collected using the same methodology and contains similar
objects.
4.1.2.2 Collection Process
In a room with a large window on one wall allowing natural light, the Kinect was
mounted on a tripod with adjustable height facing away from the window. The
objects to be captured were positioned between 0.5m and 1.25m away from the
camera. The clothing items were worn by a person making random motions in
front of the camera. Other objects were placed on a stationary surface before the
camera and manually rotated around themselves. Synchronized RGB images and
depth maps were obtained from the camera, and surface normals were computed by
smoothing the depth maps with a
5×5
Gaussian filter and differentiating them. The
accuracy of depth values obtained through Kinect is affected by temperature [WS16],
so we let the camera run for half an hour to reach a stable temperature before
starting data capture every time.
The room was illuminated by a combination of natural light from the window and
four different fluorescent light bulbs inside the room. The real-world scene setup
was similar to the simulated Blender scene described in Section 4.1.1.2, with the
addition of a window opposite the object and behind the cameras. Camera height
and viewing angle were varied to get the
up
,
down
, and
front
camera positions. The
light bulbs were switched on and off randomly to provide different lighting across
sequences, with each sequence having at least two lights to create complex shadows.
During post-processing, a segmentation algorithm was used to retain only the object
of interest and remove the person and all backgrounds.
4.1 Textureless Datasets 47
4.1.2.3 Data Description
This dataset comprises real-world low-texture objects containing 4,672 samples from
six objects. This includes four deformable objects: a hoody, a shirt, shorts, and
a tshirt, and two rigid objects: a chair and a lamp. [BFS18]’s dataset used only
controlled artificial lighting. We collected our data under daylight and a combination
of daylight and multiple random, artificial light sources. Table 4.3 gives a summary
of objects in our real-world dataset.
Tab. 4.3: Summary of objects in the supplementary dataset of real objects.
Object hoody shirt shorts tshirt chair lamp
Sequences 3 4 1 9 2 1
Samples 508 671 387 1545 1201 360
4.1.2.4 Post-Processing
After capturing raw data with the Kinect camera, we perform several post-processing
steps to prepare the data for use (Figure 4.6).
Hole filling. As the Kinect computes depth by looking at the phase-shifted infrared
light reflected from the scene, it can sometimes not "see" surfaces that absorb infrared
waves, are translucent, or are very shiny. These missing values lead to holes in the
depth map. To fill these values, we interpolated the missing parts by using the
OpenCV’s Navier-Stokes-based method of inpainting with a radius of 7 [BBS01].
Background segmentation. A simple skin-color detection algorithm was used to
detect the body of the person wearing the clothes. The surface was covered in a
skin-colored cloth for other objects placed on a flat surface. These skin-colored pixels,
along with all pixels further than 1.25 meters from the camera, were labeled as
background. As all objects captured are of continuous shape, any small components
found in the image were assumed to be noise or pixels that escaped skin detection.
These pixels were labeled as background, while the largest contour in the scene was
labeled as foreground containing the object of interest. Finally, a morphological
closing operation was performed to fill any small holes in the foreground mask. A
binary mask of the foreground for each sample is included in the dataset.
Normalization. The normal vectors were normalized to unit length with values in
the range
[−1,1]
by dividing them by their magnitude. All depth values were also
normalized between
[0,1]
, which makes the dataset invariant to the arbitrary choice
of camera distance from the object. The background pixels have a depth value of 1,
and the foreground with values between 0 and 0.99.
48 Chapter 4 Our Datasets
Fig. 4.6:
The top row of the figure shows the results of our skin-detection algorithm that
removes the person wearing the clothes from the images. The middle row shows
the raw output from Kinect with a lot of noise and a hole in the depth and normal
maps (right leg). The bottom row shows the output after post-processing steps.
4.1.2.5 Limitations
Like synthetic data, objects in this dataset are also shown before a black background
which is a potential limitation. The original background was removed as this dataset
extends [BFS18] which also uses a black background. Furthermore, unlike the
synthetic dataset, the image resolution of this dataset is 224x224, and the depth
maps and surface normals are often noisy or over-smoothed because of hardware
limitations and manual post-processing.
4.1 Textureless Datasets 49
4.2 Transparent Dataset
In this section, we describe the dataset of transparent objects we generated in a 3D
modeling software to train deep networks for reconstructing depth and normal maps
of transparent objects.
4.2.1 Motivation
As discussed in Section 3.2.3, there are no publicly available large RGB-D datasets
of transparent objects. As images of transparent objects are inherently even more
challenging to reconstruct than textureless objects, we hypothesize that we would
need a network with more parameters to learn to reconstruct transparent objects,
and the more parameters a network has, the more data it needs to learn. However,
capturing groundtruth depth maps or surface normal maps of transparent objects
is even more complex than textureless objects because time-of-flight sensors like
Kinect cannot measure depth through transparent objects. Therefore, we generate
this synthetic dataset to train the network for transparent object reconstruction.
4.2.2 Generation Process
We use Blender with a similar pipeline as discussed in Section 4.1.1.2 to generate
the transparent dataset. However, there are a few notable differences.
First, we create a new shader to model a clear glass-like transparent material,
as shown in Figure 4.7. This uses a Principled BSDF (bidirectional scattering
distribution function) shader, which is based on the Disney principled model called
the Physics Based Rendering (PBR) shader, and determines how the surface scatters
the light by computing the probability that an incident ray of light will be reflected
(scattered) at a given angle. For a glass-like appearance, the "transmission" value is
set to 1 for a fully-transparent surface. The "specular" value is set to 0.5, which is
calculated using the Fresnel formula given by the following equation:
specular =ior −1
ior + 12
÷0.08 (4.1)
where
ior
is the index of refraction and has a value of 1.5 for glass. We use the
GGX microfacet distribution [Wal+07] and the Christensen-Burley approximation
of physically-based volume scattering to simulate subsurface scattering [Gue20].
Using a Mix Shader, the output of the Principled BSDF shader is combined with a
Translucent BSDF shader that adds the Lambertian diffuse transmission [San14] to
the surface.
50 Chapter 4 Our Datasets
Fig. 4.7:
The shader in Blender used to model a transparent material. We use this to set
the transparent material’s refractive index, color, absorption, and transmission
properties.
Additionally, the reconstruction of transparent surfaces depends on more than just
the image of the object in isolation. The reconstruction algorithms need to model
light interactions with the object to successfully reconstruct the shape of transparent
objects where the surrounding environment is visible through them. These light
refractions and reflections are highly dependent on the surrounding environment,
so, unlike the textureless dataset, the transparent objects are not rendered on a
black background. The objects are placed in five different real-world environments
using High Dynamic Range Images (HDRIs) of real scenes, shown in Figure 4.8. To
achieve this, we use Blender’s image based lighting using the Environment Texture
node (Figure 4.9a).
These HDRIs contain a 360-degree panorama of the scene, and Blender computes
realistic lighting from the light sources present in the HDRI itself. This allows us to
render transparent objects with many lifelike lighting conditions, specular reflections,
shadows, and background surfaces. In each of these five worlds, seven cameras are
positioned at elevation angles with 15-degree steps from a uniformly sampled range
of -45 degrees to 45 degrees, which results in 35 unique backgrounds in the rendered
4.2 Transparent Dataset 51
(a) Studio (b) Bedroom (c) Christmas
(d) Country Hall (e) Fireplace
Fig. 4.8: The five different HDRIs used to render the transparent dataset.
images. Each object is then rotated around itself by unit azimuth steps through
360 degrees. All cameras use depth of field to add a blur effect to the background
(Figure 4.9b), and a Denoise node is used to reduce noise in the rendered images
(Figure 4.9c). Unlike the textureless dataset, all groundtruth data, including the
depth maps, surface normals, and segmentation masks, are exported from Blender
for higher data quality and have a resolution of 512x512.
(a) World settings
(b) Camera settings
(c) The compositing nodes used to generate depth and normal maps.
Fig. 4.9: The Blender settings used for transparent dataset creation.
52 Chapter 4 Our Datasets
4.2.3 Data Description
This dataset has 126,000 samples and is made up of 10 everyday objects made out
of glass. Of these, five are entirely transparent, and the other five have some opaque
parts. These include a teapot, a transparent water bottle, another bottle with an
opaque label, a bowl, a cup, a flower pot, an electric kettle, a bookshelf, a table, and
a chair. The objects are shown in Figure 4.10.
(a) Teapot (b) Bottle (c) Cup (d) Bowl (e) Flower Pot
(f) Kettle (g) Cola (h) Bookshelf (i) Table (j) Chair
Fig. 4.10: Transparent objects in the dataset.
For each object, 12,600 samples show the object in five environments and view it
from seven camera heights and all 360 horizontal rotations. For a single camera
orientation and object rotation, the groundtruth labels stay the same in all five
worlds, as shown in Figure 4.11. This would allow the dataset to be used for
modeling the effect of the environment on the object and help the neural networks
learn background-independent features invariant to the environment and more
representative of the actual geometry of the transparent object.
4.2.4 Limitations
As we show later in our experiments, while this dataset provides a good starting
point for reconstructing transparent objects, it has a few limitations that make it
challenging for the neural networks to generalize to the real world. The first of these
is the limited number of shapes in the dataset. While the dataset contains 2520
viewpoints of each object, only ten unique objects exist. In the real world, there are
hundreds of transparent objects with different refractive indices, surface thickness,
degree of transparency, the color of the transparent material, and many other factors.
The refractive index affects how the light passing through these surfaces changes,
surface thickness affects how the light is reflected off the surface, the transparency
controls how "see-through" the object is, and transparent materials can still have
4.2 Transparent Dataset 53
(a) Studio (b) Bedroom (c) Christmas (d) Country Hall (e) Fireplace
(f) Foreground Mask (g) Depth Map (h) Normals Map
Fig. 4.11:
Groundtruth labels for a single camera orientation and object rotation in the five
worlds.
different colors. The dataset does not contain these variations and instead uses a
single colorless glass-like material.
54 Chapter 4 Our Datasets
4.3 Data Sources and Licenses
We release our datasets under the Creative Commons Attribution 4.0 International
(CC BY 4.0) license. None of the data contains any personally identifiable infor-
mation. Some 3D models rendered in our datasets include religious or cultural
symbols, including the statues of Buddha, the Christian angel Lucy, Hindu symbolism
in the Thai statue, and elements of Chinese culture in models of dragons. Images
of all models are rendered in neutral situations with no offensive or controversial
animations or modifications.
The 3D models we used to render the dataset were obtained from several sources in
the public domain. The
armadillo
,
asian_dragon
,
buddha
,
lucy
,
stanford_bunny
,
stanford_dragon
and
thai
were downloaded from the Stanford 3D Scanning Repos-
itory [Lev+05]. These models, together with Martin Newell’s Utah
teapot
[Cha10]
were included in this dataset for their popularity. The Stanford bunny and the
Utah teapot, in particular, are often used in this field. Then they were included in
our dataset to make it comparable to other datasets or methods that include these
objects.
The models
diego
,
duck
,
pig
and
skeleton
were downloaded from Keenan’s 3D
Model Repository [Cra21]. Keenan Crane of Carnegie Mellon University published
this repository under the CC0 1.0 Universal (CC0 1.0) Public Domain License. The
remaining 24 models were all free from CGTrader [CGT11] with a Royalty Free
License.
4.3 Data Sources and Licenses 55
5
Methodology
„Oh, come on! There’s always something to learn.
—The Joker
Batman: Arkham City
This chapter introduces the methodology we use for learning to reconstruct texture-
less and transparent objects. First, we describe the neural network architecture used
for reconstructing textureless surfaces and explain the loss functions utilized for
training the network. The following section describes the network architecture and
loss functions for learning to reconstruct transparent surfaces. For both cases, we
use the data described in Chapter 4 to train the networks.
5.1 Textureless Surfaces
In this section, we describe the network architecture and loss functions used for
reconstructing the depth maps and surface normal maps from a single RGB image of
textureless objects.
5.1.1 Network Architecture
We use a neural network architecture similar to the one proposed by [BFS18]
for learning to reconstruct textureless surfaces. The network is an autoencoder
consisting of three parts: an encoder network for feature extraction and two decoder
networks for reconstructing depth maps and normal maps. This is illustrated in
Figure 5.1.
The encoder consists of five blocks, each consisting of several convolutional layers
with batch normalization and ReLU activations and a max pooling layer at the
end. The first convolutional layer in each block increases the number of channels,
increasing from 3 in the network input to 32, 64, and 128, then staying at 256 in the
last two blocks. In spatial dimensions decrease by a factor of 2 at each max pooling
layer, the pooling indices are stored for later use in the decoder. All convolutions use
a kernel of size 3 with ’same’ padding, and the stride is set to 1. The pooling layers
57
Fig. 5.1:
The Sketch Reconstruction Multi-task Autoencoder (SRMA) network. It has 11M
trainable parameters.
use a kernel of size 2 with stride 2. Before the pooling layer in the last two blocks,
we add the self-attention layers from [Zha+19]. Both decoders are symmetric to the
encoder and have the same architecture except for the number of output channels
at the last layer. The decoder for the depth map has a single output channel, and
the normal map decoder has three output channels. The max pooling layers in the
encoder are replaced by a non-linear max unpooling operation that uses the stored
pooling indices to upsample the feature maps, as proposed in [BKC17].
We know it is possible to compute the normal vectors of a surface from its depth maps
by using differentiation (Algorithm 1), which shows that the depth maps and normal
maps are dependent variables and share a common feature space. The encoder
g:Im7→ Λ
aims to find this space as a feature map of size
(H/32, W/32,256)
,
where
H
and
W
are the height and width of the input image, respectively,
Im
is
the masked input image with the background set to zero, and
Λ
is the feature
map. The decoder
hd: Λ 7→ D
aims to reconstruct the depth map
D
from the
feature map
Λ
, and the decoder
hn: Λ 7→ N
aims to reconstruct the normal map
N
from the feature map
Λ
. The overall network
f:Im7→ (D, N )
is then defined as
f(Im)=(hd(g(Im)), hn(g(Im))).
The network can take an input of arbitrary size as long as the spatial dimensions are
divisible by 32, and both
D
and
N
, together known as the 2.5D sketch of the input
image, have the same spatial size. We call this network the Sketch Reconstruction
Multi-task Autoencoder (SRMA).
5.1.2 Loss Functions
We now define the loss functions for training this network. The first loss function
is the edge loss
LE
, which is defined as the mean squared error (MSE) between the
gradients of the prediction
x
and the groundtruth
y
. Let
˜x
and
˜y
be the foreground
58 Chapter 5 Methodology
in
x
and
y
respectively, with the background pixels set to zero. The edge loss is then
defined as
LE(x, y) = PN
n=1
2NPi∈Bn
|Bn|(dx(˜xn)i−dx(˜yn)i)2+ (dy(˜xn)i−dy(˜yn)i)2,(5.1)
where
N
is the batch size,
B
is the set of groundtruth foreground pixels, and
dx
and
dy
are the horizontal and vertical gradients calculated using finite differences.
The edge loss is used to encourage the network to learn the edges of the foreground
objects and is a component of both the depth and normal loss functions described
next.
The second loss function is the depth loss
LD
, which is defined as the weighted sum
of the mean absolute error (MAE) and the edge loss
LE
between the predicted depth
map xDand groundtruth depth map yD. The depth loss is defined as
LD(xD, yD) = 1
µ+η µPN
n=1
NPi∈Bn
|Bn|˜xn
Di−˜yn
Di+ηLE(xD, yD)!,(5.2)
where
µ
and
η
are hyperparameters that control the relative importance of the MAE
and edge loss components. The depth loss is used by the network to learn the depth
of the textureless objects in the foreground. We use µ= 10 and η= 1.
The third loss function is the normal loss
LN
, which is defined as the weighted sum
of three components: the edge loss
LE
between the predicted normal map
xN
and
groundtruth normal map
yN
, the cosine similarity between the predicted normal
map
xN
and the groundtruth normal map
yN
, and the MSE between the magnitude
of predicted normal vectors and a unit vector. The normal loss is defined as
LN(xN, yN) = κLA(xN, yN) + ηLE(xN, yN) + τLL(xN)
κ+η+τ,(5.3)
where LLis the length loss given by
LL(xN) = PN
n=1
NPi∈Bn
|Bn|1− k˜xn
Nik,(5.4)
and LAis the angular loss computed as the arccos of the cosine similarity using
LA(xN, yN) = 1
N
N
X
n=1
cos−1
Pi∈Bn
|Bn|
h˜xn
Ni,˜yn
Nii
max k˜xn
Nikk˜yn
Nik,
×1
π,(5.5)
where
h·,·i
is the vector dot product, and
= 10−7
is a small number to avoid
division by zero. The length error term encourages the network to output unit
vectors, while the angular error helps the network learn the normal vectors. As
5.1 Textureless Surfaces 59
in the depth loss, the edge loss term encourages the network to get better normal
predictions around the object’s edges. We set κ= 10 and τ= 1.
The overall loss function, L, is then given by
L(x, y) = αLD(xD, yD) + βLN(xN, yN)(5.6)
where
α
and
β
are hyperparameters that control the relative importance of the depth
and normal losses, and both are set to 1.
60 Chapter 5 Methodology
5.2 Transparent Surfaces
In this section, we describe the network architecture and loss functions used for
reconstructing 2.5D sketches of transparent objects.
5.2.1 Network Architecture
We start by describing the network architecture. The proposed architecture combines
a Vision Transformer [Kol+21] with a modified version of the autoencoder network
for textureless objects introduced above and is shown in Figure 5.2.
Fig. 5.2:
The Residual Sketch Reconstruction Vision Transformer (RSRVT) network. It has
22M trainable parameters.
This network takes as input an RGB image
I
of size
224 ×224 ×3
and outputs the
corresponding depth map
D
and surface normals map
N
with the same spatial size.
It also has a third auxiliary output of a silhouette mask
M
showing the "shadow" of
the transparent image. This auxiliary output is used to encourage the network to
find the transparent object in the foreground because, unlike the textureless case
where a masked image was used as the network input, the input to this network has
transparent images on complex, real-life backgrounds.
Transformers use sequential input and have demonstrated great potential in NLP
tasks such as machine translation [Wan+19], where they use text sequences as input.
In [Kol+21], Kolesnikov et al. proposed the Vision Transformer (ViT) architecture
for image classification tasks. The ViT architecture uses a sequence of patches of
the input image as input to the transformer. The patches are extracted using a
sliding window of size
P×P
with a stride of
P
and then flattened to a vector before
being passed to the Transformer encoder along with positional embeddings. We
use the ViT-Tiny model implementation from the Pytorch Image Models [Wig19],
5.2 Transparent Surfaces 61
which is a "tiny" ViT model consisting of 12 transformer layers with 3 attention
heads and 192 hidden units. The patch size
P
is set to 16, flattened to a vector
of size
P2= 256
, and then mapped to a vector of size
D= 192
using a trainable
linear projection. The output of this projection is called the patch embeddings, and
for the input size of
224 ×224
, we get a sequence of 196 patch embeddings to
which an additional learnable [
cls
] token embedding is prepended to stay close to
the original Transformer architecture. In traditional classification tasks, the patch
embeddings are fed to the Transformer encoder, whose output is passed through an
MLP classification head to predict the class label. In our case, we set the MLP output
size to 784, which is reshaped to
7×7×16
and represents the Transformer features.
This can be seen as a function
t:I7→ ΛT
, where
I
is the input image, and
T
is the
Transformer features. The output is then passed to the silhouette decoder.
The silhouette decoder contains five transpose convolutional layers with a kernel
size of 4, padding of 1, and a stride of 2 that upscale the input features to the same
size as the input image. A batch normalization and a GELU activation follows each
convolutional layer, and the number of channels is changed to 64, 128, 256, 128,
and 64, respectively. These are followed by a
1×1
convolutional layer with a unit
stride and no padding, and the output is passed through a sigmoid activation to
get the silhouette features
ΛM
of size
224 ×224 ×3
. This can be seen as a function
s: ΛT7→ ΛM
. The output of the silhouette decoder is stacked with the input image
to form a new input
I0
of size
224 ×224 ×6
to another feature extraction network
which aims to learn features representing the 3D structure of the transparent object
only. The network uses the information in
ΛM
to learn to segment the object from
the background.
The feature extraction network is based on ResNet-18 [He+16] without the final
average pooling and fully-connected layers. We store the pooling indices of the max
pooling operation after the first convolutional layer in the network, and analogous
to the "attentioned" encoder described in Section 5.1.1, we add self-attention layers
after the last two residual blocks. This network is represented by the function
f0:
I07→ ΛF
, where
I0
is the 6-channel input image, and
ΛF
is the feature representation
of the transparent object and has a size of
224 ×224 ×512
. We add a shortcut
path
k: ΛT7→ ΛT0
consisting of a single
1×1
convolutional layer which maps
the Transformer features to the same size as the feature representation of the
transparent object. The two paths are then added together to form the final feature
representation before passing to the reconstruction network.
Unlike the reconstruction decoders in the SRMA network in Section 5.1.1, the
reconstruction network here only uses a single decoder network that produces a
4-channel output of the 2.5D sketch
S
, where the first channel contains the depth
map
D
and the other three channels contain the normal map
N
. It is made up of
62 Chapter 5 Methodology
three blocks of two transpose convolutions each, where the first convolution uses a
kernel size 4 with a stride of 2 and padding of 1 and halves the number of channels,
while the second convolution uses a kernel of size 3 with stride and padding of 1,
keeping the channels the same. A ReLU activation follows both convolutions, and
a batch normalization follows each block. This is followed by an unpooling layer
that uses the stored indices from the feature extraction network. After this, there
are three more convolution layers with kernel sizes of 2, 3, 1, strides of 2, 1, 1,
and padding of 0, 1, and 0, respectively. A ReLU activation follows the first two
convolutions. The number of channels changes from 64 to 32 in the first convolution,
stays unchanged in the second, and then to 4 in the final convolution, giving a
4-channel image of size
224 ×224 ×4
as an output. This can be seen as a function
g0:I07→ S, where D=S[:,:,0] and N=S[:,:,1 : 4].
The overall network can then be described with the following equations:
S=g0(f0(s(t(I)) I) + k(t(I))) (5.7)
where
is the stacking operator. The network also produces an auxiliary output of
the silhouette mask
M
of size
224 ×224 ×1
by using a
1×1
convolutional on the
output of the silhouette decoder,
ΛM
, which is used during training to encourage
the network to find the transparent object in the foreground. The ViT-Tiny and the
ResNet-18 used in the network are both pre-trained on ImageNet [Den+09]. It has a
total of 22M parameters, and is trained end-to-end using the loss functions described
in the following section.
5.2.2 Loss Functions
We use the same loss functions as in the case of the opaque surfaces but calculate
them for the whole image instead of for the foreground pixels only. However,
datasets do not always have valid depth or normal vector values for the background.
Therefore, we set the depth at the background to always be 1 and the normal vectors
of background pixels pointing towards the positive z-axis with a value of [0,0,1].
The edge loss L0
E(x, y)is now defined as
L0
E(x, y) = PN
n=1
2NPi∈In
|In|(dx(xn)i−dx(yn)i)2+ (dy(xn)i−dy(yn)i)2,(5.8)
where
In
is the input image and
|In|=H×W
is the number of pixels in the image.
Similarly, the depth loss L0
Dis defined as
L0
D(xD, yD) = 1
µ+η µPN
n=1
NPi∈In
|In|xn
Di−yn
Di+ηL0
E(xD, yD)!,(5.9)
5.2 Transparent Surfaces 63
and the normal loss L0
Nis defined as
L0
N(xN, yN) = κL0
A(xN, yN) + ηL0
E(xN, yN) + τL0
L(xN)
κ+η+τ,(5.10)
where L0
Lis the length loss given by
L0
L(xN) = PN
n=1
NPi∈In
|In|1− kxn
Nik,(5.11)
and L0
Ais computed as
L0
A(xN, yN) = 1
N
N
X
n=1
cos−1
Pi∈In
|In|
hxn
Ni, yn
Nii
max kxn
Nikkyn
Nik,
×1
π.(5.12)
Additionally, we define a new loss function for the transparency mask, called the
silhouette loss,
LS(x, y)
, which is defined as the binary cross-entropy loss between the
sigmoid of the predicted transparency mask
xB
and the ground truth transparency
mask yB. The is given by
LS(xB, yB) = 1
N
N
X
n=1
i∈In
X
i=1 −ˆyn
Bilog(ˆxn
Bi)−(1 −ˆyn
Bi) log(1 −ˆxn
Bi),(5.13)
where
ˆx=σ(x)
and
ˆy=σ(y)
, and
σ
is the sigmoid function. This loss function
encourages the network to learn to locate the transparent object in the image and
segment it from the background.
The overall loss function for transparent surfaces, L0, is then given by
L0(x, y) = αL0
D(xD, yD) + βL0
N(xN, yN) + γLS(xB, yB),(5.14)
where
α
,
β
, and
γ
are hyperparameters that control the relative importance of the
depth, normal, and silhouette losses and are set to 1, 1, and 0.2, respectively.
64 Chapter 5 Methodology
6
Experiments and Results
„Do not cite the Deep Magic to me, Witch! I was
there when it was written.
—Aslan, the King of Narnia
The Chronicles of Narnia: The Lion,
the Witch, and the Wardrobe
This chapter describes the experiments conducted on our datasets and the results of
the proposed deep neural network architectures. It starts with describing the evalu-
ation metrics used to report the experiment results. The experiments are divided
into two parts: the first part evaluates the presented approach for textureless object
reconstruction, and the second evaluates the presented approach for transparent
object reconstruction. In both parts, we first discuss the experimental setup, the goals
of these experiments, and the dataset splits used. Then, we present the quantitative
and qualitative results of the experiments. Finally, we conduct several ablation
studies and discuss them.
6.1 Evaluation Metrics
The depth error metric
ED
is defined as the mean absolute error between the fore-
ground pixels of the predicted and true depth values. As we normalize our depth
values between 0 and 1, the depth error is not in real-world units but rather a
percentage. We compute the mean absolute difference in depth values using Eq.
3.14 multiplied by 100.
The normal error metric
EN
is defined as the mean angular distance in degrees
between the groundtruth surface normals and is computed as
EN(xN, yN) = 1
N
N
X
n=1
(La(xN, yN)) Bn
i
PiBn
i
(6.1)
where Lais defined in Eq. 3.3.
65
For both
ED
and
EN
, smaller values indicate better results. Additionally, the per-
centage of surface normals with less than 10◦, 20◦, and 30◦angular errors are also
reported. Higher percentages indicate better quality of normal vectors. Note that all
metrics are only calculated for the foreground pixels.
6.2 Textureless Object Reconstruction
In this section, we describe the experimental setup and the data splits used for
the reconstruction of textureless objects, discuss the results of these experiments
using the metrics introduced above, and present ablation studies on the textureless
network.
6.2.1 Experimental Setup
For textureless reconstruction, we use the attentioned autoencoder network intro-
duced in Section 5.1.1. The network is trained and evaluated on the synthetic
textureless dataset described in Section 4.1.1 using the data splits listed in Table
6.1. For the six main categories, the network is trained once using the training and
validation splits of that category. In the first set of experiments, these six networks
are evaluated on the test split of the same category. These experiments evaluate the
generalization of the network to similar shapes in new lighting conditions. In the
second set of experiments, the six networks are evaluated on the test split of the
other five categories. These experiments evaluate the generalization of the network
to previously unseen shapes. In another experiment, the data splits in Table 6.1b are
used to train and test the network on the ShapeNet objects. This test evaluates how
well the network generalizes to new views of the objects that it has seen before. The
results of these experiments are presented in Section 6.2.2.
Tab. 6.1: The training (R), validation (V), and test (T) splits for the textureless dataset.
(a) Main categories.
LlLrLaLs
down R R V R
front R R V R
up R R T T
(b) The ShapeNet category.
0 45 90 135 180 225 270 315
R R R V R R R T
R R R V R R R T
R R R V R R R T
In all these experiments, the network is trained for 50 epochs using a batch size of
64 split across 2 GPUs. The network is trained using the Adam optimizer [KB14]
with an initial learning rate of 0.01 and a weight decay of 0.0001. The learning rate
is reduced by a factor of 2 if the validation loss does not decrease for 30 epochs. An
66 Chapter 6 Experiments and Results
input size of 224
×
224 is used, and all other hyperparameters use the values listed
in Section 5.1.2.
Theoretically, using a larger input image should improve the reconstruction quality
because the network can learn "see" the image in more detail. However, training
the network with larger images is computationally more expensive and requires
more memory. Most existing datasets and networks use an input size of
224 ×224
pixels, and we also use this size for our experiments. However, to demonstrate how
larger input images affect the reconstruction quality, we also train the network once
with an input size of
512 ×512
pixels and compare the results to the
224 ×224
pixel
network.
6.2.2 Results
In the first set of experiments, we perform intra-class evaluations on the proposed
network, which is trained once on each of the six main categories in the synthetic
textureless dataset and evaluated on a subset of the same objects in unseen lighting.
The results shown in Table 6.2 demonstrate that the network successfully learns the
shape of the textureless objects under illumination changes reasonably well.
Tab. 6.2:
Results of the intra-category experiments. The furniture and clothing categories
show the best results with more than 92.12% and 88.87% of the predicted normals,
respectively having smaller than 30
◦
angular difference from the groundtruth.
This shows that our dataset allows for a good inter-class generalization ability.
Category EDENmAE <10 mAE <20 mAE <30
animals 8.11±4.95 15.82±7.24 50.11 75.88 86.41
clothing 9.25±5.14 15.29±3.48 45.81 77.28 88.87
furniture 4.57±8.01 11.46±8.65 71.06 86.34 92.12
misc 3.11±1.77 16.42±9.77 58.95 74.37 82.66
statues 5.76±6.91 16.02±9.91 52.17 77.19 87.56
vehicles 5.14±6.31 17.25±8.05 52.05 71.61 82.00
mean 5.99±5.52 15.38±7.85 55.03 77.11 86.60
Objects in a single category sometimes share similar geometric structures. For
example, all the statues have humanoid shapes with a face, torso, arms, and legs
and have many small but gradual surface orientation changes. On the other hand,
the furniture largely consists of plain surfaces with sharp corners and planes often
bending at right angles. Table 6.3 lists the results of the inter-category experiments
with networks trained on one object category and evaluated on objects in all other
categories. These experiments show the ability of the network to learn geometry-
independent features and generalize to new categories of textureless objects.
6.2 Textureless Object Reconstruction 67
Fig. 6.1: Visualization of the qualitative errors on random samples in the test data.
Figure 6.1 visualizes the output of the network trained on the synthetic furniture
category on three randomly selected furniture objects and a statue, showing strongest
errors near the edges.
Tab. 6.3:
Results of the inter-category experiments. The degree of generalization to new
categories is less than that for objects within the same category (Table 6.2). This
is because the network learns strong shape priors that do not generalize well to
very different geometries.
Train Test EDENmAE <10 mAE <20 mAE <30
animals
clothing 19.31±4.94 25.65±5.06 17.94 50.08 70.68
furniture 21.94±6.38 25.95±6.42 22.02 50.97 71.18
misc 18.71±5.86 33.74±10.04 16.63 41.38 57.76
statues 20.05±5.05 28.23±5.01 15.40 45.49 67.29
vehicles 20.79±4.43 35.88±9.22 10.37 33.48 53.64
mean 20.16±5.33 29.89±7.15 16.472 44.28 64.11
clothing
animals 17.27±4.63 24.65±5.04 22.35 52.65 71.83
furniture 24.04±8.25 28.16±7.52 12.88 45.81 69.20
misc 18.96±5.64 34.23±11.64 13.54 40.09 58.13
statues 20.04±6.81 25.59±2.95 14.91 48.04 70.87
vehicles 22.64±5.74 33.70±8.85 9.96 34.70 56.85
mean 20.59±6.21 29.27±7.20 14.73 44.26 65.38
furniture
animals 24.85±6.79 29.77±9.01 18.74 46.34 65.61
clothing 25.45±7.89 34.68±8.31 10.47 34.11 57.30
misc 25.14±12.69 43.12±14.61 12.42 30.99 46.68
statues 36.01±14.53 40.43±11.62 9.89 30.21 50.11
vehicles 25.24±8.52 52.65±21.95 10.47 25.40 39.08
mean 27.34±10.08 40.13±13.10 12.40 33.41 51.76
For the shapenet category, Table 6.4 shows the results for each individual object.
During training, the network viewed each of the 200 3D models for all 13 objects
68 Chapter 6 Experiments and Results
Tab. 6.3: Contd.
Train Test EDENmAE <10 mAE <20 mAE <30
misc
animals 20.29±5.11 34.84±8.88 14.18 37.10 55.77
clothing 25.13±5.66 33.54±6.36 8.72 31.28 54.51
furniture 21.20±5.50 32.19±8.06 14.92 39.43 60.18
statues 23.35±5.48 31.68±9.42 19.34 41.05 59.29
vehicles 20.20±5.79 54.88±19.61 6.96 20.91 34.83
mean 22.03±5.51 37.43±10.47 12.82 33.95 52.92
statues
animals 22.62±5.36 28.45±5.18 17.04 45.71 66.59
clothing 24.25±4.96 31.54±6.44 12.72 39.59 62.16
furniture 23.33±7.05 32.50±4.31 12.17 38.25 59.73
misc 23.86±6.61 40.88±11.15 9.91 31.11 50.38
vehicles 24.13±6.65 46.84±13.98 7.00 24.74 43.01
mean 23.64±6.13 36.04±8.21 11.77 35.88 56.37
vehicles
animals 22.26±7.15 28.87±4.21 14.50 42.92 65.10
clothing 24.27±9.35 31.34±5.14 13.50 37.18 58.49
furniture 22.40±5.64 29.62±4.97 15.41 43.05 63.13
misc 23.85±8.60 37.87±9.89 12.34 32.45 49.95
statues 24.97±7.76 30.95±5.07 13.70 38.95 60.51
mean 23.55±7.70 31.73±5.86 13.89 38.91 59.44
from six of the 8 azimuth angles. Results were evaluated on one previously unseen
view of all 3D models and reported individually for each object. These results are
significantly better than the main categories because of a much higher number of 3D
models.
These results are significantly better than the experiments on the main categories
because of the larger number of shapes present. The average
ED
is 13.05, showing a
35% improvement over the best-performing main category. Similarly, the average
EN
of 23.41 degrees also shows a 20% improvement, and 70% of the normals
have less than 30% angular error in the shapenet category, which is a 5-point
improvement. In the main categories, there were only 6-8 unique 3D models per
category, whereas, in the shapenet category, there are 200 3D models per category,
but for each model, 24 viewpoints are present instead of thousands of viewpoints as
in the main categories. The network has seen more unique shapes during training
but fewer unique viewpoints. The results indicate that training data should include
as many unique shapes as possible for better generalization to unseen shapes in the
real world.
To demonstrate the generalization capability of the network to real-world data,
we choose the real textureless dataset from Bednarik et al. [BFS18]. This dataset
has five items, and [BFS18] defines an experiment where the “cloth” object in this
6.2 Textureless Object Reconstruction 69
Tab. 6.4:
Results on the shapenet objects. Performance improves greatly when more shapes
are seen during training. This shows the network can learn shape representations
from our textureless renders.
Object EDENmAE <10 mAE <20 mAE <30
plane 17.59±7.73 35.35±19.50 24.56 41.64 52.46
bench 13.05±4.63 25.60± 9.13 34.73 53.43 65.27
cabinet 8.23±4.01 12.60±6.26 76.97 85.01 87.89
car 8.26±3.84 22.14± 9.87 39.91 62.05 74.28
chair 14.75±5.36 24.08±13.37 39.51 59.29 70.58
display 13.13±6.07 22.53±12.51 50.91 67.26 75.14
lamp 16.71±9.59 26.48± 9.97 25.13 47.56 64.55
speaker 9.76±5.61 16.03±10.90 63.61 76.50 83.26
rifle 17.71±7.94 33.97±16.57 21.60 40.50 53.67
sofa 9.25±3.57 14.00± 6.71 62.51 78.66 86.25
table 12.02±4.78 18.99±13.59 57.85 68.77 75.78
phone 13.90±8.78 24.55±14.17 50.08 66.60 74.90
watercraft 15.24±8.13 28.06±15.60 32.85 50.63 62.63
mean 13.05±6.16 23.41±12.17 44.63 61.38 71.28
dataset is used to train the network and 100 samples from each object are used for
testing. We train our network on the real images of the cloth object from [BFS18]
for 50 epochs. In Table 6.5, the performance of this network for the task of normal
map reconstruction is compared with the network trained on the synthetic clothing
category.
Fig. 6.2:
Visualization of the output on real objects from [BFS18] when trained on our
synthetic data.
In another experiment, we train our network on the “clothing” subset of our synthetic
textureless dataset, and evaluate it on our real-world textureless dataset described
in Section 4.1.2. This experiment is conducted to show that networks trained on
our synthetic dataset can generalize to real-world RGB-D data captured with depth
sensors. Table 6.6 shows the results for this experiment using all images in the real
dataset for evaluation. Results for each object are reported individually.
70 Chapter 6 Experiments and Results
Tab. 6.5:
Comparison of the normal map reconstruction between a network trained on
our synthetic dataset (S) and the same network trained on real
cloth
data from
[BFS18] (R). When trained on our synthetic data, the same network gives better
surface normals for all four real objects other than the
cloth
object which was
used to train the real network, where our results are comparable.
Test Object Train Set ENmAE <10 mAE <20 mAE <30
cloth S37.58±5.95 6.46 22.46 42.92
R 33.96±6.47 11.31 32.95 53.40
hoody S 36.96±2.31 6.58 23.95 45.74
R40.94±2.85 5.29 18.95 36.70
paper S 43.87±6.92 4.28 16.08 32.24
R45.91±7.32 4.02 15.19 30.54
sweater S 40.66±2.66 4.73 17.90 36.51
R47.85±3.67 3.26 12.60 26.45
tshirt S 35.89±4.27 6.85 24.68 46.71
R42.43±6.24 4.56 17.15 34.34
Figure 6.3 shows the qualitative results of these experiments on a random sample
from the dataset. These results are only slightly worse than the results of the
inter-class experiment on the synthetic dataset.
Fig. 6.3:
Visualization of the qualitative errors on a random sample from our real-world test
data.
In our final experiment, we evaluate the effect of input size on the reconstruction
quality. The SRMA network is trained on the ’animals’ category of the synthetic
textureless dataset once using an input size of 224x224 and again using an input
size of 512x512. The qualitative results of reconstructing a random animal from
the test set with both these networks are shown in Figure 6.4. We observe that the
network trained with a larger input size can learn to reconstruct the depth maps and
normal vectors with greater detail. As illustrated in the figure, tricky edges and fine
6.2 Textureless Object Reconstruction 71
Tab. 6.6:
Results of reconstructing depth and normal vectors of our real dataset using the
baseline network trained on our synthetic clothing dataset.
Object EDENmAE <10 mAE <20 mAE <30
chair 29.62±8.89 43.68±8.54 10.15 31.39 49.84
hoody 36.60±8.54 36.55±3.68 9.38 31.37 55.20
lamp 23.30±6.27 74.26±7.07 3.39 11.85 21.88
shirt 36.99±7.85 36.24±4.41 8.88 31.54 56.49
shorts 42.35±7.12 26.78±3.01 12.68 42.35 73.03
tshirt 36.12±9.22 35.45±3.38 10.20 34.77 59.92
mean 34.16±7.98 42.16±5.02 9.11 30.55 52.73
details are more visible in predictions of the network with larger input sizes, which
proves that the network can learn more fine-grained details.
(a)
When trained with a larger
input size (cyan), the net-
work shows better generaliza-
tion. Notice how the validation
loss at 512 is closer to training
loss and more stable, as com-
pared to 224. (b) Low-level features are reconstructed better.
Fig. 6.4: Effect of using a larger input size to train the network.
6.2.3 Ablation Studies
We perform a series of ablation studies on the SRMA network introduced in Sec-
tion 5.1.1 by systematically removing different components and seeing how it affects
the network performance.
In the first experiment, we remove the self-attention layers, keeping the rest of the
network the same. This decreases the number of parameters in the network from
11.21M to 11.05M. Figure 6.5 shows the updated network architecture.
We train this network on the furniture category of the synthetic dataset and compare
the results with the "attentioned" network trained on the same dataset. Our goal
is to see what role self-attention plays in the network. In [Sal+20], where they
72 Chapter 6 Experiments and Results
Fig. 6.5:
The Sketch Reconstruction Multi-task Autoencoder (SRMA) network without the
self-attention. It has 11M trainable parameters.
used a similar "attentioned" ResNet-18 network as their encoder, they found that the
self-attention layers helped the network learn better low-level details such as edges
and corners. As shown in Figure 6.6, we observe the same trend in our network.
Without self-attention, the network outputs normal vectors with fewer fine-details
and has a more "blurred" appearance.
(a) Original Prediction (b) Normals without Self-Attention
Fig. 6.6:
The effect of removing the self-attention layers. The network without self-attention
produces more "blurred" normals and misses low-frequency details near the edges.
In the second ablation study, we want to determine if the two decoders are necessary.
We modify our network by removing one of the decoders and changing the output
channels of the other decoder to 4. The first channel of this output gives the
depth map, and the other three give the normals map. This reduces the number
of parameters by 4M, giving us a new network called the Sketch Reconstruction
Autoencoder (SRAE), containing 7M parameters. This is illustrated in Figure 6.7.
Fig. 6.7:
The Sketch Reconstruction Autoencoder (SRAE) network. It has 7M trainable
parameters.
6.2 Textureless Object Reconstruction 73
We repeat the intra-class experiment from Table 6.2 on the single-decoder network
and report the new results in Table 6.7. Comparing the two tables, it can be seen
that removing one encoder slightly deteriorates the output quality, but this decrease
is minimal in most cases.
Tab. 6.7:
Results of the second ablation study on the textureless data where we use only a
single decoder.
Object EDENmAE <10 mAE <20 mAE <30
animals 5.26±6.65 13.15±8.29 60.75 81.59 89.54
clothing 19.07±7.69 22.79±3.97 23.57 56.02 75.99
furniture 4.73±8.74 12.21±8.70 67.73 83.63 90.47
misc 2.41±1.33 16.41±9.83 58.34 74.00 82.56
statues 6.07±7.55 16.23±8.03 48.16 73.74 85.64
vehicles 3.89±5.64 17.32±8.50 52.42 70.81 81.08
This indicates that using two separate decoders for the depth map and the normals
map may not be strictly possible. This is also supported by the fact that depth maps
and normal maps are closely related and can be easily computed from each other,
which means their decoders sharing weights should not affect the reconstruction
performance significantly. On the contrary, it should encourage the depth maps and
normal maps to be more consistent. Using this knowledge, in the networks used for
transparent surface reconstruction in the following sections, we only use a single
decoder in the autoencoder part of the network.
74 Chapter 6 Experiments and Results
6.3 Transparent Object Reconstruction
In this section, we describe the experimental setup and the data splits used for the
reconstruction of transparent objects, discuss the results of these experiments, and
present ablation studies on the transparent network.
6.3.1 Experimental Setup
For transparent reconstruction, we use the RSRVT network introduced in Section
5.2.1. The network is trained and evaluated on the synthetic transparent dataset
described in Section 4.2 using the data splits listed in Table 6.8.
Tab. 6.8:
The training (R), validation (V), and test (T) splits of the synthetic transparent
dataset. Three different experiments are performed on the dataset using these
splits.
-45◦-30◦-15◦0◦15◦30◦45◦
Studio V V V V V V V
Bedroom T T T T T T T
Christmas R R R R R R R
Country Hall R R R R R R R
Fireplace R R R R R R R
Studio R R R V R R R
Bedroom R R R T R R R
Christmas R R R - R R R
Country Hall R R R - R R R
Fireplace R R R - R R R
Studio R R R R R - R
Bedroom - - - - - T+V -
Christmas R R R R R - R
Country Hall R R R R R - R
Fireplace R R R R R - R
In the first experiment, we train the network on all views of the objects in three
of the five scenes (Studio, Bedroom, and Christmas) and evaluate the remaining
two scenes (Country Hall and Fireplace). This experiment evaluates the network’s
generalization of previously seen objects and the same viewpoints to new, unseen
scenes. As the objects are transparent, even though the objects are viewed from the
same angles, they appear differently in the new scenes due to the different lighting
conditions and backgrounds visible through the objects.
In the second experiment, the network is trained on objects viewed from six out of
the total seven cameras in all five scenes and evaluated on the Bedroom scene viewed
6.3 Transparent Object Reconstruction 75
from the remaining camera. This experiment evaluates the network’s generalization
to unseen viewpoints of the same objects in known environments.
The third experiment trains the network on objects viewed from six cameras in
four scenes and evaluates it on the remaining scene (Bedroom) viewed from the
remaining camera. This is the most challenging experiment as the network has
never seen this environment or the objects from this camera angle before, but it
has seen the same objects from other camera angles in other environments. This
experiment evaluates the network’s generalization to unseen backgrounds and
camera perspectives. The results of these three experiments are presented in the
following section.
Like the textureless case, the networks here are also trained for 50 epochs with a
batch size of 64, split across 2 GPUs, and using the Adam optimizer [KB14] with
an initial learning rate of 0.01 and a weight decay of 0.0001. The learning rate is
reduced by a factor of 2 if the validation loss does not decrease for 30 epochs. An
input size of
224 ×224
is used, and all other hyperparameters use the values listed
in Section 5.2.2.
6.3.2 Results
Predictably, the network performs best when asked to generalize to unseen HDRI
environments, keeping the evaluated objects and camera viewpoints the same.
Tab. 6.9: The results of the transparent experiments described in Section 6.3.1.
EDEAmAE <10 mAE <20 mAE <30
I4.99± 3.64 12.09± 6.92 70.32 85.58 90.77
II 19.52± 8.91 28.14±14.84 27.86 54.56 67.36
III 23.07±11.70 37.88±16.17 15.23 36.07 50.21
In the second and third experiments, the network overfits to the training data, as
illustrated by the training time behavior of the three experiments in Figure 6.8. Only
experiment I shows the validation loss getting closer to the training loss over time.
Tab. 6.10
Training Set Validation Set Test Set Not Used
I7560 2520 2520 0
II 10,800 360 360 1080
III 8640 360 360 1440
This can be explained by looking at the ratios of data splits in the three experiments in
Table 6.10. It becomes apparent that the test sets in the second and third experiments
76 Chapter 6 Experiments and Results
are too small, while the training samples increase even further in both. This indicates
that the unsatisfactory results in these two experiments are because of the wrong
choice of data splits rather than the network itself. This is supported by the fact that
the depth and normal errors are manifold better in the first experiment, where the
training to test split ratio is more balanced, and the test set is larger than the other
two.
Fig. 6.8:
The evolution of the difference between training and validation loss over time for
the three transparent experiments.
To make up for the wrong choice of test sets, we define another experiment to test
whether the network can generalize to unseen objects. In this experiment, we use
photographs of real transparent objects and evaluate the network from the second
experiment on them. These photographs contain some bottles, bowls, cups, and
glasses, which have similar shapes to some of the objects in the synthetic dataset but
are not identical. The backgrounds and lighting conditions are also entirely unseen,
and the real objects have different types of transparent materials than the single
thin glass simulated in the synthetic dataset. In this experiment, we use the network
trained using the data splits of the second experiment, which has the largest number
of training samples.
6.3.3 Ablation Studies
We perform a series of ablation studies on the RSRVT network introduced in Sec-
tion 5.2.1 by systematically removing different components and seeing how the
training-time behavior and reconstruction quality change.
In the first set of experiments, we remove the shortcut path from the transformer
features to the encoder features. Additionally, we change the number of layers
6.3 Transparent Object Reconstruction 77
Fig. 6.9:
The RSRVT network successfully locates the transparent object in real photographs
and reconstructs fairly good depth and normal maps.
in the encoder and decoder to match the SNMT network instead of ResNet-18,
remove the residual connections, and replace them with skip connections between
the encoder and decoder for the non-linear max-unpooling operation. We call this
network the Sketch Reconstruction Vision Transformer (SRVT). Figure 6.10 shows
the architecture of the SRVT network.
Fig. 6.10:
The Sketch Reconstruction Vision Transformer (SRVT) network. It has 14M
trainable parameters.
In this study, we want to see how removing the residual blocks and shortcut paths
from the network affect the training-time behavior and its ability to converge. As
shown in Figure 6.11a, the SRVT network is less stable than the RSRVT network,
with the SRVT validation loss showing a 30% greater standard deviation than the
validation loss with the RSRVT network. Neither of the networks can converge to a
low validation loss, but the RSRVT network can reach a lower validation loss than
the SRVT network. This suggests that the RSRVT network can generalize better
than the SRVT network, which is also supported by the qualitative results shown in
78 Chapter 6 Experiments and Results
Figure 6.11b. This indicates that the residual blocks and shortcut paths are important
for the stability and generalization of the network.
(a) Network loss.
(b)
Qualitative results of the ablation study 1. Com-
pared to the output of RSRVT in Figure 6.9, this
shows significant deterioration.
Fig. 6.11:
Ablation study 1: Removing the residual blocks and shortcut paths from the
network.
Table 6.11 shows the quantitative results of this study.
Tab. 6.11:
Ablation study 1: The results of the transparent experiments with the SRVT
network. The average
EN
error increases by 10% without the residual blocks
and shortcut paths.
EDEAmAE <10 mAE <20 mAE <30
I10.69± 7.34 16.98±10.78 54.71 78.53 85.86
II 25.29±11.40 32.21±15.18 24.09 47.82 62.24
III 24.55±10.65 36.75±13.98 15.27 37.61 53.94
Next, we remove the feature extractor and reconstruction networks altogether, only
keeping the Vision Transformer and the Silhouette Decoder, renamed the Sketch
Decoder, which, instead of the 3-channel silhouette features, now produces a 4-
channel prediction containing the depth and normal maps. To compensate for the
significant number of network parameters, we replace the ViT-Tiny with the ViT-Small
network in the Transformer Encoder. Figure 6.12 shows this network architecture.
In this study, we want to see if the Vision Transformer alone can reconstruct depth
and normal maps from a single image. As shown in Figure 6.13, this is not the case,
and the ViT network alone cannot reconstruct the 3D shape, indicating that the
feature extractor and reconstruction networks are essential for reconstructing the
depth maps and surface normals.
Table 6.12 shows the quantitative results of this study.
Finally, we remove the transformer encoder and decoder from the network, only
keeping the feature extractor and reconstruction networks. Like SRVT, this does
not use the residual blocks but instead has skip connections between the encoder
6.3 Transparent Object Reconstruction 79
Fig. 6.12:
The Vision Transformer (ViT) network without the feature extractor and recon-
struction networks. It has 23M trainable parameters, comparable to the 22M
parameters of the original RSRVT network.
(a)
Network loss does not decrease indicat-
ing ViT alone cannot learn 3D shape.
(b)
The network primarily outputs noise but does show
some response in depth maps around the image
areas where the transparent object is.
Fig. 6.13:
Ablation study 2: Removing the feature extractor and reconstruction networks,
only keeping the Vision Transformer.
Tab. 6.12:
Ablation study 2: The results of the transparent experiments without the autoen-
coder part. Only 2.7% of the surface normals have an angular error of less than
30 degrees in these experiments, indicating this networks complete inability to
reconstruct the normals.
EDEAmAE <10 mAE <20 mAE <30
I18.88±11.31 82.15±10.84 0.14 1.26 3.24
II 19.71± 6.13 81.27±10.36 0.79 1.58 3.22
III 23.24± 5.63 83.10± 7.79 0.02 0.26 1.73
and decoder for the non-linear max-unpooling operation. We call this network the
Sketch Reconstruction Autoencoder (SRAE). It has 7M trainable parameters and has
the same architecture as the purple part in Figure 6.10 with the feature extractor
taking a 3-channel input instead of 6-channels. This ablation study is to see what
role precisely the ViT plays in the RSRVT and SRVT networks, as we already saw
that it contributes very little to shape reconstruction.
80 Chapter 6 Experiments and Results
Fig. 6.14:
Ablation study 3: Removing the Vision Transformer, only keeping the feature
extractor and reconstruction networks.
Figure 6.14 shows that, despite the ViT being removed, the network still learns to
reconstruct the foreground objects’ depth maps and surface normals. However, the
network cannot segment the transparent objects from the background as, with the
silhouette part of the network removed, it does not have a mechanism to locate the
boundaries of the transparent object. This indicates that the ViT in the SRVT and
RSRVT networks plays a crucial role in the reconstruction pipeline. We show that all
parts of the RSRVT network have a role in the reconstruction pipeline, and removing
any of them will result in a network that cannot reconstruct the 3D shape of the
transparent objects.
Table 6.13 shows the quantitative results of this study.
Tab. 6.13:
Ablation study 3: The results of the transparent experiments without the autoen-
coder part. Only 2.7% of the surface normals have an angular error of less than
30 degrees in these experiments, indicating this networks complete inability to
reconstruct the normals.
EDEAmAE <10 mAE <20 mAE <30
I5.48±6.36 12.90±7.75 61.06 82.10 90.07
II 18.34±8.50 25.93±13.96 30.47 57.68 72.70
III 15.78±5.10 29.09±10.21 17.65 44.37 61.39
6.3 Transparent Object Reconstruction 81
7
Conclusion
„We’re going to leave the world better than we
found it.
—Daenerys Targaryen
Game of Thrones
This chapter presents the conclusions of the thesis. It starts with a summary of the
main contributions of the thesis. Then, it discusses the limitations of the proposed
methods and future work. Finally, it presents the final remarks.
7.1 Summary of Contributions
In this thesis, our goal was to develop a learning-based approach for reconstructing
the depth maps and surface normal maps, also called a 2.5D sketch, from a single
RGB image of an object made out of either textureless or transparent materials. In
both cases, our input consists of an image showing one object in isolation, either
with or without a background. The output is a 2.5D sketch of the object, which
contains a depth map indicating the distance of every pixel from the camera and a
surface normal map indicating the orientation of the surface at every pixel. The 2.5D
sketch can be used for various applications, such as depth estimation for robotic
manipulation, furniture placement using augmented reality, autonomous driving, or
as a first step to obtaining a dense 3D model of the object. We treated the problems
of reconstructing the two types of surfaces separately and proposed two different
neural networks to learn their reconstruction.
7.1.1 Reconstruction of Textureless Surfaces
We created two new datasets for textureless surfaces, including a large synthetic
dataset generated in Blender and a smaller real-world dataset collected using a
Kinect camera. Our main contributions in this area are the following:
1.
The synthetic dataset is the first large-scale 3D database of textureless surfaces.
This dataset provides the 3D shape of the objects as depth maps and surface
83
normal maps and is our primary dataset for the experiments reconstructing
textureless surfaces. The dataset is divided into two main parts: the first part
has six different categories, each with 6-8 3D models of objects and 8640
samples per object. The second part has 13 different objects from the well-
known ShapeNet dataset and contains 24 samples each for 200 unique 3D
models of each object. Together, both parts contain 364,800 samples and
2635 unique 3D models. In addition to depth and normal maps, the dataset
contains RGB images showing textureless objects on a black background and
segmentation masks indicating the pixels containing the object.
2.
The real-world dataset contains 4672 samples of six real textureless objects
collected using a Kinect camera. Like the synthetic dataset, it contains depth
and normal maps, RGB images, and segmentation masks. However, the
groundtruth is noisier than the synthetic dataset. There are six unique items in
the dataset, four of which are clothing worn by the subjects who then made
random movements while the Kinect camera was recording the data. The other
two items are a table lamp and a chair. Each object is textureless, has a gray or
white color, and is shown on a black background.
3.
The source code for generating the synthetic dataset is open-source and can be
used to generate other datasets for more types of objects or the same objects
in new environments.
4.
We propose an autoencoder-based network that uses a single encoder with
two different decoders for learning to reconstruct the depth maps and surface
normal maps of textureless surfaces. The network is trained using a combined
loss for both decoders, where the depth loss is defined as an L2 loss, and the
normal loss is based on cosine similarity. Both loss functions have an additional
component to encourage the better reconstruction of the edges of the objects.
The network is trained on the synthetic dataset and then tested on both the
synthetic and the real-world dataset.
Our datasets and the network architecture are described in detail in Chapter 4
and Chapter 5, respectively. The results of the experiments are presented in
Chapter 6. The source code for generating the synthetic dataset is available at
https://github.com/saifkhichi96/blender_texless_data
and the source code
for collecting data with the Kinect camera is available at
https://github.com/
saifkhichi96/kinect_v2-data
. Network architecture and training code are avail-
able at https://github.com/saifkhichi96/master-thesis-sources.
84 Chapter 7 Conclusion
7.1.2 Reconstruction of Transparent Surfaces
We created a new dataset for transparent surfaces, including ten different 3D models.
We rendered each model from 2520 viewpoints in five different real-world environ-
ments using Blender, with a clear glass-like transparent material for the objects. The
dataset contains 126,000 samples, each with a depth map, a surface normal map,
and an RGB image. Unlike the textureless dataset, the background is not fully black
and influences the object’s appearance as the lighting is computed directly from the
background, and some of the backgrounds are visible through the transparent object.
The source code for data generation is also provided.
To reconstruct the transparent objects in our dataset, we propose a novel network
architecture that combines a Vision Transformer with a convolutional autoencoder,
and uses residual connections and shortcut paths. This network has an auxiliary
input that learns the silhouette space of the transparent objects, which lets the
network learn to locate and segment the transparent objects from the background.
This is done using the ViT, and the output is then passed to the convolutional
autoencoder that reconstructs the depth and normal maps. The network is trained
using three loss functions, a new binary cross entropy loss for silhouette prediction,
and the depth and normal losses used in the textureless case. We use the synthetic
transparent dataset created in this thesis for training and evaluation. We also
present the qualitative evaluation of the network on a few real-world photographs
of transparent objects which are not part of the dataset but demonstrate that the
network can generalize to real-world data.
7.1 Summary of Contributions 85
7.2 Limitations and Future Work
In our experiments, we encountered several challenges caused by the limitations of
our datasets, which we discuss in the following paragraphs and propose recommen-
dations for future work.
The depth and surface normal maps only provide a partial geometry of the 3D object
from one view. A better 3D representation might be a dense point cloud or volumetric
representations like voxels, which provide the complete groundtruth 3D shape of
the object and all its faces. However, textureless surface reconstruction is a new
research area, and this dataset can serve as a starting point for the reconstruction
of such surfaces because, unlike a full mesh reconstruction where occluded faces of
the object also need to be reconstructed, learning to predict depth maps or surface
normals of only the front faces of the objects is a relatively easier task. Our future
datasets can include other types of 3D representations, like voxels or 3D mesh. 2600
of the 2635 3D models in our textureless dataset were taken from the ShapeNet
dataset, which is publicly available and contains meshes and normalized voxel grids
for these models. The remaining 35 models were obtained from other public sources
which also provide 3D meshes, and we can use those to generate voxel grids for the
remaining models. This will allow us to extend our dataset to include voxel grids.
The objects in our datasets are displayed on a black background, and we compute
our loss functions and evaluation metrics on the segmented foreground. This means
that we do not care about the background, but since the network is never shown
a real background image, it cannot even distinguish between the background and
the foreground on its own, let alone reconstruct its depth or normals. The first part
of this problem can be fixed by computing the existing loss functions on the entire
image instead of the foreground or adding another loss term that thresholds the
network outputs and then computes a loss using the groundtruth segmentation task.
In short, adding supervision for background segmentation will allow the network
to distinguish between the background and the foreground, but as the dataset only
contains black backgrounds, the network will only learn to segment the foreground
object when the background is black correctly. This is not the case in the real
world, and the network would need to be trained with textureless objects on real
backgrounds to generalize to the real world. This is a limitation of our datasets
and is not a problem with the network architecture. It can be solved by either
using a different dataset or by extending our dataset to superimpose the objects on
real backgrounds, which is trivial as the segmentation masks are already available.
In addition, our primary dataset is synthetic, unlike [BFS18], which collects real-
world data using a Microsoft Kinect. This is because collecting large amounts of 3D
data from real scenes is often complex. However, synthetic datasets often do not
86 Chapter 7 Conclusion
capture all the variables present in the real scenes and only represent a subset of the
complexities in the wild. Our approach suffers from the synthetic-to-real domain
shift problem, and the network might not always generalize well to the real world.
In the future, our network for textureless surfaces can be retrained on our datasets
extended with real backgrounds to improve their generalization to the real world.
Similarly, our dataset for transparent surfaces is limited in terms of the number of
objects it contains. While there are many 126,000 data samples, they all only show
different viewpoints of 10 3D models, a small number restraining the network from
generalizing to a broader number of objects. Other limitations of the transparent
dataset include the small number of HDRI environments used and the type of
transparent material. We only use five HDRI environments, and all objects in
our dataset have a glass-like material with the same thickness, refractive index,
reflectance, and color. However, in reality, transparent objects include a much wider
variety of materials, such as plastic, glass, and water, each with different properties.
The dataset should be extended to include new objects and materials in the future.
Our source code can be used to generate data with the same material for new 3D
models in new HDRI environments. In order to include more types of transparent
materials, new shaders will need to be created in Blender before our code can be
used to generate groundtruth data for them. As our code also allows generating
data directly from the model archive downloaded from the ShapeNet database, the
renders generated by our code can be used with the 3D groundtruth data from
the ShapeNet itself to train networks for reconstructing transparent surfaces as 3D
meshes, voxels, or point clouds.
To conclude, we demonstrate Vision Transformers’ potential for 3D reconstruction
when combined with a CNN-based network and provide 2.5D datasets as a starting
point for reconstructing complex objects with textureless and transparent surfaces.
Using the networks proposed here and the datasets we provide in conjunction with
more complex datasets, the performance of this task can be improved in the future.
We hope that our datasets and code will be helpful to the research community and
will help in the development of better 3D reconstruction methods.
7.2 Limitations and Future Work 87
Bibliography
[Aji+15]
AR Ajiboye, Ruzaini Abdullah-Arshah, H Qin, and H Isah-Kebbe. „Evaluating the
effect of dataset size on predictive model using supervised learning technique“.
In: Int. J. Comput. Syst. Softw. Eng 1.1 (2015), pp. 75–84 (cit. on p. 3).
[Bar18]
Jayme Garcia Arnal Barbedo. „Impact of dataset size and variety on the effec-
tiveness of deep learning and transfer learning for plant disease classification“.
In: Computers and electronics in agriculture 153 (2018), pp. 46–53 (cit. on p. 3).
[Bau+21]
Miguel Angel Bautista, Walter Talbott, Shuangfei Zhai, Nitish Srivastava, and
Joshua M Susskind. „On the generalization of learning-based 3d reconstruction“.
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision. 2021, pp. 2180–2189 (cit. on p. 9).
[BBS01]
Marcelo Bertalmio, Andrea L Bertozzi, and Guillermo Sapiro. „Navier-stokes,
fluid dynamics, and image and video inpainting“. In: Proceedings of the 2001
IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
CVPR 2001. Vol. 1. IEEE. 2001, pp. I–I (cit. on p. 48).
[BFS18]
Jan Bednarik, Pascal Fua, and Mathieu Salzmann. „Learning to reconstruct
texture-less deformable surfaces from a single view“. In: 2018 International
Conference on 3D Vision (3DV). IEEE. 2018, pp. 606–615 (cit. on pp. 3, 4, 18, 19,
23–26, 29, 34, 42, 43, 47–49, 57, 69–71, 86).
[BHC15]
Vijay Badrinarayanan, Ankur Handa, and Roberto Cipolla. „SegNet: A Deep
Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise
Labelling“. In: CoRR abs/1505.07293 (2015). arXiv:
1505.07293
(cit. on pp. 23,
26, 34).
[BKC17]
Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. „Segnet: A deep convo-
lutional encoder-decoder architecture for image segmentation“. In: IEEE transac-
tions on pattern analysis and machine intelligence 39.12 (2017), pp. 2481–2495
(cit. on p. 58).
[Böv+20]
Judith Böven, Johannes Boos, Andrea Steuwe, et al. „Diagnostic value and
forensic relevance of a novel photorealistic 3D reconstruction technique in post-
mortem CT“. In: The British Journal of Radiology 93.1112 (2020), p. 20200204
(cit. on p. 2).
[Bro+17]
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Van-
dergheynst. „Geometric deep learning: going beyond euclidean data“. In: IEEE
Signal Processing Magazine 34.4 (2017), pp. 18–42 (cit. on p. 29).
89
[Cai+17]
Ziyun Cai, Jungong Han, Li Liu, and Ling Shao. „RGB-D datasets using microsoft
kinect or similar sensors: a survey“. In: Multimedia Tools and Applications 76.3
(2017), pp. 4313–4355 (cit. on p. 3).
[Cha+15]
Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, et al. ShapeNet: An
Information-Rich 3D Model Repository. Tech. rep. arXiv:1512.03012 [cs.GR].
Stanford University — Princeton University — Toyota Technological Institute at
Chicago, 2015 (cit. on pp. 4, 20, 30, 32, 45, 46).
[Chi+11]
Ligia-Domnica Chiorean, Teodora Szasz, Mircea-Florin Vaida, and Alin Voina.
„3D reconstruction and volume computing in medical imaging“. In: Acta Technica
Napocensis 52.3 (2011), p. 18 (cit. on p. 1).
[Cho+16]
Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese.
„3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruc-
tion“. In: Proceedings of the European Conference on Computer Vision (ECCV).
2016 (cit. on pp. 20, 30–32, 45).
[ÇN21]
Ay¸se Nur Çayır and Tu˘
gba Selcen Navruz. „Effect of Dataset Size on Deep
Learning in Voice Recognition“. In: 2021 3rd International Congress on Human-
Computer Interaction, Optimization and Robotic Applications (HORA). 2021, pp. 1–
5 (cit. on p. 3).
[Com18]
Blender Online Community. Blender - a 3D modelling and rendering package.
Blender Foundation. Stichting Blender Foundation, Amsterdam, 2018 (cit. on
pp. 21, 41).
[Den+09]
Jia Deng, Wei Dong, Richard Socher, et al. „Imagenet: A large-scale hierarchi-
cal image database“. In: 2009 IEEE conference on computer vision and pattern
recognition. IEEE. 2009, pp. 248–255 (cit. on p. 63).
[El-+04]
Sabry F El-Hakim, J-A Beraldin, Michel Picard, and Guy Godin. „Detailed 3D
reconstruction of large-scale heritage sites with integrated techniques“. In: IEEE
computer graphics and applications 24.3 (2004), pp. 21–29 (cit. on p. 1).
[EPF14]
David Eigen, Christian Puhrsch, and Rob Fergus. „Depth map prediction from a
single image using a multi-scale deep network“. In: Advances in neural informa-
tion processing systems 27 (2014) (cit. on p. 36).
[Epp+22]
Sagi Eppel, Haoping Xu, Yi Ru Wang, and Alan Aspuru-Guzik. „Predicting 3D
shapes, masks, and properties of materials inside transparent containers, using
the TransProteus CGI dataset“. In: Digital Discovery 1.1 (2022), pp. 45–60 (cit. on
p. 21).
[Fan+22]
Hongjie Fang, Hao-Shu Fang, Sheng Xu, and Cewu Lu. „TransCG: A Large-Scale
Real-World Dataset for Transparent Object Depth Completion and a Grasping
Baseline“. In: IEEE Robotics and Automation Letters 7.3 (2022), pp. 7383–7390
(cit. on p. 22).
[FSG17]
Haoqiang Fan, Hao Su, and Leonidas J Guibas. „A point set generation network
for 3d object reconstruction from a single image“. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2017, pp. 605–613 (cit. on
p. 30).
90 Bibliography
[Fu+21]
Kui Fu, Jiansheng Peng, Qiwen He, and Hanxiao Zhang. „Single image 3D object
reconstruction based on deep learning: A review“. In: Multimedia Tools and
Applications 80.1 (2021), pp. 463–498 (cit. on p. 16).
[GB19]
David Griffiths and Jan Boehm. „A review on deep learning techniques for 3D
sensed data classification“. In: Remote Sensing 11.12 (2019), p. 1499 (cit. on
p. 20).
[Gol+18]
Vladislav Golyanik, Soshi Shimada, Kiran Varanasi, and Didier Stricker. „HDM-
Net: Monocular Non-Rigid 3D Reconstruction with Learned Deformation Model“.
In: CoRR abs/1803.10193 (2018). arXiv:
1803.10193
(cit. on pp. 3, 19, 21, 26,
29, 34).
[Goo+14]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al. „Generative adversarial
nets“. In: Advances in neural information processing systems 27 (2014) (cit. on
p. 28).
[Gue20]
Ezra Thess Mendoza Guevarra. „Blending with Blender: The Shading Workspace“.
In: Modeling and Animation Using Blender. Springer, 2020, pp. 117–152 (cit. on
p. 50).
[Gup+21]
Nitin Gupta, Shashank Mujumdar, Hima Patel, et al. „Data quality for machine
learning tasks“. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Discovery & Data Mining. 2021, pp. 4040–4041 (cit. on p. 3).
[Haf+17]
Jahanzeb Hafeez, Seunghyun Lee, Soonchul Kwon, and Alaric Hamacher. „Image
based 3D reconstruction of texture-less objects for VR contents“. In: International
journal of advanced smart convergence 6.1 (2017), pp. 9–17 (cit. on pp. 1, 2).
[He+15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. „Deep Residual
Learning for Image Recognition“. In: CoRR abs/1512.03385 (2015). arXiv:
1512.
03385 (cit. on pp. 30, 31, 34, 35).
[He+16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. „Deep residual learn-
ing for image recognition“. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. 2016, pp. 770–778 (cit. on p. 62).
[HLB19]
Xian-Feng Han, Hamid Laga, and Mohammed Bennamoun. „Image-based 3D
object reconstruction: State-of-the-art and trends in the deep learning era“.
In: IEEE transactions on pattern analysis and machine intelligence 43.5 (2019),
pp. 1578–1604 (cit. on p. 16).
[Hod+17]
Tomáš Hodan, Pavel Haluza, Štepán Obdržálek, et al. „T-LESS: An RGB-D dataset
for 6D pose estimation of texture-less objects“. In: 2017 IEEE Winter Conference
on Applications of Computer Vision (WACV). IEEE. 2017, pp. 880–888 (cit. on
p. 17).
[HR14]
Mohsen Hejrati and Deva Ramanan. „Analysis by synthesis: 3d object recognition
by object reconstruction“. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2014, pp. 2449–2456 (cit. on p. 1).
[Kar+22]
Ali Karami, Roberto Battisti, Fabio Menna, and Fabio Remondino. „3D DIGITI-
ZATION OF TRANSPARENT AND GLASS SURFACES: STATE OF THE ART AND
ANALYSIS OF SOME METHODS“. In: The International Archives of Photogram-
metry, Remote Sensing and Spatial Information Sciences 43 (2022), pp. 695–702
(cit. on p. 3).
Bibliography 91
[KB14]
Diederik P Kingma and Jimmy Ba. „Adam: A method for stochastic optimization“.
In: arXiv preprint arXiv:1412.6980 (2014) (cit. on pp. 25, 66, 76).
[Kes+17]
Leonid Keselman, John Iselin Woodfill, Anders Grunnet-Jepsen, and Achintya
Bhowmik. „Intel realsense stereoscopic depth cameras“. In: Proceedings of the
IEEE conference on computer vision and pattern recognition workshops. 2017,
pp. 1–10 (cit. on p. 21).
[Kha+22]
Muhammad Saif Ullah Khan, Alain Pagani, Marcus Liwicki, Didier Stricker, and
Muhammad Zeshan Afzal. „Three-Dimensional Reconstruction from a Single
RGB Image Using Deep Learning: A Review“. In: Journal of Imaging 8.9 (2022)
(cit. on p. 3).
[Kna+17]
Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. „Tanks and
temples: Benchmarking large-scale scene reconstruction“. In: ACM Transactions
on Graphics (ToG) 36.4 (2017), pp. 1–13 (cit. on p. 37).
[Kol+21]
Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, et al. „An Image is
Worth 16x16 Words: Transformers for Image Recognition at Scale“. In: (2021)
(cit. on pp. 33, 35, 61).
[Lag19]
Hamid Laga. „A survey on deep learning architectures for image-based depth
reconstruction“. In: arXiv preprint arXiv:1906.06113 (2019) (cit. on p. 16).
[Lam+13]
Jens Lambrecht, Martin Kleinsorge, Martin Rosenstrauch, and Jörg Krüger.
„Spatial programming for industrial robots through task demonstration“. In:
International Journal of Advanced Robotic Systems 10.5 (2013), p. 254 (cit. on
p. 2).
[LHH16]
Andreas Ley, Ronny Hänsch, and Olaf Hellwich. „Reconstructing White Walls:
Multi-View, Multi-Shot 3D Reconstruction of Textureless Surfaces“. In: ISPRS
Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences 3.3
(2016) (cit. on pp. 2, 18).
[Liu+21]
Caixia Liu, Dehui Kong, Shaofan Wang, et al. „Deep3D reconstruction: meth-
ods, data, and challenges“. In: Frontiers of Information Technology & Electronic
Engineering 22.5 (2021), pp. 652–672 (cit. on p. 16).
[LK21]
Xi Li and Ping Kuang. „3D-VRVT: 3D Voxel Reconstruction from A Single Image
with Vision Transformer“. In: 2021 International Conference on Culture-oriented
Science & Technology (ICCST). IEEE. 2021, pp. 343–348 (cit. on pp. 3, 20, 33,
35).
[Mes+19]
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and
Andreas Geiger. „Occupancy networks: Learning 3d reconstruction in function
space“. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2019, pp. 4460–4470 (cit. on pp. 31, 35).
[Mil95]
George A Miller. „WordNet: A Lexical Database for English“. In: Communications
of the ACM 38.11 (1995), pp. 39–41 (cit. on p. 20).
[MN21]
Bogdan Maxim and Sergiu Nedevschi. „A survey on the current state of the art on
deep learning 3D reconstruction“. In: 2021 IEEE 17th International Conference on
Intelligent Computer Communication and Processing (ICCP). IEEE. 2021, pp. 283–
290 (cit. on p. 16).
92 Bibliography
[Oh +16]
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. „Deep metric
learning via lifted structured feature embedding“. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016, pp. 4004–4012
(cit. on pp. 30, 31).
[Ots79]
Nobuyuki Otsu. „A threshold selection method from gray-level histograms“. In:
IEEE transactions on systems, man, and cybernetics 9.1 (1979), pp. 62–66 (cit. on
p. 28).
[Pas+19]
Adam Paszke, Sam Gross, Francisco Massa, et al. „PyTorch: An Imperative Style,
High-Performance Deep Learning Library“. In: Advances in Neural Information
Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, et al.
Curran Associates, Inc., 2019, pp. 8024–8035 (cit. on p. 27).
[Pau14]
I Paun. „Multi-View ICP for 3D reconstruction of unknown space debris“. MA
thesis. 2014 (cit. on p. 1).
[PKS15]
Joseph Prusa, Taghi M. Khoshgoftaar, and Naeem Seliya. „The Effect of Dataset
Size on Training Tweet Sentiment Classifiers“. In: 2015 IEEE 14th International
Conference on Machine Learning and Applications (ICMLA). 2015, pp. 96–102
(cit. on p. 3).
[RFB15]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. „U-Net: Convolutional Net-
works for Biomedical Image Segmentation“. In: CoRR abs/1505.04597 (2015).
arXiv: 1505.04597 (cit. on pp. 26, 28).
[Ric70]
Whitman Richards. „Stereopsis and stereoblindness“. In: Experimental brain
research 10.4 (1970), pp. 380–388 (cit. on p. 9).
[RTG00]
Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. „The earth mover’s distance
as a metric for image retrieval“. In: International journal of computer vision 40.2
(2000), pp. 99–121 (cit. on p. 37).
[Saj+20]
Shreeyak Sajjan, Matthew Moore, Mike Pan, et al. „Clear grasp: 3d shape es-
timation of transparent objects for manipulation“. In: 2020 IEEE International
Conference on Robotics and Automation (ICRA). IEEE. 2020, pp. 3634–3642 (cit.
on p. 22).
[Sal+20]
Andrey Salvi, Nathan Gavenski, Eduardo Pooch, Felipe Tasoniero, and Rodrigo
Barros. „Attention-based 3D Object Reconstruction from a Single Image“. In:
2020 International Joint Conference on Neural Networks (IJCNN). IEEE. 2020,
pp. 1–8 (cit. on pp. 20, 30, 31, 33, 35, 37, 72).
[San14]
J Koppal Sanjeev. „Lambertian reflectance“. In: Computer Vision: A Reference
Guide. Ed. by Katsushi Ikeuchi. Boston, MA: Springer US (2014), pp. 441–443
(cit. on p. 50).
[SG02]
Mikkel B Stegmann and David Delgado Gomez. „A brief introduction to statistical
shape analysis“. In: Informatics and mathematical modelling, Technical University
of Denmark, DTU 15.11 (2002) (cit. on p. 36).
[Shi+19]
Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Didier Stricker.
„IsMo-GAN: Adversarial Learning for Monocular Non-Rigid 3D Reconstruction“.
In: CoRR abs/1904.12144 (2019). arXiv:
1904.12144
(cit. on pp. 3, 19, 27, 28,
34).
Bibliography 93
[SS17]
Zhexiong Shang and Zhigang Shen. „Real-time 3D reconstruction on construction
site using visual SLAM and UAV“. In: arXiv preprint arXiv:1712.07122 (2017)
(cit. on p. 2).
[Sun+17]
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. „Revisiting
Unreasonable Effectiveness of Data in Deep Learning Era“. In: Proceedings of
the IEEE International Conference on Computer Vision (ICCV). Oct. 2017 (cit. on
p. 3).
[Suz+85]
Satoshi Suzuki et al. „Topological structural analysis of digitized binary images
by border following“. In: Computer vision, graphics, and image processing 30.1
(1985), pp. 32–46 (cit. on p. 28).
[SZ14]
Karen Simonyan and Andrew Zisserman. „Very deep convolutional networks
for large-scale image recognition“. In: arXiv preprint arXiv:1409.1556 (2014)
(cit. on pp. 23, 25, 34, 35).
[TA19]
Aggeliki Tsoli and Antonis. A. Argyros. „Patch-Based Reconstruction of a Texture-
less Deformable 3D Surface from a Single RGB Image“. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. Oct.
2019 (cit. on pp. 3, 19, 25, 26, 34).
[Tat+19]
Maxim Tatarchenko, Stephan R Richter, René Ranftl, et al. „What do single-view
3d reconstruction networks learn?“ In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition. 2019, pp. 3405–3414 (cit. on pp. 9,
37).
[Vas+17]
Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. „Attention is all you need“.
In: Advances in neural information processing systems 30 (2017) (cit. on pp. 30,
31, 33, 35).
[Vos03]
George Vosselman. „3d reconstruction of roads and trees for city modelling“. In:
International archives of photogrammetry, remote sensing and spatial information
sciences 34.3/W13 (2003), pp. 231–236 (cit. on p. 1).
[Wal+07]
Bruce Walter, Stephen R Marschner, Hongsong Li, and Kenneth E Torrance. „Mi-
crofacet Models for Refraction through Rough Surfaces.“ In: Rendering techniques
2007 (2007), 18th (cit. on p. 50).
[Wan+18]
Nanyang Wang, Yinda Zhang, Zhuwen Li, et al. „Pixel2mesh: Generating 3d
mesh models from single rgb images“. In: Proceedings of the European conference
on computer vision (ECCV). 2018, pp. 52–67 (cit. on pp. 3, 20, 29, 32, 35).
[Wan+19]
Qiang Wang, Bei Li, Tong Xiao, et al. „Learning Deep Transformer Models for
Machine Translation“. In: Proceedings of the 57th Annual Meeting of the Associa-
tion for Computational Linguistics. Florence, Italy: Association for Computational
Linguistics, July 2019, pp. 1810–1822 (cit. on p. 61).
[Wel16]
Andrew E Welchman. „The human brain in depth: how we see in 3D“. In: Annual
review of vision science 2 (2016), pp. 345–376 (cit. on p. 9).
[Wid+19]
Aji Resindra Widya, Yusuke Monno, Masatoshi Okutomi, et al. „Whole stomach
3D reconstruction and frame localization from monocular endoscope video“.
In: IEEE Journal of Translational Engineering in Health and Medicine 7 (2019),
pp. 1–10 (cit. on p. 2).
94 Bibliography
[WS16]
Oliver Wasenmüller and Didier Stricker. „Comparison of kinect v1 and v2 depth
images in terms of accuracy and precision“. In: Asian Conference on Computer
Vision. Springer. 2016, pp. 34–45 (cit. on p. 47).
[YS19]
Anny Yuniarti and Nanik Suciati. „A review of deep learning techniques for 3D
reconstruction of 2D images“. In: 2019 12th International Conference on Informa-
tion & Communication Technology and System (ICTS). IEEE. 2019, pp. 327–331
(cit. on p. 15).
[YTZ21]
Yi Yuan, Jilin Tang, and Zhengxia Zou. „Vanet: a View Attention Guided Net-
work for 3d Reconstruction from Single and Multi-View Images“. In: 2021 IEEE
International Conference on Multimedia and Expo (ICME). 2021, pp. 1–6 (cit. on
pp. 20, 32, 35).
[Zha+19]
Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. „Self-
attention generative adversarial networks“. In: International conference on ma-
chine learning. PMLR. 2019, pp. 7354–7363 (cit. on pp. 31, 58).
[Zol+18]
Michael Zollhöfer, Patrick Stotko, Andreas Görlitz, et al. „State of the art on 3D
reconstruction with RGB-D cameras“. In: Computer graphics forum. Vol. 37. 2.
Wiley Online Library. 2018, pp. 625–652 (cit. on p. 15).
[ZZD20]
Tianyu Zhou, Qi Zhu, and Jing Du. „Intuitive robot teleoperation for civil engi-
neering operations with virtual reality and deep learning scene reconstruction“.
In: Advanced Engineering Informatics 46 (2020), p. 101170 (cit. on p. 2).
Websites
[CGT11] CGTrader. 3D Model Store. 2011. URL:https://cgtrader.com/ (cit. on p. 55).
[Cha10]
Siddhartha Chaudhuri. Computer Graphics at Stanford University. 2010. UR L:
https: //graphics .stanford.edu /courses/ cs148- 10- summer/as3 /code/
as3/teapot.obj (cit. on p. 55).
[Cho+15] François Chollet et al. Keras. 2015. URL:https://keras.io (cit. on p. 25).
[Cra21]
Keenan Crane. Keenan’s 3D Model Repository. 2021. URL:
https://www.cs.cmu.
edu/~kmcrane/Projects/ModelRepository (cit. on p. 55).
[Lev+05]
Marc Levoy, J Gerth, B Curless, and K Pull. The Stanford 3D Scanning Repository.
2005. URL:
https://graphics.stanford.edu/data/3Dscanrep
(cit. on p. 55).
[Mar+15]
Martín Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-Scale Ma-
chine Learning on Heterogeneous Systems. Software available from tensorflow.org.
2015. URL:https://www.tensorflow.org/ (cit. on p. 25).
[Tea]
ShapeNet Research Team. About ShapeNet.URL:
https : / / shapenet . org /
about/ (cit. on p. 20).
[Tes19]
Tesla. This 3D reconstruction shows the immense amount of depth information
a Tesla can collect from just a few seconds of video from the vehicle’s 8 cameras
PIC.TWITTER.COM/W2X6PKM2EB. Apr. 2019. URL:
https://twitter.com /
tesla/status/1120815737654767616 (cit. on p. 1).
[Wig19] Ross Wightman. PyTorch Image Models. 2019 (cit. on p. 61).
Websites 95
Glossary
Abbreviations
The following abbreviations are used in this manuscript:
2D Two-Dimensional
AI Artificial Intelligence
BSDF Bidirectional Scattering Diffuse Function
3D Three-Dimensional
CAD Computer-Aided Design
CD Chamfer Distance
CNN Convolutional Neural Network
CPU Central Processing Unit
EMD Earth Mover’s Distance
GAN Generative Adversarial Network
GB Gigabytes
GCN Graph Convolutional Network
GELU Gaussian Error Linear Unit
GPU Graphics Processing Unit
HDM-Net Hybrid Deformation Model Network
HDRI High Dynamic Range Image
IoU Intersection over Union
IsMo-GAN Isometry-Aware Monocular Generative Adversarial Network
MAE Mean Absolute Error
MLP Multi-Layer Perceptron
MSE Mean Squared Error
MVS Multi-View Stereo
NC Normal Consistency
NLP Natural Language Processing
ONet Occupancy Network
PBR Physics-Based Rendering
PNG Portable Network Graphics
R2NE Recurrent Reconstruction Neural Network
RGB Red-Green-Blue
RGB-D Red-Green-Blue-Depth
ReLU Rectified Linear Unit
RNN Recurrent Neural Network
RSRVT Residual Sketch Reconstruction Vision Transformer
SGD Stochastic Gradient Descent
SNMT SegNet Multi-Task
Websites 97
SRAE Sketch Reconstruction Autoencoder
SRMA Sketch Reconstruction Multi-task Autoencoder
SRVT Sketch Reconstruction Vision Transformer
SfM Structure from Motion
TOF Time-of-Flight
VAE Variational Autoencoder
VANET View Attention Guided Network
VGG Visual Geometry Group
ViT Vision Transformer
VOC Visual Object Classes
VRVT Voxel Reconstruction Vision Transformer
98 Bibliography