Conference PaperPDF Available

Object Detection and Pose Estimation Based on Convolutional Neural Networks Trained with Synthetic Data


Abstract and Figures

Instance-based object detection and fine pose estimation is an active research problem in computer vision. While the traditional interest-point-based approaches for pose estimation are precise, their applicability in robotic tasks relies on controlled environments and rigid objects with detailed textures. CNN-based approaches, on the other hand, have shown impressive results in uncontrolled environments for more general object recognition tasks like category-based coarse pose estimation, but the need of large datasets of fully-annotated training images makes them unfavourable for tasks like instance-based pose estimation. We present a novel approach that combines the robustness of CNNs with a fine-resolution instance-based 3D pose estimation, where the model is trained with fully-annotated synthetic training data, generated automatically from the 3D models of the objects. We propose an experimental setup in which we can carefully examine how the model trained with synthetic data performs on real images of the objects. Results show that the proposed model can be trained only with synthetic renderings of the objects' 3D models and still be successfully applied on images of the real objects, with precision suitable for robotic tasks like object grasping. Based on the results, we present more general insights about training neural models with synthetic images for application on real-world images.
Content may be subject to copyright.
Object Detection and Pose Estimation based on Convolutional Neural
Networks Trained with Synthetic Data
Josip Josifovski1, Matthias Kerzel1, Christoph Pregizer2, Lukas Posniak2, Stefan Wermter1
Abstract Instance-based object detection and fine pose es-
timation is an active research problem in computer vision.
While the traditional interest-point-based approaches for pose
estimation are precise, their applicability in robotic tasks relies
on controlled environments and rigid objects with detailed
textures. CNN-based approaches, on the other hand, have
shown impressive results in uncontrolled environments for more
general object recognition tasks like category-based coarse
pose estimation, but the need of large datasets of fully-
annotated training images makes them unfavourable for tasks
like instance-based pose estimation.
We present a novel approach that combines the robustness of
CNNs with a fine-resolution instance-based 3D pose estimation,
where the model is trained with fully-annotated synthetic
training data, generated automatically from the 3D models of
the objects. We propose an experimental setup in which we can
carefully examine how the model trained with synthetic data
performs on real images of the objects. Results show that the
proposed model can be trained only with synthetic renderings
of the objects’ 3D models and still be successfully applied on
images of the real objects, with precision suitable for robotic
tasks like object grasping. Based on the results, we present more
general insights about training neural models with synthetic
images for application on real-world images.
To recognize an object and determine its location and pose
in the environment is still a challenging task in computer
vision and an active research topic. Retrieving the spatial
information about an object is crucial for robotic tasks
like vision-based scene understanding, grasping or object
manipulation. Performing such a task becomes complex if the
objects and their environment are not specifically adjusted
for robots by controlling lighting conditions, the positions of
the objects or their textures, or placing distinctive markers
on the objects.
Historically, most models for instance-based object recog-
nition have extracted local descriptors around points of
interest in images [1], [2]. While these features are invariant
to rotation, scale, and partial occlusions, they are not very
robust to illumination changes or partial deformations of the
object. Furthermore, since they depend on interest points and
local features, they only work well for objects with texture.
Recently, there has been a shift in the state of the art by
using Convolutional Neural Network (CNN) approaches for
*The authors gratefully acknowledge partial support from the German
Research Foundation DFG under project CML (TRR 169), and the Hamburg
orderungsprojekt CROSS.
1Knowledge Technology, Department of Informatics, University
of Hamburg, Germany. 4josifov / kerzel / wermter
2Viewlicity GmbH, Schferkampsallee 42, 20357 Hamburg, Germany
pregizer / posniak
Fig. 1. Top: the 3D models used for generating synthetic training data.
Bottom: the estimated 3D pose of the objects and a robotic hand using
our CNN-based pose estimation model. The 3D poses are visualized by
overlaying the object contours onto the image. The model can successfully
handle textured or texture-less objects as well as deviations between the 3D
model used for training and the real object.
computer vision tasks, such as image classification [3] or ob-
ject detection [4]. The advantage of CNNs is their ability for
learning discriminative, hierarchical features directly from
the underlying training data, removing the need for designing
domain-specific feature extractors. CNNs have shown strong
generalization abilities, making them more robust to lighting
conditions, object textures, partial occlusions, and even ob-
ject deformations. However, the performance of CNNs relies
upon the available training data, which is scarce for more
complex computer vision tasks like pose estimation, due to
the many variables involved. Because of this and the need for
annotated data, manual preparation of a dataset becomes a
time-consuming process. In contrast, advances in computer-
generated imagery allow automation of the data generation
process by creating fully- annotated, synthetic data. However,
the benefits come at the cost of decreased realism, and it is
an open research question if such data can be useful for
training if the model is in the end used to process real-world
data from a real vision sensor.
2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Madrid, Spain, October 1-5, 2018
978-1-5386-8093-3/18/$31.00 ©2018 IEEE 6269
In this paper, we present a CNN-based approach for
instance-based object detection and 3D pose estimation that
uses synthetic training data. Further, we describe the proce-
dure for creating different types of synthetic training data
and an experimental setup which allows for an objective
evaluation of the model’s performance on real images when
it has been trained with synthetic training data.
The results contribute to a better understanding of the
model architecture, the training procedure, and the training
data properties which are useful to “bridge the gap” between
synthetic and real data for object detection and pose estima-
tion. Finally, we summarise general advice on training neural
models with synthetic images for applications based on real-
world images and demonstrate the use of our approach in
a robotic grasping scenario with NICO the Neuro-Inspired
COmpanion humanoid robot [5] depicted in Figure 1.
The recent success of convolutional neural networks in im-
age classification, e.g., on the challenging ImageNet bench-
mark dataset [3], [6] has sparked interest to apply CNN-
based methods for additional computer vision tasks. Since
then, CNN-based models that perform simultaneous ob-
ject classification and localization [4] were developed, with
recent models [7], [8] showing considerable improvement
in the accuracy of the predictions or the processing time.
However, pose estimation is a harder task than localization
since it subsumes classification, localization, and recognition
of the correct orientation of multiple objects in an image.
Due to the complexity of the problem and the need of a
fully-annotated training datasets of images, which capture
the variety of factors that need to be learned, very few
methods tackle instance-based object pose estimation using
completely CNN-based models.
Wohlhart et al. [9] introduce a CNN-based method for
calculating descriptors that encode the object’s class and
pose for an instance-based recognition. These descriptors are
generated by a CNN that is trained to generate well-separated
descriptors in Euclidean space so that the distance between
descriptors for the same object but different orientations
is proportional to the difference in the object’s pose. The
training is enriched with synthetic training data using 3D
models of the objects to generate images under different
camera viewpoints. To determine the object and its pose
in a test image, they extract a descriptor of the image and
perform nearest-neighbour search to descriptors generated
from predefined views for each object and its pose.
In contrast, Su et al. [10] rely exclusively on CNN-
based architectures to determine an object’s bounding box,
class, and viewpoint for category-based pose estimation.
They generate synthesized RGB training data from various
3D models within each category and use the data to train
a CNN for camera viewpoint estimation. Regarding the
network architecture, the authors indicate that the problem
of viewpoint estimation is category-dependent; therefore, the
classical CNN network topology cannot be applied when
training for multiple categories. Su et al. evaluate the model
when trained with synthesized-only, with real-only, and with
combined training data. They report that synthesized data
training is better than training with real data, while the
combination of both yields slightly better results.
A recent approach by Tobin et al. [11] evaluates the
transfer of knowledge from a simulated to a real-world
scenario for a CNN-based model for object localization. They
create a simulated environment that resembles the real world
regarding 3D configuration and randomly render objects
without photo-realistic appearance, while randomizing col-
ors, lighting conditions or object positions in the generated
images. They hypothesize that if the randomization is high
enough, the CNN-based model should generalize well just
from simulated data to perform a localization task in the
real world without additional fine-tuning. The trained model
is then tested in a real-world grasping task. According to
their results, the task can be learned only from the simulated
training images.
In contrast, Planche et al. [12] focus on creating realistic
depth scans from 3D models in simulation, by modeling
the error of the sensor device. They apply the synthetic
depth scans to train CNN descriptors based on the method
proposed by Wohlhart et al. [9]. Their experiments indicate
that the added realism to the simulated data improves the
performance of the task, thus making the transition from
simulation to real world smoother.
Finally, in parallel to our work, two other approaches have
been published very recently: the PoseCNN [13], which is
a CNN-based model with complex architecture that tackles
the pose estimation problem through a three-stage process,
and the BB8 model [14] which is a two-stage approach.
While both approaches tackle similar problems, the main
difference to our approach is that they rely on real images
of the objects, using synthetic data only for augmentation
of the training data or improvement of the results, while
our approach relies only on synthetic views generated from
the objects 3D models. Both methods also treat the pose
estimation as a regression problem, while we treat it as a
classification problem. In this regard, our methodology is
informed by the work of Su et al. [10], however, different
from their approach where only the viewpoint estimator is
trained with synthetic data, we train both the object detector
and the viewpoint estimator with synthetic data. Furthermore,
their aim has been category-based object pose estimation, and
their results, while impressive for such a general problem,
are still not precise enough to be practically useful in robotic
tasks. Therefore, we reduce the scope of the problem to
instance-based detection and pose estimation and propose
different architecture with the aim to increase the precision
of the model, such that it can be used for practical tasks,
e.g., in robotic grasping of objects whose 3D models are
known. Our underlying research question, the transferability
of vision networks trained with synthetic data to a real-world
scenario, is most related to the work of Tobin et al. [11].
However, where their model maps from an image of objects
to their positions, we map from an image to the objects’ 3D
poses, so that information about the rotation is also available.
Fig. 2. The proposed approach: a) Pose estimation model as a sequence of object detector and viewpoint estimator. b) Procedure for training of the object
detector and viewpoint estimation components.
To achieve instance-based object detection and pose esti-
mation, we use the modular architecture presented in Figure
2a. The first module of the architecture is a CNN-based
detector which detects and classifies regions of the image.
For a given image it outputs the bounding boxes and the
labels of the detected objects. The labeled bounding box
output of the detector module becomes the input for an
object-specific CNN-based viewpoint estimation model. The
viewpoint estimation model calculates the 3D pose under
which the object appears in the bounding box by solving
a related problem: estimating the viewpoint of the camera
related to the object’s coordinate system. The viewpoint is
represented by the azimuth and elevation angle of the camera
relative to the object’s coordinate system and the in-plane
rotation of the camera around its optical axis. These three
angles are the output of the viewpoint estimation component.
The final output of the pose estimation model is then a 3D
pose of the object within the detected bounding box1. Since
the architecture is modular, the detector or viewpoint estima-
tor can easily be replaced with a different implementation.
For the detector component we use the Faster R-CNN model
[16]. For the viewpoint estimator component we propose the
model defined in Section III-B which is inspired by [10].
A. Synthetic datasets
For training the detector and viewpoint estimation com-
ponents, we use the training procedure shown in Figure 2b.
The input for the training procedure is the 3D models of the
objects and images representing the environment which serve
as substitute background2. From the input data, we produce
1The output of our architecture does not provide a direct 6DoF pose, but
it is sufficient to reconstruct the complete 6DoF pose under known camera
intrinsics in combination with the object’s 3D model. For this purpose, well-
established approaches like PnP [15] can be used.
2These images are an optional input since we can also use a synthetically-
generated background. However, we expect that real-world images similar
to the environment where the model is applied, e.g. ”indoor” or ”outdoor”,
would reduce the number of false positives by the off-shelf detector
two types of training datasets: an object detection dataset and
an object viewpoint estimation dataset.
The dataset for object detection is produced by importing
the 3D models of the object into the Open Source 3D creation
suite Blender [17], randomizing their position and orientation
as well as the scene lighting conditions. The objects are
then rendered on a transparent background. The generated
image is overlaid on a random substitute-background image.
The annotation includes the labelled bounding boxes of the
objects for image detection. For viewpoint estimation, we
create a separate dataset for each object. First, the 3D model
of the object is imported into Blender, and the virtual camera
is positioned on a random elevation and azimuth position
related to the object’s coordinate system. Then, the camera
is rotated towards the object so that the camera’s optical axis
crosses through the object’s origin. The camera is rotated by
a random angle (in-plane rotation) around its optical axis.
The lighting in the scene is randomized, and an image is
rendered depicting the object on a transparent background.
The generated image is cropped according to the object’s
bounding box and overlaid on a substitute- background
image. The annotation of each image contains the azimuth,
elevation and in-plane rotation angles. Since the datasets
are created in simulation, the annotations are generated
automatically without the need of human assistance or a
complicated turn-table acquisition setup.
We use the synthetic datasets to train the object detector
and the object-specific viewpoint estimators. Once trained,
the object detector and object-specific viewpoint estimators
can be applied sequentially to estimate the 3D pose of the
objects in an image.
B. Viewpoint estimator architecture
The viewpoint estimation with a neural network can be
modelled as a regression or a classification task. One problem
with regression is that mapping the rotational components
to numerical values suffers from discontinuities (e.g. 0 and
360 degrees represent the same angle). To overcome the
discontinuity problem, some authors [18], [19] suggest a
prior coarse estimation of the angle, where the angle is
classified to belong to one of the four 90-degree intervals
to eliminate the discontinuity, and then a regressor for each
interval is trained. However, another problem arises with
symmetrical objects, where special care has to be taken to
identify the symmetries of the object to be able to train the
regressor, as it has been done in [14].
Because of these drawbacks, we define the viewpoint
estimation problem as a classification task, similarly to [10],
with rotation represented by 10-degree bins for azimuth,
elevation and in-plane rotation of the object w.r.t to the
camera. Unlike [10], where independent output units for
azimuth, elevation and in-plane rotation are used, we propose
an alternative architecture, where the azimuth and elevation
outputs are paired together into a two-dimensional azimuth-
elevation array of output units, each representing a unique
viewpoint around the object, while the in-plane rotation units
are kept as an independent one-dimensional array. The reason
for this change is that the azimuth and elevation angles
are conceptually different from the in-plane rotation angle:
when the azimuth or elevation is changed we effectively see
a different region of the object, while when the in-plane
rotation is changed and the other angles are kept constant,
we still see the same region of the object but it appears
rotated in the image. We expect that such organization of
the output layer would lead to learning more appropriate
high-level features for viewpoint estimation by the network.
For the convolutional part of the model, we use the VGG-
16 [20] architecture. In contrast to Su et al. [10], where
the VGG-16 convolutional base is shared among all objects,
and the final fully-connected and output layers are category-
specific, we train a complete model for each object, to make
sure that the accuracy is maximized by learning only single-
object-specific features. While this decision might influence
the scalability of the model if many objects need to be esti-
mated at once, our analyses have shown that for each object,
retraining just the top convolutional layers gives optimal
results, therefore the lower layers of the convolutional base
can be shared in order to have an optimal trade-off between
scalability and accuracy.
Fig. 3. Architecture of the proposed CNN-based viewpoint estimator
The proposed architecture of the viewpoint estimator is
presented in Figure 3. At the bottom is the VGG-16 con-
volutional base, followed by a single fully-connected layer
with rectified linear activation that serves as an intermediate
between the convolutional base and the output, and a soft-
max output layer where the output units represent 10-degree
bins for the elevation-azimuth pairs and the in-plane rotation.
We have restricted the range of elevation to 180 degrees since
only one of the azimuth or elevation angles should be defined
in the full 360-degree range to cover all possible viewpoints
on an imaginary sphere around the object. The range of the
in-plane rotation angle is 360 degrees as the camera can be
rotated fully in any azimuth-elevation point.
To train the viewpoint model, we use categorical-cross
entropy. As indicated by Su et al. [10], the output units
of the viewpoint model represent geometrical entities and
not real classes. Therefore, when encoding the geometrical
information in the class labels, we use soft-classification by
allowing one input to belong to multiple classes (multiple
neighboring bins) independently with a different probability.
For each training image, we generate the probability for
each bin by centering a normal distribution on the bin
corresponding to the ground truth angle values of that image.
We sample the distribution over the neighboring bins to get
probabilities for each bin. For the azimuth-elevation part we
sample a 2D normal distribution, and for the in-plane rotation
part, we sample 1D normal distribution.
We conduct several experiments to evaluate the proposed
architecture and the effects of training with synthetic data.
For this purpose we use the T-LESS dataset [21], which
provides 3D models of various objects taken from the
electricity installation domain, as well as RGB images of
the real objects and a precise 6DoF annotation of the objects
in the images. The dataset is challenging since the objects are
texture-less, symmetrical and similar among each-other, with
test images which contain high degree of object occlusion.
For the experiments, we always train the model with
synthetic training images, created as described in Section
III-A. However, we test the model with two types of test
images: real and synthetic. The synthetic test images are
carefully created to resemble the real test images. We achieve
this using the rendering procedure described in III-A, but
instead of randomizing the object positions, to create a
synthetic test image we use the ground truth information
of a corresponding real test image from the T-LESS dataset.
This way, in the synthetic test images the actual objects are
overlaid with the renderings of their 3D models. An example
of a real and synthetic test image is shown in 4.
This leads to two test sets with the following difference: in
the real test set, we have images of the real objects while in
the synthetic test set we have images of the rendered objects
based on their 3D models. By comparing the performance
of the model on the two test sets, we can see to what
degree the transfer from synthetic to real data influences the
performance of the model. Using this approach, we conduct
experiments with the detector and viewpoint estimator com-
ponents as separate modules, as well as with the detector-
estimator architecture as a whole.
A. Object detection experiment
For the detection of the bounding box of an object, we rely
on the Faster R-CNN model [7], which has been proposed
Fig. 4. Example test images. Left: real test image from the T-LESS dataset.
Right: Synthetic test image created from the corresponding real image by
overlaying rendered 3D models of the objects on the place of the actual
as a category-based object detector and is usually trained
on real images. However, it is unclear if the Faster R-CNN
model will be able to successfully detect the objects in
real images if it is trained with synthetic images of their
3D models. Also, it is unknown whether it is necessary to
generate photo-realistic synthetic images or the detector can
learn important features of the objects even if the objects are
rendered without high realism. To address these questions,
we perform an experiment.
1) Experimental setup: Using the procedure outlined in
Section III-A, we create two synthetic datasets for object
detection, which we refer to as the “high realism” and “low
realism” synthetic dataset. In both datasets, all variables like
object positions and substitute backgrounds are the same,
the only difference is the realism in the rendering of the 3D
models. For the “low realism” dataset we render the objects
using the Blender Internal [17] rendering engine under a light
source that generates a high contrast between different parts
of the object, while for the “high realism” dataset we use the
Blender Cycles [17] rendering engine under random lighting
conditions and random film exposure time. The difference in
realism from the two renderers emerges from the concept of
how they render an image within a virtual scene, with the
Cycles renderer producing more photo-realistic images, as
indicated in [22].
For each dataset, we create 5000 synthetic training images
and 2016 synthetic test images3, each synthetic test image
corresponding to a real test image from the T-LESS. Sample
training images from the two synthetically-generated datasets
are presented in Figure 5.
For the object detector component, we use the imple-
mentation of the Faster R-CNN model available from the
Tensorflow Object Detection API [16]. The model is pre-
trained on the COCO dataset [23] of real images. We use
the default configuration parameters [24] and we initialize
the model with pre-trained weights. For a single training run,
starting from the pre-trained state, we fine-tune the model
3The T-LESS dataset does not provide images of the environment without
any objects in them, which we need as an input in order to use them as
substitute backgrounds for the synthetic training data. To overcome this, we
split the T-LESS test images into two parts. The first part contains the scenes
1,5,6,7,8,9 and 10, and the second part contains the scenes 2,3,4 and 11. We
use the first part for creating substitute environment backgrounds without
objects, where the environment background image is generated by covering
the objects in the image with a random region of the image that does not
contain any objects. The second part contains a total of 2016 images which
we use as real test images and as a basis for generating synthetic test images.
Fig. 5. Synthetic training images for object detection experiment. Left:
“low realism” condition; Right: “high realism” condition.
on synthetic training data for 5000 steps. After 5000 steps
we measure the performance of the model on the synthetic
test data and the real test data. The metric we use is mean
Average Precision (mAP), as it has been defined for the
PASCAL Visual Object Classes (VOC) Challenge [25] and
implemented in the Tensorflow Object Detection API [24].
We perform 20 independent experiment runs, 10 times
using the “high realism” synthetic dataset, and 10 times
using the “low realism” synthetic dataset. Before each run,
we randomize the order of the training images.
Training data mAP on real test im-
mAP on synthetic
test images
“low realism”
(10 runs)
0.584, 0.598, 0.585,
0.593, 0.622, 0.594,
0.667, 0.56, 0.607,
0.605, 0.67, 0.616,
0.636, 0.697, 0.624,
0.692, 0.615, 0.656,
“high realism”
(10 runs)
0.631, 0.634, 0.585,
0.603, 0.632, 0.64,
0.576, 0.666, 0.679,
0.734, 0.749, 0.723,
0.678, 0.71, 0.733,
0.666, 0.732, 0.755,
average 0.614 ±0.032 0.683 ±0.048
2) Results: The mAP results of the detector, given in
Table I, indicate that the detector is able to learn from the
synthetic training data and correctly detect and classify the
bounding boxes of the objects, overcoming the challenges
like high object similarity and object occlusion in the test
images. Sample results of the detected bounding boxes are
shown in Figure 6 (left).
To see if more “realistically-appearing” rendering of the
training data changes the detector’s performance on the real
images, we use the mAP values of each experiment run to
perform a two-tailed independent samples t-test under the
null hypotheses that there is no change in the detector’s
performance under the two conditions. The results (t(10)=-
1.8912, pvalue=0.0748) indicate that we cannot reject the
null hypothesis that assumes equal performance under the
two conditions. This result is somewhat surprising because it
indicates that the detector can generalize from unrealistically
rendered objects as good as from ones that are realistically
rendered. This is an important observation, because pro-
ducing a “high-realism” training image is more resource-
consuming than producing a “low-realism” training image,
e.g., in our experiments it takes about seven times longer to
produce “high-realism” image.
The second important question is the possibility of trans-
ferring the detector trained with synthetic images on real
images. If we take the detector’s performance on synthetic
test images as a reference performance when there is no
difference in the nature of the training and test data, the
question is how close to that performance can the detector
come when tested on real images. Under the null hypothesis
that there is no difference in the performance between the
two types of test images, we perform dependent samples
t-test, using the column-vice mAP values from Table I.
According to the results of the paired samples t-test for the
“low realism” condition (t(10) =3.270133, pvalue=0.00425)
and for the “high realism” condition (t(10) = 6.842536,
pvalue=0.0000021) we can reject the null hypothesis that the
performance of the model when it is trained with synthetic
and applied on synthetic images is as good as when it is
applied on real images. Comparing the average mAP on real
test images to the average mAP on synthetic test images in
Table I, we can conclude that the there is a drop of about 6
percent in the performance of the detector model when it is
trained with synthetic and applied on real data.
Fig. 6. Left: Results of the object detection experiment on test image from
the T-LESS dataset. The predicted bounding boxes that have met the True
Positive (TP) condition according to the mAP metric are shown together
with the ground truth bounding boxes(red color).
Right: Sample images used in the viewpoint estimation experiment: a)
Synthetic training images. b) Real test images of the objects within their
bounding box. The synthetic test images generated from the corresponding
real test images.
B. Object viewpoint estimation experiment
In this section, we evaluate the viewpoint estimation model
proposed in Section III-B. The experiment evaluates how
precisely the model can estimate the object’s 3D-pose given
its 2D-view in a bounding box, and whether it can be
successfully applied on real images after training solely with
synthetic images. The experiment is carried out assuming an
ideal bounding box detector, i.e., the training and test input
to the viewpoint estimator are the ideal bounding boxes of
the objects. We address a non-ideal viewpoint estimation in
Section IV-C, where we use the predicted bounding boxes
of the detector component to test the detector-estimator
architecture as a whole.
1) Experimental setup: In this experiment, we use the 3D
models of objects 5, 8 and 18 from the T-LESS dataset. These
objects have different shape, different level of symmetry
and different level of detailed features, providing enough
variability to test if viewpoint estimator architecture is ro-
bust under different object properties. For each object, we
generate a synthetic dataset of approximately one million
synthetic training images using their 3D models as described
in Section III-A. Examples of the synthetic training images,
as well as the synthetic and real test images, are shown in
Figure 6 (right).
To train the viewpoint estimators, we first load the weights
of the VGG-16 convolutional base [26] pre-trained on the
ImageNet [6] dataset and randomly initialize the weights
of the newly-attached intermediate fully-connected layer and
the output layer. For each epoch, we randomly select about
50 000 synthetic training samples and train the model with
batch size of 32 samples4. The training procedure has two
stages: transfer-learning stage and fine-tuning stage5. In the
transfer-learning stage, we train the model for 20 epochs
by only adjusting the weights of the fully-connected layer
while keeping the weights of the VGG-16 convolutional base
fixed. For this phase we use rmsprop [27] optimizer with
default parameters. After 20 epochs we perform a fine-tuning
phase until the model converges, in which besides the fully-
connected layers, we also adjust the top 6 convolutional
layers of VGG-16. For the fine-tuning phase, we perform
Stochastic Gradient Descent (SGD) optimization with a
learning rate of 10-4 and a momentum of 0.9. At the end
of each epoch, we evaluate the model on both the synthetic
and the real test images. We perform 5 independent training
runs for each object, to account for the effects of randomness
in the results.
2) Metric for viewpoint estimation: To measure the per-
formance of the model, we use the guidelines proposed by
Xiang et al. [28]. The output prediction of our viewpoint
estimator for a given input image is the center of the 10-
degree azimuth-elevation and in-plane rotation bins with the
highest probability. We consider a predicted output to be
correct if the difference between the ground truth value and
the predicted value is within ±xdegrees for each of the
three angles. The accuracy is then the portion of correct
predictions and we refer to this metric as “accuracy with x
degrees tolerance”. Due to the discretization, the maximum
precision we can measure is accuracy with a tolerance of 5
degrees6. We also show the accuracy for each of the azimuth,
elevation and in-plane rotation angles independently, where
the prediction is correct if the predicted value of the angle
is within 5 degrees of difference from the true value.
3) Results: In Table II the accuracy results for 5 runs
per object are presented. The results indicate that the model
can generalize and learn features that are relevant to the
viewpoint estimation task independently of the object’s prop-
erties, and that the features learned from synthetic images are
relevant and transferable to real-world data. It is also notice-
able that the model still performs better on the synthetic test
images compared to the real test images. While one of the
4The model and training hyper-parameters, like the number of units of
the hidden layer, batch size, the covariance of the normal distribution for
generating the bin probabilities, or the number of VGG-16 convolutional
layers to fine-tune have been determined by additional hyper-parameter
optimization analysis.
5We followed DeepLearningSand-
box/DeepLearningSandbox/blob/master/transfer learning/
to implement two-stage training with Keras [26], accessed on: 28.08.2017
6A random-guess predictor with this metric under a tolerance angle of 5
degrees would have a 1 in 18*36*36 or 0.004% chance to be correct.
test images
test images
object metric
obj 05 accuracy (5tolerance) 0.376 ±0.009 0.196 ±0.015
accuracy (15tolerance) 0.881 ±0.005 0.741 ±0.014
accuracy (25tolerance) 0.909 ±0.005 0.829 ±0.009
azimuth accuracy 0.632 ±0.006 0.4 ±0.018
elevation accuracy 0.69 ±0.011 0.574 ±0.019
in-plane rotation accuracy 0.746 ±0.007 0.59 ±0.01
obj 08 accuracy (5tolerance) 0.276 ±0.008 0.135 ±0.007
accuracy (15tolerance) 0.713 ±0.008 0.509 ±0.011
accuracy (25tolerance) 0.78 ±0.005 0.595 ±0.009
azimuth accuracy 0.523 ±0.015 0.352 ±0.014
elevation accuracy 0.546 ±0.011 0.409 ±0.021
in-plane rotation accuracy 0.576 ±0.007 0.431 ±0.012
obj 18 accuracy (5tolerance) 0.261 ±0.018 0.09 ±0.011
accuracy (15tolerance) 0.76 ±0.014 0.388 ±0.034
accuracy (25tolerance) 0.863 ±0.008 0.553 ±0.03
azimuth accuracy 0.46 ±0.018 0.229 ±0.009
elevation accuracy 0.666 ±0.014 0.495 ±0.021
in-plane rotation accuracy 0.566 ±0.021 0.443 ±0.043
obvious reasons for this is the different nature of the training
and test data, we assume that the result is also influenced by
the properties of the 3D models: the models we use from
the T-LESS dataset are manually reconstructed and do not
contain all of the fine details of the real objects. We assume
that the performance would increase if a more detailed 3D
model of the object is available.
An important property of our viewpoint estimator is the
ability to handle objects that look similar from different
angles. An example is presented in Figure 7, where the
probability distribution of the output layer has encoded two
distinct viewpoints of the object that look identical. This
shows that the model can handle symmetrical objects and
that information for the object’s appearance and symmetries
can be inferred from the activations of the output layer.
Fig. 7. Prediction of the model for a symmetrical object. Left: Image
of object 05 from the T-LESS dataset. Right: The probability distribution
at the output layer for the given input: lighter color indicates higher
probability. The viewpoint (elevation-azimuth) distribution has two maxima:
one corresponds to the actual ground truth viewpoint, and one corresponds
to the viewpoint under which the object looks similar.
C. Combined object detection and viewpoint estimation
There are two additional sources of error in the accuracy
when predicted bounding boxes are used for the viewpoint
estimation instead of the ideal ones. First, the object might
not be detected at all from the bounding-box detector.
Second, the predicted bounding box would rarely exactly
match the borders of the object in an image as opposed to
an ideal bounding box, complicating the prediction of the
correct viewpoint.
To evaluate the pose estimation model as a whole, we
use the True Positive (TP) bounding boxes predicted by the
detector trained with “low-realism” synthetic data from the
experiment outlined in Section IV-A as an input to the trained
object-specific viewpoint estimators from Section IV-B.
The comparative results presented in Figure 8 show that
if the bounding boxes predicted from the detector are used
instead of the ideal ones, the viewpoint estimator can still
correctly recover the 3D pose without a significant drop in
accuracy. Furthermore, from the gap in accuracy between real
and synthetic test images under different tolerance angles,
we can conclude that the benefit of effortless creation of
synthetic training data comes at a cost of decreased accuracy
and is dependent on the quality of the 3D model of the
objects. Still, the accuracy achieved in our experiments on
the challenging test images from T-LESS, which is about 70
percent for tolerance of 25, indicates that the proposed ar-
chitecture and training procedure can be successfully applied
in real-world tasks. Sample results of the estimated 3D pose
by combining the predicted bounding box and the estimated
viewpoint are shown in Figure 9.
Fig. 8. Comparison of the accuracy in pose estimation of synthetic and
real test images, for predicted and ideal bounding boxes, under different
tolerance angles. The accuracy is calculated as the average of three objects
we use in the experiment.
D. Application of the pose estimation model in robotic tasks
To demonstrate that the proposed pose estimation ap-
proach can be extended to real-world tasks, we train the
detector and view-point estimator using 3D models of toy-
objects and a three-fingered robotic hand7. The results show
that the model can successfully estimate the 3D pose of the
objects without strict control of the environment in which
it is applied and without the need of images from the real
objects. The 3D models used for training and sample results
are shown in Figure 1, while further results are available in
the accompanying video attachment.
We present and evaluate a novel approach for instance-
based pose estimation using CNNs, suitable for determin-
ing the 3D pose of objects in robotic manipulation tasks
in loosely controlled environments. We use automatically
generated synthetic training data based on 3D models to sat-
isfy the high training data requirement of CNN-approaches.
Results in real-world images indicate that photo-realistic
renderings are not necessary for good performance. We
hypothesize that a key to this result is the use of a deep
7The hand is the RH4D from
Fig. 9. Test images from the T-LESS dataset showing the estimated object
pose of the three objects, by combining the bounding box detection and the
estimation of the viewpoint of the object within the bounding box. The 3D
poses are visualized by drawing the contours of the 3D models.
convolutional architecture that is pre-trained on real images
but then fine-tuned only in the top layers.
In future work, we will investigate which factors are
crucial to further improve the generalization ability of the
model such that the performance on real images comes closer
to the one achieved on synthetic data. Since the activations
of the output layer of the model can be used to infer shape-
relevant information about the object, we also plan to apply
the model as a low-level visual processor, whose output is
used to aid visually-guided high-level task like shape-specific
robotic grasping.
[1] D. G. Lowe, “Object recognition from local scale-invariant features,
Proceedings of the Seventh IEEE International Conference on Com-
puter Vision, vol. 2, no. [8, pp. 1150–1157, 1999.
[2] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient
alternative to sift or surf,” in Computer Vision (ICCV), 2011 IEEE
international conference on. IEEE, 2011, pp. 2564–2571.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classifi-
cation with Deep Convolutional Neural Networks,” in Advances in
Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012,
pp. 1097–1105.
[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[5] M. Kerzel, E. Strahl, S. Magg, N. Navarro-Guerrero, S. Heinrich,
and S. Wermter, “Nico - neuro-inspired companion: A developmental
humanoid robot platform for multimodal interaction,” in IEEE Inter-
national Symposium on Robot and Human Interactive Communication
(RO-MAN), Aug 2017, pp. 113–120.
[6] Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei,
“ImageNet: A large-scale hierarchical image database,” 2009 IEEE
Conference on Computer Vision and Pattern Recognition, pp. 248–
255, 2009.
[7] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks,Nips,
pp. 1–10, 2015.
[8] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look
Once: Unified, Real-Time Object Detection,Cvpr 2016, pp. 779–788,
[9] P. Wohlhart and V. Lepetit, “Learning descriptors for object recogni-
tion and 3D pose estimation,” 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), no. 1, pp. 3109–3118, 2015.
[10] H. Su, C. R. Qi, Y. Li, and L. J. Guibas, “Render for CNN:
Viewpoint estimation in images using CNNs trained with rendered
3D model views,Proceedings of the IEEE International Conference
on Computer Vision, vol. 11-18-Dece, pp. 2686–2694, 2016.
[11] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
“Domain randomization for transferring deep neural networks from
simulation to the real world,” in Intelligent Robots and Systems (IROS),
2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 23–30.
[12] B. Planche, Z. Wu, K. Ma, S. Sun, S. Kluckner, T. Chen, A. Hutter,
S. Zakharov, H. Kosch, and J. Ernst, “Depthsynth: Real-time realistic
synthetic data generation from cad models for 2.5 d recognition,” arXiv
preprint arXiv:1702.08558, 2017.
[13] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A
convolutional neural network for 6d object pose estimation in cluttered
scenes,” arXiv preprint arXiv:1711.00199, 2017.
[14] M. Rad and V. Lepetit, “Bb8: A scalable, accurate, robust to partial
occlusion method for predicting the 3d poses of challenging objects
without using depth,” in International Conference on Computer Vision,
vol. 1, no. 4, 2017, p. 5.
[15] X. S. Gao, X. R. Hou, J. Tang, and H. F. Cheng, “Complete
solution classification for the perspective-three-point problem,IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 25,
no. 8, pp. 930–943, 2003.
[16] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi,
I. Fischer, Z. Wojna, Y. Song, S. Guadarrama et al., “Speed/accuracy
trade-offs for modern convolutional object detectors,” in IEEE CVPR,
[17] “Blender - a 3D modelling and rendering package,” Blender Institute,
Amsterdam. [Online]. Available:
[18] M. Schwarz, H. Schulz, and S. Behnke, “RGB-D Object Recogni-
tion and Pose Estimation based on Pre-trained Convolutional Neural
Network Features,” IEEE International Conference on Robotics and
Automation (ICRA’15), no. May, pp. 1329–1335, 2015.
[19] P. Fischer, A. Dosovitskiy, and T. Brox, “Image orientation estimation
with convolutional networks,Lecture Notes in Computer Science (in-
cluding subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), vol. 9358, pp. 368–378, 2015.
[20] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” arXiv preprint, pp. 1–10, 2014.
[21] T. Hodan, P. Haluza, ˇ
S. Obdrˇ
alek, J. Matas, M. Lourakis, and
X. Zabulis, “T-less: An rgb-d dataset for 6d pose estimation of texture-
less objects,” in Applications of Computer Vision (WACV), 2017 IEEE
Winter Conference on. IEEE, 2017, pp. 880–888.
[22] E. Valenza, Blender 2.6 Cycles: Materials and Textures Cookbook.
Packt Publishing Ltd, 2013.
[23] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Doll´
ar, and C. L. Zitnick, “Microsoft COCO: Common objects
in context,” Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinfor-
matics), vol. 8693 LNCS, no. PART 5, pp. 740–755, 2014.
[24] “Tensorflow Object Detection API,” 2017. [Online]. Available: detection
[25] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-
man, “The pascal visual object classes (VOC) challenge,International
Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
[26] F. Chollet, “Keras,” 2015. [Online]. Available:
[27] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient
by a running average of its recent magnitude.” pp. 26–31, 2012.
[28] Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond PASCAL: A
benchmark for 3D object detection in the wild,” 2014 IEEE Winter
Conference on Applications of Computer Vision, WACV 2014, pp. 75–
82, 2014.
... Simulator-based methods: Synthetic datasets have been shown to improve performances across a variety of tasks including, but not limited to, object detection [33,26], trajectory prediction [51,7], depth estimation [3,31], semantic and instance segmentation [5,49], human pose estimation [45], object 6DoF pose estimation [26], 3D reconstruction [9], tracking and optical flow [50]. CARLA [19] is a popular simulator that relies on manually designed environment maps and places 3D object assets for vehicles, pedestrians, and dynamic entities in the environment. ...
... Simulator-based methods: Synthetic datasets have been shown to improve performances across a variety of tasks including, but not limited to, object detection [33,26], trajectory prediction [51,7], depth estimation [3,31], semantic and instance segmentation [5,49], human pose estimation [45], object 6DoF pose estimation [26], 3D reconstruction [9], tracking and optical flow [50]. CARLA [19] is a popular simulator that relies on manually designed environment maps and places 3D object assets for vehicles, pedestrians, and dynamic entities in the environment. ...
Full-text available
High-quality structured data with rich annotations are critical components in intelligent vehicle systems dealing with road scenes. However, data curation and annotation require intensive investments and yield low-diversity scenarios. The recently growing interest in synthetic data raises questions about the scope of improvement in such systems and the amount of manual work still required to produce high volumes and variations of simulated data. This work proposes a synthetic data generation pipeline that utilizes existing datasets, like nuScenes, to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation, mimicking real scene properties with high-fidelity, along with mechanisms to diversify samples in a physically meaningful way. We demonstrate improvements in mIoU metrics by presenting qualitative and quantitative experiments with real and synthetic data for semantic segmentation on the Cityscapes and KITTI-STEP datasets. All relevant code and data is released on github (
... These methods can achieve very high levels of accuracy but need a lot of data to accurately train the network to work well in real-world cases. One particular approach based on DL relies on a synthetic dataset to train a CNN to predict the object's rotation, such as in [28,29]. After the literature review, the approach proposed in these two papers seemed the most suitable for our system, due to its flexibility and extendibility to new objects or procedures. ...
Full-text available
The current study aimed to propose a Deep Learning (DL) based framework to retrieve in real-time the position and the rotation of an object in need of maintenance from live video frames only. For testing the positioning performances, we focused on intervention on a generic Fused Deposition Modeling (FDM) 3D printer maintenance. Lastly, to demonstrate a possible Augmented Reality (AR) application that can be built on top of this, we discussed a specific case study using a Prusa i3 MKS FDM printer. This method was developed using a You Only Look Once (YOLOv3) network for object detection to locate the position of the FDM 3D printer and a subsequent Rotation Convolutional Neural Network (RotationCNN), trained on a dataset of artificial images, to predict the rotations’ parameters for attaching the 3D model. To train YOLOv3 we used an augmented dataset of 1653 real images, while to train the RotationCNN we utilized a dataset of 99.220 synthetic images, showing the FDM 3D Printer with different orientations, and fine-tuned it using 235 real images tagged manually. The YOLOv3 network obtained an AP (Average Precision) of 100% with Intersection Over Unit parameter of 0.5, while the RotationCNN showed a mean Geodesic Distance of 0.250 (σ = 0.210) and a mean accuracy to detect the correct rotation r of 0.619 (σ = 0.130), considering as acceptable the range [ r − 10, r + 10]. We then evaluate the CAD system performances with 10 non-expert users: the average speed improved from 9.61 (σ = 1.53) to 5.30 (σ = 1.30) and the average number of actions to complete the task from 12.60 (σ = 2.15) to 11.00 (σ = 0.89). This work is a further step through the adoption of DL and AR in the assistance domain. In future works, we will overcome the limitations of this approach and develop a complete mobile CAD system that could be extended to any object that presents a 3D counterpart model.
... [26][27][28] Deep learning GANs were presented for synthetic medical dataset generation, and for medical image classification, the performance of CNN was improved by adding synthetic datasets. 28 In addition, physical engines such as Unreal and Unity, 3D creation suites such as Blender, and simulators built on engines such as AirSim are also utilized to generate photorealistic synthetic data to solve several computer vision problems, including classification, 29 object detection, 30 pose estimation, 31 and semantic segmentation problems. [32][33][34] In addition, a simple synthetic dataset generation method that involves cutting object masks and placing them into real images is proposed, reducing the reliance on graphic renderings and simplifying its implementation without photorealistic rendering or complex scene composition. ...
Although deep-learning-based approaches have demonstrated impressive performance in object detection tasks, the requirement for large datasets of annotated training images limits the feasibility of deep neural networks. For example, obtaining a large number of crack images of a dam is unlikely, particularly in the absence of open-source datasets. To address this problem, the authors have developed three synthetic data generators based on virtual scene simulation and image processing for generating large amounts of labeled dam surface crack data. These synthetic data combined with public-available images of cracks on pavement and concrete are further used to train a state-of-the-art object detection neural network, resulting in a 29.2% improvement in the overall crack detection mean average precision (mAP) compared to using only images of cracks on pavement and concrete. Furthermore, given the necessity for further analysis of some critical cracks, an image-processing-based approach for segmenting the crack in each detected bounding box and estimating its length and thickness is provided.
Full-text available
The 6D pose estimation of an object from an image is a central problem in many domains of Computer Vision (CV) and researchers have struggled with this issue for several years. Traditional pose estimation methods (1) leveraged on geometrical approaches, exploiting manually annotated local features, or (2) relied on 2D object representations from different points of view and their comparisons with the original image. The two methods mentioned above are also known as Feature-based and Template-based, respectively. With the diffusion of Deep Learning (DL), new Learning-based strategies have been introduced to achieve the 6D pose estimation, improving traditional methods by involving Convolutional Neural Networks (CNN). This review analyzed techniques belonging to different research fields and classified them into three main categories: Template-based methods, Feature-based methods, and Learning-Based methods. In recent years, the research mainly focused on Learning-based methods, which allow the training of a neural network tailored for a specific task. For this reason, most of the analyzed methods belong to this category, and they have been in turn classified into three sub-categories: Bounding box prediction and Perspective-n-Point (PnP) algorithm-based methods, Classification-based methods, and Regression-based methods. This review aims to provide a general overview of the latest 6D pose recovery methods to underline the pros and cons and highlight the best-performing techniques for each group. The main goal is to supply the readers with helpful guidelines for the implementation of performing applications even under challenging circumstances such as auto-occlusions, symmetries, occlusions between multiple objects, and bad lighting conditions.
Full-text available
The expansion of renewable energies is being driven by the gradual phaseout of fossil fuels in order to reduce greenhouse gas emissions, the steadily increasing demand for energy and, more recently, by geopolitical events. The offshore wind energy sector is on the verge of a massive expansion in Europe, the United Kingdom, China, but also in the USA, South Korea and Vietnam. Accordingly, the largest marine infrastructure projects to date will be carried out in the upcoming decades, with thousands of offshore wind turbines being installed. In order to accompany this process globally and to provide a database for research, development and monitoring, this dissertation presents a deep learning-based approach for object detection that enables the derivation of spatiotemporal developments of offshore wind energy infrastructures from satellite-based radar data of the Sentinel-1 mission. For training the deep learning models for offshore wind energy infrastructure detection, an approach is presented that makes it possible to synthetically generate remote sensing data and the necessary annotation for the supervised deep learning process. In this synthetic data generation process, expert knowledge about image content and sensor acquisition techniques is made machine-readable. Finally, extensive and highly variable training data sets are generated from this knowledge representation, with which deep learning models can learn to detect objects in real-world satellite data. The method for the synthetic generation of training data based on expert knowledge offers great potential for deep learning in Earth observation. Applications of deep learning based methods can be developed and tested faster with this procedure. Furthermore, the synthetically generated and thus controllable training data offer the possibility to interpret the learning process of the optimised deep learning models. The method developed in this dissertation to create synthetic remote sensing training data was finally used to optimise deep learning models for the global detection of offshore wind energy infrastructure. For this purpose, images of the entire global coastline from ESA's Sentinel-1 radar mission were evaluated. The derived data set includes over 9,941 objects, which distinguish offshore wind turbines, transformer stations and offshore wind energy infrastructures under construction from each other. In addition to this spatial detection, a quarterly time series from July 2016 to June 2021 was derived for all objects. This time series reveals the start of construction, the construction phase and the time of completion with subsequent operation for each object. The derived offshore wind energy infrastructure data set provides the basis for an analysis of the development of the offshore wind energy sector from July 2016 to June 2021. For this analysis, further attributes of the detected offshore wind turbines were derived. The most important of these are the height and installed capacity of a turbine. The turbine height was calculated by a radargrammetric analysis of the previously detected Sentinel-1 signal and then used to statistically model the installed capacity. The results show that in June 2021, 8,885 offshore wind turbines with a total capacity of 40.6~GW were installed worldwide. The largest installed capacities are in the EU (15.2~GW), China (14.1~GW) and the United Kingdom (10.7~GW). From July 2016 to June 2021, China has expanded 13~GW of offshore wind energy infrastructure. The EU has installed 8~GW and the UK 5.8~GW of offshore wind energy infrastructure in the same period. This temporal analysis shows that China was the main driver of the expansion of the offshore wind energy sector in the period under investigation. The derived data set for the description of the offshore wind energy sector was made publicly available. It is thus freely accessible to all decision-makers and stakeholders involved in the development of offshore wind energy projects. Especially in the scientific context, it serves as a database that enables a wide range of investigations. Research questions regarding offshore wind turbines themselves as well as the influence of the expansion in the coming decades can be investigated. This supports the imminent and urgently needed expansion of offshore wind energy in order to promote sustainable expansion in addition to the expansion targets that have been set.
High-quality structured data with rich annotations are critical components in intelligent vehicle systems dealing with road scenes. However, data curation and annotation require intensive investments and yield low-diversity scenarios. The recently growing interest in synthetic data raises questions about the scope of improvement in such systems and the amount of manual work still required to produce high volumes and variations of simulated data. This work proposes a synthetic data generation pipeline that utilizes existing datasets, like nuScenes, to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation, mimicking real scene properties with high-fidelity, along with mechanisms to diversify samples in a physically meaningful way. We demonstrate improvements in mIoU metrics by presenting qualitative and quantitative experiments with real and synthetic data for semantic segmentation on the Cityscapes and KITTI-STEP datasets. All relevant code and data is released on github\(^{3}\) ( KeywordsSynthetic dataRoad scenesSelf-drivingSemantic segmentation
6D pose estimation from a single RGB image is a fundamental task in computer vision. We introduce a two-stage 6D pose estimation method for texture-less objects. Instead of utilizing the object mask in almost current monocular methods, we propose an edge representation for texture-less objects. An object is represented by the combination of visible edges corresponding to the object’s 6D pose, allowing the neural network to focus more on the object’s invariant global shape and structure, rather than indistinguishable local patches with image noise and similar texture. Given an RGB image, the proposed method predicts the direction and distance to a certain object keypoint from all object pixels within the range of object edge representation, establishes voting-based sparse 2D-3D correspondences, and solves the object pose with PnP algorithm. In the experiments, we found that directly replacing the object mask with the edge representation can bring a performance improvement in two current two-stage pipelines without any modification. Further evaluations on three different benchmark datasets containing symmetric and occluded objects show our method outperforms the state-of-the-art methods using RGB images only.
Conference Paper
Full-text available
Interdisciplinary research, drawing from robotics, artificial intelligence, neuroscience, psychology, and cognitive science, is a cornerstone to advance the state-of-the-art in multimodal human-robot interaction and neuro-cognitive mod-eling. Research on neuro-cognitive models benefits from the embodiment of these models into physical, humanoid agents that possess complex, human-like sensorimotor capabilities for multimodal interaction with the real world. For this purpose, we develop and introduce NICO (Neuro-Inspired COmpanion), a humanoid developmental robot that fills a gap between necessary sensing and interaction capabilities and flexible design. This combination makes it a novel neuro-cognitive research platform for embodied sensorimotor computational and cognitive models in the context of multimodal interaction as shown in our results.
Estimating the 6D pose of known objects is important for robots to interact with objects in the real world. The problem is challenging due to the variety of objects as well as the complexity of the scene caused by clutter and occlusion between objects. In this work, we introduce a new Convolutional Neural Network (CNN) for 6D object pose estimation named PoseCNN. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. PoseCNN is able to handle symmetric objects and is also robust to occlusion between objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN provides very good estimates using only color as input.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.