Conference PaperPDF Available

# YOEO - You Only Encode Once: A CNN for Embedded Object Detection and Semantic Segmentation

Authors:

## Abstract and Figures

Fast and accurate visual perception utilizing a robot's limited hardware resources is necessary for many mobile robot applications. We are presenting YOEO, a novel hybrid CNN which unifies previous object detection and semantic segmentation approaches using one shared encoder backbone to increase performance and accuracy. We show that it outperforms previous approaches on the TORSO-21 and Cityscapes datasets.
Content may be subject to copyright.
YOEO – You Only Encode Once: A CNN for Embedded Object
Detection and Semantic Segmentation
Florian Vahl1and Jan Gutsche1and Marc Bestmann1and Jianwei Zhang1
Abstract Fast and accurate visual perception uti-
lizing a robot’s limited hardware resources is necessary
for many mobile robot applications. We are presenting
YOEO, a novel hybrid CNN which uniﬁes previous
object detection and semantic segmentation approaches
using one shared encoder backbone to increase per-
formance and accuracy. We show that it outperforms
previous approaches on the TORSO-21 and Cityscapes
datasets.
Index Terms Computer vision, Robotics, Machine
learning, Object recognition, Segmentation
I. INTRODUCTION
Computer vision is an important tool in the ﬁeld of
robotics, as it produces highly detailed observations
of the robot’s environment. Especially in the context
of the RoboCup Humanoid Soccer League, where
the autonomous robots are restricted to only use
sensors equivalent to human senses [1], the detection
of objects and features does not only need to be robust
and accurate but also be done in near real-time on the
robot itself. Furthermore, the robots need to be able to
make accurate detections even with changing natural
light conditions in a dynamic environment.
Many things, such as the ball, other robots, or the
goalposts, can sufﬁciently be described by bounding
boxes. These include a localization on the image with
a rough shape and a speciﬁcation of the class. But
such a representation yields problems for stuff classes
like the ﬂoor or the ﬁeld lines where a pixel-wise
semantic segmentation describing the shape is more
important than the instance abstraction of the bound-
ing box. In many other domains that involve mobile
robots, for example automated delivery systems, a
similar need to perform these two types of detection
in near real-time using limited hardware resources
This research was partially funded by the German Research
Foundation (DFG) and the National Science Foundation of China
(NSFC) in project Crossmodal Learning, TRR-169.
1All the authors are with the Department of Informatics,
University of Hamburg, 22527 Hamburg, Germany
[florian.vahl, jan.gutsche, marc.bestmann,
jianwei.zhang]@uni-hamburg.de
can be expected. CNN-based approaches such as
YOLOv4 [2] for object detection and U-NET [3] for
semantic segmentation dominate the ﬁeld of computer
vision in recent years. Using separate approaches
for both the object detection and the semantic seg-
both networks feature their own encoder and similar
features are encoded twice.
Therefore, it is reasonable to share one encoder
backbone between multiple decoders generating out-
puts for different purposes. Additionally, this helps
with generalization because the features are needed
in different contexts. With this in mind, we designed
a novel neural network architecture, YOEO (You
Only Encode Once, pronounced ["jojo]), that provides
both, bounding box instance detections and pixel-
precise segmentations while minimizing computa-
tional overhead by using a shared encoder backbone.
The model was optimized using the OpenVino™
toolkit and deployed on an Intel® Neural Compute
Stick 2 (NCS2) vision processing unit (VPU) used in
our Wolfgang-OP robot [4].
II. RE LATE D WORK
Starting with AlexNet [5] in 2012, approaches
using CNNs in the context of computer vision grew
in popularity. Outside of simple classiﬁcation tasks,
CNN architectures like YOLO [6] and SSD [7] gained
in inﬂuence. Both predict bounding boxes for the
detected objects. This allows for easy differentiation
and classiﬁcation for multiple objects at the same
time.
The ﬁeld of semantic segmentation is also domi-
nated by CNNs nowadays. Networks like FCN [8] or
U-NET [3] use this approach to generate promising
results for pixel-precise classiﬁcations.
Using a shared encoder backbone for both ob-
ject detection and semantic segmentation has been
proposed by Teichmann et al. in 2018 [9]. Their
approach features bounding box object detection, im-
age classiﬁcation, and semantic segmentation. They
mainly focus on the ﬁeld of autonomous driving,
generating predictions for vehicle detection, street
scenery classiﬁcation, and road segmentation at the
same time. The ﬁndings are promising, but the overall
model size is too large for a real-time inference on an
embedded device and the image classiﬁcation part is
not needed for our use case. Furthermore, the object
detection decoder architecture is quite different from
the YOLO object detection which is state of the art
in the RoboCup humanoid soccer domain.
A combined approach that is based on an older
version of the YOLO architecture was developed
by the automotive supplier Valeo and is described
in Real-time Joint Object Detection and Semantic
Segmentation Network for Automated Driving [10]. It
uses a combination of a ResNet10-like encoder with
a YOLOv2- and FCN8-like decoder but is not very
detailed when it comes to the implementation and
deployment details.
The ﬁeld of panoptic segmentation features models
like Panoptic FPN [11] to predict both, instance and
semantic segmentation. This allows for pixel-precise
instance segmentation, by removing the abstraction
of the bounding box. While seeming very promising
it also requires an expensive pixel-precise annotation
for all classes including ones where the abstraction
of a bounding box is sufﬁcient. A pixel-precise pre-
diction for these classes would also waste resources
during the inference.
III. APP ROAC HE S
We trained and evaluated multiple architectures to
construct a model that ﬁts our needs. All of the
presented architectures feature two YOLO detection
heads at different resolutions and a U-NET-like de-
coder which share a common encoder backbone.
They are also fully convolutional, meaning they do
not use any fully connected layers and are easily scal-
able to different resolutions. The encoder and YOLO
heads are taken from the YOLOv3 or YOLOv4
architectures. The decoder consists of a U-NET-
like topology meaning feature maps of the encoder
are upsampled multiple times using nearest-neighbor
interpolation. Each time the upsampled feature map
is concatenated with the corresponding feature map
in the encoder branch using a skip connection and a
convolutional layer. This layer is also used to reduce
the number of channels and convert the features to a
more spatial topology.
We used the TORSO-21 Dataset [12] to train and
evaluate different variants of the network. The dataset
is used because it resembles the intended deployment
domain of the architecture. It features many images
with bounding boxes as well as segmentation classes.
The images were downscaled to an input resolution of
416 by 416 pixels which is also the output resolution
of the segmentation head. Data augmentation in the
form of random changes to the sharpness, brightness,
and hue of the image as well as random vertical
mirroring, translation (max 10% of the image size),
and scaling (80% to 150%) has been applied.
The network parameters were optimized using the
ADAM optimizer [13]. The loss function consists of
the sum of a YOLOv4 loss for object detection and
a cross-entropy loss for semantic segmentation. Both
are weighted equally.
Our open source reference implementation1is a
fork of PyTorch-YOLOv32which served as the start-
ing point of our software architecture.
While trying to ﬁnd the optimal architecture for our
use case, we investigated the following architecture
layouts:
A YOLOv3-tiny for the encoder and bounding
box prediction. A U-NET decoder is connected
with all four feature maps down to a reso-
lution of 52x52 and includes three convolu-
tional blocks (3x3 convolution, batch normaliza-
tion, and ReLU) before the output. The model
served as a baseline for further experiments.
(YOEO-rev-0)
A YOLOv4-tiny for the encoder and bounding
box prediction. YOLOv4-tiny uses a larger stride
instead of maxpooling in the ﬁrst layers so
the output of the ﬁrst convolution is already
downsampled by a factor of two. So we tested
if we should
add a convolution with a stride of one in
front of it and start the native resolution skip
connection here. (YOEO-rev-1)
use the downsampled feature map as the
ﬁrst skip connection. (YOEO-rev-2)
A segmentation decoder head with one or
two convolutions after the last upsampling.
(YOEO-rev-3,YOEO-rev-4 respectively)
A segmentation decoder head with a 1x1 con-
volution instead of the 3x3 after the last upsam-
pling. (YOEO-rev-5)
A segmentation decoder which starts with
the deepest feature map in the encoder.
(YOEO-rev-6)
1github.com/bit-bots/YOEO
2github.com/eriklindernoren/PyTorch-YOLOv3
A variant of YOEO-rev-2 which starts with
the deepest feature map in the encoder.
(YOEO-rev-7)
A segmentation decoder head with residual
skip connections after the last upsampling.
(YOEO-rev-8)
A segmentation decoder that starts with the
deepest feature map in the encoder and has only
one high-resolution skip connection to gain a
performance beneﬁt. (YOEO-rev-9)
The complete model architectures are also provided
in the repository of the reference implementation1.
In the end, we settled for a YOLOv4-tiny with
a segmentation decoder that uses encoder feature
maps from all resolutions except the native resolution
(YOEO-rev-7). This architecture is shown in ﬁgure 1.
In the selection, we considered the object detection
and segmentation performance as well as the overall
runtime of the network. The detailed evaluation of
the different architectures can be seen in section IV.
This ﬁnal architecture was converted to the ONNX
format. The model was then optimized by the model
optimizer of the OpenVINO toolkit and converted to
the OpenVINO toolkit’s intermediate representation
format, which itself was used to deploy the net-
work on an Intel NCS2 VPU on the Wolfgang-OP
robot. This allows for efﬁcient inference of the net-
work which is crucial in our power, weight, and
performance-limited use case.
IV. EVALUATION
In the following, we will compare the different re-
visions of our YOEO model as well as YOLOv4-tiny
and U-NET regarding their precision and runtime on
two different datasets and domains.
A. Metrics
To quantify the precision of the segmentation im-
ages, we have selected the common Jaccard index
also known as Intersection over Union (IoU) metric.
For the evaluation of the bounding box object detec-
tion, the mean average precision with a minimum of
50% IoU (mAP50) was used. The inference speed of
the models was evaluated using the frames per sec-
ond (FPS) measure, running either on an NVIDIA®
GeForce® RTX 2080 Ti or with model optimizations
on the Intel NCS2 VPU ready for deployment.
B. Datasets
During our evaluation phase, we have used two
datasets for training and evaluation. As it perfectly
Seg
416 x 416 x 3
208 x 208 x 32
104 x 104 x 64
104 x 104 x 32
104 x 104 x 32
104 x 104 x 64
52 x 52 x 128
52 x 52 x 128
52 x 52 x 64
52 x 52 x 64
52 x 52 x 128
26 x 26 x 256
26 x 26 x 256
26 x 26 x 128
26 x 26 x 128
26 x 26 x 256
52 x 52 x 128
Upsample 52 x 52 x 256
104 x 104 x 64
Upsample 104 x 104 x 256
13 x 13 x 512
208 x 208 x 32
208 x 208 x 128
416 x 416 x 32
416 x 416 x 3
Upsample 416 x 416 x 64
Upsample
Conv 3x3
13 x 13 x 512
Conv 3x3 Conv 3x3
Upsample
13 x 13 x 512 13 x 13 x 128
26 x 26 x 128
Conv 1x1
13 x 13 x 24
Conv 1x1 26 x 26 x 24
Conv 3x3
13 x 13 x 265
Conv 3x3 26 x 26 x 265
YOLO
YOLO
Conv 3x3
Input
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Maxpool
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Maxpool
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Maxpool
Fig. 1. Layout of the ﬁnal YOEO architecture (YOEO-rev-7).
The shape of the output feature map is noted next to the layers.
Maxpooling layers are marked in orange while nearest-neighbor
upsampling is red. The convolutional layers (blue) also include
batch normalization and a leaky ReLU activation function. The
number of ﬁlters before the output layers (purple, pink) depend
on the number of predicted classes, this ﬁgure shows the layout
for the TORSO-21 dataset with three bounding box and three
segmentation classes.
matches our use case, we have used the TORSO-21
Dataset [12] from the RoboCup soccer domain to
train, test, and select from the various approaches
mentioned in section III. Additionally, we have used
the Cityscapes Dataset [14] to evaluate our ﬁnal
model architecture and to quantitatively compare it
with other architectures.
The TORSO-21 dataset consists of two image
collections, one contains images and labels of a
simulated soccer environment, the other is a diverse
dataset of 8894 train and 1570 test images respec-
tively. The images were recorded by ﬁve different
camera types from twelve locations containing six
ball types. Classes included in this collection are
ball, robot, and L-, T-, and X-intersections of the
soccer ﬁeld lines as bounding box labels, and goal-
post labels as four-point polygons. Additionally, it
provides segmentation masks for stuff classes like
the ﬁeld area and lines. We converted the goalpost
polygons to bounding boxes as required by the YOLO
architecture. We chose to not make use of the line
intersection labels, as we want to use the semantically
the images do not share a common resolution or
aspect ratio, they get scaled to match 416 pixels in
the largest axis and padded with black on the other
axis. This results in a 416 by 416 pixel square image
as described above.
The Cityscapes dataset is commonly used as a
benchmark for vision algorithms, especially in the
autonomous driving domain. It contains several col-
lections from stereo video recordings of street scenes
from 50 different cities. We have used the gtFine
collection consisting of 5,000 images with pixel-level
instance and semantic segmentation. 30 classes are
labeled that can be combined into eight groups. As
our initial problem space does not require as many
classes, we have decided to train and evaluate our
model architecture only based on the groups for the
semantic segmentation. For object detection, we used
all classes that supported instance segmentations. As
there are only eight of them, no grouping was applied.
The instance segmentations were converted to bound-
ing boxes which are compatible with the predictions
of our YOLO decoder. For this conversion, we have
utilized a tool3by Till Beemelmanns. Instead of
applying padding to the input images, we scaled the
rather wide but consistent resolution of 1024 x 2048
pixels directly to a 416 by 416 pixel square.
3github.com/TillBeemelmanns/
cityscapes-to- coco-conversion
TABLE I
PERFORMANCE OF THE EVALUATED ARCHITECTURES
ON T HE TORSO-21 TE ST DATASET.
Architecture mAP50 IoU FPS
YOEO-rev-0 78.87% 82.62% 175.45
YOEO-rev-1 84.10% 83.37% 133.96
YOEO-rev-2 83.87% 83.49% 138.76
YOEO-rev-3 77.91% 82.22% 185.78
YOEO-rev-4 77.97% 82.21% 180.64
YOEO-rev-5 78.71% 82.27% 179.59
YOEO-rev-6 78.95% 85.28% 154.36
YOEO-rev-7 83.36% 85.02% 137.14
YOEO-rev-8 78.62% 82.53% 173.24
YOEO-rev-9 79.12% 84.23% 163.40
YOLOv4-tiny 83.56% - 174.67
U-NET - 81.58% 152.07
C. Architecture Comparison
The different proposed architecture layouts have
been trained for up to 60 epochs on the TORSO-21
dataset. The results are presented in table I. The
performance of the object detection is measured
using the mAP50 metric, while the IoU is used for
semantic segmentation. The frames per second (FPS)
are measured on an NVIDIA® GeForce® RTX 2080
Ti and do not include the inference optimizations
applied to the deployed model. Standard deviations
for the FPS measurements are negligibly small. The
reference U-NET model uses a full YOLOv4-tiny
encoder with skip connections on every level (similar
to the segmentation decoder of YOEO-rev-7).
It can be seen that the YOLOv4-tiny based models
have a clear advantage over the YOLOv3-tiny ones
when it comes to object detection performance. We,
therefore, strongly prefer a YOLOv4-tiny network.
Also, it is evident that a deeper starting segmentation
decoder results in a higher IoU. The change from nor-
mal (YOEO-rev-0) to deeper (YOEO-rev-6) increases
the IoU from 82.62% to 85.28%. While this change
is relatively small, it is still signiﬁcant for classes like
the ﬁeld where the majority is easy to detect but the
interesting edge cases are only solved by the deeper
model. This includes cases where other intentionally
unlabeled soccer ﬁelds are visible in the background,
images with natural light (areas might be over- or
underexposed), or cases where there are obstructions
(people, robots, cables, laptops, ...) on the ﬁeld that
are part of the ﬁeld class due to the simpliﬁed anno-
tation method for ﬁeld annotations described in the
TORSO-21 paper [12]. Deeper encoder feature maps
help in this regard, because they are less spatially
speciﬁc, allowing to focus on a larger context while
Fig. 2. This ﬁgure shows an exemplary difference between
the segmentation of the YOEO-rev-0 (left) and YOEO-rev-6
(right) segmentation decoder. It shows how the deeper decoder of
YOEO-rev-6 handles large occlusions of the ﬁeld (green) better.
It also correctly classiﬁes the bright area lit by the sun in the
upper left of the image as ﬁeld instead of line (red).
also having more non-linearities which allow for an
approximation of more complex data distributions.
This is demonstrated in ﬁgure 2.
The experiments also show that the difference
between one and two convolutional blocks after the
last upscaling in the segmentation decoder head is
mainly a longer runtime than any performance beneﬁt
for models with a full resolution skip connection.
Using a 1x1 convolution in the last layer has nearly
no effect despite impacting the performance (see
YOEO-rev-5).
The residual connections in the segmentation de-
ence in the accuracy while slightly slowing down
the model during inference. The improved gradient
ﬂow resulted in faster training times, but it was not
signiﬁcant enough to justify the slower inference.
Using the encoder layout from YOEO-rev-2 which
does not include a full-resolution skip connection
over YOEO-rev-1 seems reasonable because the accu-
racy trade-off is negligible and the lower resolution
skip-connection results in a faster inference with a
smaller memory footprint. It also enables the usage
of pre-trained YOLOv4-tiny encoder weights that are
available for many datasets.
The combined YOEO network outperforms the
standalone versions of both the YOLOv4-tiny and
U-NET even while sharing the encoder. This seems
to be the case because the different domains of both
decoders help the encoder to generalize better during
training. The combined encoder also results in an
inference speed beneﬁt of 68.70% when we compare
the combined runtime of the individual networks with
the YOEO-rev-7 architecture as seen in table I.
Therefore, it is reasonable to settle for a
YOLOv4-tiny based architecture that utilizes the full
encoder for the semantic segmentation without using
a full resolution skip connection or residuals in the
Fig. 3. Visualization of the predictions (right) of the ﬁnal model
on a test image (left) from the TORSO-21 dataset. The prediction
includes the successful detection of another robot (yellow), the
goalposts (blue), and the ball (pink) while ignoring the robot
itself and the humans on the ﬁeld as intended. Further, an accurate
segmentation of ﬁeld (green) and lines (red) can be seen.
decoder (YOEO-rev-7). It is also the best out of the
YOLOv4-tiny models when ranking them using the
geometric mean of the evaluated attributes including
the runtime.
This ﬁnal model has been deployed as described
in section III. The model runs at 6.7 FPS with less
than 2 Watt power usage on the embedded hardware
of the robot.
D. Performance on Cityscapes
We also tested our architecture on the Cityscapes
dataset. The results can be seen in table II. The dataset
is slightly out of domain for this architecture and it
can be seen that the YOLOv4-tiny model struggles
with the high number of small objects in the dataset.
This is typical for this family of object detectors [15].
With these limitations in mind, our approach still
outperforms the similar YOLO-based approach from
Valeo [10] in all tested categories on the dataset while
being comparable in size and speed (see table II. A
complete runtime comparison was not possible due to
a lack of information regarding their implementation
and deployment. An exemplary prediction is shown
in ﬁgure 4.
Fig. 4. Predictions from YOEO on the Cityscapes dataset.
Stuff segmentation classes: ﬂat (red), construction (green), object
(yellow), nature (blue), sky (orange), and void (transparent).
Visible things:cars (brown), trucks (pink), and persons (blue).
TABLE II
PERFORMANCE OF THE FINAL ARCHITECTURE
ON T HE CIT YS CAP ES T EST D ATA SET.
Metric Class YOEO-rev-7 MTL Valeo [10]
AP Person 26.87% 19.31%
AP Rider 29.97% -
AP Bicycle 27.51% 18.98%
AP Car 51.05% 41.10%
AP Bus 29.88% -
AP Motorcycle 03.29% -
AP Truck 14.52% -
AP Train 08.11% -
IoU Flat (e.g. Road) 91.76% 59.66%
IoU Construction 81.43% 63.63%
IoU Object (e.g. Poles) 40.99% -
IoU Nature 85.45% 69.49%
IoU Sky 86.08% 62.28%
V. CONCLUSION & FUTURE WORK
With YOEO we present a hybrid CNN approach
based on YOLOv4-tiny for object detections and
U-NET for semantic segmentation utilizing only one
encoder backbone. Our architecture outperforms both
standalone versions in precision indicating better
generalization of the encoder while simultaneously
gaining a 68.70% speed beneﬁt over running both
standalone architectures sequentially. Many robotics
computer vision tasks could beneﬁt from this hybrid
approach.
Training the YOEO architecture requires a diverse
dataset containing stuff and things classes, which is
more expensive to create than a dataset just contain-
ing bounding boxes. On the other hand it is less
expensive to label than a fully annotated panoptic
segmentation dataset. It is also possible to convert
datasets containing panoptic segmentations as shown
with the Cityscapes dataset.
The YOEO architecture could be improved further
by using automated parameter optimization for the
coder head could be appended to the shared encoder
to provide classiﬁcations of e.g. weather, game states,
or perturbations such as glare or staining of the
camera. Combining other segmentation architectures
such as BiSeNet V2 [16] could also improve the
results.
ACKNOWLEDGMENT
Thanks to Erik Linder-Nor´
en for the original PyTorch-YOLOv3
implementation. Also thanks to the members of the Hamburg
Bit-Bots for helping to develop this architecture.
REFERENCES
[1] “RoboCup Soccer Humanoid League Laws of the Game
RCHL-2020-Rules-Dec23.pdf, (accessed November 21,
2020).
[2] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao,
“YOLOv4: Optimal Speed and Accuracy of Object Detec-
tion,” preprint arXiv:2004.10934, 2020.
[3] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convo-
lutional Networks for Biomedical Image Segmentation,” in
International Conference on Medical image computing and
computer-assisted intervention. Springer, 2015.
[4] M. Bestmann, J. G¨
uldenstein, F. Vahl, and J. Zhang,
“Wolfgang-OP: A Robust Humanoid Robot Platform for
Research and Competitions,” in IEEE Humanoids 2021, 07
2021.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
Classiﬁcation with Deep Convolutional Neural Networks,
Advances in neural information processing systems, vol. 25,
2012.
[6] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi,
“You Only Look Once: Uniﬁed, Real-Time Object Detec-
tion,” preprint arXiv:1506.02640, 2015.
[7] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
Fu, and A. C. Berg, “SSD: Single Shot Multibox Detector,
in European conference on computer vision. Springer,
2016.
[8] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional
Networks for Semantic Segmentation,” in Proceedings of
the IEEE conference on computer vision and pattern recog-
nition, 2015.
[9] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and
R. Urtasun, “Multinet: Real-Time Joint Semantic Reasoning
for Autonomous Driving,” in Intelligent Vehicles Sympo-
sium (IV). IEEE, 2018.
[10] G. Sistu, I. Leang, and S. Yogamani, “Real-Time Joint
Object Detection and Semantic Segmentation Network for
Automated Driving,preprint arXiv:1901.03912, 2019.
[11] A. Kirillov, R. Girshick, K. He, and P. Doll´
ar, “Panop-
tic Feature Pyramid Networks,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2019, pp. 6399–6408.
[12] M. Bestmann, T. Engelke, N. Fiedler, J. G¨
uldenstein,
J. Gutsche, J. Hagge, and F. Vahl, “TORSO-21 Dataset:
Typical Objects in RoboCup Soccer 2021,” in RoboCup
2021, 2021.
[13] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic
Optimization,” in 3rd International Conference on Learning
Representations, ICLR, 2015.
[14] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The
Cityscapes Dataset for Semantic Urban Scene Understand-
ing,” in Proc. of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016.
[15] N.-D. Nguyen, T. Do, T. D. Ngo, and D.-D. Le, “An
Evaluation of Deep Learning Methods for Small Object De-
tection,” Journal of Electrical and Computer Engineering,
2020.
[16] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang,
“Bisenet v2: Bilateral Network With Guided Aggre-
gation for Real-Time Semantic Segmentation,preprint
arXiv:2004.02147, 2020.
Intel and OpenVINO are trademarks of Intel Corporation or its subsidiaries. NVIDIA and GeForce
are registered trademarks of NVIDIA Corporation. The authors are not afﬁliated with Intel
Corporation or NVIDIA Corporation.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Low-level details and high-level semantics are both essential to the semantic segmentation task. However, to speed up the model inference, current approaches almost always sacrifice the low-level details, leading to a considerable decrease in accuracy. We propose to treat these spatial details and categorical semantics separately to achieve high accuracy and high efficiency for real-time semantic segmentation. For this purpose, we propose an efficient and effective architecture with a good trade-off between speed and accuracy, termed Bilateral Segmentation Network (BiSeNet V2). This architecture involves the following: (i) A detail branch, with wide channels and shallow layers to capture low-level details and generate high-resolution feature representation; (ii) A semantics branch, with narrow channels and deep layers to obtain high-level semantic context. The detail branch has wide channel dimensions and shallow layers, while the semantics branch has narrow channel dimensions and deep layers. Due to the reduction in the channel capacity and the use of a fast-downsampling strategy, the semantics branch is lightweight and can be implemented by any efficient model. We design a guided aggregation layer to enhance mutual connections and fuse both types of feature representation. Moreover, a booster training strategy is designed to improve the segmentation performance without any extra inference cost. Extensive quantitative and qualitative evaluations demonstrate that the proposed architecture shows favorable performance compared to several state-of-the-art real-time semantic segmentation approaches. Specifically, for a $$2048\times 1024$$ input, we achieve 72.6% Mean IoU on the Cityscapes test set with a speed of 156 FPS on one NVIDIA GeForce GTX 1080 Ti card, which is significantly faster than existing methods, yet we achieve better segmentation accuracy. The code and trained models are available online at https://git.io/BiSeNet.
Conference Paper
Full-text available
We present our open humanoid robot platform Wolfgang. The described hardware focuses on four aspects. Firstly, the robustness against falls is improved by integrating 3D printed elastic elements. Additionally, a high control loopfrequency is achieved by using new custom control electronics. Furthermore, a torsion spring is applied to reduce the torque on the knee joints. Finally, computational power is provided through the combination of different processors. The paper also presents the ROS-based software stack that is used in RoboCup.
Conference Paper
Full-text available
We present a dataset specifically designed to be used as a benchmark to compare vision systems in the RoboCup Humanoid Soccer domain. The dataset is composed of a collection of images taken in various real-world locations as well as a collection of simulated images. It enables comparing vision approaches with a meaningful and expressive metric. The contributions of this paper consist of providing a comprehensive and annotated dataset, an overview of the recent approaches to vision in RoboCup, methods to generate vision training data in a simulated environment, and an approach to increase the variety of a dataset by automatically selecting a diverse set of images from a larger pool. Additionally , we provide a baseline of YOLOv4 and YOLOv4-tiny on this dataset.
Article
Full-text available
Small object detection is an interesting topic in computer vision. With the rapid development in deep learning, it has drawn attention of several researchers with innovations in approaches to join a race. These innovations proposed comprise region proposals, divided grid cell, multiscale feature maps, and new loss function. As a result, performance of object detection has recently had significant improvements. However, most of the state-of-the-art detectors, both in one-stage and two-stage approaches, have struggled with detecting small objects. In this study, we evaluate current state-of-the-art models based on deep learning in both approaches such as Fast RCNN, Faster RCNN, RetinaNet, and YOLOv3. We provide a profound assessment of the advantages and limitations of models. Specifically, we run models with different backbones on different datasets with multiscale objects to find out what types of objects are suitable for each model along with backbones. Extensive empirical evaluation was conducted on 2 standard datasets, namely, a small object dataset and a filtered dataset from PASCAL VOC 2007. Finally, comparative results and analyses are then presented.
Conference Paper
Full-text available
Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .