Conference PaperPDF Available

YOEO - You Only Encode Once: A CNN for Embedded Object Detection and Semantic Segmentation


Abstract and Figures

Fast and accurate visual perception utilizing a robot's limited hardware resources is necessary for many mobile robot applications. We are presenting YOEO, a novel hybrid CNN which unifies previous object detection and semantic segmentation approaches using one shared encoder backbone to increase performance and accuracy. We show that it outperforms previous approaches on the TORSO-21 and Cityscapes datasets.
Content may be subject to copyright.
YOEO – You Only Encode Once: A CNN for Embedded Object
Detection and Semantic Segmentation
Florian Vahl1and Jan Gutsche1and Marc Bestmann1and Jianwei Zhang1
Abstract Fast and accurate visual perception uti-
lizing a robot’s limited hardware resources is necessary
for many mobile robot applications. We are presenting
YOEO, a novel hybrid CNN which unifies previous
object detection and semantic segmentation approaches
using one shared encoder backbone to increase per-
formance and accuracy. We show that it outperforms
previous approaches on the TORSO-21 and Cityscapes
Index Terms Computer vision, Robotics, Machine
learning, Object recognition, Segmentation
Computer vision is an important tool in the field of
robotics, as it produces highly detailed observations
of the robot’s environment. Especially in the context
of the RoboCup Humanoid Soccer League, where
the autonomous robots are restricted to only use
sensors equivalent to human senses [1], the detection
of objects and features does not only need to be robust
and accurate but also be done in near real-time on the
robot itself. Furthermore, the robots need to be able to
make accurate detections even with changing natural
light conditions in a dynamic environment.
Many things, such as the ball, other robots, or the
goalposts, can sufficiently be described by bounding
boxes. These include a localization on the image with
a rough shape and a specification of the class. But
such a representation yields problems for stuff classes
like the floor or the field lines where a pixel-wise
semantic segmentation describing the shape is more
important than the instance abstraction of the bound-
ing box. In many other domains that involve mobile
robots, for example automated delivery systems, a
similar need to perform these two types of detection
in near real-time using limited hardware resources
This research was partially funded by the German Research
Foundation (DFG) and the National Science Foundation of China
(NSFC) in project Crossmodal Learning, TRR-169.
1All the authors are with the Department of Informatics,
University of Hamburg, 22527 Hamburg, Germany
[florian.vahl, jan.gutsche, marc.bestmann,
can be expected. CNN-based approaches such as
YOLOv4 [2] for object detection and U-NET [3] for
semantic segmentation dominate the field of computer
vision in recent years. Using separate approaches
for both the object detection and the semantic seg-
mentation leads to a computational overhead since
both networks feature their own encoder and similar
features are encoded twice.
Therefore, it is reasonable to share one encoder
backbone between multiple decoders generating out-
puts for different purposes. Additionally, this helps
with generalization because the features are needed
in different contexts. With this in mind, we designed
a novel neural network architecture, YOEO (You
Only Encode Once, pronounced ["jojo]), that provides
both, bounding box instance detections and pixel-
precise segmentations while minimizing computa-
tional overhead by using a shared encoder backbone.
The model was optimized using the OpenVino™
toolkit and deployed on an Intel® Neural Compute
Stick 2 (NCS2) vision processing unit (VPU) used in
our Wolfgang-OP robot [4].
Starting with AlexNet [5] in 2012, approaches
using CNNs in the context of computer vision grew
in popularity. Outside of simple classification tasks,
CNN architectures like YOLO [6] and SSD [7] gained
in influence. Both predict bounding boxes for the
detected objects. This allows for easy differentiation
and classification for multiple objects at the same
The field of semantic segmentation is also domi-
nated by CNNs nowadays. Networks like FCN [8] or
U-NET [3] use this approach to generate promising
results for pixel-precise classifications.
Using a shared encoder backbone for both ob-
ject detection and semantic segmentation has been
proposed by Teichmann et al. in 2018 [9]. Their
approach features bounding box object detection, im-
age classification, and semantic segmentation. They
mainly focus on the field of autonomous driving,
generating predictions for vehicle detection, street
scenery classification, and road segmentation at the
same time. The findings are promising, but the overall
model size is too large for a real-time inference on an
embedded device and the image classification part is
not needed for our use case. Furthermore, the object
detection decoder architecture is quite different from
the YOLO object detection which is state of the art
in the RoboCup humanoid soccer domain.
A combined approach that is based on an older
version of the YOLO architecture was developed
by the automotive supplier Valeo and is described
in Real-time Joint Object Detection and Semantic
Segmentation Network for Automated Driving [10]. It
uses a combination of a ResNet10-like encoder with
a YOLOv2- and FCN8-like decoder but is not very
detailed when it comes to the implementation and
deployment details.
The field of panoptic segmentation features models
like Panoptic FPN [11] to predict both, instance and
semantic segmentation. This allows for pixel-precise
instance segmentation, by removing the abstraction
of the bounding box. While seeming very promising
it also requires an expensive pixel-precise annotation
for all classes including ones where the abstraction
of a bounding box is sufficient. A pixel-precise pre-
diction for these classes would also waste resources
during the inference.
We trained and evaluated multiple architectures to
construct a model that fits our needs. All of the
presented architectures feature two YOLO detection
heads at different resolutions and a U-NET-like de-
coder which share a common encoder backbone.
They are also fully convolutional, meaning they do
not use any fully connected layers and are easily scal-
able to different resolutions. The encoder and YOLO
heads are taken from the YOLOv3 or YOLOv4
architectures. The decoder consists of a U-NET-
like topology meaning feature maps of the encoder
are upsampled multiple times using nearest-neighbor
interpolation. Each time the upsampled feature map
is concatenated with the corresponding feature map
in the encoder branch using a skip connection and a
convolutional layer. This layer is also used to reduce
the number of channels and convert the features to a
more spatial topology.
We used the TORSO-21 Dataset [12] to train and
evaluate different variants of the network. The dataset
is used because it resembles the intended deployment
domain of the architecture. It features many images
with bounding boxes as well as segmentation classes.
The images were downscaled to an input resolution of
416 by 416 pixels which is also the output resolution
of the segmentation head. Data augmentation in the
form of random changes to the sharpness, brightness,
and hue of the image as well as random vertical
mirroring, translation (max 10% of the image size),
and scaling (80% to 150%) has been applied.
The network parameters were optimized using the
ADAM optimizer [13]. The loss function consists of
the sum of a YOLOv4 loss for object detection and
a cross-entropy loss for semantic segmentation. Both
are weighted equally.
Our open source reference implementation1is a
fork of PyTorch-YOLOv32which served as the start-
ing point of our software architecture.
While trying to find the optimal architecture for our
use case, we investigated the following architecture
A YOLOv3-tiny for the encoder and bounding
box prediction. A U-NET decoder is connected
with all four feature maps down to a reso-
lution of 52x52 and includes three convolu-
tional blocks (3x3 convolution, batch normaliza-
tion, and ReLU) before the output. The model
served as a baseline for further experiments.
A YOLOv4-tiny for the encoder and bounding
box prediction. YOLOv4-tiny uses a larger stride
instead of maxpooling in the first layers so
the output of the first convolution is already
downsampled by a factor of two. So we tested
if we should
add a convolution with a stride of one in
front of it and start the native resolution skip
connection here. (YOEO-rev-1)
use the downsampled feature map as the
first skip connection. (YOEO-rev-2)
A segmentation decoder head with one or
two convolutions after the last upsampling.
(YOEO-rev-3,YOEO-rev-4 respectively)
A segmentation decoder head with a 1x1 con-
volution instead of the 3x3 after the last upsam-
pling. (YOEO-rev-5)
A segmentation decoder which starts with
the deepest feature map in the encoder.
A variant of YOEO-rev-2 which starts with
the deepest feature map in the encoder.
A segmentation decoder head with residual
skip connections after the last upsampling.
A segmentation decoder that starts with the
deepest feature map in the encoder and has only
one high-resolution skip connection to gain a
performance benefit. (YOEO-rev-9)
The complete model architectures are also provided
in the repository of the reference implementation1.
In the end, we settled for a YOLOv4-tiny with
a segmentation decoder that uses encoder feature
maps from all resolutions except the native resolution
(YOEO-rev-7). This architecture is shown in figure 1.
In the selection, we considered the object detection
and segmentation performance as well as the overall
runtime of the network. The detailed evaluation of
the different architectures can be seen in section IV.
This final architecture was converted to the ONNX
format. The model was then optimized by the model
optimizer of the OpenVINO toolkit and converted to
the OpenVINO toolkit’s intermediate representation
format, which itself was used to deploy the net-
work on an Intel NCS2 VPU on the Wolfgang-OP
robot. This allows for efficient inference of the net-
work which is crucial in our power, weight, and
performance-limited use case.
In the following, we will compare the different re-
visions of our YOEO model as well as YOLOv4-tiny
and U-NET regarding their precision and runtime on
two different datasets and domains.
A. Metrics
To quantify the precision of the segmentation im-
ages, we have selected the common Jaccard index
also known as Intersection over Union (IoU) metric.
For the evaluation of the bounding box object detec-
tion, the mean average precision with a minimum of
50% IoU (mAP50) was used. The inference speed of
the models was evaluated using the frames per sec-
ond (FPS) measure, running either on an NVIDIA®
GeForce® RTX 2080 Ti or with model optimizations
on the Intel NCS2 VPU ready for deployment.
B. Datasets
During our evaluation phase, we have used two
datasets for training and evaluation. As it perfectly
416 x 416 x 3
208 x 208 x 32
104 x 104 x 64
104 x 104 x 32
104 x 104 x 32
104 x 104 x 64
52 x 52 x 128
52 x 52 x 128
52 x 52 x 64
52 x 52 x 64
52 x 52 x 128
26 x 26 x 256
26 x 26 x 256
26 x 26 x 128
26 x 26 x 128
26 x 26 x 256
52 x 52 x 128
Upsample 52 x 52 x 256
104 x 104 x 64
Upsample 104 x 104 x 256
13 x 13 x 512
208 x 208 x 32
208 x 208 x 128
416 x 416 x 32
416 x 416 x 3
Upsample 416 x 416 x 64
Conv 3x3
13 x 13 x 512
Conv 3x3 Conv 3x3
13 x 13 x 512 13 x 13 x 128
26 x 26 x 128
Conv 1x1
13 x 13 x 24
Conv 1x1 26 x 26 x 24
Conv 3x3
13 x 13 x 265
Conv 3x3 26 x 26 x 265
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Conv 3x3
Fig. 1. Layout of the final YOEO architecture (YOEO-rev-7).
The shape of the output feature map is noted next to the layers.
Maxpooling layers are marked in orange while nearest-neighbor
upsampling is red. The convolutional layers (blue) also include
batch normalization and a leaky ReLU activation function. The
number of filters before the output layers (purple, pink) depend
on the number of predicted classes, this figure shows the layout
for the TORSO-21 dataset with three bounding box and three
segmentation classes.
matches our use case, we have used the TORSO-21
Dataset [12] from the RoboCup soccer domain to
train, test, and select from the various approaches
mentioned in section III. Additionally, we have used
the Cityscapes Dataset [14] to evaluate our final
model architecture and to quantitatively compare it
with other architectures.
The TORSO-21 dataset consists of two image
collections, one contains images and labels of a
simulated soccer environment, the other is a diverse
dataset of 8894 train and 1570 test images respec-
tively. The images were recorded by five different
camera types from twelve locations containing six
ball types. Classes included in this collection are
ball, robot, and L-, T-, and X-intersections of the
soccer field lines as bounding box labels, and goal-
post labels as four-point polygons. Additionally, it
provides segmentation masks for stuff classes like
the field area and lines. We converted the goalpost
polygons to bounding boxes as required by the YOLO
architecture. We chose to not make use of the line
intersection labels, as we want to use the semantically
segmented lines, which contain more information. As
the images do not share a common resolution or
aspect ratio, they get scaled to match 416 pixels in
the largest axis and padded with black on the other
axis. This results in a 416 by 416 pixel square image
as described above.
The Cityscapes dataset is commonly used as a
benchmark for vision algorithms, especially in the
autonomous driving domain. It contains several col-
lections from stereo video recordings of street scenes
from 50 different cities. We have used the gtFine
collection consisting of 5,000 images with pixel-level
instance and semantic segmentation. 30 classes are
labeled that can be combined into eight groups. As
our initial problem space does not require as many
classes, we have decided to train and evaluate our
model architecture only based on the groups for the
semantic segmentation. For object detection, we used
all classes that supported instance segmentations. As
there are only eight of them, no grouping was applied.
The instance segmentations were converted to bound-
ing boxes which are compatible with the predictions
of our YOLO decoder. For this conversion, we have
utilized a tool3by Till Beemelmanns. Instead of
applying padding to the input images, we scaled the
rather wide but consistent resolution of 1024 x 2048
pixels directly to a 416 by 416 pixel square.
cityscapes-to- coco-conversion
Architecture mAP50 IoU FPS
YOEO-rev-0 78.87% 82.62% 175.45
YOEO-rev-1 84.10% 83.37% 133.96
YOEO-rev-2 83.87% 83.49% 138.76
YOEO-rev-3 77.91% 82.22% 185.78
YOEO-rev-4 77.97% 82.21% 180.64
YOEO-rev-5 78.71% 82.27% 179.59
YOEO-rev-6 78.95% 85.28% 154.36
YOEO-rev-7 83.36% 85.02% 137.14
YOEO-rev-8 78.62% 82.53% 173.24
YOEO-rev-9 79.12% 84.23% 163.40
YOLOv4-tiny 83.56% - 174.67
U-NET - 81.58% 152.07
C. Architecture Comparison
The different proposed architecture layouts have
been trained for up to 60 epochs on the TORSO-21
dataset. The results are presented in table I. The
performance of the object detection is measured
using the mAP50 metric, while the IoU is used for
semantic segmentation. The frames per second (FPS)
are measured on an NVIDIA® GeForce® RTX 2080
Ti and do not include the inference optimizations
applied to the deployed model. Standard deviations
for the FPS measurements are negligibly small. The
reference U-NET model uses a full YOLOv4-tiny
encoder with skip connections on every level (similar
to the segmentation decoder of YOEO-rev-7).
It can be seen that the YOLOv4-tiny based models
have a clear advantage over the YOLOv3-tiny ones
when it comes to object detection performance. We,
therefore, strongly prefer a YOLOv4-tiny network.
Also, it is evident that a deeper starting segmentation
decoder results in a higher IoU. The change from nor-
mal (YOEO-rev-0) to deeper (YOEO-rev-6) increases
the IoU from 82.62% to 85.28%. While this change
is relatively small, it is still significant for classes like
the field where the majority is easy to detect but the
interesting edge cases are only solved by the deeper
model. This includes cases where other intentionally
unlabeled soccer fields are visible in the background,
images with natural light (areas might be over- or
underexposed), or cases where there are obstructions
(people, robots, cables, laptops, ...) on the field that
are part of the field class due to the simplified anno-
tation method for field annotations described in the
TORSO-21 paper [12]. Deeper encoder feature maps
help in this regard, because they are less spatially
specific, allowing to focus on a larger context while
Fig. 2. This figure shows an exemplary difference between
the segmentation of the YOEO-rev-0 (left) and YOEO-rev-6
(right) segmentation decoder. It shows how the deeper decoder of
YOEO-rev-6 handles large occlusions of the field (green) better.
It also correctly classifies the bright area lit by the sun in the
upper left of the image as field instead of line (red).
also having more non-linearities which allow for an
approximation of more complex data distributions.
This is demonstrated in figure 2.
The experiments also show that the difference
between one and two convolutional blocks after the
last upscaling in the segmentation decoder head is
mainly a longer runtime than any performance benefit
for models with a full resolution skip connection.
Using a 1x1 convolution in the last layer has nearly
no effect despite impacting the performance (see
The residual connections in the segmentation de-
coder head (YOEO-rev-8) made no notable differ-
ence in the accuracy while slightly slowing down
the model during inference. The improved gradient
flow resulted in faster training times, but it was not
significant enough to justify the slower inference.
Using the encoder layout from YOEO-rev-2 which
does not include a full-resolution skip connection
over YOEO-rev-1 seems reasonable because the accu-
racy trade-off is negligible and the lower resolution
skip-connection results in a faster inference with a
smaller memory footprint. It also enables the usage
of pre-trained YOLOv4-tiny encoder weights that are
available for many datasets.
The combined YOEO network outperforms the
standalone versions of both the YOLOv4-tiny and
U-NET even while sharing the encoder. This seems
to be the case because the different domains of both
decoders help the encoder to generalize better during
training. The combined encoder also results in an
inference speed benefit of 68.70% when we compare
the combined runtime of the individual networks with
the YOEO-rev-7 architecture as seen in table I.
Therefore, it is reasonable to settle for a
YOLOv4-tiny based architecture that utilizes the full
encoder for the semantic segmentation without using
a full resolution skip connection or residuals in the
Fig. 3. Visualization of the predictions (right) of the final model
on a test image (left) from the TORSO-21 dataset. The prediction
includes the successful detection of another robot (yellow), the
goalposts (blue), and the ball (pink) while ignoring the robot
itself and the humans on the field as intended. Further, an accurate
segmentation of field (green) and lines (red) can be seen.
decoder (YOEO-rev-7). It is also the best out of the
YOLOv4-tiny models when ranking them using the
geometric mean of the evaluated attributes including
the runtime.
This final model has been deployed as described
in section III. The model runs at 6.7 FPS with less
than 2 Watt power usage on the embedded hardware
of the robot.
D. Performance on Cityscapes
We also tested our architecture on the Cityscapes
dataset. The results can be seen in table II. The dataset
is slightly out of domain for this architecture and it
can be seen that the YOLOv4-tiny model struggles
with the high number of small objects in the dataset.
This is typical for this family of object detectors [15].
With these limitations in mind, our approach still
outperforms the similar YOLO-based approach from
Valeo [10] in all tested categories on the dataset while
being comparable in size and speed (see table II. A
complete runtime comparison was not possible due to
a lack of information regarding their implementation
and deployment. An exemplary prediction is shown
in figure 4.
Fig. 4. Predictions from YOEO on the Cityscapes dataset.
Stuff segmentation classes: flat (red), construction (green), object
(yellow), nature (blue), sky (orange), and void (transparent).
Visible things:cars (brown), trucks (pink), and persons (blue).
Metric Class YOEO-rev-7 MTL Valeo [10]
AP Person 26.87% 19.31%
AP Rider 29.97% -
AP Bicycle 27.51% 18.98%
AP Car 51.05% 41.10%
AP Bus 29.88% -
AP Motorcycle 03.29% -
AP Truck 14.52% -
AP Train 08.11% -
IoU Flat (e.g. Road) 91.76% 59.66%
IoU Construction 81.43% 63.63%
IoU Object (e.g. Poles) 40.99% -
IoU Nature 85.45% 69.49%
IoU Sky 86.08% 62.28%
With YOEO we present a hybrid CNN approach
based on YOLOv4-tiny for object detections and
U-NET for semantic segmentation utilizing only one
encoder backbone. Our architecture outperforms both
standalone versions in precision indicating better
generalization of the encoder while simultaneously
gaining a 68.70% speed benefit over running both
standalone architectures sequentially. Many robotics
computer vision tasks could benefit from this hybrid
Training the YOEO architecture requires a diverse
dataset containing stuff and things classes, which is
more expensive to create than a dataset just contain-
ing bounding boxes. On the other hand it is less
expensive to label than a fully annotated panoptic
segmentation dataset. It is also possible to convert
datasets containing panoptic segmentations as shown
with the Cityscapes dataset.
The YOEO architecture could be improved further
by using automated parameter optimization for the
hyperparameters. Additionally, a classification de-
coder head could be appended to the shared encoder
to provide classifications of e.g. weather, game states,
or perturbations such as glare or staining of the
camera. Combining other segmentation architectures
such as BiSeNet V2 [16] could also improve the
Thanks to Erik Linder-Nor´
en for the original PyTorch-YOLOv3
implementation. Also thanks to the members of the Hamburg
Bit-Bots for helping to develop this architecture.
[1] “RoboCup Soccer Humanoid League Laws of the Game
2019/2020,” content/uploads/
RCHL-2020-Rules-Dec23.pdf, (accessed November 21,
[2] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao,
“YOLOv4: Optimal Speed and Accuracy of Object Detec-
tion,” preprint arXiv:2004.10934, 2020.
[3] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convo-
lutional Networks for Biomedical Image Segmentation,” in
International Conference on Medical image computing and
computer-assisted intervention. Springer, 2015.
[4] M. Bestmann, J. G¨
uldenstein, F. Vahl, and J. Zhang,
“Wolfgang-OP: A Robust Humanoid Robot Platform for
Research and Competitions,” in IEEE Humanoids 2021, 07
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
Classification with Deep Convolutional Neural Networks,
Advances in neural information processing systems, vol. 25,
[6] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi,
“You Only Look Once: Unified, Real-Time Object Detec-
tion,” preprint arXiv:1506.02640, 2015.
[7] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
Fu, and A. C. Berg, “SSD: Single Shot Multibox Detector,
in European conference on computer vision. Springer,
[8] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional
Networks for Semantic Segmentation,” in Proceedings of
the IEEE conference on computer vision and pattern recog-
nition, 2015.
[9] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and
R. Urtasun, “Multinet: Real-Time Joint Semantic Reasoning
for Autonomous Driving,” in Intelligent Vehicles Sympo-
sium (IV). IEEE, 2018.
[10] G. Sistu, I. Leang, and S. Yogamani, “Real-Time Joint
Object Detection and Semantic Segmentation Network for
Automated Driving,preprint arXiv:1901.03912, 2019.
[11] A. Kirillov, R. Girshick, K. He, and P. Doll´
ar, “Panop-
tic Feature Pyramid Networks,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2019, pp. 6399–6408.
[12] M. Bestmann, T. Engelke, N. Fiedler, J. G¨
J. Gutsche, J. Hagge, and F. Vahl, “TORSO-21 Dataset:
Typical Objects in RoboCup Soccer 2021,” in RoboCup
2021, 2021.
[13] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic
Optimization,” in 3rd International Conference on Learning
Representations, ICLR, 2015.
[14] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The
Cityscapes Dataset for Semantic Urban Scene Understand-
ing,” in Proc. of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016.
[15] N.-D. Nguyen, T. Do, T. D. Ngo, and D.-D. Le, “An
Evaluation of Deep Learning Methods for Small Object De-
tection,” Journal of Electrical and Computer Engineering,
[16] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang,
“Bisenet v2: Bilateral Network With Guided Aggre-
gation for Real-Time Semantic Segmentation,preprint
arXiv:2004.02147, 2020.
Intel and OpenVINO are trademarks of Intel Corporation or its subsidiaries. NVIDIA and GeForce
are registered trademarks of NVIDIA Corporation. The authors are not affiliated with Intel
Corporation or NVIDIA Corporation.
... The robot is also equipped with an Intel Neural Compute Stick 2 for neural network inference [29]. Because of positive experiences with the OpenVINO-toolkit [30] in the deployment of YOEO [31] to our robots, we follow a similar strategy for this approach. First, the model architecture is defined and trained in PyTorch [32]. ...
... To achieve similar generalization capabilities for real-world approaches, however, a large amount of training data recorded in different scenarios is required. Using a neural network capable of detecting the field boundary [31], [35] or conventional approaches [36], [37] to remove all depth estimations above it can address the issue. However, this can negatively impact detections above or close to the field boundary, such as robots or goalposts. ...
... In future work, domain-specific adaptions of the architectures and newer approaches to the problem like guided upsampling blocks [22] need to be evaluated. Plus, we are very interested in integrating depth estimation into shared backbone networks such as YOEO [31]. Also, new realworld datasets have to be recorded, possibly during test games, to improve the applicability in real-world games. ...
Conference Paper
Full-text available
We showcase a pipeline to train, evaluate, and deploy deep learning architectures for monocular depth estimation in the RoboCup Soccer Humanoid domain. In contrast to previous approaches, we apply the methods on embedded systems in highly dynamic but heavily constrained environments. The results indicate that our monocular depth estimation pipeline is usable in the RoboCup environment.
... The software stack the Hamburg Bit-Bots RoboCup team has developed allows the robot to perform the tasks required for autonomously playing in the soccer competition. A state-of-theart vision pipeline [41] detects objects and segments the captured images into the playing field and field lines. These features are transformed into Cartesian space using inverse projection mapping. ...
Full-text available
Navigation is a crucial component of any mobile robot. For humanoid robots, this task is more challenging than for wheeled robots. Their capabilities can not be modeled by the velocity and acceleration limits used for wheeled navigation, as the discrete nature of taking steps interferes with the assumption that acceleration is possible at any time. Footstep planning can provide a viable alternative. It enables the robot to step onto or over obstacles and navigate more precisely as it considers individual footstep placement. However, modeling the capabilities of where a robot can place its feet while remaining stable given a specific walking engine and the robot’s dynamic state is challenging. This thesis proposes using a model-free reinforcement learning approach to train a policy that outputs the footstep poses for a walking engine. This allows learning the capabilities of the robot and the walking engine to minimize the time required for navigation. The policy is trained and evaluated in simulation. However, the employed walking engine has been used successfully on several real robots, which could ease the sim-to-real transfer. The evaluation shows a significant performance increase compared to velocity-based navigation while still being robust to noisy measurements.
... YOLOv4-P5 only employed the features of three scales to represent the different scale objects, and also achieve excellent performance. Different from the above multi-scale representation, YOLOv3-Tiny (Redmon & Farhadi, 2018), YOLOv4-Tiny and YOEO (Vahl et al., 2021) integrated the features of the two scales for object detection, so as to achieve the balance between detection accuracy, model parameters and detection speed. Although these models have achieved impressive performance, they do not fully consider the matching relationships between the object distribution and the detection head at different resolutions. ...
Multi-scale detection plays an important role in object detection models. However, researchers usually feel blank on how to reasonably configure detection heads combining multi-scale features at different input resolutions. We find that there are different matching relationships between the object distribution and the detection head at different input resolutions. Based on the instructive findings, we propose a lightweight traffic object detection network based on matching between detection head and object distribution, termed as MHD-Net. It consists of three main parts. The first is the detection head and object distribution matching strategy, which guides the rational configuration of detection head, so as to leverage multi-scale features to effectively detect objects at vastly different scales. The second is the cross-scale detection head configuration guideline, which instructs to replace multiple detection heads with only two detection heads possessing of rich feature representations to achieve an excellent balance between detection accuracy, model parameters, FLOPs and detection speed. The third is the receptive field enlargement method, which combines the dilated convolution module with shallow features of backbone to further improve the detection accuracy at the cost of increasing model parameters very slightly. The proposed model achieves more competitive performance than other models on BDD100K dataset and our proposed ETFOD-v2 dataset. The code will be available.
Full-text available
Low-level details and high-level semantics are both essential to the semantic segmentation task. However, to speed up the model inference, current approaches almost always sacrifice the low-level details, leading to a considerable decrease in accuracy. We propose to treat these spatial details and categorical semantics separately to achieve high accuracy and high efficiency for real-time semantic segmentation. For this purpose, we propose an efficient and effective architecture with a good trade-off between speed and accuracy, termed Bilateral Segmentation Network (BiSeNet V2). This architecture involves the following: (i) A detail branch, with wide channels and shallow layers to capture low-level details and generate high-resolution feature representation; (ii) A semantics branch, with narrow channels and deep layers to obtain high-level semantic context. The detail branch has wide channel dimensions and shallow layers, while the semantics branch has narrow channel dimensions and deep layers. Due to the reduction in the channel capacity and the use of a fast-downsampling strategy, the semantics branch is lightweight and can be implemented by any efficient model. We design a guided aggregation layer to enhance mutual connections and fuse both types of feature representation. Moreover, a booster training strategy is designed to improve the segmentation performance without any extra inference cost. Extensive quantitative and qualitative evaluations demonstrate that the proposed architecture shows favorable performance compared to several state-of-the-art real-time semantic segmentation approaches. Specifically, for a \(2048\times 1024\) input, we achieve 72.6% Mean IoU on the Cityscapes test set with a speed of 156 FPS on one NVIDIA GeForce GTX 1080 Ti card, which is significantly faster than existing methods, yet we achieve better segmentation accuracy. The code and trained models are available online at
Conference Paper
Full-text available
We present our open humanoid robot platform Wolfgang. The described hardware focuses on four aspects. Firstly, the robustness against falls is improved by integrating 3D printed elastic elements. Additionally, a high control loopfrequency is achieved by using new custom control electronics. Furthermore, a torsion spring is applied to reduce the torque on the knee joints. Finally, computational power is provided through the combination of different processors. The paper also presents the ROS-based software stack that is used in RoboCup.
Conference Paper
Full-text available
We present a dataset specifically designed to be used as a benchmark to compare vision systems in the RoboCup Humanoid Soccer domain. The dataset is composed of a collection of images taken in various real-world locations as well as a collection of simulated images. It enables comparing vision approaches with a meaningful and expressive metric. The contributions of this paper consist of providing a comprehensive and annotated dataset, an overview of the recent approaches to vision in RoboCup, methods to generate vision training data in a simulated environment, and an approach to increase the variety of a dataset by automatically selecting a diverse set of images from a larger pool. Additionally , we provide a baseline of YOLOv4 and YOLOv4-tiny on this dataset.
Full-text available
Small object detection is an interesting topic in computer vision. With the rapid development in deep learning, it has drawn attention of several researchers with innovations in approaches to join a race. These innovations proposed comprise region proposals, divided grid cell, multiscale feature maps, and new loss function. As a result, performance of object detection has recently had significant improvements. However, most of the state-of-the-art detectors, both in one-stage and two-stage approaches, have struggled with detecting small objects. In this study, we evaluate current state-of-the-art models based on deep learning in both approaches such as Fast RCNN, Faster RCNN, RetinaNet, and YOLOv3. We provide a profound assessment of the advantages and limitations of models. Specifically, we run models with different backbones on different datasets with multiscale objects to find out what types of objects are suitable for each model along with backbones. Extensive empirical evaluation was conducted on 2 standard datasets, namely, a small object dataset and a filtered dataset from PASCAL VOC 2007. Finally, comparative results and analyses are then presented.
Full-text available
While most approaches to semantic reasoning have focused on improving performance, in this paper we argue that computational times are very important in order to enable real time applications such as autonomous driving. Towards this goal, we present an approach to joint classification, detection and semantic segmentation via a unified architecture where the encoder is shared amongst the three tasks. Our approach is very simple, can be trained end-to-end and performs extremely well in the challenging KITTI dataset, outperforming the state-of-the-art in the road segmentation task. Our approach is also very efficient, taking less than 100 ms to perform all tasks.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at .
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.