Conference PaperPDF Available

FisheyePixPro: Self-supervised Pretraining using Fisheye Images for Semantic Segmentation

Authors:

Abstract and Figures

Self-supervised learning has been an active area of research in the past few years. Contrastive learning is a type of self-supervised learning method that has achieved a significant performance improvement on image classification task. However, there has been no work done in its application to fisheye images for autonomous driving. In this paper, we propose FisheyePixPro, which is an adaption of pixel level contrastive learning method PixPro for fisheye images. This is the first attempt to pretrain a contrastive learning based model, directly on fisheye images in a self-supervised approach. We evaluate the performance of learned representations on the WoodScape dataset using segmentation task. Our FisheyePixPro model achieves a 65.78 mIoU score, a significant improvement over the PixPro model. This indicates that pre-training a model on fisheye images have a better performance on a downstream task.
Content may be subject to copyright.
FisheyePixPro: Self-supervised Pretraining using Fisheye Im-
ages for Semantic Segmentation
Ramchandra Cheke1, Ganesh Sistu2, Ciar´
an Eising1, Pepijn van de ven1, Varun Ravi Kumar3and Senthil Yogamani2
1University of Limerick, Ireland
2Valeo Vision Systems, Ireland
3Valeo DAR Kronach, Germany
Abstract
Self-supervised learning has been an active area of research
in the past few years. Contrastive learning is a type of self-
supervised learning method that has achieved a significant perfor-
mance improvement on image classification task. However, there
has been no work done in its application to fisheye images for
autonomous driving. In this paper, we propose FisheyePixPro,
which is an adaption of pixel level contrastive learning method
PixPro [1] for fisheye images. This is the first attempt to pre-
train a contrastive learning based model, directly on fisheye im-
ages in a self-supervised approach. We evaluate the performance
of learned representations on the WoodScape dataset using seg-
mentation task. Our FisheyePixPro model achieves a 65.78 mIoU
score, a significant improvement over the PixPro model. This in-
dicates that pre-training a model on fisheye images have a better
performance on a downstream task.
INTRODUCTION
Recent advancements in deep learning have acted as a cata-
lyst for achieving human-level performance in various computer
vision tasks. Availability of large datasets, development of novel
architectures and access to faster GPUs are the key factors in the
success of deep learning. One of the main challenges in training
a deep neural network in a supervised way is the requirement for
a large amount of labelled data, which is costly to generate. Self-
supervised learning methods focus on learning a generic visual
representation from a large amount of unlabelled images, allevi-
ating the requirement for an annotated dataset. Self-supervised
learning can be divided into two major categories: 1) Pretext task
and 2) contrastive learning.
In pretext task methods, the labels are generated by defin-
ing a pseudo task, with the intuition that the network should learn
generic features while solving a pretext task. Examples of such
pretext tasks are context prediction [2], image colourisation [3],
jigsaw puzzle [4], and rotation prediction [5]. The transfer learn-
ing performance of these tasks was limited as the network was
unable to learn robust feature representations while solving pre-
text tasks [6].
Contrastive learning means learning by comparing the input
samples. The objective of contrastive learning is to maximise the
agreement between ”similar” inputs or ”positive pairs” and also
maximise the distance between ”dissimilar” inputs or ”negative
pairs” in the embedding space. Two views from a single image
can be considered as positive pairs, while two views from dif-
ferent images can be considered as negative pairs. Contrastive
learning methods are based on a principle of instance discrimi-
Figure 1: Sample images from KITTI-360 dataset.
nation [7], where each image is considered as a single class, and
the aim is to distinguish each class from other classes. In order
to classify two different views from the same image as a single
class, the need for data augmentation arises. Hence data augmen-
tation proves one of the critical aspect of contrastive learning. Nu-
merous methods [8,9,10,11,12] have shown promising results
on downstream tasks of image classification on the ImageNet-1K
dataset using a ResNet-50 backbone which was pre-trained using
a contrastive learning framework. However, to the best of our
knowledge, no work has been done in leveraging fisheye images
to pretrain a model using contrastive learning methods.
Traditional deep learning models offer little performance
benefits when applied directly to fisheye photos (e.g. fig 1) due
to the large radial distortion in fisheye images. Still fisheye cam-
eras are one of the major components of computer vision sys-
tems in autonomous driving cars because only four fisheye cam-
eras are necessary to provide a full 360coverage around the ve-
hicle. Therefore, it has become popular in near field sensing at
low speed [13,14]. Several experiments have been conducted to
enhance the performance of CNN on fisheye dataset by investi-
gating the impact of adversarial attacks[15] on a multi-task visual
perception network [16]. The domain of autonomous driving in-
Momentum
encoder +
Projection module
ResNet-50
encoder +
Projection module
Pixel Propagation
Module y
Loss= - cos(y,x)
x
Figure 2: Self-supervised FisheyePixPro training framework for pretraining the PixPro [1] model using fisheye images.
volves object detection [17,18,19], soiling detection [20,21,22],
semantic segmentation [23], weather classification [24,25], dy-
namic object detection [26], depth prediction [27,28,29,30,31],
fusion [32], key-point detection and description [33] and multi-
task learning [34,35,36]. It also poses many challenges due to
the highly dynamic and interactive nature of surrounding objects
in the automotive scenarios [37].
Pretext task-based models have lower performance com-
pared to contrastive learning frameworks [10]. As a result, we
chose to adopt a contrastive learning framework. However, con-
trastive learning-based methods such as a Simple framework for
Contrastive Learning of visual Representations (SimCLR) re-
quires a very large batch size to maintain the ratio of negative
pairs. The Momentum Contrast (MoCo) method [12] addresses
this issue by using a momentum encoder and maintaining a queue
from previously generated samples that can be utilised as negative
pairs. These approaches use the notion of instance discrimina-
tion, with the network being pre-trained on the ImageNet dataset.
ImageNet dataset typically consist of a single item in a particu-
lar image. Thus, two different views from a single image may
have some features from the main object in the image. There-
fore, instance discrimination methods can be applied to ImageNet
dataset. Whereas, fisheye pictures collected for self-driving car
purposes are fundamentally different from the ImageNet dataset.
These images consist of multiple objects like a bus, car, bike,
road, humans, traffic signs etc. in a single image. Due to this,
instance discrimination methods like MoCo and SimCLR are not
a suitable choice for pretraining models with fisheye images for
autonomous driving.
PixPro [1] method is based on pixel level contrastive learn-
ing. In this method, each pixel in a given image is considered
as a single class and the objective is to differentiate each pixel
from other pixels within the same image. The main advantage of
using PixPro is that it does not require a large batch size similar
to SimCLR. The negatives are selected based on the features ob-
tained from different pixels from the same image. Additionally,
the pixel propagation module provides a smoothing effect that re-
moves noise and allows propagation of the features with similar
pixels.
In this work, we propose a novel training method
FisheyePixPro, which is the first attempt to train a contrastive
learning based model, directly on fisheye images in a self-
supervised approach. We use PixPro [1] as a base for pretraining
on fisheye images. Fisheye images have geometrical distortion
and it leads to drop in the performance when ImageNet pretrained
model is directly applied on fisheye images. We demonstrated that
FisheyePixPro pretrained representations obtained higher score
on segmentation task than standard PixPro model.
METHODS
Datasets
We used a subset of the KITTI-360 [38] dataset and a Va-
leo internal fisheye image dataset for pretraining. The KITTI-360
is a large scale 3D video dataset with 300k images and 3D laser
point clouds. The dataset was collected with the help of a station
wagon using two fisheye cameras along each side covering 360
view. Sample fisheye images from KITTI-360 dataset are shown
in Figure 1. To remove duplicate images, total of 50k images were
sampled for pretraining. In addition, we used an internal fisheye
dataset from Valeo. These fisheye images are obtained under same
conditions as WoodScape [39] and it consists of around 50k unla-
belled fisheye images. Therefore, total of 100k images were used
for pretraining.
In addition to this, we evaluated the performance on the
WoodScape dataset using segmentation task. WoodScape dataset
consists of total 10k images with annotations for nine classes.
These classes are road, lanes, curbs, person, rider, vehicle, bicy-
cle, motorcycle and traffic sign. From total 10k images only 8215
images are publicly available. Therefore, we randomly choose
7200 images for training a deeplabv3+ model on segmentation
task using ResNet-50 encoder and evaluated performance on re-
maining images as test set.
FisheyePixPro Pretraining for segmentation
PixPro is based on two properties of an image: spatial sen-
sitivity and spatial smoothing. Spatial sensitivity is defined as an
ability to differentiate between adjacent pixels. This property is
useful in delineation of boundary areas. On the other hand, spatial
smoothing operation involves removal of noise or high frequency
signals from an image. These two properties are core compo-
nents of pretext task. The features from the corresponding pixels
of the two views taken from the randomly cropped image are en-
couraged to be consistent. This pixel level pretext task focuses
on learning representation from two different views of the same
image by minimising the distance between two pixel level repre-
sentations using cosine similarity loss.
The model architecture is shown in figure 2. It is a siamese
architecture with two input branches to process different views
from the same input image under different data augmentations.
One branch consists of ResNet-50 encoder with projection head
and pixel propagation module. Whereas, the other branch con-
sists of only momentum encoder with projection head. A random
crop is extracted from given image, then it is resized to 224x224
pixels. Different data augmentations such as random horizontal
flip, colour jitter, grayscale, gaussian blur and solarization oper-
ation are applied to the input image. An encoder and momen-
tum encoder is used to calculate features from these two extracted
patches. The spatial resolution of features map reduces to 7x7.
Then each pixel in feature map is mapped to original image space
and the distance between every pair of pixels in the two feature
maps is calculated according to equation 1
A(i,j) = (1 if dist(i,j) τ
0 if dist(i,j) >τ
(1)
If the distance between two pixels from different views is
less than the threshold τthen those two are considered as positive
pairs and if the distance between two pixels from different views
is greater than τthen it is considered as negative pairs. Typical
value of τis 0.7.
Pixel propagation module is applied to only one branch of
the network after regular encoder. The purpose of this module is
to provide smooth representation using self attention mechanism,
according to Equation 2.
yi=
j
(max(cos(xi,xj),0))γ.g(xj)(2)
where, cosine function is used to calculate the distance be-
tween pair of pixels and γis control parameter for similarity func-
tion. The default value of γis set to 2. Function denoted as g(·),
is a transformation function which composed of batch normalisa-
tion and a ReLU layer. Finally, the loss is calculated by equation
3, where yis feature of a pixel from pixel propagation module and
xis feature of a pixel from momentum encoder module.
loss =cos(yi,x
j)cos(yj,x
i)(3)
To pretrain FisheyePixPro, the network was first initialised
with PixPro weights as the weights are already available; then
further pretrain on fisheye images.
DeepLabv3+
DeepLabv3+[40] incorporates encoder-decoder architecture
and is an upgraded version of DeepLabv3 [41]. Deeplabv3+ of-
fers a number of benefits for semantic segmentation tasks, includ-
ing dense prediction with Atrous convolution [42], memory opti-
misation with depth-wise separable convolution [43] and multi-
scale processing using Atrous Spatial Pyramid Pooling(ASPP)
module. The following are some key points in the DeepLabv3+
architecture:
Atrous convolution: Atrous convolution also called as dilated
convolution, it allow us to increase the spatial resolution of
feature maps. The dilation rate in Atrous convolution deter-
mines the distance between consecutive values in the kernel.
As a result, multi-scale information is acquired by regulat-
ing the dilation rate, boosting the network’s generalisation
capacity.
ImageNet PixPro FisheyePixPro
Pretraining Supervised Self-supervised Self-supervised
Class IoU Acc IoU Acc IoU Acc
void 97.23 98.59 97.13 98.29 97.27 98.51
road 93.76 96.12 93.64 96.51 93.91 96.32
lanes 71.46 83.45 69.92 82.47 70.00 83.26
curbs 53.30 81.25 50.05 84.81 52.54 83.05
person 55.29 79.88 52.25 77.02 55.63 78.64
rider 54.92 76.46 52.03 73.84 53.90 76.75
vehicle 88.28 93.11 87.91 92.86 88.56 93.64
bicycle 48.35 72.85 46.45 72.45 48.47 71.81
motorcycle 60.14 80.25 56.74 70.46 59.04 77.14
traffic sign 39.26 65.22 35.76 58.06 38.45 61.26
Table 1: Class-wise IoU score and accuracy on validation dataset.
FisheyePixPro pretraining performs better than PixPro pretraining
in a self-supervised setting.
Depth-wise separable convolution: Depth-wise separable con-
volution operation splits a regular convolution into two com-
ponents as depth-wise convolution followed by a point-wise
convolution. The depth-wise convolution conducts a spatial
convolution for each input channel individually, whereas the
point-wise convolution is used to combine the depth-wise
convolution’s output. This innovative solution not only sig-
nificantly reduces calculation complexity but also enhances
performance.
Atrous Spatial Pyramid Pooling: The size of the same object
varies according to its position in front of the camera. To
deal with different sizes of the same object, several stud-
ies have been proposed to extract features at multiple scales
[44] [42]. DeepLabv3+ uses Atrous Spatial Pyramid Pool-
ing (ASPP) with different atrous rates of 6,12 and 18 to pro-
cess the convolution neural network output.
Network backbone: In this work, we follow [1,12,10] and use
ResNet-50 [45] backbone for pretraining.
EXPERIMENTS
We investigated whether pretraining a FisheyePixPro model
leads to better representation learning than a regular PixPro
model. To evaluate this hypothesis, we adopt state-of-the-art
Deeplabv3+ model with ResNet-50 backbone on WoodScape
dataset. These experiments were carried out using a PyTorch
[46] based implementation. For PixPro model, the ResNet-50
encoder is initialised with the weights provided by [1] and Im-
ageNet model uses weights from ImageNet pretrained weights.
All training images were resized to 640x480 pixels. These models
were trained for 100k iterations using Nvidia V100, 16GB GPU
with batch size of 6. We used SGD optimiser with initial learning
rate=0.01, momentum=0.9, weight decay=0.0005. We adopted
poly learning rate scheduling scheme with power=0.9, minimum
learning rate=0.0001. To overcome the problem of class imbal-
ance, we also used weighted categorical cross-entropy.
Table 1shows the detail view of class-wise IoU score for
PixPro, FisheyePixPro and ImageNet pretrained model on the
WoodScape dataset for segmentation task. It can be seen that
our FisheyePixPro model outperforms PixPro model on fisheye
Figure 3: Qualitative results on downstream task of segmentation on WoodScape dataset. Rows represent fisheye images, ground truth,
results from ImageNet supervised pretrained model (baseline), PixPro and our FisheyePixPro respectively. The PixPro and FisheyePix-
Pro models were trained using unlabelled dataset. While ImageNet pretrained model was trained on 1.2M images with corresponding
classification labels.
dataset, whereas it achieves comparable performance with super-
vised ImageNet pretrained model. Kindly note that supervised
ImageNet pretrained model was pretrained on ImageNet dataset
which requires 1.2M images with its classification label. How-
ever, FisheyePixPro model was pretrained without using labels.
We have also provided the visual results on validation set in
fig 3. Table 2compares the mean intersection over union (mIou)
score and average accuracy on WoodScape segmentation task for
our FisheyePixPro, PixPro and ImageNet pretrained model. Our
FisheyePixPro method achieves significantly higher mIoU score
as compared to normal PixPro. On the other hand, supervised
ImageNet pretrained model achieves the highest score. This re-
Pretraining mIoU aAcc
ImageNet Supervised 66.2 97.20
PixPro Self-super vised 64.19 97.06
FisheyePixPro Self-supervised 65.78 97.21
Table 2: Comparing proposed FisheyePixPro pretraining with
PixPro and supervised pretraining on ImageNet dataset. The re-
sults are evaluated on WoodScape dataset for semantic segmenta-
tion.
sults demonstrates that our FisheyePixPro pretrained model helps
to learn better representations as compared to PixPro model.
CONCLUSION
In this work, we build a successful pretraining framework us-
ing pixel level pretext task of contrastive learning for fisheye im-
ages. We demonstrated that our FisheyePixPro method achieved
better feature representation and transfer performance using fish-
eye images for dense prediction. According to our knowledge,
this is the first attempt to pretrain a contrastive learning based
model directly on fisheye images. Our findings show that there is
potential to define a pixel level pretext task for fisheye images that
can alleviate the effect of non-linear distortion and learn generic
visual representations.
ACKNOWLEDGEMENTS
”This publication has emanated from research conducted
with the financial support of Science Foundation Ireland under
Grant number 18/CRT/6049”. ”The authors wish to acknowledge
the Irish Centre for High-End Computing (ICHEC) for the provi-
sion of computational facilities and support”. The authors would
like to thank Valeo Vision Systems for providing the opportunity
to work on the research project. We would also like to thank Lucie
Yahiaoui for providing a detailed review.
References
[1] Z. Xie, Y. Lin, Z. Zhang, Y. Cao, S. Lin, and H. Hu, “Propagate
yourself: Exploring pixel-level consistency for unsupervised visual
representation learning,” 2021 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 16679–16688, 2021. 1,
2,3
[2] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual rep-
resentation learning by context prediction,” in 2015 IEEE Inter-
national Conference on Computer Vision (ICCV), pp. 1422–1430,
2015. 1
[3] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,
in Computer Vision – ECCV 2016 (B. Leibe, J. Matas, N. Sebe,
and M. Welling, eds.), (Cham), pp. 649–666, Springer International
Publishing, 2016. 1
[4] M. Noroozi and P. Favaro, “Unsupervised learning of visual repre-
sentations by solving jigsaw puzzles,” in Computer Vision – ECCV
2016 (B. Leibe, J. Matas, N. Sebe, and M. Welling, eds.), (Cham),
pp. 69–84, Springer International Publishing, 2016. 1
[5] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised rep-
resentation learning by predicting image rotations,” ArXiv,
vol. abs/1803.07728, 2018. 1
[6] I. Misra and L. v. d. Maaten, “Self-supervised learning of pretext-
invariant representations,” in Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pp. 6707–6717,
2020. 1
[7] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller, and
T. Brox, “Discriminative unsupervised feature learning with exem-
plar convolutional neural networks,IEEE transactions on pattern
analysis and machine intelligence, vol. 38, no. 9, pp. 1734–1747,
2015. 1
[8] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learn-
ing via non-parametric instance discrimination,” 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 3733–
3742, 2018. 1
[9] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning rep-
resentations by maximizing mutual information across views,” in
Advances in Neural Information Processing Systems (H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alch ´
e-Buc, E. Fox, and R. Gar-
nett, eds.), vol. 32, Curran Associates, Inc., 2019. 1
[10] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple frame-
work for contrastive learning of visual representations,” in Proceed-
ings of the 37th International Conference on Machine Learning
(H. D. III and A. Singh, eds.), vol. 119 of Proceedings of Machine
Learning Research, pp. 1597–1607, PMLR, 13–18 Jul 2020. 1,2,3
[11] I. Misra and L. van der Maaten, “Self-supervised learning of pretext-
invariant representations, 2020 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pp. 6706–6716, 2020.
1
[12] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast
for unsupervised visual representation learning,” in 2020 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 9726–9735, 2020. 1,2,3
[13] C. Eising, J. Horgan, and S. Yogamani, “Near-field perception for
low-speed vehicle automation using surround-view fisheye cam-
eras,” IEEE Transactions on Intelligent Transportation Systems,
pp. 1–18, 2021. 1
[14] L. Gallagher, V. R. Kumar, S. Yogamani, and J. B. McDonald, “A
hybrid sparse-dense monocular slam system for autonomous driv-
ing,” in Proc. of ECMR, pp. 1–8, IEEE, 2021. 1
[15] I. Sobh, A. Hamed, V. Ravi Kumar, and S. Yogamani, “Adversar-
ial attacks on multi-task visual perception for autonomous driv-
ing,” Journal of Imaging Science and Technology, vol. 65, no. 6,
pp. 60408–1, 2021. 1
[16] V. Ravi Kumar, S. Yogamani, H. Rashed, G. Sitsu, C. Witt, I. Leang,
S. Milz, and P. M¨
ader, “Omnidet: Surround view cameras based
multi-task visual perception network for autonomous driving,IEEE
Robotics and Automation Letters, vol. 6, no. 2, pp. 2830–2837,
2021. 1
[17] H. Rashed, E. Mohamed, G. Sistu, V. Ravi Kumar, C. Eising, A. El-
Sallab, and S. Yogamani, “Generalized Object Detection on Fisheye
Cameras for Autonomous Driving: Dataset, Representations and
Baseline,” in Proceedings of the Winter Conference on Applications
of Computer Vision, pp. 2272–2280, 2021. 2
[18] A. Dahal, V. R. Kumar, S. Yogamani, and C. Eising, “An online
learning system for wireless charging alignment using surround-
view fisheye cameras,arXiv preprint arXiv:2105.12763, 2021. 2
[19] R. Hazem, E. Mohamed, V. R. K. Sistu, Ganesh and, C. Eising,
A. El-Sallab, and S. Yogamani, “FisheyeYOLO: Object Detection
on Fisheye Cameras for Autonomous Driving,Machine Learning
for Autonomous Driving NeurIPS 2020 Virtual Workshop, 2020. 2
[20] M. Uricar, G. Sistu, H. Rashed, A. Vobecky, V. Ravi Kumar,
P. Krizek, F. Burger, and S. Yogamani, “Let’s get dirty: Gan based
data augmentation for camera lens soiling detection in autonomous
driving,” in Proceedings of the IEEE/CVF Winter Conference on Ap-
plications of Computer Vision, pp. 766–775, 2021. 2
[21] A. Das, P. Kˇ
r´
ıˇ
zek, G. Sistu, F. B¨
urger, S. Madasamy, M. Uˇ
riˇ
c´
aˇ
r,
V. Ravi Kumar, and S. Yogamani, “TiledSoilingNet: Tile-level Soil-
ing Detection on Automotive Surround-view Cameras Using Cover-
age Metric,” in 2020 IEEE 23rd International Conference on Intel-
ligent Transportation Systems (ITSC), pp. 1–6, IEEE, 2020. 2
[22] M. Uric´
ar, J. Ulicny, G. Sistu, H. Rashed, P. Krizek, D. Hurych,
A. Vobecky, and S. Yogamani, “Desoiling dataset: Restoring
soiled areas on automotive fisheye cameras, in Proceedings of
the IEEE/CVF International Conference on Computer Vision Work-
shops, pp. 0–0, 2019. 2
[23] A. Dahal, E. Golab, R. Garlapati, V. Ravi Kumar, and S. Yogamani,
“RoadEdgeNet: Road Edge Detection System Using Surround View
Camera Images,” in Electronic Imaging, 2021. 2
[24] M. M. Dhananjaya, V. R. Kumar, and S. Yogamani, “Weather and
light level classification for autonomous driving: Dataset, baseline
and active learning,” in 2021 IEEE International Intelligent Trans-
portation Systems Conference (ITSC), pp. 2816–2821, 2021. 2
[25] L. Yahiaoui, M. Uˇ
riˇ
c´
aˇ
r, A. Das, and S. Yogamani, “Let the sun-
shine in: Sun glare detection on automotive surround-view cam-
eras,” Electronic Imaging, vol. 2020, no. 16, pp. 80–1, 2020. 2
[26] M. Yahiaoui, H. Rashed, L. Mariotti, G. Sistu, I. Clancy,
L. Yahiaoui, V. R. Kumar, and S. K. Yogamani, “Fisheyemodnet:
Moving object detection on surround-view cameras for autonomous
driving,ArXiv, vol. abs/1908.11789, 2019. 2
[27] V. R. Kumar, S. A. Hiremath, M. Bach, S. Milz, C. Witt, C. Pinard,
S. Yogamani, and P. M¨
ader, “Fisheyedistancenet: Self-supervised
scale-aware distance estimation using monocular fisheye camera for
autonomous driving,” in 2020 IEEE International Conference on
Robotics and Automation (ICRA), pp. 574–581, 2020. 2
[28] R. K. Varun, S. Yogamani, S. Milz, and P. M¨
ader, “FisheyeDis-
tanceNet++: Self-Supervised Fisheye Distance Estimation with
Self-Attention, Robust Loss Function and Camera View General-
ization,” in Electronic Imaging, 2021. 2
[29] V. Ravi Kumar, M. Klingner, S. Yogamani, S. Milz, T. Fingscheidt,
and P. Mader, “Syndistnet: Self-supervised monocular fisheye cam-
era distance estimation synergized with semantic segmentation for
autonomous driving,” in Proceedings of the IEEE/CVF Winter Con-
ference on Applications of Computer Vision, pp. 61–71, 2021. 2
[30] R. K. Varun, M. Klingner, S. Yogamani, M. Bach, S. Milz, T. Fin-
gscheidt, and P. M¨
ader, “SVDistNet: Self-supervised near-field dis-
tance estimation on surround view fisheye cameras,IEEE Transac-
tions on Intelligent Transportation Systems, 2021. 2
[31] R. K. Varun, S. Yogamani, M. Bach, C. Witt, S. Milz, and P. M ¨
ader,
“UnRectDepthNet: Self-Supervised Monocular Depth Estimation
using a Generic Framework for Handling Common Camera Distor-
tion Models,” in IEEE/RSJ International Conference on Intelligent
Robots and Systems, IROS, 2020. 2
[32] K. El Madawi, H. Rashed, A. El Sallab, O. Nasr, H. Kamel, and
S. Yogamani, “Rgb and lidar fusion based 3d semantic segmentation
for autonomous driving,” in 2019 IEEE Intelligent Transportation
Systems Conference (ITSC), pp. 7–12, IEEE, 2019. 2
[33] A. Konrad, C. Eising, G. Sistu, J. B. McDonald, R. C. Villing, and
S. K. Yogamani, “Fisheyesuperpoint: Keypoint detection and de-
scription network for fisheye images,” ArXiv, vol. abs/2103.00191,
2021. 2
[34] G. Sistu, I. Leang, and S. Yogamani, “Real-time joint object de-
tection and semantic segmentation network for automated driving,
NeurIPS 2018 Workshop on Machine Learning on the Phone and
other Consumer Devices, 2019. 2
[35] P. Maddu, W. Doherty, G. Sistu, I. Leang, M. Uˇ
riˇ
c´
aˇ
r, S. Chennupati,
H. Rashed, J. Horgan, C. Hughes, and S. Yogamani, “Fisheyemulti-
net: Real-time multi-task learning architecture for surround-view
automated parking system.,” in Proceedings of the Irish Machine
Vision and Image Processing Conference (IMVIP), 2019. 2
[36] I. Leang, G. Sistu, F. B¨
urger, A. Bursuc, and S. Yogamani, “Dy-
namic task weighting methods for multi-task networks in au-
tonomous driving systems,” in 2020 IEEE 23rd International Con-
ference on Intelligent Transportation Systems (ITSC), pp. 1–8,
IEEE, 2020. 2
[37] S. Houben, S. Abrecht, M. Akila, A. B¨
ar, et al., “Inspect, Under-
stand, Overcome: A Survey of Practical Methods for AI Safety,
CoRR, vol. abs/2104.14235, 2021. 2
[38] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon,
“Bundle adjustment — a modern synthesis,” in Vision Algorithms:
Theory and Practice (B. Triggs, A. Zisserman, and R. Szeliski,
eds.), (Berlin, Heidelberg), pp. 298–372, Springer Berlin Heidel-
berg, 2000. 2
[39] S. K. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, et al.,
“Woodscape: A multi-task, multi-camera fisheye dataset for au-
tonomous driving,2019 IEEE/CVF International Conference on
Computer Vision (ICCV), pp. 9307–9317, 2019. 2
[40] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
“Encoder-decoder with atrous separable convolution for semantic
image segmentation,” in Proceedings of the European conference
on computer vision (ECCV), pp. 801–818, 2018. 3
[41] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethink-
ing atrous convolution for semantic image segmentation, arXiv
preprint arXiv:1706.05587, 2017. 3
[42] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
Yuille, “Deeplab: Semantic image segmentation with deep convo-
lutional nets, atrous convolution, and fully connected crfs,” IEEE
transactions on pattern analysis and machine intelligence, vol. 40,
no. 4, pp. 834–848, 2017. 3
[43] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient con-
volutional neural networks for mobile vision applications,” 2017. 3
[44] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 2881–2890, 2017. 3
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pp. 770–778, 2016. 3
[46] M. Contributors, “MMSegmentation: Openmmlab semantic
segmentation toolbox and benchmark.” https://github.com/
open-mmlab/mmsegmentation, 2020. 3
... Numerous pretext task techniques have been suggested to advance the field of self-supervised learning in the domains of natural images [15,16,17,18,19,20] and self-supervised learning methods have been used in a wide range of applications [21,22,23,24,25,26,27]. A common theme in the majority of these initiatives is that the use of the pretext results is limited to use on downstream tasks using the same dataset. ...
Article
Full-text available
Supervised deep learning methods have produced state-of-the-art results with large labeled datasets. However, accessing large labeled datasets is difficult in medical image analysis because of a shortage of medical experts, expensive annotations, and privacy constraints in the healthcare domain. Self-supervised learning is a branch of machine learning that exploits unlabeled data to encourage network weights toward a valid latent representation of the data during a so-called pretext task. The features learned by the model while solving pretext tasks are transferred to a downstream task where limited annotations are available. In this work, we propose PatchLoc, a novel pretext task whose objective is to find the location of a given patch from an image as a source of supervision. We validated the effectiveness of PatchLoc on a downstream segmentation task using three different medical datasets. PatchLoc yields substantial improvements compared to U-Net trained from scratch and other pretext task-based approaches in a low data regime.
... DenseCL [17] implements self-supervision by optimizing a pairwise contrastive (dis)similarity loss between two views of input images, whereas pixel-level pretext tasks are introduced for learning dense feature representations in [77]. FisheyePix-Pro [78] attempts to pretrain a contrastive learning based model directly on fisheye images. Cross-image pixel contrast has been leveraged for semantic segmentation by looking beyond single images [79], [80], [81] and enforcing pixel embeddings belonging to the same semantic class to be more similar than embeddings from different classes. ...
Article
Full-text available
In this work, we introduce panoramic panoptic segmentation, as the most holistic scene understanding, both in terms of Field of View (FoV) and image-level understanding for standard camera-based input. A complete surrounding understanding provides a maximum of information to a mobile agent. This is essential information for any intelligent vehicle to make informed decisions in a safety-critical dynamic environment such as real-world traffic. In order to overcome the lack of annotated panoramic images, we propose a framework which allows model training on standard pinhole images and transfers the learned features to the panoramic domain in a cost-minimizing way. The domain shift from pinhole to panoramic images is non-trivial as large objects and surfaces are heavily distorted close to the image border regions and look different across the two domains. Using our proposed method with dense contrastive learning, we manage to achieve significant improvements over a non-adapted approach. Depending on the efficient panoptic segmentation architecture, we can improve 3.56.5%3.5{-}6.5\% measured in Panoptic Quality (PQ) over non-adapted models on our established Wild Panoramic Panoptic Segmentation (WildPPS) dataset. Furthermore, our efficient framework does not need access to the images of the target domain, making it a feasible domain generalization approach suitable for a limited hardware setting. As additional contributions, we publish WildPPS: The first panoramic panoptic image dataset to foster progress in surrounding perception and explore a novel training procedure combining supervised and contrastive training.
... Recently, there has been a lot of work on perception related to various tasks such as object detection [1], [2], soiling detection [3], [4], motion segmentation [5], [6], road edge detection [7], [8], weather classification [9], depth prediction [10]- [16], SLAM [17], [18] and in general for multi-task outputs [19]- [25] based on Convolutional Neural Networks (CNNs). However CNNs are power hungry and it is difficult to deploy all the above tasks in a low power embedded system. ...
... Instead of naive rectification, the WoodScape pushes researchers to create solutions that can work directly on raw fisheye images, modeling the underlying distortion. WoodScape dataset (public and private versions) has enabled research in various perception areas such as object detection [9]- [12], trailer detection [13], soiling detection [14]- [16], semantic segmentation [17]- [21], weather classification [22], depth prediction [23]- [29], moving object detection [21], [30]- [32], SLAM [33], [34] and multi-task learning [35], [36]. SynwoodScape [37] is a synthetic version of the Woodscape dataset. ...
Preprint
Object detection is a comprehensively studied problem in autonomous driving. However, it has been relatively less explored in the case of fisheye cameras. The strong radial distortion breaks the translation invariance inductive bias of Convolutional Neural Networks. Thus, we present the WoodScape fisheye object detection challenge for autonomous driving which was held as part of the CVPR 2022 Workshop on Omnidirectional Computer Vision (OmniCV). This is one of the first competitions focused on fisheye camera object detection. We encouraged the participants to design models which work natively on fisheye images without rectification. We used CodaLab to host the competition based on the publicly available WoodScape fisheye dataset. In this paper, we provide a detailed analysis on the competition which attracted the participation of 120 global teams and a total of 1492 submissions. We briefly discuss the details of the winning methods and analyze their qualitative and quantitative results.
... DenseCL [15] implements self-supervision by optimizing a pairwise contrastive (dis)similarity loss between two views of input images, whereas pixel-level pretext tasks are introduced for learning dense feature representations in [72]. FisheyePix-Pro [73] attempts to pretrain a contrastive learning based model directly on fisheye images. Cross-image pixel contrast has been leveraged for semantic segmentation by looking beyond single images [74], [75], [76] and enforcing pixel embeddings belonging to the same semantic class to be more similar than embeddings from different classes. ...
Preprint
Full-text available
In this work, we introduce panoramic panoptic segmentation, as the most holistic scene understanding, both in terms of Field of View (FoV) and image-level understanding for standard camera-based input. A complete surrounding understanding provides a maximum of information to a mobile agent, which is essential for any intelligent vehicle in order to make informed decisions in a safety-critical dynamic environment such as real-world traffic. In order to overcome the lack of annotated panoramic images, we propose a framework which allows model training on standard pinhole images and transfers the learned features to a different domain in a cost-minimizing way. Using our proposed method with dense contrastive learning, we manage to achieve significant improvements over a non-adapted approach. Depending on the efficient panoptic segmentation architecture, we can improve 3.5-6.5% measured in Panoptic Quality (PQ) over non-adapted models on our established Wild Panoramic Panoptic Segmentation (WildPPS) dataset. Furthermore, our efficient framework does not need access to the images of the target domain, making it a feasible domain generalization approach suitable for a limited hardware setting. As additional contributions, we publish WildPPS: The first panoramic panoptic image dataset to foster progress in surrounding perception and explore a novel training procedure combining supervised and contrastive training.
... There is relatively less work on using convolutional neural networks for fisheye cameras. Recently, it has been explored for various tasks such as object detection [12], [13], soiling detection [14], [15], motion segmentation [16], road edge detection [17], weather classification [18], depth prediction [19], [20], [21], [22], SLAM [23], [24], [25] and in general for multi-task outputs [26], [27], [28], [29]. ...
Article
Full-text available
Electric Vehicles are increasingly common, with inductive chargepads being considered a convenient and efficient means of charging electric vehicles. However, drivers are typically poor at aligning the vehicle to the necessary accuracy for efficient inductive charging, making the automated alignment of the two charging plates desirable. In parallel to the electrification of the vehicular fleet, automated parking systems that make use of surround-view camera systems are becoming increasingly popular. In this work, we propose a system based on the surround-view camera architecture to detect, localize, and automatically align the vehicle with the inductive chargepad. The visual design of the chargepads is not standardized and not necessarily known beforehand. Therefore, a system that relies on offline training will fail in some situations. Thus, we propose a self-supervised online learning method that leverages the driver’s actions when manually aligning the vehicle with the chargepad and combine it with weak supervision from semantic segmentation and depth to learn a classifier to auto-annotate the chargepad in the video for further training. In this way, when faced with a previously unseen chargepad, the driver needs only manually align the vehicle a single time. As the chargepad is flat on the ground, it is not easy to detect it from a distance. Thus, we propose using a Visual SLAM pipeline to learn landmarks relative to the chargepad to enable alignment from a greater range. We demonstrate the working system on an automated vehicle as illustrated in the video https://youtu.be/_cLCmkW4UYo . To encourage further research, we will share a chargepad dataset used in this work (an initial version of the dataset is shared https://drive.google.com/drive/folders/1KeLFIqOnhU2CGsD0vbiN9UqKmBSyHERd here).
... There has been a significant rise in the usage of fisheye cameras in various automotive applications [63,64,65,66,67,68], surveillance [69] and useful applications in robotics [70], including robotic localization [71,72], simultaneous localization and mapping [29,73,74,75], visual odometry [76], due to their large field-of-view (FoV). Recently, several computer vision tasks on fisheye cameras have been explored including object detection [77,78,79], semantic segmentation [80,81], soiling detection [44], motion estimation [21,82,83], image restoration [84], underwater robotics [85], aerial robotics [86] and many other related fields. ...
Preprint
The formation of eyes led to the big bang of evolution. The dynamics changed from a primitive organism waiting for the food to come into contact for eating food being sought after by visual sensors. The human eye is one of the most sophisticated developments of evolution, but it still has defects. Humans have evolved a biological perception algorithm capable of driving cars, operating machinery, piloting aircraft, and navigating ships over millions of years. Automating these capabilities for computers is critical for various applications, including self-driving cars, augmented reality, and architectural surveying. Near-field visual perception in the context of self-driving cars can perceive the environment in a range of 0100-10 meters and 360{\deg} coverage around the vehicle. It is a critical decision-making component in the development of safer automated driving. Recent advances in computer vision and deep learning, in conjunction with high-quality sensors such as cameras and LiDARs, have fueled mature visual perception solutions. Until now, far-field perception has been the primary focus. Another significant issue is the limited processing power available for developing real-time applications. Because of this bottleneck, there is frequently a trade-off between performance and run-time efficiency. We concentrate on the following issues in order to address them: 1) Developing near-field perception algorithms with high performance and low computational complexity for various visual perception tasks such as geometric and semantic tasks using convolutional neural networks. 2) Using Multi-Task Learning to overcome computational bottlenecks by sharing initial convolutional layers between tasks and developing optimization strategies that balance tasks.
... Recently, there has been a lot of work on perception related to various tasks such as object detection [1], [2], soiling detection [3], [4], motion segmentation [5], [6], road edge detection [7], [8], weather classification [9], depth prediction [10]- [16], SLAM [17], [18] and in general for multi-task outputs [19]- [25] based on Convolutional Neural Networks (CNNs). However CNNs are power hungry and it is difficult to deploy all the above tasks in a low power embedded system. ...
Preprint
Full-text available
Spiking Neural Networks are a recent and new neural network design approach that promises tremendous improvements in power efficiency, computation efficiency, and processing latency. They do so by using asynchronous spike-based data flow, event-based signal generation, processing, and modifying the neuron model to resemble biological neurons closely. While some initial works have shown significant initial evidence of applicability to common deep learning tasks, their applications in complex real-world tasks has been relatively low. In this work, we first illustrate the applicability of spiking neural networks to a complex deep learning task namely Lidar based 3D object detection for automated driving. Secondly, we make a step-by-step demonstration of simulating spiking behavior using a pre-trained convolutional neural network. We closely model essential aspects of spiking neural networks in simulation and achieve equivalent run-time and accuracy on a GPU. When the model is realized on a neuromorphic hardware, we expect to have significantly improved power efficiency.
Article
Full-text available
In this work, we present a methodology to shape a fisheye-specific representation space that reflects the interaction between distortion and semantic context present in this data modality. Fisheye data has the wider field of view advantage over other types of cameras, but this comes at the expense of high radial distortion. As a result, objects further from the center exhibit deformations that make it difficult for a model to identify their semantic context. While previous work has attempted architectural and training augmentation changes to alleviate this effect, no work has attempted to guide the model towards learning a representation space that reflects this interaction between distortion and semantic context inherent to fisheye data. We introduce an approach to exploit this relationship by first extracting distortion class labels based on an object's distance from the center of the image. We then shape a backbone's representation space with a weighted contrastive loss that constrains objects of the same semantic class and distortion class to be close to each other within a lower dimensional embedding space. This backbone trained with both semantic and distortion information is then fine-tuned within an object detection setting to empirically evaluate the quality of the learnt representation. We show this method leads to performance improvements by as much as 1.1% mean average precision over standard object detection strategies and.6% improvement over other state of the art representation learning approaches.
Article
Full-text available
Cameras are the primary sensor in automated driving systems. They provide high information density and are optimal for detecting road infrastructure cues laid out for human vision. Surround-view camera systems typically comprise of four fisheye cameras with 190°+ field of view covering the entire 360° around the vehicle focused on near-field sensing. They are the principal sensors for low-speed, high accuracy, and close-range sensing applications, such as automated parking, traffic jam assistance, and low-speed emergency braking. In this work, we provide a detailed survey of such vision systems, setting up the survey in the context of an architecture that can be decomposed into four modular components namely Recognition, Reconstruction, Relocalization, and Reorganization. We jointly call this the 4R Architecture . We discuss how each component accomplishes a specific aspect and provide a positional argument that they can be synergized to form a complete perception system for low-speed automation. We support this argument by presenting results from previous works and by presenting architecture proposals for such a system. Qualitative results are presented in the video at https://youtu.be/ae8bCOF77uY .
Article
Full-text available
A 360° perception of scene geometry is essential for automated driving, notably for parking and urban driving scenarios. Typically, it is achieved using surround-view fisheye cameras, focusing on the near-field area around the vehicle. The majority of current depth estimation approaches focus on employing just a single camera, which cannot be straightforwardly generalized to multiple cameras. The depth estimation model must be tested on a variety of cameras equipped to millions of cars with varying camera geometries. Even within a single car, intrinsics vary due to manufacturing tolerances. Deep learning models are sensitive to these changes, and it is practically infeasible to train and test on each camera variant. As a result, we present novel camera-geometry adaptive multi-scale convolutions which utilize the camera parameters as a conditional input, enabling the model to generalize to previously unseen fisheye cameras. Additionally, we improve the distance estimation by pairwise and patchwise vector-based self-attention encoder networks. We evaluate our approach on the Fisheye WoodScape surround-view dataset, significantly improving over previous approaches. We also show a generalization of our approach across different camera viewing angles and perform extensive experiments to support our contributions. To enable comparison with other approaches, we evaluate the front camera data on the KITTI dataset (pinhole camera images) and achieve state-of-the-art performance among self-supervised monocular methods. An overview video with qualitative results is provided at https://youtu.be/bmX0UcU9wtA . Baseline code and dataset will be made public. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> 1 </sup
Article
In recent years, deep neural networks (DNNs) have accomplished impressive success in various applications, including autonomous driving perception tasks. However, current deep neural networks are easily deceived by adversarial attacks. This vulnerability raises significant concerns, particularly in safety-critical applications. As a result, research into attacking and defending DNNs has gained much coverage. In this work, detailed adversarial attacks are applied on a diverse multi-task visual perception deep network across distance estimation, semantic segmentation, motion detection, and object detection. The experiments consider both white and black box attacks for targeted and un-targeted cases, while attacking a task and inspecting the effect on all others, in addition to inspecting the effect of applying a simple defense method. We conclude this paper by comparing and discussing the experimental results, proposing insights and future work. The visualizations of the attacks are available at https://youtu.be/6AixN90budY.
Article
Road Edge is defined as the borderline where there is a change from the road surface to the non-road surface. Most of the currently existing solutions for Road Edge Detection use only a single front camera to capture the input image; hence, the system’s performance and robustness suffer. Our efficient CNN trained on a very diverse dataset yields more than 98% semantic segmentation for the road surface, which is then used to obtain road edge segments for individual camera images. Afterward, the multi-cameras raw road edges are transformed into world coordinates, and RANSAC curve fitting is used to get the final road edges on both sides of the vehicle for driving assistance. The process of road edge extraction is also very computationally efficient as we can use the same generic road segmentation output, which is computed along with other semantic segmentation for driving assistance and autonomous driving. RoadEdgeNet algorithm is designed for automated driving in series production, and we discuss the various challenges and limitations of the current algorithm.
Article
FisheyeDistanceNet [1] proposed a self-supervised monocular depth estimation method for fisheye cameras with a large field of view (> 180°). To achieve scale-invariant depth estimation, FisheyeDistanceNet supervises depth map predictions over multiple scales during training. To overcome this bottleneck, we incorporate self-attention layers and robust loss function [2] to FisheyeDistanceNet. A general adaptive robust loss function helps obtain sharp depth maps without a need to train over multiple scales and allows us to learn hyperparameters in loss function to aid in better optimization in terms of convergence speed and accuracy. We also ablate the importance of Instance Normalization over Batch Normalization in the network architecture. Finally, we generalize the network to be invariant to camera views by training multiple perspectives using front, rear, and side cameras. Proposed algorithm improvements, FisheyeDistanceNet++, result in 30% relative improvement in RMSE while reducing the training time by 25% on the WoodScape dataset. We also obtain state-of-the-art results on the KITTI dataset, in comparison to other self-supervised monocular methods.