Content uploaded by Mihai Nan
Author content
All content in this area was uploaded by Mihai Nan on Jul 13, 2022
Content may be subject to copyright.
Car crash detection in videos
Veronica Radu∗, Mihai Nan†, Mihai Tr˘
asc˘
au‡, David Traian Iancu§, Alexandra S
,tefania Ghit
,˘
a¶and Adina Magda Floreak
Faculty of Automatic Control and Computers, University Politehnica of Bucharest
Romania
Email: ∗veronica.radu@stud.acs.upb.ro, †mihai.nan@upb.ro, ‡mihai.trascau@upb.ro, §david.iancu@upb.ro,
¶stefania.a.ghita@upb.ro, kadina.florea@upb.ro
Abstract—Increasing the number of cars and excessive traffic
congestion in cities is a major problem in the current time.
Statistics show that more and more accidents happen daily,
and many of these could be avoided. This article aims to
develop a system capable of detecting the possibility of an
accident by analyzing a video sequence. In this sense, this paper
presents a dataset built on those available on the Internet and a
series of video classification models. Based on the experimental
results obtained, we show what are the main vulnerabilities of
each model. The main contribution is that we tried different
architectures on a dataset that contains videos from a first-
person perspective of the camera, being more challenging to
generalize this behavior, but more useful for the autonomous
driving systems.
Index Terms—autonomous driving, deep learning, car crash
detection, video classification, depth estimation
I. INTRODUCTION
In recent years the number of cars has increased consider-
ably and implicitly the number of accidents. Many of them
are caused by human errors and lately, the field of research
become interested in finding a solution, so now their attention
is focusing on autonomous driving systems. Recognizing car
accidents has an essential role in building a good self-driving
car or a driving assistant system. However, most of the models
proposed so far are limited: they have small datasets, they
assume a static position of the camera and they require
tremendous work for having a complex annotation of the data.
Therefore this problem is far from solved and research is still
required to find an efficient and robust solution. In this paper,
we propose an approach to solve this problem using a classifier
based on deep neural networks.
For a human being is natural to estimate how close are the
objects around us, but for a machine, it seems hard to predict
the depth from a single image. This kind of understanding is
essential for autonomous driving systems, robots, localization
and navigation systems and can be easily captured using spe-
cialized sensors. Considering that this information is essential
to be able to correctly detect an accident, we decided to use
a depth estimation module in the proposed model.
The main contributions of our work include the following:
•We proposed a protocol for creating a dataset that can be
used for the problem of car crash detection.
•We proposed an architecture based on ResNet [1] and
Long short-term memory (LSTM) for video classification
to detect accidents.
•We have demonstrated how the performance of the
proposed classifier can be improved by using a depth
estimation module.
II. RE LATE D WO RK
In recent years, there has been extensive research in this
area and several approaches have been discovered. In [2]
the authors proposed an architecture that aims to recognize
real-time accidents, using YOLO [3] to detect the objects in
the scene and the tracking algorithm proposed by Danelljan
in [4], based on the discriminative correlation filter method.
To detect car crashes, they used the VIolent Flows (ViF)
descriptor to highlight the magnitude changes of the motion
vectors, previously computed with an optical flow algorithm.
This vector of frequencies was then used to train a Support
Vector Machine (SVM) classifier.
Another supervised method, based on classic computer
vision techniques was proposed in [5]. Similar to the previ-
ously described paper, it has three main components: object
detection (using Mask-RCNN [6]), vehicle tracking (using the
simplified version of the centroid tracking algorithm), and a
component responsible for classification which determines if a
vehicle was involved in an accident that assigns a score based
on the overlapping bounding boxes between two objects and
the anomalies in acceleration, trajectory and change angle.
An and Kim [7] considered that the algorithms proposed for
the problem of car crash detection were limited to several types
of scenarios. Therefore, they proposed an approach that takes
into account road scenes and situations. The extended version
works with data from the 3D sensor and treats two possible
situations differently: normal driving and car parking. In the
proposed algorithm, the data provided by the accelerometer
are processed to reduce noise and are used to correct the
drift of the gyroscope. An important step in this algorithm
is the calibration of the sensor. This calibration contains two
stages: an intrinsic calibration and an extrinsic calibration. The
effective detection of accidents is done according to some
threshold values.
Furthermore, this problem can be formulated as anomaly
detection in videos and the baseline for this approach is pro-
posed by Liu et al. in [8]. In this paper, they introduce a model
based on Generative Adversarial Networks (GAN) in which
the generator is a modified version of the UNet architecture
[9]. This component receives as input a sequence of frames and
predicts the next frame. To compare the generated image with
the ground truth they used intensity loss and gradient loss. In
addition, they added a motion constraint - enforcing the optical
frame between the predicted image and the real one to be
close. For this component, they used the FlowNet architecture
[10]. To improve the quality of the generated images, they use
a discriminator to distinguish between a real and a fake image.
This architecture is trained on normal data and it is tested on
videos with abnormal events.
Our approaches are different from those proposed in the
literature because we do not use a specialized module in object
detection. We start from the premise that a module based on a
network with good performance for image classification task
will be able to detect spatial dependencies and context-related
features. We also demonstrate that the performance of this
approach can be improved by using depth maps.
III. DATAS ET
For the problem we are trying to solve, it is a real challenge
to find the perfect dataset. Although we can find thousands
of videos on YouTube from closed-circuit television (CCTV)
cameras or dashcams with different traffic scenes, it requires
tremendous work to properly annotate and create a complex
dataset. In our experiments, we used videos from a first-person
perspective of the camera, because they offer scene diversity
and heterogeneity for the observed objects. We managed to
find two datasets - one published by VSLab [11], (an academic
research institution involved in predicting car accidents) and
the A3D dataset used in [12] for anomalies detection.
A. VSLab dashcam video dataset
There are 1730 videos captured by dashcams from vehi-
cles from Taiwan and collected by VSLab (in the following
sections we will refer to this as VSLab dataset). From these,
620 are positive videos (where an accident occurs), but there
are several types of accidents: motorbike hits car (42.6%), car
hits car (19.7%), motorbike hits motorbike (15.6%), and other
types (20%).
In most of the samples, the scenes are very crowded,
with many pedestrians and it also includes many vehicle
types: mopeds, bicycles, motorcycles, etc. The diversity is also
extended by different weather and illumination conditions.
Although the dataset contains complex scenarios, the num-
ber of samples is not big enough to support such a variety.
Regarding the labeling, the dataset is split into positives and
negatives samples, without offering any information regarding
the moment when the accident occurs.
As for the video properties, our analysis shows that each
sample has a 4s duration, 25 FPS and most of the samples
have 720p resolution.
B. A3D dataset
A similar dataset is the one described above is the A3D [12].
It contains 203 compilation videos, summing 1500 crashes.
There are also 18 different types of accidents, involving bikes,
motorcycles, cars, trucks, persons, and even animals. The
videos are captured in various weather conditions (sunny,
rainy, snowy), in different moments of the day (during the
day, but there is an important percentage of crashes collected
in the night).
The data is annotated with the start and end time for each
crash, the index of the accident in the current video, two values
that indicate if the ego-vehicle is involved in the accident and
if he is the only participant in the crash.
Approximately 60% videos capture an accident in which the
driver is involved, the rest of the samples having a third-person
view of the crashes.
Compared with the VSLab dataset, our analysis shows that
the scenes with accidents do not have the same length and
neither the same FPS. We analyzed this dataset and found
that most of the videos have 30 FPS and the length of each
video varies. Regarding the image resolution, all the videos
measure 1280 pixels width and 720 height.
C. Our dataset variant
In our experiments, we needed a dataset with enough num-
ber of samples from each category, positives, and negatives
(normal traffic conditions). For this reason, we decided to
combine the negative videos from the VSLab dataset with all
the samples from A3D, having in this way a balanced dataset.
Also, because A3D has a complex annotation, it can be used
for many tasks. Because the labelling includes a binary value
for each frame (1for an accident and 0when there is no
accident), we can use the VSLab dataset in the same way,
considering that all the frames are labels with 0. However,
there is a constraint when someone wants to use the A3D
dataset. The labels can be applied only after the videos were
split using 10 FPS as frame rate. Also, since the VSLab dataset
has fix length of 4seconds, we split the VSLab dataset using
15 FPS, having in total 60 images per video. To keep the same
length for the videos from A3D, we kept only the videos with
more than 60 images and then tried to find the best 60 images
window that contains some frames before the accident (at least
5). For our experiments, we used K-frame sequences, the label
for each sample is the label of the last frame in the sequence.
IV. PROPOSED METHODS
We chose to tackle this problem in a supervised manner, us-
ing a fully labelled dataset built on two existing datasets. Thus
we solved the problem considering it as a binary classification.
A video is a sequence of images, so the difference between
video classification and image classification is that in videos
we have to catch temporal dependencies between frames.
A. Depth estimation
Although the most common and accurate method for mea-
suring distances is using LIDAR sensors, it is unaffordable
for an everyday driver. The recent publications [13]–[16],
proved that it is possible to train depth estimation models,
using stereo pairs or monocular videos. Since distance is an
extremely important point in understanding a traffic scene, for
our problem it might be useful to have the depth map for each
Fig. 1. A sample from Lyft L5 dataset (left: the original image; right: the image and the depth points collected from sensor.)
FrameT=1 F rameT=2 F r ameT=... F rameT=k
ResNet
Features
LSTM FC1 FC2 FC3 Prediction
Fig. 2. The architecture of the proposed classification model based on ResNet for the extraction of spatial characteristics and LSTM for the detection of
dependencies between frames
frame in a video. We managed to find two such pre-trained
models for depth prediction, namely Monodepth2 [14] and
PackNet [13]. To compare their performance, we evaluated
them on a new dataset called Lyft L5 [17]. However, this
dataset does not have dense maps that can be used as ground
truth and therefore we evaluate the predictions only on the
points extracted from the LIDAR sensors.
We used the validation and testing parts from the perception
dataset of Lyft L5, summing 218 scenes. From each scene,
we extracted the first and the last samples and we projected
the point cloud using the top sensor on the image from the
front camera. The depth map was computed using the x and
y coordinates for points and the values from z as the values
for depth. An example can be observed in Figure 1.
B. Video classification approaches
The first proposed approach consists of a binary video
classifier that contains two fundamental components: a module
to extract relevant spatial features from each frame of the video
and a module to analyze the previously extracted features
to discover the temporal dependencies between them. There
are lots of architectures that have successfully addressed the
problem of image classification, such as ResNet [1], DenseNet
[18], Inception [19], VGG [20], EfficientNet [21]. All these
models were trained on the ImageNet [22] dataset and can
extract important features from images. In our experiments we
decided to compare the performances of two of them: ResNet
[1] – widely used in classification tasks as well as a backbone
for many computer vision tasks and EfficientNet [21] – a
new model published in 2019 that promises 10 times better
efficiency than the existing models. The best performances
were obtained with ResNet. To determine the temporal depen-
dencies between the obtained features we decided to use an
LSTM module. Finally, we kept the result provided by LSTM
for the last step and passed it through fully connected layers to
achieve the classification. Before the last fully connected layer,
we used the ReLU activation function, followed by Dropout
(p= 0.5). The architecture of this approach is presented in
Figure 2.
In the second proposed approach, we also included the depth
obtained for each frame. For this extended version, we tried
several experiments. Initial we changed the first layer from
ResNet50 to accept a 4-channels input and we attached the
depth map to each frame. Another approach was to use a
separate stream of 2D convolutions for depth and fed through
ResNet50 all the RGB frames. Then we concatenate the 2
input streams and we pass the features through an LSTM and
then through some dense layers as before. Another experiment
was to add the depth information as an auxiliary loss, by using
the decoder from U-Net [23] architecture and using ResNet
[1] as an encoder. We also kept the residual connections for
this architecture. The best results were obtained for the 2
streams variant, and the architecture proposed for this variant
is presented in Figure 3. To fusion the data from the two
streams we decided to use the concatenation operation. To
generate the depth maps, we chose to use the Monodepth2
f1f2f3. . . fk
Frames
Depth
Estimator d1d2d3. . . dk
Depth maps
ResNet y1
1y1
2y1
3. . . y1
k
Features – first stream
Concat
CNNs-based
Module y2
1y2
2y2
3. . . y2
k
Features – second stream
LSTM-based
modules
Fully Connected
Prediction
Fig. 3. The architecture of the extended model with two streams: one that analyzes the RGB and one that analyzes the depth
[14] architecture, trained on KITTI [24].
V. RES ULTS A ND DISCUSSION
A. Implementation details
In all our experiments we used Adam as an optimizer, with
a learning rate equal to 0.0001 and we reduce it on the plateau,
with a patience of 2epochs. In each experiment, the model was
trained for 15 epochs. After labelling, the dataset was balanced
(50% positive samples and 50% negative samples) and split in
train (80%), validation (10%) and test (10%). We performed
all the experiments using two Tesla P100 PCIe GPUs.
B. Results for the depth estimation module
The implementations of Monodepth2 [25] and PackNet
[26] architectures are publicly available on GitHub and the
weights are also provided for several experiments. The metrics
used for evaluation and the results are presented in Table
I. Also, because the training dataset and the one used for
evaluation have different scales, we scaled the predictions
using the median ground-truth, the method also proposed in
[14]. Comparing the five pre-trained models we tested we
Abs.Rel. Sqr.Rel. RMSE RMSElog
Monodepth2 (M) 0.42225 0.05622 0.12139 0.47523
Monodepth2 (S) 0.45228 0.06419 0.12250 0.47827
Monodepth2 (MS) 0.45113 0.06360 0.12584 0.48332
PackNet (K) 0.42703 0.05761 0.12582 0.47101
PackNet (D) 0.98865 0.25018 0.26557 0.95633
TABLE I
RES ULTS O F THE E VALUATI ON ON T HE LYF T L5 DATAS ET. TE STE D TH E
MON ODE PT H2ARCHITECTURE TRAINED WITH MONOCULAR (M), ST ERE O
(S), OR JOINT (MS) SUPERVISION AND THE PAC KNET ARCHITECTURE
TRAINED ON KITTI (K) AN D DDAD(D) DATAS ET S.
can conclude that all the pre-trained models, except PackNet
trained on DDAD [13], have similar results. Based on a quali-
tative analysis, we observed that using PackNet the objects are
slightly distorted in the generated depth maps and we decided
to use Monodepth2 as our depth estimator module.
However there are several problems with this approach:
we need a 1:1 correspondence between frames and depth
maps, but Monodepth2 uses a standard set of resolutions,
by cropping the margins of each input sample, in order to
focus on the centre part. Therefore, we had to resize the
input image and to interpolate the generated depth map. Also,
another important problem is the fact that our dataset does
not have the intrinsic parameters of the cameras and the pose
network from Monodepth2 depends on those. However, we
want to check whether these depth maps could give some hints
for some architectures regarding the distances between cars
(especially for the ego-vehicle). In Figure 4 we can observe
some generated depth maps for the proposed dataset.
C. Results for video classification approaches
We compared two of the best models that achieved great re-
sults in ImageNet competition: ResNet50 [1] and EfficientNet
[27]. First, we trained each model to predict the presence or
absence of an accident in a single image. Then, to move from
image classification to video classification, we iterate over
each frame, predict the probabilities and then we average the
results of the last K frames. After that, we chose the maximum
prediction as to the label for the current frame.
We varied the number of frames we are averaging between
5, 7 or 10, but according to Table II a window size of 5
achieved the best accuracy for each architecture. The results
Model Window Size Accuracy
ResNet50
5 78.58%
7 78.56%
10 78,48%
EfficientNet-b0
5 76.07%
7 76.05%
10 76.04%
EfficientNet-b4
5 76.76%
7 76.02%
10 75.75%
TABLE II
ACC URAC Y FO R AVERA GED P RED IC TIO NS F OR DI FFE REN T WI NDO W SIZ ES
AN D FINE -TUNED MODELS
showed in Table II are gathered after training each architecture
completely. When we froze the first layers, we got results
lower by about 4%. We also added dropout before the last
dense layer. Comparing the results between EfficientNet [27]
and ResNet [1], we can observe that on this task ResNet50
achieved the highest accuracy, but EfficientNet brings a faster
convergence, the time for an epoch being halved.
For our second approach, instead of averaging consecutive
frames, we built the architecture presented in Figure 2. In
Fig. 4. Examples of qualitative results obtained for estimating the depth using
samples from the proposed dataset
training, we used a sequence of 4 images and we tried different
augmentations (adjust brightness and contrast, horizontal flip-
ping, random affine transformations), but the biggest accuracy
we got was 74.46%. Looking at different inputs and their
predictions, we noticed that most of the classification errors
are made when the ego vehicle is involved in the accident.
Therefore, to improve the ability of the network to under-
stand the scene, we added depth information. For this part, we
tried three different architectures, presented in section IV-B.
The results of these experiments are presented in Table III.
The best accuracy we got was using the depth as a separate
stream and concatenated it together with the features extracted
by ResNet. Also, it seems that using the depth as together as
Experiment Accuracy
no depth input 74.46%
4-channles input 72.31%
2-streams LSTM 76.32%
auxiliary depth loss 74.58%
TABLE III
RES ULTS O F THE A RC HIT ECT UR ES TH AT USE D EPT H IN FOR MATI ON
Sequence Length
Experiment 4 8 16 32 60
no depth input 12.877 19.537 30.904 56.537 103.427
4-channles input 11.153 18.465 30.780 56.292 103.967
2-streams LSTM 20.970 30.154 46.772 80.288 144.081
auxiliary depth loss 28.353 48.772 72.821 137.723 259.198
TABLE IV
INFERENCE TIME (MS EC)OF THE PROPOSED ARCHITECTURES OBTAINED
FOR A SEQUENCE WITH VARIABLE NUMBER OF FRAMES
a fourth channel in an image does not have a good impact on
the tested architecture.
Table IV presents the inference time (milliseconds) for our
architectures, with and without the depth information. For each
model, we averaged the inference time for 300 iterations on a
Tesla P100 PCIe 16GB. These results confirm that the model
can be used in real-time scenarios, even for sequences with a
larger number of frames.
VI. CONCLUSIONS AND FUTURE WORK
We tested different architectures and from the results ob-
tained, we can conclude that there is a real challenge to find
an architecture that is tolerant of all the changes and the
diversity of the videos from dashcams. We discovered that all
the models are sensitive to the accidents in which the ego-
vehicle is involved. They usually recognize as an accident
when a car is too close to the vehicle on which the camera
is positioned and they cannot detect when the ego-vehicle is
crashed from the back.
Our experiments showed that we cannot classify accidents
using the standard state of the art model in video classification
because we have a big diversity in the types of accidents and
also the differences between accident and non-accident videos
are sometimes very subtle. However, we achieved better results
classifying videos frame by frame, being able to detect in this
way the moment when the accident begins.
Depth information is a key point in many problems. In
this paper we analyzed the state of the art methods, we
evaluated several pre-trained models on a new dataset and we
experimented how the depth information can improve a model.
From our results, we can conclude that both Monodepth2
[14] and PackNet [13] achives a good performance on a new
dataset, even without fine-tuning. Also, using Monodepth2
[14] we generated the depth maps for another dataset and
improved with 2% accuracy an architecture for car crash
detection, by including the depth information as input.
Intrinsic camera parameters are very important in Mon-
odepth2 [14], but some methods also learn these parameters
from videos. Gordon et al. [28] proposed an approach that
solves this problem of determining the parameters. They
proposed an architecture that generates the depth, egomotion,
object motion and determines camera intrinsics from monoc-
ular videos, using only consistency across neighbouring video
frames as supervision signal. As a feature step, we could try
this approach to avoid the inconsistencies of different types of
camera.
ACKNOWLEDGMENT
This work was supported by both the grant of
the CCDI-UEFISCDI ROBIN (PN-III-P1-1.2-PCCDI2017-
0734)—“Robot
,ii s
,i Societatea: Sisteme Cognitive pentru
Robot
,i Personali s
,i Vehicule Autonome”—and the PETRA
(PN-III-P2-2.1-PED2019-4995)—“Detect
,ia s
,i urm˘
arirea per-
soanelor pentru robot
,i sociali s
,i mas
,ini autonome”.
REFERENCES
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 770–778, 2016.
[2] V. E. M. Arceda and E. L. Riveros, “Fast car crash detection in video,”
2018 XLIV Latin American Computer Conference (CLEI), pp. 632–637,
2018.
[3] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
CoRR, vol. abs/1804.02767, 2018.
[4] M. Danelljan, G. H¨
ager, F. Shahbaz Khan, and M. Felsberg, “Accurate
scale estimation for robust visual tracking,” in Proceedings of the British
Machine Vision Conference, BMVA Press, 2014.
[5] P. Earnest, D. Chand, S. Gupta, and K. Goutham, “Computer vision-
based accident detection in traffic surveillance,” 11 2019.
[6] K. He, G. Gkioxari, P. Doll´
ar, and R. B. Girshick, “Mask r-cnn,” 2017
IEEE International Conference on Computer Vision (ICCV), pp. 2980–
2988, 2017.
[7] B. An and Y. Kim, “Improved crash detection algorithm for vehicle
crash detection,” Journal of the Semiconductor & Display Technology,
vol. 19, no. 3, pp. 93–99, 2020.
[8] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for
anomaly detection - a new baseline,” 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 6536–6545, 2018.
[9] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in MICCAI, 2015.
[10] A. Dosovitskiy, P. Fischer, E. Ilg, P. H¨
ausser, C. Hazirbas, V. Golkov,
P. V. D. Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical
flow with convolutional networks,” 2015 IEEE International Conference
on Computer Vision (ICCV), pp. 2758–2766, 2015.
[11] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Vslab dashcam video
dataset.” https://aliensunmin.github.io/project/dashcam/. Accessed on: 2
February 2020.
[12] Y. Yao, M. Xu, Y. Wang, D. J. Crandall, and E. M. Atkins, “Un-
supervised traffic accident detection in first-person videos,” CoRR,
vol. abs/1903.00618, 2019.
[13] V. Guizilini, R. Ambrus, S. Pillai, and A. Gaidon, “3d packing for self-
supervised monocular depth estimation,” 2020 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 2482–2491,
2020.
[14] C. Godard, O. M. Aodha, M. Firman, and G. Brostow, “Digging
into self-supervised monocular depth estimation,” in 2019 IEEE/CVF
International Conference on Computer Vision (ICCV), pp. 3827–3837,
2019.
[15] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical
flow and camera pose,” 2018 IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 1983–1992, 2018.
[16] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova, “Depth prediction
without the sensors: Leveraging structure for unsupervised learning from
monocular videos,” in AAAI, 2019.
[17] J. Houston, G. Zuidhof, L. Bergamini, Y. Ye, A. Jain, S. Omari,
V. Iglovikov, and P. Ondruska, “One thousand and one hours: Self-
driving motion prediction dataset.” https://level5.lyft.com/dataset/, 2020.
[18] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and
K. Keutzer, “Densenet: Implementing efficient convnet descriptor pyra-
mids,” arXiv preprint arXiv:1404.1869, 2014.
[19] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the inception architecture for computer vision,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 2818–
2826, 2016.
[20] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[21] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con-
volutional neural networks,” in International Conference on Machine
Learning, pp. 6105–6114, PMLR, 2019.
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in 2009 IEEE conference on
computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
[23] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in International Conference on
Medical image computing and computer-assisted intervention, pp. 234–
241, Springer, 2015.
[24] J. Fritsch, T. Kuehnl, and A. Geiger, “A new performance measure and
evaluation benchmark for road detection algorithms,” in International
Conference on Intelligent Transportation Systems (ITSC), 2013.
[25] I. L ¨
utkebohle, “BWorld Robot Control Software.”
https://github.com/AlanNaoto/monodepth2, 2008. [Online; accessed
19-July-2008].
[26] “Packnet-sfm: 3d packing for self-supervised monocular depth estima-
tion.” https://github.com/TRI-ML/packnet-sfm. Accessed on: 20 January
2021.
[27] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
convolutional neural networks,” ArXiv, vol. abs/1905.11946, 2019.
[28] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova, “Depth from
videos in the wild: Unsupervised monocular depth learning from un-
known cameras,” in Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pp. 8977–8986, 2019.