Content uploaded by Nathir A Rawashdeh
Author content
All content in this area was uploaded by Nathir A Rawashdeh on May 14, 2021
Content may be subject to copyright.
PROCEEDINGS OF SPIE
SPIEDigitalLibrary.org/conference-proceedings-of-spie
Drivable path detection using CNN
sensor fusion for autonomous driving
in the snow
Rawashdeh, Nathir, Bos, Jeremy, Abu-Alrub, Nader
Nathir A. Rawashdeh, Jeremy P. Bos, Nader J. Abu-Alrub, "Drivable path
detection using CNN sensor fusion for autonomous driving in the snow," Proc.
SPIE 11748, Autonomous Systems: Sensors, Processing, and Security for
Vehicles and Infrastructure 2021, 1174806 (12 April 2021); doi:
10.1117/12.2587993
Event: SPIE Defense + Commercial Sensing, 2021, Online Only
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Drivable path detection using CNN sensor fusion for autonomous
driving in the snow
Nathir A. Rawashdeha, Jeremy P. Bosb, Nader J. Abu-Alrubb
aDept. of Applied Computing, bDept. of Electrical and Computer Engineering
Michigan Technological University, 1400 Townsend Drive; Houghton, MI USA 49931
ABSTRACT
This work targets the problem of drivable path detection in poor weather conditions including on snow covered roads. A
successful drivable path detection algorithm is vital for safe autonomous driving of passenger cars. Poor weather conditions
degrade vehicle perception systems, including cameras, radar, and laser ranging. Convolutional Neural Network (CNN)
based multi-modal sensor fusion is applied to path detection. A multi-stream encoder-decoder network that fuses camera,
LiDAR, and Radar data is presented here in order to overcome the asymmetrical degradation of sensors by complementing
their measurements. The model was trained and tested using a manually labeled subset from the DENSE dataset. Multiple
metrics were used to assess the model performance.
Keywords: drivable path, autonomous driving, sensor fusion, adverse weather, snow, convolutional neural networks
1. INTRODUCTION
Advanced Driver Assistance Systems (ADAS) define the next milestones toward fully autonomously driving vehicles.
Nonetheless, many challenges are yet to be conquered in order to achieve true and full autonomy. These challenges include
public acceptance, government regulations, expensive and complicated processes needed to create datasets, and corner
cases presented by poor or severe weather conditions. Current trend in autonomous vehicles research in general, and in
dealing with poor weather conditions in particular, is to benefit from the improvement in sensor performance and
affordability and equip vehicles with a mix of sensors that complement each other and provide the vehicle with the best
perception of its surroundings 1 2. Typical sensors found in auto-drive vehicles are cameras 3, 4, 5, millimeter wave (MMW)
Radars, global positioning system (GPS) receivers, inertial measurement units (IMU), Laser detection and Ranging
(LiDAR), and ultrasonic6. Moreover, the continuous technological advancement of microprocessor computational power
encourage the use of such high bandwidth of data flow, generated by the vehicle perception system, and facilitate the
deployment of state-of-the-art computational methods that can be executed in real-time. The subsequent section provides
a brief background about key concepts addressed in this work, followed by a description of the project and its goals. Section
IV handles dataset selection, used subset, and data preprocessing and labeling. Section V presents the deep learning model
architecture designed and applied in this work, followed by training and testing details. The results section summarizes the
performance of the model. Finally, Conclusions are drawn and suggestions for improvements and enhancements on the
model are made.
2. BACKGROUND
2.1 Semantic Segmentation
In semantic segmentation, instead of detecting an object in an image, each pixel is classified individually and assigned to
a class that the pixel best represents. In other words, semantic segmentation is a pixel-level or pixel-wise classification. A
typical semantic segmentation Convolutional Neural Network (CNN) is made of an encoder network and a decoder
network. The encoder network downsamples the inputs and extracts the features, while the decoder network uses those
features to reconstruct and upsample the input, and finally assign each pixel to a class. Two key components in decoder
networks are the so-called MaxUnpooling layer and the Transpose convolution layer. The MaxUnpooling layer is the
counterpart of the MaxPooling layer, it distributes the values in an input of smaller size to an output of larger size. Several
methods for distribution exist, a common approach is to store the locations of the maximum values in a MaxPooling layer
and use these locations to place the maximum values back in matching locations in a corresponding MaxUnpooling layer.
This approach requires that the encoder-decoder network be symmetrical in which each MaxPooling layer in the encoder
has a corresponding MaxUnpooling layer in the decoder side. Another approach is to place the values in a predetermined
Autonomous Systems: Sensors, Processing, and Security for Vehicles and Infrastructure 2021, edited by
Michael C. Dudzik, Stephen M. Jameson, Theresa J. Axenson, Proc. of SPIE Vol. 11748, 1174806
© 2021 SPIE · CCC code: 0277-786X/21/$21 · doi: 10.1117/12.2587993
Proc. of SPIE Vol. 11748 1174806-1
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
location (e.g., upper left corner) in the area at which the kernel is pointing. This method is known as “bed of nails” and it
is used in the model presented in this work. The transpose convolution layer is the opposite of a regular convolution layer.
It comprises a moving kernel that scans the input and convolves the values to fill the output image. The output volume of
both layers, MaxUnpooling and transpose convolution, can be controlled by adjusting the kernel size, padding, and stride.
A major difference between these two layers is that MaxUnpooling layers have no trainable parameters, while in transpose
convolution, the values in the kernels are trainable. There have been many successful models that perform semantic
segmentation 6. For example, the very deep convolutional neural network VGG16 7 which was originally used for object
detection and image classification, but was later extended by other researchers to perform semantic segmentation. Another
example is SegNet 8, which is a symmetrical encoder-decoder network that uses VGG16 as an encoder. In addition,
ERFNet 9 is asymmetrical encoder-decoder network that was designed and tested for autonomous driving. Finally, U-net
10 that was designed for biomedical applications, uses a U shaped architecture and a technique, to copy feature maps
forward to corresponding layers in the upsampling side to achieve higher segmentation resolution.
2.2 Drivable Path Detection
A Drivable Path is “the space in which the car can move safely in a physical sense, but without taking symbolic information
into account” 11. It can be very important in scenarios where a vehicle is in a parking lot, off-road, in an un-marked road,
or driving in snowy or foggy conditions. Drivable path detection can be implemented as a preliminary step to lane or object
detection. It greatly reduces the input volume for possibly deeper networks that will carry out other detection tasks for
example, by limiting the search area for a lane detection algorithm to areas that were classified as “drivable”. Drivable
path detection is a semantic segmentation problem where the goal is to generate pixel-wise classification after training on
a pixel-wise labeled dataset. The number of classes is usually two, i.e. drivable path and non-drivable path 12.
2.3 Sensor Fusion
The goal of sensor fusion is to combine readings from multiple sensors to decrease measurement uncertainty. It can also
be used to gain new type of information or perspective from the original readings. Sensor fusion techniques can be
categorized in multiple ways. For example, sensor fusion methods can be classified as homogenous fusion and non-
homogeneous fusion, where in the former, measurements from similar sensors are combined and in non-homogenous
fusion data are taken from different sensing modalities. Examples of homogenous sensor fusion include refining location
data for GPS by using data from multiple satellites, and using multiple cameras in a stereovision configuration to extract
depth information. On the other hand, combining data from cameras, LiDAR, and Radar in autonomous driving
applications is an example of non-homogeneous sensor fusion.
Another important categorization of sensor fusion techniques relies on the stage on which the fusion takes place. Sensor
fusion methods could be classified into early fusion (or raw data fusion) and late fusion. In early fusion, raw data are
stacked or concatenated before being used as a unified data container. For example, researchers 13 inserted the depth
information from a LiDAR in the fourth channel of camera data to form RGBD images that will be processed by their
model. In late fusion, sensors measurements are processed separately and decisions are made based on the individual data
streams. Finally, these decisions are combined on a higher level of abstraction. Between these two extreme cases, sensor
fusion can take place anytime during data processing and feature extraction CNN pipeline.
There are various methods for sensor fusion and the level of complexity of these methods varies considerably. Achieving
sensor fusion can be as simple as averaging readings of two sensors, or by using techniques such as Kalman filtering and
deep learning models to handle the fusion. Examples methods of data fusion in deep learning networks include reshaping
and concatenating the outputs of each stream a into 1-D vector, which is fed into a fully connected network that handles
the fusion 14. Another example published 15 where trainable parameters are used to control information exchange between
separate streams of sensing modalities. This “cross-fusion” allows the model to variably integrate the data at any depth. A
third example is shown 16 where the entropy of each sensor stream is calculated and used to steer the fusion toward useful
information. The main motivation for sensor fusion in autonomous driving applications is to overcome the asymmetrical
degradation of different sensing modalities under varying weather conditions. Table 1 summarizes key advantages and
disadvantages of cameras, LiDAR and Radar, the main three sensors in this context. While cameras and LiDARs offer the
best resolution, they both perform poorly in poor weather conditions. Cameras are also sensitive to changing lighting
conditions and object glare. In contrast, Radars have a much lower detection resolution but are not affected by poor lighting
or weather conditions 13, 1. The complementary nature of these sensors highlights the importance of using all three of them
together for a successful autonomous driving solution.
Proc. of SPIE Vol. 11748 1174806-2
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Table 1. Automotive Sensor Comparison
Sensor
Advantages
Disadvantages
Camera
High resolution
Detects color
Sensitive to weather
Sensitive to lighting
LiDAR
Long range
High resolution
Wide FOV
Sensitive to weather
Expensive
Radar
Long range
Detects velocity
Applicable for all weather
Low resolution
Very sensitive
3. DRIVING DATASETS
This work uses convolutional neural networks and sensor fusion to address the problem of detecting a drivable path in
adverse weather conditions. The model proposed is a multi-stream (one per sensor) deep convolutional neural network that
will downsample the feature maps of each stream, fuse the data in the fully connected network, and upsample the maps
again to perform pixel-wise classification. This effort adapted earlier presented work 12, 13, 15, 16; however, while other
researchers 12, 13, 15 fused camera and LiDAR data only, and implemented only object detection 16, this work attempts to
fuse camera, LiDAR, and Rader to detect drivable paths in adverse weather conditions.
A suitable dataset for this work will contain camera, LiDAR, and Radar data. Annotation for semantic segmentation (i.e.,
pixel-wise labeling of drivable path and non-drivable path classes) is also desirable, and it should have been recorded in
poor weather conditions, where snow, rain, and fog are dominant. Furthermore, roads should have snow track and
snowbanks on the sides or be completely covered in snow. Table 2 presents a brief survey of the most common and recent
open autonomous driving datasets. The closest candidate to meeting all mentioned conditions is the DENSE dataset.
Despite not having annotations for semantic segmentation, it was selected due to its relevance to the goals of this project
and ease of access; however, a subset had to be manually labeled in order to train the CNN model.
3.1 The DENSE dataset
The DENSE project is a European effort that tackles the problems of autonomous driving in severe weather conditions 25,
26, 16. Researchers in this project have collected a very large dataset by driving more than 10,000 km in northern Europe
while recording data from multiple cameras, multiple LiDAR’s, Radar, GPS, IMU, road friction sensor, and thermal
cameras. The dataset comprises 12,000 samples, i. e. momentary measurements, and is annotated with 2-D and 3-D
bounding boxes for object detection. Moreover, the dataset is split into smaller subsets depending on weather conditions
and daytime. For example, snow-day, fog-night, clear-day, etc. Finally, the dataset is available for public access.
3.2 Data Preprocessing and Labeling
The original camera images in the dataset have a size of 1920 by 1024 pixels. Thus, they were scaled down to 480 by 256
for faster training and testing. LiDAR data are stored in the NumPy array format that had to be converted to images,
rescaled (to 480 x 256), and normalized. Radar data are saved in JSON files, one file for each frame. Each file contains a
dictionary of detected objects and multiple readings for each object including x-coordinates, y-coordinates, distance,
velocity, etc. This coordinate system is parallel to the vehicle plane. To convert it to the vertical plane facing the vehicle,
only y-coordinates are considered. The y-distance was placed on an image then extruded vertically to form a vertical line,
where lines and their horizontal locations correspond to objects and the intensities of those lines represent distances. Figure
1 shows an example of this projection process and its result. Finally, the output image was rescaled (to 480 x 256) and
normalized. This radar data representation is similar to what was implemented in previous work16.
Proc. of SPIE Vol. 11748 1174806-3
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Table 2. Recently Published Autonomous Driving Datasets
Dataset
Camera
Data
LiDAR
Data
Radar
Dara
Labels
Adverse
weather
Year, Ref.
DENSE
✓
✓
✓
✓
2020, 16
End of the Earth
✓
✓
✓
2020, 17
BDD100K
✓
✓
✓
2020, 18
Astyx
✓
✓
✓
2019, 19
Oxford
✓
✓
✓
2020, 20
Argoverse
✓
✓
✓
2019, 21
nuImages
✓
✓
2020, 22
Apollo
✓
✓
✓
2020, 23
Kitti
✓
✓
✓
2013, 24
Since the original dataset contained annotations for object detection only; pixel-wise labels were needed to train a deep
learning CNN model for semantic segmentation. A total of 1000 samples were randomly selected from the “snowy day
split” and manually labeled using the Computer Vision Annotation Tool CVAT 27. The two classes considered here are
drivable and non-drivable regions. The resulting labels are masks of zeros and ones that have the same size of the original
camera images (1920 x 1024 pixels). The Drivable path class is not limited to streets. It also includes parking lots,
entrances, exits, etc. It also completely disregards lane markers, snow tracks, tramway tracks, and other lines that appear
on the roads as they are drivable areas. The Non-drivable class includes skies, trees, people, animals, vehicles, bicycles,
snowbanks, buildings, pavements, etc. It is worth mentioning, that in some frames, the amount of snow was so large that
a drivable path was unrecognizable even for to the human eye, in which cases the matter of deciding if an area is safe to
drive on or not becomes more of a subjective matter and depends on personal judgment.
(a)
(b)
Figure 1. Radar data preprocessing: (a) projecting the y-coordinates on the image plane; (b) a processed radar frame.
Proc. of SPIE Vol. 11748 1174806-4
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
4. SENSOR FUSION ARCHITECHTURE
The convolutional neural network presented here consists of three subnetworks; an encoder network that will downsample
the inputs and extract the features, a fully connected network that will fuse the data, and a decoder network which
upsamples the fused data and reconstructs the image. Figure 2 illustrates the implemented architecture. The network was
designed to be as compact as possible, because very deep encoder-decoder networks are computationally expensive.For
this reason, the decoder network was not designed with as many layers as the encoder network which explains the
asymmetrical architecture adopted here. The encoder network is made of three streams: camera stream, LiDAR stream,
and Radar stream. Since camera images contain more information, the camera stream is made deeper than the other two
streams, it has four blocks where each block consists of two convolutional layers, a batch normalization layer, and a ReLU
layer followed by a MaxPooling layer.
LiDAR data is not as dense as camera data, therefore the LiDAR stream is made to have less depth and downsamples the
inputs more aggressively. It consists of three of the same building blocks used in the camera stream. Similarly, the Radar
stream is shallower than the LiDAR stream - it is made of only two blocks. The outputs of the three streams are then
reshaped and concatenated as a one-dimensional vector. This vector is fed to the fully connected network which consists
of 3 hidden layers with ReLU activations. Finally, the outputs of the fully connected network are reshaped again into a 2-
D array that will be fed to the decoder network. The decoder network consists of four consecutive stages of MaxUnpooling
and transpose convolution to upsample the data back to the size of the input (480x256). Figure 2 and Table 3 detail the
model architecture. It is worth noting that the total number of learnable parameters of this network is 9,856,832.
Figure 2. Deep CNN Model architecture (numbers represent the output volume of each block).
5. TRAINING AND TESTING
The deep convolutional neural network model, was implemented using Python, including the PyTorch function library for
deep learning. Training and testing were executed on Google Colab with GPU utilization. The manually labeled data subset
was comprise of 1000 samples of camera, LiDAR and Radar data. It was divided into 800 samples for training and 200 for
testing. The network training batch size was 10 samples, and the model was trained for 600 epochs. The binary cross
entropy with sigmoid layer loss function was used, as well as stochastic gradient descent as the optimizer with a learning
rate 0.001 and momentum 0.9. Figure 3 shows the losses versus the training samples during the training phase. No separate
validation subset was created. Hyperparameters were tuned iteratively using the training subset itself. The model was then
tested with the 200 samples from the testing subset. The output of the model was postprocessed with image dilation and
erosion with varying kernels sizes to reduce the amount of noise in the pixel classification output and close small areas.
Proc. of SPIE Vol. 11748 1174806-5
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Table 3. Deep CNN architecture details.
Network
Layer
Kernel
Padding
Stride
Encoder Network
Camera
Conv 0
Conv 1
3
1
1
MaxPool 0
3
0
3
Conv 2
Conv 3
3
1
1
MaxPool 1
3
0
3
Conv 4
Conv 5
3
1
1
MaxPool 2
3
0
3
Conv 6
Conv 7
3
1
1
MaxPool 3
3
0
3
LiDAR
Conv 0
Conv 1
3
0
1
MaxPool 0
4
0
4
Conv 2
Conv 3
3
0
1
MaxPool 1
4
0
4
Conv 4
Conv 5
3
0
1
MaxPool 2
4
0
4
Radar
Conv 0
3
0
1
Conv 1
3
0
2
MaxPool 0
4
0
4
Conv 2
3
0
1
Conv 3
3
0
2
MaxPool 1
4
0
4
Fully Connected Network
3360 → 2048 → 1024 → 480
Decoder Network
MaxUnpool 0
2
0
2
TConv 0
3
1
1
MaxUnpool 1
2
0
2
TConv 1
3
1
1
MaxUnpool 2
2
0
2
TConv 2
3
1
1
MaxUnpool 3
2
0
2
TConv 3
3
1
1
The simplest metric for measuring the accuracy for semantic segmentation is what is known as Pixel Accuracy; which is
the ratio of the correctly identified positives and negatives to the size of the image as, calculated as:
(1)
where TP, TN, FP, and FN are true positives, true negatives, false positives, and false negatives, respectively. Pixel
accuracy is calculated for each sample in the testing set and the average of these values represents the total accuracy of the
model. In the case of binary classification, it is insufficient to calculate the pixel accuracy for any of the two classes. Pixel
accuracy can be misleading. In cases where a certain class is underrepresented in a sample, pixel accuracy can falsely give
higher accuracy only because there are not enough pixels to test the model for a specific class 28. For this reason, Mean
Intersection over Union (MIoU) is also presented here. MIoU is another common metric for semantic segmentation, it is
Proc. of SPIE Vol. 11748 1174806-6
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
the ratio between the intersection of the target mask and the prediction mask to the union of the target mask and the
prediction mask as follows:
(2)
Similar to pixel accuracy, the IoU accuracy is calculated for each frame and the final accuracy metric is the average of
these values; however, MIoU is calculated for each class separately. Pixel accuracy and MIoU for both classes (i.e.,
drivable and non-drivable) are shown in Table 4. Both eq.(1) and eq.(2) are applicable for binary classification only. Figure
4 shows the three metrics considered for each test sample in the testing subset. Figure 5 shows four selected snow driving
frames of camera, LiDAR, Radar, ground truth, and model output.
Figure 3. Losses vs training samples.
Figure 4. Accuracy vs test samples.
Table 4. Model performance
Metric
Model accuracy
Pixel accuracy
95.04%
MIoU “drivable path” class
81.35%
MIoU “non-drivable path” class
93.58%
Table 4 conveys very good results for pixel accuracy and MIoU for the non-drivable path class. Referring to Figure 5. it is
evident that the model can successfully delineate the general circumference of the area in which a vehicle can move safely.
It can do so while ignoring all sorts of lines and edges that appear on the road which could be otherwise interpreted as
edges for a drivable path. The model also shows a great amount of resilience toward foggy and other situations with poor
visibility. It also, to some extent, avoids pedestrians, animals, and other vehicles. Finally, the model presented here
Proc. of SPIE Vol. 11748 1174806-7
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
[Camera]
[Radar]
[LiDAR]
[Ground Truth]
[Outputs]
Figure 5. selected example frames. Segmentation color code: green is drivable; purple is non-drivable
is quite compact. When compared to other models mentioned earlier, this model has much fewer layers and parameters
which means that it trains faster and, although not tested, it could possibly be used in real-time applications. On the other
hand, the MIoU for the drivable path class is not as high as the other two metrics. Moreover, the model does not perfectly
avoid pedestrians. And finally, segmentation boundaries generated by the model are rough and of low resolution which
could be caused by insufficient depth, especially in the decoder network.
6. CONCLUSIONS
This paper presents a CNN deep learning model for detecting the drivable path in snow driving. The results are very good
due to the use of multi modal sensor fusion. There are several improvements that could be investigated to enhance the
model. For example, advanced techniques in semantic segmentation like “skip connections” that copies feature maps from
Proc. of SPIE Vol. 11748 1174806-8
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
early levels in the network and use them in later levels could improve the classification resolution. Furthermore, storing
indices from MaxPooling layers and using them in MaxUnpooling layers during upsampling could also improve the output
resolution. The preprocessing technique for LiDAR and Radar data could be revisited. Better representation for LiDAR
and Radar data should lead to better feature extraction and would certainly improve the final results. Finally, an important
improvement is to utilize other measurements usually provided by LiDARs and Radars. For example, most LiDARs
measure intensities and number of reflected beams in addition to distances. Radars on the other hand measure velocities
and angles in addition to distances. This information can be used to build multi-channel data arrays similar to what cameras
provide. Since camera streams benefits from RGB data, LiDAR stream can also be augmented by distance, beam intensity,
number of beams data. Similarly, Radars can include distance, velocity, and angle information. The extra data provided
here should help the model distinguish between objects in scenarios where objects made of different material are located
at the same distance.
REFERENCES
[1] Wang, Z., Wu, Y. and Niu, Q., “Multi-Sensor Fusion in Automated Driving: A Survey,” IEEE Access 8, 2847–
2868 (2020).
[2] Rawashdeh, N. A. and Jasim, H. T., “Multi-sensor input path planning for an autonomous ground vehicle,” 2013
9th Int. Symp. Mechatronics Its Appl. ISMA 2013, 9–14 (2013).
[3] Aladem, M., Rawashdeh, S. and Rawashdeh, N., “Evaluation of a Stereo Visual Odometry Algorithm for
Passenger Vehicle Navigation,” SAE Tech. Pap. 2017-March(March) (2017).
[4] Rawashdeh, N. A. and Rawashdeh, S. A., “Scene Structure Classification as Preprocessing for Feature-Based
Visual Odometry,” SAE Int. J. Passeng. Cars - Electron. Electr. Syst. 11(3) (2018).
[5] Abdo, A., Ibrahim, R. and Rawashdeh, N. A., “Mobile Robot Localization Evaluations with Visual Odometry in
Varying Environments Using Festo-Robotino,” SAE Tech. Pap. 2020-April(April) (2020).
[6] Grigorescu, S., Trasnea, B., Cocias, T. and Macesanu, G., “A survey of deep learning techniques for autonomous
driving,” J. F. Robot. 37(3), 362–386 (2020).
[7] Simonyan, K. and Zisserman, A., “Very deep convolutional networks for large-scale image recognition,” 3rd Int.
Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., International Conference on Learning Representations,
ICLR (2015).
[8] Badrinarayanan, V., Kendall, A. and Cipolla, R., “SegNet: A Deep Convolutional Encoder-Decoder Architecture
for Image Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017).
[9] Romera, E., Alvarez, J. M., Bergasa, L. M. and Arroyo, R., “ERFNet: Efficient Residual Factorized ConvNet for
Real-Time Semantic Segmentation,” IEEE Trans. Intell. Transp. Syst. 19(1), 263–272 (2018).
0] Ronneberger, O., Fischer, P. and Brox, T., “U-net: Convolutional networks for biomedical image segmentation,”
Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 9351, 234–241,
Springer Verlag (2015).
[11] Michalke, T., Kastner, R., Herbert, M., Fritsch, J. and Goerick, C., “Adaptive multi-cue fusion for robust detection
of unmarked inner-city streets,” IEEE Intell. Veh. Symp. Proc., 1–8 (2009).
[12] Li, Q., Chen, L., Li, M., Shaw, S. L. and Nüchter, A., “A sensor-fusion drivable-region and lane-detection system
for autonomous vehicle navigation in challenging road scenarios,” IEEE Trans. Veh. Technol. 63(2), 540–555
(2014).
[13] Shahian Jahromi, B., Tulabandhula, T. and Cetin, S., “Real-Time Hybrid Multi-Sensor Fusion Framework for
Perception in Autonomous Vehicles,” Sensors 19(20), 4357 (2019).
[14] Shin, J., Park, H. and Paik, J., “Fire recognition using spatio-temporal two-stream convolutional neural network
with fully connected layer-fusion,” IEEE Int. Conf. Consum. Electron. - Berlin, ICCE-Berlin 2018-Septe, IEEE
Computer Society (2018).
[15] Caltagirone, L., Bellone, M., Svensson, L. and Wahde, M., “LIDAR–camera fusion for road detection using fully
convolutional neural networks,” Rob. Auton. Syst. 111, 125–131 (2019).
Proc. of SPIE Vol. 11748 1174806-9
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
[16] Bijelic, M., Gruber, T., Mannan, F., Kraus, F., Ritter, W., Dietmayer, K. and Heide, F., “Seeing Through Fog
Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather,” 11679–11689 (2019).
[17] Bos, J. P., Chopp, D. J., Kurup, A. and Spike, N., “Autonomy at the end of the earth: an inclement weather
autonomous driving data set,” Auton. Syst. Sensors, Process. Secur. Veh. Infrastruct. 2020 11415, M. C. Dudzik
and S. M. Jameson, Eds., 6, SPIE (2020).
[18] Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V. and Darrell, T., “BDD100K: A Diverse
Driving Dataset for Heterogeneous Multitask Learning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., 2633–2642 (2018).
[19] Meyer, M. and Kuschk, G., [Automotive Radar Dataset for Deep Learning Based 3D Object Detection] (2019).
[20] Maddern, W., Pascoe, G., Linegar, C. and Newman, P., “Year, 1000km: The Oxford RobotCar Dataset.”
[21] Chang, M.-F., Lambert, J. W., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S.,
Ramanan, D. and Hays, J., “Argoverse: 3D Tracking and Forecasting with Rich Maps,” Conf. Comput. Vis. Pattern
Recognit. (2019).
[22] Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G. and Beijbom,
O., “nuScenes: A multimodal dataset for autonomous driving,” Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., 11618–11628 (2019).
[23] Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W. and Manocha, D., “TrafficPredict: Trajectory Prediction for
Heterogeneous Traffic-Agents.”
[24] Geiger, A., Lenz, P., Stiller, C. and Urtasun, R., “Vision meets Robotics: The KITTI Dataset,” Int. J. Robot. Res.
(2013).
[25] “DENSE 24/7 - Startseite - DENSE - 24/7 Automotive Sensing System.”,
https://www.dense247.eu/home/index.html (accessed Nov. 29, 2020).
[26] “DENSE Datasets - Ulm University.”, https://www.uni-ulm.de/en/in/driveu/projects/dense-datasets (accessed
Nov. 29, 2020).
[27] “Computer Vision Annotation Tool.”, https://cvat.org/auth/login (accessed Oct. 18, 2020).
[28] “Evaluating image segmentation models.”, https://www.jeremyjordan.me/evaluating-image-segmentation-models
(accessed Nov. 28, 2020).
Proc. of SPIE Vol. 11748 1174806-10
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use