Conference PaperPDF Available

Drivable path detection using CNN sensor fusion for autonomous driving in the snow

Authors:

Abstract and Figures

This work targets the problem of drivable path detection in poor weather conditions including on snow covered roads. A successful drivable path detection algorithm is vital for safe autonomous driving of passenger cars. Poor weather conditions degrade vehicle perception systems, including cameras, radar, and laser ranging. Convolutional Neural Network (CNN) based multi-modal sensor fusion is applied to path detection. A multi-stream encoder-decoder network that fuses camera, LiDAR, and Radar data is presented here in order to overcome the asymmetrical degradation of sensors by complementing their measurements. The model was trained and tested using a manually labeled subset from the DENSE dataset. Multiple metrics were used to assess the model performance.
Content may be subject to copyright.
PROCEEDINGS OF SPIE
SPIEDigitalLibrary.org/conference-proceedings-of-spie
Drivable path detection using CNN
sensor fusion for autonomous driving
in the snow
Rawashdeh, Nathir, Bos, Jeremy, Abu-Alrub, Nader
Nathir A. Rawashdeh, Jeremy P. Bos, Nader J. Abu-Alrub, "Drivable path
detection using CNN sensor fusion for autonomous driving in the snow," Proc.
SPIE 11748, Autonomous Systems: Sensors, Processing, and Security for
Vehicles and Infrastructure 2021, 1174806 (12 April 2021); doi:
10.1117/12.2587993
Event: SPIE Defense + Commercial Sensing, 2021, Online Only
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Drivable path detection using CNN sensor fusion for autonomous
driving in the snow
Nathir A. Rawashdeha, Jeremy P. Bosb, Nader J. Abu-Alrubb
aDept. of Applied Computing, bDept. of Electrical and Computer Engineering
Michigan Technological University, 1400 Townsend Drive; Houghton, MI USA 49931
ABSTRACT
This work targets the problem of drivable path detection in poor weather conditions including on snow covered roads. A
successful drivable path detection algorithm is vital for safe autonomous driving of passenger cars. Poor weather conditions
degrade vehicle perception systems, including cameras, radar, and laser ranging. Convolutional Neural Network (CNN)
based multi-modal sensor fusion is applied to path detection. A multi-stream encoder-decoder network that fuses camera,
LiDAR, and Radar data is presented here in order to overcome the asymmetrical degradation of sensors by complementing
their measurements. The model was trained and tested using a manually labeled subset from the DENSE dataset. Multiple
metrics were used to assess the model performance.
Keywords: drivable path, autonomous driving, sensor fusion, adverse weather, snow, convolutional neural networks
1. INTRODUCTION
Advanced Driver Assistance Systems (ADAS) define the next milestones toward fully autonomously driving vehicles.
Nonetheless, many challenges are yet to be conquered in order to achieve true and full autonomy. These challenges include
public acceptance, government regulations, expensive and complicated processes needed to create datasets, and corner
cases presented by poor or severe weather conditions. Current trend in autonomous vehicles research in general, and in
dealing with poor weather conditions in particular, is to benefit from the improvement in sensor performance and
affordability and equip vehicles with a mix of sensors that complement each other and provide the vehicle with the best
perception of its surroundings 1 2. Typical sensors found in auto-drive vehicles are cameras 3, 4, 5, millimeter wave (MMW)
Radars, global positioning system (GPS) receivers, inertial measurement units (IMU), Laser detection and Ranging
(LiDAR), and ultrasonic6. Moreover, the continuous technological advancement of microprocessor computational power
encourage the use of such high bandwidth of data flow, generated by the vehicle perception system, and facilitate the
deployment of state-of-the-art computational methods that can be executed in real-time. The subsequent section provides
a brief background about key concepts addressed in this work, followed by a description of the project and its goals. Section
IV handles dataset selection, used subset, and data preprocessing and labeling. Section V presents the deep learning model
architecture designed and applied in this work, followed by training and testing details. The results section summarizes the
performance of the model. Finally, Conclusions are drawn and suggestions for improvements and enhancements on the
model are made.
2. BACKGROUND
2.1 Semantic Segmentation
In semantic segmentation, instead of detecting an object in an image, each pixel is classified individually and assigned to
a class that the pixel best represents. In other words, semantic segmentation is a pixel-level or pixel-wise classification. A
typical semantic segmentation Convolutional Neural Network (CNN) is made of an encoder network and a decoder
network. The encoder network downsamples the inputs and extracts the features, while the decoder network uses those
features to reconstruct and upsample the input, and finally assign each pixel to a class. Two key components in decoder
networks are the so-called MaxUnpooling layer and the Transpose convolution layer. The MaxUnpooling layer is the
counterpart of the MaxPooling layer, it distributes the values in an input of smaller size to an output of larger size. Several
methods for distribution exist, a common approach is to store the locations of the maximum values in a MaxPooling layer
and use these locations to place the maximum values back in matching locations in a corresponding MaxUnpooling layer.
This approach requires that the encoder-decoder network be symmetrical in which each MaxPooling layer in the encoder
has a corresponding MaxUnpooling layer in the decoder side. Another approach is to place the values in a predetermined
Autonomous Systems: Sensors, Processing, and Security for Vehicles and Infrastructure 2021, edited by
Michael C. Dudzik, Stephen M. Jameson, Theresa J. Axenson, Proc. of SPIE Vol. 11748, 1174806
© 2021 SPIE · CCC code: 0277-786X/21/$21 · doi: 10.1117/12.2587993
Proc. of SPIE Vol. 11748 1174806-1
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
location (e.g., upper left corner) in the area at which the kernel is pointing. This method is known as “bed of nails” and it
is used in the model presented in this work. The transpose convolution layer is the opposite of a regular convolution layer.
It comprises a moving kernel that scans the input and convolves the values to fill the output image. The output volume of
both layers, MaxUnpooling and transpose convolution, can be controlled by adjusting the kernel size, padding, and stride.
A major difference between these two layers is that MaxUnpooling layers have no trainable parameters, while in transpose
convolution, the values in the kernels are trainable. There have been many successful models that perform semantic
segmentation 6. For example, the very deep convolutional neural network VGG16 7 which was originally used for object
detection and image classification, but was later extended by other researchers to perform semantic segmentation. Another
example is SegNet 8, which is a symmetrical encoder-decoder network that uses VGG16 as an encoder. In addition,
ERFNet 9 is asymmetrical encoder-decoder network that was designed and tested for autonomous driving. Finally, U-net
10 that was designed for biomedical applications, uses a U shaped architecture and a technique, to copy feature maps
forward to corresponding layers in the upsampling side to achieve higher segmentation resolution.
2.2 Drivable Path Detection
A Drivable Path is “the space in which the car can move safely in a physical sense, but without taking symbolic information
into account” 11. It can be very important in scenarios where a vehicle is in a parking lot, off-road, in an un-marked road,
or driving in snowy or foggy conditions. Drivable path detection can be implemented as a preliminary step to lane or object
detection. It greatly reduces the input volume for possibly deeper networks that will carry out other detection tasks for
example, by limiting the search area for a lane detection algorithm to areas that were classified as “drivable”. Drivable
path detection is a semantic segmentation problem where the goal is to generate pixel-wise classification after training on
a pixel-wise labeled dataset. The number of classes is usually two, i.e. drivable path and non-drivable path 12.
2.3 Sensor Fusion
The goal of sensor fusion is to combine readings from multiple sensors to decrease measurement uncertainty. It can also
be used to gain new type of information or perspective from the original readings. Sensor fusion techniques can be
categorized in multiple ways. For example, sensor fusion methods can be classified as homogenous fusion and non-
homogeneous fusion, where in the former, measurements from similar sensors are combined and in non-homogenous
fusion data are taken from different sensing modalities. Examples of homogenous sensor fusion include refining location
data for GPS by using data from multiple satellites, and using multiple cameras in a stereovision configuration to extract
depth information. On the other hand, combining data from cameras, LiDAR, and Radar in autonomous driving
applications is an example of non-homogeneous sensor fusion.
Another important categorization of sensor fusion techniques relies on the stage on which the fusion takes place. Sensor
fusion methods could be classified into early fusion (or raw data fusion) and late fusion. In early fusion, raw data are
stacked or concatenated before being used as a unified data container. For example, researchers 13 inserted the depth
information from a LiDAR in the fourth channel of camera data to form RGBD images that will be processed by their
model. In late fusion, sensors measurements are processed separately and decisions are made based on the individual data
streams. Finally, these decisions are combined on a higher level of abstraction. Between these two extreme cases, sensor
fusion can take place anytime during data processing and feature extraction CNN pipeline.
There are various methods for sensor fusion and the level of complexity of these methods varies considerably. Achieving
sensor fusion can be as simple as averaging readings of two sensors, or by using techniques such as Kalman filtering and
deep learning models to handle the fusion. Examples methods of data fusion in deep learning networks include reshaping
and concatenating the outputs of each stream a into 1-D vector, which is fed into a fully connected network that handles
the fusion 14. Another example published 15 where trainable parameters are used to control information exchange between
separate streams of sensing modalities. This “cross-fusion” allows the model to variably integrate the data at any depth. A
third example is shown 16 where the entropy of each sensor stream is calculated and used to steer the fusion toward useful
information. The main motivation for sensor fusion in autonomous driving applications is to overcome the asymmetrical
degradation of different sensing modalities under varying weather conditions. Table 1 summarizes key advantages and
disadvantages of cameras, LiDAR and Radar, the main three sensors in this context. While cameras and LiDARs offer the
best resolution, they both perform poorly in poor weather conditions. Cameras are also sensitive to changing lighting
conditions and object glare. In contrast, Radars have a much lower detection resolution but are not affected by poor lighting
or weather conditions 13, 1. The complementary nature of these sensors highlights the importance of using all three of them
together for a successful autonomous driving solution.
Proc. of SPIE Vol. 11748 1174806-2
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Table 1. Automotive Sensor Comparison
Sensor
Advantages
Disadvantages
Camera
High resolution
Detects color
Sensitive to weather
Sensitive to lighting
LiDAR
Long range
High resolution
Wide FOV
Sensitive to weather
Expensive
Radar
Long range
Detects velocity
Applicable for all weather
Low resolution
Very sensitive
3. DRIVING DATASETS
This work uses convolutional neural networks and sensor fusion to address the problem of detecting a drivable path in
adverse weather conditions. The model proposed is a multi-stream (one per sensor) deep convolutional neural network that
will downsample the feature maps of each stream, fuse the data in the fully connected network, and upsample the maps
again to perform pixel-wise classification. This effort adapted earlier presented work 12, 13, 15, 16; however, while other
researchers 12, 13, 15 fused camera and LiDAR data only, and implemented only object detection 16, this work attempts to
fuse camera, LiDAR, and Rader to detect drivable paths in adverse weather conditions.
A suitable dataset for this work will contain camera, LiDAR, and Radar data. Annotation for semantic segmentation (i.e.,
pixel-wise labeling of drivable path and non-drivable path classes) is also desirable, and it should have been recorded in
poor weather conditions, where snow, rain, and fog are dominant. Furthermore, roads should have snow track and
snowbanks on the sides or be completely covered in snow. Table 2 presents a brief survey of the most common and recent
open autonomous driving datasets. The closest candidate to meeting all mentioned conditions is the DENSE dataset.
Despite not having annotations for semantic segmentation, it was selected due to its relevance to the goals of this project
and ease of access; however, a subset had to be manually labeled in order to train the CNN model.
3.1 The DENSE dataset
The DENSE project is a European effort that tackles the problems of autonomous driving in severe weather conditions 25,
26, 16. Researchers in this project have collected a very large dataset by driving more than 10,000 km in northern Europe
while recording data from multiple cameras, multiple LiDAR’s, Radar, GPS, IMU, road friction sensor, and thermal
cameras. The dataset comprises 12,000 samples, i. e. momentary measurements, and is annotated with 2-D and 3-D
bounding boxes for object detection. Moreover, the dataset is split into smaller subsets depending on weather conditions
and daytime. For example, snow-day, fog-night, clear-day, etc. Finally, the dataset is available for public access.
3.2 Data Preprocessing and Labeling
The original camera images in the dataset have a size of 1920 by 1024 pixels. Thus, they were scaled down to 480 by 256
for faster training and testing. LiDAR data are stored in the NumPy array format that had to be converted to images,
rescaled (to 480 x 256), and normalized. Radar data are saved in JSON files, one file for each frame. Each file contains a
dictionary of detected objects and multiple readings for each object including x-coordinates, y-coordinates, distance,
velocity, etc. This coordinate system is parallel to the vehicle plane. To convert it to the vertical plane facing the vehicle,
only y-coordinates are considered. The y-distance was placed on an image then extruded vertically to form a vertical line,
where lines and their horizontal locations correspond to objects and the intensities of those lines represent distances. Figure
1 shows an example of this projection process and its result. Finally, the output image was rescaled (to 480 x 256) and
normalized. This radar data representation is similar to what was implemented in previous work16.
Proc. of SPIE Vol. 11748 1174806-3
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Table 2. Recently Published Autonomous Driving Datasets
Dataset
Camera
Data
LiDAR
Data
Labels
Adverse
weather
Year, Ref.
DENSE
2020, 16
End of the Earth
2020, 17
BDD100K
2020, 18
Astyx
2019, 19
Oxford
2020, 20
Argoverse
2019, 21
nuImages
2020, 22
Apollo
2020, 23
Kitti
2013, 24
Since the original dataset contained annotations for object detection only; pixel-wise labels were needed to train a deep
learning CNN model for semantic segmentation. A total of 1000 samples were randomly selected from the “snowy day
split” and manually labeled using the Computer Vision Annotation Tool CVAT 27. The two classes considered here are
drivable and non-drivable regions. The resulting labels are masks of zeros and ones that have the same size of the original
camera images (1920 x 1024 pixels). The Drivable path class is not limited to streets. It also includes parking lots,
entrances, exits, etc. It also completely disregards lane markers, snow tracks, tramway tracks, and other lines that appear
on the roads as they are drivable areas. The Non-drivable class includes skies, trees, people, animals, vehicles, bicycles,
snowbanks, buildings, pavements, etc. It is worth mentioning, that in some frames, the amount of snow was so large that
a drivable path was unrecognizable even for to the human eye, in which cases the matter of deciding if an area is safe to
drive on or not becomes more of a subjective matter and depends on personal judgment.
(a)
(b)
Figure 1. Radar data preprocessing: (a) projecting the y-coordinates on the image plane; (b) a processed radar frame.
Proc. of SPIE Vol. 11748 1174806-4
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
4. SENSOR FUSION ARCHITECHTURE
The convolutional neural network presented here consists of three subnetworks; an encoder network that will downsample
the inputs and extract the features, a fully connected network that will fuse the data, and a decoder network which
upsamples the fused data and reconstructs the image. Figure 2 illustrates the implemented architecture. The network was
designed to be as compact as possible, because very deep encoder-decoder networks are computationally expensive.For
this reason, the decoder network was not designed with as many layers as the encoder network which explains the
asymmetrical architecture adopted here. The encoder network is made of three streams: camera stream, LiDAR stream,
and Radar stream. Since camera images contain more information, the camera stream is made deeper than the other two
streams, it has four blocks where each block consists of two convolutional layers, a batch normalization layer, and a ReLU
layer followed by a MaxPooling layer.
LiDAR data is not as dense as camera data, therefore the LiDAR stream is made to have less depth and downsamples the
inputs more aggressively. It consists of three of the same building blocks used in the camera stream. Similarly, the Radar
stream is shallower than the LiDAR stream - it is made of only two blocks. The outputs of the three streams are then
reshaped and concatenated as a one-dimensional vector. This vector is fed to the fully connected network which consists
of 3 hidden layers with ReLU activations. Finally, the outputs of the fully connected network are reshaped again into a 2-
D array that will be fed to the decoder network. The decoder network consists of four consecutive stages of MaxUnpooling
and transpose convolution to upsample the data back to the size of the input (480x256). Figure 2 and Table 3 detail the
model architecture. It is worth noting that the total number of learnable parameters of this network is 9,856,832.
Figure 2. Deep CNN Model architecture (numbers represent the output volume of each block).
5. TRAINING AND TESTING
The deep convolutional neural network model, was implemented using Python, including the PyTorch function library for
deep learning. Training and testing were executed on Google Colab with GPU utilization. The manually labeled data subset
was comprise of 1000 samples of camera, LiDAR and Radar data. It was divided into 800 samples for training and 200 for
testing. The network training batch size was 10 samples, and the model was trained for 600 epochs. The binary cross
entropy with sigmoid layer loss function was used, as well as stochastic gradient descent as the optimizer with a learning
rate 0.001 and momentum 0.9. Figure 3 shows the losses versus the training samples during the training phase. No separate
validation subset was created. Hyperparameters were tuned iteratively using the training subset itself. The model was then
tested with the 200 samples from the testing subset. The output of the model was postprocessed with image dilation and
erosion with varying kernels sizes to reduce the amount of noise in the pixel classification output and close small areas.
Proc. of SPIE Vol. 11748 1174806-5
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
Table 3. Deep CNN architecture details.
Network
Layer
Kernel
Padding
Stride
Encoder Network
Camera
Conv 0
Conv 1
3
1
1
MaxPool 0
3
0
3
Conv 2
Conv 3
3
1
1
MaxPool 1
3
0
3
Conv 4
Conv 5
3
1
1
MaxPool 2
3
0
3
Conv 6
Conv 7
3
1
1
MaxPool 3
3
0
3
LiDAR
Conv 0
Conv 1
3
0
1
MaxPool 0
4
0
4
Conv 2
Conv 3
3
0
1
MaxPool 1
4
0
4
Conv 4
Conv 5
3
0
1
MaxPool 2
4
0
4
Radar
Conv 0
3
0
1
Conv 1
3
0
2
MaxPool 0
4
0
4
Conv 2
3
0
1
Conv 3
3
0
2
MaxPool 1
4
0
4
Fully Connected Network
3360 2048 1024 480
Decoder Network
MaxUnpool 0
2
0
2
TConv 0
3
1
1
MaxUnpool 1
2
0
2
TConv 1
3
1
1
MaxUnpool 2
2
0
2
TConv 2
3
1
1
MaxUnpool 3
2
0
2
TConv 3
3
1
1
The simplest metric for measuring the accuracy for semantic segmentation is what is known as Pixel Accuracy; which is
the ratio of the correctly identified positives and negatives to the size of the image as, calculated as:
   
 (1)
where TP, TN, FP, and FN are true positives, true negatives, false positives, and false negatives, respectively. Pixel
accuracy is calculated for each sample in the testing set and the average of these values represents the total accuracy of the
model. In the case of binary classification, it is insufficient to calculate the pixel accuracy for any of the two classes. Pixel
accuracy can be misleading. In cases where a certain class is underrepresented in a sample, pixel accuracy can falsely give
higher accuracy only because there are not enough pixels to test the model for a specific class 28. For this reason, Mean
Intersection over Union (MIoU) is also presented here. MIoU is another common metric for semantic segmentation, it is
Proc. of SPIE Vol. 11748 1174806-6
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
the ratio between the intersection of the target mask and the prediction mask to the union of the target mask and the
prediction mask as follows:
  
 (2)
Similar to pixel accuracy, the IoU accuracy is calculated for each frame and the final accuracy metric is the average of
these values; however, MIoU is calculated for each class separately. Pixel accuracy and MIoU for both classes (i.e.,
drivable and non-drivable) are shown in Table 4. Both eq.(1) and eq.(2) are applicable for binary classification only. Figure
4 shows the three metrics considered for each test sample in the testing subset. Figure 5 shows four selected snow driving
frames of camera, LiDAR, Radar, ground truth, and model output.
Figure 3. Losses vs training samples.
Figure 4. Accuracy vs test samples.
Table 4. Model performance
Metric
Model accuracy
Pixel accuracy
95.04%
MIoU “drivable path” class
81.35%
MIoU “non-drivable path” class
93.58%
Table 4 conveys very good results for pixel accuracy and MIoU for the non-drivable path class. Referring to Figure 5. it is
evident that the model can successfully delineate the general circumference of the area in which a vehicle can move safely.
It can do so while ignoring all sorts of lines and edges that appear on the road which could be otherwise interpreted as
edges for a drivable path. The model also shows a great amount of resilience toward foggy and other situations with poor
visibility. It also, to some extent, avoids pedestrians, animals, and other vehicles. Finally, the model presented here
Proc. of SPIE Vol. 11748 1174806-7
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
[Camera]
[Radar]
[LiDAR]
[Ground Truth]
[Outputs]
Figure 5. selected example frames. Segmentation color code: green is drivable; purple is non-drivable
is quite compact. When compared to other models mentioned earlier, this model has much fewer layers and parameters
which means that it trains faster and, although not tested, it could possibly be used in real-time applications. On the other
hand, the MIoU for the drivable path class is not as high as the other two metrics. Moreover, the model does not perfectly
avoid pedestrians. And finally, segmentation boundaries generated by the model are rough and of low resolution which
could be caused by insufficient depth, especially in the decoder network.
6. CONCLUSIONS
This paper presents a CNN deep learning model for detecting the drivable path in snow driving. The results are very good
due to the use of multi modal sensor fusion. There are several improvements that could be investigated to enhance the
model. For example, advanced techniques in semantic segmentation like “skip connections” that copies feature maps from
Proc. of SPIE Vol. 11748 1174806-8
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
early levels in the network and use them in later levels could improve the classification resolution. Furthermore, storing
indices from MaxPooling layers and using them in MaxUnpooling layers during upsampling could also improve the output
resolution. The preprocessing technique for LiDAR and Radar data could be revisited. Better representation for LiDAR
and Radar data should lead to better feature extraction and would certainly improve the final results. Finally, an important
improvement is to utilize other measurements usually provided by LiDARs and Radars. For example, most LiDARs
measure intensities and number of reflected beams in addition to distances. Radars on the other hand measure velocities
and angles in addition to distances. This information can be used to build multi-channel data arrays similar to what cameras
provide. Since camera streams benefits from RGB data, LiDAR stream can also be augmented by distance, beam intensity,
number of beams data. Similarly, Radars can include distance, velocity, and angle information. The extra data provided
here should help the model distinguish between objects in scenarios where objects made of different material are located
at the same distance.
REFERENCES
[1] Wang, Z., Wu, Y. and Niu, Q., “Multi-Sensor Fusion in Automated Driving: A Survey,” IEEE Access 8, 2847
2868 (2020).
[2] Rawashdeh, N. A. and Jasim, H. T., “Multi-sensor input path planning for an autonomous ground vehicle,” 2013
9th Int. Symp. Mechatronics Its Appl. ISMA 2013, 914 (2013).
[3] Aladem, M., Rawashdeh, S. and Rawashdeh, N., “Evaluation of a Stereo Visual Odometry Algorithm for
Passenger Vehicle Navigation,” SAE Tech. Pap. 2017-March(March) (2017).
[4] Rawashdeh, N. A. and Rawashdeh, S. A., “Scene Structure Classification as Preprocessing for Feature-Based
Visual Odometry,” SAE Int. J. Passeng. Cars - Electron. Electr. Syst. 11(3) (2018).
[5] Abdo, A., Ibrahim, R. and Rawashdeh, N. A., “Mobile Robot Localization Evaluations with Visual Odometry in
Varying Environments Using Festo-Robotino,” SAE Tech. Pap. 2020-April(April) (2020).
[6] Grigorescu, S., Trasnea, B., Cocias, T. and Macesanu, G., “A survey of deep learning techniques for autonomous
driving,” J. F. Robot. 37(3), 362386 (2020).
[7] Simonyan, K. and Zisserman, A., “Very deep convolutional networks for large-scale image recognition,” 3rd Int.
Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., International Conference on Learning Representations,
ICLR (2015).
[8] Badrinarayanan, V., Kendall, A. and Cipolla, R., “SegNet: A Deep Convolutional Encoder-Decoder Architecture
for Image Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 24812495 (2017).
[9] Romera, E., Alvarez, J. M., Bergasa, L. M. and Arroyo, R., “ERFNet: Efficient Residual Factorized ConvNet for
Real-Time Semantic Segmentation,” IEEE Trans. Intell. Transp. Syst. 19(1), 263272 (2018).
0] Ronneberger, O., Fischer, P. and Brox, T., “U-net: Convolutional networks for biomedical image segmentation,”
Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 9351, 234241,
Springer Verlag (2015).
[11] Michalke, T., Kastner, R., Herbert, M., Fritsch, J. and Goerick, C., “Adaptive multi-cue fusion for robust detection
of unmarked inner-city streets,” IEEE Intell. Veh. Symp. Proc., 1–8 (2009).
[12] Li, Q., Chen, L., Li, M., Shaw, S. L. and Nüchter, A., “A sensor-fusion drivable-region and lane-detection system
for autonomous vehicle navigation in challenging road scenarios,” IEEE Trans. Veh. Technol. 63(2), 540555
(2014).
[13] Shahian Jahromi, B., Tulabandhula, T. and Cetin, S., “Real-Time Hybrid Multi-Sensor Fusion Framework for
Perception in Autonomous Vehicles,” Sensors 19(20), 4357 (2019).
[14] Shin, J., Park, H. and Paik, J., “Fire recognition using spatio-temporal two-stream convolutional neural network
with fully connected layer-fusion,” IEEE Int. Conf. Consum. Electron. - Berlin, ICCE-Berlin 2018-Septe, IEEE
Computer Society (2018).
[15] Caltagirone, L., Bellone, M., Svensson, L. and Wahde, M., “LIDAR–camera fusion for road detection using fully
convolutional neural networks,” Rob. Auton. Syst. 111, 125131 (2019).
Proc. of SPIE Vol. 11748 1174806-9
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
[16] Bijelic, M., Gruber, T., Mannan, F., Kraus, F., Ritter, W., Dietmayer, K. and Heide, F., “Seeing Through Fog
Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather,” 11679–11689 (2019).
[17] Bos, J. P., Chopp, D. J., Kurup, A. and Spike, N., “Autonomy at the end of the earth: an inclement weather
autonomous driving data set,” Auton. Syst. Sensors, Process. Secur. Veh. Infrastruct. 2020 11415, M. C. Dudzik
and S. M. Jameson, Eds., 6, SPIE (2020).
[18] Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V. and Darrell, T., “BDD100K: A Diverse
Driving Dataset for Heterogeneous Multitask Learning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit., 26332642 (2018).
[19] Meyer, M. and Kuschk, G., [Automotive Radar Dataset for Deep Learning Based 3D Object Detection] (2019).
[20] Maddern, W., Pascoe, G., Linegar, C. and Newman, P., “Year, 1000km: The Oxford RobotCar Dataset.”
[21] Chang, M.-F., Lambert, J. W., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S.,
Ramanan, D. and Hays, J., “Argoverse: 3D Tracking and Forecasting with Rich Maps,” Conf. Comput. Vis. Pattern
Recognit. (2019).
[22] Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G. and Beijbom,
O., “nuScenes: A multimodal dataset for autonomous driving,” Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., 1161811628 (2019).
[23] Ma, Y., Zhu, X., Zhang, S., Yang, R., Wang, W. and Manocha, D., “TrafficPredict: Trajectory Prediction for
Heterogeneous Traffic-Agents.”
[24] Geiger, A., Lenz, P., Stiller, C. and Urtasun, R., “Vision meets Robotics: The KITTI Dataset,” Int. J. Robot. Res.
(2013).
[25] “DENSE 24/7 - Startseite - DENSE - 24/7 Automotive Sensing System.”,
https://www.dense247.eu/home/index.html (accessed Nov. 29, 2020).
[26] “DENSE Datasets - Ulm University.”, https://www.uni-ulm.de/en/in/driveu/projects/dense-datasets (accessed
Nov. 29, 2020).
[27] “Computer Vision Annotation Tool.”, https://cvat.org/auth/login (accessed Oct. 18, 2020).
[28] “Evaluating image segmentation models.”, https://www.jeremyjordan.me/evaluating-image-segmentation-models
(accessed Nov. 28, 2020).
Proc. of SPIE Vol. 11748 1174806-10
Downloaded From: https://www.spiedigitallibrary.org/conference-proceedings-of-spie on 22 Apr 2021
Terms of Use: https://www.spiedigitallibrary.org/terms-of-use
... Also falling under the segmentation problem but approached differently by the decision of definition of class, the authors in [52] addressed the RCE using camera and LiDAR data fusion to classify each pixel on the images as drivable or non-drivable. To achieve runtime-capable algorithms, they optimized the CNN used in their previous work [92]. They experimented with seven different combinations of fusion, introduced synthesized fog noise on images and point clouds to test robustness, and compared their method with existing SOTA-based classifiers. ...
Article
Full-text available
Deployment of Level 3 and Level 4 autonomous vehicles (AVs) in urban environments is significantly constrained by adverse weather conditions, limiting their operation to clear weather due to safety concerns. Ensuring that AVs remain within their designated Operational Design Domain (ODD) is a formidable challenge, making boundary monitoring strategies essential for safe navigation. This study explores the critical role of an ODD monitoring system (OMS) in addressing these challenges. It reviews various methodologies for designing an OMS and presents a comprehensive visualization framework incorporating trigger points for ODD exits. These trigger points serve as essential references for effective OMS design. The study also delves into a specific use case concerning ODD exits: the reduction in road friction due to adverse weather conditions. It emphasizes the importance of contactless computer vision-based methods for road condition estimation (RCE), particularly using vision sensors such as cameras. The study details a timeline of methods involving classical machine learning and deep learning feature extraction techniques, identifying contemporary challenges such as class imbalance, lack of comprehensive datasets, annotation methods, and the scarcity of generalization techniques. Furthermore, it provides a factual comparison of two state-of-the-art RCE datasets. In essence, the study aims to address and explore ODD exits due to weather-induced road conditions, decoding the practical solutions and directions for future research in the realm of AVs.
... Also falling under segmentation problem, but approached in a different way by the decision of definition of class, the authors in [44] addressed the RCE by using camera and lidar data fusion to classify each pixel on the images as drivable or non-drivable. To achieve runtime-capable algorithms, they optimized the convolutional neural network (CNN) used in their previous work [83]. They experimented with seven different combinations of fusion, introduced synthesized fog noise on images and point clouds to test robustness, and compared their method with existing SOTA-based classifiers. ...
Preprint
Full-text available
Deployment of Level 3 and Level 4 autonomous vehicles (AVs) in urban environments is significantly constrained by adverse weather conditions, limiting their operation to clear weather due to safety concerns. Ensuring that AVs remain within their designated Operational Design Domain (ODD) is a formidable challenge, making boundary monitoring strategies essential for safe navigation. This study explores the critical role of an ODD monitoring system (OMS) in addressing these challenges. It reviews various methodologies for designing an OMS and presents a comprehensive visualization framework incorporating trigger points for ODD exits. These trigger points serve as essential references for effective OMS design. The study also delves into a specific use case concerning ODD exits: the reduction in road friction due to adverse weather conditions. It emphasizes the importance of contactless computer vision-based methods for road condition estimation (RCE), particularly using vision sensors such as cameras. The study details a timeline of methods involving classical machine learning and deep learning feature extraction techniques, identifying contemporary challenges such as class imbalance, lack of comprehensive datasets, annotation methods, and the scarcity of generalization techniques. Furthermore, it provides a factual comparison of two state-of-the-art RCE datasets. In essence, the study aims to address and explore ODD exits due to weather-induced road conditions, decoding the practical solutions and directions for future research in the realm of AVs.
... Comparison of sensors[35]. ...
Article
Full-text available
Enhancing the environmental perception of autonomous vehicles (AVs) in intelligent transportation systems requires computer vision technology to be effective in detecting objects and obstacles, particularly in adverse weather conditions. Adverse weather circumstances present serious difficulties for object-detecting systems, which are essential to contemporary safety procedures, infrastructure for monitoring, and intelligent transportation. AVs primarily depend on image processing algorithms that utilize a wide range of onboard visual sensors for guidance and decisionmaking. Ensuring the consistent identification of critical elements such as vehicles, pedestrians, and road lanes, even in adverse weather, is a paramount objective. This paper not only provides a comprehensive review of the literature on object detection (OD) under adverse weather conditions but also delves into the ever-evolving realm of the architecture of AVs, challenges for automated vehicles in adverse weather, the basic structure of OD, and explores the landscape of traditional and deep learning (DL) approaches for OD within the realm of AVs. These approaches are essential for advancing the capabilities of AVs in recognizing and responding to objects in their surroundings. This paper further investigates previous research that has employed both traditional and DL methodologies for the detection of vehicles, pedestrians, and road lanes, effectively linking these approaches with the evolving field of AVs. Moreover, this paper offers an in-depth analysis of the datasets commonly employed in AV research, with a specific focus on the detection of key elements in various environmental conditions, and then summarizes the evaluation matrix. We expect that this review paper will help scholars to gain a better understanding of this area of research.
... Year Sensor Combination Objective MVDNet [64] 2021 LiDAR and Radar Fog SLS Fusion [65] 2021 LiDAR and Camera Fog Liu et al. [66] 2021 LiDAR and Camera Fog, Rain, and Nighttime John et al. [67] 2021 Thermal Camera and Visible Cameras Low Light Conditions, and Headlight Glint Rawashdeh et al. [68] 2021 RCL Navigating in the Snow Vachmanus et al. [69] 2021 Thermal Camera and RGB cameras Snow Radar Net [70] 2020 LiDAR and Radar Potential Rain (NuScenes Dataset) HeatNet [71] 2020 Thermal Camera and Two RGB cameras Nighttime ...
Chapter
Today, self-driving car technology is actively being researched and developed by numerous major automakers. For autonomous vehicles (AVs) to function in the same way that people do—perceiving their surroundings and making rational judgments based on that information—a wide range of sensor technologies must be employed. An essential concern for fully automated driving on any road is the capacity to operate in a wide range of weather conditions. The ability to assess different traffic conditions and maneuver safely provides significant obstacles for the commercialization of automated cars. The creation of a reliable recognition system that can work in inclement weather is another significant obstacle. Unfavorable climate, like precipitation, fog, and sun glint, and metropolitan locations, with their many towering buildings and tunnels, which can cause problems or impair the operation of sensors, as well as many automobiles, pedestrians, traffic lights, and so on, all present difficulties for self-driving vehicles. After providing an overview of AVs and their development, this paper evaluates their usefulness in the real world. After that, the sensors utilized by these automobiles are analyzed thoroughly, and subsequently, the operation of AVs under varying settings is covered. Lastly, the challenges and drawbacks of sensor fusion are described, along with an analysis of sensor performance under varying environmental circumstances.
... 3 of 24 method to obtain DE-YOLO, which improves the detection result compared to the YOLOv3 method. [19] used the DENSE [3] dataset and CNN (Convolutional Neural Network) based sensor fusion for drivable path detection in snowy conditions, i.e., snow-covered roads, and claimed that the drivable path detection might serve as a preparatory step before object detection. Some useful portions of the DENSE dataset were manually processed and labeled before usage because they were not labeled for semantic segmentation. ...
Preprint
Full-text available
For autonomous driving, perception is a primary and essential element that fundamentally deals with the insight into the ego vehicle’s environment through sensors. Perception is a challenging task suffering from dynamic objects and continuous environmental changes. The issue gets worse due to interrupting the quality of perception by adverse weather like snow, rain, fog, night light, sand storm, strong daylight, etc. In this work, we have tried to improve camera-based perception accuracy, such as autonomous driving-related object detection in adverse weather. We proposed the improvement of YOLOv8-based object detection in adverse weather through transfer learning using merged data from various harsh weather datasets. Two prosperous open-source datasets (ACDC and DAWN) and their merged dataset were used to detect primary objects on the road in harsh weather. A set of training weights were collected from training on the individual datasets, their merged version, and several subsets of those datasets according to their characteristics. A comparison between the training weights also occurred by evaluating the detection performance on the above-mentioned datasets and their subsets. The evaluation revealed that using custom datasets for training significantly improves the detection performance compared to the YOLOv8 base weights. And using more images through the feature-related data merging technique steadily increases the object detection performance.
Conference Paper
div class="section abstract"> Deliberate modifications to infrastructure can significantly enhance machine vision recognition of road sections designed for Vulnerable Road Users, such as green bike lanes. This study evaluates how green bike lanes, compared to unpainted lanes, enhance machine vision recognition and vulnerable road users safety by keeping vehicles at a safe distance and preventing encroachment into designated bike lanes. Conducted at the American Center for Mobility, this study utilizes a vehicle equipped with a front-facing camera to assess green bike lane recognition capabilities across various environmental conditions including dry daytime, dry nighttime, rain, fog, and snow. Data collection involved gathering a comprehensive dataset under diverse conditions and generating masks for lane markings to perform comparative analysis for training Advanced Driver Assistance Systems. Quality measurement and statistical analysis are used to evaluate the effectiveness of machine vision recognition using metrics, such as Blind/Reference-less Image Spatial Quality Evaluator, Naturalness Image Quality Evaluator, and Entropy-based Image Quality Assessment. The results indicate that green bike lanes are more likely to be recognized by machine vision systems across a wide range of environmental conditions, demonstrating enhanced recognition capabilities. Green lane markings exhibit enhanced visibility and stability, with BRISQUE scores below 82, a median contrast ratio of 17.6, and improved resilience to motion blur and NIQE variations under diverse conditions. </div
Conference Paper
Full-text available
İnsan odaklı ve çevreye duyarlı yenilikçi sistemlerin ortaya çıkması ve tesis edilerek yaygın olarak kullanımının sağlanması, sürdürülebilir yük ve yolcu taşımacılığı için önemli bir unsurdur. Bu bakış açısıyla erişilebilir, adil ve yenilikçi ulaşım teknolojileri geliştirmek üzere dünya genelinde yoğun çalışmalar yapılmaktadır. Bu çalışmalardan biri olarak ortaya çıkan hyperloop teknolojisi, ulaşımın beşinci modu olarak nitelendirilen ve geleceğin öne çıkan ulaşım sistemleri arasında yer almaktadır. Hyperloop teknolojisi, fiziksel tüm bileşenleri ve ilişkili olduğu tüm varlıklar ile bir bütün olarak ele alındığında; haberleşme, entegrasyon ve veri akışının yoğun olduğu bir sistemler bütününden oluşmaktadır. Bu yoğun dijital altyapının ise siber güvenlik tehditlerine maruz kalması muhtemeldir. Bu çalışmada, hyperloop sistemine genel bir bakış yapılarak hyperloop sisteminde kullanılan haberleşme ağı, haberleşme türleri, teknolojileri, yöntemleri, zorlukları ve bu sistemi hedef alan siber güvenlik tehditleri ile risk azaltıcı stratejiler incelenerek literatüre katkı sağlanması hedeflenmiştir. Türkiye’de hyperloop sistemi geliştirme girişimleri özetlenerek bu sistemlerin geliştirilmesi esnasında siber güvenliğin sağlanmasına yönelik öneriler sunulmuştur. Bu çalışma, hyperloop sisteminde kullanılan haberleşme teknolojilerine ve bu sistemin siber güvenliği hususuna odaklanarak literatürde bu konuları inceleyen nadir çalışmalar arasındadır.
Conference Paper
Full-text available
First employed in the field of natural language processing for machine translation tasks, transformer is a type of deep neural network that is based on the self-attention mechanism, and is capable of capturing long-term dependencies. Transformer has high data representation capabilities, which makes it useful for computer vision (CV) applications. In this paper, we examined contemporary vision transformer-based methods for video anomaly detection (VAD). We explored transformer models used in VAD methods and vision transformers utilized in CV applications. In addition, we discussed the benefits, drawbacks, and current limitations of transformer architecture, and provided directions for further research in relation to VAD using vision transformers.
Chapter
Full-text available
Artificial Intelligence (AI) has made significant strides across various domains, but the opacity of many AI models, especially in critical sectors like healthcare, finance, and autonomous vehicles, has raised concerns. To address this, Explainable Artificial Intelligence (XAI) has emerged, aiming to shed light on AI decision-making and provide human-comprehensible explanations. Understanding XAI is crucial as it can lead to more transparent, trustworthy, and accountable AI systems. XAI seeks to make complex AI models interpretable, bridging the gap left by black-box models like deep neural networks. Various XAI approaches cater to different use cases, with the choice depending on specific domain requirements. However, integrating XAI into autonomous vehicles poses unique challenges, necessitating solutions that maintain real-time decision-making without compromising safety. Achieving interpretability in deep learning models commonly used in autonomous vehicles is also challenging, requiring novel tailored approaches. Moreover, presenting explanations in a clear, concise manner is essential for user trust. Legal and ethical considerations arise when integrating XAI into autonomous vehicles, requiring comprehensive validation and adherence to regulatory standards. Rigorous evaluation, including quantitative and qualitative measures, is imperative to ensure effectiveness and prevent misleading explanations. XAI holds promise for enhancing transparency and interpretability across various fields, but its integration into autonomous vehicles requires addressing specific hurdles while maintaining real-time capabilities and user-centricity, ultimately fostering public trust and acceptance in transformative technologies like autonomous driving.
Chapter
Advanced Driver Assistance Systems (ADAS) generally utilize cameras to provide limited automation functions to enhance driver safety.ADAS uses computer vision (CV) to extract vehicle surrounds, lane boundaries, drivable regions, and nearby objects. ADAS systems fail, however, when the vehicle is operating in adverse conditions (e.g., obscured lane lines). We introduce a new ADAS CV method that was evaluated in snowy weather and lane line occlusion scenarios by recognizing tire tracks in the snow. This approach was previously developed using classical machine learning techniques (ML) but has now been expanded to include a variety of convolutional neural network (CNN) models. Using an instrumented automation research vehicle, a custom dataset was collected. A data preparation pipeline was constructed for data labeling and model training. The CNN models outperform the classical ML model in detecting tire tracks on key metrics such as Intersection over union (IoU), precision, and recall, at the expense of real-time compute speeds in frames per second (FPS). Essentially, we have demonstrated that this method works as an end-to-end pipeline for detecting tire tracks since it doesn’t necessitate any feature engineering and is a feasible way to expand the operational design domain of current ADAS productsKeywordsAdvanced driver assistance systemsComputer visionMachine learningDeep learningConvolutional neural networkAutonomous vehiclesInclement weather
Article
Full-text available
With the significant development of practicability in deep learning and the ultra-high-speed information transmission rate of 5G communication technology will overcome the barrier of data transmission on the Internet of Vehicles, automated driving is becoming a pivotal technology affecting the future industry. Sensors are the key to the perception of the outside world in the automated driving system and whose cooperation performance directly determines the safety of automated driving vehicles. In this survey, we mainly discuss the different strategies of multi-sensor fusion in automated driving in recent years. The performance of conventional sensors and the necessity of multi-sensor fusion are analyzed, including radar, LiDAR, camera, ultrasonic, GPS, IMU, and V2X. According to the differences in the latest studies, we divide the fusion strategies into four categories and point out some shortcomings. Sensor fusion is mainly applied for multi-target tracking and environment reconstruction. We discuss the method of establishing a motion model and data association in multi-target tracking. At the end of the paper, we analyzed the deficiencies in the current studies and put forward some suggestions for further improvement in the future. Through this investigation, we hope to analyze the current situation of multi-sensor fusion in the automated driving process and provide more efficient and reliable fusion strategies.
Article
Full-text available
The last decade witnessed increasingly rapid progress in self‐driving vehicle technology, mainly backed up by advances in the area of deep learning and artificial intelligence (AI). The objective of this paper is to survey the current state‐of‐the‐art on deep learning technologies used in autonomous driving. We start by presenting AI‐based self‐driving architectures, convolutional and recurrent neural networks, as well as the deep reinforcement learning paradigm. These methodologies form a base for the surveyed driving scene perception, path planning, behavior arbitration, and motion control algorithms. We investigate both the modular perception‐planning‐action pipeline, where each module is built using deep learning methods, as well as End2End systems, which directly map sensory information to steering commands. Additionally, we tackle current challenges encountered in designing AI architectures for autonomous driving, such as their safety, training data sources, and computational hardware. The comparison presented in this survey helps gain insight into the strengths and limitations of deep learning and AI approaches for autonomous driving and assist with design choices.
Conference Paper
Full-text available
We present a radar-centric automotive dataset based on radar, lidar and camera data for the purpose of 3D object detection. Our main focus is to provide high resolution radar data to the research community, facilitating and stimulating research on algorithms using radar sensor data. To this end, semi-automatically generated and manually refined 3D ground truth data for object detection is provided. We describe the complete process of generating such a dataset, highlight some main features of the corresponding high-resolution radar and demonstrate its usage for level 3-5 autonomous driving applications by showing results of a deep learning based 3D object detection algorithm on this dataset. Our dataset is available online at: www.astyx.net
Article
Full-text available
There are many sensor fusion frameworks proposed in the literature using different sensors and fusion methods combinations and configurations. More focus has been on improving the accuracy performance; however, the implementation feasibility of these frameworks in an autonomous vehicle is less explored. Some fusion architectures can perform very well in lab conditions using powerful computational resources; however, in real-world applications, they cannot be implemented in an embedded edge computer due to their high cost and computational need. We propose a new hybrid multi-sensor fusion pipeline configuration that performs environment perception for autonomous vehicles such as road segmentation, obstacle detection, and tracking. This fusion framework uses a proposed encoder-decoder based Fully Convolutional Neural Network (FCNx) and a traditional Extended Kalman Filter (EKF) nonlinear state estimator method. It also uses a configuration of camera, LiDAR, and radar sensors that are best suited for each fusion method. The goal of this hybrid framework is to provide a cost-effective, lightweight, modular, and robust (in case of a sensor failure) fusion system solution. It uses FCNx algorithm that improve road detection accuracy compared to benchmark models while maintaining real-time efficiency that can be used in an autonomous vehicle embedded computer. Tested on over 3K road scenes, our fusion algorithm shows better performance in various environment scenarios compared to baseline benchmark networks. Moreover, the algorithm is implemented in a vehicle and tested using actual sensor data collected from a vehicle, performing real-time environment perception.
Article
Full-text available
To safely and efficiently navigate in complex urban traffic, autonomous vehicles must make responsible predictions in relation to surrounding traffic-agents (vehicles, bicycles, pedestrians, etc.). A challenging and critical task is to explore the movement patterns of different traffic-agents and predict their future trajectories accurately to help the autonomous vehicle make reasonable navigation decision. To solve this problem, we propose a long short-term memory-based (LSTM-based) realtime traffic prediction algorithm, TrafficPredict. Our approach uses an instance layer to learn instances’ movements and interactions and has a category layer to learn the similarities of instances belonging to the same type to refine the prediction. In order to evaluate its performance, we collected trajectory datasets in a large city consisting of varying conditions and traffic densities. The dataset includes many challenging scenarios where vehicles, bicycles, and pedestrians move among one another. We evaluate the performance of TrafficPredict on our new dataset and highlight its higher accuracy for trajectory prediction by comparing with prior prediction methods.
Article
Full-text available
Cameras and image processing hardware are rapidly evolving technologies, which enable real-time applications for passenger cars, ground robots, and aerial vehicles. Visual odometry (VO) algorithms estimate vehicle position and orientation changes from the moving camera images. For ground vehicles, such as cars, indoor robots, and planetary rovers, VO can augment movement estimation from rotary wheel encoders. Feature-based VO relies on detecting feature points, such as corners or edges, in image frames as the vehicle moves. These points are tracked over frames and, as a group, estimate motion. Not all detected points are tracked since not all are found in the next frame. Even tracked features may not be correct since a feature point may map to an incorrect nearby feature point. This can depend on the driving scenario, which can include driving at high speed or in the rain or snow. This article investigates the effect of image structural content on the performance of feature tracking and motion estimation from known VO algorithms. As a preprocessing step, the image frame is divided into regions of three classes: Transient, Texture, and Random. The number of tracked features does differ in these regions, as validated by the presented results. VO algorithms can fail intermittently when too few detected points contribute to tracking, where the remaining points are false matches and are outliers to the motion estimator. Exclusion of these poor corners in advance can increase the robustness of the algorithms.
Article
Full-text available
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
Article
Full-text available
Semantic segmentation is a challenging task that addresses most of the perception needs of intelligent vehicles (IVs) in an unified way. Deep neural networks excel at this task, as they can be trained end-to-end to accurately classify multiple object categories in an image at pixel level. However, a good tradeoff between high quality and computational resources is yet not present in the state-of-the-art semantic segmentation approaches, limiting their application in real vehicles. In this paper, we propose a deep architecture that is able to run in real time while providing accurate semantic segmentation. The core of our architecture is a novel layer that uses residual connections and factorized convolutions in order to remain efficient while retaining remarkable accuracy. Our approach is able to run at over 83 FPS in a single Titan X, and 7 FPS in a Jetson TX1 (embedded device). A comprehensive set of experiments on the publicly available Cityscapes data set demonstrates that our system achieves an accuracy that is similar to the state of the art, while being orders of magnitude faster to compute than other architectures that achieve top precision. The resulting tradeoff makes our model an ideal approach for scene understanding in IV applications. The code is publicly available at: https://github.com/Eromera/erfnet
Article
In this work, a deep learning approach has been developed to carry out road detection by fusing LIDAR point clouds and camera images. An unstructured and sparse point cloud is first projected onto the camera image plane and then upsampled to obtain a set of dense 2D images encoding spatial information. Several fully convolutional neural networks (FCNs) are then trained to carry out road detection, either by using data from a single sensor, or by using three fusion strategies: early, late, and the newly proposed cross fusion. Whereas in the former two fusion approaches, the integration of multimodal information is carried out at a predefined depth level, the cross fusion FCN is designed to directly learn from data where to integrate information; this is accomplished by using trainable cross connections between the LIDAR and the camera processing branches. To further highlight the benefits of using a multimodal system for road detection, a data set consisting of visually challenging scenes was extracted from driving sequences of the KITTI raw data set. It was then demonstrated that, as expected, a purely camera-based FCN severely underperforms on this data set. A multimodal system, on the other hand, is still able to provide high accuracy. Finally, the proposed cross fusion FCN was evaluated on the KITTI road benchmark where it achieved excellent performance, with a MaxF score of 96.03%, ranking it among the top-performing approaches.