ArticlePDF Available

Real-Time Hybrid Multi-Sensor Fusion Framework for Perception in Autonomous Vehicles


Abstract and Figures

There are many sensor fusion frameworks proposed in the literature using different sensors and fusion methods combinations and configurations. More focus has been on improving the accuracy performance; however, the implementation feasibility of these frameworks in an autonomous vehicle is less explored. Some fusion architectures can perform very well in lab conditions using powerful computational resources; however, in real-world applications, they cannot be implemented in an embedded edge computer due to their high cost and computational need. We propose a new hybrid multi-sensor fusion pipeline configuration that performs environment perception for autonomous vehicles such as road segmentation, obstacle detection, and tracking. This fusion framework uses a proposed encoder-decoder based Fully Convolutional Neural Network (FCNx) and a traditional Extended Kalman Filter (EKF) nonlinear state estimator method. It also uses a configuration of camera, LiDAR, and radar sensors that are best suited for each fusion method. The goal of this hybrid framework is to provide a cost-effective, lightweight, modular, and robust (in case of a sensor failure) fusion system solution. It uses FCNx algorithm that improve road detection accuracy compared to benchmark models while maintaining real-time efficiency that can be used in an autonomous vehicle embedded computer. Tested on over 3K road scenes, our fusion algorithm shows better performance in various environment scenarios compared to baseline benchmark networks. Moreover, the algorithm is implemented in a vehicle and tested using actual sensor data collected from a vehicle, performing real-time environment perception.
Content may be subject to copyright.
Real-Time Hybrid Multi-Sensor Fusion Framework
for Perception in Autonomous Vehicles
Babak Shahian Jahromi 1,* , Theja Tulabandhula 2and Sabri Cetin 1, *
1Mechanical and Industrial Engineering, University of Illinois, Chicago, IL 60607, USA
2Information and Decision Sciences, University of Illinois, Chicago, IL 60607, USA;
*Correspondence: (B.S.J.); (S.C.)
Received: 13 August 2019; Accepted: 3 October 2019; Published: 9 October 2019
There are many sensor fusion frameworks proposed in the literature using different sensors
and fusion methods combinations and configurations. More focus has been on improving the accuracy
performance; however, the implementation feasibility of these frameworks in an autonomous vehicle
is less explored. Some fusion architectures can perform very well in lab conditions using powerful
computational resources; however, in real-world applications, they cannot be implemented in an
embedded edge computer due to their high cost and computational need. We propose a new hybrid
multi-sensor fusion pipeline configuration that performs environment perception for autonomous
vehicles such as road segmentation, obstacle detection, and tracking. This fusion framework uses
a proposed encoder-decoder based Fully Convolutional Neural Network (FCNx) and a traditional
Extended Kalman Filter (EKF) nonlinear state estimator method. It also uses a configuration of
camera, LiDAR, and radar sensors that are best suited for each fusion method. The goal of this hybrid
framework is to provide a cost-effective, lightweight, modular, and robust (in case of a sensor failure)
fusion system solution. It uses FCNx algorithm that improve road detection accuracy compared
to benchmark models while maintaining real-time efficiency that can be used in an autonomous
vehicle embedded computer. Tested on over 3K road scenes, our fusion algorithm shows better
performance in various environment scenarios compared to baseline benchmark networks. Moreover,
the algorithm is implemented in a vehicle and tested using actual sensor data collected from a vehicle,
performing real-time environment perception.
environment perception; sensor fusion; LiDAR; road segmentation; fully convolutional
network; object detection and tracking; extended Kalman filter; autonomous vehicle
1. Introduction
Sensors in autonomous vehicles (AV) are used in two main categories: localization to measure
where the vehicle is on the road and environment perception to detect what is around the vehicle.
Global Navigation Satellite System (GNSS), Inertial Measurement Unit (IMU), and vehicle odometry
sensors are used to localize the AV. Localization is needed, so the AV knows its position with
respect to the environment. LiDAR, camera, and, radar sensors are used for environment perception.
Understanding the surrounding is needed for the AV to navigate the environment safely. The output
of localization and environment perception is sent to the AV path planner to decide on what actions
to take. Those actions are sent to the AV motion control, which in turn will send the optimal control
inputs to the actuators (accelerator, brakes, and steering). The AV architecture is shown in Figure 1.
In this research paper, we propose a hybrid multi-sensor fusion framework configuration for
autonomous driving. In addition to accuracy improvement, this modular framework takes into account
and combines the strengths of nonlinear state estimators i.e., Extended Kalman Filter technique with
the strength of deep learning algorithms i.e., Fully Convolutional Network, based on the sensor type
Sensors 2019,19, 4357; doi:10.3390/s19204357
Sensors 2019,19, 4357 2 of 23
and configuration. This hybrid sensor fusion technique is used for understanding the environment
around the autonomous vehicle to provide rich information on the safe driveable road region as well
as detecting and tracking the objects on the road. This method was tested via an embedded in-vehicle
computer, and the results were compared to the baseline networks and ground truth information.
There is numerous research on the environment perception for autonomous vehicles, including
the sensors used in an AV, sensor data processing, and various fusion algorithms. The sensors can be
categorized into three main categories: (1) camera; (2) LiDAR; and (3) radar. Also, the fusion algorithms
can be categorized into two main categories: (1) sensor fusion using state estimators i.e., Kalman filters
(KF); and (2) machine learning-based methods i.e., deep neural networks (DNN). Literature focus
has been often on improving the algorithm accuracy performance; the implementation feasibility of
these algorithms in an autonomous vehicle, however, has been less explored. The need for an efficient,
lightweight, modular, and robust pipeline is essential. Therefore, a sensor fusion method configuration
that balances a trade-off between fusion model complexity and real-world real-time applicability while
improving the environment perception accuracy is fundamental.
Figure 1. Autonomous vehicle architecture. Our paper focuses on the perception (red area).
The paper is structured as follows, we review the sensor fusion literature in Section 2. We review
the related perception sensors technology (i.e., camera, LiDAR and radar) used in our implementation
in Section 3. Next, we present our sensor fusion pipeline overview in Section 4, along with presenting
how we processed the sensor data and implemented the fusion methods (i.e., details of our proposed
FCNx and review of traditional EKF). In Section 5, the experiment procedure, tools and dataset is
discussed. The experimental results and evaluation is presented in Section 6. We conclude our paper
in Section 7.
2. Fusion Systems Literature Review
Reviewing related work on fusion systems [
], we can see that a lot focus has been given
to camera and LiDAR sensors and performing different computer vision tasks using deep learning
methods. Some literature focused on single sensor (camera or LiDAR) and single method (deep
learning or traditional computer vision) techniques [
]. Some literature focused on LiDAR +
camera multi-sensor approach [
] and a single fusion method. Other literature focused on
camera + radar + LiDAR multi-sensor fusion using state estimators i.e., Kalman filters or particle
filters [
]. Also, some works focused on the fusion method techniques without any real-world
implementation or any specific set of sensors. In [
], the authors reviewed autonomous vehicle
Sensors 2019,19, 4357 3 of 23
systems technology. They present the history of autonomous driver assistance systems and the sensor
technology used in autonomous vehicles. They also explored standard sensor fusion methods as
well as methodologies that connect the sensor information with actuation decision. In [
], a review
of environment perception algorithms and modeling methods in intelligent vehicles is presented
focusing on lane and road detection, traffic sign recognition, vehicle tracking, behavior analysis, and
scene understanding. An overview of sensors and sensor fusion methods in autonomous vehicles is
given [
]. In [
], an RGBD architecture is used for object recognition using two separate Convolutional
Neural Networks (CNN). In [
], LiDAR and camera information is fused using a boosted decision tree
classifier for road detection. Vision-based road detection is explored using geometrical transforms in [
Reference [
] proposed a unified approach to perform joint classification, detection, and semantic
segmentation with a focus on efficiency. Deep learning-based road detection is developed using
only LiDAR data and a fully convolutional network (FCN) in [
]. In [
], a high-level sensor data
fusion architecture is designed and tested for ADAS applications such as emergency stop assistance.
Reference [
] fused camera images and LiDAR point clouds to detect road using several FCNs with
three different fusion strategies: early fusion, cross fusion, and late fusion. In [
], millimeter-wave
radar and mono camera sensors are fused for vehicle detection and tracking. They present a fusion
approach to achieve the optimal balance between vehicle detection accuracy and computational efficiency.
Reference [
] presented a decoder-encoder FCN called SegNet for pixel-wise segmentation of road and
indoor scenes images. In [
], a deep learning framework called ChiNet is proposed for sensor fusion
using a stereo camera to fuse depth and appearance features on separate branches. ChiNet has two
input branches for depth and appearance and two output branches for segmentation of road and objects.
In [
], a fusion system is proposed based on a 2D laser scanner and camera sensors with Unscented
Kalman filter (UKF) and joint probabilistic data association to detect vehicles. Reference [
] proposes a
choice of fusion techniques using a Kalman filter, given a sensor configuration and its model parameters.
Reference [
] investigated the problem of estimating wheelchair states in an indoor environment using
UKF to fuse sensor data. A deep learning-based fusion framework for an end-to-end urban driving is
proposed using LiDAR and camera [
]. A two-stage fusion method is used for hypothesis generation
and verification using a stereo camera and LiDAR to improve vehicle detection in [
]. Real-time vehicle
detection and tracking system are proposed for complex urban scenarios that fuse a camera, a 2D LiDAR
and road context from prior road map knowledge using a CNN and a KF [
]. Reference [
] proposed
a system in which a camera is used for pedestrian, bicyclist, and vehicle detection, which is then used
to improve tracking and movement classification using radar and LiDAR. A multi-spectral pedestrian
detection system is proposed using deep learning from a thermal camera and an RGB camera in [
In terms of architectures, Reference [
] proposed a convolutional neural network to perform semantic
segmentation called fully convolutional network (FCN). Reference [
] improved this architecture by
proposing the deconvolutional network (DeconvNet) by using decoding layers with deconvolutional
filters that upsample the down-sampled feature maps. In SegNet [
], DeconvNet was improved by
using connections between encoder and decoder layers. In U-Net, another encoder-decoder-based
architecture for semantic segmentation all the feature maps are transferred from the encoding layers to
the decoding layers via residual connections. Mask-RCNN [
] was also proposed in which it combines
Faster-RCNN [
] object detection network with an FCN in one architecture. Our goal is to provide a
cost-effective, lightweight fusion system solution that can be implemented in an relatively cheap and
efficient embedded edge computer.
3. Autonomous Vehicle Perception Sensors
There are multiple sensors used for perception in autonomous cars that will cover the environment
around the autonomous vehicle (AV). These sensors are: 1. Camera, 2. LiDAR (
anging), 3. Radar (
anging), and 4. Sonar (
anging) to a
lesser extent.
Sensors 2019,19, 4357 4 of 23
In this paper, we stick with three primary sensors: camera, LiDAR, and radar. The main sensor ’s
placement on a typical autonomous vehicle, their coverage and max range are shown in Figure 2.
A camera is affected by environmental variations such as occlusions, illumination and weather
variation; LiDAR is affected by bad weather such as snow, rain; and radar has low detection accuracy
and resolution. Using these complementary sensors with the appropriate sensor fusion method suited
for them addresses the ineffectiveness of each sensor. It also increases the fault tolerance and accuracy
of the entire perception framework.
Figure 2.
Surround sensors used for environment perception in autonomous vehicles. Their placement
on the vehicle, coverage, and maximum range is shown. Green areas show radars coverage (long-range,
medium-range and short-range), orange areas show camera coverage around the car, blue indicates
LiDAR 360-degree coverage.
3.1. Camera
Cameras are the primary sensors used for high-resolution tasks like object classification, semantic
image segmentation, scene perception, and tasks that require color perception like traffic lights or signs
recognition. Cameras work on the basic principle of detecting the light emitted from the objects on a
photosensitive surface through a lens, as shown in Figure 3. The photosensitive surface, also called
the image plane, is where the light is stored as an image. After light photons pass through a camera
lens, they hit a light-sensitive surface which creates several electrons. The photosensitive surfaces
measure the amount of light they receive and convert that into electron movements. Those moving
electrons form a charge which is then converted to a voltage by a capacitor. The voltage is amplified
and passed through an Analog-Digital Converter (ADC). The ADC converts the amplified voltage to a
digital number i.e., a digital gray value (pixels) that is related to the intensity of light. The Image Signal
Sensors 2019,19, 4357 5 of 23
Processor (ISP) enhances the quality of the noisy image captured. Image noises source from imperfect
lenses, filters and image sensors. Image sensors are color blind by default and cannot differentiate
between different wavelengths. Therefore, we use microlenses or filters in front of each pixel to only
allow a certain wavelength. One filter pattern to map wavelength to color is the RGB pattern. The most
common RGB pattern is the Bayer pattern or Bayer filtering. The position of the object can be defined
as a vector with three elements [x, y, z] in space which is going to get projected through a lens to a point
on the image plane [x’, y’]. The projected points are converted from the metric unit to pixels so the
image can be used for further processing and extracting information such as detection, classification,
tracking of objects that are in the path of our autonomous vehicle (ego vehicle).
Figure 3. Camera sensor processing pipeline.
Before using the camera sensors for our implementation and testing applications, we calibrate
each camera; this is due to the use of lenses in the camera sensors. Camera lenses distort the images
they create. We rectified these distortions by using pictures of known objects i.e., checkerboards.
The dimensions and geometry of the checkerboards are known; therefore, we can compute the
calibration parameters by comparing the camera output images with the original images to account
for those distortions. There are two major camera chips in the market today, namely: Charge-Coupled
Device (CCD) and Complementary Metal Oxide Semiconductor (CMOS) [
]. There is also new
camera technology developed called Live-MOS which combines the strengths of CCD and CMOS. In
CCD chip the photosites (RGB blocks) are passive, and the electrons at those sites are processed one
column at a time. The Amplification and A/D conversion also happen outside of the sensor. In CMOS,
as opposed to CCD the photosites are active, and each includes its amplifier and ADC. We used a
CMOS camera for our application due to the lower cost, lower power consumption, and faster data
readout. The advantage CCD has over CMOS is its high light sensitivity, which results in better image
quality at the cost of higher power consumption that can lead to heat issues and a higher price.
3.2. RADAR
Radar stands for
anging. Radar works on the principle of transmitting and
detecting electromagnetic (EM) radio waves. Radars can measure information about the obstacles they
detect like relative position based on the wave travel time and relative velocity based on the Doppler
property of EM waves. The frequency of radars EM waves are not affected by different lighting
or weather conditions; thus, they can detect at day or night, in foggy, cloudy, or snowy conditions.
Automotive radars are usually mounted either behind a bumper or in the vehicle grille; the vehicles
are designed to ensure the radar performance (the radar antenna ability to focus EM energy) is not
affected no matter where they are mounted. For automotive applications, Frequency-Modulated
Continuous Wave (FMCW) type radar is typically used [
]. FMCW radar transmits continuous waves
or chirps in which the frequency increases (or decreases) over time usually in a sawtooth waveform.
Sensors 2019,19, 4357 6 of 23
These radars also typically work at frequencies of 24 GigaHertz (GHz), 77 GHz and 79 GHz. The GHz
frequency corresponds to millimeter wavelengths; hence, they are also called Millimeter Wave (MMW)
radars. There are three major classes of automotive radar systems depending the application: SRR
(Short-Range Radar) mostly for parking assist and collision proximity warning, MRR (Medium-Range
Radar) mostly for blind-spot detection, side/rear collision avoidance and LRR (Long-Range Radar) for
adaptive cruise control and early collision detection mounted upfront. FMCW radar hardware consists
of multiple components as shown in Figure 4.
Figure 4. Radar sensor processing pipeline.
The Frequency Synthesizer generates the FMCW wave or chirp at the desired frequency. The wave
is then amplified using a power amplifier depending on the radar class and the desired distance
coverage for the radar. The further the desired range, the higher the power amplification needed for
the signal. The transmit (TX) antenna converts the electrical energy of the signals to electromagnetic
signals and transmits them through the air. The receive (RX) antenna receives the reflected signal.
Field of View (FoV) of radar is determined by the antenna type, pattern, and gain value, discussing
which is out of the scope of this paper. The antenna also increases the power of the signal by focusing
the energy in a specific direction. The received signal is amplified through a Low Noise Amplifier
(LNA) since some of the EM energy is attenuated by moving through air and the targets it hits during
the signal travel. The amplified received signal is sent to a mixer. Mixer multiplies the reflected
signal by the original signal generated by the synthesizer. Through this operation, mixer changes
the frequency of the EM wave to the difference of the original signal and reflected signal frequency.
The mixer will then output the delta frequency, also known as frequency-shift or intermediate frequency
(IF). The output of the mixer is converted from analog to digital using an analog digital converter
(ADC) so we can further process the signal for target detection and tracking.
3.3. LiDAR
LiDAR stands for
anging. LiDAR works on the principle of emitting
a laser light or Infrared (IR) beam and receiving the reflection of the light beams to measure the
surrounding environment. The LiDAR used for autonomous vehicles are mostly in the 900 nanometer
(nm) wavelength, but some have a longer wavelength which works better in fog or rain. LiDAR outputs
point cloud data (PCD) which contain the position (x, y, z coordinates) and intensity information of
the objects. Intensity value shows the object’s reflectivity (the amount of light reflected by the object).
The position of the objects is detected using the speed of light and time of flight (ToF). There are
various classes of LiDAR. The two major ones are Mechanical LiDAR and Solid State LiDAR (SSL).
In Mechanical/Spinning LiDAR, the laser is emitted and received at the environment using rotating
lenses driven by an electric motor to capture a desired FoV around the ego vehicle. Solid State
Sensors 2019,19, 4357 7 of 23
LiDARs (SSL) do not have any spinning lenses to direct the laser light; instead, they steer the laser
lights electronically. These LiDARs are more robust, reliable, and less expensive than the mechanical
counterparts, but their downside is their smaller and limited FoV compared the mechanical ones. It is
worth noting that the laser beams generated in LiDARs are class 1 laser or the lowest intensity out of
four categories of laser classes; therefore, their operation is considered harmless for the human eyes.
Typical mechanical LiDAR hardware consists of various components as shown in Figure 5.
Figure 5. LiDAR sensor processing pipeline.
The laser light is generated and emitted using diode emitters; it is passed through rotating
lenses to focus its energy at various angles. The light is then reflected off the objects and received
through another set of rotating lenses. Reflected light information is received through photoreceivers.
The timer/counter then calculates the ToF between the emitted and received laser beams to measure
the object position state information. The resulting point cloud information is then sent to an on-board
vehicle processor for further processing. LiDAR adds angular resolution and higher accuracy to the
camera and radar sensors.
Similar to cameras, LiDAR captures 2D (X-Y direction) spatial information; however, unlike
cameras it captures ranges to create a 3D spatial information of environment around the ego vehicle.
By measuring range, LiDAR adds angular resolution (horizontal azimuth and vertical) for better
measurement accuracy compared to the camera. Also, having a higher frequency and shorter
wavelengths enables LiDAR to be more accurate than radar sensor measurements. Typical LiDAR
has a range of about 40 to 100 m, resolution accuracy between 1.5–10 cm, vertical angular resolution
between 0.35 to 2 degrees, horizontal angular resolution of 0.2 degrees and an operating frequency of
10–20 Hz.
4. Hybrid Sensor Fusion Algorithm Overview
The three main perception sensors used in autonomous vehicles have their strengths and
weaknesses. Therefore, the information from them needs to be combined (fused) to make the best
sense of the environment. Here, we propose a hybrid sensor fusion algorithm to this end. The hybrid
sensor fusion algorithm consists of two parts that run in parallel as shown in Figure 6. In each part,
a set configuration of sensors and a fusion method is used that is best suited for the fusion task at hand.
Sensors 2019,19, 4357 8 of 23
Figure 6. Multi-sensor fusion algorithm pipeline.
The first part deals with high-resolution tasks of object classification, localization, and semantic
road segmentation by using the camera (vision) and LiDAR sensors.
The camera’s raw video frames and the depth channel from the LiDAR are combined before being
sent to a Deep Neural Network (DNN) in charge of the object classification and road segmentation tasks.
Deep networks such as Convolutional Neural Networks (CNN) and Fully Convolutional Networks
(FCN) have shown better performance and accuracy for computer vision (CV) tasks compared to
traditional methods i.e., feature extractors (SIFT, HOG) with a classic Machine Learning (ML) algorithm
(SVM) [30,31].
The combination of camera and LiDAR with an FCN architecture gives us the best sensor-method
combination for performing classification and segmentation. In this paper, we prioritize real-time
performance and accuracy. Therefore, instead of shooting for segmenting the road scene into multiple
small fine segments as has often been done in the literature [
], we only perform segmentation for
two classes: the free space (driveable area) of the road and not driveable area of the road for our
autonomous vehicle. We can also use an efficient and fast network like YOLO V3 [
] to detect obstacles
within the segmented road. For automated driving, knowledge of driveable space and obstacle classes
on the road is essential for path planning and decision making.
The second part deals with the task of detecting objects and tracking their states by using LiDAR
and radar sensors. The LiDAR point cloud data (PCD) and radar signals are processed and fused
at the object level. The LiDAR and radar data processing will result in clusters of obstacles on the
road within the region of interest (ROI) with their states. The fusion of processed sensor data at the
object level is referred to as late fusion. The resulting late fused LiDAR and radar data is sent to a state
estimation method to best combine the noisy measured states of each sensor. The Kalman Filter (KF)
is the method of choice for state estimation; however, they work on the assumption of linear motion
models. Since the motion of obstacles like cars is non-linear; we use a modified version of the KF,
namely the Extended Kalman Filter (EKF). Knowing and tracking the states of the obstacles on the
road helps to predict and account for their behavior in the AV path planning and decision making
stacks. Finally, we overlay each fusion output and visualize them on the car monitor.
Sensors 2019,19, 4357 9 of 23
4.1. Object Classification and Road Segmentation using Deep Learning
For road semantic segmentation, we use camera images as well as depth information from the
LiDAR. We combine these raw data at the depth channel, which results in an RGBD image with
a depth channel of size four. We then send this information to our proposed Fully Convolutional
Network (FCNx). FCNx is an encoder-decoder-based network with modified filter numbers and sizes
in the downsampling and upsampling portions and improved skip connections to the decoder section.
The model is trained and tested on actual sensor data on our embedded NVIDIA GPU-powered
computers. Finally, the model performance is evaluated by comparing its predictions with baseline
benchmark networks and the ground truth labels.
4.1.1. Camera-LiDAR Raw Data Fusion
Before diving into the FCN architecture; we need to prepare the data that feeds into it. The camera
and LiDAR sensors provide the input data for the FCN Figure 7. The camera provides 2-dimensional
(2D) color images of three RGB channels (Height
3). The LiDAR provides a high-resolution
depth map in addition to the point cloud data. We combine the unprocessed raw data of LiDAR and
camera (early fusion). The resulting RGBD image will have four channels (H
4). The RGBD
image contains the 2D appearance features of the camera image with 3D depth features of LiDAR
to give us a rich illumination-invariant spatial image. This image is fed to the FCN to extract road
features for semantic segmentation and a CNN for object detection and localization depending on
the application.
Figure 7. Early fusion of LiDAR depth image and camera image as an input to a DNN pipeline.
4.1.2. Proposed Fully Convolutional Network (FCNx) Architecture
The architecture consists of two main parts: an encoder and a decoder. First, we have the encoder;
it is based on the VGG16 architecture [
], with its fully connected layers replaced with convolutional
layers. The purpose of the encoder is to extract features and spatial information from our four-channel
RGBD input image. It uses a series of convolutional layers to extract features and max-pooling layers
to reduce the size of the feature maps. Second, we have a decoder, which is based on the FCN8
architecture [
]. Its purpose is to rebuild the prediction of pixel classes back to the original image size
while maintaining low-level and high-level information integrity. It uses transposed convolutional
layers to upsample the output of the last convolutional layer of the encoder [
]. Also, it uses skip
Sensors 2019,19, 4357 10 of 23
convolutional layers to combine the finer low-level features from encoder layers with the coarser
high-level features of transposed convolutional (upsampled) layers.
In our proposed network, which builds on the above shown in Figure 8, we combine the encoder
output from layer four with the upsampled layer seven. This combination is an element-wise addition.
This combined feature map is then added to the output of the third layer skip connection. In the skip
connection of the layer three output, we use a second convolutional layer to further extract features
from layer three output. The addition of this convolutional layer adds some features that would be
extracted in layer four. Including some basic layer four level feature maps will help the layer three skip
connection to represent a combined feature map of layer three and layer four. This combined feature
map is then added to the upsampled layer nine, which itself represents a combined feature map of
layer seven and layer four. This addition is shown to give a better accuracy and lower Cross-Entropy
loss compared to the base VGG16-FCN8 architecture. The better performance can be explained by the
fact that having some similar layer four feature maps can help better align the extracted features when
performing the last addition. We will refer to our proposed architecture as FCNx.
Figure 8.
Overview of the object detection and road segmentation architecture via a Fully Convolutional
Neural Network. The tiles and colors indicate different layers of the FCNx and their type respectively.
Encoder consists of: convolution (teal) and max pooling (green) layers. Decoder consists of: skip
convolution (purple) and transposed convolution (yellow). The size and number of feature maps are
shown at the top and bottom of each layer.
The purpose of the FCNx applied to camera and LiDAR fused raw data is to segment the road
image into free navigation space area and non-driveable area. The output can be further processed
and sent through a plug and play detector network (YOLO [
], SSD [
]) for object detection
and localization depending on the situation.
4.2. Testing Obstacle Detection and Tracking using Kalman Filtering
In this section, we use the radar and LiDAR sensors to detect obstacles and measure their states
by processing each sensor data i.e., the radar beat signal and LiDAR point cloud individually. We then
perform a late data fusion or an object-level fusion to add the processed data from the radar and
LiDAR. Then, using a non-linear Kalman Filter method, we take those noisy sensor measurements to
estimate and track obstacle states with a higher accuracy than each sensor.
4.2.1. Radar Beat Signal Data Processing
The radar sensor can detect obstacles and their states in a four step process as shown in Figure 9.
The first step is to process the noisy digitized mixed or beat signal we receive from the radar. The beat
signal is sent through an internal radar ADC to get converted to a digital signal. The digital signal is in
Sensors 2019,19, 4357 11 of 23
the time domain, which comprises of multiple frequency components. In the second step, we use a
one-dimensional (1D) Fast Fourier Transform (FFT), also known as 1st stage or Range FFT, to transform
the signal from the time domain to frequency domain, and separate all its frequency components.
The output of the FFT is a frequency response represented by signal power or amplitude in dBm unit
versus beat frequency in MHz unit. Each peak in the frequency response represents a detected target.
Figure 9.
Radar data processing pipeline. Plots from left to right: Noisy radar beat signal, output of the
1D-FFT showing a detection with its range, output of the 2D-FFT showing a detection with its range
and Doppler velocity, and output of the 2-Dimensional Constant False Alarm Rate (2D-CFAR) filtering
showing the detection with noises filtered out.
The x-axis can be converted from beat frequency of the targets to range of the targets by:
R=c·Tsweep ·fb
2·Bsweep (1)
is the range of the target,
is the speed of light and
is the beat frequency of the target. The
Tsweep and Bsweep are the chirp/sweep time and bandwidth respectively, which can be calculated by:
Tsweep =5.5 ·2·Rmax
Bsweep =c
2·dres (3)
is the radar maximum range and the
is the range resolution of the radar. The constant
5.5 comes from assumption that the sweep time is often between 5–6 times the delay corresponding to
the maximum radar range.
As discussed, the range FFT output gives us the beat frequency, amplitude, phase, and range
(from the equations above) of the targets. To measure the velocity (or Doppler velocity) of the targets,
we need to find the Doppler frequency shift, which is the rate of change of phase across radar chirps.
The target phase changes from one chirp to another. Therefore, after acquiring the range FFT on the
radar chirps, in the third step, we run another FFT (Doppler FFT) to measure the rate of change of
phase i.e., the Doppler frequency shift. The output of the Doppler FFT is a 3D map represented by
signal power, range, and Doppler velocity. This 3D map is also referred to as Range Doppler Map
(RDM). RDM gives us an overview of the targets range and velocity.
Sensors 2019,19, 4357 12 of 23
Figure 10.
1D and 2D-CFAR sample windows (left) with 1D-CFAR multi object detection (right).
The plot shows signal strength vs. range. Five objects are detected (the blue peaks) at ranges 50, 100,
200, 350 and 700 feet.
RDM can be quite noisy since reflected radar signals received can be from unwanted sources like
the ground, and buildings, which can create false alarms in our object detection task. In the fourth and
final step, these noises or clutters are filtered out to avoid such false positives. One filtering method
most used in automotive applications is Cell Averaging CFAR (CA-CFAR) [
] which is a dynamic
thresholding method i.e., it varies thresholds based on the local noise level of the signal, see Figure 10.
CA-CFAR slides a window across the 2D FFT output cells we generate. The sliding window in a 2D
CFAR includes: the Cell Under Test (CUT) or current cell tested for target presence, the guard cells
(GC) or cells around the CUT that prevent the leakage of the target signal on to the surrounding cells
hence negatively affecting the noise estimate, and finally the reference or training cells (RC) or cells
covering the surrounding of the CUT. The local noise level is estimated by averaging the noise under
the training cells. If the signal in CUT is smaller than the threshold, it is removed. Anything above the
threshold is considered a detection.
Finally, we have radar CFAR detection with range and Doppler velocity information but object
detection and tracking in real time, is a computationally expensive process. Therefore, we cluster the
radar detection that belongs to the same obstacle together to increase the performance of our pipeline.
We use a Euclidean distance-based clustering algorithm. In this method, all the radar detection points
that are within the size of the target are considered one cluster. The cluster is assigned with a range
and velocity at the center equal to the mean of the ranges and velocities of all the cluster detection.
4.2.2. LiDAR Point Cloud Data Processing
LiDAR gives us rich information with point clouds (which include position coordinates x, y, z
and intensity i) as well as a depth map. The first step to process the LiDAR point cloud data (PCD) as
shown in Figure 11 involves downsampling and filtering the data. The raw LiDAR point cloud is high
resolution and covers a long distance (for example a 64-lens laser acquires more than 1 million points
per second). Dealing with such a large number of points is very computationally expensive and will
affect the real-time performance of our pipeline. Therefore, we downsample and filter. We downsample
using a voxel grid; we define a cubic voxel within the PCD and only assign one point cloud per voxel.
After downsampling, we use a region of interest (ROI) box to filter any PCD outside of that box that
is not of our interest (i.e., points from non-road objects like buildings). In the next step we segment
Sensors 2019,19, 4357 13 of 23
the filtered PCD into obstacles and the road. Knowledge of the road and objects on the road are the
two segments most important for the automated driving task. The PCD segmentation happens at
the point level; hence, processing would require a lot of resources and slow down the pipeline. In
order to improve the performance of our pipeline, similar to the radar data, we cluster the obstacle
segments based on their proximity to neighboring points (Euclidean Clustering) and assign each
cluster with a new position coordinates (x, y, z) which is the mean of all the point clouds within that
cluster. Finally, we can define bounding boxes of the size of clusters and visualize the obstacles with
the bounding boxes.
Figure 11.
LiDAR data processing pipeline. Images from left to right: Dense LiDAR point cloud data,
downsampled and filtered LiDAR data, segmented point cloud into road and obstacles, clustered
obstacles in the segmented point cloud. The center dark spot is the ego vehicle.
For segmenting the PCD into road and obstacles, we need to separate the road plane from the
obstacle plane. In order find the road plane, we use the
onsensus (RANSAC)
method [
]. Using RANSAC, we fit a plane to our 3D point cloud data. At each step, RANSAC picks
a sample of our point clouds and fits a plane through it. We consider the road points as inliers and
obstacle points as outliers. It measures the inliers (points belonging to the road plane) and returns the
plane i.e., road plane with the highest number of inliers or the lowest number of outliers (obstacles).
4.2.3. Extended Kalman Filtering
After the semantic road segmentation using our FCNx network, in this section, we show the
implementation of single object tracking via our vehicle embedded computer using the traditional
Extended Kalman Filter (EKF) algorithm. This is to showcase the real-world implementation of object
tracking, predicting and maintaining their states (2-dimensional positions and velocities) over time.
State of the object being tracked are:
(the object’s positions in x and y-direction), and
object’s velocities in x and y-direction). For state tracking, we use a state estimation method called
Kalman filtering (KF).
A standard KF involves three steps: initialization, prediction, and update. For the prediction
step, we use the object’s current state
(2D-position and 2D-velocity) and assume that the object
has a constant velocity (CV) motion model, to estimate the state at the next time step
using state
transition function
in Equation (4). Object covariance
represents the state estimate uncertainty in
Equation (5). We also model the noise in KF, the state transition noise
, motion noise covariance
measurement noise covariance
. After getting the measurement from our LiDAR or radar sensor,
we map the state to the actual measured value of the sensors
using the measurement function
using Equation (6). we calculate the residual covariance from Equation (7).
Sensors 2019,19, 4357 14 of 23
Kalman gain
, which combines the uncertainty of our predicted state
with the uncertainty
of sensor measurements
. Kalman gain will give more weight on the less uncertain value; either
the predicted state
or the sensor measurement
. We calculate the updated state estimate as
well as its covariance from Equations (9) and (10).
Pk+1= (IK·H)·Pk(10)
The standard Kalman filter can only be used when our models are linear (motion model or the
measurement model.) with the assumption that our data has a Gaussian distribution. When data
is Gaussian and motion-based, and measurement models are linear, the output data will also have
Gaussian distribution. However, in a system, the motion model or measurement model or both
can be nonlinear. When either or both these models do not follow a Gaussian distribution and are
nonlinear, the input states will go through a nonlinear transformation; Hence, the standard KF may
not converge, and its equations are not valid. In the case of nonlinear models, we have to use a
nonlinear state estimator like an Extended Kalman filter (EKF). EKF fixes this problem by linearizing
the nonlinear functions (state transition and measurement functions) around the mean of the current
state estimate using Taylor series expansion and taking the Jacobian of our functions [
]. In EKF,
the system equations are:
uk) + ˆ
xk+1) + ˆ
are the nonlinear state transition and measurement functions respectively.
We linearize these functions by finding their Jacobians, so the linearized approximated system becomes:
where Fand Hare approximated as: F=f
xand H=h
For our tracking implementation, we use two sensor measurements: LiDAR and radar.
From LiDAR point cloud data, we acquire the object’s timestamped position in
. From radar data, we acquire the objects timestamped position in
as well
as the objects timestamped velocity in
. LiDAR sensor measurements are in
Cartesian coordinates therefore linear. Radar sensor measurements are typically in polar coordinates
(unless a separate radar signal processing is done), i.e., timestamped
, the radial distance from the
sensor to the detected objects,
, the angle between the radar beam and the x-axis as well as
, the radial
velocity along the radar beam.
For this nonlinear case we need to map our predicted state vector from Cartesian to polar
coordinates using a nonlinear measurement function
; in order to linearize the measurement
function, we take its Jacobian using Equation (15):
Sensors 2019,19, 4357 15 of 23
(Px)2+ (Py)2
(Px)2+ (Py)20 0
(Px)2+ (Py)2
(Px)2+ (Py)20 0
((Px)2+ (Py)2)1.5
((Px)2+ (Py)2)1.5
(Px)2+ (Py)2
(Px)2+ (Py)2
are the object’s positions in x and y-direction;
are the object’s velocities in x
and y-direction. We use these measurement values in the Extended Kalman filter algorithm.
5. Experiments Procedure Overview
For our FCNx architecture, we use a combined dataset for training. The dataset is a combination
of UC Berkeley DeepDrive (BDD) dataset [
], University of Toronto KITTI dataset [
] and a
self-generated dataset generated from our sensors installed in our test vehicle. We validated the
proposed algorithm on a dataset with 3000 training and 3000 testing road scene samples. Our sensor
data was acquired from a ZED camera and an Ouster LiDAR mounted on our test vehicle capturing
data from streets of the city of Chicago. The BDD and KITTI dataset are annotated and labeled at object
and pixel levels. For our dataset, we performed the annotating and labeling manually. We use these
annotations as the ground truth labels. For our annotations, we mark the road with the color yellow.
We trained the network by using the following tunable hyper-parameters. The parameter
values are found based on trial and error with different values and combinations that give the best
performance. Some of the hyper-parameters we picked are as follows: learning rate of 2
batch size of 5 and the keep probability of 0.5. We train the network using an NVIDIA GeForce
RTX2080 Ti Ubuntu 18.04.3 LTS machine with 64 GB memory; then upload the model to an embedded
NVIDIA Xavier edge computer in our vehicle for real-time inference. From the obtained results of
our network we calculate the Cross-Entropy (C.E.) loss. This metric gives a measure of how well our
segmentation network is performing. We define our performance goals as C.E. loss, using validation
and strive to improve upon other baseline benchmarks networks namely FCN8 [
] and U-Net [
Another important factor for automated driving is performing inference in real time. Our network
experiment procedure is shown in Figure 12.
Figure 12.
Overview of the experiment process. Training a Deep Neural Network model requires
powerful computing resources; hence, for training we use a GPU powered workstation. Inference is
less computationally expensive; thus, when the model is trained, we use an embedded GPU enabled
device for testing and inference at the edge in-vehicle.
There are different metrics to measure the performance of our free navigation space segmentation
network like pixel accuracy, mean Intersect over Union (mIoU) and the Cross-Entropy (C.E.) loss.
The most common loss function is a pixel-wise Cross-Entropy. In this loss, we compare each pixels
predicted class to the ground truth image pixel labels. We repeat this process for all pixels in each
image and take the average. In our segmentation we have two classes (i.e., binary), road and not-a-road.
The Cross-Entropy loss is defined as:
Sensors 2019,19, 4357 16 of 23
C.E.Loss =(p·log(ˆ
p) + (1p)·log(1ˆ
p)) (16)
where the probability of pixel ground-truth values
for the road defined as
P(Ytrue =road) = p
and for not-a-road defined as
P(Ytrue =notroad) =
. The predicted values
for road and
not-a-road are defined as P(Ypred =road) = ˆ
pand P(Ypred =notroad) = 1ˆ
To show the implementation of single object tracking via the traditional EKF on our embedded
edge computer, we use the LiDAR and radar sensor measurements (
: the object’s positions in
x and y-direction;
: the object’s velocities in x and y-direction.) We use these measurement
values in the Extended Kalman filter algorithm. For our testing purposes, the ground truth (the actual
path of the object) is measured manually and used for calculating root mean square error (RMSE) from
Equation (17) to measure single object tracking algorithm performance.
xestimate ˆ
xgroundtruth ||2
where ˆ
xestimate is the EKF predicted states and ˆ
xgroundtruth is the actual states of the test vehicle.
6. Experimental Results and Discussion
The results of the hybrid sensor fusion framework are presented in this section. We measure
the performance using an evaluation metric appropriate for the fusion method; for our FCNx road
segmentation we use the Cross-Entropy (CE) loss metric. We also show the results of implementing
single object tracking using a traditional EKF fusion on our embedded edge computer. For tracking
evaluation we use the root mean square error (RMSE) metric. We compare our proposed network
architecture with two benchmarks FCN8 [23] and U-Net [43] networks architectures in Table 1.
Table 1. Comparison of our FCNx architecture with FCN8 and U-Net architectures.
Network Architecture Skip Connection Operation Residual Connections
Asymmetrical Skip connections between Three convolution groups
FCN8 [23] encoder-decoder the encoder and the decoder for three downsampling layers
base using element-wise sum operator before the upsampling path
Symmetrical Skip connections between All downsampling layers
U-Net [43] encoder-decoder the encoder and the decoder information is passed
base using concatenation operator to the upsampling path
Asymmetrical Skip connections between Four convolution groups
FCNx (Our model) encoder-decoder the encoder and the decoder for three downsampling layers
base using element-wise sum operator before the upsampling path
Analytically, we compare our proposed network performance with two benchmarks FCN8 [
and U-Net [
] networks performance by measuring the Cross-Entropy loss and inference time metrics
shown in Table 2. Under the same conditions, i.e., the same training input data, hyperparameter
choices and computing power; our model FCNx reduces the CE loss by 3.2% compared to FCN8 model
while maintaining similar inference time. We also see an improvement of 3.4% over U-net model;
although the U-net network showed faster inference time but the segmentation accuracy improvement
of FCNx offers a better trade off than the inference time. Table 2results show that the FCNx achieves
better free navigation space segmentation accuracy than the baseline benchmarks networks with
computational complexity of 185 ms. We also compare our proposed network with two baseline
benchmarks and show their detection accuracy/cross entropy loss over ten training cycles as shown
in Figure 13.
Sensors 2019,19, 4357 17 of 23
Table 2. Comparison of our architecture performance with FCN8 and U-Net networks.
Method Cross Entropy Loss % Inference Time (ms)
FCN8 [23] 6.8 180
U-Net [43] 7.2 92
FCNx (Our model) 3.6 185
Figure 13.
Comparison of FCN8 (
), U-Net (
), and FCNx (
) architectures training C.E. Loss over 10
training cycles (d).
Next, we show the results of our FCNx network. We tested our FCNx network on approximately
3000 road scenes and show eight of the more challenging scenes along with the ground truth and the
results in Figures 14 and 15. In all scenes shown, by visual examination we can see that our model
(row three) performs the road segmentation better and closer to the ground truth (row two) than the
benchmark network FCN8 (row four). In Figure 14, we show four scenes from left to right: highway
clear weather with no traffic, highway shadow mix with traffic, urban dim with cars parked, and
urban shadow mix with cars parked. Our FCNx segmentation was more smooth at segmenting road
border lines. For example, in Figure 14, it classified parts of the right-side parking area and right-side
sidewalk as the road in the row four columns three and four images, respectively. It also showed
some difficulty with False Positives (FP). For example, it falsely classified parts of the left-side grass
area and parts of the bicycle path on the right side as the road in the row four column one and two
images, respectively.
Also, in Figure 15, we show four scenes from left to right: city shadow mix scene with moving
bicyclist and vehicle, back-road shadow mix with no traffic, urban city road clear sky with cars parked,
tunnel dim with no traffic. We again see that our network works better at the borders of the road
and distinguishing between side walk and road at both near field road and the far field. This can
be attributed to sharing, additional filtering before fusion of the encoder feature maps at different
decoder levels. In all the eight scenes we highlight the areas our FCNx network performs a better pixel
classification and road segmentation with yellow bounding boxes.
Sensors 2019,19, 4357 18 of 23
Figure 14.
Results of semantic segmentation of the road. First row: some examples of road scenes in various situations. From left to right: Highway clear weather no
traffic, highway shadow mix with traffic, urban dim with cars parked, and urban shadow mix with cars parked. Second row: the corresponding road ground truth
annotations in yellow overlaid over the original image. Third row: road segmentation by our model FCNx. Row four: road segmentation by UC Berkeley FCN8.
Sensors 2019,19, 4357 19 of 23
Figure 15.
Results of semantic segmentation of the road. First row: some examples of road scenes in various situations. From left to right: City shadow mix scene
with moving bicyclist and vehicle, back-road shadow mix with no traffic, urban city road clear sky with cars parked, tunnel dim with no traffic. Second row: the
corresponding road ground truth annotations in yellow overlaid over the original image. Third row: road segmentation by our model FCNx. Row four: road
segmentation by UC Berkeley FCN8.
Sensors 2019,19, 4357 20 of 23
Results of the Extended Kalman filter estimation for a single object tracking are shown in
Figure 16. The purpose is to show an example scenario and the feasibility of implementing object
tracking using our LiDAR and radar sensor data via a cost-effective embedded computer. Detailed
multi-object tracking, track management, data association for more challenging scenarios are not the
scope of this paper. The red line shows the actual path of the vehicle we are tracking and the blue
and yellow points are the LiDAR and radar measurements, respectively. The orange points are the
EKF fused output. The EKF fused predictions are closer to the actual path than individual LiDAR and
radar measurements.
Figure 16.
Results of EKF estimated state predictions of a target vehicle from sensors on an ego vehicle.
The blue points indicate LiDAR measurements of the target vehicle. The yellow points indicate radar
measurements of the target vehicle. The red path shows the actual path the target vehicle traveled, and
the orange points show the EKF fused predictions from both LiDAR and radar measurements.
We evaluate the performance of the Extended Kalman filter predictions by measuring the root
mean square error. RMSE measures how well our prediction values fit the actual path of the target
vehicle. RMSE values range from zero to positive infinity depending the data and lower RMSE shows
better accuracy. In our simple left-turn scenario experiment (Figure 16), we achieve RMSE of 0.065 and
0.061 (less than 0.1) for the x and y-position of our tracked target vehicle, which shows an improvement
over individual LiDAR and radar sensors.
We show the results of each fusion method in our hybrid sensor fusion pipeline. This information
gives our AV a detailed map of the environment, which is then sent to the path planning and
controls stack of the automated driving pipeline. Having a clear understanding of the surrounding
Sensors 2019,19, 4357 21 of 23
environment can result in optimal decision making, and generating optimal control inputs to the
actuators (accelerator, brakes, steering) of our AV.
Finally, we comment on the robustness of our proposed fusion framework in case of sensor failure
by discussing the following sensor-failure scenarios:
Camera sensor failure: In this case, the pipelines free navigation space segmentation is affected,
but it is still able to properly segment using LiDAR depth map alone. Detecting and tracking
objects is unaffected and done via the late fused (object level) LiDAR and radar data with the
EKF algorithm.
Radar sensor failure: In this case, LiDAR can measure the speed of the objects via the range and
ToF measurements. However this is slightly less accurate, but still usable for EKF to track the
state of objects detected. The road segmentation and object classification remain unaffected.
LiDAR sensor failure will not fail the framework. Since the camera and radar essentially are doing
the segmentation and tracking tasks on their own, not having LiDAR data will not completely
derail or disturb the fusion flow. However, not having LiDAR depth map or point cloud position
coordinates information can reduce the segmentation quality as well as tracking accuracy.
In all three scenarios, corrupted inputs can potentially influence the quality of both segmentation,
detection and tracking. A concrete analysis of robustness is left for future work.
7. Conclusions
The results of each fusion method in our hybrid sensor fusion algorithm give our autonomous
vehicle (AV) a detailed map of the environment. Having a clear understanding of the surrounding
environment can result in optimal decision making, and generation of optimal control inputs to the
actuators (accelerator, brakes, steering) of our AV. In this paper, a new sensor fusion framework
configuration is proposed and we demonstrate successful real-time environment perception in a
vehicle. Appropriate combinations of Vision + LiDAR and LiDAR + Radar sensor fusion is successfully
demonstrated, using a hybrid pipeline: the proposed FCNx and the classic EKF. The algorithm uses
our proposed FCNx deep learning framework and is implemented in an edge-computing device in real
time using actual sensor data. The proposed solution and implementation enables fusion of camera,
radar and LiDAR for a cost-effective embedded solution. For future work, cloud computing and edge
computing coordination would be the next step to further enhance this framework.
Author Contributions:
Conceptualization, B.S.J.; Formal analysis, B.S.J. and T.T.; Funding acquisition, B.S.J. and
S.C.; Investigation, B.S.J.; Methodology, B.S.J. and T.T.; Project administration, B.S.J., T.T. and S.C.; Resources, B.S.J.;
Software, B.S.J.; Supervision, T.T. and S.C.; Validation, B.S.J. and T.T.; Visualization, B.S.J.; Writing original draft,
B.S.J.; Writing, review and editing, B.S.J., T.T. and S.C.
Funding: This research was funded by University of Illinois at Chicago, College of Engineering.
Conflicts of Interest: The authors declare no conflict of interest.
Xiao, L.; Wang, R.; Dai, B.; Fang, Y.; Liu, D.; Wu, T. Hybrid conditional random field based camera-LIDAR
fusion for road detection. Inf. Sci. 2018,432, 543–558. [CrossRef]
Xiao, L.; Dai, B.; Liu, D.; Hu, T.; Wu, T. Crf based road detection with multi-sensor fusion. In Proceedings of
the 2015 IEEE Intelligent Vehicles Symposium (IV), Seoul, Korea, 28 June–1 July 2015; pp. 192–198.
Broggi, A. Robust real-time lane and road detection in critical shadow conditions. In Proceedings of
the International Symposium on Computer Vision-ISCV, Coral Gables, FL, USA, 21–23 November 1995;
pp. 353–358.
Teichmann, M.; Weber, M.; Zoellner, M.; Cipolla, R.; Urtasun, R. Multinet: Real-time joint semantic reasoning
for autonomous driving. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu,
China, 26–30 June 2018; pp. 1013–1020.
Sensors 2019,19, 4357 22 of 23
Sobh, I.; Amin, L.; Abdelkarim, S.; Elmadawy, K.; Saeed, M.; Abdeltawab, O.; Gamal, M.; El Sallab, A.
End-To-End multi-modal sensors fusion system for urban automated driving. In Proceedings of the 2018
NIPS MLITS Workshop: Machine Learning for Intelligent Transportation Systems, Montreal, QC, Canada,
3–8 December 2018.
Eitel, A.; Springenberg, J.T.; Spinello, L.; Riedmiller, M.; Burgard, W. Multimodal deep learning for robust
RGB-D object recognition. In Proceedings of 2015 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 681–687.
John, V.; Nithilan, M.; Mita, S.; Tehrani, H.; Konishi, M.; Ishimaru, K.; Oishi, T. Sensor Fusion of Intensity
and Depth Cues using the ChiNet for Semantic Segmentation of Road Scenes. In Proceedings of 2018 IEEE
Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 585–590.
Garcia, F.; Martin, D.; De La Escalera, A.; Armingol, J.M. Sensor fusion methodology for vehicle detection.
IEEE Intell. Transp. Syst. Mag. 2017,9, 123–133. [CrossRef]
Nada, D.; Bousbia-Salah, M.; Bettayeb, M. Multi-sensor data fusion for wheelchair position estimation with
unscented Kalman Filter. Int. J. Autom. Comput. 2018,15, 207–217. [CrossRef]
Jagannathan, S.; Mody, M.; Jones, J.; Swami, P.; Poddar, D. Multi-sensor fusion for Automated Driving:
Selecting model and optimizing on Embedded platform. In Proceedings of the Autonomous Vehicles and
Machines 2018, Burlingame, CA, USA, 28 January–2 February 2018; pp. 1–5.
Shahian-Jahromi, B.; Hussain, S.A.; Karakas, B.; Cetin, S. Control of autonomous ground vehicles: A brief
technical review. IOP Conf. Ser. Mater. Sci. Eng. 2017,224, 012029.
Zhu, H.; Yuen, K.V.; Mihaylova, L.; Leung, H. Overview of environment perception for intelligent vehicles.
IEEE Trans. Intell. Transp. Syst. 2017,18, 2584–2601. [CrossRef]
Koci´c, J.; Joviˇci´c, N.; Drndarevi´c, V. Sensors and sensor fusion in autonomous vehicles. In Proceedings of the
2018 26th Telecommunications Forum (TELFOR), Belgrade, Serbia, 20–21 November 2018; pp. 420–425.
Caltagirone, L.; Scheidegger, S.; Svensson, L.; Wahde, M. Fast LIDAR-based road detection using fully
convolutional neural networks. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV),
Redondo Beach, CA, USA, 11–14 June 2017; pp. 1019–1024.
Aeberhard, M.; Kaempchen, N. High-level sensor data fusion architecture for vehicle surround environment
perception. In Proceedings of the 8th International Workshop on Intelligent Transportation (WIT 2011),
Hamburg, Germany, 22–23 March 2011.
Caltagirone, L.; Bellone, M.; Svensson, L.; Wahde, M. LIDAR-camera fusion for road detection using fully
convolutional neural networks. Rob. Autom. Syst. 2018. [CrossRef]
Wang, X.; Xu, L.; Sun, H.; Xin, J.; Zheng, N. On-road vehicle detection and tracking using MMW radar and
monovision fusion. IEEE Trans. Intell. Transp. Syst. 2016,17, 2075–2084. [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for
image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017,39, 2481–2495. [CrossRef] [PubMed]
Zhang, F.; Clarke, D.; Knoll, A. Vehicle detection based on LiDAR and camera fusion. In Proceedings
of the 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), Qingdao, China,
8–11 October 2014; pp. 1620–1625.
Verma, S.; Eng, Y.H.; Kong, H.X.; Andersen, H.; Meghjani, M.; Leong, W.K.; Shen, X.; Zhang, C.; Ang, M.H.;
Rus, D. Vehicle Detection, Tracking and Behavior Analysis in Urban Driving Environments Using Road
Context. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA),
Brisbane, Australia, 20–25 May 2018.
Cho, H.; Seo, Y.W.; Kumar, B.V.; Rajkumar, R.R. A multi-sensor fusion system for moving object detection
and tracking in urban driving environments. In Proceedings of the 2014 IEEE International Conference on
Robotics and Automation (ICRA), Hong Kong, China, 31 May–5 June 2014; pp. 1836–1843.
Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral Pedestrian Detection using Deep Fusion
Convolutional Neural Networks. In Proceedings of the 24th European Symposium on Artificial Neural
Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium, 27–29 April 2016.
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings
of the IEEE Conference on Computer Vision And Pattern Recognition, Boston, MA, USA, 7–12 June 2015;
pp. 3431–3440.
Sensors 2019,19, 4357 23 of 23
Noh, H.; Hong, S.; Han, B. Learning Deconvolution Network for Semantic Segmentation. In Proceedings
of the 2015 IEEE International Conference on Computer Vision (ICCV), Las Condes, Chile,
11–18 December 2015.
25. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture
for Image Segmentation. arXiv 2015, arXiv:1505.07293.
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. In Proceedings of the IEEE International
Conference on Computer Vision, Venice, Italy, 22–29 October 2017.
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal,
QC, Canada, 7–12 December 2015.
York, T.; Jain, R. Fundamentals of Image Sensor Performance. Available online: https://www.cse.wustl.
edu/~jain/cse567-11/ftp/imgsens/ (accessed on 10 August 2019).
Patole, S.M.; Torlak, M.; Wang, D.; Ali, M. Automotive radars: A review of signal processing techniques.
IEEE Signal Process Mag. 2017,34, 22–35. [CrossRef]
Mukhtar, A.; Xia, L.; Tang, T.B. Vehicle Detection Techniques for Collision Avoidance Systems: A Review.
IEEE Trans. Intell. Transp. Syst. 2015,16, 2318–2338. [CrossRef]
Sun, Z.; Bebis, G.; Miller, R. On-Road Vehicle Detection: A Review. IEEE Trans. Pattern Anal. Mach. Intell.
2006,28, 694–711. [PubMed]
Barowski, T.; Szczot, M.; Houben, S. Fine-Grained Vehicle Representations for Autonomous Driving.
In Proceedings of 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI,
USA, 4–7 November 2018; pp. 3797–3804.
33. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
2014, arXiv:1409.1556.
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection.
In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA,
26 June–1 July 2016; pp. 779–788.
Shafiee, M.J.; Chywl, B.; Li, F.; Wong, A. Fast YOLO: A fast you only look once system for real-time embedded
object detection in video. arXiv 2017, arXiv:1709.05943.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector.
In Proceedings of the European Conference on Computer Vision, Amsterdam, Netherlands, 6–8 October 2016;
pp. 21–37.
Kabakchiev, C.; Doukovska, L.; Garvanov, I. Cell averaging constant false alarm rate detector with Hough
transform in randomly arriving impulse interference. Cybern. Inf. Tech. 2006,6, 83–89.
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to
Image Analysis and Automated Cartography. Commun. ACM 1981,24, 381–395. [CrossRef]
Ljung, L. Asymptotic behavior of the extended Kalman filter as a parameter estimator for linear systems.
IEEE Trans. Autom. Control 1979,24, 36–50. [CrossRef]
Yu, F.; Xian, W.; Chen, Y.; Liu, F.; Liao, M.; Madhavan, V.; Darrell, T. Bdd100k: A diverse driving video
database with scalable annotation tooling. arXiv 2018, arXiv:1805.04687 .
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Rob. Res.
1231–1237. [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation.
In Proceedings of International Conference on Medical Image Computing and Computer-Assisted
Intervention, Munich, Germany, 5–9 October 2015.
2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (
... Nowadays, more than a decade after the first self-driving car winning the Defense Advanced Research Projects Agency (DARPA) Challenge, the interest in creating and deploying completely autonomous vehicles is at an all-time high. An autonomous vehicle requires trustworthy solutions to produce an accurate map of its surroundings, which is only possible through the use of multi-sensor perception systems equipped with RAdio Detection And Ranging (RADAR), Cameras, and Light Detection And Ranging (LiDAR) sensors [1,2,3,4]. These multi-sensor perception systems empower vehicles with the ability to detect the distance and speed of nearby obstacles, as well as their shape, so they can drive safely through the environment, contributing to the various levels of driving automation defined by the Society of Automotive Engineers (SAE). ...
... LiDAR technology is constantly evolving and thus finding different usages in a wide range of applications. Since accurate and exact measurements of the environment using a 3D point cloud representation can aid perception systems [3], they are used in a variety of applications, including obstacle identification, object and vehicle detection [8], pedestrian recognition and tracking [9], and ground segmentation for road filtering [10,11]. Nonetheless, a variety of noise sources, including internal components [12], mutual interference [13,14], reflectivity issues [15], light [5], and severe weather conditions [4,16,17], can corrupt LiDAR's sparse 3D point clouds, difficulting the task of understanding the vehicle's surroundings. ...
... An autonomous vehicle must recognize and understand the world around it to safely navigate its surroundings. Therefore, the majority of autonomous cars employ multi-sensor perception systems that typically include RADAR, Cameras, and LiDAR sensors [1,2,3,4]. Each of these sensors offers several advantages and disadvantages. ...
Full-text available
The interest in developing and deploying fully autonomous vehicles on our public roads has come to a full swing. As a result, interest in autonomous driving capabilities, already implemented in modern automobiles via Advanced Driver Assistance Systems (ADAS), is at an all-time high. However, employing highly reliable perception systems to navigate the environment is a very demanding task, requiring multiple sensors like Cameras, Radio Detection And Ranging (RADAR), and Light Detection And Ranging (LiDAR). The latter is critical for determining the distance and speed of surrounding obstacles while producing high-resolution 3D representations of the surrounding environment in real-time. Despite being assumed as a game-changer in the autonomous driving paradigm, LiDAR sensors can be susceptible to several noise sources, including internal components, mutual interference, light, and severe weather conditions. Notwithstanding the numerous mitigation approaches for weather denoising described in the literature, some issues still exist, ranging from high complexity to low accuracy or even systems with reduced overall performance. Thus, this dissertation proposes the ALFA-Pd framework, an embedded weather denoising approach that supports the existing and new state-of-the-art outlier removal methods using a reconfigurable hardware platform. ALFA-Pd contributes with a new weather denoising approach by combining the Dynamic Radius Outlier Removal (DROR) and the Low-Intensity Outlier Removal (LIOR) algorithms. The performed evaluation shows that Dynamic low-Intensity Outlier Removal (DIOR) can achieve better accuracy and performance while guaranteeing real-time requirements when compared to other state-of-the-art solutions.
... Different methods using deep learning for autonomous vehicles and driving have been presented elsewhere, in a survey [9]. In this paper, we focus on camera-based image processing; work using active sensors and sensor fusion has been also presented elsewhere and represents an active field of study [10,11]. ...
... The labels for each of the 15 images will be unique (the labels for image 2 will start from the maximum value of the labels of image 1, plus 1, and so on), meaning that at the end, they can be joined together into a common label image, as shown in Figure 8. The resulting regions are similar to the superpixels described in [11], but these have the added advantage that they are the result of semantic segmentation and have a high probability that each region belongs to a single object-they are grouped by meaning and not simply by color or texture properties, as in the case of superpixels. The coded pixels will have a value of 0 if they belong to the free space (they are not obstacle points), or have a value between 1 and 15 if they belong to an object. ...
... The labels for each of the 15 images will be unique (the labels for image 2 will start from the maximum value of the labels of image 1, plus 1, and so on), meaning that at the end, they can be joined together into a common label image, as shown in Figure 8. The resulting regions are similar to the superpixels described in [11], but these have the added advantage that they are the result of semantic segmentation and have a high probability that each region belongs to a single object-they are grouped by meaning and not simply by color or texture properties, as in the case of superpixels. The problem is now re-stated as the problem of assigning a unique object identifier to each region. ...
Full-text available
Detecting the objects surrounding a moving vehicle is essential for autonomous driving and for any kind of advanced driving assistance system; such a system can also be used for analyzing the surrounding traffic as the vehicle moves. The most popular techniques for object detection are based on image processing; in recent years, they have become increasingly focused on artificial intelligence. Systems using monocular vision are increasingly popular for driving assistance, as they do not require complex calibration and setup. The lack of three-dimensional data is compensated for by the efficient and accurate classification of the input image pixels. The detected objects are usually identified as cuboids in the 3D space, or as rectangles in the image space. Recently, instance segmentation techniques have been developed that are able to identify the freeform set of pixels that form an individual object, using complex convolutional neural networks (CNNs). This paper presents an alternative to these instance segmentation networks, combining much simpler semantic segmentation networks with light, geometrical post-processing techniques, to achieve instance segmentation results. The semantic segmentation network produces four semantic labels that identify the quarters of the individual objects: top left, top right, bottom left, and bottom right. These pixels are grouped into connected regions, based on their proximity and their position with respect to the whole object. Each quarter is used to generate a complete object hypothesis, which is then scored according to object pixel fitness. The individual homogeneous regions extracted from the labeled pixels are then assigned to the best-fitted rectangles, leading to complete and freeform identification of the pixels of individual objects. The accuracy is similar to instance segmentation-based methods but with reduced complexity in terms of trainable parameters, which leads to a reduced demand for computational resources.
... The experimental results show the robustness and effectiveness of the proposed approach, which outperforms the competitive methods [57] Road segmentation, obstacle detection, and vehicle tracking based on an encoder-decoder-based FCN, an extended Kalman filter, and camera, lidar, and radar sensor fusion for autonomous vehicles ...
... Test data are collected by some of the known commercial lidar systems combined with optical cameras, or using KITTI datasets. There are various applications of CNNs, where some studies use them only for a specific step, such as vehicle detection [51,52], pedestrian detection [53], and object classification [54][55][56][57][58], while some try to use them for the whole process. For example, in [59], the authors proposed an end-to-end (E2E) self-driving algorithm utilizing a CNN that provided the vehicle speed and angle as outputs based on the input camera and 2D lidar data. ...
Full-text available
The development of light detection and ranging (lidar) technology began in the 1960s, following the invention of the laser, which represents the central component of this system, integrating laser scanning with an inertial measurement unit (IMU) and Global Positioning System (GPS). Lidar technology is spreading to many different areas of application, from those in autonomous vehicles for road detection and object recognition, to those in the maritime sector, including object detection for autonomous navigation, monitoring ocean ecosystems, mapping coastal areas, and other diverse applications. This paper presents lidar system technology and reviews its application in the modern road transportation and maritime sector. Some of the better-known lidar systems for practical applications, on which current commercial models are based, are presented, and their advantages and disadvantages are described and analyzed. Moreover, current challenges and future trends of application are discussed. This paper also provides a systematic review of recent scientific research on the application of lidar system technology and the corresponding computational algorithms for data analysis, mainly focusing on deep learning algorithms, in the modern road transportation and maritime sector, based on an extensive analysis of the available scientific literature.
... These approaches are currently mainly the focus of the automotive domain, where different information, coming from diverse sensors (Lidar, Radar, Camera, Ego-data, etc.) should be fused to enable autonomous driving. In [22], lidar depth information and 2D image data are combined into an RGBD image, which is then processed by a convolutional network for image segmentation and object detection. Another approach [23] provided a fusion method for vehicle detection, which combines laser information, visual data, and GPS data. ...
... Mutli-Sensor Perception: To maximize information extraction from a driving scene, data is collected from a diverse set of sensors, e.g., cameras, lidar, and radar, to promote perception robustness. Mainly, There are two primary schemes for processing these multisensory inputs: early fusion [24,30] and late fusion [29]. The former combines all sensory features to a single feature at an early point in the ADS pipeline, but is susceptible to sensing noise. ...
Due to the high performance and safety requirements of self-driving applications, the complexity of modern autonomous driving systems (ADS) has been growing, instigating the need for more sophisticated hardware which could add to the energy footprint of the ADS platform. Addressing this, edge computing is poised to encompass self-driving applications, enabling the compute-intensive autonomy-related tasks to be offloaded for processing at compute-capable edge servers. Nonetheless, the intricate hardware architecture of ADS platforms, in addition to the stringent robustness demands, set forth complications for task offloading which are unique to autonomous driving. Hence, we present $ROMANUS$, a methodology for robust and efficient task offloading for modular ADS platforms with multi-sensor processing pipelines. Our methodology entails two phases: (i) the introduction of efficient offloading points along the execution path of the involved deep learning models, and (ii) the implementation of a runtime solution based on Deep Reinforcement Learning to adapt the operating mode according to variations in the perceived road scene complexity, network connectivity, and server load. Experiments on the object detection use case demonstrated that our approach is 14.99% more energy-efficient than pure local execution while achieving a 77.06% reduction in risky behavior from a robust-agnostic offloading baseline.
... For advanced driver assistance systems (ADAS), authors in [12] propose a high-level sensor data fusion architecture. Reference [13] provides a hybrid multi-sensor fusion architecture that performs environment perception tasks such as road segmentation, obstacle identification, and tracking. In [14] a fusion method is presented based on fuzzy logic to calculate the object's distance by separately parsing image and point cloud data. ...
One of the most main perception challenges for autonomous vehicles is cars detection. Classic vision-based cars identification approaches are insufficiently accurate, particularly for small objects, whereas sensors such as Lidars help in detecting objects in all shapes and sizes but still limited in classifying and recognizing detected obstacles. To fully exploit the benefits of Lidar's depth information and vision's obstacle classification capabilities, this paper presents an object detection and distance estimation via Lidar and camera fusion. Both sensors have varied different characteristics and must be aligned by performing a geometrical transformation and projection to fuse the sensor’s data. The main purpose of the conducted research is to fuse sensor data to estimate the distance of objects detected using Tiny YOLOv4. Finally, the results of the evaluations on the KITTI datasets show that the proposed approach enables both object detection and distance estimation.
... Similarly, autonomous vehicle applications like those presented in [42] can utilize the VVSP platform without significant modifications. ...
Full-text available
In this paper, we present a hardware and software platform for signal processing (SPP) in long-range, multi-spectral, electro-optical systems (MSEOS). Such systems integrate various cameras such as lowlight color, medium or long-wave-infrared thermal and short-wave-infrared cameras together with other sensors such as laser range finders, radars, GPS receivers, etc. on rotational pan-tilt positioner platforms. An SPP is designed with the main goal to control all components of an MSEOS and execute complex signal processing algorithms such as video stabilization, artificial intelligence-based target detection, target tracking, video enhancement, target illumination, multi-sensory image fusion, etc. Such algorithms might be very computationally demanding, so an SPP enables them to run by splitting processing tasks between a field-programmable gate array (FPGA) unit, a multicore microprocessor (MCuP) and a graphic processing unit (GPU). Additionally, multiple SPPs can be linked together via an internal Gbps Ethernet-based network to balance the processing load. A detailed description of the SPP system and experimental results of workloads for typical algorithms on demonstrational MSEOS are given. Finally, we give remarks regarding upgrading SPPs as novel FPGAs, MCuPs and GPUs become available.
... Necessary nonlinear state estimates required for object tracking were obtained by making a fusion between camera, LIDAR and radar sensors using extended Kalman filter [27]. ...
Full-text available
Autonomous robotic systems (ARS) serve in many areas of daily life. The sensors have critical importance for these systems. The sensor data obtained from the environment should be as accurate and reliable as possible and correctly interpreted by the autonomous robot. Since sensors have advantages and disadvantages over each other they should be used together to reduce errors. In this study, Convolutional Neural Network (CNN) based sensor fusion was applied to ARS to contribute the autonomous driving. In a real-time application, a camera and LIDAR sensor were tested with these networks. The novelty of this work is that the uniquely collected data set was trained in a new CNN network and sensor fusion was performed between CNN layers. The results showed that CNN based sensor fusion process was more effective than the individual usage of the sensors on the ARS.
Real-time machine vision applications running on resource-constrained embedded systems face challenges for maintaining performance. An especially challenging scenario arises when multiple applications execute at the same time, creating contention for the computational resources of the system. This contention results in increase in inference delay of the machine vision applications, which can be unacceptable for time-critical tasks. To address this challenge, we propose an adaptive model selection framework that mitigates the impact of system contention and prevents unexpected increases in inference delay by trading off the application accuracy minimally. The framework has two parts, which are performed pre-deployment and at runtime. The pre-deployment part profiles the system for contention in a black-box manner and produces a model set that is specifically optimized for the contention levels observed in the system. The runtime part predicts the inference delays of each model considering the system contention and selects the best model according to the predictions for each frame. Compared to a fixed individual model with similar accuracy, our framework improves the performance by significantly reducing the inference delay violations against a specified threshold. We implement our framework on the Nvidia Jetson TX2 platform and show that our approach achieves greater than 20% reductions in delay violations over the individual baseline models.
Conference Paper
Full-text available
In this paper, we are presenting a short overview of the sensors and sensor fusion in autonomous vehicles. We focused on the sensor fusion from the key sensors in autonomous vehicles: camera, radar, and lidar. The current state-of-the-art in this area will be presented, such as 3D object detection method for leveraging both image and 3D point cloud information, moving object detection and tracking system, and occupancy grid mapping used for navigation and localization in dynamic environments. It is shown that including more sensors into sensor fusion system benefits with better performance and the robustness of the solution. Moreover, usage of camera data in localization and mapping, that is traditionally solved by radar and lidar data, improves the perceived model of the environment. Sensor fusion has a crucial role in autonomous systems overall, therefore this is one of the fastest developing areas in the autonomous vehicles domain.
Full-text available
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at
In this work, a deep learning approach has been developed to carry out road detection by fusing LIDAR point clouds and camera images. An unstructured and sparse point cloud is first projected onto the camera image plane and then upsampled to obtain a set of dense 2D images encoding spatial information. Several fully convolutional neural networks (FCNs) are then trained to carry out road detection, either by using data from a single sensor, or by using three fusion strategies: early, late, and the newly proposed cross fusion. Whereas in the former two fusion approaches, the integration of multimodal information is carried out at a predefined depth level, the cross fusion FCN is designed to directly learn from data where to integrate information; this is accomplished by using trainable cross connections between the LIDAR and the camera processing branches. To further highlight the benefits of using a multimodal system for road detection, a data set consisting of visually challenging scenes was extracted from driving sequences of the KITTI raw data set. It was then demonstrated that, as expected, a purely camera-based FCN severely underperforms on this data set. A multimodal system, on the other hand, is still able to provide high accuracy. Finally, the proposed cross fusion FCN was evaluated on the KITTI road benchmark where it achieved excellent performance, with a MaxF score of 96.03%, ranking it among the top-performing approaches.
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at
Object detection is considered one of the most challenging problems in this field of computer vision, as it involves the combination of object classification and object localization within a scene. Recently, deep neural networks (DNNs) have been demonstrated to achieve superior object detection performance compared to other approaches, with YOLOv2 (an improved You Only Look Once model) being one of the state-of-the-art in DNN-based object detection methods in terms of both speed and accuracy. Although YOLOv2 can achieve real-time performance on a powerful GPU, it still remains very challenging for leveraging this approach for real-time object detection in video on embedded computing devices with limited computational power and limited memory. In this paper, we propose a new framework called Fast YOLO, a fast You Only Look Once framework which accelerates YOLOv2 to be able to perform object detection in video on embedded devices in a real-time manner. First, we leverage the evolutionary deep intelligence framework to evolve the YOLOv2 network architecture and produce an optimized architecture (referred to as O-YOLOv2 here) that has 2.8X fewer parameters with just a ~2% IOU drop. To further reduce power consumption on embedded devices while maintaining performance, a motion-adaptive inference method is introduced into the proposed Fast YOLO framework to reduce the frequency of deep inference with O-YOLOv2 based on temporal motion characteristics. Experimental results show that the proposed Fast YOLO framework can reduce the number of deep inferences by an average of 38.13%, and an average speedup of ~3.3X for objection detection in video compared to the original YOLOv2, leading Fast YOLO to run an average of ~18FPS on a Nvidia Jetson TX1 embedded system.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.