ArticlePDF Available

Probabilistic End-to-End Vehicle Navigation in Complex Dynamic Environments With Multimodal Sensor Fusion


Abstract and Figures

All-day and all-weather navigation is a critical capability for autonomous driving, which requires proper reaction to varied environmental conditions and complex agent behaviors. Recently, with the rise of deep learning, end-to-end control for autonomous vehicles has been studied a lot. However, most works are solely based on visual information, which could be degraded by challenging illumination conditions, such as dim lights or total darkness. In addition, they usually generate and apply deterministic control commands without considering the uncertainties in the future. In this paper, based on imitation learning, we propose a probabilistic driving model with multi-perception capability utilizing the information from camera, lidar and radar. We further evaluate its driving performance on-line on our new driving benchmark, which includes various environments (urban and rural areas, weathers, time of day, etc) and dynamic obstacles (vehicles, pedestrians, motorcyclists, bicyclists, etc). The results suggest that our proposed model outperforms baselines and achieves excellent generalization performance in unseen environments with heavy traffic and extreme weather conditions.
Content may be subject to copyright.
Probabilistic End-to-End Vehicle Navigation in
Complex Dynamic Environments with
Multimodal Sensor Fusion
Peide Cai, Sukai Wang, Yuxiang Sun, Ming Liu, Senior Member, IEEE
Abstract—All-day and all-weather navigation is a critical capa-
bility for autonomous driving, which requires proper reaction to
varied environmental conditions and complex agent behaviors.
Recently, with the rise of deep learning, end-to-end control
for autonomous vehicles has been well studied. However, most
works are solely based on visual information, which can be
degraded by challenging illumination conditions such as dim
light or total darkness. In addition, they usually generate and
apply deterministic control commands without considering the
uncertainties in the future. In this paper, based on imitation
learning, we propose a probabilistic driving model with multi-
perception capability utilizing the information from the camera,
lidar and radar. We further evaluate its driving performance
online on our new driving benchmark, which includes various
environmental conditions (e.g., urban and rural areas, traffic
densities, weather and times of the day) and dynamic obstacles
(e.g., vehicles, pedestrians, motorcyclists and bicyclists). The
results suggest that our proposed model outperforms baselines
and achieves excellent generalization performance in unseen
environments with heavy traffic and extreme weather.
Index Terms—Automation technologies for smart cities, au-
tonomous vehicle navigation, multi-modal perception, sensorimo-
tor learning, motion planning and control.
IN THE field of autonomous driving, traditional naviga-
tion methods are commonly implemented with modular
pipelines [1], [2], which split the navigation task into individ-
ual sub-problems, such as perception, planning and control.
These modules often rely on a multitude of engineering
components to produce reliable environmental representations,
robust decisions and safe control actions. However, since the
separate modules rely on each other, the system can lead to
an accumulation of errors. Therefore, each component requires
careful and time-consuming hand engineering.
In recent years, with the unprecedented success of deep
learning, an alternative method called end-to-end control [3]–
[12] has arisen. This paradigm mimics the human brain and
maps the raw sensory input (e.g., RGB images) to control
output (e.g., steering angle) in an end-to-end fashion. In
addition, it substitutes laborious hand engineering by learning
control policies directly on data from human drivers with deep
networks, where explicit programming or modeling of each
possible scenario is not needed. Moreover, it can adapt to
complex noise characteristics of different environments during
training, which cannot be captured well by analytical methods.
All authors are with The Hong Kong University of Science and
Technology, Hong Kong SAR, China (email:;;;
14.4 km/h 14.8 km/h
Driving speed: 15.5 km/h
Route to goal
Lidar (y-channel)
unclear vehicle
Fig. 1. Snapshots of different driving scenarios (left to right: ClearDay,
RainySunset and DrizzleNight) with global route directions and sensor data
information. For visualization, we project the lidar data (y-channel, i.e., the
height information) and radar data (relative speed to the ego-vehicle) to the
image plane. Brighter points mean larger values. It can be seen that the
information characteristic from lidar and radar is more consistent than from
the camera in different environmental conditions.
While end-to-end driving has been considerably fruitful,
there exist three critical deficiencies in the prior works.
1) The visual information is stressed too much. Most
works depend solely on cameras for scene understanding
and decision making [3]–[14]. However, although cameras
are versatile and cheap, they have difficulty capturing fine-
grained 3-D information. In addition, perception relying on
cameras is prone to be affected by challenging illumination
and weather conditions, such as the DrizzleNight case shown
in Fig. 1. Because of dim light and rain drops in this scene,
the blue car far ahead left can be difficult to recognize. In
such scenarios, vision-based driving systems can be dangerous.
However, the blue car is quite distinguishable by observing the
speed distribution from the radar data.
2) The probabilistic nature of executable actions is not well
explored. Most works output deterministic commands to the
vehicle [15], [16]; however, non-determinism is a key aspect of
controlling, which is useful in many safety-critical tasks such
as collision checking and risk-aware motion planning [17]. A
more reasonable approach, therefore, should be predicting a
motion distribution indicating what could do rather than what
to do for the driving platform.
3) The prior end-to-end methods are not evaluated suf-
ficiently in terms of the navigation task. Most works are
evaluated by first collecting a driving dataset with ground-truth
annotations (e.g., expert control actions) and then measuring
the average prediction error offline on the test set [6], [9], [10],
[13], [14], [17]. However, different from the computer vision
tasks such as object detection, the priority of driving should
be safety and robustness rather than accuracy. As indicated
in [18], the offline prediction error cannot well reflect the
actual driving quality. Therefore, online evaluation is more
reasonable and should be given more attention. One critical
concern for online evaluation is the environmental complexity,
yet prior related works either test their methods in static
maps [11], [12], [16], [19], [20], or scenarios with low-level
complexity [3]–[5], [7], [8], [15].
The aforementioned limitations motivate our exploration
to enhance the perception capability for end-to-end driving
systems. To this end, we propose a mixed sensor setup combin-
ing a camera, lidar and radar. The multimodal information is
processed by uniform alignment and projection onto the image
plane. Then, ResNet [21] is used for feature extraction. Based
on this setup, we introduce a probabilistic motion planning
(PMP) network to learn a deep probabilistic driving policy
from expert provided data, which outputs both a distribution
of future motion based on the Gaussian mixture model (GMM)
[9], [17], [22], and a deterministic control action. Finally,
we evaluate the driving performance of our model online
on a new benchmark with extensive experiments. The main
contributions of this letter are summarized as follows.
An end-to-end navigation method with multimodal sensor
fusion and probabilistic motion planning, named PMP-
net, for improving perception capability and considering
uncertainties in the future.
A new online benchmark, named DeepTest, to perform
analysis of driving systems in high-fidelity simulated en-
vironments with varied maps, weather, lighting conditions
and traffic densities.
Extensive evaluation and human-level driving perfor-
mance of the proposed PMP-net, presented in unseen
urban and rural areas with extreme weather and heavy
End-to-end control is designed with deep networks to
directly learn a mapping from raw sensory data to control
outputs. The pioneer ALVINN system [23] developed in 1989
uses a multilayer perceptron to learn the directions a vehicle
should steer. With the recent advancement of deep learning,
end-to-end control techniques have experienced tremendous
success. For example, using more powerful modern convo-
lutional neural networks (CNNs) and higher computational
power, Bojarski et al. [3] demonstrate impressive performance
in simple real-world driving scenarios such as on flat or
barrier-free roads. Xu et al. [6] develop an end-to-end ar-
chitecture to predict future vehicle egomotion from a large-
scale video dataset. However, these works only realize a lane-
following task and goal-directed navigation is not studied.
To enable goal-directed autonomous driving, Codevilla et al.
[5] propose a conditional imitation learning pipeline. In this
work, the vehicle is able to take a specific turn at intersections
based on high-level navigational commands such as turn left
and turn right. Follow-up works include [7], [12], [13] and
[14]. Another trend of adding guidance to the control policy
is using global route, which is a richer representation of
the intended moving directions than turning commands. For
example, Gao et al. [4] render routes on 2D floor maps and
call them intentions. Then, a neural-network motion controller
maps intentions and camera images directly to robot actions.
Pokle et al. [16] follow this idea and implement a deep
local trajectory planner and a velocity controller to compute
motion commands based on the path generated by a global
planner. However, these two works only focus on indoor robot
navigation. For outdoor driving applications, Cai et al. [20]
realize high-speed autonomous drifting in racing scenarios
guided by route information with deep reinforcement learning.
However, the control policy is only evaluated in static maps.
Hecker et al. [10] propose to learn a control policy with GPS-
based route planners and surround-view cameras. However, as
with many other works [6], [9], [13], [17], this work is only
evaluated offline by analysing the average predicting error,
providing unclear information of the actual driving quality.
Inspired by the route-guided navigation methods mentioned
above, we use a global planner to compute paths towards
destinations in outdoor driving areas. For the low-level reactive
control, we implement an end-to-end network translating the
global route into driving actions (steering, throttle and brake).
Based on this architecture, point-to-point autonomous driving
can be realized. The network is trained with imitation learning
and can adapt to varied environments to drive appropriately
(e.g., slow down at intersections) and safely (e.g., slow down
for a car, and urgently stop for jaywalkers). Similar to [4] and
[16], we assume that the localization information is available
during system operation. However, different to [4] and [16],
our work focuses on complicated outdoor driving scenarios,
and combines multimodal sensors complementing each other
to generate unified perception results.
In addition, our approach relates to the work of probabilistic
driving models. To improve the capability of handling long-
term plans with imitation learning, Amini et al. [9] propose a
variational network to predict a full distribution over possible
steering commands. Similarly, Huang et al. [17] propose to
use GMM to predict a distribution of future vehicle trajecto-
ries. These works explicitly consider uncertainties of future
motions on logged data with offline metrics. By contrast, we
evaluate our probabilistic driving model online with varied
environmental conditions (e.g., rainy night with heavy traffics),
which has not been studied in this context before.
A. Formulation
We formulate the problem of autonomous vehicle navigation
as a goal-directed motion planning task to be solved by an
end-to-end network architecture with imitation learning. The
goal is to control the vehicle to drive safely and robustly in
complex outdoor areas to achieve point-to-point navigation,
like a human driver. To this end, we design a probabilistic
driving model using multimodal perceptions from the camera,
Planning horizon [s]
Yaw distribution
Velocity distribution
Feature fusion
with attention
Attention s
Feature Concatenation
variance Action
Final action
Steer | Throttle
Global Planning & Multi-Perception Probablistic Motion Planning & Control
Fig. 2. The architecture of our probabilistic motion planning network (PMP-net). It receives the multimodal sensory input and plans a motion distribution
for 3 seconds in the future, based on which a PID controller is designed to generate a control action a2. In addition, PMP-net generates another action a1in
an end-to-end fashion. Then the variance of the planned motion distribution is used to fuse the dual actions for controlling the vehicle.
lidar and radar. In addition, we choose the latest CARLA
simulation (0.9.7) [24] to train and evaluate the system1. The
entire pipeline of our PMP-net is shown in Fig. 2.
B. Dataset Collection
To make the model successfully learn the knowledge of
goal-directed reactive control in the context of outdoor driving,
we collect a large-scale dataset with a global planner and an
expert demonstrator in CARLA. At the beginning of each
driving episode, the ego-vehicle is spawned at a random
position p. Then a collision-free coarse route (ranging from
350 m to 1500 m) from pto a destination dis provided by a
global planner. The vehicle then follows this route at a speed of
around 40 km/h while reacting to local environments to avoid
collisions, such as slowing down for a forward-facing car that
is moving slowly. Additionally, the vehicle reasonably slows
the speed down to 15 km/h at intersections to ensure safety. In
the process of data collection, we record the vehicle velocities,
yaw angles, RGB images, lidar/radar data and expert driving
actions (i.e., steering, throttle and brake) at 10 Hz. Moreover,
in order to increase the complexity of our dataset, we focus
on the following two aspects:
1) Complexity of Environments: a) The datasets from prior
works [5], [7], [8] are generated only in one map with two
lanes and 90-degree turns (Town01 in Fig. 3). By contrast,
we use five urban maps for data collection, which consist
of different types of intersections and even roundabouts, and
multiple lanes on roads; b) We set nine combinations of
weather (clear,drizzle and rainy) and illumination (daytime,
sunset and night). Heavier rain leads to more puddles on roads,
and thus brings a greater reflection effect for visual perception.
1Different from the older versions of CARLA (0.8.x) used in [5], [7] and
[8], which contain only two urban maps, the latest CARLA environment
provides seven maps covering both urban and rural areas, with more avail-
able sensors, improved physical dynamics and more realistic illuminations.
2) Complexity of Road Agents: a) We set pedestrians with
different appearances (children and adults) randomly running
or walking along the sidewalks and crosswalks. They oc-
casionally disobey traffic rules and cross the road abruptly
without previous notice, which increases the safety burden for
autonomous driving; b) We set different types of vehicles (e.g.,
cars, trucks, vans, jeeps, buses, motorcyclists and bicyclists)
with multiple appearances navigating around the cities at
varied speeds. Based on a) and b), we apply four levels of
traffic density for data collection: empty,few,regular and
dense. Note that these road agents are controlled by the AI
engine from CARLA to construct realistic city scenarios.
The setups mentioned above can be partially viewed in Fig.
3 and more can be viewed in our supplementary videos. These
help to generate sufficient interactions between the ego-vehicle
and road agents in diverse environments. Based on these
setups, we finally collect 360 high-fidelity driving episodes,
which last 10.8 hours in total with 389 thousand frames and
cover a driving distance of 247 km.
C. Model Architecture
1) Global Planning: The global planner is separate from
the deep networks. It is implemented with the Aalgorithm
to plan a high-level coarse route from the start point to the
destination based on static town maps. Similar to [16] and
[20], we down-sample the full global route Gfto local relevant
routes Gduring navigation, which is shown in (1):
Note that the first waypoint (x1,y1)in Gis the closest
waypoint in Gfto the current location of the vehicle, and
the distance of every two adjacent points is 0.4 m. The
waypoints are then flattened into a 260-dimensional vector
to be processed by dense layers with fully connected ReLU
layers. The extracted feature is a higher dimensional vector
(a) Town01-ClearDay (b) Town02-DrizzleSunset (c) Town03-RainySunset (d) Town04-ClearNight (f) Town05-RainyNight
unclear vehicles
unclear motorcyclist
Fig. 3. Overview of our dataset: varied maps, weather and illumination conditions with increasing traffic densities (top to bottom). Noticeable road agents
are bounded by color boxes. Note that this figure shows only a small part of the environmental setups; please see contexts in Section III-B for more details.
Columns (a-c) show there can sometimes be jaywalkers running across the roads, for which the ego-vehicle will urgently slow down or completely stop
to ensure safety. In addition, it can be seen that in rainy scenarios, especially in RainyNight, the surroundings are considerably blurred (e.g., the unclear
motorcyclist in the Regular setting of column (f)), leading to potential risks for the vision-based driving models [5], [7], [8], [14].
Camera Image
Lidar Point Cloud
Radar Measurements
-- velocity -- azimuth
-- altitude -- depth
projected to the
image plane
Ralidar Image
Fig. 4. Multimodal data processing. We achieve data alignment by projecting
the lidar pointclouds and radar measurements to the image plane by combining
them together to form the ralidar image. Then, two ResNet34 modules are
used to extract features from the camera and ralidar images. Brighter points
mean larger values in the projected images. Noticeable road agents in the
projected radar image are bounded by white boxes.
2) Multi-Perception: With the aim to capture environmental
information, the camera records color textures in a 2D image
plane, while the lidar captures 3-D spatial locations and the
radar records movement information (i.e., speeds of obstacles
relative to the ego-vehicle). We combine these sensors together
in our network so that the vehicle is able to sense different
dimensions of its surroundings.
Specifically, we project the lidar point clouds and radar data
to the image plane with the same width and height as the
camera images. We name it the ralidar image (250 ×600 ×4),
in which the first three channels encode 3-D coordinates and
the forth channel encodes relative speeds, as shown in Fig. 4.
In this way, the multimodal measurements are aligned on the
same space and can be uniformly processed with CNNs. In
this work, we use ResNet34 [21] as the backbone to extract
environmental features from the camera and ralidar images.
The results are feature vectors fiR2048 and frR2048.
3) End-to-End Action Generation: In addition to the sen-
sory data and the global route, our network also takes as input
the velocity of the ego-vehicle (vx,vy)to the dense layers. The
extracted feature is a higher dimensional vector fvR2048.
Then the features [fi,fr,fv,fg]are handled in two ways:
a) we concatenate them into a vector fcR8192 for further
processing, and b) in the spirit of [16], we fuse them with
an attention mechanism defined in (2), where the coefficients
a= [ai,ar,av,ag]reflect the relative importance of the features
in changing environments.
The coefficients aare computed by transforming fcwith dense
layers and softmax activation. After such feature fusion, a
control action a1composed of steering, throttle and brake is
generated by projecting ffwith fully connected ReLU layers.
Inspired by [18], we use the L1 loss function for this module
as it is better correlated to the online driving performance.
4) Probabilistic Motion Planning: In this module, we aim
to learn a full parameterized distribution over possible ego-
motions (i.e., velocities and yaw angles) for 3.0 s into the
future, as shown in Fig. 2. We adopt the GMM to repre-
sent such a distribution due to its excellent approximation
properties. Specifically, the combined feature fcin our work
is transformed by dense layers into GMM parameters (i.e.,
weight, mean and variance) to describe the distribution of
future motions. Similar to [9] and [17], the negative log-
likelihood (NLL) loss function is used for this module.
As mentioned in [22], the advantage of probabilistic model-
ing is that we can make a decision by evaluating its statistical
properties. In this work, based on the mean values (µ) of the
planned motion distribution, we further design a PID controller
to calculate a control action a2composed of steering, throttle
and brake. The target point for this PID controller (assume k
frames in the future) is set to the point 5 m ahead of the vehicle
by calculating the integral with µ. Then, the final action afto
control the vehicle is computed by examining the reliability of
the motion distribution through its accumulated variance σ2:
af= (1λ)a1+λa2,λ=ec1·max(0,k
In this way, higher planning uncertainty leads to smaller λ,
thus the final action will depend more on a1. We believe
that we can take advantage of both end-to-end control and
probabilistic modeling by performing such reliability-aware
action fusion.
Training Conditions New Weather New Town New Town & Weather
Town Name Town03 (urban) Town05 (urban) Town07 (rural) Town06 (urban)
Traffic Density Empty Regular Dense Empty Regular Dense Empty Regular Dense Empty Regular Dense
Success Rate (%)
CIL [5] 38 16 16 33 11 0 0 0 0 16 11 0
CIL-R 83 55 38 33 22 16 22 11 11 11 11 11
INT [4] 16 33 11 83 5 5 38 22 5 94 61 16
PMP (ours)100 72 88 100 77 77 100 83 72 100 88 83
Wrong Lane (%)
CIL [5] 66.05 45.16 50.87 57.22 64.41 46.18 35.55 36.81 40.71 44.14 52.37 52.03
CIL-R 26.60 25.57 19.07 26.58 36.64 41.86 8.88 7.20 3.35 42.50 50.72 51.61
INT [4] 0.00 0.04 0.01 0.07 0.12 0.15 0.00 0.00 0.00 0.12 0.13 0.28
PMP (ours) 0.02 0.00 0.01 0.40 0.48 0.50 0.04 0.00 0.01 0.43 0.40 0.61
Overspeed (%)
CIL [5] 0.33 0.37 0.16 0.10 0.00 0.04 0.00
CIL-R 0.14 0.13 0.08 0.04 0.00 0.00 0.33 0.09 0.28 0.10 0.16 1.54
INT [4] 17.70 11.18 5.85 17.09 15.14 8.52 19.03 11.87 14.84 37.12 30.22 31.04
PMP (ours)0.11 0.22 0.12 0.14 0.00 0.06 0.26 0.30 0.28 0.40 0.28 0.36
A. Training Setup
We train the proposed PMP-net on our large-scale driving
dataset introduced in Section III-B. The full dataset is divided
into a training set and a validation set according to the ratio
of 7:1, leading to 340K training samples2. We use the Adam
optimizer with a learning rate of 0.0001, and the batch size is
90. Based on these setups, the model is trained on two Nvidia
GeForce RTX 2080 Ti GPUs for about 75 hours, with 234K
training steps to achieve convergence. For comparison, we also
train and finetune three other baselines on the same training
set, which are for visual navigation:
CIL: The conditional imitation learning network intro-
duced in [5]. This maps the camera images and ego-
velocities directly to control actions, based on four dis-
crete commands for goal-directed navigation: follow lane,
turn left,turn right and go straight at the intersection.
CIL-R: We replace the original image processing module
of CIL (which is relatively shallow) with ResNet34, to
evaluate if deeper models perform better for our task.
INT: The intention-net introduced in [4] with the back-
bone of ResNet34 for fair comparisons. This maps the
camera images and global routes to control actions. Note
that the original intention-net takes the indoor floor maps
rendered with routes for directions. We replace it with
the local relevant routes Gintroduced in (1).
B. Evaluation
1) DeepTest Benchmark: We evaluate the online driving
performance for different models on our proposed DeepTest
2Note the test set is not considered because we evaluate our model online
in Section IV-B by making the ego-vehicle directly interact with dynamic
benchmark in CARLA. Compared with the previous bench-
marks in [7] and [24], DeepTest has many more environmental
setups, such as more test maps, weather conditions and inter-
actions with road agents. In addition, different to [7] and [24],
we set zero tolerance for collision events, which means that
any degree of collisions with static (e.g., trees) or dynamic
(e.g., pedestrians) objects leads to a failed episode.
In our benchmark, different methods are tested on four
maps. For each map, we set three levels of traffic densities:
empty,regular and dense. Therefore, each driving model
relates to 12 driving tasks. Note that denser traffic leads to
harder driving tasks as it involves more dynamic obstacles on
the road. In each task, we further set 18 goal-directed episodes
with varied weather conditions. Therefore, to fully evaluate
PMP-net and the other three baselines, 864 driving episodes
should be conducted. Finally, the evaluation process costs 4
days on our computer and covers a driving distance of 855
km. Compared with the environmental setups in the training
set (Section III-B), we consider new maps, illuminations
and weather in DeepTest to test the generalization capability.
Specifically, we add an unseen rural map Town07 and an
urban map Town06.Town07 brings new challenges to test the
negotiation skills with narrow roads and many non-signalized
crossings. In addition, we add four extreme illumination and
weather conditions: ClearDark,DrizzleDark,StormDark and
StormSunset. The new Dark and Storm (i.e., heavy rain)
settings, which are shown in Fig. 5, bring extra challenges
to the drive with limited vision. Similar to [5], we do not
consider traffic lights in this work. For quantification of the
driving performance, three metrics are adopted as follows:
SR: success rate. An episode is considered to be suc-
cessful if the agent reaches a certain goal without any
collision within a time limit. Based on this, we calculate
the success rate for models in different tasks.
WL: The proportion of the period in a wrong lane to the
Fig. 5. Online evaluation results of PMP-net in our DeepTest benchmark. The environment setups, driving velocities and control actions are shown in yellow
text. Noticeable road agents (e.g., jaywalkers) are bounded by green boxes. The range of steering is [-1,1], while for throttle and brake the range is [0,1].
The sample driving behaviors are: (c) lane-following, turning at (b,d,e) intersections or (a) roundabouts, (g) lane-changing, (f,h,i,j) vehicle-, bicyclist- or
motorcyclist-following, and (k,m) urgently slowing down for jaywalkers. All of these behaviors are performed autonomously and safely by PMP-net in an
end-to-end fashion without hand-crafted rules.
total driving time.
OVSP: The proportion of the overspeeding period to the
total driving time. The speed limit is set to 20 km/h at
intersections and 50 km/h elsewhere.
2) Quantitative Analysis: We show the results on the
DeepTest benchmark in Table I. In the following, the analyses
are given from two perspectives: the ability and the quality of
autonomous driving.
Ability: Success rate (SR) is used to measure the self-
driving ability, which is a crucial concern in this area.
It can be seen that the CIL model presents the worst results,
which can not even achieve a successful episode in Town07.
In addition, although in Town03 we only set new routes with
similar environments to the training dataset, CIL still presents
low SRs (16~38%). With the help of a deeper backbone in
CIL-R, the performance is improved. For example, the SR in
Town03-empty increases from 38% to 83%.
By changing the model structure to INT, better generaliza-
tion performance on certain new environments is achieved, for
example, the SR in Town06-Regular increases from 11%
to 61%. However, INT performs worse than CIL-R in Town03
and some other new environments such as Town05-Dense.
Generally, INT and CIL-R have similar low-level perfor-
mances in outdoor driving areas, especially in heavy traffic.
This is because they only use visual perception, which often
has troubles in tough environments such as StormDark. By
contrast, PMP-net achieves a much higher SR in all evaluation
setups, which indicates a superior generalization capability. In
particular, the SR increases to 100% in all environments for
the empty traffic, and to 72~88% for regular and dense traffic.
Quality: We use WL and OVSP to evaluate the driving
quality of different models. Due to the lack of concrete
direction guidance, CIL and CIL-R both have high WL values
(3.35~66.05%). Specifically, they often navigate the vehicle
to drive in the correct direction but in the wrong lanes. With
the help of the global route information, the models are able
to drive more accurately, as we can see by the WL values
for INT and PMP, which are all close to 0%. However, INT
tends to control the vehicle to drive at high speeds without
slowing down at intersections. This unsafe phenomenon leads
to high values of OVSP for INT (5.85~37.12%). While PMP
still performs well on this metric (0.0~0.4%).
Generally, the remarkable improvements of PMP-net on the
benchmark w.r.t. the other three baselines confirm that our
proposed model can effectively learn and deploy the driving
knowledge in complex dynamic environments.
3) Qualitative Analysis: Fig. 5 shows the qualitative results
of PMP-net. When there are no obstacles ahead on straight
roads, our model drives relatively fast, at about 40 km/h (Fig.
5-(c)). When taking turns or following road agents, our model
reasonably slows down as a human driver would, as shown in
Fig. 5-(a,b,d,f,i). In addition, we show some results in extreme
conditions. In Fig. 5-(e), the traffic is heavy with many vehicles
driving at an intersection. Although the model is directed to
turn right, it applies full brake as another vehicle blocks the
road ahead. Moreover, in Fig. 5-(g,h), we set dense traffic on a
dark night where slow-moving obstacles are ahead of the ego-
vehicle. In these scenes with limited vision, PMP-net is also
able to drive safely by reducing the throttle to slow down when
changing/following lanes. Furthermore, the most challenging
scene is shown in Fig. 5-(m). In the StormDark environment,
there is a small child running across the road abruptly without
any previous notice. For this scene, it is difficult to raise alarm
even for a human driver because the surroundings cannot be
seen clearly. Surprisingly, our model slows down timely by
applying brake to avoid an accident. Fig. 5-(k) is another
similar scenario. For interpretation, the planned motion dis-
tribution of Fig. 5-(m) is attached, where we can see that the
planned speed drops rapidly within a short horizon (~0.5 s)
with low variance. We accredit such prominent performance to
our multimodal and probabilistic setup. More related driving
behaviors are shown in supplementary videos3.
In this paper, to realize autonomous driving in outdoor
dynamic environments, we proposed a deep navigation model
named PMP-net, which is based on multimodal sensors (a
camera, lidar and radar) and probabilistic end-to-end control.
We collected a large-scale driving dataset in the CARLA sim-
ulator and trained the model with imitation learning. In order
to fully evaluate the driving performance, we further proposed
a new online benchmark DeepTest, of which the environmental
complexity has not been previously considered. By setting
varied illumination, weather and traffic conditions in different
towns, we showed that our model achieves excellent driving
and generalization performance in both unseen urban and rural
areas with extreme weather and heavy traffic with dynamic
objects (e.g., vehicles, bicyclists and jaywalkers).
To further extend PMP-net for real autonomous vehicles,
the reality gap should be considered. 1) For discrepancy of
sensory input, we can finetune the model with real-world data.
The sensor readings of lidar and radar are more consistent than
those of a camera with real/simulated deployments, which can
help regularize the finetuning process for domain adaption.
2) For discrepancy of driving platforms, we can adjust the
parameters of the PID controller to adapt to different vehicle
properties [14], due to the modular design of our network.
[1] J. Leonard, J. How, S. Teller, M. Berger, S. Campbell, G. Fiore,
L. Fletcher, E. Frazzoli, A. Huang, S. Karaman et al., “A perception-
driven autonomous urban vehicle,Journal of Field Robotics, vol. 25,
no. 10, pp. 727–774, 2008.
[2] E. D. Dickmanns, “The development of machine vision for road vehicles
in the last decade,” in Intelligent Vehicle Symposium, 2002. IEEE, vol. 1.
IEEE, 2002, pp. 268–281.
[3] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End
to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,
[4] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subramanian, “Intention-
net: Integrating planning and deep learning for goal-directed autonomous
navigation,” in Conference on Robot Learning, 2017, pp. 185–194.
[5] F. Codevilla, M. Miiller, A. López, V. Koltun, and A. Dosovitskiy,
“End-to-end driving via conditional imitation learning,” in 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE,
2018, pp. 1–9.
[6] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
models from large-scale video datasets,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, pp. 2174–
[7] F. Codevilla, E. Santana, A. M. Lopez, and A. Gaidon, “Exploring the
limitations of behavior cloning for autonomous driving,” in The IEEE
International Conference on Computer Vision (ICCV), October 2019.
[8] L. Tai, P. Yun, Y. Chen, C. Liu, H. Ye, and M. Liu, “Visual-based
autonomous driving deployment from a stochastic and uncertainty-aware
perspective,” in 2019 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), 2019.
[9] A. Amini, G. Rosman, S. Karaman, and D. Rus, “Variational end-to-
end navigation and localization,” in 2019 International Conference on
Robotics and Automation (ICRA), May 2019, pp. 8958–8964.
[10] S. Hecker, D. Dai, and L. Van Gool, “End-to-end learning of driving
models with surround-view cameras and route planners,” in Proceedings
of the European Conference on Computer Vision (ECCV), 2018, pp.
[11] P. Karkus, X. Ma, D. Hsu, L. Kaelbling, W. S. Lee, and T. Lozano-Perez,
“Differentiable algorithm networks for composable robot learning,” in
Proceedings of Robotics: Science and Systems, FreiburgimBreisgau,
Germany, June 2019.
[12] M. Mueller, A. Dosovitskiy, B. Ghanem, and V. Koltun, “Driving policy
transfer via modularity and abstraction,” in Proceedings of The 2nd
Conference on Robot Learning, ser. Proceedings of Machine Learning
Research, A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds.,
vol. 87. PMLR, 29–31 Oct 2018, pp. 1–15.
[13] P. Cai, Y. Sun, Y. Chen, and M. Liu, “Vision-based trajectory planning
via imitation learning for autonomous vehicles,” in 2019 IEEE Intelligent
Transportation Systems Conference (ITSC), Oct 2019, pp. 2736–2742.
[14] P. Cai, Y. Sun, H. Wang, and M. Liu, “VTGNet: A vision-based
trajectory generation network for autonomous vehicles in urban envi-
ronments,” arXiv preprint arXiv:2004.12591, 2020.
[15] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to
drive by imitating the best and synthesizing the worst,” in Proceedings
of Robotics: Science and Systems, FreiburgimBreisgau, Germany, June
[16] A. Pokle, R. Martín-Martín, P. Goebel, V. Chow, H. M. Ewald, J. Yang,
Z. Wang, A. Sadeghian, D. Sadigh, S. Savarese, and M. Vázquez, “Deep
local trajectory replanning and control for robot navigation,” in 2019
International Conference on Robotics and Automation (ICRA), May
2019, pp. 5815–5822.
[17] X. Huang, S. G. McGill, B. C. Williams, L. Fletcher, and G. Rosman,
“Uncertainty-aware driver trajectory prediction at urban intersections,
in 2019 International Conference on Robotics and Automation (ICRA).
IEEE, 2019, pp. 9718–9724.
[18] F. Codevilla, A. M. Lopez, V. Koltun, and A. Dosovitskiy, “On offline
evaluation of vision-based driving models,” in The European Conference
on Computer Vision (ECCV), September 2018.
[19] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena, “From
perception to decision: A data-driven approach to end-to-end motion
planning for autonomous ground robots,” in 2017 IEEE International
Conference on Robotics and Automation (ICRA), May 2017, pp. 1527–
[20] P. Cai, X. Mei, L. Tai, Y. Sun, and M. Liu, “High-speed autonomous
drifting with deep reinforcement learning,” IEEE Robotics and Automa-
tion Letters, vol. 5, pp. 1247–1254, 2020.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2016, pp. 770–778.
[22] J. Wiest, M. Höffken, U. Kreßel, and K. Dietmayer, “Probabilistic
trajectory prediction with gaussian mixture models,” in 2012 IEEE
Intelligent Vehicles Symposium. IEEE, 2012, pp. 141–146.
[23] D. A. Pomerleau, “ALVINN: An autonomous land vehicle in a neural
network,” in Advances in neural information processing systems, 1989,
pp. 305–313.
[24] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
“CARLA: An open urban driving simulator,” in Proceedings of the 1st
Annual Conference on Robot Learning, ser. Proceedings of Machine
Learning Research, S. Levine, V. Vanhoucke, and K. Goldberg, Eds.,
vol. 78. PMLR, 13–15 Nov 2017, pp. 1–16.
Supplement File of Probabilistic End-to-End
Vehicle Navigation in Complex Dynamic
Environments with Multimodal Sensor Fusion
Peide Cai, Sukai Wang, Yuxiang Sun, Ming Liu, Senior Member, IEEE
(1) Empty (2) Regular (3) Dense
Fig. S1: Three levels of traffic density in our DeepTest benchmark.
clear drizzle rainy storm
Fig. S2: Varied illumination (clear, sunset, night and dark) and weather (clear, drizzle, rainy and storm) conditions considered
in this work.
A. Traffic Densities
We set three levels of traffic density for our DeepTest
benchmark: empty,regular and dense. The specific settings
are shown in Table. S1 and can be viewed in Fig. S1.
TABLE S1: Different traffic densities in this work
Type Number of pedestrians Number of vehicles
Empty 0 0
Regular 40~75 60
Dense 60~150 80~120
B. Illumination and Weather Conditions
This paper involves four illumination conditions (i.e., day-
time,sunset,night and dark) and four weather conditions (i.e.,
clear,drizzle,rainy and storm). The specific settings can be
viewed in Fig. S2.
C. PID Control
To translate the planned motion distribution into action a2,
two PID controllers are designed for lateral (steering) and
longitudinal (throttle and brake) control. The parameters are
shown in Table S3.
TABLE S2: Quantitative comparison of PMP-NO-RDR (which removes the input of radar information from PMP-net) with
other models. means larger numbers are better, means smaller numbers are better. The bold font highlights the best results
in each column.
Training Conditions New Weather New Town New Town & Weather
Town Name Town03 (urban) Town05 (urban) Town07 (rural) Town06 (urban)
Traffic Density Empty Regular Dense Empty Regular Dense Empty Regular Dense Empty Regular Dense
Success Rate (%)
CIL[S1] 38 16 16 33 11 0 0 0 0 16 11 0
CIL-R 83 55 38 33 22 16 22 11 11 11 11 11
INT[S2] 16 33 11 83 5 5 38 22 5 94 61 16
PMP-NO-RDR 50 38 55 16 22 38 44 44 22 100 72 50
PMP (ours)100 72 88 100 77 77 100 83 72 100 88 83
Wrong Lane (%)
CIL[S1] 66.05 45.16 50.87 57.22 64.41 46.18 35.55 36.81 40.71 44.14 52.37 52.03
CIL-R 26.60 25.57 19.07 26.58 36.64 41.86 8.88 7.20 3.35 42.50 50.72 51.61
INT[S2]0.00 0.04 0.01 0.07 0.12 0.15 0.00 0.00 0.00 0.12 0.13 0.28
PMP-NO-RDR 6.43 4.69 1.67 7.93 1.78 1.64 0.78 0.42 0.42 2.21 1.30 1.54
PMP (ours) 0.02 0.00 0.01 0.40 0.48 0.50 0.04 0.00 0.01 0.43 0.40 0.61
Overspeed (%)
CIL[S1] 0.33 0.37 0.16 0.10 0.00 0.04 0.00
CIL-R 0.14 0.13 0.08 0.04 0.00 0.00 0.33 0.09 0.28 0.10 0.16 1.54
INT[S2] 17.70 11.18 5.85 17.09 15.14 8.52 19.03 11.87 14.84 37.12 30.22 31.04
PMP-NO-RDR 0.48 0.04 0.15 0.16 0.00 0.02 0.18 0.27 0.09 0.03 0.22 0.12
PMP (ours)0.11 0.22 0.12 0.14 0.00 0.06 0.26 0.30 0.28 0.40 0.28 0.36
(a) (b) (c)
Fig. S3: Failure cases of collision event from PMP-NO-RDR. Top to bottom: onboard camera images, projected lidar images
(not the input of PMP-NO-RDR, only for interpretation of performance degradation) and lidar images (y-channel). Brighter
points mean larger values. The obstacle hit by the ego-vehicle are bounded by blue boxes.
TABLE S3: PID parameters used in this work
Type Proportional (P) Integral (I) Derivative (D)
Lateral 0.70 0.00 0.00
Longitudinal 0.25 0.20 0.00
In this section, we aim to explore if PMP-net can com-
plement the lidar and radar as the form of ralidar image
introduced in the paper, or is it possible to use only a camera
and lidar. Therefore, we remove the radar input from PMP-
net and train an ablated model named PMP-NO-RDR, which
takes as input the ego-velocity, global route and information
from the camera and lidar. The training setup of this model is
the same as PMP-net.
A. Quantitative Analysis
As with the other models, PMP-NO-RDR is evaluated on
our DeepTest benchmark with the same setup. The comparison
quantitative results are shown in Table. S2. For the driving
(a) (b)
(c) (d)
(e) (f)
Fig. S4: Evaluation results of PMP-net in a new urban environment with dynamic illumination and weather conditions. The
parameters of dynamic environments such as sun altitude angle and fog are shown on the bottom of snapshots. Noticeable
road agents are bounded by green boxes. The red lines indicate the global routes.
ability, it can be seen that the SR (success rate) values of
PMP-NO-RDR drop from 100% (from PMP) to 16~100% in
Empty traffic settings, and drop from 72~88% (from PMP)
to 22~55% in Dense traffic settings. For the driving quality,
the OVSP (over speed) values of PMP-NO-RDR are similar
to those of PMP, while the WL (wrong lane) values increase
from 0.0~0.61% (from PMP) to 0.42~7.93%. These results
suggest that our PMP-net indeed complements the radar and
lidar to improve the overall navigation performance. Note that
even without the radar input, PMP-NO-RDR still performs
better than the other three baselines, i.e., CIL, CIL-R and INT,
with lower OVSPs and higher SRs, especially in Dense traffic
B. Qualitative Analysis
We show three failure cases of collision event from PMP-
NO-RDR in Fig. S3 to interpret its performance degradation
that mainly happens when interacting with dynamic obstacles.
We can see that the vehicle sometimes collides with other
road agents when changing lanes (Fig. S3-(c)) or turning at
intersections (Fig. S3-(a,b)). In these scenes, although there are
other obstacles blocking the road ahead, PMP-NO-RDR still
controls the vehicle to move forward by applying throttles. For
interpretation, we show the related radar/lidar images in the
second/third row of Fig. S3. It can be seen that the structure of
static environments and dynamic obstacle are well separated
in radar images because of their different speeds relative
to the ego-vehicle. This characteristic of radar information
is significant for comprehensive scene understanding along
with the lidar information, which can only perceive the static
information (e.g., position, length and height) of surroundings.
(a) (b)
(c) (d)
Fig. S5: Evaluation results of PMP-net on a new vehicle chevrolet-impala. The parameters of dynamic environments such as
sun altitude angle and fog are shown on the bottom of snapshots. Noticeable road agents are bounded by green boxes. The
red lines indicate the global routes.
A. New Town with Unseen Features
We further test the driving performance of PMP-net in a
more realistic town named Town10HD1. This test is much
more difficult than DeepTest because we add some new
features as follows, which are also unseen in the training stage.
Dynamic illumination (changing from daytime to night)
and weather (changing from clear to rainy) conditions in
a single driving episode.
Road lamp and vehicle lights at night.
Fog and wind weather conditions.
The results show that our PMP-net still navigates the vehicle
robustly and safely in these scenarios. We show qualitative
results in Fig. S4 and some demo videos on the website2. A
tough scene is shown in Fig. S4-(f). In the foggy and rainy
night, a small child who wears dark-colored cloths crosses the
road abruptly. He is also affected by other vehicles’ lights.
It is difficult to raise alarm even for a human driver because
this small child cannot be seen clearly. Surprisingly, PMP-net
stops the vehicle timely by applying full brake to avoid an
B. New Driving Platform
Similar to [S3], to further test the cross-platform general-
ization performance of PMP-net, we change the test vehicle
from audi-a2 to chevrolet-impala. The new vehicle is much
longer, therefore, we expand the steering action of the network
output by 1.4 times. The test scene is the same as the one in
Section S.III.A. We show qualitative results in Fig. S5 and
some demo video is on our website. Note the new vehicle has
available lights in CARLA, therefore, it is further programmed
to turn on the LowBeam at night and Blinker when turning at
intersections. This setup brings new visual effects to camera
images. The results confirm that PMP-net also performs well
on this new platform in dynamic environments, such as the
collision avoidance event shown in Fig. S5-(b,c), and the
ability of driving at rainy night shown in Fig. S5-(d).
[S1] F. Codevilla, M. Miiller, A. López, V. Koltun, and
A. Dosovitskiy, “End-to-end driving via conditional imi-
tation learning,” in 2018 IEEE International Conference
on Robotics and Automation (ICRA). IEEE, 2018, pp.
[S2] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subra-
manian, “Intention-net: Integrating planning and deep
learning for goal-directed autonomous navigation,” in
Conference on Robot Learning, 2017, pp. 185–194.
[S3] P. Cai, Y. Sun, H. Wang, and M. Liu, “VTGNet:
A vision-based trajectory generation network for au-
tonomous vehicles in urban environments,arXiv
preprint arXiv:2004.12591, 2020.
... As an alternative to navigation commands, some authors have used the desired trajectory, provided by a global planner, as input [14], [60], [81]. Usually the trajectory is conveyed into the system under the form of waypoints. ...
... Conversely, the waypoints predicted by the neural network refer to the trajectory that the agent should follow in subsequent timestamps. Cai et al.,in [81], instead of using the global planner from CARLA, implemented the A * [82], [83] algorithm to plan the coarse route from the initial point to the destination point based on static maps. The waypoints provided by global planners do not consider dynamic objects nor information regarding traffic-lights. ...
... In the last few years, several authors suggested that, for urban environments, the integration of RGB cameras and LiDAR is essential [60], [65], [81]. These modalities are often seen as complementary, where the RGB cameras provide information about the road and visual aspects of the scene, such as traffic-lights, and the LiDAR provides accurate spatial information in 360 degrees [65]. ...
Full-text available
Autonomous driving in urban environments requires intelligent systems that are able to deal with complex and unpredictable scenarios. Traditional modular approaches focus on dividing the driving task into standard modules, and then use rule-based methods to connect those different modules. As such, these approaches require a significant effort to design architectures that combine all system components, and are often prone to error propagation throughout the pipeline. Recently, end-to-end autonomous driving systems have formulated the autonomous driving problem as an end-to-end learning process, with the goal of developing a policy that transforms sensory data into vehicle control commands. Despite promising results, the majority of end-to-end works in autonomous driving focus on simple driving tasks, such as lane-following, which do not fully capture the intricacies of driving in urban environments. The main contribution of this paper is to provide a detailed comparison between end-to-end autonomous driving systems that tackle urban environments. This analysis comprises two stages: a) a description of the main characteristics of the successful end-to-end approaches in urban environments; b) a quantitative comparison based on two CARLA simulator benchmarks ( CoRL2017 and NoCrash ). Beyond providing a detailed overview of the existent approaches, we conclude this work with the most promising aspects of end-to-end autonomous driving approaches suitable for urban environments.
... Other works either use different kinds of data [5], or use sensors of the same model but place them at different places [20]. The former methodology is Multi-sensor Fusion [29], it is widely used in the intelligent and autonomous systems [2,4,15,27]. However, our approach of experiment proves that multi-sensors (of the same model, at the same place) will make the model perform better in predicting target data. ...
... We can conclude that when applying official Attention, the model cannot converge consistently with different data sets. Figure 6 shows that model fails to converge when it learns on 2 Figure 8 shows that our model fits the labeled IAQI level lines well, except that its predictions differ from the ground truth a little on some parts of the lines, especially on the turning corners. Figure 9 illustrates the same evaluation performance, which presents that the model may not need much data to learn as to set stride to 2. To quantify the testing results of our model with different parameters, we test it on the whole 2.5 sequence by setting stride as 1. ...
Full-text available
Accompanying rapid industrialization, humans are suffering from serious air pollution problems. The demand for air quality prediction is becoming more and more important to the government's policy-making and people's daily life. In this paper, We propose GreenEyes -- a deep neural network model, which consists of a WaveNet-based backbone block for learning representations of sequences and an LSTM with a Temporal Attention module for capturing the hidden interactions between features of multi-channel inputs. To evaluate the effectiveness of our proposed method, we carry out several experiments including an ablation study on our collected and preprocessed air quality data near HKUST. The experimental results show our model can effectively predict the air quality level of the next timestamp given any segment of the air quality data from the data set. We have also released our standalone dataset at The model and code for this paper are publicly available at
... In terms of robot perception, deep learning (DL) techniques have been proposed for multi-modal sensor fusion. These methods combine distinct feature representations from different sensor modalities such as camera, IMU, and lidar [8], [9], [10], [11] or perform only camera-lidar fusion [12], [13]. Some of these methods use Graph Neural Networks (GNN) [14], which have been used for multimodal sensor fusion in indoor scenes [15], [16]. ...
... Deep learning methods have also been widely used for sensor fusion [26], [28], [31], [32]. Particularly, the cameralidar fusion is employed in navigation and simultaneous localization and mapping (SLAM) [33] literature to obtain combined perception from visual and geometric features [9], [12]. However, these methods typically assume that both camera and LiDAR perception are reliable in terms of feature availability. ...
Full-text available
We present a novel trajectory traversability estimation and planning algorithm for robot navigation in complex outdoor environments. We incorporate multimodal sensory inputs from an RGB camera, 3D LiDAR, and robot's odometry sensor to train a prediction model to estimate candidate trajectories' success probabilities based on partially reliable multi-modal sensor observations. We encode high-dimensional multi-modal sensory inputs to low-dimensional feature vectors using encoder networks and represent them as a connected graph to train an attention-based Graph Neural Network (GNN) model to predict trajectory success probabilities. We further analyze the image and point cloud data separately to quantify sensor reliability to augment the weights of the feature graph representation used in our GNN. During runtime, our model utilizes multi-sensor inputs to predict the success probabilities of the trajectories generated by a local planner to avoid potential collisions and failures. Our algorithm demonstrates robust predictions when one or more sensor modalities are unreliable or unavailable in complex outdoor environments. We evaluate our algorithm's navigation performance using a Spot robot in real-world outdoor environments.
... Gao et al. [10] assessed the performance by counting the number of collision frames and derailed frames. Cai et al. [11] analyzed performance based on success rate, improper lane, and over-speeding. The purpose of the measurements that they used were similar, but the actual metrics was different, which makes a proper comparison of proposed methods difficult. ...
Full-text available
Vision-based autonomous driving is rapidly growing. There are, however, presently no agreed-upon metrics for assessing how well deep neural network (DNN) models perform in driving. To compare novel approaches and architectures to existing ones, some researchers employed a mean error between labeled and predicted values in a test dataset and others presented a new metric that is designed to match their requirements. The discrepancy in the usage of various performance metrics and lack of objective metrics to judge the driving performance were our primary motives for developing a feasible solution. In this study, we propose online performance evaluation metrics index (OPEMI), an integrated metric that can evaluate the driving capabilities of autonomous driving models in various driving scenarios. To evaluate driving performance precisely and objectively, OPEMI incorporates several variables, including driving control stability, driving trajectory stability, journey duration, travel distance, success rate, and speed. To demonstrate the validity of OPEMI, we first confirmed that the prediction accuracy has a weak correlation with driving performance. Then, we have discussed the constraints in the existing driving performance metrics in certain circumstances, and their failure to assess the driving models. Finally, we conducted experiments with four popular DNN models and two in-house models under three different driving scenarios (generic, urban, and racing). The results show that the proposed evaluation metric, OPEMI, realistically displays driving performance and demonstrates its validity in various driving scenarios.
... The end-to-end paradigm, as the most promising solution, considers the navigation system as a whole and directly maps the observation input to the final control action. Several researches aim to learn driving policy directly from the raw image data [5] and LIDAR point cloud [6], however, their performance in the complex scenarios cannot be ensured, as it is very difficult to learn the semantic relations among objects in complex traffic scenario (which is very important for robust navigation). Since the semantic information can provide informative abstraction of the perceptual results and can be easily accessed, it is commonly utilized in existing end-to-end driving models to improve transferability, examples include the birdeye-views (BEVs) [7], [8] and high definition (HD) maps [9]). ...
Environmental disturbances, such as sensor data noises, various lighting conditions, challenging weathers and external adversarial perturbations, are inevitable in real self-driving applications. Existing researches and testings have shown that they can severely influence the vehicles perception ability and performance, one of the main issue is the false positive detection, i.e., the ghost object which is not real existed or occurs in the wrong position (such as a non-existent vehicle). Traditional navigation methods tend to avoid every detected objects for safety, however, avoiding a ghost object may lead the vehicle into a even more dangerous situation, such as a sudden break on the highway. Considering the various disturbance types, it is difficult to address this issue at the perceptual aspect. A potential solution is to detect the ghost through relation learning among the whole scenario and develop an integrated end-to-end navigation system. Our underlying logic is that the behavior of all vehicles in the scene is influenced by their neighbors, and normal vehicles behave in a logical way, while ghost vehicles do not. By learning the spatio-temporal relation among surrounding vehicles, an information reliability representation is learned for each detected vehicle and then a robot navigation network is developed. In contrast to existing works, we encourage the network to learn how to represent the reliability and how to aggregate all the information with uncertainties by itself, thus increasing the efficiency and generalizability. To the best of the authors knowledge, this paper provides the first work on using graph relation learning to achieve end-to-end robust navigation in the presence of ghost vehicles. Simulation results in the CARLA platform demonstrate the feasibility and effectiveness of the proposed method in various scenarios.
Autonomous driving in multi-agent dynamic traffic scenarios is challenging: the behaviors of road users are uncertain and are hard to model explicitly, and the ego-vehicle should apply complicated negotiation skills with them, such as yielding, merging and taking turns, to achieve both safe and efficient driving in various settings. Traditional planning methods are largely rule-based and scale poorly in these complex dynamic scenarios, often leading to reactive or even overly conservative behaviors. Therefore, they require tedious human efforts to maintain workability. Recently, deep learning-based methods have shown promising results with better generalization capability but less hand engineering efforts. However, they are either implemented with supervised imitation learning (IL), which suffers from dataset bias and distribution mismatch issues, or are trained with deep reinforcement learning (DRL) but focus on one specific traffic scenario. In this work, we propose DQ-GAT to achieve scalable and proactive autonomous driving, where graph attention-based networks are used to implicitly model interactions, and deep Q-learning is employed to train the network end-to-end in an unsupervised manner. Extensive experiments in a high-fidelity driving simulator show that our method achieves higher success rates than previous learning-based methods and a traditional rule-based method, and better trades off safety and efficiency in both seen and unseen scenarios. Moreover, qualitative results on a trajectory dataset indicate that our learned policy can be transferred to the real world for practical applications with real-time speeds. Demonstration videos are available at .
Full-text available
This paper proposes a 3D vehicle-detection algorithm based on multimodal feature fusion to address the problem of low vehicle-detection accuracy in unmanned system environment awareness. The algorithm matches the coordinate relationships between the two sensors and reduces sampling errors by combining the millimeter-wave radar and camera calibration. Statistical filtering is used to remove redundant points from the millimeter-wave radar data to reduce outlier interference; a multimodal feature fusion module is constructed to fuse the point cloud and image information using pixel-by-pixel averaging. Moreover, feature pyramids are added to extract fused high-level feature information, which is used to improve detection accuracy in complex road scenarios. A feature fusion region proposal structure was established to generate region proposals based on the high-level feature information. The vehicle detection results were obtained by matching the detection frames in their vertices after removal of the redundant detection frames using non-maximum suppression. Experimental results from the KITTI dataset show that the proposed method improved the efficiency and accuracy of vehicle detection with the corresponding average of 0.14 s and 84.71%.
Traditional methods for autonomous driving are implemented with many building blocks from perception, planning and control, making them difficult to generalize to varied scenarios due to complex assumptions and interdependencies. Recently, the end-to-end driving method has emerged, which performs well and generalizes to new environments by directly learning from export-provided data. However, many prior methods on this topic neglect to check the confidence of the driving actions and the ability to recover from driving mistakes. In this paper, we develop an uncertainty-aware end-to-end trajectory generation method based on imitation learning. It can extract spatiotemporal features from the front-view camera images for scene understanding, then generate collision-free trajectories several seconds into the future. The experimental results suggest that under various weather and lighting conditions, our network can reliably generate trajectories in different urban environments, such as turning at intersections and slowing down for collision avoidance. Furthermore, closed-loop driving tests suggest that the proposed method achieves better cross-scene/platform driving results than the state-of-the-art (SOTA) end-to-end control method, where our model can recover from off-center and off-orientation errors and capture 80% of dangerous cases with high uncertainty estimations.
Drifting is a complicated task for autonomous vehicle control. Most traditional methods in this area are based on motion equations derived by the understanding of vehicle dynamics, which is difficult to be modeled precisely. We propose a robust drift controller without explicit motion equations, which is based on the latest model-free deep reinforcement learning algorithm soft actor-critic. The drift control problem is formulated as a trajectory following task, where the errorbased state and reward are designed. After being trained on tracks with different levels of difficulty, our controller is capable of making the vehicle drift through various sharp corners quickly and stably in the unseen map. The proposed controller is further shown to have excellent generalization ability, which can directly handle unseen vehicle types with different physical properties, such as mass, tire friction, etc.
Autonomous driving models should ideally be evaluated by deploying them on a fleet of physical vehicles in the real world. Unfortunately, this approach is not practical for the vast majority of researchers. An attractive alternative is to evaluate models offline, on a pre-collected validation dataset with ground truth annotation. In this paper, we investigate the relation between various online and offline metrics for evaluation of autonomous driving models. We find that offline prediction error is not necessarily correlated with driving quality, and two models with identical prediction error can differ dramatically in their driving performance. We show that the correlation of offline evaluation with driving quality can be significantly improved by selecting an appropriate validation dataset and suitable offline metrics.