ArticlePDF Available

Probabilistic End-to-End Vehicle Navigation in Complex Dynamic Environments With Multimodal Sensor Fusion

Authors:

Abstract and Figures

All-day and all-weather navigation is a critical capability for autonomous driving, which requires proper reaction to varied environmental conditions and complex agent behaviors. Recently, with the rise of deep learning, end-to-end control for autonomous vehicles has been studied a lot. However, most works are solely based on visual information, which could be degraded by challenging illumination conditions, such as dim lights or total darkness. In addition, they usually generate and apply deterministic control commands without considering the uncertainties in the future. In this paper, based on imitation learning, we propose a probabilistic driving model with multi-perception capability utilizing the information from camera, lidar and radar. We further evaluate its driving performance on-line on our new driving benchmark, which includes various environments (urban and rural areas, weathers, time of day, etc) and dynamic obstacles (vehicles, pedestrians, motorcyclists, bicyclists, etc). The results suggest that our proposed model outperforms baselines and achieves excellent generalization performance in unseen environments with heavy traffic and extreme weather conditions.
Content may be subject to copyright.
1
Probabilistic End-to-End Vehicle Navigation in
Complex Dynamic Environments with
Multimodal Sensor Fusion
Peide Cai, Sukai Wang, Yuxiang Sun, Ming Liu, Senior Member, IEEE
Abstract—All-day and all-weather navigation is a critical capa-
bility for autonomous driving, which requires proper reaction to
varied environmental conditions and complex agent behaviors.
Recently, with the rise of deep learning, end-to-end control
for autonomous vehicles has been well studied. However, most
works are solely based on visual information, which can be
degraded by challenging illumination conditions such as dim
light or total darkness. In addition, they usually generate and
apply deterministic control commands without considering the
uncertainties in the future. In this paper, based on imitation
learning, we propose a probabilistic driving model with multi-
perception capability utilizing the information from the camera,
lidar and radar. We further evaluate its driving performance
online on our new driving benchmark, which includes various
environmental conditions (e.g., urban and rural areas, traffic
densities, weather and times of the day) and dynamic obstacles
(e.g., vehicles, pedestrians, motorcyclists and bicyclists). The
results suggest that our proposed model outperforms baselines
and achieves excellent generalization performance in unseen
environments with heavy traffic and extreme weather.
Index Terms—Automation technologies for smart cities, au-
tonomous vehicle navigation, multi-modal perception, sensorimo-
tor learning, motion planning and control.
I. INTRODUCTION
IN THE field of autonomous driving, traditional naviga-
tion methods are commonly implemented with modular
pipelines [1], [2], which split the navigation task into individ-
ual sub-problems, such as perception, planning and control.
These modules often rely on a multitude of engineering
components to produce reliable environmental representations,
robust decisions and safe control actions. However, since the
separate modules rely on each other, the system can lead to
an accumulation of errors. Therefore, each component requires
careful and time-consuming hand engineering.
In recent years, with the unprecedented success of deep
learning, an alternative method called end-to-end control [3]–
[12] has arisen. This paradigm mimics the human brain and
maps the raw sensory input (e.g., RGB images) to control
output (e.g., steering angle) in an end-to-end fashion. In
addition, it substitutes laborious hand engineering by learning
control policies directly on data from human drivers with deep
networks, where explicit programming or modeling of each
possible scenario is not needed. Moreover, it can adapt to
complex noise characteristics of different environments during
training, which cannot be captured well by analytical methods.
All authors are with The Hong Kong University of Science and
Technology, Hong Kong SAR, China (email: pcaiaa@connect.ust.hk;
swangcy@connect.ust.hk; sun.yuxiang@outlook.com; eelium@ust.hk).
14.4 km/h 14.8 km/h
Camera
Driving speed: 15.5 km/h
Route to goal
Lidar (y-channel)
Radar
unclear vehicle
Fig. 1. Snapshots of different driving scenarios (left to right: ClearDay,
RainySunset and DrizzleNight) with global route directions and sensor data
information. For visualization, we project the lidar data (y-channel, i.e., the
height information) and radar data (relative speed to the ego-vehicle) to the
image plane. Brighter points mean larger values. It can be seen that the
information characteristic from lidar and radar is more consistent than from
the camera in different environmental conditions.
While end-to-end driving has been considerably fruitful,
there exist three critical deficiencies in the prior works.
1) The visual information is stressed too much. Most
works depend solely on cameras for scene understanding
and decision making [3]–[14]. However, although cameras
are versatile and cheap, they have difficulty capturing fine-
grained 3-D information. In addition, perception relying on
cameras is prone to be affected by challenging illumination
and weather conditions, such as the DrizzleNight case shown
in Fig. 1. Because of dim light and rain drops in this scene,
the blue car far ahead left can be difficult to recognize. In
such scenarios, vision-based driving systems can be dangerous.
However, the blue car is quite distinguishable by observing the
speed distribution from the radar data.
2) The probabilistic nature of executable actions is not well
explored. Most works output deterministic commands to the
vehicle [15], [16]; however, non-determinism is a key aspect of
controlling, which is useful in many safety-critical tasks such
as collision checking and risk-aware motion planning [17]. A
more reasonable approach, therefore, should be predicting a
motion distribution indicating what could do rather than what
to do for the driving platform.
3) The prior end-to-end methods are not evaluated suf-
ficiently in terms of the navigation task. Most works are
evaluated by first collecting a driving dataset with ground-truth
2
annotations (e.g., expert control actions) and then measuring
the average prediction error offline on the test set [6], [9], [10],
[13], [14], [17]. However, different from the computer vision
tasks such as object detection, the priority of driving should
be safety and robustness rather than accuracy. As indicated
in [18], the offline prediction error cannot well reflect the
actual driving quality. Therefore, online evaluation is more
reasonable and should be given more attention. One critical
concern for online evaluation is the environmental complexity,
yet prior related works either test their methods in static
maps [11], [12], [16], [19], [20], or scenarios with low-level
complexity [3]–[5], [7], [8], [15].
The aforementioned limitations motivate our exploration
to enhance the perception capability for end-to-end driving
systems. To this end, we propose a mixed sensor setup combin-
ing a camera, lidar and radar. The multimodal information is
processed by uniform alignment and projection onto the image
plane. Then, ResNet [21] is used for feature extraction. Based
on this setup, we introduce a probabilistic motion planning
(PMP) network to learn a deep probabilistic driving policy
from expert provided data, which outputs both a distribution
of future motion based on the Gaussian mixture model (GMM)
[9], [17], [22], and a deterministic control action. Finally,
we evaluate the driving performance of our model online
on a new benchmark with extensive experiments. The main
contributions of this letter are summarized as follows.
An end-to-end navigation method with multimodal sensor
fusion and probabilistic motion planning, named PMP-
net, for improving perception capability and considering
uncertainties in the future.
A new online benchmark, named DeepTest, to perform
analysis of driving systems in high-fidelity simulated en-
vironments with varied maps, weather, lighting conditions
and traffic densities.
Extensive evaluation and human-level driving perfor-
mance of the proposed PMP-net, presented in unseen
urban and rural areas with extreme weather and heavy
traffic.
II. RE LATE D WOR K
End-to-end control is designed with deep networks to
directly learn a mapping from raw sensory data to control
outputs. The pioneer ALVINN system [23] developed in 1989
uses a multilayer perceptron to learn the directions a vehicle
should steer. With the recent advancement of deep learning,
end-to-end control techniques have experienced tremendous
success. For example, using more powerful modern convo-
lutional neural networks (CNNs) and higher computational
power, Bojarski et al. [3] demonstrate impressive performance
in simple real-world driving scenarios such as on flat or
barrier-free roads. Xu et al. [6] develop an end-to-end ar-
chitecture to predict future vehicle egomotion from a large-
scale video dataset. However, these works only realize a lane-
following task and goal-directed navigation is not studied.
To enable goal-directed autonomous driving, Codevilla et al.
[5] propose a conditional imitation learning pipeline. In this
work, the vehicle is able to take a specific turn at intersections
based on high-level navigational commands such as turn left
and turn right. Follow-up works include [7], [12], [13] and
[14]. Another trend of adding guidance to the control policy
is using global route, which is a richer representation of
the intended moving directions than turning commands. For
example, Gao et al. [4] render routes on 2D floor maps and
call them intentions. Then, a neural-network motion controller
maps intentions and camera images directly to robot actions.
Pokle et al. [16] follow this idea and implement a deep
local trajectory planner and a velocity controller to compute
motion commands based on the path generated by a global
planner. However, these two works only focus on indoor robot
navigation. For outdoor driving applications, Cai et al. [20]
realize high-speed autonomous drifting in racing scenarios
guided by route information with deep reinforcement learning.
However, the control policy is only evaluated in static maps.
Hecker et al. [10] propose to learn a control policy with GPS-
based route planners and surround-view cameras. However, as
with many other works [6], [9], [13], [17], this work is only
evaluated offline by analysing the average predicting error,
providing unclear information of the actual driving quality.
Inspired by the route-guided navigation methods mentioned
above, we use a global planner to compute paths towards
destinations in outdoor driving areas. For the low-level reactive
control, we implement an end-to-end network translating the
global route into driving actions (steering, throttle and brake).
Based on this architecture, point-to-point autonomous driving
can be realized. The network is trained with imitation learning
and can adapt to varied environments to drive appropriately
(e.g., slow down at intersections) and safely (e.g., slow down
for a car, and urgently stop for jaywalkers). Similar to [4] and
[16], we assume that the localization information is available
during system operation. However, different to [4] and [16],
our work focuses on complicated outdoor driving scenarios,
and combines multimodal sensors complementing each other
to generate unified perception results.
In addition, our approach relates to the work of probabilistic
driving models. To improve the capability of handling long-
term plans with imitation learning, Amini et al. [9] propose a
variational network to predict a full distribution over possible
steering commands. Similarly, Huang et al. [17] propose to
use GMM to predict a distribution of future vehicle trajecto-
ries. These works explicitly consider uncertainties of future
motions on logged data with offline metrics. By contrast, we
evaluate our probabilistic driving model online with varied
environmental conditions (e.g., rainy night with heavy traffics),
which has not been studied in this context before.
III. METHODOLOGY
A. Formulation
We formulate the problem of autonomous vehicle navigation
as a goal-directed motion planning task to be solved by an
end-to-end network architecture with imitation learning. The
goal is to control the vehicle to drive safely and robustly in
complex outdoor areas to achieve point-to-point navigation,
like a human driver. To this end, we design a probabilistic
driving model using multimodal perceptions from the camera,
3
Planning horizon [s]
Yaw distribution
[degree]
Velocity distribution
[km/h]
Map
Current
Location
Goal
Position
Global
Planner
Vehicle
speed
Route
ResNet34
Dense
Layers
Dense
Layers
Camera
Lidar
Radar
ResNet34
XYZ
fi
fr
fv
fg
Dense
Layers
GMM
Dense
Layers
Feature fusion
with attention
Attention s
Feature Concatenation
Dense
Layers
Action
fusion
PID
controller
Planning
variance Action
Environment
Ego-vehicle
Action
Final action
Steer | Throttle
Global Planning & Multi-Perception Probablistic Motion Planning & Control
parameters
Sensors
Brake
fi
fr
fv
fc
ff
a1
a2
af
Fig. 2. The architecture of our probabilistic motion planning network (PMP-net). It receives the multimodal sensory input and plans a motion distribution
for 3 seconds in the future, based on which a PID controller is designed to generate a control action a2. In addition, PMP-net generates another action a1in
an end-to-end fashion. Then the variance of the planned motion distribution is used to fuse the dual actions for controlling the vehicle.
lidar and radar. In addition, we choose the latest CARLA
simulation (0.9.7) [24] to train and evaluate the system1. The
entire pipeline of our PMP-net is shown in Fig. 2.
B. Dataset Collection
To make the model successfully learn the knowledge of
goal-directed reactive control in the context of outdoor driving,
we collect a large-scale dataset with a global planner and an
expert demonstrator in CARLA. At the beginning of each
driving episode, the ego-vehicle is spawned at a random
position p. Then a collision-free coarse route (ranging from
350 m to 1500 m) from pto a destination dis provided by a
global planner. The vehicle then follows this route at a speed of
around 40 km/h while reacting to local environments to avoid
collisions, such as slowing down for a forward-facing car that
is moving slowly. Additionally, the vehicle reasonably slows
the speed down to 15 km/h at intersections to ensure safety. In
the process of data collection, we record the vehicle velocities,
yaw angles, RGB images, lidar/radar data and expert driving
actions (i.e., steering, throttle and brake) at 10 Hz. Moreover,
in order to increase the complexity of our dataset, we focus
on the following two aspects:
1) Complexity of Environments: a) The datasets from prior
works [5], [7], [8] are generated only in one map with two
lanes and 90-degree turns (Town01 in Fig. 3). By contrast,
we use five urban maps for data collection, which consist
of different types of intersections and even roundabouts, and
multiple lanes on roads; b) We set nine combinations of
weather (clear,drizzle and rainy) and illumination (daytime,
sunset and night). Heavier rain leads to more puddles on roads,
and thus brings a greater reflection effect for visual perception.
1Different from the older versions of CARLA (0.8.x) used in [5], [7] and
[8], which contain only two urban maps, the latest CARLA environment
provides seven maps covering both urban and rural areas, with more avail-
able sensors, improved physical dynamics and more realistic illuminations.
http://carla.org/2019/12/11/release-0.9.7/
2) Complexity of Road Agents: a) We set pedestrians with
different appearances (children and adults) randomly running
or walking along the sidewalks and crosswalks. They oc-
casionally disobey traffic rules and cross the road abruptly
without previous notice, which increases the safety burden for
autonomous driving; b) We set different types of vehicles (e.g.,
cars, trucks, vans, jeeps, buses, motorcyclists and bicyclists)
with multiple appearances navigating around the cities at
varied speeds. Based on a) and b), we apply four levels of
traffic density for data collection: empty,few,regular and
dense. Note that these road agents are controlled by the AI
engine from CARLA to construct realistic city scenarios.
The setups mentioned above can be partially viewed in Fig.
3 and more can be viewed in our supplementary videos. These
help to generate sufficient interactions between the ego-vehicle
and road agents in diverse environments. Based on these
setups, we finally collect 360 high-fidelity driving episodes,
which last 10.8 hours in total with 389 thousand frames and
cover a driving distance of 247 km.
C. Model Architecture
1) Global Planning: The global planner is separate from
the deep networks. It is implemented with the Aalgorithm
to plan a high-level coarse route from the start point to the
destination based on static town maps. Similar to [16] and
[20], we down-sample the full global route Gfto local relevant
routes Gduring navigation, which is shown in (1):
G={(xk,yk)|1k130}Gf.(1)
Note that the first waypoint (x1,y1)in Gis the closest
waypoint in Gfto the current location of the vehicle, and
the distance of every two adjacent points is 0.4 m. The
waypoints are then flattened into a 260-dimensional vector
to be processed by dense layers with fully connected ReLU
layers. The extracted feature is a higher dimensional vector
fgR2048.
4
(a) Town01-ClearDay (b) Town02-DrizzleSunset (c) Town03-RainySunset (d) Town04-ClearNight (f) Town05-RainyNight
Empty
Regular
Dense
jaywalkers
unclear vehicles
unclear motorcyclist
Fig. 3. Overview of our dataset: varied maps, weather and illumination conditions with increasing traffic densities (top to bottom). Noticeable road agents
are bounded by color boxes. Note that this figure shows only a small part of the environmental setups; please see contexts in Section III-B for more details.
Columns (a-c) show there can sometimes be jaywalkers running across the roads, for which the ego-vehicle will urgently slow down or completely stop
to ensure safety. In addition, it can be seen that in rainy scenarios, especially in RainyNight, the surroundings are considerably blurred (e.g., the unclear
motorcyclist in the Regular setting of column (f)), leading to potential risks for the vision-based driving models [5], [7], [8], [14].
Camera Image
3-channel
Lidar Point Cloud
Radar Measurements
-- velocity -- azimuth
-- altitude -- depth
X
Y
Z
projected to the
image plane
projected
Z
Y
X
Ralidar Image
4-channel
ResNet34
ResNet34
Fig. 4. Multimodal data processing. We achieve data alignment by projecting
the lidar pointclouds and radar measurements to the image plane by combining
them together to form the ralidar image. Then, two ResNet34 modules are
used to extract features from the camera and ralidar images. Brighter points
mean larger values in the projected images. Noticeable road agents in the
projected radar image are bounded by white boxes.
2) Multi-Perception: With the aim to capture environmental
information, the camera records color textures in a 2D image
plane, while the lidar captures 3-D spatial locations and the
radar records movement information (i.e., speeds of obstacles
relative to the ego-vehicle). We combine these sensors together
in our network so that the vehicle is able to sense different
dimensions of its surroundings.
Specifically, we project the lidar point clouds and radar data
to the image plane with the same width and height as the
camera images. We name it the ralidar image (250 ×600 ×4),
in which the first three channels encode 3-D coordinates and
the forth channel encodes relative speeds, as shown in Fig. 4.
In this way, the multimodal measurements are aligned on the
same space and can be uniformly processed with CNNs. In
this work, we use ResNet34 [21] as the backbone to extract
environmental features from the camera and ralidar images.
The results are feature vectors fiR2048 and frR2048.
3) End-to-End Action Generation: In addition to the sen-
sory data and the global route, our network also takes as input
the velocity of the ego-vehicle (vx,vy)to the dense layers. The
extracted feature is a higher dimensional vector fvR2048.
Then the features [fi,fr,fv,fg]are handled in two ways:
a) we concatenate them into a vector fcR8192 for further
processing, and b) in the spirit of [16], we fuse them with
an attention mechanism defined in (2), where the coefficients
a= [ai,ar,av,ag]reflect the relative importance of the features
in changing environments.
ff=aifi+arfr+avfv+agfg.(2)
The coefficients aare computed by transforming fcwith dense
layers and softmax activation. After such feature fusion, a
control action a1composed of steering, throttle and brake is
generated by projecting ffwith fully connected ReLU layers.
Inspired by [18], we use the L1 loss function for this module
as it is better correlated to the online driving performance.
4) Probabilistic Motion Planning: In this module, we aim
to learn a full parameterized distribution over possible ego-
motions (i.e., velocities and yaw angles) for 3.0 s into the
future, as shown in Fig. 2. We adopt the GMM to repre-
sent such a distribution due to its excellent approximation
properties. Specifically, the combined feature fcin our work
is transformed by dense layers into GMM parameters (i.e.,
weight, mean and variance) to describe the distribution of
future motions. Similar to [9] and [17], the negative log-
likelihood (NLL) loss function is used for this module.
As mentioned in [22], the advantage of probabilistic model-
ing is that we can make a decision by evaluating its statistical
properties. In this work, based on the mean values (µ) of the
planned motion distribution, we further design a PID controller
to calculate a control action a2composed of steering, throttle
and brake. The target point for this PID controller (assume k
frames in the future) is set to the point 5 m ahead of the vehicle
by calculating the integral with µ. Then, the final action afto
control the vehicle is computed by examining the reliability of
the motion distribution through its accumulated variance σ2:
af= (1λ)a1+λa2,λ=ec1·max(0,k
iσ2c2)
.(3)
In this way, higher planning uncertainty leads to smaller λ,
thus the final action will depend more on a1. We believe
that we can take advantage of both end-to-end control and
probabilistic modeling by performing such reliability-aware
action fusion.
5
TABLE I
WEEVALUATE DI FFER EN T DRIVING MODE LS O N OUR DeepTest DRIVING BENCHMARK.MEANS LARGER NUMBERS AR E BET TER ,MEA NS
SMA LLE R NUMBERS ARE BE TTE R. TH E BOL D FONT HIGHLIGHTS THE BEST RE SU LTS IN EAC H COLUMN.
Training Conditions New Weather New Town New Town & Weather
Town Name Town03 (urban) Town05 (urban) Town07 (rural) Town06 (urban)
Traffic Density Empty Regular Dense Empty Regular Dense Empty Regular Dense Empty Regular Dense
Success Rate (%)
CIL [5] 38 16 16 33 11 0 0 0 0 16 11 0
CIL-R 83 55 38 33 22 16 22 11 11 11 11 11
INT [4] 16 33 11 83 5 5 38 22 5 94 61 16
PMP (ours)100 72 88 100 77 77 100 83 72 100 88 83
Wrong Lane (%)
CIL [5] 66.05 45.16 50.87 57.22 64.41 46.18 35.55 36.81 40.71 44.14 52.37 52.03
CIL-R 26.60 25.57 19.07 26.58 36.64 41.86 8.88 7.20 3.35 42.50 50.72 51.61
INT [4] 0.00 0.04 0.01 0.07 0.12 0.15 0.00 0.00 0.00 0.12 0.13 0.28
PMP (ours) 0.02 0.00 0.01 0.40 0.48 0.50 0.04 0.00 0.01 0.43 0.40 0.61
Overspeed (%)
CIL [5] 0.33 0.37 0.16 0.10 0.00 0.04 0.00
CIL-R 0.14 0.13 0.08 0.04 0.00 0.00 0.33 0.09 0.28 0.10 0.16 1.54
INT [4] 17.70 11.18 5.85 17.09 15.14 8.52 19.03 11.87 14.84 37.12 30.22 31.04
PMP (ours)0.11 0.22 0.12 0.14 0.00 0.06 0.26 0.30 0.28 0.40 0.28 0.36
IV. EXP ER IM EN TS A ND DISCUSSION
A. Training Setup
We train the proposed PMP-net on our large-scale driving
dataset introduced in Section III-B. The full dataset is divided
into a training set and a validation set according to the ratio
of 7:1, leading to 340K training samples2. We use the Adam
optimizer with a learning rate of 0.0001, and the batch size is
90. Based on these setups, the model is trained on two Nvidia
GeForce RTX 2080 Ti GPUs for about 75 hours, with 234K
training steps to achieve convergence. For comparison, we also
train and finetune three other baselines on the same training
set, which are for visual navigation:
CIL: The conditional imitation learning network intro-
duced in [5]. This maps the camera images and ego-
velocities directly to control actions, based on four dis-
crete commands for goal-directed navigation: follow lane,
turn left,turn right and go straight at the intersection.
CIL-R: We replace the original image processing module
of CIL (which is relatively shallow) with ResNet34, to
evaluate if deeper models perform better for our task.
INT: The intention-net introduced in [4] with the back-
bone of ResNet34 for fair comparisons. This maps the
camera images and global routes to control actions. Note
that the original intention-net takes the indoor floor maps
rendered with routes for directions. We replace it with
the local relevant routes Gintroduced in (1).
B. Evaluation
1) DeepTest Benchmark: We evaluate the online driving
performance for different models on our proposed DeepTest
2Note the test set is not considered because we evaluate our model online
in Section IV-B by making the ego-vehicle directly interact with dynamic
environments.
benchmark in CARLA. Compared with the previous bench-
marks in [7] and [24], DeepTest has many more environmental
setups, such as more test maps, weather conditions and inter-
actions with road agents. In addition, different to [7] and [24],
we set zero tolerance for collision events, which means that
any degree of collisions with static (e.g., trees) or dynamic
(e.g., pedestrians) objects leads to a failed episode.
In our benchmark, different methods are tested on four
maps. For each map, we set three levels of traffic densities:
empty,regular and dense. Therefore, each driving model
relates to 12 driving tasks. Note that denser traffic leads to
harder driving tasks as it involves more dynamic obstacles on
the road. In each task, we further set 18 goal-directed episodes
with varied weather conditions. Therefore, to fully evaluate
PMP-net and the other three baselines, 864 driving episodes
should be conducted. Finally, the evaluation process costs 4
days on our computer and covers a driving distance of 855
km. Compared with the environmental setups in the training
set (Section III-B), we consider new maps, illuminations
and weather in DeepTest to test the generalization capability.
Specifically, we add an unseen rural map Town07 and an
urban map Town06.Town07 brings new challenges to test the
negotiation skills with narrow roads and many non-signalized
crossings. In addition, we add four extreme illumination and
weather conditions: ClearDark,DrizzleDark,StormDark and
StormSunset. The new Dark and Storm (i.e., heavy rain)
settings, which are shown in Fig. 5, bring extra challenges
to the drive with limited vision. Similar to [5], we do not
consider traffic lights in this work. For quantification of the
driving performance, three metrics are adopted as follows:
SR: success rate. An episode is considered to be suc-
cessful if the agent reaches a certain goal without any
collision within a time limit. Based on this, we calculate
the success rate for models in different tasks.
WL: The proportion of the period in a wrong lane to the
6
Fig. 5. Online evaluation results of PMP-net in our DeepTest benchmark. The environment setups, driving velocities and control actions are shown in yellow
text. Noticeable road agents (e.g., jaywalkers) are bounded by green boxes. The range of steering is [-1,1], while for throttle and brake the range is [0,1].
The sample driving behaviors are: (c) lane-following, turning at (b,d,e) intersections or (a) roundabouts, (g) lane-changing, (f,h,i,j) vehicle-, bicyclist- or
motorcyclist-following, and (k,m) urgently slowing down for jaywalkers. All of these behaviors are performed autonomously and safely by PMP-net in an
end-to-end fashion without hand-crafted rules.
total driving time.
OVSP: The proportion of the overspeeding period to the
total driving time. The speed limit is set to 20 km/h at
intersections and 50 km/h elsewhere.
2) Quantitative Analysis: We show the results on the
DeepTest benchmark in Table I. In the following, the analyses
are given from two perspectives: the ability and the quality of
autonomous driving.
Ability: Success rate (SR) is used to measure the self-
driving ability, which is a crucial concern in this area.
It can be seen that the CIL model presents the worst results,
which can not even achieve a successful episode in Town07.
In addition, although in Town03 we only set new routes with
similar environments to the training dataset, CIL still presents
low SRs (16~38%). With the help of a deeper backbone in
CIL-R, the performance is improved. For example, the SR in
Town03-empty increases from 38% to 83%.
By changing the model structure to INT, better generaliza-
tion performance on certain new environments is achieved, for
example, the SR in Town06-Regular increases from 11%
to 61%. However, INT performs worse than CIL-R in Town03
and some other new environments such as Town05-Dense.
Generally, INT and CIL-R have similar low-level perfor-
mances in outdoor driving areas, especially in heavy traffic.
This is because they only use visual perception, which often
has troubles in tough environments such as StormDark. By
contrast, PMP-net achieves a much higher SR in all evaluation
setups, which indicates a superior generalization capability. In
particular, the SR increases to 100% in all environments for
the empty traffic, and to 72~88% for regular and dense traffic.
Quality: We use WL and OVSP to evaluate the driving
quality of different models. Due to the lack of concrete
direction guidance, CIL and CIL-R both have high WL values
(3.35~66.05%). Specifically, they often navigate the vehicle
to drive in the correct direction but in the wrong lanes. With
the help of the global route information, the models are able
to drive more accurately, as we can see by the WL values
for INT and PMP, which are all close to 0%. However, INT
tends to control the vehicle to drive at high speeds without
slowing down at intersections. This unsafe phenomenon leads
to high values of OVSP for INT (5.85~37.12%). While PMP
still performs well on this metric (0.0~0.4%).
Generally, the remarkable improvements of PMP-net on the
benchmark w.r.t. the other three baselines confirm that our
proposed model can effectively learn and deploy the driving
knowledge in complex dynamic environments.
3) Qualitative Analysis: Fig. 5 shows the qualitative results
of PMP-net. When there are no obstacles ahead on straight
roads, our model drives relatively fast, at about 40 km/h (Fig.
5-(c)). When taking turns or following road agents, our model
reasonably slows down as a human driver would, as shown in
Fig. 5-(a,b,d,f,i). In addition, we show some results in extreme
conditions. In Fig. 5-(e), the traffic is heavy with many vehicles
driving at an intersection. Although the model is directed to
turn right, it applies full brake as another vehicle blocks the
road ahead. Moreover, in Fig. 5-(g,h), we set dense traffic on a
7
dark night where slow-moving obstacles are ahead of the ego-
vehicle. In these scenes with limited vision, PMP-net is also
able to drive safely by reducing the throttle to slow down when
changing/following lanes. Furthermore, the most challenging
scene is shown in Fig. 5-(m). In the StormDark environment,
there is a small child running across the road abruptly without
any previous notice. For this scene, it is difficult to raise alarm
even for a human driver because the surroundings cannot be
seen clearly. Surprisingly, our model slows down timely by
applying brake to avoid an accident. Fig. 5-(k) is another
similar scenario. For interpretation, the planned motion dis-
tribution of Fig. 5-(m) is attached, where we can see that the
planned speed drops rapidly within a short horizon (~0.5 s)
with low variance. We accredit such prominent performance to
our multimodal and probabilistic setup. More related driving
behaviors are shown in supplementary videos3.
V. CONCLUSION
In this paper, to realize autonomous driving in outdoor
dynamic environments, we proposed a deep navigation model
named PMP-net, which is based on multimodal sensors (a
camera, lidar and radar) and probabilistic end-to-end control.
We collected a large-scale driving dataset in the CARLA sim-
ulator and trained the model with imitation learning. In order
to fully evaluate the driving performance, we further proposed
a new online benchmark DeepTest, of which the environmental
complexity has not been previously considered. By setting
varied illumination, weather and traffic conditions in different
towns, we showed that our model achieves excellent driving
and generalization performance in both unseen urban and rural
areas with extreme weather and heavy traffic with dynamic
objects (e.g., vehicles, bicyclists and jaywalkers).
To further extend PMP-net for real autonomous vehicles,
the reality gap should be considered. 1) For discrepancy of
sensory input, we can finetune the model with real-world data.
The sensor readings of lidar and radar are more consistent than
those of a camera with real/simulated deployments, which can
help regularize the finetuning process for domain adaption.
2) For discrepancy of driving platforms, we can adjust the
parameters of the PID controller to adapt to different vehicle
properties [14], due to the modular design of our network.
REFERENCES
[1] J. Leonard, J. How, S. Teller, M. Berger, S. Campbell, G. Fiore,
L. Fletcher, E. Frazzoli, A. Huang, S. Karaman et al., “A perception-
driven autonomous urban vehicle,Journal of Field Robotics, vol. 25,
no. 10, pp. 727–774, 2008.
[2] E. D. Dickmanns, “The development of machine vision for road vehicles
in the last decade,” in Intelligent Vehicle Symposium, 2002. IEEE, vol. 1.
IEEE, 2002, pp. 268–281.
[3] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End
to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,
2016.
[4] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subramanian, “Intention-
net: Integrating planning and deep learning for goal-directed autonomous
navigation,” in Conference on Robot Learning, 2017, pp. 185–194.
3https://sites.google.com/view/pmpnet/
[5] F. Codevilla, M. Miiller, A. López, V. Koltun, and A. Dosovitskiy,
“End-to-end driving via conditional imitation learning,” in 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE,
2018, pp. 1–9.
[6] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
models from large-scale video datasets,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, pp. 2174–
2182.
[7] F. Codevilla, E. Santana, A. M. Lopez, and A. Gaidon, “Exploring the
limitations of behavior cloning for autonomous driving,” in The IEEE
International Conference on Computer Vision (ICCV), October 2019.
[8] L. Tai, P. Yun, Y. Chen, C. Liu, H. Ye, and M. Liu, “Visual-based
autonomous driving deployment from a stochastic and uncertainty-aware
perspective,” in 2019 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), 2019.
[9] A. Amini, G. Rosman, S. Karaman, and D. Rus, “Variational end-to-
end navigation and localization,” in 2019 International Conference on
Robotics and Automation (ICRA), May 2019, pp. 8958–8964.
[10] S. Hecker, D. Dai, and L. Van Gool, “End-to-end learning of driving
models with surround-view cameras and route planners,” in Proceedings
of the European Conference on Computer Vision (ECCV), 2018, pp.
435–453.
[11] P. Karkus, X. Ma, D. Hsu, L. Kaelbling, W. S. Lee, and T. Lozano-Perez,
“Differentiable algorithm networks for composable robot learning,” in
Proceedings of Robotics: Science and Systems, FreiburgimBreisgau,
Germany, June 2019.
[12] M. Mueller, A. Dosovitskiy, B. Ghanem, and V. Koltun, “Driving policy
transfer via modularity and abstraction,” in Proceedings of The 2nd
Conference on Robot Learning, ser. Proceedings of Machine Learning
Research, A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds.,
vol. 87. PMLR, 29–31 Oct 2018, pp. 1–15.
[13] P. Cai, Y. Sun, Y. Chen, and M. Liu, “Vision-based trajectory planning
via imitation learning for autonomous vehicles,” in 2019 IEEE Intelligent
Transportation Systems Conference (ITSC), Oct 2019, pp. 2736–2742.
[14] P. Cai, Y. Sun, H. Wang, and M. Liu, “VTGNet: A vision-based
trajectory generation network for autonomous vehicles in urban envi-
ronments,” arXiv preprint arXiv:2004.12591, 2020.
[15] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to
drive by imitating the best and synthesizing the worst,” in Proceedings
of Robotics: Science and Systems, FreiburgimBreisgau, Germany, June
2019.
[16] A. Pokle, R. Martín-Martín, P. Goebel, V. Chow, H. M. Ewald, J. Yang,
Z. Wang, A. Sadeghian, D. Sadigh, S. Savarese, and M. Vázquez, “Deep
local trajectory replanning and control for robot navigation,” in 2019
International Conference on Robotics and Automation (ICRA), May
2019, pp. 5815–5822.
[17] X. Huang, S. G. McGill, B. C. Williams, L. Fletcher, and G. Rosman,
“Uncertainty-aware driver trajectory prediction at urban intersections,
in 2019 International Conference on Robotics and Automation (ICRA).
IEEE, 2019, pp. 9718–9724.
[18] F. Codevilla, A. M. Lopez, V. Koltun, and A. Dosovitskiy, “On offline
evaluation of vision-based driving models,” in The European Conference
on Computer Vision (ECCV), September 2018.
[19] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena, “From
perception to decision: A data-driven approach to end-to-end motion
planning for autonomous ground robots,” in 2017 IEEE International
Conference on Robotics and Automation (ICRA), May 2017, pp. 1527–
1533.
[20] P. Cai, X. Mei, L. Tai, Y. Sun, and M. Liu, “High-speed autonomous
drifting with deep reinforcement learning,” IEEE Robotics and Automa-
tion Letters, vol. 5, pp. 1247–1254, 2020.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2016, pp. 770–778.
[22] J. Wiest, M. Höffken, U. Kreßel, and K. Dietmayer, “Probabilistic
trajectory prediction with gaussian mixture models,” in 2012 IEEE
Intelligent Vehicles Symposium. IEEE, 2012, pp. 141–146.
[23] D. A. Pomerleau, “ALVINN: An autonomous land vehicle in a neural
network,” in Advances in neural information processing systems, 1989,
pp. 305–313.
[24] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
“CARLA: An open urban driving simulator,” in Proceedings of the 1st
Annual Conference on Robot Learning, ser. Proceedings of Machine
Learning Research, S. Levine, V. Vanhoucke, and K. Goldberg, Eds.,
vol. 78. PMLR, 13–15 Nov 2017, pp. 1–16.
1
Supplement File of Probabilistic End-to-End
Vehicle Navigation in Complex Dynamic
Environments with Multimodal Sensor Fusion
Peide Cai, Sukai Wang, Yuxiang Sun, Ming Liu, Senior Member, IEEE
(1) Empty (2) Regular (3) Dense
Fig. S1: Three levels of traffic density in our DeepTest benchmark.
clear drizzle rainy storm
daysunsetnightdark
Fig. S2: Varied illumination (clear, sunset, night and dark) and weather (clear, drizzle, rainy and storm) conditions considered
in this work.
S.I. MOD EL PAR AM ET ER S
A. Traffic Densities
We set three levels of traffic density for our DeepTest
benchmark: empty,regular and dense. The specific settings
are shown in Table. S1 and can be viewed in Fig. S1.
TABLE S1: Different traffic densities in this work
Type Number of pedestrians Number of vehicles
Empty 0 0
Regular 40~75 60
Dense 60~150 80~120
B. Illumination and Weather Conditions
This paper involves four illumination conditions (i.e., day-
time,sunset,night and dark) and four weather conditions (i.e.,
clear,drizzle,rainy and storm). The specific settings can be
viewed in Fig. S2.
C. PID Control
To translate the planned motion distribution into action a2,
two PID controllers are designed for lateral (steering) and
longitudinal (throttle and brake) control. The parameters are
shown in Table S3.
2
TABLE S2: Quantitative comparison of PMP-NO-RDR (which removes the input of radar information from PMP-net) with
other models. means larger numbers are better, means smaller numbers are better. The bold font highlights the best results
in each column.
Training Conditions New Weather New Town New Town & Weather
Town Name Town03 (urban) Town05 (urban) Town07 (rural) Town06 (urban)
Traffic Density Empty Regular Dense Empty Regular Dense Empty Regular Dense Empty Regular Dense
Success Rate (%)
CIL[S1] 38 16 16 33 11 0 0 0 0 16 11 0
CIL-R 83 55 38 33 22 16 22 11 11 11 11 11
INT[S2] 16 33 11 83 5 5 38 22 5 94 61 16
PMP-NO-RDR 50 38 55 16 22 38 44 44 22 100 72 50
PMP (ours)100 72 88 100 77 77 100 83 72 100 88 83
Wrong Lane (%)
CIL[S1] 66.05 45.16 50.87 57.22 64.41 46.18 35.55 36.81 40.71 44.14 52.37 52.03
CIL-R 26.60 25.57 19.07 26.58 36.64 41.86 8.88 7.20 3.35 42.50 50.72 51.61
INT[S2]0.00 0.04 0.01 0.07 0.12 0.15 0.00 0.00 0.00 0.12 0.13 0.28
PMP-NO-RDR 6.43 4.69 1.67 7.93 1.78 1.64 0.78 0.42 0.42 2.21 1.30 1.54
PMP (ours) 0.02 0.00 0.01 0.40 0.48 0.50 0.04 0.00 0.01 0.43 0.40 0.61
Overspeed (%)
CIL[S1] 0.33 0.37 0.16 0.10 0.00 0.04 0.00
CIL-R 0.14 0.13 0.08 0.04 0.00 0.00 0.33 0.09 0.28 0.10 0.16 1.54
INT[S2] 17.70 11.18 5.85 17.09 15.14 8.52 19.03 11.87 14.84 37.12 30.22 31.04
PMP-NO-RDR 0.48 0.04 0.15 0.16 0.00 0.02 0.18 0.27 0.09 0.03 0.22 0.12
PMP (ours)0.11 0.22 0.12 0.14 0.00 0.06 0.26 0.30 0.28 0.40 0.28 0.36
(a) (b) (c)
Fig. S3: Failure cases of collision event from PMP-NO-RDR. Top to bottom: onboard camera images, projected lidar images
(not the input of PMP-NO-RDR, only for interpretation of performance degradation) and lidar images (y-channel). Brighter
points mean larger values. The obstacle hit by the ego-vehicle are bounded by blue boxes.
TABLE S3: PID parameters used in this work
Type Proportional (P) Integral (I) Derivative (D)
Lateral 0.70 0.00 0.00
Longitudinal 0.25 0.20 0.00
S.II. ABL ATIO N STU DY
In this section, we aim to explore if PMP-net can com-
plement the lidar and radar as the form of ralidar image
introduced in the paper, or is it possible to use only a camera
and lidar. Therefore, we remove the radar input from PMP-
net and train an ablated model named PMP-NO-RDR, which
takes as input the ego-velocity, global route and information
from the camera and lidar. The training setup of this model is
the same as PMP-net.
A. Quantitative Analysis
As with the other models, PMP-NO-RDR is evaluated on
our DeepTest benchmark with the same setup. The comparison
quantitative results are shown in Table. S2. For the driving
3
(a) (b)
(c) (d)
(e) (f)
Fig. S4: Evaluation results of PMP-net in a new urban environment with dynamic illumination and weather conditions. The
parameters of dynamic environments such as sun altitude angle and fog are shown on the bottom of snapshots. Noticeable
road agents are bounded by green boxes. The red lines indicate the global routes.
ability, it can be seen that the SR (success rate) values of
PMP-NO-RDR drop from 100% (from PMP) to 16~100% in
Empty traffic settings, and drop from 72~88% (from PMP)
to 22~55% in Dense traffic settings. For the driving quality,
the OVSP (over speed) values of PMP-NO-RDR are similar
to those of PMP, while the WL (wrong lane) values increase
from 0.0~0.61% (from PMP) to 0.42~7.93%. These results
suggest that our PMP-net indeed complements the radar and
lidar to improve the overall navigation performance. Note that
even without the radar input, PMP-NO-RDR still performs
better than the other three baselines, i.e., CIL, CIL-R and INT,
with lower OVSPs and higher SRs, especially in Dense traffic
conditions.
B. Qualitative Analysis
We show three failure cases of collision event from PMP-
NO-RDR in Fig. S3 to interpret its performance degradation
that mainly happens when interacting with dynamic obstacles.
We can see that the vehicle sometimes collides with other
road agents when changing lanes (Fig. S3-(c)) or turning at
intersections (Fig. S3-(a,b)). In these scenes, although there are
other obstacles blocking the road ahead, PMP-NO-RDR still
controls the vehicle to move forward by applying throttles. For
interpretation, we show the related radar/lidar images in the
second/third row of Fig. S3. It can be seen that the structure of
static environments and dynamic obstacle are well separated
in radar images because of their different speeds relative
to the ego-vehicle. This characteristic of radar information
is significant for comprehensive scene understanding along
with the lidar information, which can only perceive the static
information (e.g., position, length and height) of surroundings.
4
(a) (b)
(c) (d)
Fig. S5: Evaluation results of PMP-net on a new vehicle chevrolet-impala. The parameters of dynamic environments such as
sun altitude angle and fog are shown on the bottom of snapshots. Noticeable road agents are bounded by green boxes. The
red lines indicate the global routes.
S.III. EVALUATI ON I N MOR E EXT RE ME CONDITIONS
A. New Town with Unseen Features
We further test the driving performance of PMP-net in a
more realistic town named Town10HD1. This test is much
more difficult than DeepTest because we add some new
features as follows, which are also unseen in the training stage.
Dynamic illumination (changing from daytime to night)
and weather (changing from clear to rainy) conditions in
a single driving episode.
Road lamp and vehicle lights at night.
Fog and wind weather conditions.
The results show that our PMP-net still navigates the vehicle
robustly and safely in these scenarios. We show qualitative
results in Fig. S4 and some demo videos on the website2. A
tough scene is shown in Fig. S4-(f). In the foggy and rainy
night, a small child who wears dark-colored cloths crosses the
road abruptly. He is also affected by other vehicles’ lights.
It is difficult to raise alarm even for a human driver because
this small child cannot be seen clearly. Surprisingly, PMP-net
stops the vehicle timely by applying full brake to avoid an
accident.
B. New Driving Platform
Similar to [S3], to further test the cross-platform general-
ization performance of PMP-net, we change the test vehicle
1https://carla.org/2020/04/22/release-0.9.9/
2https://sites.google.com/view/pmpnet/
from audi-a2 to chevrolet-impala. The new vehicle is much
longer, therefore, we expand the steering action of the network
output by 1.4 times. The test scene is the same as the one in
Section S.III.A. We show qualitative results in Fig. S5 and
some demo video is on our website. Note the new vehicle has
available lights in CARLA, therefore, it is further programmed
to turn on the LowBeam at night and Blinker when turning at
intersections. This setup brings new visual effects to camera
images. The results confirm that PMP-net also performs well
on this new platform in dynamic environments, such as the
collision avoidance event shown in Fig. S5-(b,c), and the
ability of driving at rainy night shown in Fig. S5-(d).
REFERENCES
[S1] F. Codevilla, M. Miiller, A. López, V. Koltun, and
A. Dosovitskiy, “End-to-end driving via conditional imi-
tation learning,” in 2018 IEEE International Conference
on Robotics and Automation (ICRA). IEEE, 2018, pp.
1–9.
[S2] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subra-
manian, “Intention-net: Integrating planning and deep
learning for goal-directed autonomous navigation,” in
Conference on Robot Learning, 2017, pp. 185–194.
[S3] P. Cai, Y. Sun, H. Wang, and M. Liu, “VTGNet:
A vision-based trajectory generation network for au-
tonomous vehicles in urban environments,arXiv
preprint arXiv:2004.12591, 2020.
... Contrary to these methods, middle fusion achieves multi-sensor fusion within the network by separately encoding inputs and then fusing them at the feature level. Naive concatenation is also frequently adopted [15,22,30,142,143,144,145,146]. Recently, works have employed Transformers [27] to model interactions among features [6,28,29,147,148]. ...
... Loquercio et al. [232] and Filos et al. [234] propose capturing epistemic uncertainty with an ensemble of expert likelihood models and aggregating the results to perform safe planning. Regarding methods modeling aleatoric uncertainty, driving actions/planning and uncertainty (usually represented by variance) are explicitly predicted in [146,235,236]. Such methods directly model and quantify the uncertainty at the action level as a variable for the network to predict. ...
... Such methods directly model and quantify the uncertainty at the action level as a variable for the network to predict. The planner would generate the final action based on the predicted uncertainty, either choosing the action with the lowest uncertainty from multiple actions [235] or generating a weighted combination of proposed actions based on the uncertainties [146]. Currently, predicted uncertainty is mainly utilized in combination with hardcoded rules. ...
Article
Full-text available
The autonomous driving community has witnessed a rapid growth in approaches that embrace an end-to-end algorithm framework, utilizing raw sensor input to generate vehicle motion plans, instead of concentrating on individual tasks such as detection and motion prediction. End-to-end systems, in comparison to modular pipelines, benefit from joint feature optimization for perception and planning. This field has flourished due to the availability of large-scale datasets, closed-loop evaluation, and the increasing need for autonomous driving algorithms to perform effectively in challenging scenarios. In this survey, we provide a comprehensive analysis of more than 270 papers, covering the motivation, roadmap, methodology, challenges, and future trends in end-to-end autonomous driving. We delve into several critical challenges, including multi-modality, interpretability, causal confusion, robustness, and world models, amongst others. Additionally, we discuss current advancements in foundation models and visual pre-training, as well as how to incorporate these techniques within the end-to-end driving framework.We maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/OpenDriveLab/End-to-end-Autonomous-Driving .
... Moreover, the Autonomous Mobile Manipulator Robot (AMMR) demonstrates higher robustness in tasks with static, dynamic, and even unknown environments in industrial scenarios [12]- [14]. Unfortunately, current execution systems using end-to-end networks and high-precision sensors have difficulty in meeting the millimeter-level control accuracy required for battery disassembly [15]. The method based on deep reinforcement learning makes it even more difficult to accurately determine the right position for mobile disassembly [16]. ...
... The definition of primitives ensures that BEAM-1 can autonomously plan appropriate action sequences in dynamic and complex environments to cope with various environmental states and accomplish various tasks. The accuracy of current popular control methods failed to meet the millimeter-level requirements in the disassembly environment [15], [28], and this study uses manually implemented primitives to achieve highprecision accurate control while adding a layer of detection and verification at the primitive level. ...
Preprint
The efficient disassembly of end-of-life electric vehicle batteries(EOL-EVBs) is crucial for green manufacturing and sustainable development. The current pre-programmed disassembly conducted by the Autonomous Mobile Manipulator Robot(AMMR) struggles to meet the disassembly requirements in dynamic environments, complex scenarios, and unstructured processes. In this paper, we propose a Battery Disassembly AMMR(BEAM-1) system based on NeuralSymbolic AI. It detects the environmental state by leveraging a combination of multi-sensors and neural predicates and then translates this information into a quasi-symbolic space. In real-time, it identifies the optimal sequence of action primitives through LLM-heuristic tree search, ensuring high-precision execution of these primitives. Additionally, it employs positional speculative sampling using intuitive networks and achieves the disassembly of various bolt types with a meticulously designed end-effector. Importantly, BEAM-1 is a continuously learning embodied intelligence system capable of subjective reasoning like a human, and possessing intuition. A large number of real scene experiments have proved that it can autonomously perceive, decide, and execute to complete the continuous disassembly of bolts in multiple, multi-category, and complex situations, with a success rate of 98.78%. This research attempts to use NeuroSymbolic AI to give robots real autonomous reasoning, planning, and learning capabilities. BEAM-1 realizes the revolution of battery disassembly. Its framework can be easily ported to any robotic system to realize different application scenarios, which provides a ground-breaking idea for the design and implementation of future embodied intelligent robotic systems.
... Farinhaet al.,2020] [6]. Additionally, MG811 and ES-PH-N01-TR-1 sensors concentrate on carbon dioxide and soil pH levels, respectively, while LM393 sensors are frequently employed to monitor light and soil moisture [7][8]. Choosing the right sensors is essential to getting precise and trustworthy data. ...
... A large dataset of human driving was collected and augmented using image processing techniques to train end-to-end policies conditioned on the desired route. Further advancements in BC were made in [26], which used a large deep ResNet network for feature extraction and fused data from camera, LiDAR, and radar to generate feature maps. The work also incorporated a probabilistic motion planner to account for uncertainties in the trajectory. ...
Preprint
Deriving robust control policies for realistic urban navigation scenarios is not a trivial task. In an end-to-end approach, these policies must map high-dimensional images from the vehicle's cameras to low-level actions such as steering and throttle. While pure Reinforcement Learning (RL) approaches are based exclusively on engineered rewards, Generative Adversarial Imitation Learning (GAIL) agents learn from expert demonstrations while interacting with the environment, which favors GAIL on tasks for which a reward signal is difficult to derive, such as autonomous driving. However, training deep networks directly from raw images on RL tasks is known to be unstable and troublesome. To deal with that, this work proposes a hierarchical GAIL-based architecture (hGAIL) which decouples representation learning from the driving task to solve the autonomous navigation of a vehicle. The proposed architecture consists of two modules: a GAN (Generative Adversarial Net) which generates an abstract mid-level input representation, which is the Bird's-Eye View (BEV) from the surroundings of the vehicle; and the GAIL which learns to control the vehicle based on the BEV predictions from the GAN as input. hGAIL is able to learn both the policy and the mid-level representation simultaneously as the agent interacts with the environment. Our experiments made in the CARLA simulation environment have shown that GAIL exclusively from cameras without BEV) fails to even learn the task, while hGAIL, after training exclusively on one city, was able to autonomously navigate successfully in 98% of the intersections of a new city not used in training phase.
... In such a design, the radar provides the accurate relative velocity of the target, the lidar provides the high-resolution shape and precise position and the camera provides additional information that can help determine the class of a target (object) and improve the detection of small and non-reflective road elements like traffic lights or road lane markers. An example of the camera, radar and lidar fusion used to control the car is described in [40], where complex neural network architecture combines data from sensors to control the steering angle, throttle and brakes. As described in [41], AD can be implemented in an end-to-end approach (with a complex neural network) or in a modular approach, in which data tracking and fusion are performed using algorithms or smaller neural networks with clearly defined goals. ...
Article
Full-text available
The rapid development of active safety systems in the automotive industry and research in autonomous driving requires reliable, high-precision sensors that provide rich information about the surrounding environment and the behaviour of other road users. In practice, there is always some non-zero mounting misalignment, i.e., angular inaccuracy in a sensor’s mounting on a vehicle. It is essential to accurately estimate and compensate for this misalignment further programmatically (in software). In the case of radars, imprecise mounting may result in incorrect/inaccurate target information, problems with the tracking algorithm, or a decrease in the power reflected from the target. Sensor misalignment should be mitigated in two ways: through the correction of an inaccurate alignment angle via the estimated value of the misalignment angle or alerting other components of the system of potential sensor degradation if the misalignment is beyond the operational range. This work analyses misalignment’s influences on radar sensors and other system components. In the mathematically proven example of a vertically misaligned radar, pedestrian detectability dropped to one-third of the maximum range. In addition, mathematically derived heading estimation errors demonstrate the impact on data association in data fusion. The simulation results presented show that the angle of misalignment exponentially increases the risk of false track splitting. Additionally, the paper presents a comprehensive review of radar alignment techniques, mostly found in the patent literature, and implements a baseline algorithm, along with suggested key performance indicators (KPIs) to facilitate comparisons for other researchers.
... Free-space detection can be defined generally as the pixel-level dichotomous task of defining the areas in which an autonomous vehicle can or cannot navigate safely. The pixel-level classified scenes can then be used as input for autonomous-driving modules that perform other tasks, such as trajectory prediction [13] or path planning [14], allowing safe and effective navigation even in complex environments. ...
Article
Full-text available
Free-space detection is an essential task in autonomous driving; it can be formulated as the semantic segmentation of driving scenes. An important line of research in free-space detection is the use of convolutional neural networks to achieve high-accuracy semantic segmentation. In this study, we introduce two fusion modules: the dense exploration module (DEM) and the dual-attention exploration module (DAEM). They efficiently capture diverse fusion information by fully exploring deep and representative information at each network stage. Furthermore, we propose a dense multimodal fusion transfer network (DMFTNet). This architecture uses elaborate multimodal deep fusion exploration modules to extract fused features from red–green–blue and depth features at every stage with the help of DEM and DAEM and then densely transfer them to predict the free space. Extensive experiments were conducted comparing DMFTNet and 11 state-of-the-art approaches on two datasets. The proposed fusion module ensured that DMFTNet’s free-space-detection performance was superior.
Article
Traditional methods for autonomous driving are implemented with many building blocks from perception, planning and control, making them difficult to generalize to varied scenarios due to complex assumptions and interdependencies. Recently, the end-to-end driving method has emerged, which performs well and generalizes to new environments by directly learning from export-provided data. However, many prior methods on this topic neglect to check the confidence of the driving actions and the ability to recover from driving mistakes. In this paper, we develop an uncertainty-aware end-to-end trajectory generation method based on imitation learning. It can extract spatiotemporal features from the front-view camera images for scene understanding, then generate collision-free trajectories several seconds into the future. The experimental results suggest that under various weather and lighting conditions, our network can reliably generate trajectories in different urban environments, such as turning at intersections and slowing down for collision avoidance. Furthermore, closed-loop driving tests suggest that the proposed method achieves better cross-scene/platform driving results than the state-of-the-art (SOTA) end-to-end control method, where our model can recover from off-center and off-orientation errors and capture 80% of dangerous cases with high uncertainty estimations.
Article
Drifting is a complicated task for autonomous vehicle control. Most traditional methods in this area are based on motion equations derived by the understanding of vehicle dynamics, which is difficult to be modeled precisely. We propose a robust drift controller without explicit motion equations, which is based on the latest model-free deep reinforcement learning algorithm soft actor-critic. The drift control problem is formulated as a trajectory following task, where the errorbased state and reward are designed. After being trained on tracks with different levels of difficulty, our controller is capable of making the vehicle drift through various sharp corners quickly and stably in the unseen map. The proposed controller is further shown to have excellent generalization ability, which can directly handle unseen vehicle types with different physical properties, such as mass, tire friction, etc.