Conference PaperPDF Available

Vision-Based Trajectory Planning via Imitation Learning for Autonomous Vehicles

Authors:
Vision-Based Trajectory Planning via Imitation Learning for
Autonomous Vehicles
Peide Cai, Yuxiang Sun, Yuying Chen, Ming Liu
Abstract Reliable trajectory planning like human drivers in
real-world dynamic urban environments is a critical capability
for autonomous driving. To this end, we develop a vision
and imitation learning-based planner to generate collision-
free trajectories several seconds into the future. Our network
consists of three sub-networks to conduct three basic driving
tasks: keep straight,turn left and turn right. During the planning
process, high-level commands are received as prior information
to select a specific sub-network. We create our dataset from
the Robotcar dataset, and the experimental results suggest
that our planner is able to reliably generate trajectories in
various driving tasks, such as turning at different intersections,
lane-keeping on curved roads and changing lanes for collision
avoidance.
I. INTRODUCTION
In order to reach the level of autonomous driving, a
vehicle needs to model the surroundings using perception
techniques, such as 3-D SLAM [1]–[3]. Then the extracted
information could be used to plan collision-free trajectories
to the goal position [4]. However, it is still a challenge to
enable autonomous vehicles to catch up with human drivers’
capabilities, especially in real-world environments, where
trajectory planning is a crucial task.
Within the trajectory planning framework, the solution
trajectory is represented as a time-parametrized function
π(t):[0,T]→ X , where Tis the planning horizon and Xis
the configuration space of the vehicle [5]. For human drivers
who plan to reach a destination in an unknown environment,
they are usually informed by a navigation software about
which direction to drive, such as keeping straight or turning
left at the next intersection. Based on this phenomenon, we
propose a vision-based imitation learning method to plan
trajectories with high-level driving commands (keep straight,
turn left,turn right).
Basically, methods for planning and decision-making for
autonomous vehicles can be divided into three approaches:
traditional sequential planning, end-to-end control and end-
to-end planning. Their pipelines are illustrated in Fig. 1.
The traditional approaches have been regarded as an
optimization problem in which the optimal actions given a
cost function should be selected. Various approaches of them
This work was supported by the National Natural Science Foundation of
China (Grant No. U1713211); the Research Grant Council of Hong Kong
SAR Government, China, under Project No. 11210017, and No. 21202816,
awarded to Prof. Ming Liu.
Peide Cai, Yuxiang Sun, Yuying Chen and Ming Liu are with the
Department of Electronic and Computer Engineering, The Hong Kong
University of Science and Technology, Hong Kong, China. (e-mail: pca-
iaa@connect.ust.hk; sun.yuxiang@outlook.com; ychenco@connect.ust.hk;
eelium@ust.hk)
Perception Behavioral
Layer
Motion
Planning Control
End-to-end
Control
End-to-end
Planning Control
Sensor Input Control Output
Fig. 1. Different approaches for trajectory planning and decision-making
on autonomous vehicles.
differ in the shape of both actions and cost functions [6].
However, in most work they have to be carefully designed.
Besides, many of them are based on map-building, which
need to be updated dynamically with difficulty under the
limitation of computational resource, which may not be quick
enough to reflect changes in environment [7], [8]. Addition-
ally, the traditional approaches classically rely on a percep-
tion system to extract information in the form of manually
designed features from raw sensory input. However, they are
constrained by the adaptivity to generic environments [8].
Recently, a number of methods based on deep learning
integrating perception, planning and control have been pro-
posed and achieved impressive results on vehicle navigation,
called end-to-end control [9]–[11]. However, these prior
efforts mostly express the problem as learning a mapping
from perceptual inputs directly to vehicle control commands
(steering angle and throttle). The deficiencies of this treat-
ment are twofold. (i) The vehicle is not steerable at an
intersection. Since the vehicle considers only the camera
input for decision making, it may take a wrong turn for
the lack of high-level navigation commands [12]. (ii) The
learned behavioral policy can only be well performed on data
collected with the specifically calibrated actuation setup [13].
In this paper, to tackle the problems above, we propose a
novel network to map visual perception and state information
into future trajectories with the help of high-level turning
command. Here the modules of perception, behavioral layer
and motion planning in traditional sequential planning frame-
work are integrated together to form an end-to-end planning
method. The proposed network has been successfully trained
on our dataset created from the Robotcar [14], and proved to
be reliable by comparing with different baselines. The main
contributions of this paper are summarized as follows:
By imitating the behavior of human drivers, we intro-
duce a novel mapless learning-based planning method
for autonomous vehicles in different occasions, with less
prior information than traditional approaches.
Our method is reliable when turning at an intersection
and some other tough scenes such as changing lanes
for collision avoidance, slowing down for a slow car
and lane-keeping on curved roads.
The planned trajectories from our network can be further
translated into steering and throttle commands by con-
trollers designed for different vehicles, which is more
generalized than the end-to-end control methods.
II. RELATED WORK
A. End-to-end Control Methods
ALVINN [15] in 1989 was the pioneer attempt to use a
shallow neural network for autonomous driving. Since the
method was very simple and comprised of limited number
of neural network layers, it can only drive in very simple
scenarios with few obstacles, far away from being applied
in a real traffic environment. Even though, it demonstrated
the potential for a well-trained end-to-end neural network
navigating a vehicle on public roads.
Inspired by ALVINN, another framework called DAVE-
2 was proposed by NVIDIA [9] in 2016 using a similar
idea but benefitting from more powerful modern Convolution
Neural Networks(CNN), more available data sets for training
and higher computing power from GPUs. During driving, the
camera mounted in the center portion behind the windshield
was used to get the images in front of the car and transmit
them into CNN to compute steering commands. By using
CNN to learn lane following task on the road without
manual intervention, this framework achieved extraordinary
success in relatively simple real-world scenarios, such as
flat or barrier-free roads. However, this framework was only
preliminary, and more work needs to be done to improve
system robustness.
Actually, those work mainly solves the problem of lane-
following tasks. In order to resolve the ambiguity occurred
at intersections and extend the problem into a goal-directed
navigation task, [12] and [16] redesigned the network to
receive not only perceptual inputs but also the driver’s in-
tention. The neural-network motion controller in [16] trained
on data from the first floor of a building can also drive the
robot successfully on the second floor, which has a different
geometric arrangement and visual appearance.
B. End-to-end Planning Methods
To make the driving model more generic, recently some
new methods have been approached as an end-to-end plan-
ning method leaving the control component outside the
neural network [17], [18]. To investigate whether neural
networks are capable of predicting the path 5 seconds
into the future, [17] tested different networks with various
inputs, such as a sequence of gray-scale images, past ego
motion, detections of surrounding objects and lane marking
estimations. All models were trained via imitation learning
using the collected 7-hours real-world driving data in rural
Europe. The results showed that the path predicted by LSTM
or CNN-LSTM is smooth and feasible in many situations.
However, the network was not supposed to handle decision-
making and only lane-keeping task was considered. In addi-
tion, the interaction to the surroundings was not taken into
account because of the data limitation. The ChauffeurNet
proposed in [18] is a partially end-to-end planning method
because the perception module is independent of the planning
network, which processes raw sensor inputs into top-down
representations of the environment. These mid-level inputs
are received by a recurrent neural network (RNN) to compute
a driving trajectory, after which a control module is designed
to translate the trajectory into steering and acceleration. By
augmenting their data with synthesized perturbations and
augmenting the imitation loss with losses that discourage bad
behavior, this method received good results on driving tasks
such as stopping for stop-signs and lane-following along
straight or curved roads. Different with [18], our planning
architecture is mapless and fully end-to-end.
III. METHODOLOGY
A. Network Architecture
The goal of our work is to plan trajectories like those
planned by humans. So we assume that the history driving
trajectories before the current time are planned by humans
and we train our network by learning from those trajectories.
At every time step, the inputs of the network are camera
images {i1,...,iK}and vehicle’s motion state information
{m1,...,mK}in the past 1.5 seconds (assume Ksteps).
The output of the network is a collision-free trajectory
by looking 3 seconds (assume Psteps) into the future
{m1
plan ,..., mP
plan }. We have three sub-networks to be chosen
by the prior command information cto conduct different
driving tasks. The high-level command information contains
“Turn left at the coming intersection",“Keep straight" and
“Turn right at the coming intersection". The movement in-
formation contains the lateral xand longitudinal zpositions
together with speed v, which can be easily acquired and
corrected by GPS, IMU, etc. Note that the positions are
transformed from the world frame to the local one. Once
a specific sub-network is chosen, the images will first be
processed respectively by an image module I(i), which is
implemented as a convolutional network. It extracts a feature
vector of length 128 from each image. Then a balance
module B(m)expands the length of each history movement
vector from 3 to 32 to balance the influence of the vision
and the movement feature vectors, where we draw on the
experience of [16]. The balance module is implemented as a
fully connected network. The outputs of these two modules
are then concatenated together at every time step into joint
vectors of length 160, represented by:
Jn=<I(in),B(mn)>, 1nK(1)
where I(in)is the output of the image module for the n-th
image, B(mn)is the output of the balance module for the
n-th movement vector. And Jnis the n-th joint vector. Then
the set of J={J1,...,JK}will be processed by a LSTM
network and further reshaped by a fully connected layer to
get the final output vector of length (3 ×P), of which every
3 cells represent a state mplan at a time step in the preview
horizon. Here mplan also consists of x,zand v.
CNN FC
CNN FC
CNN FC
History information × K
Image × K
Input
Images
Turn Left
Keep Straight
Turn Right
Visual Features
(128) × 12
History Info
(3) × 12
FC
Combined Features
(128+32) × 12
Many to One
LSTM
LSTM
LSTM
FC
Output
x1
z1
velocity1
(X,Z,Velocity) × P
Prior Known
Current Command
location_x
location_z
velocity
x2
z2
velocity2
xP
zP
velocityP
Fig. 2. The CNN-LSTM+State network architecture for trajectory planning. The command information takes three discrete values: Turn Left, Turn
Right and Keep Straight. Then different sub-networks will be chosen to compute a trajectory. Note that the three network architectures based on different
commands are identical, and they only differ in the training data. The figure is best viewed in color.
Input Image
224 224 3
109 109 16
52 52 32
Conv1
Conv2
24 24 48
10 10 64
Conv3 Conv4 flatten
6400
1024
512
128
Fig. 3. The CNN module for processing every image at 244×244 pixel
resolution. The output feature is a single vector of length 128 after four
convolutional layers and three fully connected layers. Between every two
convolutional layers there is a 2×2 max pooling layer.
Fig. 2 shows our proposed CNN-LSTM+State network
architecture. Overall, the deep network planner Fessentially
performs a multivariable regression task, which can be for-
mulated by the following equation:
{m1
plan ,..., mP
plan }=F{i1,..., iK},{m1,...,mK}|c(2)
B. Network Details
The CNN module for processing the input RGB images
is shown in Fig. 3, which consists of four convolutional
layers, four max pooling layers and three fully connected
layers. For the four convolutional layers, the kernel sizes
are 7, 6, 5, 5 and the corresponding numbers of filters are
16, 32, 48, 64. The strides are 1 for the four layers. The
features are then flattened to get a single vector of length
6400. Then a multilayer perceptron of three fully connected
layers transforms it into the final output feature vector of
length 128. We performed batch normalization [19] after all
hidden layers. In the CNN module, we use ReLU activation
functions after each convolutional layer.
For the LSTM module, the number of features in the
hidden state is 512. The number of recurrent layers is set
to 3, which means that three LSTMs are stacked together, of
which each LSTM receives the output of the upper one with
the third LSTM layer giving the results.
For each training sample, given the ground truth tr jgt and
the planned (3×P)vector tr j plan, our loss function is defined
as follows:
`tr j plan,tr jgt =
P
n=1
{(xn
plan xn
gt )2
+ (zn
plan zn
gt )2+ (vn
plan vn
gt )2}
(3)
C. Data Collection and preprocessing
1) Data Collection: We create our dataset from the
Robotcar dataset, which is recorded in urban areas through
6 cameras mounted to the vehicle, along with LiDAR,
GPS and INS information. The driving information in this
dataset is over 1000 km containing many turns at crossroads
and interactions with dynamic objects on the road, such
as slowing down for a slow or parked car and changing
lanes for collision avoidance. The ground truth position
information can be acquired from the fused GPS+Inertial
solution recorded at a frequency of 50Hz, which is also
equipped in the dataset.
From Robotcar, we extract camera images and position
information to train and test the planning network. To visu-
alize and measure the performance, we need to project the
trajectories and neighboring objects into a top-down view at
every frame. In order to achieve this goal, it is necessary
to detect and compute the position, orientation and size of
each dynamic objects around the ego-vehicle. Based on the
above two purposes, the stereo camera on top of the vehicle
in Robotcar is the source where we pick pairwise images, of
which the left and right cameras with a baseline of 24cm are
used. We can then compute disparity and restore the depth
of the environment with those paired images.
The data distribution and splitting are shown in Fig. 4.
The final training dataset contains 52,200 images (left camera
only) for about 2 hours of driving in daytime. We labelled
every image with different commands telling the vehicle
where to go based on the ground truth trajectory. Before we
get the final dataset, we removed the sequences with poor
GPS signals to ensure the quality of the training data. Note
Keep
Straight
60%
Turn Right
24%
Turn Left
16%
training
validation
test
Normal driving(55%)
Driving on curved roads(2%)
Following a car(22%)
Changing lanes for collision
avoidance(11%)
Slowing down for a slow
car(10%)
Fig. 4. Data distribution and splitting. Totaling 52,200 images for three
driving tasks. For each situation, The split ratio for training, validation and
test is 35:4:11. The staked column bar on the right shows the distribution of
data in keeping straight situation, where we have balanced the portion for
normal and abnormal driving cases. The turning situation does not contain
much interaction with other road agents. The figure is best viewed in color.
Fig. 5. The pipeline to extract object information around the ego-vehicle.
Images in row 1 come from the left camera. Row 2 shows the detected
objects from Mask R-CNN. Row 3 shows the estimated orientation and
size from PSMNet. Row 4 shows the restored depth information for left
camera images. The figure is best viewed in color.
that in this dataset, we do not consider the situation where
the car stopped for traffic lights or stop signs.
2) Data Preprocessing: To train the network, we need to
equip every image with the trajectory of ego-vehicle in the
past 1.5 seconds and the next 3 seconds. We first interpolate
the ground truth UTM (Universal Transverse Mercator) posi-
tion and velocity series to the image timestamps recorded at
15Hz. For every image, the corresponding UTM coordinates
ranging 4.5 seconds are then converted to the local coordinate
system.
To visualize the road objects in analyzing the planning
performance, we design a pipeline (Fig. 5) to extract the
surrounding objects’ size, orientation and position relative to
the ego-vehicle. We first use the Mask R-CNN [20], [21] to
detect different objects in every image frame and then the
acquired 2D bounding boxes are used to estimate the objects’
size and orientation [22], [23]. Then the pyramid stereo
matching network (PSMNet) [24] is utilized to compute
the depth information for left camera images. Finally, the
objects’ positions are computed by the depth and detected
mask along with the camera parameters.
IV. EXP ERI MEN TS
We train the three sub-tasks for keep straight,turn left
and turn right separately in our CNN-LSTM+State net-
work, which is implemented in Pytorch. We use the Adam
optimizer [25] for training with the initial learning rate of
0.5s 1.0s 1.5s 2.0s 2.5s 3.0s
0
2
4
6
Displacement Error (m)
0.5s 1.0s 1.5s 2.0s 2.5s 3.0s
0
2
4
6
Displacement Error (m)
(a) Keep Straight
0.5s 1.0s 1.5s 2.0s 2.5s 3.0s
0
2
4
6
Displacement Error (m)
(b) Turn Left
(c) Turn Right
CNN-LSTM+State
CNN-LSTM
CNN-FC
CNN-LSTM+State
CNN-LSTM
CNN-FC
CNN-LSTM+State
CNN-LSTM
CNN-FC
Fig. 6. Box plots on different scenes containing information on distribution
of displacement error over the preview time in 3 seconds. The CNN-FC,
CNN-LSTM and our full CNN-LSTM+State method are used to compute
the results. The x-axis shows the preview time in seconds. The figure is best
viewed in color.
0.001 and batch size of 32. We also use the same dataset
and training procedure for other variants of our method.
Compared with our full method CNN-LSTM+State, these
models do not use the previous state information for planning
trajectories:
CNN-FC: Use image series as input by looking back
1.5 seconds, with CNN as visual encoder and fully
connected layers (FC) as decoder.
CNN-LSTM: Use image series as input by looking back
1.5 seconds, with CNN as visual encoder and LSTM as
decoder.
A. Quantitative Evaluation
1) Metrics: We use several metrics to measure the per-
formance of different methods. All of the metrics are the
average value over all test samples. E(`2)is the average
displacement error [26], which is the mean distance over all
planned points and the true points of a trajectory. Similarly,
E(lateral)and E(longi)represent the lateral and longitudinal
error. Besides, E(final disp.)is the final displacement error
[26], meaning the distance between the planned and true
final destination of a trajectory. E(speed)measures the mean
speed error in the preview horizon. The average time for
different networks to plan a trajectory is also recorded. We
expanded the future trajectory for every frame into a driving
area (D) according to the vehicle’s width and then calculated
the IoU based on the following equation:
IoU =Dgt TDplan
Dgt SDplan (4)
TABLE I
QUAN TITATIV E RESULTS BYDI FFER EN T MET HO DS . SM AL LE R VALU ES ENCODE BE TT ER PERFORMANCE EX CE PT F OR IOU .
(THE UN IT F OR TI S ms,F OR I oU IS %, F OR E(s peed)I S m/s,FO R TH E REM AI NI NG ER ROR ME TR IC S AR E m.)
Scenes Methods T DLJ IoU E(s peed)E(`2)E(lateral)E(l ongi)E(f inal dis p.)
Keep Straight
CNN-FC 88.80 -0.26 83.61 0.78 1.33 0.22 1.27 2.60
CNN-LSTM 89.08 -0.13 82.62 0.88 1.35 0.25 1.28 2.53
CNN-LSTM+State 91.17 -0.33 85.68 0.30 0.77 0.20 0.70 1.54
Turn Left
CNN-FC 91.38 -0.55 68.59 0.88 1.31 0.52 1.11 2.72
CNN-LSTM 96.40 -0.33 70.42 0.91 1.35 0.48 1.16 2.73
CNN-LSTM+State 91.01 -0.50 74.50 0.35 0.61 0.35 0.41 1.47
Turn Right
CNN-FC 90.20 -0.41 67.97 0.74 1.22 0.50 1.00 2.53
CNN-LSTM 90.49 -0.23 68.93 0.70 1.16 0.49 0.94 2.33
CNN-LSTM+State 89.10 -0.70 73.24 0.32 0.63 0.38 0.41 1.48
Fig. 7. For every time step, the error for planned lateral and longitudinal position after 1.0 second over the whole sequence are shown in this figure. The
predictions are done by the CNN-FC, CNN-LSTM and our full CNN-LSTM+State method. The x-axis shows the time in the sequence in seconds. The
figure is best viewed in color.
Fig. 8. The green line represents the path for the vehicle in the test
sequence, in which there are many turns at intersections suitable to measure
the performance of our model.1
Additionally, we use the dimensionless jerk (DLJ) metric
to calculate the smoothness of the trajectory [27], where v(t)
is the movement speed, t1and t2are the start and end times
of the movement. The higher the value of DLJ, the smoother
the trajectory is.
DLJ ,(t2t1)3
v2
peak Zt2
t1
d2v
dt2
2
dt,vpeak ,max
t[t1,t2]v(t)(5)
2) Results: Table I shows the quantitative results from
experiments among various models. From the third column
of Table. I, it can be seen that the proposed method, CNN-
LSTM+State, can run at about 11 Hz, faster than the other
methods when turning left and right, which is capable of real-
time applications. When the decoder of CNN-FC changed
from FC to LSTM, the network CNN-LSTM shows smaller
or comparable final displacement error for all scenes and
larger IoU for turning left and right. Furthermore, it plans
smoother trajectories with higher DLJ. This may be due to
the fact that LSTM has an advantage over FC to process
temporal information. Finally, by composing image series
and past state into our full CNN-LSTM+State network, the
performance improves a lot with higher IoU and lower
error results on the prediction of speed and positions in the
1https://robotcar-dataset.robots.ox.ac.uk/datasets/2015-08-17-13-30-19/
Fig. 9. Results for turning left and right at different intersections by our CNN-LSTM+State model. Row 1 shows the driving areas generated by trajectories.
Row 2 shows the top-down view. The horizontal axis indicates the lateral coordinates and the vertical axis indicates the longitudinal coordinates, both in
meters. The figure is best viewed in color.
(a) (b) (c1) (c2) (d1) (d2)
Fig. 10. Some special cases for the qualitative evaluation. Our CNN-LSTM+State model generates smoother trajectories than ground truth when the
GPS+INS data deviates slightly from the real driving trajectory, which is shown in column (a). Column (b) shows the situation when driving on curved
roads. Column (c1) shows the situation when changing lanes for collision avoidance. Column (c2) shows the corresponding top-down view, of which the
horizontal axis indicates the lateral coordinates and the vertical axis indicates the longitudinal coordinates, both in meters. Column (d1) shows the situation
when slowing down for a slow car. Column (d2) shows the corresponding speed information, of which the horizontal axis indicates the time in seconds
and the vertical axis indicates the speed in m/s. The figure is best viewed in color.
three test scenes. This shows the importance of past state
information to plan future trajectories.
Fig. 6 shows the distribution of displacement error on three
test scenes over the preview time by different models. It is
visible that for each scene, the displacement error increases
as the preview time increases for the three models, among
which the CNN-FC and CNN-LSTM have similar results
and our full CNN-LSTM+State method performs best, with
lower error medium and more centralized error distribution,
especially at the moment of 1.0s in the turn left situation.
We also tested our model on a whole consecutive sequence
in our dataset for about 20 minutes. The error results for
different models are shown in Fig. 7 and the driving path
is visualized in Fig. 8. It can be seen intuitively that the
CNN-LSTM+State model plans more accurate positions than
the other two methods, especially for the prediction on
longitudinal positions.
B. Qualitative Evaluation
We further provide some qualitative analysis to investigate
the effectiveness of our model. As introduced earlier, we
use the positions recorded by GPS+INS data to form the
ground truth trajectories. However, they occasionally deviate
a little from the real driving path due to instability of GPS
signals. In these scenes, our method can plan much smoother
trajectories than ground truth, as shown in column aof Fig.
10.
When turning left or turning right at an intersection, our
model can plan a trajectory very close to a human driver, as
shown in Fig. 9. For the keep straight situation, we mainly
concentrate on the following three specific tasks:
(A) lane-keeping on curved roads, shown in column bof
Fig. 10. Note that the curvature of the road surface in
this scene is smaller than that when turning left or right.
(B) changing lanes for collision avoidance, shown in column
c1-c2 of Fig. 10.
(C) slowing down for a slow car, shown in column d1-d2
of Fig. 10.
In situation (A), the planning performance is still reliable
when the road becomes winding. In situation (B), the model
plans a trajectory to bypass the obstacle ahead and keep
going straight. In situation (C), when the vehicle drives near
a slow or parked car, the model plans a low speed to ensure
safety.
V. CONCLUSION
To imitate the driving behavior of human drivers as well
as avoid the deficiencies of end-to-end control methods,
we proposed an end-to-end planning method achieved by
a CNN-LSTM network architecture. We used the front-view
image stream and state information in the past 1.5 seconds as
input, to plan a possible collision-free trajectory containing
speed and lateral/longitudinal positions 3.0 seconds in the
future. A CNN module was used to extract visual features
for comprehending the environment, and an LSTM module
was utilized to consider temporal dependencies. The results
suggest that our model plans trajectories close to ground truth
when turning at various intersections or keeping straight.
We also tested our model in three sub-tasks when keep-
ing straight, where the generated future trajectories could
indicate the vehicle to bypass the stopped vehicle ahead for
collision avoidance, driving on curved roads and slow down
for a parked car.
Some limitations still exists in this work such as we
ignored the treatment to traffic lights and we did not consider
the planning performance under different weather or lighting
conditions. To further improve the performance of our model
and apply it to practical applications, more labelled driving
data could be utilized for training. Some other sensors, such
as a thermal imaging camera, could also be added to enlarge
the perceptual range [28] and help the network achieve better
planning results. Besides, the restored 3-D information for
the surrounding road agents could also be valuable for the
planning network, which leaves to future work.
REFERENCES
[1] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving rgb-d slam in
dynamic environments: A motion removal approach,Robotics and
Autonomous Systems, vol. 89, pp. 110–122, 2017.
[2] T. Sun, Y. Sun, M. Liu, and D.-Y. Yeung, “Movable-object-aware
visual slam via weakly supervised semantic segmentation,” arXiv
preprint arXiv:1906.03629, 2019.
[3] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable rgb-
d slam in dynamic environments,Robotics and Autonomous Systems,
vol. 108, pp. 115–128, 2018.
[4] C. Wang, W. Chi, Y. Sun, and M. Q.-H. Meng, “Autonomous Robotic
Exploration by Incremental Road Map Construction,” IEEE Transac-
tions on Automation Science and Engineering, pp. 1–12, 2019.
[5] B. Paden, M. ˇ
Cáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A
survey of motion planning and control techniques for self-driving
urban vehicles,” IEEE Transactions on intelligent vehicles, vol. 1,
no. 1, pp. 33–55, 2016.
[6] E. Rehder, J. Quehl, and C. Stiller, “Driving like a human: Imitation
learning for path planning using convolutional neural networks,” in
International Conference on Robotics and Automation Workshops,
2017.
[7] P. Wu, Y. Cao, Y. He, and D. Li, “Vision-based robot path planning
with deep learning,” in International Conference on Computer Vision
Systems. Springer, 2017, pp. 101–111.
[8] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-
making for autonomous vehicles,” Annual Review of Control, Robotics,
and Autonomous Systems, vol. 1, pp. 187–210, 2018.
[9] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End to
end learning for self-driving cars,arXiv preprint arXiv:1604.07316,
2016.
[10] J. Jhung, I. Bae, J. Moon, T. Kim, J. Kim, and S. Kim, “End-
to-end steering controller with cnn-based closed-loop feedback for
autonomous vehicles,” in 2018 IEEE Intelligent Vehicles Symposium
(IV). IEEE, 2018, pp. 617–622.
[11] H. M. Eraqi, M. N. Moustafa, and J. Honer, “End-to-end deep learning
for steering autonomous vehicles considering temporal dependencies,”
arXiv preprint arXiv:1710.03804, 2017.
[12] F. Codevilla, M. Miiller, A. López, V. Koltun, and A. Dosovitskiy,
“End-to-end driving via conditional imitation learning,” in 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE,
2018, pp. 1–9.
[13] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
models from large-scale video datasets,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, pp.
2174–2182.
[14] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km:
The oxford robotcar dataset,” The International Journal of Robotics
Research, vol. 36, no. 1, pp. 3–15, 2017.
[15] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural
network,” in Advances in neural information processing systems, 1989,
pp. 305–313.
[16] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subramanian, “Intention-
net: Integrating planning and deep learning for goal-directed au-
tonomous navigation,arXiv preprint arXiv:1710.05627, 2017.
[17] M. Bergqvist and O. Rödholm, “Deep Path Planning Using Images
and Object Data,” Master’s thesis, Chalmers University of Technology,
Gothenburg, Sweden, 2018.
[18] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to
drive by imitating the best and synthesizing the worst,arXiv preprint
arXiv:1812.03079, 2018.
[19] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,arXiv preprint
arXiv:1502.03167, 2015.
[20] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE international conference on computer vision,
2017, pp. 2961–2969.
[21] W. Abdulla, “Mask r-cnn for object detection and instance segmen-
tation on keras and tensorflow,” https://github.com/matterport/Mask_
RCNN, 2017.
[22] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding
box estimation using deep learning and geometry,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
2017, pp. 7074–7082.
[23] cersar, “3d_detection,” https://github.com/cersar/3D_detection, 2019.
[24] J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 5410–5418.
[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[26] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never
walk alone: Modeling social behavior for multi-target tracking,” in
Computer Vision, 2009 IEEE 12th International Conference on. IEEE,
2009, pp. 261–268.
[27] N. Hogan and D. Sternad, “Sensitivity of smoothness measures to
movement duration, amplitude, and arrests,Journal of motor behav-
ior, vol. 41, no. 6, pp. 529–534, 2009.
[28] Y. Sun, W. Zuo, and M. Liu, “Rtfnet: Rgb-thermal fusion network
for semantic segmentation of urban scenes,” IEEE Robotics and
Automation Letters, 2019.
... Besides their success in end-to-end control, CNN and RNN architectures also showed promising results in end-to-end planning [19][20][21][22]. Bergqvist et al. [20] also designed several CNN and RNN-based path planning networks that use various input types, including gray-scale images and ego-motions. ...
... They showed LSTM or CNN-LSTM's ability to generate smooth and feasible path plans in many situations, although they only considered lane-following tasks in simple scenarios. Recently, Cai et al. proposed a similar CNN-LSTM architecture that computes a future trajectory for autonomous vehicles using image sequences and three discrete command values (Turn Left, Turn Right, and Keep Straight), which uses different sub-networks [21]. In their following work, they proposed a similar network architecture that takes image sequences, trajectory histories, and three discrete commands and estimates uncertainty and trajectory planning [22]. ...
Article
Full-text available
Abstract: Autonomous driving vehicles (ADVs) are sleeping giant intelligent machines that perceive their environment and make driving decisions. Most existing ADSs are built as hand-engineered perception-planning-control pipelines. However, designing generalized handcrafted rules for autonomous driving in an urban environment is complex. An alternative approach is imitation learning (IL) from human driving demonstrations. However, most previous studies on IL for autonomous driving face several critical challenges: (1) poor generalization ability toward the unseen environment due to distribution shift problems such as changes in driving views and weather conditions; (2) lack of interpretability; and (3) mostly trained to learn the single driving task. To address these challenges, we propose a view-invariant spatiotemporal attentive planning and control network for autonomous vehicles. The proposed method first extracts spatiotemporal representations from images of a front and top driving view sequence through attentive Siamese 3DResNet. Then, the maximum mean discrepancy loss (MMD) is employed to minimize spatiotemporal discrepancies between these driving views and produce an invariant spatiotemporal representation, which reduces domain shift due to view change. Finally, the multitasking learning (MTL) method is employed to jointly train trajectory planning and high-level control tasks based on learned representations and previous motions. Results of extensive experimental evaluations on a large autonomous driving dataset with various weather/lighting conditions verified that the proposed method is effective for feasible motion planning and control in autonomous vehicles.
... With the advancements of machine learning and artificial intelligence in the past decades, deep learning-based vehicle trajectory planning has been gaining more attention, as it shows robust performance in dynamic traffic environments (Aradi (2020); Ye et al. (2021)). Cai et al. (2019) introduced a trajectory planning approach based on imitation learning. It separated planning into three tasks: going straight, turning left, and turning right, and created a different model for each task. ...
Preprint
Full-text available
In this work we put forward a predictive trajectory planning framework to help autonomous vehicles plan future trajectories. We develop a partially observable Markov decision process (POMDP) to model this sequential decision making problem, and a deep reinforcement learning solution methodology to learn high-quality policies. The POMDP model utilizes driving scenarios, condensed into graphs, as inputs. More specifically, an input graph contains information on the history trajectory of the subject vehicle, predicted trajectories of other agents in the scene (e.g., other vehicles, pedestrians, and cyclists), as well as predicted risk levels posed by surrounding vehicles to devise safe, comfortable, and energy-efficient trajectories for the subject vehicle to follow. In order to obtain sufficient driving scenarios to use as training data, we propose a simulation framework to generate socially acceptable driving scenarios using a real world autonomous vehicle dataset. The simulation framework utilizes Bayesian Gaussian mixture models to learn trajectory patterns of different agent types, and Gibbs sampling to ensure that the distribution of simulated scenarios matches that of the real-world dataset collected by an autonomous fleet. We evaluate the proposed work in two complex urban driving environments: a non-signalized T-junction and a non-signalized lane merge intersection. Both environments provide vastly more complex driving scenarios compared to a highway driving environment, which has been mostly the focus of previous studies. The framework demonstrates promising performance for planning horizons as long as five seconds. We compare safety, comfort, and energy efficiency of the planned trajectories against human-driven trajectories in both experimental driving environments, and demonstrate that it outperforms human-driven trajectories in a statistically significant fashion in all aspects.
... Those indicators are chosen via heuristics and the controller design is as expensive as the case with the mediated perception rule-based module [49]. Another approach is to learn driving trajectory planning endto-end while leaving the vehicle control component outside the end-to-end trained module [2] [7]. ...
Preprint
Full-text available
Conditional imitation learning (CIL) trains deep neural networks, in an end-to-end manner, to mimic human driving. This approach has demonstrated suitable vehicle control when following roads, avoiding obstacles, or taking specific turns at intersections to reach a destination. Unfortunately, performance dramatically decreases when deployed to unseen environments and is inconsistent against varying weather conditions. Most importantly, the current CIL fails to avoid static road blockages. In this work, we propose a solution to those deficiencies. First, we fuse the laser scanner with the regular camera streams, at the features level, to overcome the generalization and consistency challenges. Second, we introduce a new efficient Occupancy Grid Mapping (OGM) method along with new algorithms for road blockages avoidance and global route planning. Consequently, our proposed method dynamically detects partial and full road blockages, and guides the controlled vehicle to another route to reach the destination. Following the original CIL work, we demonstrated the effectiveness of our proposal on CARLA simulator urban driving benchmark. Our experiments showed that our model improved consistency against weather conditions by four times and autonomous driving success rate generalization by 52%. Furthermore, our global route planner improved the driving success rate by 37%. Our proposed road blockages avoidance algorithm improved the driving success rate by 27%. Finally, the average kilometers traveled before a collision with a static object increased by 1.5 times. The main source code can be reached at https://heshameraqi.github.io/dynamic_cil_autonomous_driving.
... Those indicators are chosen via heuristics and the controller design is as expensive as the case with the mediated perception rule-based module [49]. Another approach is to learn driving trajectory planning endto-end while leaving the vehicle control component outside the end-to-end trained module [2] [7]. ...
Article
Full-text available
Conditional imitation learning (CIL) trains deep neural networks, in an end-to-end manner, to mimic human driving. This approach has demonstrated suitable vehicle control when following roads, avoiding obstacles, or taking specific turns at intersections to reach a destination. Unfortunately, performance dramatically decreases when deployed to unseen environments and is inconsistent against varying weather conditions. Most importantly, the current CIL fails to avoid static road blockages. In this work, we propose a solution to those deficiencies. First, we fuse the laser scanner with the regular camera streams, at the features level, to overcome the generalization and consistency challenges. Second, we introduce a new efficient Occupancy Grid Mapping (OGM) method along with new algorithms for road blockages avoidance and global route planning. Consequently, our proposed method dynamically detects partial and full road blockages, and guides the controlled vehicle to another route to reach the destination. Following the original CIL work, we demonstrated the effectiveness of our proposal on CARLA simulator urban driving benchmark. Our experiments showed that our model improved consistency against weather conditions by four times and autonomous driving success rate generalization by 52%. Furthermore, our global route planner improved the driving success rate by 37%. Our proposed road blockages avoidance algorithm improved the driving success rate by 27%. Finally, the average kilometers traveled before a collision with a static object increased by 1.5 times. The main source code can be reached at this web page: https://heshameraqi.github.io/dynamic_cil_autonomous_driving.
... Most use some form of Convolutional Neural Networks, sometimes integrated with LSTM cells or another form of recurrent network. For example, in [22], the authors utilize a CNN-LSTM architecture for predicting a fixed number of waypoints and velocities for the car to follow, similiar to [20]. However, predicting sufficient waypoints leads to the curse of dimensionality as the neural network needs to predict proportionally more parameters to specify more waypoints. ...
Preprint
Full-text available
One of the main challenges in autonomous racing is to design algorithms for motion planning at high speed, and across complex racing courses. End-to-end trajectory synthesis has been previously proposed where the trajectory for the ego vehicle is computed based on camera images from the racecar. This is done in a supervised learning setting using behavioral cloning techniques. In this paper, we address the limitations of behavioral cloning methods for trajectory synthesis by introducing Differential Bayesian Filtering (DBF), which uses probabilistic B\'ezier curves as a basis for inferring optimal autonomous racing trajectories based on Bayesian inference. We introduce a trajectory sampling mechanism and combine it with a filtering process which is able to push the car to its physical driving limits. The performance of DBF is evaluated on the DeepRacing Formula One simulation environment and compared with several other trajectory synthesis approaches as well as human driving performance. DBF achieves the fastest lap time, and the fastest speed, by pushing the racecar closer to its limits of control while always remaining inside track bounds.
Article
One of the main challenges in autonomous racing is to design algorithms for motion planning at high speed, and across complex racing courses. End-to-end trajectory synthesis has been previously proposed where the trajectory for the ego vehicle is computed based on camera images from the racecar. This is done in a supervised learning setting using behavioral cloning techniques. In this paper, we address the limitations of behavioral cloning methods for trajectory synthesis by introducing Differential Bayesian Filtering (DBF), which uses probabilistic Bézier curves as a basis for inferring optimal autonomous racing trajectories based on Bayesian inference. We introduce a trajectory sampling mechanism and combine it with a filtering process which is able to push the car to its physical driving limits. The performance of DBF is evaluated on the DeepRacing Formula One simulation environment and compared with several other trajectory synthesis approaches as well as human driving performance. DBF achieves the fastest lap time, and the fastest speed, by pushing the racecar closer to its limits of control while always remaining inside track bounds.
Article
Autonomous vehicles (AVs) are emerging in the automobile industry with potential benefits to reduce traffic congestion, improve mobility and accessibility, as well as safety. According to the AV collision data managed by the California Department of Motor Vehicles (DMV), however, the safety issue of AVs has continuously been a concern. This paper aims to learn the contributing factors to AV crash severity from the latest 3-year AV collision data. To achieve the objective, we develop an AV crash severity classification tree with the possible contributing factors by the cost-sensitive classification and regression tree (CART) model, which can deal with the class imbalance issue raised from the AV collision dataset. Our results show that the main factors affecting AV crash severity level include manufacturer, facility type, movement preceding collision, collision type, light condition and year. These findings could provide useful insights for traffic engineers or AV manufacturers to raise effective counter measures or policies to mitigate AV crash severity.
Article
Semantic segmentation is a fundamental capability for autonomous vehicles. With the advancements of deep learning technologies, many effective semantic segmentation networks have been proposed in recent years. However, most of them are designed using RGB images from visible cameras. The quality of RGB images is prone to be degraded with unsatisfied lighting conditions, such as darkness and glares of oncoming headlights, which imposes critical challenges for the networks that use only RGB images. Different from visible cameras, thermal imaging cameras generate images using thermal radiations. They are able to see under various lighting conditions. In order to enable robust and accurate semantic segmentation for autonomous vehicles, we take the advantage of thermal images and fuse both the RGB and thermal information in a novel deep neural network. The main innovation of this letter is the architecture of the proposed network. We adopt the Encoder-Decoder design concept. ResNet is employed for feature extraction and a new decoder is developed to restore the feature map resolution. The experimental results prove that our network outperforms the state of the arts.
Article
In this paper, we propose a novel path planning framework for autonomous exploration in unknown environments using a mobile robot. A graph structure is incrementally constructed along with the exploration process. The structure is the road map that represents the topology of the explored environment. To construct the road map, we design a sampling strategy to get random points in the explored environment uniformly. A global path from the current location of the robot to the target area can be found on this road map efficiently. We utilize a lazy collision checking method that only checks the feasibility of the generated global path to improve the planning efficiency. The feasible global path is further optimized with our proposed trajectory optimization method considering the motion constraints of the robot. This mechanism can facilitate the path cost evaluation for the next best view selection. In order to select the next best target region, we propose a utility function that takes into account both the path cost and the information gain of a candidate target region. Moreover, we present a target reselection mechanism to evaluate the target region and reduce the extra path cost. The efficiency and effectiveness of our approach are demonstrated using a mobile robot in both simulation and real experimental studies.
Article
RGB-D data-based Simultaneous Localization and Mapping (RGB-D SLAM) aims to concurrently estimate robot poses and reconstruct traversed environments using RGB-D sensors. Many effective and impressive RGB-D SLAM algorithms have been proposed over the past years. However, virtually all the RGB-D SLAM systems developed so far rely on the static-world assumption. This is because the SLAM performance is prone to be degraded by the moving objects in dynamic environments. In this paper, we propose a novel RGB-D data-based motion removal approach to address this problem. The approach is on-line and does not require prior-known moving-object information, such as semantics or visual appearances. We integrate the approach into the front end of an RGB-D SLAM system. It acts as a pre-processing stage to filter out data that are associated with moving objects. Experimental results demonstrate that our approach is able to improve RGB-D SLAM in various challenging scenarios.
Article
Recent work has shown that depth estimation from a stereo pair of images can be formulated as a supervised learning task to be resolved with convolutional neural networks (CNNs). However, current architectures rely on patch-based Siamese networks, lacking the means to exploit context information for finding correspondence in illposed regions. To tackle this problem, we propose PSMNet, a pyramid stereo matching network consisting of two main modules: spatial pyramid pooling and 3D CNN. The spatial pyramid pooling module takes advantage of the capacity of global context information by aggregating context in different scales and locations to form a cost volume. The 3D CNN learns to regularize cost volume using stacked multiple hourglass networks in conjunction with intermediate supervision. The proposed approach was evaluated on several benchmark datasets. Our method ranked first in the KITTI 2012 and 2015 leaderboards before March 18, 2018. The codes of PSMNet are available at: https://github.com/JiaRenChang/PSMNet.
Article
In this review, we provide an overview of emerging trends and challenges in the field of intelligent and autonomous, or self-driving, vehicles. Recent advances in the field of perception, planning, and decision-making for autonomous vehicles have led to great improvements in functional capabilities, with several prototypes already driving on our roads and streets. Yet challenges remain regarding guaranteed performance and safety under all driving circumstances. For instance, planning methods that provide safe and systemcompliant performance in complex, cluttered environments while modeling the uncertain interaction with other traffic participants are required. Furthermore, new paradigms, such as interactive planning and end-to-end learning, open up questions regarding safety and reliability that need to be addressed. In this survey, we emphasize recent approaches for integrated perception and planning and for behavior-aware planning, many of which rely on machine learning. This raises the question of verification and safety, which we also touch upon. Finally, we discuss the state of the art and remaining challenges for managing fleets of autonomous vehicl Expected final online publication date for the Annual Review of Control, Robotics, and Autonomous Systems Volume 1 is May 28, 2018. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.