Conference PaperPDF Available

Tactical Decision Making for Lane Changing with Deep Reinforcement Learning

  • Facebook AI Research
Tactical Decision Making for Lane Changing
with Deep Reinforcement Learning
Mustafa Mukadam
Georgia Institute of Technology, USA
Akansel Cosgun
Honda Research Institute, USA
Alireza Nakhaei
Honda Research Institute, USA
Kikuo Fujimura
Honda Research Institute, USA
In this paper we consider the problem of autonomous lane changing for self
driving cars in a multi-lane, multi-agent setting. We present a framework that
demonstrates a more structured and data efficient alternative to end-to-end complete
policy learning on problems where the high-level policy is hard to formulate
using traditional optimization or rule based methods but well designed low-level
controllers are available. The framework uses deep reinforcement learning solely
to obtain a high-level policy for tactical decision making, while still maintaining a
tight integration with the low-level controller, thus getting the best of both worlds.
This is possible with Q-masking, a technique with which we are able to incorporate
prior knowledge, constraints and information from a low-level controller, directly
in to the learning process thereby simplifying the reward function and making
learning faster and efficient. We provide preliminary results in a simulator and
show our approach to be more efficient than a greedy baseline, and more successful
and safer than human driving.
1 Introduction and Related Work
In recent years there has been a growing interest in self driving cars. Building such autonomous
systems has been an active area of research [
] due to a high potential in leading to more efficient
road networks that are much safer for the passengers. One of the fundamental skills a self driving
car must possess is an ability to perform efficient lane change maneuvers in a safe manner. This is
especially critical in a multi-lane highway setting in the presence fast moving traffic (as shown in
Figure 1a), since if anything goes wrong, at best it leads to congestion and at worst to accidents [
Reasoning about interactions with other agents and forming an efficient long term strategy while
maintaining safety makes this problem challenging and complex.
Prior work on lane changing consists of a diverse set of approaches with early work considering vision
based control [
]. Other methods track trajectories [
], use fuzzy control [
], model predictive
control [
], generate a steering command with adaptive control [
], consider planning [
], and
mixed logic programming [
]. However majority of the prior work considers the problem only
from a local perspective, i.e. changing between adjacent lanes while avoiding the few neighboring
cars. There is no notion of a goal, like reaching an exit, which would require reasoning about long
term decisions on a strategic level when present on a multi-lane highway among traffic. Formulating
a control or optimization based problem to handle such a scenario is not straight forward, would
require a lot of hand design and tuning, may work only on a subset of cases, and would generally
be intractable. The primary roadblock is that there is no abstraction of what the overall ideal policy
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
oo oo
(b) (c)
Figure 1: (a) Typical multi-lane highway (California, USA). (b) 5 lane highway in the simulator with traffic
(red), where the ego car (green) is tasked with reaching the exit 1.5km away in the least amount of time. (c) Ego
car’s visibility in the simulator (left) and the corresponding occupancy grid (right) with the ego car at its center.
should look like, only the ideal outcome is know: reaching the exit safely and efficiently (in least
amount of time).
Reinforcement learning provides a way to learn arbitrary policies giving specific goals. In recent years
learning based methods have been used to address similar or related problems, like learning from
human driving [
], inverse reinforcement learning [
], end-to-end methods that map perception
inputs (mainly images) directly to control commands [
], and methods that understand
the scene via learning to make driving decisions [19, 20].
With the recent success of deep reinforcement learning [
], in this work we investigate its use and
place in solving the autonomous lane changing problem. In general learning a full policy than can
reason about tactical decisions while at same time address continuous control and collision avoidance
can be exceedingly difficult with large notorious to train networks. An ideal approach to provide
the right balance would be to learn the hard to define high-level tactical policy while relying on
established optimization or rule based method for low-level control. Along these lines we propose
a framework that tightly integrates the two paradigms using Q-masking that essentially forces the
agent to explore only the required states and learns a subset of the space of Q-values. Within this
framework we focus the raw deep learning power to only obtain a high-level decision making policy
for tactical lane changing. By incorporating prior knowledge about the system, constraints of the
problem and information from the low-level controller using Q-masking, we can heavily simplify the
reward function and make the overall learning faster and efficient. Basing low-level decisions on a
controller we are able to completely eliminate collisions during training or testing, which makes it
possible to perform training directly on real systems. We present preliminary benchmarks and show
our framework can outperform a greedy baseline in efficiency and humans driving in the simulator in
safety and success.
2 Problem Setup
We consider the problem of autonomous lane changing in a multi-lane-multi-agent setting and set up
the problem environment in the commonly used traffic simulator, SUMO [22].
The simulation environment consists of a
lane highway as shown in Figure 1b
with minimum and maximum speed limits of
m/s and
m/s respectively that all cars must
The traffic density is generated using SUMO, where each lane can be assigned a probability
of emitting a car at the start position every second, with a random start speed and a random
target speed that it will stay near. These cars use a car follow model [
] and are controlled by SUMO
to avoid collisions with each other. For the sake of simplicity we do not allow any traffic cars to
change lanes.
Ego car:
The ego car is tasked with reaching the exit at the right most lane,
km from the start
position, in minimum amount of time (i.e. maximum average speed), while respecting the speed
limits and avoiding collisions with traffic. Missing the exit would be considered a failure. Our aim
is to learn a high-level tactical decision making strategy such that the ego car makes efficient lane
change maneuvers while relying on the low-level controller for collision free lane changing between
adjacent lanes. The ego car’s performance is evaluated on success rate and average speed (i.e. time to
reach the exit) and is compared to a greedy baseline and humans driving in the simulator. We set up
the ego car in SUMO such that the simulator has no control over it, and the other cars do not avoid
any collisions with the ego car. This allows our approach to fully govern the behavior of the ego car.
3 Deep RL for Tactical Decision Making
We train the ego car using deep reinforcement learning to learn a high-level policy that can make
tactical decisions. For lane change maneuvers, we break down the high level decisions in to the
following 5 actions that can be taken at any time step: (1)
: no-op i.e. take no action and maintain
current speed, (2)
: accelerate for a constant amount this time step, (3)
: decelerate for a constant
amount this time step, (4)
: make a left lane change, and (5)
: make a right lane change. The
network learns to map state inputs to Q-values [21] for these actions, as shown in Figure 2a.
The inputs to the network is the state of the ego car, which consists of internal and external information.
Scalar inputs, velocity
, lane
, and distance to goal
, are chosen to represent internal information
all of which are scaled between 0 and 1. Velocity can varied between 0 and 1 i.e.
respectively, lanes are numbered 0 to 1 from right to left, and distance to goal decreases from the start
position of the car at 1 to the exit at 0. A binary occupancy grid around a chosen visibility region
of the ego car (with the ego car in the center) is used to input external information. An example
occupancy grid of size 42
5 is shown in Figure 1c where the visibility of the car is 50m in front and
back with a longitudinal discretization of 2.5m per cell (all cars are 5m in length), and 2 lanes to the
left and right with a one cell per lane discretization in the lateral direction. To capture relative motion
of the traffic we also input a history of the occupancy grid from previous time steps, as separate
The network architecture we use is shown in Figure 2a. The occupancy grid with history is passed
though a single convolution layer, flattened out and then concatenated with the output of a fully
connected layer with the scalar inputs. The concatenation is passed through a fully connected layer to
give the final output of 5 Q-values associated with the 5 tactical actions. A common practice is to
use a max or soft-max operation on the Q-values to choose an action [
]. In this work we introduce
a technique called Q-masking, which is injected between the Q-values and the max operation (see
Section 4). Using this technique we are able to incorporate prior information about the problem that
the agent does not need to learn from scratch through exploration. We can incorporate low-level
optimization, control or rule based information, like generating trajectories to make an adjacent
certain lane change happen while avoiding collisions. The combined effect is that learning becomes
faster and more efficient while the reward function is massively simplified. We use the following
sparse reward function,
rT=+10 l= 0; exit reached
10 ×l l 6= 0; exit missed (1)
is the lane in which the ego car is when it reaches the target distance
from the start position.
A positive terminal reward is given for success (reaching the exit) and an increasingly negative
terminal reward the further the ego car ends up away from the exit lane. Note the simplicity of the
reward and the absence of competing components like maintaining speed limits or avoid collisions. A
discount factor along with this reward encourages the agent to reach the exit in the smallest number
of time steps i.e. maintaining a higher average speed.
During training, actions are taken in an -greedy manner and is annealed. To train the network we
simulate full trajectories until the terminal state. Two experience buffers are maintained: good and
bad. All transitions i.e. state, action and reward tuples from successful trajectories are saved in the
good buffer while transition from failed trajectories are saved in the bad buffer. For any transition the
expected reward is back calculated from the terminal reward,
rt+γyt+1 otherwise (2)
&''()*+',- ./01
CD;*E(> F
:&+F9/* 0+9F
J/0&/-I+&KE>1. >
(a) (b)
Figure 2: (a) Our framework consisting of the deep Q network and the low-level module that interface together
using Q-masking. (b) A GUI displaying ego car’s location and speed, and available actions.
is the discount factor. While any trajectory is being simulated at each time step the network
is optimized with the following loss function,
L(θ) = ytQ(st, at, θ)2(3)
using a mini batch of transitions equally sampled from the good and the bad buffer. The two
separate buffers help maintain a decent exposure to successful executions when the exploration might
constantly lead to failed trajectories thus avoiding the network getting stuck in a local minima.
4 Q-masking
We use deep Q-learning to learn a policy to make decisions on a tactical level. In a typical Q-learning
network, a mapping between states and Q-values associated to each action is learned. Then a max (or
soft max) operator can be applied on the output layer of Q-values to pick the best action. In this work
we propose a technique, Q-masking, in the form of a mask that is applied on the output Q-values
before taking the max operation as shown in Figure 2a. The direct effect of this is that when taking
the max operation to choose the best action, we consider the Q-values associated with only a subset
of actions, which are dictated by a lower-level module.
Given a state the lower-level module can restrict (or mask off) any set of actions that the agent does
not need to explore or learn from their outcomes. For example, in the lane changing problem if the
ego car is say in the left most lane, then taking a left action will result in getting off the highway.
Therefore, it can put a mask on the Q-value associated with the left action such that it is never
selected in such a state. This allows us to incorporate prior knowledge about the system (i.e. highway
shoulders) directly in to the learning process, meaning that we do not need to set up a negative reward
for getting off the highway, thus simplifying the reward function. Also, since the agent does not
explore these states learning itself becomes faster and more efficient. What the agent ends up learning
is a subset of the actual space of Q-values that are necessary. We can also incorporate constraints on
the system in a similar manner that provides similar benefits. For example, if the ego car is driving
at the maximum speed then the accelerate action is masked or if it is at the minimum speed then
decelerate action is masked. Then the agent never needs to spend time learning the speed limits of
the highway.
As discussed in Section 1 many optimization or rule based methods are available to generate a
sequence of low-level actions that can make a car change between adjacent lanes while avoiding
collisions. However, these methods are generally not designed to handle long term decision making
and reasoning about lane change maneuvers in a multi-lane multi-agent setting. In turn modeling and
training an end-to-end system to learn a complete policy that can generate collision free trajectories
while reasoning about tactical level decisions is hard. We use Q-masking as a interface between the
two ideologies and leverage deep learning to exclusively learn a high-level decision making policy
and relying on the low-level module to provide control policies to change between adjacent lanes in a
collision free manner. We can incorporate any optimization or rule based method in the low-level
module such that given a state it masks off actions that can result in impossible to perform lane
Table 1: Benchmark results of our approach against a greedy baseline and human driving.
Ours Baseline Human
vislat =1 vislat=2
Avg Speed (m/s) 26.50 26.27 22.34 29.16
Success (%) 84 91 100 70
Collision (%) 0 0 0 24
changes or in collisions. Then the learning process truly focuses on learning only the high level
strategy. Since collisions are never allowed during training or testing the reward function does not
need to account for it.
In our implementation we incorporate the highway shoulders information and the speed limits in the
lower-level module. We also include a rule based time to collision (TTC) method [
] (we set the
threshold as 10s) that checks for collisions given the state against all actions and masks off those
actions that lead to collision.
5 Results
To evaluate the performance of our framework we compare the network against a greedy baseline
policy and humans driving in the simulator. For this benchmark we set the problem parameters as
vmin =
20m/s and
vmax =
30m/s. The traffic density is set to
Plane =
{0.3, 0.2, 0.2, 0.15, 0.1} for lane 1 to 5 (from right to left) with {20, 22, 25, 27, 29} m/s as the target
speed in those lanes. These settings give faster sparse traffic in the left most lanes, and slower dense
traffic in the right most lanes. Such traffic conditions ensured that non-trivial lane change maneuvers,
like merging and overtaking would be necessary to remain efficient. For any method we restrict
the action space to the tactical decisions described in Section 3 with a constant acceleration and
deceleration rate of 2m/s
. We aggregate the results across 100 trials for each method and record the
success rate of reaching the exit, and the average speed of the ego car.
The network is trained for 10k episodes, with time step of 0.4s, discount factor
0.99 and
-greedy exploration, where
is annealed from 1.0 to 0.1 for 80% of the episodes. For each episode
the ego car is started from the zero position in a random lane with a random speed between the speed
limits. To investigate the effects of visibility of the ego car we train and benchmark two networks
with lateral visibility
of 1 and 2 lanes to the right and to the left while the longitudinal visibility
is fixed at 50m in front and back. For the occupancy grid we use a 2.5m per cell resolution in the
longitudinal direction and a history of 3 previous time steps to give the grid input sizes of 42
and 42×5×4 for the two networks.
A greedy baseline tactical policy for the ego car prioritizes making a right lane change
until it is in the correct lane. Then it tries to go as fast as possible while staying within the speed limits
and not colliding with any car in the front. Same time step of 0.4s is set and to keep comparisons fair
the high-level baseline policy is allowed to access the low-level module (see Section 4) as an oracle,
to reason about speed limits, highway shoulders and collisions.
We aggregated data on 10 human subjects that drove the ego car in the simulator for
10 trials each. As shown in Figure 2b a GUI was set up to indicate the location of the ego car on
the highway relative to the start and exit, its speed and the available actions. The time step of the
simulator was reduced to 0.1s for smoother control. The subjects were asked to drive naturally and
were told that the primary and secondary objectives were to reach the exit and get there as fast as
possible respectively. They were allowed to learn to use the simulator for a few trials before the
data was recorded. Originally, we found that the subjects did not feel comfortable driving with the
low-level module on (specifically the TTC component used for collision avoidance, see Section 4)
and felt like they had to fight against the simulator or weren’t being allowed to take actions that they
considered were safe and possible. So we conducted the experiments with the TTC component of the
low-level module turned off, however regaining what felt like relinquished control actually resulted in
Figure 3: Examples of overtaking (top row) and merging (bottom row) behaviors, from left to right.
many collision as shown in Table 1. This warrants further study and is beyond the scope of this paper.
We collected the success rate, average speed and also collision rate (since collision could happen).
The benchmark results are summarized in Table 1. By design the baseline policy is always successful
in reaching the exit but is very inefficient since it never tries to apply any lane change maneuvers
when stuck behind slow moving traffic. On the other hand the humans inclined to drive faster were
overall much less successful, however majority of the failures were due to collisions and not missing
the exit. Our approach is able to achieve a much higher average speed than the baseline, is more
successful than the humans, and never results in collisions. An improvement in success rate is seen
with increased visibility. The better performance is attained by the network having learned to make
interesting and human-like lane change decision, which result in emergent behaviors like merging
and overtaking (see Figure 3).
These preliminary results show the applicability of deep reinforcement learning in addressing tactical
decision making problems. Our approach is able to strike the right synergy between learning a high-
level policy and using a low-level controller. It hold promise for further investigation in improving
performance with different (deeper) network architectures or applying it on other problem domains
with a similar construction, and on real systems. Further improvements can be made to make the
set up more realistic by considering occlusions and also introducing uncertainty with a probabilistic
occupancy grid.
6 Conclusion
We proposed a framework that leverages the strengths of deep reinforcement learning for high-level
tactical decision making, and traditional optimization or rule-based methods for low-level control, by
striking the right balance between both domains. At the heart of this framework lies, Q-masking, that
provides an interface between the two levels. Using Q-masking we can incorporate prior knowledge,
constraints about the system and information from the lower-level controller, directly in to the training
of the network, simplifying the reward function and making learning faster and more efficient, while
completely eliminating collisions during training or testing. We applied our framework on the
problem of autonomous lane changing for self driving cars, where the network learned a high-level
tactical decision making policy. We presented preliminary results and benchmarked our approach
against a greedy baseline and humans driving in the simulator and showed that our approach is able
to outperform them both on different metrics with a more efficient and much safer policy.
C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan, D. Duggins, T. Galatali,
C. Geyer, et al., “Autonomous driving in urban environments: Boss and the urban challenge,” Journal of
Field Robotics, vol. 25, no. 8, pp. 425–466, 2008.
J. Ziegler, P. Bender, T. Dang, and C. Stiller, “Trajectory planning for bertha—a local, continuous method,
in Intelligent Vehicles Symposium Proceedings, 2014 IEEE, pp. 450–457, IEEE, 2014.
A. Cosgun, L. Ma, J. Chiu, J. Huang, M. Demir, A. M. Anon, T. Lian, H. Tafish, and S. Al-Stouhi, “Towards
full automated drive in urban environments: A demonstration in gomentum station, california,” in IEEE
Intelligent Vehicles Symposium, IV 2017, Los Angeles, CA, USA, June 11-14, 2017, pp. 1811–1818, 2017.
H. Jula, E. B. Kosmatopoulos, and P. A. Ioannou, “Collision avoidance analysis for lane changing and
merging,IEEE Transactions on vehicular technology, vol. 49, no. 6, pp. 2295–2308, 2000.
C. J. Taylor, J. Košecká, R. Blasi, and J. Malik, A comparative study of vision-based lateral control
strategies for autonomous highway driving,The International Journal of Robotics Research, vol. 18, no. 5,
pp. 442–453, 1999.
J. E. Naranjo, C. Gonzalez, R. Garcia, and T. De Pedro, “Lane-change fuzzy control in autonomous
vehicles for the overtaking maneuver,” IEEE Transactions on Intelligent Transportation Systems, vol. 9,
no. 3, pp. 438–450, 2008.
Z. Shiller and S. Sundar, “Emergency lane-change maneuvers of autonomous vehicles,” Journal of Dynamic
Systems, Measurement, and Control, vol. 120, no. 1, pp. 37–44, 1998.
C. Hatipoglu, U. Ozguner, and K. A. Redmill, “Automated lane change controller design,” IEEE transac-
tions on intelligent transportation systems, vol. 4, no. 1, pp. 13–22, 2003.
P. Falcone, F. Borrelli, J. Asgari, H. E. Tseng, and D. Hrovat, “Predictive active steering control for
autonomous vehicle systems,IEEE Transactions on control systems technology, vol. 15, no. 3, pp. 566–
580, 2007.
P. Petrov and F. Nashashibi, “Adaptive steering control for autonomous lane change maneuver,” in
Intelligent Vehicles Symposium (IV), 2013 IEEE, pp. 835–840, IEEE, 2013.
F. You, R. Zhang, G. Lie, H. Wang, H. Wen, and J. Xu, “Trajectory planning and tracking control for
autonomous lane change maneuver based on the cooperative vehicle infrastructure system,Expert Systems
with Applications, vol. 42, no. 14, pp. 5932–5946, 2015.
Y. Du, Y. Wang, and C.-Y. Chan, “Autonomous lane-change controller via mixed logical dynamical,” in
Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on, pp. 1154–1159,
IEEE, 2014.
H. Tehrani, Q. H. Do, M. Egawa, K. Muto, K. Yoneda, and S. Mita, “General behavior and motion model
for automated lane change,” in Intelligent Vehicles Symposium (IV), 2015 IEEE, pp. 1154–1159, IEEE,
S. Sharifzadeh, I. Chiotellis, R. Triebel, and D. Cremers, “Learning to drive using inverse reinforcement
learning and deep q-networks,” arXiv preprint arXiv:1612.03653, 2016.
U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun, “Off-road obstacle avoidance through end-to-end
learning,” in Advances in neural information processing systems, pp. 739–746, 2006.
J. Koutník, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving large-scale neural networks for vision-
based torcs,” 2013.
M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort,
U. Muller, J. Zhang, et al., “End to end learning for self-driving cars,arXiv preprint arXiv:1604.07316,
H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving models from large-scale video
datasets,” arXiv preprint arXiv:1612.01079, 2016.
C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning affordance for direct perception in
autonomous driving,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2722–
2730, 2015.
S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for au-
tonomous driving,arXiv preprint arXiv:1610.03295, 2016.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,
A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,Nature,
vol. 518, no. 7540, pp. 529–533, 2015.
D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent development and applications of sumo-
simulation of urban mobility,International Journal On Advances in Systems and Measurements, vol. 5,
no. 3&4, pp. 128–138, 2012.
S. Krauß, P. Wagner, and C. Gawron, “Metastable states in a microscopic model of traffic flow,Physical
Review E, vol. 55, no. 5, p. 5597, 1997.
[24] R. van der Horst and J. Hogema, Time-to-collision and collision avoidance systems. na, 1993.
... Mircheveska et al. [25] proposed a reinforcement learning method applying random forests for autonomous driving in highway scenarios. Mukadam et al. [26] proposed a deep Q-learning-based method to solve the problem of automatic vehicle lane changing in a multi-lane, multi-vehicle environment. The proposed method takes the vehicle state and the surrounding environment state as network inputs and the output is a score of five driving behaviors. ...
Full-text available
Intersection scenarios are one of the most complex and high-risk traffic scenarios. Therefore, it is important to propose a vehicle driving decision algorithm for intersection scenarios. Most of the related studies have focused on considering explicit collision risks while lacking consideration for potential driving risks. Therefore, this study proposes a deep-reinforcement-learning-based driving decision algorithm to address these problems. In this study, a non-deterministic vehicle driving risk assessment method is proposed for intersection scenarios and introduced into a learning-based intelligent driving decision algorithm. In addition, this study proposes an attention network based on state information. In this study, a typical intersection scenario was constructed using simulation software, and experiments were conducted. The experimental results show that the algorithm proposed in this paper can effectively derive a driving strategy with both driving efficiency and driving safety in the intersection driving scenario. It is also demonstrated that the attentional neural network designed in this study helps intelligent vehicles to perceive the surrounding environment more accurately, improves the performance of intelligent vehicles, as well as accelerates the convergence speed.
... In terms of the learning-based methods, the preferred approach seems to be the variations of Reinforcement Learning techniques applied in a simulated environment [7], [8], [9], [10], [11]. These approaches, although seeming to work well in simulation, have concerns regarding real-world implementation due to the large amount of training data that they require, the exploration of unsafe behaviors during training, and a general inability to handle edge cases. ...
Full-text available
This paper proposes a hierarchical autonomous vehicle navigation architecture, composed of a high-level speed and lane advisory system (SLAS) coupled with low-level trajectory generation and trajectory following modules. Specifically, we target a multi-lane highway driving scenario where an autonomous ego vehicle navigates in traffic. We propose a novel receding horizon mixed-integer optimization based method for SLAS with the objective to minimize travel time while accounting for passenger comfort. We further incorporate various modifications in the proposed approach to improve the overall computational efficiency and achieve real-time performance. We demonstrate the efficacy of the proposed approach in contrast to the existing methods, when applied in conjunction with state-of-the-art trajectory generation and trajectory following frameworks, in a CARLA simulation environment.
... The primitive element representation is incapable of maintaining permutation invariance and independence from the number of vehicles, whereas the occupancy grid representation can eliminate these constraints [164]. Mukadam et al. [165] utilize a history of binary occupancy grids to represent external environment information and concentrate it with the internal state as input. Numerous techniques [166], [167] extend the occupancy grid map with additional channels for other characteristics, such as velocities, headings, lateral displacements, and so on. ...
Grid-centric perception is a crucial field for mobile robot perception and navigation. Nonetheless, grid-centric perception is less prevalent than object-centric perception for autonomous driving as autonomous vehicles need to accurately perceive highly dynamic, large-scale outdoor traffic scenarios and the complexity and computational costs of grid-centric perception are high. The rapid development of deep learning techniques and hardware gives fresh insights into the evolution of grid-centric perception and enables the deployment of many real-time algorithms. Current industrial and academic research demonstrates the great advantages of grid-centric perception, such as comprehensive fine-grained environmental representation, greater robustness to occlusion, more efficient sensor fusion, and safer planning policies. Given the lack of current surveys for this rapidly expanding field, we present a hierarchically-structured review of grid-centric perception for autonomous vehicles. We organize previous and current knowledge of occupancy grid techniques and provide a systematic in-depth analysis of algorithms in terms of three aspects: feature representation, data utility, and applications in autonomous driving systems. Lastly, we present a summary of the current research trend and provide some probable future outlooks.
... Bansal et al. [25] presented a ChauffeurNet scheme, which outputs a driving trajectory that is consumed by a motion control module that translates it into steering and acceleration. Mukadam et al. [26] designed a tactical decision-making for autonomous lane change with deep RL, which includes a Q-masking module to provide constraints, prior knowledge and information of a low-level controller. ...
... Paxton et al. [83] use reinforcement learning to learn both low-level control policies as well as high-level goal-directed "option" policies to search through with MCTS; they use extracted feature inputs and a single hidden layer. Mukadam et al. [84] use deep Q-learning to find a high-level tactical lane-changing strategy that determines when to elect lane changing or acceleration actions to be performed by a low-level controller. ...
In this thesis we propose novel estimation techniques for localization and planning problems, which are key challenges in long-term autonomy. We take inspiration in our methods from non-parametric estimation and use tools such as kernel density estimation, non-linear least-squares optimization, binary masking, and random sampling. We show that these methods, by avoiding explicit parametric models, outperform existing methods that use them. Despite the seeming differences between localization and planning, we demonstrate in this thesis that the problems share core structural similarities. When real or simulation-sampled measurements are expensive, noisy, or high variance, non-parametric estimation techniques give higher-quality results in less time. We first address two localization problems. In order to permit localization with a set of ad hoc-placed radios, we propose an ultra-wideband (UWB) graph realization system to localize the radios. Our system achieves high accuracy and robustness by using kernel density estimation for measurement probability densities, by explicitly modeling antenna delays, and by optimizing this combination with a non-linear least squares formulation. Next, in order to then support robotic navigation, we present a flexible system for simultaneous localization and mapping (SLAM) that combines elements from both traditional dense metric SLAM and topological SLAM, using a binary "masking function" to focus attention. This masking function controls which lidar scans are available for loop closures. We provide several masking functions based on approximate topological class detectors. We then examine planning problems in the final chapter and in the appendix. In order to plan with uncertainty around multiple dynamic agents, we describe Monte-Carlo Policy-Tree Decision Making (MCPTDM), a framework for efficiently computing policies in partially-observable, stochastic, continuous problems. MCPTDM composes a sequence of simpler closed-loop policies and uses marginal action costs and particle repetition to improve cost estimates and sample efficiency by reducing variance. Finally, in the appendix we explore Learned Similarity Monte-Carlo Planning (LSMCP), where we seek to enhance the sample efficiency of partially observable Monte Carlo tree search-based planning by taking advantage of similarities in the final outcomes of similar states and actions. We train a multilayer perceptron to learn a similarity function which we then use to enhance value estimates in the planning. Collectively, we show in this thesis that non-parametric methods promote long-term autonomy by reducing error and increasing robustness across multiple domains.
... However, the performance of such methods for complex and high dimensional driving situations such as dense traffic is limited because of the curse of dimensionality. The combination of Q-learning with neural networks addresses this problem and paves the path for using Q-learning for complex DDM applications [16]. Early implementations of Deep Q-Networks (DQN) was used for lane-keeping control of a vehicle in a race track [17], [18]. ...
Full-text available
Autonomous driving decision-making is a challenging task due to the inherent complexity and uncertainty in traffic. For example, adjacent vehicles may change their lane or overtake at any time to pass a slow vehicle or to help traffic flow. Anticipating the intention of surrounding vehicles, estimating their future states and integrating them into the decision-making process of an automated vehicle can enhance the reliability of autonomous driving in complex driving scenarios. This paper proposes a Prediction-based Deep Reinforcement Learning (PDRL) decision-making model that considers the manoeuvre intentions of surrounding vehicles in the decision-making process for highway driving. The model is trained using real traffic data and tested in various traffic conditions through a simulation platform. The results show that the proposed PDRL model improves the decision-making performance compared to a Deep Reinforcement Learning (DRL) model by decreasing collision numbers, resulting in safer driving.
Most conventional unmanned vehicle control algorithms require human adjustment of parameters and design of precise rules, thus failing to adapt quickly to multiple situations when facing complex environments in the wild. To address these problems, this paper adopts an end-to-end deep reinforcement learning model based on proximal policy optimization algorithm to control the steering, speed and braking of an unmanned vehicle, allowing it to autonomously learn motion control strategies from perception map in un-known environments. A novel environment simulator which contains variable passable areas and obstacles is also proposed to support agents to achieve target reward. The proposed agent model has been proved to receive the highest reward over SAC and has the ability to overcome the complexity of the wild environment generated by the simulator.
Conference Paper
Full-text available
We propose an inverse reinforcement learning (IRL) approach using Deep Q-Networks to extract the rewards in problems with large state spaces. We evaluate the performance of this approach in a simulation-based autonomous driving scenario. Our results resemble the intuitive relation between the reward function and readings of distance sensors mounted at different poses on the car. We also show that, after a few learning rounds, our simulated agent generates collision-free motions and performs human-like lane change behaviour.
Full-text available
Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadways. Since there are many possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy. Moreover, one must balance between unexpected behavior of other drivers/pedestrians and at the same time not to be too defensive so that normal traffic flow is maintained. In this paper we apply deep reinforcement learning to the problem of forming long term driving strategies. We note that there are two major challenges that make autonomous driving different from other robotic tasks. First, is the necessity for ensuring functional safety - something that machine learning has difficulty with given that performance is optimized at the level of an expectation over many instances. Second, the Markov Decision Process model often used in robotics is problematic in our case because of unpredictable behavior of other agents in this multi-agent scenario. We make three contributions in our work. First, we show how policy gradient iterations can be used without Markovian assumptions. Second, we decompose the problem into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of driving. Third, we introduce a hierarchical temporal abstraction we call an "Option Graph" with a gating mechanism that significantly reduces the effective horizon and thereby reducing the variance of the gradient estimation even further.
Full-text available
We trained a convolutional neural network (CNN) to map raw pixels from a single front-facing camera directly to steering commands. This end-to-end approach proved surprisingly powerful. With minimum training data from humans the system learns to drive in traffic on local roads with or without lane markings and on highways. It also operates in areas with unclear visual guidance such as in parking lots and on unpaved roads. The system automatically learns internal representations of the necessary processing steps such as detecting useful road features with only the human steering angle as the training signal. We never explicitly trained it to detect, for example, the outline of roads. Compared to explicit decomposition of the problem, such as lane marking detection, path planning, and control, our end-to-end system optimizes all processing steps simultaneously. We argue that this will eventually lead to better performance and smaller systems. Better performance will result because the internal components self-optimize to maximize overall system performance, instead of optimizing human-selected intermediate criteria, e.g., lane detection. Such criteria understandably are selected for ease of human interpretation which doesn't automatically guarantee maximum system performance. Smaller networks are possible because the system learns to solve the problem with the minimal number of processing steps. We used an NVIDIA DevBox and Torch 7 for training and an NVIDIA DRIVE(TM) PX self-driving car computer also running Torch 7 for determining where to drive. The system operates at 30 frames per second (FPS).
Conference Paper
Full-text available
The idea of using evolutionary computation to train artificial neural networks, or neuroevolution (NE), for reinforcement learning (RL) tasks has now been around for over 20 years. However, as RL tasks become more challenging, the networks required become larger, as do their genomes. But, scaling NE to large nets (i.e. tens of thousands of weights) is infeasible using direct encodings that map genes one-to-one to network components. In this paper, we scale-up our compressed network encoding where network weight matrices are represented indirectly as a set of Fourier-type coefficients, to tasks that require very-large networks due to the high-dimensionality of their input space. The approach is demonstrated successfully on two reinforcement learning tasks in which the control networks receive visual input: (1) a vision-based version of the octopus control task requiring networks with over 3 thousand weights, and (2) a version of the TORCS driving game where networks with over 1 million weights are evolved to drive a car around a track using video images from the driver's perspective.
Conference Paper
This paper focuses on the design of a model based methodology to optimize lane change maneuvers based on the predictions of neighboring agents. In particular, the dynamics of vehicles are modeled as double integrators, and the lane change actions are indicated by boolean variables while neglecting the lane change dynamics. The objective of the optimal control is to minimize the travel time, maintain saftey distances between the target vehicle and neighboring ones, satisfy the operational constraints of the target vehicle, and comply with the speed limit. The control problem is formulated as a mixed logic dynamic system, and solved by Cplex, a commercial mixed integer optimization solver. Finally, the effectiveness of the proposed methodology is demonstrated by two simulated scenarios in this paper.
Lane change maneuver is one of the most conventional behaviors in driving. Unsafe lane change maneuvers are key factor for traffic accidents and traffic congestion. For drivers’ safety, comfort and convenience, advanced driver assistance systems (ADAS) are presented. The main problem discussed in this paper is the development of an autonomous lane change system. The system can be extended applied in intelligent vehicles in future. It solves two crucial issues – trajectory planning and trajectory tracking. Polynomial method was adopted for describing the trajectory planning issue. Movement of a host vehicle was abstracted into time functions. Moreover, collision detection was mapped into a parameter space by adopting infinite dynamic circles. The second issue was described by backstepping principle. According to the Lyapunov function, a tracking controller with global convergence property was verified. Both the simulations and the experimental results demonstrate the feasibility and effectiveness of the designed method for autonomous lane change.
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.