Content uploaded by Bangalore Ravi Kiran
Author content
All content in this area was uploaded by Bangalore Ravi Kiran on Feb 16, 2021
Content may be subject to copyright.
1
Deep Reinforcement Learning
for Autonomous Driving: A Survey
B Ravi Kiran1, Ibrahim Sobh2, Victor Talpaert3, Patrick Mannion4,
Ahmad A. Al Sallab2, Senthil Yogamani5, Patrick Pérez6
Abstract—With the development of deep representation learn-
ing, the domain of reinforcement learning (RL) has become
a powerful learning framework now capable of learning com-
plex policies in high dimensional environments. This review
summarises deep reinforcement learning (DRL) algorithms and
provides a taxonomy of automated driving tasks where (D)RL
methods have been employed, while addressing key computa-
tional challenges in real world deployment of autonomous driving
agents. It also delineates adjacent domains such as behavior
cloning, imitation learning, inverse reinforcement learning that
are related but are not classical RL algorithms. The role of
simulators in training agents, methods to validate, test and
robustify existing solutions in RL are discussed.
Index Terms—Deep reinforcement learning, Autonomous driv-
ing, Imitation learning, Inverse reinforcement learning, Con-
troller learning, Trajectory optimisation, Motion planning, Safe
reinforcement learning.
I. INTRODUCTION
Autonomous driving (AD)1systems constitute of multiple
perception level tasks that have now achieved high precision
on account of deep learning architectures. Besides the per-
ception, autonomous driving systems constitute of multiple
tasks where classical supervised learning methods are no more
applicable. First, when the prediction of the agent’s action
changes future sensor observations received from the envi-
ronment under which the autonomous driving agent operates,
for example the task of optimal driving speed in an urban
area. Second, supervisory signals such as time to collision
(TTC), lateral error w.r.t to optimal trajectory of the agent,
represent the dynamics of the agent, as well uncertainty in
the environment. Such problems would require defining the
stochastic cost function to be maximized. Third, the agent is
required to learn new configurations of the environment, as
well as to predict an optimal decision at each instant while
driving in its environment. This represents a high dimensional
space given the number of unique configurations under which
the agent & environment are observed, this is combinatorially
large. In all such scenarios we are aiming to solve a sequential
1Navya, Paris. ravi.kiran@navya.tech
2Valeo Cairo AI team, Egypt. ibrahim.sobh, ahmad.el-
sallab@{valeo.com}
3U2IS, ENSTA Paris, Institut Polytechnique de Paris & AKKA Technolo-
gies, France. victor.talpaert@ensta.fr
4School of Computer Science, National University of Ireland, Galway.
patrick.mannion@nuigalway.ie
5Valeo Vision Systems. senthil.yogamani@valeo.com
6Valeo.ai. patrick.perez@valeo.com
1For easy reference, the main acronyms used in this article are listed in
Appendix (Table IV).
decision process, which is formalized under the classical
settings of Reinforcement Learning (RL), where the agent is
required to learn and represent its environment as well as
act optimally given at each instant [1]. The optimal action
is referred to as the policy.
In this review we cover the notions of reinforcement
learning, the taxonomy of tasks where RL is a promising
solution especially in the domains of driving policy, predictive
perception, path and motion planning, and low level controller
design. We also focus our review on the different real world
deployments of RL in the domain of autonomous driving
expanding our conference paper [2] since their deployment has
not been reviewed in an academic setting. Finally, we motivate
users by demonstrating the key computational challenges and
risks when applying current day RL algorithms such imitation
learning, deep Q learning, among others. We also note from
the trends of publications in figure 2that the use of RL or
Deep RL applied to autonomous driving or the self driving
domain is an emergent field. This is due to the recent usage of
RL/DRL algorithms domain, leaving open multiple real world
challenges in implementation and deployment. We address the
open problems in VI.
The main contributions of this work can be summarized as
follows:
•Self-contained overview of RL background for the auto-
motive community as it is not well known.
•Detailed literature review of using RL for different au-
tonomous driving tasks.
•Discussion of the key challenges and opportunities for
RL applied to real world autonomous driving.
The rest of the paper is organized as follows. Section II
provides an overview of components of a typical autonomous
driving system. Section III provides an introduction to rein-
forcement learning and briefly discusses key concepts. Section
IV discusses more sophisticated extensions on top of the basic
RL framework. Section Vprovides an overview of RL applica-
tions for autonomous driving problems. Section VI discusses
challenges in deploying RL for real-world autonomous driving
systems. Section VII concludes this paper with some final
remarks.
II. COMPONENTS OF AD SY ST EM
Figure 1comprises of the standard blocks of an AD system
demonstrating the pipeline from sensor stream to control
actuation. The sensor architecture in a modern autonomous
driving system notably includes multiple sets of cameras,
arXiv:2002.00444v2 [cs.LG] 23 Jan 2021
2
Fig. 1. Standard components in a modern autonomous driving systems pipeline listing the various tasks. The key problems addressed by these modules are
Scene Understanding, Decision and Planning.
radars and LIDARs as well as a GPS-GNSS system for
absolute localisation and inertial measurement Units (IMUs)
that provide 3D pose of the vehicle in space.
The goal of the perception module is the creation of an
intermediate level representation of the environment state (for
example bird-eye view map of all obstacles and agents) that is
to be later utilised by a decision making system that ultimately
produces the driving policy. This state would include lane
position, drivable zone, location of agents such as cars &
pedestrians, state of traffic lights and others. Uncertainties
in the perception propagate to the rest of the information
chain. Robust sensing is critical for safety thus using redundant
sources increases confidence in detection. This is achieved
by a combination of several perception tasks like semantic
segmentation [3], [4], motion estimation [5], depth estimation
[6], soiling detection [7], etc which can be efficiently unified
into a multi-task model [8], [9].
A. Scene Understanding
This key module maps the abstract mid-level representation
of the perception state obtained from the perception module to
the high level action or decision making module. Conceptually,
three tasks are grouped by this module: Scene understanding,
Decision and Planning as seen in figure 1module aims to
provide a higher level understanding of the scene, it is built
on top of the algorithmic tasks of detection or localisation.
By fusing heterogeneous sensor sources, it aims to robustly
generalise to situations as the content becomes more abstract.
This information fusion provides a general and simplified
context for the Decision making components.
Fusion provides a sensor agnostic representation of the envi-
ronment and models the sensor noise and detection uncertain-
ties across multiple modalities such as LIDAR, camera, radar,
ultra-sound. This basically requires weighting the predictions
in a principled way.
B. Localization and Mapping
Mapping is one of the key pillars of automated driving [10].
Once an area is mapped, current position of the vehicle can
be localized within the map. The first reliable demonstrations
of automated driving by Google were primarily reliant on
localisation to pre-mapped areas. Because of the scale of
the problem, traditional mapping techniques are augmented
by semantic object detection for reliable disambiguation. In
addition, localised high definition maps (HD maps) can be
used as a prior for object detection.
C. Planning and Driving policy
Trajectory planning is a crucial module in the autonomous
driving pipeline. Given a route-level plan from HD maps or
GPS based maps, this module is required to generate motion-
level commands that steer the agent.
Classical motion planning ignores dynamics and differential
constraints while using translations and rotations required to
move an agent from source to destination poses [11]. A robotic
agent capable of controlling 6-degrees of freedom (DOF) is
said to be holonomic, while an agent with fewer controllable
DOFs than its total DOF is said to be non-holonomic. Classical
algorithms such as A∗algorithm based on Djisktra’s algorithm
do not work in the non-holonomic case for autonomous
driving. Rapidly-exploring random trees (RRT) [12] are non-
holonomic algorithms that explore the configuration space by
random sampling and obstacle free path generation. There are
various versions of RRT currently used in for motion planning
in autonomous driving pipelines.
D. Control
A controller defines the speed, steering angle and braking
actions necessary over every point in the path obtained from
a pre-determined map such as Google maps, or expert driving
recording of the same values at every waypoint. Trajectory
tracking in contrast involves a temporal model of the dynamics
of the vehicle viewing the waypoints sequentially over time.
Current vehicle control methods are founded in classical op-
timal control theory which can be stated as a minimisation of
a cost function ˙
x=f(x(t),u(t)) defined over a set of states x(t)
and control actions u(t). The control input is usually defined
3
Fig. 2. Trend of publications for keywords 1. "reinforcement learning", 2."deep reinforcement", and 3."reinforcement learning" AND ("autonomous cars" OR
"autonomous vehicles" OR "self driving") for academic publication trends from this [13].
over a finite time horizon and restricted on a feasible state
space x∈Xfree [14]. The velocity control are based on classical
methods of closed loop control such as PID (proportional-
integral-derivative) controllers, MPC (Model predictive con-
trol). PIDs aim to minimise a cost function constituting of
three terms current error with proportional term, effect of
past errors with integral term, and effect of future errors with
the derivative term. While the family of MPC methods aim
to stabilize the behavior of the vehicle while tracking the
specified path [15]. A review on controllers, motion planning
and learning based approaches for the same are provided in
this review [16] for interested readers. Optimal control and
reinforcement learning are intimately related, where optimal
control can be viewed as a model based reinforcement learning
problem where the dynamics of the vehicle/environment are
modeled by well defined differential equations. Reinforcement
learning methods were developed to handle stochastic control
problems as well ill-posed problems with unknown rewards
and state transition probabilities. Autonomous vehicle stochas-
tic control is large domain, and we advise readers to read the
survey on this subject by authors in [17].
III. REINFORCEMENT LEARNING
Machine learning (ML) is a process whereby a computer
program learns from experience to improve its performance at
a specified task [18]. ML algorithms are often classified under
one of three broad categories: supervised learning, unsuper-
vised learning and reinforcement learning (RL). Supervised
learning algorithms are based on inductive inference where
the model is typically trained using labelled data to perform
classification or regression, whereas unsupervised learning en-
compasses techniques such as density estimation or clustering
applied to unlabelled data. By contrast, in the RL paradigm
an autonomous agent learns to improve its performance at
an assigned task by interacting with its environment. Russel
and Norvig define an agent as “anything that can be viewed
as perceiving its environment through sensors and acting
upon that environment through actuators” [19]. RL agents
are not told explicitly how to act by an expert; rather an
agent’s performance is evaluated by a reward function R.
For each state experienced, the agent chooses an action and
receives an occasional reward from its environment based on
the usefulness of its decision. The goal for the agent is to
maximize the cumulative rewards received over its lifetime.
Gradually, the agent can increase its long-term reward by
Vehicle/RL Agent
State st∈S
Seither continuous
or discrete
Challenges
Sample Complexity, Validation,
Safe-exploration, Credit assignment,
Simulator-Real Gap, Learning
Function Approximators
ot→st(SRL)
CNNs
RL methods
Value/Policy-based
Actor-Critic
On/Off policy
Model-based/Model-free π:S→A
Driving Policy
stochastic/deterministic
Environment
Real-world Simulator
simulator-to-real gap
action at
discrete/continuous
reward rt
observation ot
Fig. 3. A graphical decomposition of the different components of an RL
algorithm. It also demonstrates the different challenges encountered while
training a D(RL) algorithm.
exploiting knowledge learned about the expected utility (i.e.
discounted sum of expected future rewards) of different state-
action pairs. One of the main challenges in reinforcement
learning is managing the trade-off between exploration and
exploitation. To maximize the rewards it receives, an agent
must exploit its knowledge by selecting actions which are
known to result in high rewards. On the other hand, to
discover such beneficial actions, it has to take the risk of
trying new actions which may lead to higher rewards than
the current best-valued actions for each system state. In other
words, the learning agent has to exploit what it already knows
in order to obtain rewards, but it also has to explore the
unknown in order to make better action selections in the future.
Examples of strategies which have been proposed to manage
this trade-off include ²-greedy and softmax. When adopting
the ubiquitous ²-greedy strategy, an agent either selects an
action at random with probability 0<²<1, or greedily
selects the highest valued action for the current state with
the remaining probability 1−². Intuitively, the agent should
explore more at the beginning of the training process when
little is known about the problem environment. As training
progresses, the agent may gradually conduct more exploitation
than exploration. The design of exploration strategies for RL
agents is an area of active research (see e.g. [20]).
4
Markov decision processes (MDPs) are considered the de
facto standard when formalising sequential decision making
problems involving a single RL agent [21]. An MDP consists
of a set Sof states, a set Aof actions, a transition function
Tand a reward function R[22], i.e. a tuple <S,A,T,R>.
When in any state s∈S, selecting an action a∈Awill result
in the environment entering a new state s0∈Swith a transition
probability T(s,a,s0)∈(0,1), and give a reward R(s,a). This
process is illustrated in Fig. 3. The stochastic policy π:S→D
is a mapping from the state space to a probability over the set
of actions, and π(a|s)represents the probability of choosing
action aat state s. The goal is to find the optimal policy
π∗, which results in the highest expected sum of discounted
rewards [21]:
π∗=argmax
π
Eπ½H−1
X
k=0
γkrk+1|s0=s¾
| {z }
:=Vπ(s)
,(1)
for all states s∈S, where rk=R(sk,ak)is the reward at time
kand Vπ(s), the ‘value function’ at state sfollowing a policy
π, is the expected ‘return’ (or ‘utility’) when starting at sand
following the policy πthereafter [1]. An important, related
concept is the action-value function, a.k.a.‘Q-function’ defined
as:
Qπ(s,a)=Eπ½H−1
X
k=0
γkrk+1|s0=s,a0=a¾.(2)
The discount factor γ∈[0,1] controls how an agent regards
future rewards. Low values of γencourage myopic behaviour
where an agent will aim to maximise short term rewards,
whereas high values of γcause agents to be more forward-
looking and to maximise rewards over a longer time frame.
The horizon Hrefers to the number of time steps in the
MDP. In infinite-horizon problems H= ∞, whereas in episodic
domains Hhas a finite value. Episodic domains may terminate
after a fixed number of time steps, or when an agent reaches
a specified goal state. The last state reached in an episodic
domain is referred to as the terminal state. In finite-horizon
or goal-oriented domains discount factors of (close to) 1 may
be used to encourage agents to focus on achieving the goal,
whereas in infinite-horizon domains lower discount factors
may be used to strike a balance between short- and long-
term rewards. If the optimal policy for a MDP is known,
then Vπ∗may be used to determine the maximum expected
discounted sum of rewards available from any arbitrary initial
state. A rollout is a trajectory produced in the state space
by sequentially applying a policy to an initial state. A MDP
satisfies the Markov property, i.e. system state transitions are
dependent only on the most recent state and action, not on
the full history of states and actions in the decision process.
Moreover, in many real-world application domains, it is not
possible for an agent to observe all features of the environment
state; in such cases the decision-making problem is formulated
as a partially-observable Markov decision process (POMDP).
Solving a reinforcement learning task means finding a policy
πthat maximises the expected discounted sum of rewards
over trajectories in the state space. RL agents may learn
value function estimates, policies and/or environment models
directly. Dynamic programming (DP) refers to a collection of
algorithms that can be used to compute optimal policies given
a perfect model of the environment in terms of reward and
transition functions. Unlike DP, in Monte Carlo methods there
is no assumption of complete environment knowledge. Monte
Carlo methods are incremental in an episode-by-episode sense.
Upon the completion of an episode, the value estimates and
policies are updated. Temporal Difference (TD) methods, on
the other hand, are incremental in a step-by-step sense, making
them applicable to non-episodic scenarios. Like Monte Carlo
methods, TD methods can learn directly from raw experience
without a model of the environment’s dynamics. Like DP, TD
methods learn their estimates based on other estimates.
A. Value-based methods
Q-learning is one of the most commonly used RL algo-
rithms. It is a model-free TD algorithm that learns estimates of
the utility of individual state-action pairs (Q-functions defined
in Eqn. 2). Q-learning has been shown to converge to the
optimum state-action values for a MDP with probability 1,
so long as all actions in all states are sampled infinitely
often and the state-action values are represented discretely
[23]. In practice, Q-learning will learn (near) optimal state-
action values provided a sufficient number of samples are
obtained for each state-action pair. If a Q-learning agent has
converged to the optimal Q values for a MDP and selects
actions greedily thereafter, it will receive the same expected
sum of discounted rewards as calculated by the value function
with π∗(assuming that the same arbitrary initial starting state
is used for both). Agents implementing Q-learning update their
Q values according to the following update rule:
Q(s,a)←Q(s,a)+α[r+γmax
a0∈AQ(s0,a0)−Q(s,a)],(3)
where Q(s,a)is an estimate of the utility of selecting action
ain state s,α∈[0,1] is the learning rate which controls the
degree to which Q values are updated at each time step, and
γ∈[0,1] is the same discount factor used in Eqn. 1. The
theoretical guarantees of Q-learning hold with any arbitrary
initial Q values [23]; therefore the optimal Q values for a MDP
can be learned by starting with any initial action value function
estimate. The initialisation can be optimistic (each Q(s,a)
returns the maximum possible reward), pessimistic (minimum)
or even using knowledge of the problem to ensure faster
convergence. Deep Q-Networks (DQN) [24] incorporates a
variant of the Q-learning algorithm [25], by using deep neural
networks (DNNs) as a non-linear Q function approximator
over high-dimensional state spaces (e.g. the pixels in a frame
of an Atari game). Practically, the neural network predicts the
value of all actions without the use of any explicit domain-
specific information or hand-designed features. DQN applies
experience replay technique to break the correlation between
successive experience samples and also for better sample
efficiency. For increased stability, two networks are used where
the parameters of the target network for DQN are fixed for
a number of iterations while updating the parameters of the
online network. Readers are directed to sub-section III-E for
a more detailed introduction to the use of DNNs in Deep RL.
5
B. Policy-based methods
The difference between value-based and policy-based meth-
ods is essentially a matter of where the burden of optimality
resides. Both method types must propose actions and evaluate
the resulting behaviour, but while value-based methods focus
on evaluating the optimal cumulative reward and have a policy
follows the recommendations, policy-based methods aim to
estimate the optimal policy directly, and the value is a sec-
ondary if calculated at all. Typically, a policy is parameterised
as a neural network πθ. Policy gradient methods use gradient
descent to estimate the parameters of the policy that maximise
the expected reward. The result can be a stochastic policy
where actions are selected by sampling, or a deterministic
policy. Many real-world applications have continuous action
spaces. Deterministic policy gradient (DPG) algorithms [26]
[1] allow reinforcement learning in domains with continuous
actions. Silver et al. [26] proved that a deterministic policy
gradient exists for MDPs satisfying certain conditions, and
that deterministic policy gradients have a simple model-free
form that follows the gradient of the action-value function.
As a result, instead of integrating over both state and action
spaces in stochastic policy gradients, DPG integrates over
the state space only leading to fewer samples in problems
with large action spaces. To ensure sufficient exploration,
actions are chosen using a stochastic policy, while learning a
deterministic target policy. The REINFORCE [27] algorithm
is a straight forward policy-based method. The discounted
cumulative reward gt=PH−1
k=0γkrk+t+1at one time step is
calculated by playing the entire episode, so no estimator is
required for policy evaluation. The parameters are updated into
the direction of the performance gradient:
θ←θ+αγtg∇log πθ(a|s),(4)
where αis the learning rate for a stable incremental update.
Intuitively, we want to encourage state-action pairs that result
in the best possible returns. Trust Region Policy Optimization
(TRPO) [28], works by preventing the updated policies from
deviating too much from previous policies, thus reducing the
chance of a bad update. TRPO optimises a surrogate objective
function where the basic idea is to limit each policy gradient
update as measured by the Kullback-Leibler (KL) divergence
between the current and the new proposed policy. This method
results in monotonic improvements in policy performance.
While Proximal Policy Optimization (PPO) [29] proposed a
clipped surrogate objective function by adding a penalty for
having a too large policy change. Accordingly, PPO policy
optimisation is simpler to implement, and has better sample
complexity while ensuring the deviation from the previous
policy is relatively small.
C. Actor-critic methods
Actor-critic methods are hybrid methods that combine the
benefits of policy-based and value-based algorithms. The
policy structure that is responsible for selecting actions is
known as the ‘actor’. The estimated value function criticises
the actions made by the actor and is known as the ‘critic’.
After each action selection, the critic evaluates the new state to
determine whether the result of the selected action was better
or worse than expected. Both networks need their gradient to
learn. Let J(θ) :=Eπθ[r]represent a policy objective function,
where θdesignates the parameters of a DNN. Policy gradient
methods search for local maximum of J(θ). Since optimization
in continuous action spaces could be costly and slow, the
DPG (Direct Policy Gradient) algorithm represents actions
as parameterised function µ(s|θµ), where θµrefers to the
parameters of the actor network. Then the unbiased estimate
of the policy gradient gradient step is given as:
∇θJ= −Eπθn(g−b)logπθ(a|s)o,(5)
where bis the baseline. While using b≡0is the simplification
that leads to the REINFORCE formulation. Williams [27]
explains a well chosen baseline can reduce variance leading
to a more stable learning. The baseline, bcan be chosen
as Vπ(s),Qπ(s,a)or ‘Advantage’ Aπ(s,a)based methods.
Deep Deterministic Policy Gradient (DDPG) [30] is a model-
free, off-policy (please refer to subsection III-D for a detailed
distinction), actor-critic algorithm that can learn policies for
continuous action spaces using deep neural net based function
approximation, extending prior work on DPG to large and
high-dimensional state-action spaces. When selecting actions,
exploration is performed by adding noise to the actor policy.
Like DQN, to stabilise learning a replay buffer is used to
minimize data correlation. A separate actor-critic specific
target network is also used. Normal Q-learning is adapted with
a restricted number of discrete actions, and DDPG also needs
a straightforward way to choose an action. Starting from Q-
learning, we extend Eqn. 2to define the optimal Q-value and
optimal action as Q∗and a∗.
Q∗(s,a)=max
πQπ(s,a),a∗=argmax
a
Q∗(s,a).(6)
In the case of Q-learning, the action is chosen according to
the Q-function as in Eqn. 6. But DDPG chains the evaluation
of Q after the action has already been chosen according to the
policy. By correcting the Q-values towards the optimal values
using the chosen action, we also update the policy towards the
optimal action proposition. Thus two separate networks work
at estimating Q∗and π∗.
Asynchronous Advantage Actor Critic (A3C) [31] uses
asynchronous gradient descent for optimization of deep neural
network controllers. Deep reinforcement learning algorithms
based on experience replay such as DQN and DDPG have
demonstrated considerable success in difficult domains such
as playing Atari games. However, experience replay uses a
large amount of memory to store experience samples and
requires off-policy learning algorithms. In A3C, instead of
using an experience replay buffer, agents asynchronously
execute on multiple parallel instances of the environment. In
addition to the reducing correlation of the experiences, the
parallel actor-learners have a stabilizing effect on training
process. This simple setup enables a much larger spectrum
of on-policy as well as off-policy reinforcement learning
algorithms to be applied robustly using deep neural networks.
A3C exceeded the performance of the previous state-of-the-
art at the time on the Atari domain while training for half
6
the time on a single multi-core CPU instead of a GPU by
combining several ideas. It also demonstrates how using an
estimate of the value function as the previously explained
baseline breduces variance and improves convergence time.
By defining the advantage as Aπ(a,s)=Qπ(s,a)−Vπ(s), the
expression of the policy gradient from Eqn. 5is rewritten
as ∇θL= −Eπθ{Aπ(a,s)logπθ(a|s)}. The critic is trained to
minimize 1
2°
°Aπθ(a,s)°
°2. The intuition of using advantage
estimates rather than just discounted returns is to allow the
agent to determine not just how good its actions were, but also
how much better they turned out to be than expected, leading
to reduced variance and more stable training. The A3C model
also demonstrated good performance in 3D environments such
as labyrinth exploration. Advantage Actor Critic (A2C) is
a synchronous version of the asynchronous advantage actor
critic model, that waits for each agent to finish its experience
before conducting an update. The performance of both A2C
and A3C is comparable. Most greedy policies must alternate
between exploration and exploitation, and good exploration
visits the states where the value estimate is uncertain. This
way, exploration focuses on trying to find the most uncertain
state paths as they bring valuable information. In addition to
advantage, explained earlier, some methods use the entropy as
the uncertainty quantity. Most A3C implementations include
this as well. Two methods with common authors are energy-
based policies [32] and more recent and with widespread use,
the Soft Actor Critic (SAC) algorithm [33], both rely on adding
an entropy term to the reward function, so we update the policy
objective from Eqn. 1to Eqn. 7. We refer readers to [33] for
an in depth explanation of the expression
π∗
Ma xEn t =argmax
π
Eπ©X
t
[r(st,at)+αH(π(.|st))]ª,(7)
shown here for illustration of how the entropy His added.
D. Model-based (vs. Model-free) & On/Off Policy methods
In practical situations, interacting with the real environment
could be limited due to many reasons including safety and
cost. Learning a model for environment dynamics may reduce
the amount of interactions required with the real environ-
ment. Moreover, exploration can be performed on the learned
models. In the case of model-based approaches (e.g. Dyna-
Q [34], R-max [35]), agents attempt to learn the transition
function Tand reward function R, which can be used when
making action selections. Keeping a model approximation of
the environment means storing knowledge of its dynamics, and
allows for fewer, and sometimes, costly environment interac-
tions. By contrast, in model-free approaches such knowledge
is not a requirement. Instead, model-free learners sample the
underlying MDP directly in order to gain knowledge about
the unknown model, in the form of value function estimates
for example. In Dyna-2 [36], the learning agent stores long-
term and short-term memories, where a memory is defined
as the set of features and corresponding parameters used by
an agent to estimate the value function. Long-term memory
is for general domain knowledge which is updated from real
experience, while short-term memory is for specific local
knowledge about the current situation, and the value function
is a linear combination of long and short term memories.
Learning algorithms can be on-policy or off-policy depend-
ing on whether the updates are conducted on fresh trajectories
generated by the policy or by another policy, that could be
generated by an older version of the policy or provided by an
expert. On-policy methods such as SARSA [37], estimate the
value of a policy while using the same policy for control.
However, off-policy methods such as Q-learning [25], use
two policies: the behavior policy, the policy used to generate
behavior; and the target policy, the one being improved on. An
advantage of this separation is that the target policy may be
deterministic (greedy), while the behavior policy can continue
to sample all possible actions, [1].
E. Deep reinforcement learning (DRL)
Tabular representations are the simplest way to store learned
estimates (of e.g. values, policies or models), where each state-
action pair has a discrete estimate associated with it. When
estimates are represented discretely, each additional feature
tracked in the state leads to an exponential growth in the
number of state-action pair values that must be stored [38].
This problem is commonly referred to in the literature as the
“curse of dimensionality”, a term originally coined by Bellman
[39]. In simple environments this is rarely an issue, but it
may lead to an intractable problem in real-world applications,
due to memory and/or computational constraints. Learning
over a large state-action space is possible, but may take an
unacceptably long time to learn useful policies. Many real-
world domains feature continuous state and/or action spaces;
these can be discretised in many cases. However, large discreti-
sation steps may limit the achievable performance in a domain,
whereas small discretisation steps may result in a large state-
action space where obtaining a sufficient number of samples
for each state-action pair is impractical. Alternatively, function
approximation may be used to generalise across states and/or
actions, whereby a function approximator is used to store and
retrieve estimates. Function approximation is an active area
of research in RL, offering a way to handle continuous state
and/or action spaces, mitigate against the state-action space
explosion and generalise prior experience to previously unseen
state-action pairs. Tile coding is one of the simplest forms
of function approximation, where one tile represents multiple
states or state-action pairs [38]. Neural networks are also
commonly used to implement function approximation, one of
the most famous examples being Tesuaro’s application of RL
to backgammon [40]. Recent work has applied deep neural
networks as a function approximation method; this emerging
paradigm is known as deep reinforcement learning (DRL).
DRL algorithms have achieved human level performance (or
above) on complex tasks such as playing Atari games [24] and
playing the board game Go [41].
In DQN [24] it is demonstrated how a convolutional neural
network can learn successful control policies from just raw
video data for different Atari environments. The network
was trained end-to-end and was not provided with any game
specific information. The input to the convolutional neural
7
network consists of a 84×84×4tensor of 4 consecutive stacked
frames used to capture the temporal information. Through
consecutive layers, the network learns how to combine features
in order to identify the action most likely to bring the best
outcome. One layer consists of several convolutional filters.
For instance, the first layer uses 32 filters with 8×8kernels
with stride 4 and applies a rectifier non linearity. The second
layer is 64 filters of 4×4with stride 2, followed by a rectifier
non-linearity. Next comes a third convolutional layer of 64
filters of 3×3with stride 1 followed by a rectifier. The last
intermediate layer is composed of 512 rectifier units fully
connected. The output layer is a fully-connected linear layer
with a single output for each valid action. For DQN training
stability, two networks are used while the parameters of the
target network are fixed for a number of iterations while
updating the online network parameters. For practical reasons,
the Q(s,a)function is modeled as a deep neural network
that predicts the value of all actions given the input state.
Accordingly, deciding what action to take requires performing
a single forward pass of the network. Moreover, in order
to increase sample efficiency, experiences of the agent are
stored in a replay memory (experience replay), where the Q-
learning updates are conducted on randomly selected samples
from the replay memory. This random selection breaks the
correlation between successive samples. Experience replay
enables reinforcement learning agents to remember and reuse
experiences from the past where observed transitions are stored
for some time, usually in a queue, and sampled uniformly from
this memory to update the network. However, this approach
simply replays transitions at the same frequency that they were
originally experienced, regardless of their significance. An
alternative method is to use two separate experience buckets,
one for positive and one for negative rewards [42]. Then a fixed
fraction from each bucket is selected to replay. This method
is only applicable in domains that have a natural notion of
binary experience. Experience replay has also been extended
with a framework for prioritising experience [43], where
important transitions, based on the TD error, are replayed
more frequently, leading to improved performance and faster
training when compared to the standard experience replay
approach.
The max operator in standard Q-learning and DQN uses the
same values both to select and to evaluate an action resulting
in over optimistic value estimates. In Double DQN (D-DQN)
[44] the over estimation problem in DQN is tackled where the
greedy policy is evaluated according to the online network and
uses the target network to estimate its value. It was shown that
this algorithm not only yields more accurate value estimates,
but leads to much higher scores on several games.
In Dueling network architecture [45] the state value function
and associated advantage function are estimated, and then
combined together to estimate action value function. The
advantage of the dueling architecture lies partly in its ability
to learn the state-value function efficiently. In a single-stream
architecture only the value for one of the actions is updated.
However in dueling architecture, the value stream is updated
with every update, allowing for better approximation of the
state values, which in turn need to be accurate for temporal
difference methods like Q-learning.
DRQN [46] applied a modification to the DQN by com-
bining a Long Short Term Memory (LSTM) with a Deep Q-
Network. Accordingly, the DRQN is capable of integrating in-
formation across frames to detect information such as velocity
of objects. DRQN showed to generalize its policies in case of
complete observations and when trained on Atari games and
evaluated against flickering games, it was shown that DRQN
generalizes better than DQN.
IV. EXTENSIONS TO REINFORCEMENT LEARNING
This section introduces and discusses some of the main
extensions to the basic single-agent RL paradigms which
have been introduced over the years. As well as broadening
the applicability of RL algorithms, many of the extensions
discussed here have been demonstrated to improve scalability,
learning speed and/or converged performance in complex
problem domains.
A. Reward shaping
As noted in Section III, the design of the reward function
is crucial: RL agents seek to maximise the return from the
reward function, therefore the optimal policy for a domain
is defined with respect to the reward function. In many real-
world application domains, learning may be difficult due to
sparse and/or delayed rewards. RL agents typically learn how
to act in their environment guided merely by the reward signal.
Additional knowledge can be provided to a learner by the
addition of a shaping reward to the reward naturally received
from the environment, with the goal of improving learning
speed and converged performance. This principle is referred
to as reward shaping. The term shaping has its origins in the
field of experimental psychology, and describes the idea of
rewarding all behaviour that leads to the desired behaviour.
Skinner [47] discovered while training a rat to push a lever that
any movement in the direction of the lever had to be rewarded
to encourage the rat to complete the task. Analogously to
the rat, a RL agent may take an unacceptably long time to
discover its goal when learning from delayed rewards, and
shaping offers an opportunity to speed up the learning process.
Reward shaping allows a reward function to be engineered in a
way to provide more frequent feedback signal on appropriate
behaviours [48], which is especially useful in domains with
sparse rewards. Generally, the return from the reward function
is modified as follows: r0=r+fwhere ris the return from the
original reward function R,fis the additional reward from a
shaping function F, and r0is the signal given to the agent
by the augmented reward function R0. Empirical evidence
has shown that reward shaping can be a powerful tool to
improve the learning speed of RL agents [49]. However, it
can have unintended consequences. The implication of adding
a shaping reward is that a policy which is optimal for the
augmented reward function R0may not in fact also be optimal
for the original reward function R. A classic example of
reward shaping gone wrong for this exact reason is reported
by [49] where the experimented bicycle agent would turn in
circle to stay upright rather than reach its goal. Difference
8
rewards (D) [50] and potential-based reward shaping (PBRS)
[51] are two commonly used shaping approaches. Both D
and PBRS have been successfully applied to a wide range of
application domains and have the added benefit of convenient
theoretical guarantees, meaning that they do not suffer from
the same issues as the unprincipled reward shaping approaches
described above (see e.g. [51]–[55]).
B. Multi-agent reinforcement learning (MARL)
In multi-agent reinforcement learning, multiple RL agents
are deployed into a common environment. The single-agent
MDP framework becomes inadequate when multiple au-
tonomous agents act simultaneously in the same domain.
Instead, the more general stochastic game (SG) may be used in
the case of a Multi-Agent System (MAS) [56]. A SG is defined
as a tuple <S,A1...N,T,R1...N>, where Nis the number of
agents, Sis the set of system states, Aiis the set of actions
for agent i(and Ais the joint action set), Tis the transition
function, and Riis the reward function for agent i. The SG
looks very similar to the MDP framework, apart from the
addition of multiple agents. In fact, for the case of N=1a SG
then becomes a MDP. The next system state and the rewards
received by each agent depend on the joint action aof all of the
agents in a SG, where ais derived from the combination of the
individual actions aifor each agent in the system. Each agent
may have its own local state perception si, which is different
to the system state s(i.e. individual agents are not assumed to
have full observability of the system). Note also that each
agent may receive a different reward for the same system
state transition, as each agent has its own separate reward
function Ri. In a SG, the agents may all have the same goal
(collaborative SG), totally opposing goals (competitive SG),
or there may be elements of collaboration and competition
between agents (mixed SG). Whether RL agents in a MAS
will learn to act together or at cross-purposes depends on the
reward scheme used for a specific application.
C. Multi-objective reinforcement learning
In multi-objective reinforcement learning (MORL) the re-
ward signal is a vector, where each component represents the
performance on a different objective. The MORL framework
was developed to handle sequential decision making problems
where tradeoffs between conflicting objective functions must
be considered. Examples of real-world problems with multiple
objectives include selecting energy sources (tradeoffs between
fuel cost and emissions) [57] and watershed management
(tradeoffs between generating electricity, preserving reservoir
levels and supplying drinking water) [58]. Solutions to MORL
problems are often evaluated using the concept of Pareto
dominance [59] and MORL algorithms typically seek to learn
or approximate the set of non-dominated solutions. MORL
problems may be defined using the MDP or SG framework as
appropriate, in a similar manner to single-objective problems.
The main difference lies in the definition of the reward
function: instead of returning a single scalar value r, the
reward function Rin multi-objective domains returns a vector
rconsisting of the rewards for each individual objective
c∈C. Therefore, a regular MDP or SG can be extended to
a Multi-Objective MDP (MOMDP) or Multi-Objective SG
(MOSG) by modifying the return of the reward function. For a
more complete overview of MORL beyond the brief summary
presented in this section, the interested reader is referred to
recent surveys [60], [61].
D. State Representation Learning (SRL)
State Representation Learning refers to feature extraction
& dimensionality reduction to represent the state space with
its history conditioned by the actions and environment of the
agent. A complete review of SRL for control is discussed
in [62]. In the simplest form SRL maps a high dimensional
vector otinto a small dimensional latent space st. The inverse
operation decodes the state back into an estimate of the
original observation ˆ
ot. The agent then learns to map from
the latent space to the action. Training the SRL chain is
unsupervised in the sense that no labels are required. Reducing
the dimension of the input effectively simplifies the task as
it removes noise and decreases the domain’s size as shown
in [63]. SRL could be a simple auto-encoder (AE), though
various methods exist for observation reconstruction such as
Variational Auto-Encoders (VAE) or Generative Adversarial
Networks (GANs), as well as forward models for predicting
the next state or inverse models for predicting the action given
a transition. A good learned state representation should be
Markovian; i.e. it should encode all necessary information to
be able to select an action based on the current state only, and
not any previous states or actions [62], [64].
E. Learning from Demonstrations
Learning from Demonstrations (LfD) is used by humans
to acquire new skills in an expert to learner knowledge
transmission process. LfD is important for initial exploration
where reward signals are too sparse or the input domain is
too large to cover. In LfD, an agent learns to perform a
task from demonstrations, usually in the form of state-action
pairs, provided by an expert without any feedback rewards.
However, high quality and diverse demonstrations are hard to
collect, leading to learning sub-optimal policies. Accordingly,
learning merely from demonstrations can be used to initialize
the learning agent with a good or safe policy, and then
reinforcement learning can be conducted to enable the dis-
covery of a better policy by interacting with the environment.
Combining demonstrations and reinforcement learning has
been conducted in recent research. AlphaGo [41], combines
search tree with deep neural networks, initializes the policy
network by supervised learning on state-action pairs provided
by recorded games played by human experts. Additionally, a
value network is trained to tell how desirable a board state is.
By conducting self-play and reinforcement learning, AlphaGo
is able to discover new stronger actions and learn from its
mistakes, achieving super human performance. More recently,
AlphaZero [65], developed by the same team, proposed a
general framework for self-play models. AlphaZero is trained
entirely using reinforcement learning and self play, starting
from completely random play, and requires no prior knowledge
9
of human players. AlphaZero taught itself from scratch how
to master the games of chess, shogi, and Go game, beating a
world-champion program in each case. In [66] it is shown that
given the initial demonstration, no explicit exploration is nec-
essary, and we can attain near-optimal performance. Measuring
the divergence between the current policy and the expert policy
for optimization is proposed in [67]. DQfD [68] pre-trains the
agent and uses expert demonstrations by adding them into
the replay buffer with additional priority. Moreover, a training
framework that combines learning from both demonstrations
and reinforcement learning is proposed in [69] for fast learning
agents. Two policies close to maximizing the reward function
can still have large differences in behaviour. To avoid degener-
ating a solution which would fit the reward but not the original
behaviour, authors [70] proposed a method for enforcing that
the optimal policy learnt over the rewards should still match
the observed policy in behavior. Behavior Cloning (BC) is
applied as a supervised learning that maps states to actions
based on demonstrations provided by an expert. On the other
hand, Inverse Reinforcement Learning (IRL) is about inferring
the reward function that justifies demonstrations of the expert.
IRL is the problem of extracting a reward function given
observed, optimal behavior [71]. A key motivation is that the
reward function provides a succinct and robust definition of
a task. Generally, IRL algorithms can be expensive to run,
requiring reinforcement learning in an inner loop between
cost estimation to policy training and evaluation. Generative
Adversarial Imitation Learning (GAIL) [72] introduces a way
to avoid this expensive inner loop. In practice, GAIL trains a
policy close enough to the expert policy to fool a discriminator.
This process is similar to GANs [73], [74]. The resulting
policy must travel the same MDP states as the expert, or the
discriminator would pick up the differences. The theory behind
GAIL is an equation simplification: qualitatively, if IRL is
going from demonstrations to a cost function and RL from a
cost function to a policy, then we should altogether be able
to go from demonstration to policy in a single equation while
avoiding the cost function estimation.
V. REINFORCEMENT LEARNING FOR AUTONOMOUS
DRIVING TASKS
Autonomous driving tasks where RL could be applied
include: controller optimization, path planning and trajectory
optimization, motion planning and dynamic path planning,
development of high-level driving policies for complex nav-
igation tasks, scenario-based policy learning for highways,
intersections, merges and splits, reward learning with inverse
reinforcement learning from expert data for intent prediction
for traffic actors such as pedestrian, vehicles and finally
learning of policies that ensures safety and perform risk
estimation. Before discussing the applications of DRL to AD
tasks we briefly review the state space, action space and
rewards schemes in autonomous driving setting.
A. State Spaces, Action Spaces and Rewards
To successfully apply DRL to autonomous driving tasks,
designing appropriate state spaces, action spaces, and reward
functions is important. Leurent et al. [75] provided a compre-
hensive review of the different state and action representations
which are used in autonomous driving research. Commonly
used state space features for an autonomous vehicle include:
position, heading and velocity of ego-vehicle, as well as other
obstacles in the sensor view extent of the ego-vehicle. To avoid
variations in the dimension of the state space, a Cartesian or
Polar occupancy grid around the ego vehicle is frequently
employed. This is further augmented with lane information
such as lane number (ego-lane or others), path curvature,
past and future trajectory of the ego-vehicle, longitudinal
information such as Time-to-collision (TTC), and finally scene
information such as traffic laws and signal locations.
Using raw sensor data such as camera images, LiDAR,
radar, etc. provides the benefit of finer contextual information,
while using condensed abstracted data reduces the complexity
of the state space. In between, a mid-level representation such
as 2D bird eye view (BEV) is sensor agnostic but still close to
the spatial organization of the scene. Fig. 4is an illustration
of a top down view showing an occupancy grid, past and
projected trajectories, and semantic information about the
scene such as the position of traffic lights. This intermediary
format retains the spatial layout of roads when graph-based
representations would not. Some simulators offer this view
such as Carla or Flow (see Table V-C).
A vehicle policy must control a number of different actu-
ators. Continuous-valued actuators for vehicle control include
steering angle, throttle and brake. Other actuators such as
gear changes are discrete. To reduce complexity and allow
the application of DRL algorithms which work with discrete
action spaces only (e.g. DQN), an action space may be
discretised uniformly by dividing the range of continuous
actuators such as steering angle, throttle and brake into equal-
sized bins (see Section VI-C). Discretisation in log-space
has also been suggested, as many steering angles which are
selected in practice are close to the centre [76]. Discretisation
does have disadvantages however; it can lead to jerky or
unstable trajectories if the step values between actions are too
large. Furthermore, when selecting the number of bins for an
actuator there is a trade-off between having enough discrete
steps to allow for smooth control, and not having so many
steps that action selections become prohibitively expensive to
evaluate. As an alternative to discretisation, continuous values
for actuators may also be handled by DRL algorithms which
learn a policy directly, (e.g. DDPG). Temporal abstractions
options framework [77]) may also be employed to simplify
the process of selecting actions, where agents select options
instead of low-level actions. These options represent a sub-
policy that could extend a primitive action over multiple time
steps.
Designing reward functions for DRL agents for autonomous
driving is still very much an open question. Examples of
criteria for AD tasks include: distance travelled towards a
destination [78], speed of the ego vehicle [78]–[80], keeping
the ego vehicle at a standstill [81], collisions with other road
users or scene objects [78], [79], infractions on sidewalks [78],
keeping in lane, and maintaining comfort and stability while
avoiding extreme acceleration, braking or steering [80], [81],
10
and following traffic rules [79].
B. Motion Planning & Trajectory optimization
Motion planning is the task of ensuring the existence of a
path between target and destination points. This is necessary
to plan trajectories for vehicles over prior maps usually aug-
mented with semantic information. Path planning in dynamic
environments and varying vehicle dynamics is a key prob-
lem in autonomous driving, for example negotiating right to
pass through in an intersection [87], merging into highways.
Recent work by authors [89] contains real world motions by
various traffic actors, observed in diverse interactive driving
scenarios. Recently, authors demonstrated an application of
DRL (DDPG) for AD using a full-sized autonomous vehicle
[90]. The system was first trained in simulation, before being
trained in real time using on board computers, and was able
to learn to follow a lane, successfully completing a real-
world trial on a 250 metre section of road. Model-based deep
RL algorithms have been proposed for learning models and
policies directly from raw pixel inputs [91], [92]. In [93],
deep neural networks have been used to generate predictions in
simulated environments over hundreds of time steps. RL is also
suitable for Control. Classical optimal control methods like
LQR/iLQR are compared with RL methods in [94]. Classical
RL methods are used to perform optimal control in stochastic
settings, for example the Linear Quadratic Regulator (LQR)
in linear regimes and iterative LQR (iLQR) for non-linear
regimes are utilized. A recent study in [95] demonstrates that
random search over the parameters for a policy network can
perform as well as LQR.
C. Simulator & Scenario generation tools
Autonomous driving datasets address supervised learning
setup with training sets containing image, label pairs for
various modalities. Reinforcement learning requires an en-
vironment where state-action pairs can be recovered while
modelling dynamics of the vehicle state, environment as well
as the stochasticity in the movement and actions of the
environment and agent respectively. Various simulators are
actively used for training and validating reinforcement learn-
ing algorithms. Table V-C summarises various high fidelity
perception simulators capable of simulating cameras, LiDARs
and radar. Some simulators are also capable of providing the
vehicle state and dynamics. A complete review of sensors and
simulators utilised within the autonomous driving community
is available in [105] for readers. Learned driving policies are
stress tested in simulated environments before moving on to
costly evaluations in the real world. Multi-fidelity reinforce-
ment learning (MFRL) framework is proposed in [106] where
multiple simulators are available. In MFRL, a cascade of
simulators with increasing fidelity are used in representing
state dynamics (and thus computational cost) that enables
the training and validation of RL algorithms, while finding
near optimal policies for the real world with fewer expensive
real world samples using a remote controlled car. CARLA
Challenge [107] is a Carla simulator based AD competition
with pre-crash scenarios characterized in a National Highway
Traffic Safety Administration report [108]. The systems are
evaluated in critical scenarios such as: Ego-vehicle loses
control, ego-vehicle reacts to unseen obstacle, lane change
to evade slow leading vehicle among others. The scores of
agents are evaluated as a function of the aggregated distance
travelled in different circuits, and total points discounted due
to infractions.
D. LfD and IRL for AD applications
Early work on Behavior Cloning (BC) for driving cars in
[109], [110] presented agents that learn form demonstrations
(LfD) that tries to mimic the behavior of an expert. BC is
typically implemented as a supervised learning, and accord-
ingly, it is hard for BC to adapt to new, unseen situations.
An architecture for learning a convolutional neural network,
end to end, in self-driving cars domain was proposed in
[111], [112]. The CNN is trained to map raw pixels from
a single front facing camera directly to steering commands.
Using a relatively small training dataset from humans/experts,
the system learns to drive in traffic on local roads with or
without lane markings and on highways. The network learns
image representations that detect the road successfully, without
being explicitly trained to do so. Authors of [113] proposed
to learn comfortable driving trajectories optimization using
expert demonstration from human drivers using Maximum
Entropy Inverse RL. Authors of [114] used DQN as the
refinement step in IRL to extract the rewards, in an effort
learn human-like lane change behavior.
VI. RE AL WORLD CHALLENGES AND FUTURE
PERSPECTIVES
In this section, challenges for conducting reinforcement
learning for real-world autonomous driving are presented
and discussed along with the related research approaches for
solving them.
A. Validating RL systems
Henderson et al. [115] described challenges in validating
reinforcement learning methods focusing on policy gradient
methods for continuous control algorithms such as PPO,
DDPG and TRPO as well as in reproducing benchmarks. They
demonstrate with real examples that implementations often
have varying code-bases and different hyper-parameter values,
and that unprincipled ways to estimate the top-k rollouts
could lead to incoherent interpretations on the performance
of the reinforcement learning algorithms, and further more on
how well they generalize. Authors concluded that evaluation
could be performed either on a well defined common setup
or on real-world tasks. Authors in [116] proposed automated
generation of challenging and rare driving scenarios in high-
fidelity photo-realistic simulators. These adversarial scenarios
are automatically discovered by parameterising the behavior
of pedestrians and other vehicles on the road. Moreover, it is
shown that by adding these scenarios to the training data of
imitation learning, the safety is increased.
11
Fig. 4. Bird Eye View (BEV) 2D representation of a driving scene. Left demonstrates an occupancy grid. Right shows the combination of semantic information
(traffic lights) with past (red) and projected (green) trajectories. The ego car is represented by a green rectangle in both images.
AD Task (D)RL method & description Improvements & Tradeoffs
Lane Keep 1. Authors [82] propose a DRL system for discrete actions
(DQN) and continuous actions (DDAC) using the TORCS
simulator (see Table V-C) 2. Authors [83] learn discretised
and continuous policies using DQNs and Deep Deterministic
Actor Critic (DDAC) to follow the lane and maximize average
velocity.
1. This study concludes that using continuous actions provide
smoother trajectories, though on the negative side lead to
more restricted termination conditions & slower convergence
time to learn. 2. Removing memory replay in DQNs help
for faster convergence & better performance. The one hot
encoding of action space resulted in abrupt steering control.
While DDAC’s continuous policy helps smooth the actions
and provides better performance.
Lane Change Authors [84] use Q-learning to learn a policy for ego-
vehicle to perform no operation, lane change to left/right,
accelerate/decelerate.
This approach is more robust compared to traditional ap-
proaches which consist in defining fixed way points, velocity
profiles and curvature of path to be followed by the ego
vehicle.
Ramp Merging Authors [85] propose recurrent architectures namely LSTMs
to model longterm dependencies for ego vehicles ramp merg-
ing into a highway.
Past history of the state information is used to perform the
merge more robustly.
Overtaking Authors [86] propose Multi-goal RL policy that is learnt by
Q-Learning or Double action Q-Learning(DAQL) is employed
to determine individual action decisions based on whether the
other vehicle interacts with the agent for that particular goal.
Improved speed for lane keeping and overtaking with collision
avoidance.
Intersections Authors use DQN to evalute the Q-value for state-action pairs
to negotiate intersection [87],
Creep-Go actions defined by authors enables the vehicle to
maneuver intersections with restricted spaces and visibility
more safely
Motion Planning Authors [88] propose an improved A∗algorithm to learn a
heuristic function using deep neural netowrks over image-
based input obstacle map
Smooth control behavior of vehicle and better peformance
compared to multi-step DQN
TABLE I
LIS T OF AD TA SKS T HAT RE QUI RE D(RL) TO L EA RN A PO LI CY OR B EH AVIOR .
B. Bridging the simulation-reality gap
Simulation-to-real-world transfer learning is an active do-
main, since simulations are a source large & cheap data with
perfect annotations. Authors [117] train a robot arm to grasp
objects in the real world by performing domain adaption
from simulation-to-reality, at both feature-level and pixel-
level. The vision-based grasping system achieved comparable
performance with 50 times fewer real-world samples. Authors
in [118], randomized the dynamics of the simulator during
training. The resulting policies were capable of generalising to
different dynamics without requiring retraining on real system.
In the domain of autonomous driving, authors [119] train
an A3C agent using simulation-real translated images of the
driving environment. Following which, the trained policy was
evaluated on a real world driving dataset.
Authors in [120] addressed the issue of performing imitation
learning in simulation that transfers well to images from real
world. They achieved this by unsupervised domain translation
between simulated and real world images, that enables learning
the prediction of steering in the real world domain with only
ground truth from the simulated domain. Authors remark that
there were no pairwise correspondences between images in the
simulated training set and the unlabelled real-world image set.
Similarly, [121] performs domain adaptation to map real world
images to simulated images. In contrast to sim-to-real methods
they handle the reality gap during deployment of agents in real
scenarios, by adapting the real camera streams to the synthetic
modality, so as to map the unfamiliar or unseen features of
real images back into the simulated environment and states.
The agents have already learnt a policy in simulation.
C. Sample efficiency
Animals are usually able to learn new tasks in just a few
trials, benefiting from their prior knowledge about the environ-
ment. However, one of the key challenges for reinforcement
learning is sample efficiency. The learning process requires too
many samples to learn a reasonable policy. This issue becomes
more noticeable when collection of valuable experience is
12
Simulator Description
CARLA [78] Urban simulator, Camera & LIDAR streams, with depth & semantic segmentation, Location information
TORCS [96] Racing Simulator, Camera stream, agent positions, testing control policies for vehicles
AIRSIM [97] Camera stream with depth and semantic segmentation, support for drones
GAZEBO (ROS) [98] Multi-robot physics simulator employed for path planning & vehicle control in complex 2D & 3D maps
SUMO [99] Macro-scale modelling of traffic in cities motion planning simulators are used
DeepDrive [100] Driving simulator based on unreal, providing multi-camera (eight) stream with depth
Constellation [101] NVIDIA DRIVE ConstellationTM simulates camera, LIDAR & radar for AD (Proprietary)
MADRaS [102] Multi-Agent Autonomous Driving Simulator built on top of TORCS
Flow [103] Multi-Agent Traffic Control Simulator built on top of SUMO
Highway-env [104] A gym-based environment that provides a simulator for highway based road topologies
Carcraft Waymo’s simulation environment (Proprietary)
TABLE II
SIM ULATO RS FO R RL APPLICATIONS IN ADVANCED DRIVING ASSISTANCE SYSTEMS (ADAS) AND AUT ON OMO US D RIV IN G.
expensive or even risky to acquire. In the case of robot control
and autonomous driving, sample efficiency is a difficult issue
due to the delayed and sparse rewards found in typical settings,
along with an unbalanced distribution of observations in a
large state space.
Reward shaping enables the agent to learn intermediate
goals by designing a more frequent reward function to en-
courage the agent to learn faster from fewer samples. Authors
in [122] design a second "trauma" replay memory that contains
only collision situations in order to pool positive and negative
experiences at each training step.
IL boostrapped RL: Further efficiency can be achieved
where the agent first learns an initial policy offline perform-
ing imitation learning from roll-outs provided by an expert.
Following which, the agent can self-improve by applying RL
while interacting with the environment.
Actor Critic with Experience Replay (ACER) [123], is a
sample-efficient policy gradient algorithm that makes use of a
replay buffer, enabling it to perform more than one gradient
update using each piece of sampled experience, as well as a
trust region policy optimization method.
Transfer learning is another approach for sample effi-
ciency, which enables the reuse of previously trained policy for
a source task to initialize the learning of a target task. Policy
composition presented in [124] propose composing previously
learned basis policies to be able to reuse them for a novel task,
which leads to faster learning of new policies. A survey on
transfer learning in RL is presented in [125]. Multi-fidelity
reinforcement learning (MFRL) framework [106] showed to
transfer heuristics to guide exploration in high fidelity simu-
lators and find near optimal policies for the real world with
fewer real world samples. Authors in [126] transferred policies
learnt to handle simulated intersections to real world examples
between DQN agents.
Meta-learning algorithms enable agents adapt to new tasks
and learn new skills rapidly from small amounts of ex-
periences, benefiting from their prior knowledge about the
world. Authors of [127] addressed this issue through training
a recurrent neural network on a training set of interrelated
tasks, where the network input includes the action selected
in addition to the reward received in the previous time step.
Accordingly, the agent is trained to learn to exploit the
structure of the problem dynamically and solve new problems
by adjusting its hidden state. A similar approach for designing
RL algorithms is presented in [128]. Rather than designing a
“fast” reinforcement learning algorithm, it is represented as a
recurrent neural network, and learned from data. In Model-
Agnostic Meta-Learning (MAML) proposed in [129], the
meta-learner seeks to find an initialisation for the parameters
of a neural network, that can be adapted quickly for a new
task using only few examples. Reptile [130] includes a similar
model. Authors [131] present simple gradient-based meta-
learning algorithm.
Efficient state representations : World models proposed in
[132] learn a compressed spatial and temporal representation
of the environment using VAEs. Further on a compact and
simple policy directly from the compressed state representa-
tion.
D. Exploration issues with Imitation
In imitation learning, the agent makes use of trajectories
provided by an expert. However, the distribution of states
the expert encounters usually does not cover all the states
the trained agent may encounter during testing. Furthermore
imitation assumes that the actions are independent and iden-
tically distributed (i.i.d.). One solution consists in using the
Data Aggregation (DAgger) methods [133] where the end-
to-end learned policy is executed, and extracted observation-
action pairs are again labelled by the expert, and aggregated
to the original expert observation-action dataset. Thus, iter-
atively collecting training examples from both reference and
trained policies explores more valuable states and solves this
lack of exploration. Following work on Search-based Struc-
tured Prediction (SEARN) [133], Stochastic Mixing Iterative
Learning (SMILE) trains a stochastic stationary policy over
several iterations and then makes use of a geometric stochastic
mixing of the policies trained. In a standard imitation learning
scenario, the demonstrator is required to cover sufficient states
so as to avoid unseen states during test. This constraint is
costly and requires frequent human intervention. More re-
cently, Chauffeurnet [134] demonstrated the limits of imitation
learning where even 30 million state-action samples were
insufficient to learn an optimal policy that mapped bird-eye
view images (states) to control (action). The authors propose
the use of simulated examples which introduced perturbations,
higher diversity of scenarios such as collisions and/or going
off the road. The featurenet includes an agent RNN that
13
outputs the way point, agent box position and heading at each
iteration. Authors [135] identify limits of imitation learning,
and train a DNN end-to-end using the ego vehicles on input
raw image, and 2d and 3d locations of neighboring vehicles
to simultaneously predict the ego-vehicle action as well as
neighbouring vehicle trajectories.
E. Intrinsic Reward functions
In controlled simulated environments such as games, an
explicit reward signal is given to the agent along with its sensor
stream. However, in real-world robotics and autonomous driv-
ing deriving, designing a good reward functions is essential
so that the desired behaviour may be learned. The most
common solution has been reward shaping [136] and consists
in supplying additional well designed rewards to the agent to
encourage the optimization into the direction of the optimal
policy. Rewards as already pointed earlier in the paper, could
be estimated by inverse RL (IRL) [137], which depends on
expert demonstrations. In the absence of an explicit reward
shaping and expert demonstrations, agents can use intrinsic
rewards or intrinsic motivation [138] to evaluate if their actions
were good or not. Authors of [139] define curiosity as the error
in an agent’s ability to predict the consequence of its own
actions in a visual feature space learned by a self-supervised
inverse dynamics model. In [140] the agent learns a next state
predictor model from its experience, and uses the error of
the prediction as an intrinsic reward. This enables that agent
to determine what could be a useful behavior even without
extrinsic rewards.
F. Incorporating safety in DRL
Deploying an autonomous vehicle in real environments after
training directly could be dangerous. Different approaches to
incorporate safety into DRL algorithms are presented here.
For imitation learning based systems, Safe DAgger [141]
introduces a safety policy that learns to predict the error
made by a primary policy trained initially with the supervised
learning approach, without querying a reference policy. An
additional safe policy takes both the partial observation of
a state and a primary policy as inputs, and returns a binary
label indicating whether the primary policy is likely to deviate
from a reference policy without querying it. Authors of [142]
addressed safety in multi-agent Reinforcement Learning for
Autonomous Driving, where a balance is maintained between
unexpected behavior of other drivers or pedestrians and not
to be too defensive, so that normal traffic flow is achieved.
While hard constraints are maintained to guarantee the safety
of driving, the problem is decomposed into a composition of
a policy for desires to enable comfort driving and trajectory
planning. The deep reinforcement learning algorithms for con-
trol such as DDPG and safety based control are combined in
[143], including artificial potential field method that is widely
used for robot path planning. Using TORCS environment,
the DDPG is applied first for learning a driving policy in
a stable and familiar environment, then policy network and
safety-based control are combined to avoid collisions. It was
found that combination of DRL and safety-based control
performs well in most scenarios. In order to enable DRL to
escape local optima, speed up the training process and avoid
danger conditions or accidents, Survival-Oriented Reinforce-
ment Learning (SORL) model is proposed in [144], where
survival is favored over maximizing total reward through
modeling the autonomous driving problem as a constrained
MDP and introducing Negative-Avoidance Function to learn
from previous failure. The SORL model was found to be
not sensitive to reward function and can use different DRL
algorithms like DDPG. Furthermore, a comprehensive survey
on safe reinforcement learning can be found in [145] for
interested readers.
G. Multi-agent reinforcement learning
Autonomous driving is a fundamentally multi-agent task;
as well as the ego vehicle being controlled by an agent,
there will also be many other actors present in simulated
and real world autonomous driving settings, such as pedes-
trians, cyclists and other vehicles. Therefore, the continued
development of explicitly multi-agent approaches to learning
to drive autonomous vehicles is an important future research
direction. Several prior methods have already approached the
autonomous driving problem using a MARL perspective, e.g.
[142], [146]–[149].
One important area where MARL techniques could be very
beneficial is in high-level decision making and coordination
between groups of autonomous vehicles, in scenarios such
as overtaking in highway scenarios [149], or negotiating
intersections without signalised control. Another area where
MARL approaches could be of benefit is in the development
of adversarial agents for testing autonomous driving policies
before deployment [148], i.e. agents controlling other vehicles
in a simulation that learn to expose weaknesses in the be-
haviour of autonomous driving policies by acting erratically or
against the rules of the road. Finally, MARL approaches could
potentially have an important role to play in developing safe
policies for autonomous driving [142], as discussed earlier.
VII. CONCLUSION
Reinforcement learning is still an active and emerging area
in real-world autonomous driving applications. Although there
are a few successful commercial applications, there is very
little literature or large-scale public datasets available. Thus we
were motivated to formalize and organize RL applications for
autonomous driving. Autonomous driving scenarios involve in-
teracting agents and require negotiation and dynamic decision
making which suits RL. However, there are many challenges to
be resolved in order to have mature solutions which we discuss
in detail. In this work, a detailed theoretical reinforcement
learning is presented, along with a comprehensive literature
survey about applying RL for autonomous driving tasks.
Challenges, future research directions and opportunities
are discussed in section VI. This includes : validating the
performance of RL based systems, the simulation-reality gap,
sample efficiency, designing good reward functions, incorpo-
rating safety into decision making RL systems for autonomous
agents.
14
Framework Description
OpenAI Baselines [150] Set of high-quality implementations of different RL and DRL algorithms. The main goal for these
Baselines is to make it easier for the research community to replicate, refine and create reproducible
research.
Unity ML Agents Toolkit [151] Implements core RL algorithms, games, simulations environments for training RL or IL based agents .
RL Coach [152] Intel AI Lab’s implementation of modular RL algorithms implementation with simple integration of new
environments by extending and reusing existing components.
Tensorflow Agents [153] RL algorithms package with Bandits from TF.
rlpyt [154] implements deep Q-learning, policy gradients algorithm families in a single python package
bsuite [155] DeepMind Behaviour Suite for Reinforcement Learning aims at defining metrics for RL agents.
Automating evaluation and analysis.
TABLE III
OPE N-SO UR CE FR AME WO RKS A ND PAC KAGE S FO R STATE OF T HE AR T RL/DRL ALG ORI TH MS AN D EVAL UATIO N.
Reinforcement learning results are usually difficult to re-
produce and are highly sensitive to hyper-parameter choices,
which are often not reported in detail. Both researchers and
practitioners need to have a reliable starting point where
the well known reinforcement learning algorithms are imple-
mented, documented and well tested. These frameworks have
been covered in table VI-G.
The development of explicitly multi-agent reinforcement
learning approaches to the autonomous driving problem is
also an important future challenge that has not received a lot
of attention to date. MARL techniques have the potential to
make coordination and high-level decision making between
groups of autonomous vehicles easier, as well as providing
new opportunities for testing and validating the safety of
autonomous driving policies.
Furthermore, implementation of RL algorithms is a chal-
lenging task for researchers and practitioners. This work
presents examples of well known and active open-source RL
frameworks that provide well documented implementations
that enables the opportunity of using, evaluating and extending
different RL algorithms. Finally, We hope that this overview
paper encourages further research and applications.
A2C, A3C Advantage Actor Critic, Asynchronous A2C
BC Behavior Cloning
DDPG Deep DPG
DP Dynamic Programming
DPG Deterministic PG
DQN Deep Q-Network
DRL Deep RL
IL Imitation Learning
IRL Inverse RL
LfD Learning from Demonstration
MAML Model-Agnostic Meta-Learning
MARL Multi-Agent RL
MDP Markov Decision Process
MOMDP Multi-Objective MDP
MOSG Multi-Objective SG
PG Policy Gradient
POMDP Partially Observed MDP
PPO Proximal Policy Optimization
QL Q-Learning
RRT Rapidly-exploring Random Trees
SG Stochastic Game
SMDP Semi-Markov Decision Process
TDL Time Difference Learning
TRPO Trust Region Policy Optimization
TABLE IV
ACRO NYM S RE LATE D TO REINFORCEMENT LEARNING (RL).
REFERENCES
[1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction
(Second Edition). MIT Press, 2018. 1,4,5,6
[2] V. Talpaert., I. Sobh., B. R. Kiran., P. Mannion., S. Yogamani., A. El-
Sallab., and P. Perez., “Exploring applications of deep reinforcement
learning for real-world autonomous driving systems,” in Proceedings of
the 14th International Joint Conference on Computer Vision, Imaging
and Computer Graphics Theory and Applications - Volume 5 VISAPP:
VISAPP,, INSTICC. SciTePress, 2019, pp. 564–572. 1
[3] M. Siam, S. Elkerdawy, M. Jagersand, and S. Yogamani, “Deep
semantic segmentation for automated driving: Taxonomy, roadmap and
challenges,” in 2017 IEEE 20th international conference on intelligent
transportation systems (ITSC). IEEE, 2017, pp. 1–8. 2
[4] K. El Madawi, H. Rashed, A. El Sallab, O. Nasr, H. Kamel, and
S. Yogamani, “Rgb and lidar fusion based 3d semantic segmentation for
autonomous driving,” in 2019 IEEE Intelligent Transportation Systems
Conference (ITSC). IEEE, 2019, pp. 7–12. 2
[5] M. Siam, H. Mahgoub, M. Zahran, S. Yogamani, M. Jagersand, and
A. El-Sallab, “Modnet: Motion and appearance based moving object
detection network for autonomous driving,” in 2018 21st International
Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018,
pp. 2859–2864. 2
[6] V. R. Kumar, S. Milz, C. Witt, M. Simon, K. Amende, J. Petzold,
S. Yogamani, and T. Pech, “Monocular fisheye camera depth estimation
using sparse lidar supervision,” in 2018 21st International Conference
on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2853–
2858. 2
[7] M. Uˇ
riˇ
cáˇ
r, P. Kˇ
rížek, G. Sistu, and S. Yogamani, “Soilingnet: Soiling
detection on automotive surround-view cameras,” in 2019 IEEE Intel-
ligent Transportation Systems Conference (ITSC). IEEE, 2019, pp.
67–72. 2
[8] G. Sistu, I. Leang, S. Chennupati, S. Yogamani, C. Hughes, S. Milz, and
S. Rawashdeh, “Neurall: Towards a unified visual perception model for
automated driving,” in 2019 IEEE Intelligent Transportation Systems
Conference (ITSC). IEEE, 2019, pp. 796–803. 2
[9] S. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, D. O’Dea,
M. Uricár, S. Milz, M. Simon, K. Amende et al., “Woodscape: A
multi-task, multi-camera fisheye dataset for autonomous driving,” in
Proceedings of the IEEE International Conference on Computer Vision,
2019, pp. 9308–9318. 2
[10] S. Milz, G. Arbeiter, C. Witt, B. Abdallah, and S. Yogamani, “Visual
slam for automated driving: Exploring the applications of deep learn-
ing,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops, 2018, pp. 247–257. 2
[11] S. M. LaValle, Planning Algorithms. New York, NY, USA: Cambridge
University Press, 2006. 2
[12] S. M. LaValle and J. James J. Kuffner, “Randomized kinodynamic
planning,” The International Journal of Robotics Research, vol. 20,
no. 5, pp. 378–400, 2001. 2
[13] T. D. Team. Dimensions publication trends. [Online]. Available:
https://app.dimensions.ai/discover/publication 3
[14] Y. Kuwata, J. Teo, G. Fiore, S. Karaman, E. Frazzoli, and J. P. How,
“Real-time motion planning with applications to autonomous urban
driving,” IEEE Transactions on Control Systems Technology, vol. 17,
no. 5, pp. 1105–1118, 2009. 3
[15] B. Paden, M. ˇ
Cáp, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of
motion planning and control techniques for self-driving urban vehicles,”
15
IEEE Transactions on intelligent vehicles, vol. 1, no. 1, pp. 33–55,
2016. 3
[16] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-
making for autonomous vehicles,” Annual Review of Control, Robotics,
and Autonomous Systems, no. 0, 2018. 3
[17] S. Kuutti, R. Bowden, Y. Jin, P. Barber, and S. Fallah, “A survey
of deep learning applications to autonomous vehicle control,” IEEE
Transactions on Intelligent Transportation Systems, 2020. 3
[18] T. M. Mitchell, Machine learning, ser. McGraw-Hill series in computer
science. Boston (Mass.), Burr Ridge (Ill.), Dubuque (Iowa): McGraw-
Hill, 1997. 3
[19] S. J. Russell and P. Norvig, Artificial intelligence: a modern approach
(3rd edition). Prentice Hall, 2009. 3
[20] Z.-W. Hong, T.-Y. Shann, S.-Y. Su, Y.-H. Chang, T.-J. Fu, and C.-
Y. Lee, “Diversity-driven exploration strategy for deep reinforcement
learning,” in Advances in Neural Information Processing Systems 31,
S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, Eds., 2018, pp. 10 489–10 500. 3
[21] M. Wiering and M. van Otterlo, Eds., Reinforcement Learning: State-
of-the-Art. Springer, 2012. 4
[22] M. L. Puterman, Markov Decision Processes: Discrete Stochastic
Dynamic Programming, 1st ed. New York, NY, USA: John Wiley
& Sons, Inc., 1994. 4
[23] C. J. Watkins and P. Dayan, “Technical note: Q-learning,” Machine
Learning, vol. 8, no. 3-4, 1992. 4
[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, 2015. 4,6
[25] C. J. C. H. Watkins, “Learning from delayed rewards,” Ph.D. disserta-
tion, King’s College, Cambridge, 1989. 4,6
[26] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-
miller, “Deterministic policy gradient algorithms,” in ICML, 2014. 5
[27] R. J. Williams, “Simple statistical gradient-following algorithms for
connectionist reinforcement learning,” Machine Learning, vol. 8, pp.
229–256, 1992. 5
[28] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
region policy optimization,” in International Conference on Machine
Learning, 2015, pp. 1889–1897. 5
[29] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
2017. 5
[30] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
ment learning.” in 4th International Conference on Learning Represen-
tations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference
Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. 5
[31] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
forcement learning,” in International Conference on Machine Learning,
2016. 5
[32] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement
learning with deep energy-based policies,” in Proceedings of the 34th
International Conference on Machine Learning - Volume 70, ser.
ICML’17. JMLR.org, 2017, pp. 1352–1361. 6
[33] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,
V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic
algorithms & applications,” arXiv:1812.05905, 2018. 6
[34] R. S. Sutton, “Integrated architectures for learning, planning, and
reacting based on approximating dynamic programming,” in Machine
Learning Proceedings 1990. Elsevier, 1990. 6
[35] R. I. Brafman and M. Tennenholtz, “R-max-a general polynomial time
algorithm for near-optimal reinforcement learning,” Journal of Machine
Learning Research, vol. 3, no. Oct, 2002. 6
[36] D. Silver, R. S. Sutton, and M. Müller, “Sample-based learning and
search with permanent and transient memories,” in Proceedings of the
25th international conference on Machine learning. ACM, 2008, pp.
968–975. 6
[37] G. A. Rummery and M. Niranjan, “On-line Q-learning using con-
nectionist systems,” Cambridge University Engineering Department,
Cambridge, England, Tech. Rep. TR 166, 1994. 6
[38] R. S. Sutton and A. G. Barto, “Reinforcement learning an introduction–
second edition, in progress (draft),” 2015. 6
[39] R. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton
University Press, 1957. 6
[40] G. Tesauro, “Td-gammon, a self-teaching backgammon program,
achieves master-level play,” Neural Computing, vol. 6, no. 2, Mar. 1994.
6
[41] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot et al., “Mastering the game of go with deep neural networks
and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016. 6,8
[42] K. Narasimhan, T. Kulkarni, and R. Barzilay, “Language understanding
for text-based games using deep reinforcement learning,” in Pro-
ceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing. Association for Computational Linguistics,
2015, pp. 1–11. 7
[43] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
replay,” arXiv preprint arXiv:1511.05952, 2015. 7
[44] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double q-learning.” in AAAI, vol. 16, 2016, pp. 2094–2100. 7
[45] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and
N. De Freitas, “Dueling network architectures for deep reinforcement
learning,” arXiv preprint arXiv:1511.06581, 2015. 7
[46] M. Hausknecht and P. Stone, “Deep recurrent q-learning for partially
observable mdps,” CoRR, abs/1507.06527, 2015. 7
[47] B. F. Skinner, The behavior of organisms: An experimental analysis.
Appleton-Century, 1938. 7
[48] E. Wiewiora, “Reward shaping,” in Encyclopedia of Machine Learning
and Data Mining, C. Sammut and G. I. Webb, Eds. Boston, MA:
Springer US, 2017, pp. 1104–1106. 7
[49] J. Randløv and P. Alstrøm, “Learning to drive a bicycle using re-
inforcement learning and shaping,” in Proceedings of the Fifteenth
International Conference on Machine Learning, ser. ICML ’98. San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp.
463–471. 7
[50] D. H. Wolpert, K. R. Wheeler, and K. Tumer, “Collective intelligence
for control of distributed dynamical systems,” EPL (Europhysics Let-
ters), vol. 49, no. 6, p. 708, 2000. 8
[51] A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance under
reward transformations: Theory and application to reward shaping,”
in Proceedings of the Sixteenth International Conference on Machine
Learning, ser. ICML ’99, 1999, pp. 278–287. 8
[52] S. Devlin and D. Kudenko, “Theoretical considerations of potential-
based reward shaping for multi-agent systems,” in Proceedings of the
10th International Conference on Autonomous Agents and Multiagent
Systems (AAMAS), 2011. 8
[53] P. Mannion, S. Devlin, K. Mason, J. Duggan, and E. Howley, “Policy
invariance under reward transformations for multi-objective reinforce-
ment learning,” Neurocomputing, vol. 263, 2017. 8
[54] M. Colby and K. Tumer, “An evolutionary game theoretic analysis of
difference evaluation functions,” in Proceedings of the 2015 Annual
Conference on Genetic and Evolutionary Computation. ACM, 2015,
pp. 1391–1398. 8
[55] P. Mannion, J. Duggan, and E. Howley, “A theoretical and empirical
analysis of reward transformations in multi-objective stochastic games,”
in Proceedings of the 16th International Conference on Autonomous
Agents and Multiagent Systems (AAMAS), 2017. 8
[56] L. Bu¸soniu, R. Babuška, and B. Schutter, “Multi-agent reinforcement
learning: An overview,” in Innovations in Multi-Agent Systems and Ap-
plications - 1, ser. Studies in Computational Intelligence, D. Srinivasan
and L. Jain, Eds. Springer Berlin Heidelberg, 2010, vol. 310. 8
[57] P. Mannion, K. Mason, S. Devlin, J. Duggan, and E. Howley, “Multi-
objective dynamic dispatch optimisation using multi-agent reinforce-
ment learning,” in Proceedings of the 15th International Conference
on Autonomous Agents and Multiagent Systems (AAMAS), 2016. 8
[58] K. Mason, P. Mannion, J. Duggan, and E. Howley, “Applying multi-
agent reinforcement learning to watershed management,” in Proceed-
ings of the Adaptive and Learning Agents workshop (at AAMAS 2016),
2016. 8
[59] V. Pareto, Manual of political economy. OUP Oxford, 1906. 8
[60] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, “A survey
of multi-objective sequential decision-making,” Journal of Artificial
Intelligence Research, vol. 48, pp. 67–113, 2013. 8
[61] R. R ˘
adulescu, P. Mannion, D. M. Roijers, and A. Nowé, “Multi-
objective multi-agent decision making: a utility-based analysis and
survey,” Autonomous Agents and Multi-Agent Systems, vol. 34, no. 1,
p. 10, 2020. 8
[62] T. Lesort, N. Diaz-Rodriguez, J.-F. Goudou, and D. Filliat, “State
representation learning for control: An overview,” Neural Networks,
vol. 108, pp. 379 – 392, 2018. 8
16
[63] A. Raffin, A. Hill, K. R. Traoré, T. Lesort, N. D. Rodríguez, and
D. Filliat, “Decoupling feature extraction from policy learning: assess-
ing benefits of state representation learning in goal based robotics,”
CoRR, vol. abs/1901.08651, 2019. 8
[64] W. Böhmer, J. T. Springenberg, J. Boedecker, M. Riedmiller, and
K. Obermayer, “Autonomous learning of state representations for
control: An emerging field aims to autonomously learn state represen-
tations for reinforcement learning agents from their real-world sensor
observations,” KI-Künstliche Intelligenz, vol. 29, no. 4, pp. 353–362,
2015. 8
[65] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the
game of go without human knowledge,” Nature, vol. 550, no. 7676, p.
354, 2017. 8
[66] P. Abbeel and A. Y. Ng, “Exploration and apprenticeship learning
in reinforcement learning,” in Proceedings of the 22nd international
conference on Machine learning. ACM, 2005, pp. 1–8. 9
[67] B. Kang, Z. Jie, and J. Feng, “Policy optimization with demonstra-
tions,” in International Conference on Machine Learning, 2018, pp.
2474–2483. 9
[68] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot,
D. Horgan, J. Quan, A. Sendonaris, I. Osband et al., “Deep q-learning
from demonstrations,” in Thirty-Second AAAI Conference on Artificial
Intelligence, 2018. 9
[69] S. Ibrahim and D. Nevin, “End-to-end framework for fast learning
asynchronous agents,” in the 32nd Conference on Neural Information
Processing Systems, Imitation Learning and its Challenges in Robotics
workshop, 2018. 9
[70] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-
forcement learning,” in Proceedings of the twenty-first international
conference on Machine learning. ACM, 2004, p. 1. 9
[71] A. Y. Ng, S. J. Russell et al., “Algorithms for inverse reinforcement
learning.” in ICML, 2000. 9
[72] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in
Advances in Neural Information Processing Systems, 2016, pp. 4565–
4573. 9
[73] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
in Advances in Neural Information Processing Systems 27, 2014, pp.
2672–2680. 9
[74] M. Uˇ
riˇ
cáˇ
r, P. Kˇ
rížek, D. Hurych, I. Sobh, S. Yogamani, and P. Denny,
“Yes, we gan: Applying adversarial techniques for autonomous driv-
ing,” Electronic Imaging, vol. 2019, no. 15, pp. 48–1, 2019. 9
[75] E. Leurent, Y. Blanco, D. Efimov, and O.-A. Maillard, “A survey of
state-action representations for autonomous driving,” HAL archives,
2018. 9
[76] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
models from large-scale video datasets,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2017, pp.
2174–2182. 9
[77] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps:
A framework for temporal abstraction in reinforcement learning,”
Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999. 9
[78] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
“CARLA: An open urban driving simulator,” in Proceedings of the
1st Annual Conference on Robot Learning, 2017, pp. 1–16. 9,12
[79] C. Li and K. Czarnecki, “Urban driving with multi-objective deep
reinforcement learning,” in Proceedings of the 18th International Con-
ference on Autonomous Agents and MultiAgent Systems. International
Foundation for Autonomous Agents and Multiagent Systems, 2019, pp.
359–367. 9,10
[80] S. Kardell and M. Kuosku, “Autonomous vehicle control via deep
reinforcement learning,” Master’s thesis, Chalmers University of Tech-
nology, 2017. 9
[81] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement
learning for urban autonomous driving,” in 2019 IEEE Intelligent
Transportation Systems Conference (ITSC). IEEE, 2019, pp. 2765–
2771. 9
[82] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “End-to-end
deep reinforcement learning for lane keeping assist,” in MLITS, NIPS
Workshop, vol. 2, 2016. 11
[83] A.-E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforce-
ment learning framework for autonomous driving,” Electronic Imaging,
vol. 2017, no. 19, pp. 70–76, 2017. 11
[84] P. Wang, C.-Y. Chan, and A. de La Fortelle, “A reinforcement learning
based approach for automated lane change maneuvers,” in 2018 IEEE
Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1379–1384. 11
[85] P. Wang and C.-Y. Chan, “Formulation of deep reinforcement learning
architecture toward autonomous driving for on-ramp merge,” in Intel-
ligent Transportation Systems (ITSC), 2017 IEEE 20th International
Conference on. IEEE, 2017, pp. 1–6. 11
[86] D. C. K. Ngai and N. H. C. Yung, “A multiple-goal reinforcement
learning method for complex vehicle overtaking maneuvers,” IEEE
Transactions on Intelligent Transportation Systems, vol. 12, no. 2, pp.
509–522, 2011. 11
[87] D. Isele, R. Rahimi, A. Cosgun, K. Subramanian, and K. Fujimura,
“Navigating occluded intersections with autonomous vehicles using
deep reinforcement learning,” in 2018 IEEE International Conference
on Robotics and Automation (ICRA). IEEE, 2018, pp. 2034–2039.
10,11
[88] A. Keselman, S. Ten, A. Ghazali, and M. Jubeh, “Reinforcement learn-
ing with a* and a deep heuristic,” arXiv preprint arXiv:1811.07745,
2018. 11
[89] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, J. Küm-
merle, H. Königshof, C. Stiller, A. de La Fortelle, and M. Tomizuka,
“INTERACTION Dataset: An INTERnational, Adversarial and Coop-
erative moTION Dataset in Interactive Driving Scenarios with Semantic
Maps,” arXiv:1910.03088 [cs, eess], 2019. 10
[90] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V.-D.
Lam, A. Bewley, and A. Shah, “Learning to drive in a day,” in 2019
International Conference on Robotics and Automation (ICRA). IEEE,
2019, pp. 8248–8254. 10
[91] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embed
to control: A locally linear latent dynamics model for control from raw
images,” in Advances in neural information processing systems, 2015.
10
[92] N. Wahlström, T. B. Schön, and M. P. Deisenroth, “Learning deep
dynamical models from image pixels,” IFAC-PapersOnLine, vol. 48,
no. 28, pp. 1059–1064, 2015. 10
[93] S. Chiappa, S. Racanière, D. Wierstra, and S. Mohamed, “Recurrent
environment simulators,” in 5th International Conference on Learn-
ing Representations, ICLR 2017, Toulon, France, April 24-26, 2017,
Conference Track Proceedings. OpenReview.net, 2017. 10
[94] B. Recht, “A tour of reinforcement learning: The view from contin-
uous control,” Annual Review of Control, Robotics, and Autonomous
Systems, 2008. 10
[95] H. Mania, A. Guy, and B. Recht, “Simple random search of static
linear policies is competitive for reinforcement learning,” in Advances
in Neural Information Processing Systems 31, S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds.,
2018, pp. 1800–1809. 10
[96] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, and
A. Sumner, “Torcs, the open racing car simulator,” Software available
at http://torcs. sourceforge. net, vol. 4, 2000. 12
[97] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity
visual and physical simulation for autonomous vehicles,” in Field and
Service Robotics. Springer, 2018, pp. 621–635. 12
[98] N. Koenig and A. Howard, “Design and use paradigms for gazebo, an
open-source multi-robot simulator,” in 2004 International Conference
on Intelligent Robots and Systems (IROS), vol. 3. IEEE, 2004, pp.
2149–2154. 12
[99] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Flötteröd,
R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner, “Micro-
scopic traffic simulation using sumo,” in The 21st IEEE International
Conference on Intelligent Transportation Systems. IEEE, 2018. 12
[100] C. Quiter and M. Ernst, “deepdrive/deepdrive: 2.0,” Mar. 2018.
[Online]. Available: https://doi.org/10.5281/zenodo.1248998 12
[101] Nvidia, “Drive Constellation now available,” https://blogs.nvidia.com/
blog/2019/03/18/drive-constellation-now- available/, 2019, [accessed
14-April-2019]. 12
[102] A. S. et al., “Multi-Agent Autonomous Driving Simulator built on
top of TORCS,” https://github.com/madras-simulator/MADRaS, 2019,
[Online; accessed 14-April-2019]. 12
[103] C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen, “Flow:
Architecture and benchmarking for reinforcement learning in traffic
control,” CoRR, vol. abs/1710.05465, 2017. 12
[104] E. Leurent, “A collection of environments for autonomous driv-
ing and tactical decision-making tasks,” https://github.com/eleurent/
highway-env, 2019, [Online; accessed 14-April-2019]. 12
[105] F. Rosique, P. J. Navarro, C. Fernández, and A. Padilla, “A systematic
review of perception system and simulators for autonomous vehicles
research,” Sensors, vol. 19, no. 3, p. 648, 2019. 10
[106] M. Cutler, T. J. Walsh, and J. P. How, “Reinforcement learning with
multi-fidelity simulators,” in 2014 IEEE International Conference on
17
Robotics and Automation (ICRA). IEEE, 2014, pp. 3888–3895. 10,
12
[107] F. C. German Ros, Vladlen Koltun and A. M. Lopez, “Carla au-
tonomous driving challenge,” https://carlachallenge.org/, 2019, [Online;
accessed 14-April-2019]. 10
[108] W. G. Najm, J. D. Smith, M. Yanagisawa et al., “Pre-crash scenario ty-
pology for crash avoidance research,” United States. National Highway
Traffic Safety Administration, Tech. Rep., 2007. 10
[109] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural
network,” in Advances in neural information processing systems, 1989.
10
[110] D. Pomerleau, “Efficient training of artificial neural networks for
autonomous navigation,” Neural Computation, vol. 3, no. 1, 1991. 10
[111] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End
to end learning for self-driving cars,” in NIPS 2016 Deep Learning
Symposium, 2016. 10
[112] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner,
L. Jackel, and U. Muller, “Explaining how a deep neural net-
work trained with end-to-end learning steers a car,” arXiv preprint
arXiv:1704.07911, 2017. 10
[113] M. Kuderer, S. Gulati, and W. Burgard, “Learning driving styles for
autonomous vehicles from demonstration,” in Robotics and Automation
(ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp.
2641–2646. 10
[114] S. Sharifzadeh, I. Chiotellis, R. Triebel, and D. Cremers, “Learning
to drive using inverse reinforcement learning and deep q-networks,” in
NIPS Workshops, December 2016. 10
[115] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and
D. Meger, “Deep reinforcement learning that matters,” in Thirty-Second
AAAI Conference on Artificial Intelligence, 2018. 10
[116] Y. Abeysirigoonawardena, F. Shkurti, and G. Dudek, “Generating
adversarial driving scenarios in high-fidelity simulators,” in 2019 IEEE
International Conference on Robotics and Automation (ICRA). ICRA,
2019. 10
[117] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrish-
nan, L. Downs, J. Ibarz, P. Pastor, K. Konolige et al., “Using simulation
and domain adaptation to improve efficiency of deep robotic grasping,”
in 2018 IEEE International Conference on Robotics and Automation
(ICRA). IEEE, 2018, pp. 4243–4250. 11
[118] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-
real transfer of robotic control with dynamics randomization,” in 2018
IEEE international conference on robotics and automation (ICRA).
IEEE, 2018, pp. 1–8. 11
[119] Z. W. Xinlei Pan, Yurong You and C. Lu, “Virtual to real rein-
forcement learning for autonomous driving,” in Proceedings of the
British Machine Vision Conference (BMVC), G. B. Tae-Kyun Kim,
Stefanos Zafeiriou and K. Mikolajczyk, Eds. BMVA Press, September
2017. 11
[120] A. Bewley, J. Rigley, Y. Liu, J. Hawke, R. Shen, V.-D. Lam, and
A. Kendall, “Learning to drive from simulation without real world
labels,” in 2019 International Conference on Robotics and Automation
(ICRA). IEEE, 2019, pp. 4818–4824. 11
[121] J. Zhang, L. Tai, P. Yun, Y. Xiong, M. Liu, J. Boedecker, and
W. Burgard, “Vr-goggles for robots: Real-to-sim domain adaptation for
visual control,” IEEE Robotics and Automation Letters, vol. 4, no. 2,
pp. 1148–1155, 2019. 11
[122] H. Chae, C. M. Kang, B. Kim, J. Kim, C. C. Chung, and J. W.
Choi, “Autonomous braking system via deep reinforcement learning,”
2017 IEEE 20th International Conference on Intelligent Transportation
Systems (ITSC), pp. 1–6, 2017. 12
[123] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and
N. de Freitas, “Sample efficient actor-critic with experience replay,” in
5th International Conference on Learning Representations, ICLR 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings.
OpenReview.net, 2017. 12
[124] R. Liaw, S. Krishnan, A. Garg, D. Crankshaw, J. E. Gonzalez,
and K. Goldberg, “Composing meta-policies for autonomous driv-
ing using hierarchical deep reinforcement learning,” arXiv preprint
arXiv:1711.01503, 2017. 12
[125] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning
domains: A survey,” Journal of Machine Learning Research, vol. 10,
no. Jul, pp. 1633–1685, 2009. 12
[126] D. Isele and A. Cosgun, “Transferring autonomous driving knowledge
on simulated and real intersections,” in Lifelong Learning: A Reinforce-
ment Learning Approach,ICML WORKSHOP 2017, 2017. 12
[127] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo,
R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learning
to reinforcement learn,” Complete CogSci 2017 Proceedings, 2016. 12
[128] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and
P. Abbeel, “Fast reinforcement learning via slow reinforcement learn-
ing,” arXiv preprint arXiv:1611.02779, 2016. 12
[129] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
for fast adaptation of deep networks,” in Proceedings of the 34th
International Conference on Machine Learning - Volume 70, ser.
ICML’17. JMLR.org, 2017, p. 1126–1135. 12
[130] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning
algorithms,” CoRR, abs/1803.02999, 2018. 12
[131] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and
P. Abbeel, “Continuous adaptation via meta-learning in nonstationary
and competitive environments,” in 6th International Conference on
Learning Representations, ICLR 2018, Vancouver, BC, Canada, April
30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,
2018. 12
[132] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy
evolution,” in Advances in Neural Information Processing Systems,
2018. 12
[133] S. Ross and D. Bagnell, “Efficient reductions for imitation learning,”
in Proceedings of the thirteenth international conference on artificial
intelligence and statistics, 2010, pp. 661–668. 12
[134] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to
drive by imitating the best and synthesizing the worst,” in Robotics:
Science and Systems XV, 2018. 12
[135] T. Buhet, E. Wirbel, and X. Perrotton, “Conditional vehicle trajectories
prediction in carla urban environment,” in Proceedings of the IEEE
International Conference on Computer Vision Workshops, 2019. 13
[136] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward
transformations: Theory and application to reward shaping,” in ICML,
vol. 99, 1999, pp. 278–287. 13
[137] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse rein-
forcement learning,” in Proceedings of the Twenty-first International
Conference on Machine Learning, ser. ICML ’04. ACM, 2004. 13
[138] N. Chentanez, A. G. Barto, and S. P. Singh, “Intrinsically motivated
reinforcement learning,” in Advances in neural information processing
systems, 2005, pp. 1281–1288. 13
[139] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven
exploration by self-supervised prediction,” in International Conference
on Machine Learning (ICML), vol. 2017, 2017. 13
[140] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A.
Efros, “Large-scale study of curiosity-driven learning,” arXiv preprint
arXiv:1808.04355, 2018. 13
[141] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-end
simulated driving,” in Proceedings of the Thirty-First AAAI Conference
on Artificial Intelligence, San Francisco, California, USA., 2017, pp.
2891–2897. 13
[142] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-
agent, reinforcement learning for autonomous driving,” arXiv preprint
arXiv:1610.03295, 2016. 13
[143] X. Xiong, J. Wang, F. Zhang, and K. Li, “Combining deep reinforce-
ment learning and safety based control for autonomous driving,” arXiv
preprint arXiv:1612.00147, 2016. 13
[144] C. Ye, H. Ma, X. Zhang, K. Zhang, and S. You, “Survival-oriented re-
inforcement learning model: An effcient and robust deep reinforcement
learning algorithm for autonomous driving problem,” in International
Conference on Image and Graphics. Springer, 2017, pp. 417–429. 13
[145] J. Garcıa and F. Fernández, “A comprehensive survey on safe rein-
forcement learning,” Journal of Machine Learning Research, vol. 16,
no. 1, pp. 1437–1480, 2015. 13
[146] P. Palanisamy, “Multi-agent connected autonomous driving using deep
reinforcement learning,” in 2020 International Joint Conference on
Neural Networks (IJCNN). IEEE, 2020, pp. 1–7. 13
[147] S. Bhalla, S. Ganapathi Subramanian, and M. Crowley, “Deep multi
agent reinforcement learning for autonomous driving,” in Advances in
Artificial Intelligence, C. Goutte and X. Zhu, Eds. Cham: Springer
International Publishing, 2020, pp. 67–78. 13
[148] A. Wachi, “Failure-scenario maker for rule-based agent using multi-
agent adversarial reinforcement learning and its application to au-
tonomous driving,” arXiv preprint arXiv:1903.10654, 2019. 13
[149] C. Yu, X. Wang, X. Xu, M. Zhang, H. Ge, J. Ren, L. Sun, B. Chen, and
G. Tan, “Distributed multiagent coordinated learning for autonomous
driving in highways based on dynamic coordination graphs,” IEEE
Transactions on Intelligent Transportation Systems, vol. 21, no. 2, pp.
735–748, 2020. 13
18
[150] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford,
J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, “Openai baselines,”
https://github.com/openai/baselines, 2017. 14
[151] A. Juliani, V.-P. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, and
D. Lange, “Unity: A general platform for intelligent agents,” arXiv
preprint arXiv:1809.02627, 2018. 14
[152] I. Caspi, G. Leibovich, G. Novik, and S. Endrawis, “Reinforcement
learning coach,” Dec. 2017. [Online]. Available: https://doi.org/10.
5281/zenodo.1134899 14
[153] Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo
Castro, Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina,
Neal Wu, Chris Harris, Vincent Vanhoucke, Eugene Brevdo,
“TF-Agents: A library for reinforcement learning in tensorflow,”
https://github.com/tensorflow/agents, 2018, [Online; accessed 25-June-
2019]. [Online]. Available: https://github.com/tensorflow/agents 14
[154] A. Stooke and P. Abbeel, “rlpyt: A research code base for deep
reinforcement learning in pytorch,” arXiv preprint arXiv:1909.01500,
2019. 14
[155] I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva,
K. McKinney, T. Lattimore, C. Szepezvari, S. Singh et al., “Behaviour
suite for reinforcement learning,” arXiv preprint arXiv:1908.03568,
2019. 14
B Ravi Kiran is the technical lead of machine
learning team at Navya, designing and deploying
realtime deep learning architectures for perception
tasks on autonomous shuttles. During his 6 years
in academic research, he has worked on DNNs for
video anomaly detection, online time series anomaly
detection, hyperspectral image processing for tumor
detection. He finished his PhD at Paris-Est in 2014
entitled Energetic lattice based optimization which
was awarded the MSTIC prize. He has worked in
academic research for over 4 years in embedded
programming and in autonomous driving. During his career he has published
over 40 articles and journals.
Ibrahim Sobh Ibrahim has more than 20 years of
experience in the area of Machine Learning and
Software Development. Dr. Sobh received his PhD
in Deep Reinforcement Learning for fast learning
agents acting in 3D environments. He received his
B.Sc. and M.Sc. degrees in Computer Engineering
from Cairo University Faculty of Engineering. His
M.Sc. Thesis is in the field of Machine Learn-
ing applied on automatic documents summarization.
Ibrahim has participated in several related national
and international mega projects, conferences and
summits. He delivers training and lectures for academic and industrial entities.
Ibrahim’s publications including international journals and conference papers
are mainly in the machine and deep learning fields. His area of research
is mainly in Computer vision, Natural language processing and Speech
processing. Currently, Dr. Sobh is a Senior Expert of AI, Valeo.
Victor Talpaert is a PhD student at U2IS, EN-
STA Paris, Institut Polytechnique de Paris, 91120
Palaiseau, France. His PhD is directed by Bruno
Monsuez since 2017, the lab speciality is in robotics
and complex systems. His PhD is co-sponsored by
AKKA Technologies in Gyuancourt, France, through
the guidance of AKKA’s Autonomous Systems
Team. This team has a large focus on Autonomous
Driving (AD) and the automotive industry in general.
His PhD subject is learning decision making for AD,
with assumptions such as a modular AD pipeline,
learned features compatible with classic robotic approaches and ontology
based hierarchical abstractions.
Patrick Mannion is a permanent member of aca-
demic staff at National University of Ireland Gal-
way, where he lectures in Computer Science. He is
also Deputy Editor of The Knowledge Engineering
Review journal. He received a BEng in Civil En-
gineering, a HDip in Software Development and a
PhD in Machine Learning from National University
of Ireland Galway, a PgCert in Teaching & Learning
from Galway-Mayo IT and a PgCert in Sensors
for Autonomous Vehicles from IT Sligo. He is a
former Irish Research Council Scholar and a former
Fulbright Scholar. His main research interests include machine learning, multi-
agent systems, multi-objective optimisation, game theory and metaheuristic
algorithms, with applications to domains such as transportation, autonomous
vehicles, energy systems and smart grids.
Ahmad El Sallab Ahmad El Sallab is the Senior
Chief Engineer of Deep Learning at Valeo Egypt,
and Senior Expert at Valeo Group. Ahmad has 15
years of experience in Machine Learning and Deep
Learning, where he acquired his M.Sc. and Ph.D.
on 2009 and 2013 in the field. He has worked for
reputable multi-national organizations in the industry
since 2005 like Intel and Valeo. He has over 35
publications and book chapters in Deep Learning
in top IEEE and ACM journals and conferences, in
addition to 30 patents, with applications in Speech,
NLP, Computer Vision and Robotics.
Senthil Yogamani is an Artificial Intelligence ar-
chitect and technical leader at Valeo Ireland. He
leads the research and design of AI algorithms for
various modules of autonomous driving systems.
He has over 14 years of experience in computer
vision and machine learning including 12 years of
experience in industrial automotive systems. He is
an author of over 90 publications and 60 patents
with 1300+ citations. He serves in the editorial board
of various leading IEEE automotive conferences
including ITSC, IV and ICVES and advisory board
of various industry consortia including Khronos, Cognitive Vehicles and IS
Auto. He is a recipient of best associate editor award at ITSC 2015 and best
paper award at ITST 2012.
Patrick Pérez is Scientific Director of Valeo.ai,
a Valeo research lab on artificial intelligence for
automotive applications. He is currently on the Edi-
torial Board of the International Journal of Computer
Vision. Before joining Valeo, Patrick Pérez has been
Distinguished Scientist at Technicolor (2009-2918),
researcher at Inria (1993-2000, 2004-2009) and at
Microsoft Research Cambridge (2000-2004). His
research interests include multimodal scene under-
standing and computational imaging.