PreprintPDF Available

End-to-End Race Driving with Deep Reinforcement Learning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We present research using the latest reinforcement learning algorithm for end-to-end driving without any mediated perception (object recognition, scene understanding). The newly proposed reward and learning strategies lead together to faster convergence and more robust driving using only RGB image from a forward facing camera. An Asynchronous Actor Critic (A3C) framework is used to learn the car control in a physically and graphically realistic rally game, with the agents evolving simultaneously on tracks with a variety of road structures (turns, hills), graphics (seasons, location) and physics (road adherence). A thorough evaluation is conducted and generalization is proven on unseen tracks and using legal speed limits. Open loop tests on real sequences of images show some domain adaption capability of our method.
Content may be subject to copyright.
End-to-End Race Driving with Deep Reinforcement Learning
Maximilian Jaritz1,2, Raoul de Charette1, Marin Toromanoff2, Etienne Perot2and Fawzi Nashashibi1
Abstract We present research using the latest reinforcement
learning algorithm for end-to-end driving without any mediated
perception (object recognition, scene understanding). The newly
proposed reward and learning strategies lead together to
faster convergence and more robust driving using only RGB
image from a forward facing camera. An Asynchronous Actor
Critic (A3C) framework is used to learn the car control in
a physically and graphically realistic rally game, with the
agents evolving simultaneously on tracks with a variety of
road structures (turns, hills), graphics (seasons, location) and
physics (road adherence). A thorough evaluation is conducted
and generalization is proven on unseen tracks and using legal
speed limits. Open loop tests on real sequences of images show
some domain adaption capability of our method.
I. INTRODUCTION
Recent advances prove the feasibility of end-to-end robot
control by replacing the classic chain of perception, planning
and control with a neural network that directly maps sensor
input to control output [11]. For cars, direct perception [5]
and end-to-end control [13] were showcased in the TORCS
car racing game using Reinforcement Learning (RL). As
RL relies on try and error strategies an end-to-end driving
prototype still seems too dangerous for real-life learning and
there is still a lot of progress to be done as the first studies
use simulators with simplified graphics and physics, and the
obtained driving results lack realism.
We propose a method (fig. 1) benefiting from recent
asynchronous learning [13] and building on our prelim-
inary work [17] to train an end-to-end agent in World
Rally Championship 6 (WRC6), a realistic car racing game
with stochastic behavior (animations, light). In addition to
remain close to real driving conditions we rely only on
image and speed to predict the full longitudinal and lateral
control of the car. Together with our learning strategy, the
method converges faster than previous ones and exhibits
some generalization capacity despite the significantly more
complex environment that exhibits 29.6km of training tracks
with various visual appearances (snow, mountain, coast) and
physics (road adherence). Although it is fully trained in a
simulation environment the algorithm was tested successfully
on real driving videos, and handled scenarios unseen in the
training (e.g. oncoming cars).
Section II describes the few related works in end-to-end
driving. Section III details our methodology and learning
strategies. Exhaustive evaluation is discussed in IV and
generalization on real videos is shown in V.
1Inria, RITS Team, 2 rue Simone Iff, 75012 Paris
surname.last-name@inria.fr
2Valeo Driving Assistance Research, Bobigny
first.lastname@valeo.com
State encoder
Policy Network
Rallye Game
Front cam
(84x84x3)
Environment state
Reward
Image
Control
command
Metadata
API
(32)
Brake Gas
Handbrake
Steering
30fps
(a) Overall pipeline (red = for training only)
Snow (SE) 11.61km
Mountain (CN) 13.34km
Coast (UK) 4.59km
Crashes frequencies
(b) Training performance (3 tracks, 29.6km)
Fig. 1: (a) Overview of our end-to-end driving in the WRC6
rally environment. The state encoder learns the optimal con-
trol commands (steering, brake, gas, hand brake), using only
84x84 front view images and speed. The stochastic game
environment is complex with realistic physics and graphics.
(b) Performance on 29.6km of training tracks exhibiting a
wide variety of appearance, physics, and road layout. The
agent learned to drive at 73km/h and to take sharp turns and
hairpin bends with few crashes (drawn in yellow).
II. REL ATED WORK
Despite early interest for end-to-end driving [18], [3] the
trend for self-driving cars is still to use perception-planning-
control paradigm [21], [22], [16]. The slow development
of end-to-end driving can be explained both due to the
computational and algorithmic limitations that evolved re-
cently with deep learning. There has been a number of
deep learning approaches to solve end-to-end control (aka
”behavioral reflex”) for games [15], [14], [13] or robots [10],
[11] but still very few were applied to end-to-end driving.
Using supervised learning, Bojarski et al. [4] trained an
8 layer CNN to learn the lateral control from a front view
camera, using steering angle from a real driver as a ground
truth. It uses 72h of training data, with homographic interpo-
lation of three forward-facing cameras to generate synthetic
viewpoints. Similar results were presented in [19] with a
different architecture. Another recent proposal from [24] is
to use privilege learning and to compute the driving loss
through comparison with actual driver decision. An auxiliary
pixel-segmentation loss is used which was previously found
arXiv:1807.02371v2 [cs.CV] 31 Aug 2018
to be a way to help training convergence. However, in [24]
it isn’t clear whether the network learned to detect the driver
action or to predict it. Behavioral cloning is limited by nature
as it only mimics the expert driver and thus cannot adapt to
unseen situations.
An alternative for unsupervised (or self-supervised) learn-
ing, is deep Reinforcement Learning (RL) as it uses reward
to train a network to find the most favorable state. The two
major interests for RL are that: a) it doesn’t require ground
truth data and b) the reward can be sparse and delayed
which opens new horizons. Indeed, judging a driving control
decision at the frame level is complex and deep RL allows re-
warding or penalizing after a sequence of decisions. To avoid
getting caught in local optima, experience replay memory
can be used as in [14]. Another solution is asynchronous
learning - coined A3C - as proposed by Mnih et al. and
successfully applied for learning ATARI games [13] using
score as a reward. A3C achieves experience decorrelation
with multiple agents evolving in different environments at
the same time. In [13] they also applied deep RL to the
TORCS driving game. A 3 layer CNN is trained with an
A3C strategy to learn jointly lateral and longitudinal control
from a virtual front camera. The algorithm searches itself for
the best driving strategy based on a reward computed from
the car’s angle and speed.
There has also been a number of close studies in direct
perception. Rather than learning the control policy, it learns
the extraction of high level driving features (distance to
ongoing vehicle, road curvature, distance to the border,
etc.) which are then used to control the car through simple
decision making and control algorithms. This field yields
interesting research from [5] and recently [2].
III. MET HOD
In this work we aim at learning end-to-end driving in rally
conditions while having a variety of visual appearances and
physic models (road adherence, tire friction). This task is
very challenging and cannot be conducted in real cars, as
we aim to learn full control that is steering, brake, gas and
even hand brake to enforce drifting. Instead, the training is
done using a dedicated API of a realistic car game (WRC6).
This simulation environment allows us to crash the car while
ensuring we encounter multiple scenarios. We demonstrate
in section V that the simulation training is transposable to
real driving images.
We chose the Asynchronous Learning strategy (A3C) [13]
to train the architecture because it is well suited for experi-
ence decorrelation. The overall pipeline is depicted in fig. 1a.
At every time-step, the algorithm receives the state of the
game (s), acts on the car through control command (a), and
gets a reward (r) on next iteration as supervision signal. The
complete architecture optimizes a driving policy to apply to
the vehicle using only the RGB front view image.
We first motivate our choice for the reinforcement learning
algorithms (sec. III-A), detail the state encoder architecture
(sec. III-B) and then describe the strategy applied for the
training of the reinforcement learning algorithm III-C.
A. Reinforcement learning algorithms
In the common RL model an agent interacts with an
environment at discrete time steps tby receiving the state st
on which basis it selects an action atas a function of policy
πwith probability π(at|st)and sends it to the environment
where atis executed and the next state st+1 is reached with
associated reward rt. Both, state st+1 and reward rt, are
returned to the agent which allows the process to start over.
The discounted reward Rt=P
k=0 γkrt+kwith γ[0,1[
is to be maximized by the agent.
1) Policy optimization: The output probabilities of the
control commands (i.e. steering, gas, brake, hand brake) are
determined by the control policy πθparameterized by θ(e.g.
the weights of a neural network) which we seek to optimize
by estimating the gradient of the expected return E[Rt]. To
do so, we chose the rather popular REINFORCE method [23]
that allows to compute an unbiased estimate of θE[Rt].
2) Asynchronous Advantage Actor Critic (A3C): As intro-
duced by Mnih et al. [13], in A3C the discounted reward Rt
is estimated with a value function Vπθ(s) = E[Rt|st=s]
and the remaining rewards can be estimated after some
steps as the sum of the above value function and the actual
rewards: ˆ
Rt=Pk1
i=0 γirt+i+γkˆ
Vπθ(st+k)where kvaries
between 0and tmax = 5 the sampling update. The quantity:
ˆ
Rtˆ
Vπθ(st)can be seen as the advantage, i.e. whether the
actions at, at+1, ..., at+tmax were actually better or worse
than expected. This is of high importance as it allows
correction when non optimal strategies are encountered.
For generality, it is said that the policy π(at|st;θ)(aka Actor)
and the value function ˆ
V(st;θ0)(aka Critic) are estimated
independently with two neural networks; each one with a
different gradient loss. In practice, both networks share all
layers but the last fully connected.
In addition to its top performance, the choice of this A3C
algorithm is justified because of its ability to train small im-
age encoder CNNs without any need of experience replay for
decorrelation. This allows training in different environments
simultaneously, which is useful for our particular case as the
WRC6 environment is undeterministic.
B. State encoder
Intuitively, unlike for other computer vision tasks a
shallow CNN should be sufficient as car racing should
rely mostly on obstacles and road surface detection. The
chosen network is a 4 layer (3 convolutional) architecture
inspired by [8] but using a dense filtering (stride 1) to
handle far-away vision. It also uses max pooling for more
translational invariance and takes advantage of speed and
previous action in the LSTM. Using a recurrent network is
required because there are multiple valid control decision
if we don’t account for motion. Our network is displayed
in fig. 2 alongside the one from Mnih et al. [13] which we
compare against in section IV.
32x39x39
32x18x18 32x7x7
Conv-Pool-ReLU
Conv-Pool-ReLU
Conv-ReLU
3x84x84
Input Image
256
FC-ReLU
LSTM
FC:
FC: V
32
1
289
vt
t-1
a
(a)
16x20x20
32x9x9
256
Conv-ReLU
Conv-ReLU
FC-ReLU
LSTM
3x84x84
Input Image FC:
FC: V
32
1
256
(b)
Fig. 2: The state encoder CNN+LSTM architecture used in
our approach (2a). Compared to the one used in [13] (2b), our
network is slightly deeper and comes with a dense filtering
for finer feature extraction and far-away vision.
C. End-to-end learning strategy
Preliminary research highlighted that naively training an
A3C algorithm with a given state encoder does not reach
optimal performances. In fact, we found that control, reward
shaping, and agent initialization are crucial for optimal end-
to-end driving. Although the literature somewhat details con-
trol and reward shaping, the last one is completely neglected
but of high importance.
1) Control: Continuous control with Deep Reinforcement
Learning (DRL) is possible [12], [6], [7] but the common
strategy for A3C is to use a discrete control, easier to
implement. For this rally driving task, the architecture needs
to learn the control commands for steering (-1...1), gas (0...1),
brake (0, 1) and hand brake (0, 1). Note that the brake
and hand brake commands are binary. The latter was added
for the car to learn drifts in hairpin bends, since brake
implies slowing down rather than drifting. The combination
of all control commands has been broken into 32 classes
listed in Table I. Although the total number of actions is
arbitrary, two choices should be highlighted: the prominence
of acceleration classes (18 commands with gas >0) to
encourage speeding up, and the fact that brake and hand
brake are associated with different steering commands to
favor drifting.
# classes Control commands
Steering Gas Brake Hand brake
27 {−1., 0.75, ..., 1.} {0.0,0.5,1.0} {0} {0}
4{−1., 0.5,0.5,1.} {0.0} {0} {1}
1{0.0} {0.0} {1} {0}
TABLE I: The 32 classes output for the policy network. First
column indicate the number of classes with possible set of
control values. Note the prominence of gas commands.
The softmax layer of the policy network outputs the
probabilities of the 32 control classes given the state encoded
by the CNN and LSTM.
2) Reward shaping: Reward shaping is crucial to help the
network to converge to the optimal set of solutions. Because
in car racing the score is measured at the end of the track it
is much too sparse to train the agents. Instead, a reward is
computed at each frame using metadata information received
from the game. In [13] the reward is computed as a function
of the difference of angle αbetween the road and car’s
heading and the speed v. Tests exhibit that that it cannot
Fig. 3: Scheme of our training setup. Multiple game instances
run on different machines and communicate through a ded-
icated API with the actor-learner threads which update and
synchronize their weights frequently with the shared target
network in an asynchronous fashion.
prevent the car to slide along the guard rail which makes
sense since the latter follows the road angle. Eventually, we
found preferable to add the distance from the middle of the
road das a penalty:
R=v(cos αd)(1)
Similarly to [9] our conclusion is that the distance penalty
enables the agent to rapidly learn how to stay in the middle of
the track. Additionally, we propose two other rewards using
the road width as detailed in section IV-C.2.
3) Agent initialization: In previous end-to-end DRL at-
tempts [13], [9] the agents are always initialized at the
beginning of the track. Such a strategy will lead to overfitting
at the beginning of the training tracks and is intuitively
inadequate given the decorrelation property of the A3C
algorithm. In practice, restarting at the beginning of the track
leads to better performance on the training tracks but poorer
generalization capability, which we prove in section IV-C.3.
Instead, we chose to initialize (at start or after crash) the
agents randomly on the training tracks although restrained
to random checkpoint positions, due to technical limitations.
Experiments detailed in section IV-C.3 advocate that random
initialization improves generalization and exploration signif-
icantly.
IV. EXP ERI MEN TS
This section describes our architecture setup and reports
quantitative and qualitative performance in the World Rally
Championship 6 (WRC6) racing game. Compared to the
TORCS platform used in [13], [5], [9], WRC6 exhibits more
realistic physics engine (grip, drift), graphics (illuminations,
animations, etc.), and a variety of road shapes (sharp turns,
slopes). Additionally, the WRC6 stochastic behavior makes
each run unique which is harder but closer to real conditions.
For better graph visualization all plots are rolling mean
and deviation over 1000 steps. Performance is best seen in
our video: http://team.inria.fr/rits/drl.
A. Training setup
The training setup is depicted in figure 3. We use a central
machine to run the RL algorithm which communicates with
9 instances of the game split over 2 machines. Each of
the agents communicates via TCP with a WRC6 instance
through a dedicated API specifically developed for this work.
Fig. 4: Rolling mean (dark) and standard deviation (light)
over training on 3 tracks. The agent had more difficulty to
progress on the mountain and snow tracks as they exhibit
sharp curves, hairpin bends, and slippery road.
It allows us to retrieve in-game info, compute the reward
and send control back to the game. To speed up the pipeline
and match the CNN input resolution, some costly graphics
effects were disabled and we used a narrower field of view
compared to the in-game view, as shown in figure 1a. The
game’s clock runs at 30FPS and the physical engine is on
hold as it waits for the next action.
We use only first person view images (RGB normalized)
and speed for testing to allow a fair comparison with a real
driver knowledge. Likewise, training images do not contain
usual in-game info (head up display, turns, track progress,
etc.) which actually makes the gaming task harder.
Three tracks - a total of 29.6km - were used for training
(3 instances of each) and we reserved two tracks for testing
the generalization of the learning which is discussed in
section V. The agents (cars) start and respawn (after crashes)
at a random checkpoint (always at 0 km/h).
B. Performance evaluation
Plots in fig. 4 show the mean (dark) and standard deviation
(light) over the three training tracks for 140 million steps.
During training, the agents progress along the tracks simul-
taneously as expected from the decorrelation capability of
the A3C algorithm. Overall, the agents successfully learned
to drive despite the challenging tracks appearances and
physics at an average speed of 72.88km/h and covers an
average distance of 0.72km per run. We observe a high
standard deviation in the covered distance as it depends on
the difficulty of the track part where the car is spawned.
The bottom plot reports how often the car hits objects of
the environment (guard rail, obstacles, etc.). Although not
penalized in the reward function, the hits get lower through
training as they imply a speed decrease. After training, the
car hits scene objects 5.44 times per kilometer.
During training a run is interrupted if the bot either stops
Fig. 5: Visualization of back-propagation where positive
gradients for the chosen actions are highlighted in blue.
Despite various scenes and road appearances the network
learned to detect road edges and relies on them for control.
progressing or goes in the wrong direction (off road, wrong
ways). We refer to this as ”crash”. The location of crashes
over five meter segments is colored from black to yellow in
fig. 1b. The latter shows the bot learned to go through slopes,
sharp curves and even some hairpin bends.
From a qualitative point of view, the agent drives rather
smoothly and even learned how to drift with the hand-
brake control strategy. However, the bots still not achieve
optimal trajectories from a racing aspect (e.g. taking turns
on the inside). This is explained because the network lacks
anticipation, and the car will always try to remain in the
track center. For the snow track the road being very slippery
it leads to frequent crashes when the vehicle rushes headlong.
Although on all tracks the average number of hits is relatively
high, the context of racing game is very complex and we
found during our experiments that even best human players
collide with the environment. It is important to highlight that
the physics, graphics, dynamics, and tracks are much more
complex than the usual TORCS platform which explain the
lower performance in comparison. In fact, as the architecture
learns end-to-end driving it needs to learn the realistic
underlying dynamics of the car. Additionally, we trained
using tracks with different graphics and physics (e.g. road
adherence) thus increasing the complexity.
To better understand the network decisions, fig. 5 shows
guided back propagation [20] (i.e. the network positive inner
gradients that lead to the chosen action) for several scenarios.
Despite the various scene appearances the agent uses the road
edges and curvature as a strong action indicator. This mimics
existing end-to-end techniques [13], [9], [4] that also learn
lateral controls from road gradients.
C. Comparative evaluation
To evaluate our contribution as compared to the state of
the art we evaluate separately the proposed choices of state
encoder, reward shaping and respawn strategies. The study
of each factor is carried out by retraining the whole network
while only changing the element of study.
1) State encoder: Fig. 6 compares the performance of
our CNN with dense stride against the smaller network with
larger stride from Mnih et al. [13] (cf. fig. 2). As expected,
in fig. 6a the convergence is faster for the smaller network
[13] than for our network (80 versus 130 mega steps).
However although a significant impact on the racing style
was expected intuitively due to the far away vision, in fig. 6b
both networks seem leading to similar performance. As a
(a) Performance
Ours
B
A
(b) Racing style
Fig. 6: Evaluation of our state encoder versus the encoder
used in Mnih et al. [13]. In 6a, the smaller CNN from [13]
(cyan) converges faster than ours (orange). In 6b the rac-
ing performance of both networks are comparable though
slightly more exploratory with our network, as highlighted
at location A and B. Refer to section IV-C.1 for details.
(a) Reward shaping (b) Respawn strategy
Fig. 7: Evaluation of the performance of our strategies for
reward (7a) and respawn (7b). In 7a, the reward used (orange)
is compared to the reward from [13] (cyan). In 7b, our
random checkpoint strategy (plain line) and the start (dash
line) strategies for respawn are compared on the three tracks.
matter of fact, the crashes locations (highlighted in yellow)
are similar even on such a difficult track. With the notable
exception of the section labeled A where [13] crashes less
often and section B only explored by our network.
In light of these results, we retrained the whole experi-
ment shown in fig. 4 with the network from [13]. Despite
the longer convergence, the performance of our network
is better compared to the smaller network with +89.9m
average distance covered (+14.3%) and 0.8average car’s
hit per kilometer (13.0%). Such analysis advocates that our
network performs better at a cost of longer training.
2) Reward shaping: We measure now the impact of the
reward in terms driving style. In addition to compare reward
Ours (eq. 1) and Mnih et al. ([13]), we investigate the
interest to account for the road width (rw) in the reward
and consequently propose two new rewards:
a reward penalizing distance only when the car is off the
road lane: R=v(cos αmax(d0.5rw,0)), named
Ours w/ margin.
a smooth reward penalizing distance with a sigmoid:
R=v(cos α1
1+e4(|d|−0.5rw)), named Ours Sigmoid.
Both of these rewards penalize more when the car leaves the
road but the last one also avoids singularity. A visualization
of the four rewards as a function of the car’s lateral position
is displayed in fig. 7a right, assuming a constant speed for
simplification.
Fig. 7a left shows training performance as the number
(a) Speed limit (b) Prediction on real videos
Fig. 8: 8a Influence of the speed limit on the number of
car’s hits for the three tracks and estimation of the hits with
”real speed limit” (dash lines) calculated from the design
speed. Cf. sec. V-.2. 8b Prediction of longitudinal and lateral
commands on real videos. Note that the agent can handle
situations never encountered (other road users, multi-lanes).
of car’s hits per km. For a fair comparison the architecture
of [13] is used and train on a single track. Our three rewards
using the distance from track center lead to a significant
drop in hits while converging faster than the reward Mnih et
al., i.e.: Mnih et al. (9.2hits/km) versus Ours (2.3hits/km).
Out of those three, the partly constant function Ours w/
margin performs worst, probably because it is less suited for
optimization via gradient descent. As could be expected, the
fewer hits come at a cost of a slower overall speed: Mnih et
al. (106.3km/h) compared to Ours (91.4km/h). To conclude
Mnih et al. drives faster but with a much rougher style.
3) Respawn strategy: To evaluate our respawn strategy
we compared to a similar network trained using the standard
respawn at the start of the track. When the bots start at
different positions the track completion is not a valid metric1.
Instead, we use the track exploration as a percentage of tracks
length. In fig. 7b, our random checkpoint strategy (plain
lines) exhibits a significantly better exploration than the
usual start strategy (dashed lines). For the easiest coast track
(green), both strategies reach full exploration of the track but
random checkpoint is faster. For complex snow (blue) and
mountain (green) tracks, our strategy improves greatly the
exploration by +32.20% and +65.19%, respectively.
The improvement due to our strategy is easily explained
as it makes full use of the A3C decorrelation capability.
Because it sees a wider variety of environments the network
is forced generalize more. The fig. 7b also shows that
the bots tend to progress along the track in a non-linear
fashion. The most logical explanation is that some track
segments (e.g. sharp turns) require many attempts before
bots succeed, leading to a training progress in fits and starts.
To summarize, this comparative evaluation showed that
our state encoder, respawn strategy and reward shaping im-
prove greatly the end-to-end driving performance. Compari-
son of discrete versus continuous control was also conducted
but not reported as they exhibit similar performances.
V. GENERALIZATION
Because the WRC6 game is stochastic (animations, slight
illumination changes, etc.) the performance reported on the
1E.g: a bot starting half way through the track could not reach more than
50% completion. Hence, track completion would be a biased metric.
training tracks already acknowledge some generalization.
Still, we want to address these questions: Can the agent drive
on unseen tracks? Can it drive respecting the speed limit?
How does it perform on real images?
1) Unseen tracks: We tested the learned agent on tracks
with different road layout which the agent could follow at
high speeds, showing that the network incorporated general
driving concepts rather than learned a track by heart. Quali-
tative performance is shown in the supplementary video.
2) Racing VS Normal driving: As the reward favors
speed without direct penalizations of collisions the agent
learned to go fast. While appropriate for racing games,
this is dangerous for ’normal driving’. To evaluate how
our algorithm could be transposed to normal driving with
speed limits, we evaluate the influence of speed per track
in fig. 8a. As one could expect, the number of crashes
(collisions, off-roads) significantly reduces at lower speeds.
Dashed lines in the figure show the performance in a real
speed limit scenario. To estimate the real speed limit, we
use the computed curvature and superelevation of each road
segment to compute the ad-hoc design speed as defined in
the infrastructure standards [1].
3) Real videos: Finally, we tested our agent on real videos
(web footages, cropped and resize), and fig. 8b shows the
guided back propagation and control output of a few frames.
Although the results are partial as we cannot act on the
video (i.e. control commands are never applied), the decision
performance is surprisingly good for such a shallow network.
In various environments the bot is capable of outputting the
correct decision. To the best of our knowledge it is the first
time a deep RL driving is shown working on real images
and lets foresee that simulation based RL can be used as
initialization strategy for decision making networks.
VI. CONCLUSION
This paper introduced a framework and several learning
strategies for end-to-end driving. The stochastic environment
used is significantly more complex than existing ones and
exhibits a variety of different physics/graphics with sharp
road layouts. Compared to prior research, we learned the
full control (lateral and longitudinal) including hand brake
for drifts, in a completely self-supervised manner. The com-
parative evaluation proved the importance of the learning
strategy as our performance are above existing approaches
despite the significantly more challenging task and the longer
training tracks. Additionally, the bot shows generalization
capacities in racing mode, and performance are even better
with ad-hoc speed limits. Finally, experiments on real videos
show our training can be transposed to camera images.
End-to-end driving is a challenging task and the research
in that field is still at its earliest stage. These results partic-
ipate to a step towards a real end-to-end driving platform.
REFERENCES
[1] A policy on Geometric Design of Highways and Streets. American
Association of State Highway and Transportation Officials, 2001.
[2] M. Al-Qizwini, I. Barjasteh, H. Al-Qassab, and H. Radha. Deep
learning algorithm for autonomous driving using googlenet. In
Intelligent Vehicles (IV), 2017 IEEE, pages 89–96. IEEE, 2017.
[3] M. Bajracharya, A. Howard, L. H. Matthies, B. Tang, and M. Turmon.
Autonomous off-road navigation with end-to-end learning for the lagr
program. Journal of Field Robotics, 26(1):3–25, 2009.
[4] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to
end learning for self-driving cars. arXiv preprint arXiv:1604.07316,
2016.
[5] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning
affordance for direct perception in autonomous driving. In Proceedings
of the IEEE International Conference on Computer Vision, pages
2722–2730, 2015.
[6] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel.
Benchmarking deep reinforcement learning for continuous control.
In International Conference on Machine Learning, pages 1329–1338,
2016.
[7] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep q-
learning with model-based acceleration. In M. F. Balcan and K. Q.
Weinberger, editors, Proceedings of The 33rd International Conference
on Machine Learning, volume 48 of Proceedings of Machine Learning
Research, pages 2829–2838, New York, New York, USA, 20–22 Jun
2016. PMLR.
[8] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Ja´
skowski.
Vizdoom: A doom-based ai research platform for visual reinforcement
learning. arXiv preprint arXiv:1605.02097, 2016.
[9] B. Lau. Using Keras and Deep Deterministic Policy Gradient to play
TORCS, 2016.
[10] Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp. Off-road
obstacle avoidance through end-to-end learning. In NIPS, pages 739–
746, 2005.
[11] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training
of deep visuomotor policies. Journal of Machine Learning Research,
17(39):1–40, 2016.
[12] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra. Continuous control with deep reinforce-
ment learning. arXiv preprint arXiv:1509.02971, 2015.
[13] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep
reinforcement learning. In International Conference on Machine
Learning, 2016.
[14] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement
learning. arXiv preprint arXiv:1312.5602, 2013.
[15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
et al. Human-level control through deep reinforcement learning.
Nature, 518(7540):529–533, 2015.
[16] M. Montemerlo, J. Becker, S. Bhat, H. Dahlkamp, D. Dolgov, S. Et-
tinger, D. Haehnel, T. Hilden, G. Hoffmann, B. Huhnke, et al. Junior:
The stanford entry in the urban challenge. Journal of field Robotics,
25(9):569–597, 2008.
[17] E. Perot, M. Jaritz, M. Toromanoff, and R. De Charette. End-to-end
driving in a realistic racing game with deep reinforcement learning. In
International conference on Computer Vision and Pattern Recognition-
Workshop, 2017.
[18] D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural
network. In Advances in neural information processing systems, pages
305–313, 1989.
[19] V. Rausch, A. Hansen, E. Solowjow, C. Liu, E. Kreuzer, and J. K.
Hedrick. Learning a deep neural net policy for end-to-end control of
autonomous vehicles. In American Control Conference (ACC), 2017,
pages 4914–4919. IEEE, 2017.
[20] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller.
Striving for simplicity: The all convolutional net. arXiv preprint
arXiv:1412.6806, 2014.
[21] Z. Sun, G. Bebis, and R. Miller. On-road vehicle detection: A review.
IEEE transactions on Pattern Analysis and Machine Intelligence,
28(5):694–711, 2006.
[22] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark,
J. Dolan, D. Duggins, T. Galatali, C. Geyer, et al. Autonomous driving
in urban environments: Boss and the urban challenge. Journal of Field
Robotics, 25(8):425–466, 2008.
[23] R. J. Williams. Simple statistical gradient-following algorithms for
connectionist reinforcement learning. Machine learning, 8(3-4):229–
256, 1992.
[24] H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning of
driving models from large-scale video datasets. arXiv preprint
arXiv:1612.01079, 2016.
... In recent years, Reinforcement Learning (RL) has emerged as a compelling solution to such dynamic problems [3,4,5]. The strength of RL lies in its ability to generate solutions to a wide range of scenarios without the need for tailoring specific solutions for each case. ...
... In a relevant previous work, Maximilian et al. [5] has shown that it is possible to train an end-to-end DRL agent using camera sensors to safely navigate through highways. One major shortcoming of using camera sensor for this regard lies in the lack of its generalization to different lighting and weather conditions, especially during nighttime. ...
Preprint
Full-text available
Reinforcement Learning (RL) has emerged as a transformative approach in the domains of automation and robotics, offering powerful solutions to complex problems that conventional methods struggle to address. In scenarios where the problem definitions are elusive and challenging to quantify, learning-based solutions such as RL become particularly valuable. One instance of such complexity can be found in the realm of car racing, a dynamic and unpredictable environment that demands sophisticated decision-making algorithms. This study focuses on developing and training an RL agent to navigate a racing environment solely using feedforward raw lidar and velocity data in a simulated context. The agent's performance, trained in the simulation environment, is then experimentally evaluated in a real-world racing scenario. This exploration underlines the feasibility and potential benefits of RL algorithm enhancing autonomous racing performance, especially in the environments where prior map information is not available.
Article
Full-text available
The application of autonomous driving system (ADS) technology can significantly reduce potential accidents involving vulnerable road users (VRUs) due to driver error. This paper proposes a novel hierarchical deep reinforcement learning (DRL) framework for high-performance collision avoidance, which enables the automated driving agent to perform collision avoidance maneuvers while maintaining appropriate speeds and acceptable social distancing. The novelty of the DRL method proposed here is its ability to accommodate dynamic obstacle avoidance, which is necessary as pedestrians are moving dynamically in their interactions with nearby ADSs. This is an improvement over existing DRL frameworks that have only been developed and demonstrated for stationary obstacle avoidance problems. The hybrid A* path searching algorithm is first applied to calculate a pre-defined path marked by waypoints, and a low-level path-following controller is used under cases where no VRUs are detected. Upon detection of any VRUs, however, a high-level DRL collision avoidance controller is activated to prompt the vehicle to either decelerate or change its trajectory to prevent potential collisions. The CARLA simulator is used to train the proposed DRL collision avoidance controller, and virtual raw sensor data are utilized to enhance the realism of the simulations. The model-in-the-loop (MIL) methodology is utilized to assess the efficacy of the proposed DRL ADS routine. In comparison to the traditional DRL end-to-end approach, which combines high-level decision making with low-level control, the proposed hierarchical DRL agents demonstrate superior performance.
Article
Full-text available
Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of model-free algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized adantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.
Article
Full-text available
We trained a convolutional neural network (CNN) to map raw pixels from a single front-facing camera directly to steering commands. This end-to-end approach proved surprisingly powerful. With minimum training data from humans the system learns to drive in traffic on local roads with or without lane markings and on highways. It also operates in areas with unclear visual guidance such as in parking lots and on unpaved roads. The system automatically learns internal representations of the necessary processing steps such as detecting useful road features with only the human steering angle as the training signal. We never explicitly trained it to detect, for example, the outline of roads. Compared to explicit decomposition of the problem, such as lane marking detection, path planning, and control, our end-to-end system optimizes all processing steps simultaneously. We argue that this will eventually lead to better performance and smaller systems. Better performance will result because the internal components self-optimize to maximize overall system performance, instead of optimizing human-selected intermediate criteria, e.g., lane detection. Such criteria understandably are selected for ease of human interpretation which doesn't automatically guarantee maximum system performance. Smaller networks are possible because the system learns to solve the problem with the minimal number of processing steps. We used an NVIDIA DevBox and Torch 7 for training and an NVIDIA DRIVE(TM) PX self-driving car computer also running Torch 7 for determining where to drive. The system operates at 30 frames per second (FPS).
Conference Paper
Recently, researchers have made significant progress combining the advances in deep learning for learning feature representations with reinforcement learning. Some notable examples include training agents to play Atari games based on raw pixel data and to acquire advanced manipulation skills using raw sensory inputs. However, it has been difficult to quantify progress in the domain of continuous control due to the lack of a commonly adopted benchmark. In this work, we present a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, tasks with partial observations, and tasks with hierarchical structure. We report novel findings based on the systematic evaluation of a range of implemented reinforcement learning algorithms. Both the benchmark and reference implementations are released open-source in order to facilitate experimental reproducibility and to encourage adoption by other researchers.
Article
This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.
Article
Policy search methods based on reinforcement learning and optimal control can allow robots to automatically learn a wide range of tasks. However, practical applications of policy search tend to require the policy to be supported by hand-engineered components for perception, state estimation, and low-level control. We propose a method for learning policies that map raw, low-level observations, consisting of joint angles and camera images, directly to the torques at the robot's joints. The policies are represented as deep convolutional neural networks (CNNs) with 92,000 parameters. The high dimensionality of such policies poses a tremendous challenge for policy search. To address this challenge, we develop a sensorimotor guided policy search method that can handle high-dimensional policies and partially observed tasks. We use BADMM to decompose policy search into an optimal control phase and supervised learning phase, allowing CNN policies to be trained with standard supervised learning techniques. This method can learn a number of manipulation tasks that require close coordination between vision and control, including inserting a block into a shape sorting cube, screwing on a bottle cap, fitting the claw of a toy hammer under a nail with various grasps, and placing a coat hanger on a clothes rack.