ArticlePDF Available

Outracing champion Gran Turismo drivers with deep reinforcement learning

Authors:

Abstract and Figures

Many potential applications of artificial intelligence involve making real-time decisions in physical systems while interacting with humans. Automobile racing represents an extreme example of these conditions; drivers must execute complex tactical manoeuvres to pass or block opponents while operating their vehicles at their traction limits¹. Racing simulations, such as the PlayStation game Gran Turismo, faithfully reproduce the non-linear control challenges of real race cars while also encapsulating the complex multi-agent interactions. Here we describe how we trained agents for Gran Turismo that can compete with the world’s best e-sports drivers. We combine state-of-the-art, model-free, deep reinforcement learning algorithms with mixed-scenario training to learn an integrated control policy that combines exceptional speed with impressive tactics. In addition, we construct a reward function that enables the agent to be competitive while adhering to racing’s important, but under-specified, sportsmanship rules. We demonstrate the capabilities of our agent, Gran Turismo Sophy, by winning a head-to-head competition against four of the world’s best Gran Turismo drivers. By describing how we trained championship-level racers, we demonstrate the possibilities and challenges of using these techniques to control complex dynamical systems in domains where agents must respect imprecisely defined human norms.
Results: Figures (a-c) show how each race unfolded on (a) Seaside, (b) Maggiore, and (c) Sarthe. The distance from the leader is computed as the time since the lead car passed the same position on the track. The legend for each race shows the final places and, in parenthesis, the points for each driver. These charts clearly show how, once GT Sophy got a small lead, the human drivers could not catch it. The sharp decreases represent either a driver losing control or paying a penalty for either exceeding the course bounds or colliding with another driver. Sarthe (c) had the most incidents, with GT Sophy getting two penalties for excessive contact and the humans getting one penalty and two warnings. Both the humans and GT Sophy also had several smaller penalties for exceeding the course boundaries, particularly in the final chicane sequence. (d) illustrates an example from the July 2nd race in which two instances of GT Sophy (grey, green) passed two humans (yellow, blue) on a corner on Maggiore. As a reference, the lead GT Sophy car's trajectory when taking the corner alone is shown in red. The example clearly illustrates that GT Sophy's trajectory through the corner is contextual; even though the human drivers tried to protect the inside going into the corner, GT Sophy was able to find two different, faster trajectories. (e) shows the number of passes that occurred on different parts of Sarthe in 100 4v4 races between two GT Sophy policies, demonstrating that the agent has learned to pass on many parts of the track. (f) shows the results from the time trial competition in July.
… 
Training a, An example training configuration. The trainer distributes training scenarios to rollout workers, each of which controls one PS4 running an instance of GT. The agent in the worker runs one copy of the most recent policy, π, to control up to 20 cars. The agent sends an action, a, for each car it controls to the game. Asynchronously, the game computes the next frames and sends each new state, s, to the agent. When the game reports that the action has been registered, the agent reports the state, action, reward tuple ⟨s,a,r⟩\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\langle s,a,r\rangle $$\end{document} to the trainer, which stores it in the ERB. The trainer samples the ERB to update the policy, π, and Q-function networks. b, The course representation ahead of the car on a sequence of curves on Maggiore if the car was travelling at 200 km h⁻¹. c, The distribution of the learning curves on Maggiore from 15 different random seeds. All of the seeds reached superhuman performance. Most reached it in 10 days of training, whereas the longest took 25 days. d, The distribution of individual players’ best lap times on Maggiore as recorded on Kudos Prime (https://www.kudosprime.com/gts/rankings.php?sec=daily). Superimposed on d is the number of hours that GT Sophy, using ten PlayStations with 20 cars each, needed to achieve similar performance. e, A histogram (in orange) of 100 laps from the time-trial policy GT Sophy used on 2 July 2021 compared with the five best human drivers’ best lap times (circles 1–5) in the Kudos Prime data. Similar graphs for the other two tracks are in the Supplementary Information; Maggiore is the only one of the three tracks on which the best human performance was close to GT Sophy. f, The training scenarios on Sarthe, including five full-track configurations in which the agent starts with zero, one, two, three or seven nearby opponents and three specialized scenarios that are limited to the shaded regions. The actual track positions, opponents and relative car arrangements are varied to ensure that the learned skills are robust.
… 
Content may be subject to copyright.
Outracing champion Gran Turismo drivers with deep1
reinforcement learning2
Peter R. Wurman1,*, Samuel Barrett2, Kenta Kawamoto4, James MacGlashan2, Kaushik3
Subramanian3, Thomas J. Walsh2, Roberto Capobianco3, Alisa Devlic3, Franziska4
Eckert3, Florian Fuchs3, Leilani Gilpin2, Piyush Khandelwal2, Varun Kompella2, HaoChih5
Lin3, Patrick MacAlpine2, Declan Oller2, Takuma Seno4, Craig Sherstan2, Michael D.6
Thomure2, Houmehr Aghabozorgi2, Leon Barrett2, Rory Douglas2, Dion Whitehead2,7
Peter D ¨urr3, Peter Stone2, Michael Spranger4, and Hiroaki Kitano4
8
1Sony AI, Boston, Massachusetts, USA9
2Sony AI, North America (various locations)10
3Sony AI, Zurich, Switzerland11
4Sony AI, Tokyo, Japan12
*peter.wurman@sony.com13
ABSTRACT
14
Many potential applications of artificial intelligence involve making real-time decisions in physical systems while interacting
with humans. Automobile racing represents an extreme example of these conditions; drivers must execute complex tactical
maneuvers to pass or block opponents while operating their vehicles at their traction limits
1
. Racing simulations, such as the
PlayStation game Gran Turismo, faithfully reproduce the nonlinear control challenges of real race cars while also encapsulating
the complex multi-agent interactions. Here we describe how we trained agents for Gran Turismo that can compete with the
world’s best e-sports drivers. We combine state-of-the-art model-free deep reinforcement learning algorithms with mixed
scenario training to learn an integrated control policy that combines exceptional speed with impressive tactics. In addition, we
construct a reward function that enables the agent to be competitive while adhering to racing’s important, but under-specified,
sportsmanship rules. We demonstrate the capabilities of our agent, Gran Turismo Sophy, by winning a head-to-head competition
against four of the world’s best Gran Turismo drivers. By describing how we trained championship-level racers, we illuminate
the possibilities and challenges of using these techniques to control complex dynamical systems in domains where agents
must respect imprecisely defined human norms.
15
Introduction16
Deep reinforcement learning (deep RL) has been a key component of impressive recent artificial intelligence milestones in
17
domains such as Atari
2
, Go
3,4
, StarCraft
5
, and DOTA
6
. For deep RL to impact robotics and automation, researchers must
18
demonstrate success controlling complex physical systems. In addition, many potential applications of robotics require
19
interacting in close proximity to humans while respecting imprecisely specified human norms. Automobile racing is a domain
20
that poses exactly these challenges; it requires real-time control of vehicles with complex, non-linear dynamics while operating
21
within inches of opponents. Fortunately, it is also a domain for which highly realistic simulations exist, making it amenable to
22
experimentation with machine learning approaches.23
Research on autonomous racing has accelerated in recent years, leveraging full-sized
710
, scale
1115
, and simulated
1625
24
vehicles. A common approach pre-computes trajectories
26,27
and uses model predictive control to execute those trajectories
7,28
.
25
However, when driving at the absolute limits of friction, small modeling errors can be catastrophic. Racing against other drivers
26
puts even greater demands on modeling accuracy, introduces complex aerodynamic interactions, and further requires engineers
27
design control schemes that continuously predict and adapt to the trajectories of other cars. Racing with real driver-less vehicles
28
still appears to be several years away, as the recent Indy Autonomous Challenge curtailed its planned head-to-head competition
29
to time trials and simple obstacle avoidance29.30
Researchers have explored various ways to use machine learning to avoid this modeling complexity, including using
31
supervised learning to model vehicle dynamics
8,12,30
, and using imitation learning
31
, evolutionary approaches
32
, or reinforce-
32
ment learning
16,21
to learn driving policies. Although some studies achieved super-human performance in solo driving
24
, or
33
progressed to simple passing scenarios16,20,25,33, none have tackled racing at the highest levels.34
To be successful, racers must become highly skilled in four areas: (1) race car control, (2) racing tactics, (3) racing etiquette,
35
and (4) racing strategy. To control the car, drivers develop a detailed understanding of the dynamics of their vehicle and the
36
idiosyncrasies of the track on which they are racing. Upon this foundation, drivers build the tactical skills needed to pass and
37
defend against opponents, executing precise maneuvers at high speed with little margin for error. At the same time, drivers
38
must conform to highly refined, but imprecisely specified, sportsmanship rules. Finally, drivers employ strategic thinking when
39
modeling opponents and deciding when and how to attempt a pass.40
In this article we describe how we used model-free off-policy deep RL to build a champion-level racing agent, which we
41
call Gran Turismo Sophy (GT Sophy). GT Sophy was developed to compete with the world’s best players of the highly realistic
42
PlayStation
TM
4 (PS4) game Gran Turismo (GT) Sport
34
, developed by Polyphony Digital, Inc. We demonstrate GT Sophy by
43
competing against top human drivers on three car and track combinations that posed different racing challenges. The car used
44
on the first track, Dragon Trail Seaside (Seaside), was a high performance road vehicle. On the second track, Lake Maggiore
45
GP (Maggiore), the vehicle was equivalent to the FIA GT3 class of race cars. The third and final race took place on the Circuit
46
de la Sarthe (Sarthe), famous as the home of the 24 hours of Le Mans. This race featured the Red Bull X2019 Competition race
47
car, which can reach speeds in excess of 300 km/h. Though lacking strategic savvy, in the process of winning the races against
48
humans, GT Sophy demonstrated significant advances in the first three of the four skill areas mentioned above.49
Approach50
The training configuration is illustrated in Figure 1(a). Gran Turismo runs only on PlayStations, which necessitated that the
51
agent run on a separate computing device and communicate asynchronously with the game via TCP. Although GT ran only in
52
real time, each GT Sophy instance controlled up to 20 cars on its PlayStation, which accelerated data collection. We typically
53
trained GT Sophy from scratch using 10–20 PlayStations, an equal number of compute instances, and a GPU machine that
54
asynchronously updates the neural networks.55
The agent’s core actions were mapped to two continuous-valued dimensions: changing velocity (accelerating or braking),
56
and steering (left or right). The effect of the actions was enforced by the game to be consistent with the physics of the
57
environment; GT Sophy cannot brake harder than humans, but it can learn more precisely when to brake. GT Sophy interacted
58
with the game at 10Hz, which we claim does not give GT Sophy a particular advantage over professional gamers
35
or athletes
36
.
59
As is common
26,27
, the agent was given a static map defining the left and right edges and the center line of the track. We
60
encoded the approaching course segment as 60 equally spaced 3D points along each edge of the track and the center line
61
(Figure 1(b)). The span of the points in any given observation was a function of the current velocity so as to always represent
62
approximately the next 6 seconds of travel. The points were computed from the track map and presented to the neural network
63
in the agent’s egocentric frame of reference.64
Through an API, GT Sophy observed the positions, velocities, accelerations, and other relevant state information about
65
itself and all opponents. To make opponent information amenable to deep learning, GT Sophy maintained two lists of their
66
state features: one for cars in front of the agent and one for cars behind. Both lists were ordered from closest to farthest and
67
limited by a maximum range.68
We trained GT Sophy using a novel deep reinforcement learning algorithm we call quantile-regression soft actor-critic (QR-
69
SAC). This approach learns a policy (actor) that selects an action based on the agent’s observations, and a value function (critic)
70
that estimates the future rewards of each possible action. QR-SAC extends the soft actor-critic approach
37,38
by modifying it to
71
handle
N
-step returns
39
and replacing the expected value of future rewards with a representation of the probability distributions
72
of those rewards
40
. QR-SAC trains the neural networks asynchronously; it samples data from an Experience Replay Buffer
73
(ERB)
41
, while actors simultaneously practice driving using the most recent policy and continuously fill the buffer with their
74
new experiences.75
The agent was given a progress reward
24
for the speed with which it advanced around the track and penalties if it went
76
out of bounds, hit a wall, or lost traction. These shaping rewards allowed the agent to quickly receive positive feedback for
77
staying on the track and driving fast. Notably, GT Sophy learned to get around the track in only a few hours, and learned to be
78
faster than 95% of the humans in our reference data set
42
within a day or two. However, as shown in Figure 1(c), it trained for
79
another nine or more days—accumulating over 45,000 driving hours—shaving off tenths of seconds, until its lap times stopped
80
improving. With this training procedure, GT Sophy achieved superhuman time trial performance on all three tracks. Figure 1(d)
81
shows the distribution of the best single lap times for over 17,700 players all driving the same car on Maggiore (the track with
82
the smallest gap between GT Sophy and the humans). Figure 1(e) shows how consistent GT Sophy’s lap times were, with a
83
mean lap time about equal to the single best recorded human lap time.84
The progress reward alone was not enough to incentivize the agent to win the race. If the opponent was sufficiently fast, the
85
agent would learn to follow it and accumulate large rewards without risking potentially catastrophic collisions. As in prior
86
work
25
, adding rewards specifically for passing helped the agent learn to overtake other cars. We used a passing reward that
87
was proportional to the distance by which the agent improved its position relative to each opponent within the local region. The
88
reward was symmetric; if an opponent gained ground on the agent, the agent would see a proportional negative reward.89
2/17
Like many other sports, racing, both physical and virtual, requires human judges. These stewards immediately review
90
racing “incidents” and make decisions about which drivers, if any, receive penalties. A car with a penalty is forced by the
91
game engine to slow down to 100 km/h in certain penalty zones on the track for the penalty duration. While a small amount of
92
unintentional car-to-car contact is fairly common and considered acceptable, racing rules describe a variety of conditions under
93
which drivers may be penalized. The rules are somewhat ambiguous and stewards’ judgements incorporate a lot of context,
94
like the impact the contact has on the immediate future of the cars involved. The fact that judges’ decisions are subjective and
95
contextual makes it difficult to encode these rules in a way that gives the agent clear signals to learn from. Racing etiquette is96
an example of the challenges that AI practitioners face when designing agents that interact with humans who expect those
97
agents to conform to behavioral norms43.98
The observations the agent gets from the game includes a flag when car contact occurs, but does not indicate whether
99
a penalty was deserved. We experimented with several approaches to encode etiquette as instantaneous penalties based on
100
situational analysis of the collisions. However, as we tried to more accurately model blame assignment, the resulting policies
101
were judged much too aggressive by stewards and test drivers. For the final races, we opted for a conservative approach that
102
penalized the agent for any collision in which it was involved (regardless of fault), with some additional penalties if the collision
103
was likely to be considered unacceptable. Figures 2(a–h) isolate the effects of collision penalties and other key design choices
104
made during this project.105
While many applications of RL to games employ self-play to improve performance
3,44
, the straightforward application
106
of self-play was inadequate in this setting. For example, as a human enters a difficult corner, they may brake a fraction of
107
a second earlier than the agent would. Even a small bump at the wrong moment can cause an opponent to lose control of
108
their car. By racing against only copies of itself, the agent was ill-prepared for the imprecision it would see with human
109
opponents. If the agent following does not anticipate the possibility of the opponent braking early, it will not be able to avoid
110
rear-ending the human driver and will be assessed a penalty. This feature of racing—that one player’s sub-optimal choice
111
causes the other player to be penalized—is not a feature of zero-sum games like Go and chess. To alleviate this issue, we used a
112
mixed population of opponents, including agents curated from prior experiments and the (relatively slower) game’s built-in AI.
113
Figure 2(e) shows the importance of these choices.114
In addition, the opportunities to learn certain skills are rare. We call this the exposure problem; certain states of the world
115
are not accessible to the agent without the "cooperation" of its opponents. For example, to execute a “slingshot pass”, a car
116
must be in the slipstream of an opponent on a long straightaway, a condition which may occur naturally a few times or not at all
117
in an entire race. If that opponent always drives only on the right, the agent will learn to pass only on the left, and would be
118
easily foiled by a human who chose to drive on the left. To address this issue, we developed a process we called mixed scenario
119
training. We worked with a retired competitive GT driver to identify a small number of race situations that were likely to be
120
pivotal on each track. We then configured scenarios that presented the agent with noisy variations of those critical situations. In
121
slipstream passing scenarios, we used simple PID controllers to ensure that the opponents followed certain trajectories, such as
122
driving on the left, that we wanted our agent to be prepared for. Figure 1(f) shows the full-track and specialized scenarios for
123
Sarthe. Importantly, all scenarios were present throughout the training regime; no sequential curriculum was needed. We used a
124
form of stratified sampling
45
to ensure the situational diversity was present throughout training. Figure 2(h) shows that this
125
technique resulted in more robust skills being learned.126
Results127
To evaluate GT Sophy, we raced the agent in two events against top GT drivers. The first event was on July 2nd, 2021, and
128
involved both time trial and head-to-head races. In the time trial race, three of the world’s top drivers were asked to try to beat
129
GT Sophy’s lap times. Although the human drivers were allowed to see a “ghost” of GT Sophy as they drove around the track,
130
GT Sophy won all three matches. The results are shown in Figure 3(f).131
The head-to-head race was held at Polyphony headquarters and, though limited to top Japanese players due to pandemic
132
travel restrictions, included four of the world’s best GT drivers. These drivers formed a team to compete against four instances
133
of GT Sophy. Points were awarded to the team based on the final positions ({10, 8, 6, 5, 4, 3, 2, 1} from first to last), with
134
Sarthe, the final and most challenging race, counting double. Each team started in either the odd or even positions based on
135
their best qualifying time. The human drivers won the team event on July 2nd by a score of 86-70.136
After examining GT Sophy’s July 2nd performance, we improved the training regime, increased the network size, made
137
small modifications to some features and rewards, and improved the population of opponents. GT Sophy handily won the
138
rematch held on October 21st, 2021 by an overall team score of 104-52. Starting in the odd positions, team GT Sophy improved
139
four spots on Seaside and Maggiore, and two on Sarthe. Figure 3(a–c) shows the relative positions of the cars through each race
140
and the points earned by each individual.141
One of the advantages of using deep RL to develop a racing agent is that it eliminates the need for engineers to program
142
how and when to execute the skills needed to win the race—as long as it is exposed to the right conditions, the agent learns to
143
3/17
do the right thing by trial and error. We observed that GT Sophy was able to perform multiple types of corner passing, use
144
the slipstream effectively, disrupt the draft of a following car, block, and execute emergency maneuvers. Figure 3(d) shows
145
particularly compelling evidence of GT Sophy’s generalized tactical competence. The diagram illustrates a situation from the
146
July 2nd event in which two GT Sophy cars both pass two human cars on a single corner on Maggiore. This kind of tactical
147
competence was not limited to any particular part of the course. Figure 3(e) shows the number of passes that occurred on
148
different sections of Sarthe from 100 4v4 races between two different GT Sophy policies. While slipstream passing on the
149
straightaways was most common, the results show that GT Sophy was able to take advantage of passing opportunities on many
150
different sections of Sarthe.151
Although GT Sophy demonstrated enough tactical skill to beat expert humans in head-to-head racing, there are many areas
152
for improvement, particularly in the area of strategic decision making. For example, GT Sophy takes the first opportunity to
153
pass on a straightaway, sometimes leaving enough room on the same stretch of track for the opponent to use the slipstream to154
pass back. GT Sophy also aggressively tries to pass an opponent with a looming penalty, whereas a strategic human driver may
155
wait and make the easy pass when the opponent is forced to slow down.156
Conclusions157
Simulated automobile racing is a domain that requires real-time, continuous control in an environment with highly realistic,
158
complex physics. The success of GT Sophy in this environment shows, for the first time, that it is possible to train AI agents that
159
are better than the top human racers across a range of car and track types. This result can be seen as another important step in
160
the ongoing progression of competitive tasks that computers can beat the very best people at, such as chess, Go, Jeopardy, poker,
161
and StarCraft. In the context of previous landmarks of this kind, GT Sophy is the first that deals with head-to-head, competitive,
162
high-speed racing, which requires advanced tactics and subtle sportsmanship considerations. Agents like GT Sophy have the
163
potential to make racing games more enjoyable, provide realistic, high-level competition for training professional drivers,
164
and discover new racing techniques. The success of deep RL in this environment suggests these techniques may soon impact
165
real-world systems like collaborative robotics, aerial drones, or autonomous vehicles.166
Online content167
A supplement is available which provides pseudo code of the training procedures and algorithms. The supplement also includes a
168
table enumerating the hyperparameters. Videos of the races against the human drivers are available at https://sonyai.github.io/gt_sophy_public.
169
References170
1.
Milliken, W. F., Milliken, D. L. et al. Race car vehicle dynamics, vol. 400 (Society of Automotive Engineers Warrendale,
171
PA, 1995).172
2. Mnih, V. et al. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).173
3. Silver, D. et al. Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016).174
4. Silver, D. et al. Mastering the game of go without human knowledge. Nature 550, 354–359 (2017).175
5.
Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature
575
, 350–354 (2019).
176
6. Berner, C. et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680 (2019).177
7.
Laurense, V. A., Goh, J. Y. & Gerdes, J. C. Path-tracking for autonomous vehicles at the limit of friction. In 2017 American
178
control conference (ACC), 5586–5591 (IEEE, 2017).179
8.
Spielberg, N. A., Brown, M., Kapania, N. R., Kegelman, J. C. & Gerdes, J. C. Neural network vehicle models for high-
180
performance automated driving. Sci. Robotics
4
, DOI: 10.1126/scirobotics.aaw1975 (2019). https://robotics.sciencemag.
181
org/content/4/28/eaaw1975.full.pdf.182
9.
Burke, K. Data makes it beta: Roborace returns for second season with updateable self-driving vehicles powered by
183
NVIDIA DRIVE. https://blogs.nvidia.com/blog/2020/10/29/roborace-second- season-nvidia-drive/.184
10.
Leporati, G. No driver? no problem—this is the Indy Autonomous Challenge. https://arstechnica.com/cars/2021/07/
185
a-science-fair-or-the-future-of-racing-the-indy-autonomous-challenge/.186
11.
Williams, G., Drews, P., Goldfain, B., Rehg, J. M. & Theodorou, E. A. Aggressive driving with model predictive path
187
integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), 1433–1440 (IEEE, 2016).188
12.
Williams, G., Drews, P., Goldfain, B., Rehg, J. M. & Theodorou, E. A. Information-theoretic model predictive control:
189
Theory and applications to autonomous driving. IEEE Transactions on Robotics 34, 1603–1622 (2018).190
4/17
13.
Pan, Y. et al. Agile autonomous driving using end-to-end deep imitation learning. In Proceedings of Robotics: Science and
191
Systems, DOI: 10.15607/RSS.2018.XIV.056 (Pittsburgh, Pennsylvania, 2018).192
14. Pan, Y. et al. Imitation learning for agile autonomous driving. The Int. J. Robotics Res. 39, 286–302 (2020).193
15.
Amazon Web Services. AWS DeepRacer League. https://aws.amazon.com/deepracer/league/ (2019). [Online; accessed
194
01-June-2020].195
16.
Pyeatt, L. D. & Howe, A. E. Learning to race: Experiments with a simulated race car. In FLAIRS Conference, 357–361
196
(Citeseer, 1998).197
17.
Chaperot, B. & Fyfe, C. Improving artificial intelligence in a motocross game. In 2006 IEEE Symposium on Computational
198
Intelligence and Games, 181–186 (IEEE, 2006).199
18.
Cardamone, L., Loiacono, D. & Lanzi, P. L. Evolving competitive car controllers for racing games with neuroevolution. In
200
Proceedings of the 11th Annual conference on Genetic and evolutionary computation, 1179–1186 (2009).201
19.
Cardamone, L., Loiacono, D. & Lanzi, P. L. On-line neuroevolution applied to the open racing car simulator. In 2009
202
IEEE Congress on Evolutionary Computation, 2622–2629 (IEEE, 2009).203
20.
Loiacono, D., Prete, A., Lanzi, L. & Cardamone, L. Learning to overtake in TORCS using simple reinforcement learning.
204
In IEEE Congress on Evolutionary Computation, 1–8 (IEEE, 2010).205
21.
Jaritz, M., de Charette, R., Toromanoff, M., Perot, E. & Nashashibi, F. End-to-end race driving with deep reinforcement
206
learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 2070–2075 (2018).207
22.
Weiss, T. & Behl, M. Deepracing: a framework for autonomous racing. In 2020 Design, Automation & Test in Europe
208
Conference & Exhibition (DATE), 1163–1168 (IEEE, 2020).209
23.
Weiss, T., Babu, V. S. & Behl, M. Bezier curve based end-to-end trajectory synthesis for agile autonomous driving. In
210
NeurIPS 2020 Machine Learning for Autonomous Driving Workshop (2020).211
24.
Fuchs, F., Song, Y., Kaufmann, E., Scaramuzza, D. & Dürr, P. Super-human performance in Gran Turismo Sport using
212
deep reinforcement learning. IEEE Robotics Autom. Lett. 6, 4257–4264, DOI: 10.1109/LRA.2021.3064284 (2021).213
25.
Song, Y., Lin, H., Kaufmann, E., Dürr, P. & Scaramuzza, D. Autonomous Overtaking in Gran Turismo Sport Using
214
Curriculum Reinforcement Learning. In Proceedings of the IEEE International Conference on Robotics and Automation
215
(ICRA) (2021).216
26.
Theodosis, P. A. & Gerdes, J. C. Nonlinear optimization of a racing line for an autonomous racecar using professional
217
driving techniques. In Dynamic Systems and Control Conference, vol. 45295, 235–241 (American Society of Mechanical
218
Engineers, 2012).219
27.
Funke, J. et al. Up to the limits: Autonomous Audi TTS. In 2012 IEEE Intelligent Vehicles Symposium, 541–547 (IEEE,
220
2012).221
28.
Kritayakirana, K. & Gerdes, J. C. Autonomous vehicle control at the limits of handling. Int. J. Veh. Auton. Syst.
10
,
222
271–296 (2012).223
29.
Bonkowski, J. Here’s what you missed from the Indy Autonomous Challenge main event.
224
https://www.autoweek.com/racing/more-racing/a38069263/what-missed-indy-autonomous-challenge-main-event/.225
30.
Rutherford, S. J. & Cole, D. J. Modelling nonlinear vehicle dynamics with neural networks. Int. journal vehicle design
53
,
226
260–287 (2010).227
31.
Pomerleau, D. A. Knowledge-based training of artificial neural networks for autonomous robot driving. In Robot learning,
228
19–43 (Springer, 1993).229
32.
Togelius, J. & Lucas, S. M. Evolving robust and specialized car racing skills. In 2006 IEEE International Conference on
230
Evolutionary Computation, 1187–1194 (IEEE, 2006).231
33.
Schwarting, W. et al. Deep latent competition: Learning to race using visual control policies in latent space. arXiv preprint
232
arXiv:2102.09812 (2021).233
34. https://www.gran-turismo.com/us/.234
35. Gozli, D. G., Bavelier, D. & Pratt, J. The effect of action video game playing on sensorimotor learning: Evidence from a235
movement tracking task. Hum. movement science 38, 152–162 (2014).236
36. Davids, K., Williams, A. M. & Williams, J. G. Visual perception and action in sport (Routledge, 2005).237
5/17
37.
Haarnoja, T., Zhou, A., Abbeel, P. & Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning
238
with a stochastic actor. In International Conference on Machine Learning, 1856–1865 (2018).239
38. Haarnoja, T. et al. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905 (2018).240
39.
Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In International conference on machine learning,
241
1928–1937 (PMLR, 2016).242
40.
Dabney, W., Rowland, M., Bellemare, M. G. & Munos, R. Distributional reinforcement learning with quantile regression.
243
In AAAI (2018).244
41. Lin, L. Reinforcement learning for robots using neural networks (Carnegie Mellon University, 1992).245
42. https://www.kudosprime.com/gts/rankings.php?sec=daily.246
43.
Siu, H. C. et al. Evaluation of human-ai teams for learned and rule-based agents in hanabi. arXiv preprint arXiv:2107.07630
247
(2021).248
44.
Tesauro, G. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation
6
,
249
215–219 (1994).250
45. Devore, J. L. Probability and Statistics (Brooks/Cole, Belmont, CA, 2004), 6 edn.251
Acknowledgements252
We thank K. Yamauchi, S. Takano, A. Hayashi, C. Ferreira, N. Nozawa, T. Teramoto, M. Hakim, K. Yamada, S. Sakamoto, T.
253
Ueda, A. Yago, J. Nakata, and H. Imanishi at Polyphony Digital for making the Gran Turismo franchise, providing support
254
throughout the project, and organizing the Race Together events on July 2nd and Oct 21st. We also thank U. Gallizzi, J. Beltran,
255
G. Albowicz, R. Abdul-ahad and the staff at CGEI for access to their PlayStation Now network to train agents and their help
256
building the infrastructure for our experiments. We benefited from the advice of T. Grossenbacher, a retired competitive GT
257
driver. Finally, we thank E. Kato Marcus and E. Ohshima of Sony AI, who managed the partnership activities with Polyphony
258
Digital and Sony Interactive Entertainment.259
Author contributions statement260
P.W. managed the project. S.B, K.K., P.K., J.M., K.S., and T.W. led the research and development efforts. R.C., A.D., F.E., F.F.,
261
L.G., V.K., H.L., P.M., D.O., C.S., T.S. and M.D. participated in the research and the development of GT Sophy and the AI
262
libraries. H.A., L.B., R.D., and D.W. built the research platform that connected to CGEI’s PlayStation network. P.S. provided
263
executive support and technical and research advice, and P.D. provided executive support and technical advice. H.K. and M.S.
264
conceived and set up the project, provided executive support, resources, and technical advice and managed stakeholders.265
6/17
(a)
Trainer (GPU)
ERB
Rollout
Worker
(CPU)
PS4
π<s,a,r>
as
(b)
1v0
Agent Opponent
πQ
PS4
Rollout
Worker
(CPU)
1v1 1v3 1v7
QR-SAC
Agent Agent
Grid
start
114. 0
114. 2
114. 3
114. 4
114. 5
51 423
114. 1
114. 6
114. 7
0
400
800
1200
1600
114
116
118
120
122
124
126
128
130
132
134
136
138
140
142
144
146
148
150
152
154
156
158
160
Number of players
Lap time (s)
Best human lap times - Maggiore
8 hours 4 hours
24 hours
Built-in AI
(d)
(e)
(f)
Slipstream Final
chicane
1v2
Full track scenarios
114.25
114.50
114.75
115.00
115.25
115.50
115.75
Minimum laptime (s)
2000 4000 6000 8000 10000
Training epoch
0
1
#Outliers
(c)
Figure 1. Training: Figure (a) shows an example training configuration. The trainer distributes training scenarios to rollout
workers, each of which controls one PS4 running an instance of GT. The agent within the worker runs one copy of the most
recent policy,
π
, to control up to 20 cars. The agent sends an action,
a
, for each car it controls to the game. Asynchronously, the
game computes next frames and sends each new state,
s
, to the agent. When the game reports the action has been registered, the
agent reports the state, action, reward tuple
<s,a,r>
to the trainer which stores it in the ERB. The trainer samples the ERB to
update the policy, π, and Q-function networks. Figure (b) shows the course representation ahead of the car on a sequence of
curves on Maggiore if the car were traveling at 200 km/h. Figure (c) shows the distribution of the learning curves on Maggiore
from 15 different random seeds. All of the seeds reached superhuman performance. Most reached it in 10 days of training,
while the longest took 25 days. Figure (d) shows the distribution of individual players’ best lap times on Maggiore as recorded
on Kudos Prime42. Superimposed on (d) is the number of hours that GT Sophy, using 10 PlayStations with 20 cars each,
needed to achieve similar performance. Figure (e) shows a histogram (in orange) of 100 laps from the time trial policy GT
Sophy used on July 2nd compared to the five best human drivers’ best lap times (circles 1–5) in the Kudos Prime data. Similar
graphs for the other two tracks are in the supplement; Maggiore is the only of the three tracks on which the best human
performance was close to GT Sophy. Figure (f) illustrates the training scenarios on Sarthe, including four full-track
configurations in which the agent starts with zero, one, two, three, or seven nearby opponents, and three specialized scenarios
which are limited to the shaded regions. The actual track positions, opponents, and relative car arrangements are varied to
ensure the learned skills are robust.
7/17
115. 95
114. 86
114. 47
114. 39
114. 36
115. 07 (90%)
114. 47 (98%)
> 130
117. 11
115. 72
114. 47
115. 16
114. 47
114 115 116 117 118 119 120
n = 1
n = 3
n = 5
n = 7
n = 9
Fixed step cost
w/o off course penalty
Baseline
Projected position
Wall lidar, curvature
Course points only
Baseline
SAC
QR-SAC
Lap time (s)
0.5
0.6
0.7
0.8
0.9
1.0
1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
1550
1600
1650
1700
1750
1800
1850
1900
1950
2000
Success rate
Training epoch
Success rate on slipstream test
Baseline No PID opponents No slipstream scenarios
14.88
16.73
18.57
19.41
19.73
10 15 20 25
No passing reward
No slipstream, 1 table
No slipstream scenarios
No PID opponent
Baseline
4v4 Team Score
Skill development
Course representation
Time trial rewards
N-step returns
RL algorithm
DNF
0
20
40
60
80
100
120
10 15 20 25
Questionable collisions
4v4 team score
Effect of collision penalties
Baseline
w/o any-collision
w/o all collisions
(h)(f)
(e)(a)
(b)
(c)
(d)
(g)
Figure 2. Ablations:
Figures (a–d) show the impact of various settings on Maggiore performance using the 2048x2 July 2nd
time-trial network. All bars represent the average across five initial seeds with the full range of the samples displayed as an
error bar. In all graphs, the baseline settings are colored in a darker shade of blue. (a) shows that GT Sophy would not be faster
than the best human on Maggiore without the QR enhancement to SAC. (b) shows that representing the upcoming track as
sequences of points was advantageous. (c) shows that not including the off-course penalty results in a slower lap time and (in
parentheses) a much lower percentage of laps without exceeding the course boundaries. Interestingly, (d) shows that the 5-step
return used on July 2nd was not the best choice; this was changed to a 7-step return for the October match. Figures (e–h)
evaluate the 2048x4 networks and configurations used to train the version of GT Sophy that raced on Oct. 21st. In (e,f) each
point represents the average of ten 7-lap 4v4 races on Sarthe against copies of October GT Sophy, and compare the trade-offs
between team score and “questionable collisions” (a rough indication of possible penalties). (e) shows that when GT Sophy
trained against only the built-in AI, it learned to be too aggressive, and when it trained against an aggressive opponent, it lost its
competitive edge. (f) shows that, as elements of the collision penalties are removed from GT Sophy’s reward function, it
becomes significantly more aggressive. The test drivers and stewards judged the non-baseline policies to be much too
unsportsmanlike. To make the importance of the features evaluated in (g) more clear, we tested these variations against a
slightly less competitive version of GT Sophy. The results show the importance of the scenario training, using multiple ERBs,
and having a passing reward. (h) shows an ablation of elements of the slipstream training over a range of epochs sampled
during training. The vertical axis measures the agent’s ability to pass a particular slipstream test. The solid lines represent the
performance of one seed in each condition, and the dotted lines represent the mean of five seeds over all epochs. Note that the
agent’s ability to apply the skill fluctuates even in the best (baseline) case because of the changing characteristics of the replay
buffer.
8/17
(b)
(a)
(c)
Lap 1 23 4 5 6
(e)
Lap 1 2 3 4 5 6 7
Lap 1 2 3 4 Seaside results
Maggiore results
Sarthe results
610
339
363
457
81
76
176
116
20
23
238
191
29
21
GT Sophy (lead car)
GT Sophy
Miyazono
GT Sophy
Kokubun
(d)
GT Sophy (10)
GT Sophy (8)
Kokubun (6)
GT Sophy (5)
GT Sophy (4)
Miyazono (3)
Yamanaka (2)
Ryu (1)
GT Sophy (10)
GT Sophy (8)
Yamanaka (6)
GT Sophy (5)
GT Sophy (4)
Ryu (3)
Kokubun (2)
Miyazono (1)
GT Sophy (20)
GT Sophy (16)
Kokubun (12)
Yamanaka (10)
GT Sophy (8)
GT Sophy (6)
Ryu (4)
Miyazono (2)
(f)
Seaside
Maggiore
Sarthe
Emily Jones
Driver
Valerio Gallo
Igor Fraga
107.964
Best Time (s)
114 .4 66
194.888
106.417
GT Sophy (s)
114 .2 49
193.080
Time trial results
Figure 3. Results: Figures (a–c) show how each race unfolded on (a) Seaside, (b) Maggiore, and (c) Sarthe. The distance
from the leader is computed as the time since the lead car passed the same position on the track. The legend for each race
shows the final places and, in parenthesis, the points for each driver. These charts clearly show how, once GT Sophy got a small
lead, the human drivers could not catch it. The sharp decreases represent either a driver losing control or paying a penalty for
either exceeding the course bounds or colliding with another driver. Sarthe (c) had the most incidents, with GT Sophy getting
two penalties for excessive contact and the humans getting one penalty and two warnings. Both the humans and GT Sophy also
had several smaller penalties for exceeding the course boundaries, particularly in the final chicane sequence. (d) illustrates an
example from the July 2nd race in which two instances of GT Sophy (grey, green) passed two humans (yellow, blue) on a
corner on Maggiore. As a reference, the lead GT Sophy car’s trajectory when taking the corner alone is shown in red. The
example clearly illustrates that GT Sophy’s trajectory through the corner is contextual; even though the human drivers tried to
protect the inside going into the corner, GT Sophy was able to find two different, faster trajectories. (e) shows the number of
passes that occurred on different parts of Sarthe in 100 4v4 races between two GT Sophy policies, demonstrating that the agent
has learned to pass on many parts of the track. (f) shows the results from the time trial competition in July.
9/17
Methods266
Game environment267
Since its debut in 1997, the Gran Turismo (GT) franchise has sold over 80 million units. The most recent release, Gran Turismo
268
Sport, is known for precise vehicle dynamics simulation and racing realism, earning it the distinction of being sanctioned by
269
FIA and selected as a platform for the first Virtual Olympics
46
. GT Sport runs only on PS4s and at a 60Hz dynamics simulation
270
cycle. A maximum of 20 cars can be in any race.271
Our agent ran asynchronously on a separate computer and communicated with the game via HTTP over wired Ethernet.
272
The agent requested the latest observation and made decisions at 10Hz (every 100ms). We tested action frequencies from 5Hz
273
to 60Hz and found no significant performance gains from acting more frequently than 10Hz. The agent had to be robust to the
274
infrequent, but real, networking delays. The agent’s action was treated the same as a human’s game controller input, but only
275
a subset of action capabilities were supported in the GT API. For example, the API did not allow the agent to control gear
276
shifting, the Traction Control System, or the brake balance, all of which can be adjusted in game by human players.277
Computing environment278
Each experiment used a single trainer on a compute node with either one NVIDIA V100 or half of an NVIDIA A100 coupled
279
with approximately 8 vCPUs and 55 GiB of memory. Some of these trainers were run in PlayStation Now data centers and
280
others in AWS EC2 using p3.2xlarge instances.281
Each experiment also used a number of rollout workers, where each rollout worker consisted of a compute node controlling
282
a PS4. In this setup, the PS4 ran the game, and the compute node managed the rollouts by doing tasks such as computing
283
actions, sending them to the game, sending experience streams to the trainer, and getting updated policies from the trainer (see
284
Figure 1(a)). The compute node used approximately 2 vCPUs and 3.3 GB of memory. In the time trial experiments, 10 rollout
285
workers (and therefore 10 PS4s) were used for approximately 8 days. To train policies that could drive in traffic, 21 rollout
286
workers were used for between 7 and 12 days. In both cases, one worker was primarily evaluating intermediate policies rather
287
than generating new training data.288
Actions289
The GT API enabled control of three independent continuous actions: throttle, brake, and steering. Because the throttle and
290
brake are rarely engaged at the same time, the agent was presented control over the throttle and brake as one continuous action
291
dimension. The combined dimension was scaled to
[1,1]
. Positive values engaged the throttle (with maximum throttle at
292
+1
), while negative values engaged the brake (with maximum braking at
1
); the value zero engaged neither the throttle nor
293
brake. The steering dimension was also scaled to
[1,1]
where the extreme values corresponded to the maximum steering
294
angle possible in either direction for the vehicle being controlled.295
The policy network selected actions by outputting a squashed normal distribution with a learned mean and diagonal
296
covariance matrix over these two dimensions. The squashed normal distribution enforced sampled actions to always be within
297
the
[1,1]
action bounds
37
. The diagonal covariance matrix values were constrained to lie in the range
(e40,e4)
, allowing for
298
nearly deterministic or nearly uniform random action selection policies to be learned.299
Features300
A variety of state features were input to the neural networks. These features were either directly available from the game state
301
or processed into more convenient forms and concatenated before being input to the models.302
Time trial features -
To learn competent time trial performance, the agent needed features that allowed it to learn how the
303
vehicle behaved and what the upcoming course looked like. The list of vehicle features included the car’s 3D velocity, 3D
304
angular velocity, 3D acceleration, load on each tire, and tire slip angles. Information about the environment was converted into
305
features including the scalar progress of the car along the track represented as sine and cosine components, the local course
306
surface inclination, the car’s orientation with respect to the course center line, and the (left, center and right) course points
307
describing the course ahead based on the car’s velocity. The agent also received indicators if it contacted a fixed barrier or was
308
considered off-course by the game, and it received real-valued values for the game’s view of the car’s most recent steering
309
angle, throttle intensity, and brake intensity. We relied on the game engine to determine whether the agent was off-course
310
(defined as when three or more tires are out-of-bounds) because the out-of-bounds regions are not exactly defined by the course
311
edges; kerbs and other tarmac areas outside the track edges are often considered in-bounds.312
Racing features -
When training the agent to race against other cars, the list of features also included a car contact flag to
313
detect collisions and a slipstream scalar that indicates if the agent was experiencing the slipstream effect from the cars in front of
314
it. To represent the nearby cars, the agent used a fixed forward and rear distance bound to determine which cars to encode. The
315
cars were ordered by their relative distance to the agent and represented using their relative center-of-mass position, velocity,
316
10/17
and acceleration. The combination of features provided the information required for the agent to drive fast and learn to overtake
317
cars while avoiding collisions.318
To keep the features described here in a reasonable numerical range when training neural networks, we standardized the
319
inputs based on the knowledge of the range of each feature scalar. We assumed the samples were drawn from a uniform
320
distribution given the range and computed the expected mean and standard deviation. These were used to compute the z-score
321
for each scalar before being input to the models.322
Rewards323
The reward function was a hand-tuned linear combination of reward components computed on the transition between the previous
324
state
s
and current state
s0
. The reward components were: course progress (
Rcp
), off-course penalty (
Rsoc
or
Rloc
), wall-penalty
325
(
Rw
), tire slip penalty (
Rts
), passing bonus (
Rps
), any-collision penalty (
Rc
), rear-end penalty (
Rr
), and unsporting-collision
326
penalty (Ruc). The reward weightings for the three tracks are shown in Extended Data Table 1.327
Due to the high speeds on Sarthe, training for that track used a slightly different off-course penalty, included the unsporting
328
collision penalty, and excluded the tire slip penalty. Note, to reduce variance in time-sensitive rewards, like course progress
329
and off-course penalty, we filtered transitions when network delays were encountered. The components are described in detail
330
below.331
Course progress (Rcp) -
Following previous work
24
, the primary reward component rewarded the amount of progress made
332
along the track since the last observation. To measure progress, we made use of the state variable
l
that measured the length (in
333
meters) along the centerline from the start of the track. The agent’s centerline distance
l
was estimated by first projecting its
334
current position to the closest point on the centerline. The progress reward was the difference in
l
between the previous and
335
current state:
Rcp (s,s0),s0
lsl
. To reduce the incentive to cut corners, this reward was masked when the agent was driving off
336
course.337
Off-course penalty (Rsoc or Rloc ) -
The off-course reward penalty was proportional to the squared speed the agent was traveling
338
to further discourage corner cutting that may result in a very large gain in position:
Rsoc(s,s0),(s0
oso)(s0
kph )2
, where
so
is
339
the cumulative time off course, and
skph
is the speed in kilometers per hour. To avoid an explosion in values at Sarthe where
340
driving speeds were significantly faster and corners particularly easy to cut, we used a penalty that was proportional to the
341
speed (not squared): Rl oc(s,s0),(s0
oso)s0
kph and the penalty was doubled for the difficult first and final chicane.342
Wall penalty (Rw) -
To assist the agent in learning to avoid walls, a wall contact penalty was included. This penalty
343
was proportional to the squared speed of the car and amount of time in contact with the wall since the last observation:
344
Rw(s,s0),(s0
wsw)(s0
kph )2, where swis the cumulative time the agent was in contact with a wall.345
Tire slip penalty (Rts) -
Tire slip makes it more difficult to control the car. To assist learning, we included a penalty when the
346
tires were slipping in a different direction than they were pointing:
Rts(s,s0),4
imin(|s0
tsr,i|,1.0)|s0
tsθ,i|
, where
stsr,i
is the
347
tire slip ratio for the ith tire, and stsθ,iis the angle of the slip from the forward direction of the ith tire.348
Passing bonus (Rps) -
As in prior work
25
, to incentivize passing opponents, we included a term that positively rewarded gaining
349
ground and overtaking opponents, and negatively rewarded losing ground to an opponent. The negative reward ensured there
350
were no positive-cycle reward loops to exploit and encouraged defensive play when an opponent was trying to overtake the
351
agent. This reward was defined as
Rps(s,s0),i(sLis0
Li)max1(b,f)(sLi),1(b,f)(s0
Li)
, where
sLi
is the projected centerline
352
signed distance (in meters) from the agent to opponent
i
, and
1(b,f)(x)
is an indicator function for when an opponent is no
353
more than
b
meters behind nor
f
meters in front of the agent. We used
b=20
and
f=40
meters to train GT Sophy. The
max354
operator ensures the reward is provided when the agent was within bounds in the previous state OR in the current state. In the
355
particularly complex first and final chicane of Sarthe, we masked this passing bonus to strongly discourage the agent from
356
cutting corners to gain a passing reward.357
Any-collision penalty (Rc) -
To discourage collisions and pushing cars off the road, we included a reward penalty whenever
358
the agent was involved in any collision. This was defined as a negative indicator whenever the agent collided with another
359
car:
Rc(s,s0),maxiNs0
c,i
, where
sc,i
is
1
when the agent collided with opponent
i
and
0
otherwise, and
N
is the number of
360
opponents.361
Rear-end penalty (Rr) -
Rear-ending an opponent was one of the more common ways to cause an opponent to lose control
362
and for the agent to be penalized by stewards. To discourage bumping from behind, we included the penalty
Rr(s,s0),363
is0
c,i·1>0(s0
l,is0
l)· ||s0
vs0
v,i||2
2
, where
sc,i
is a binary indicator for whether the agent was in a collision with opponent
i
,
364
1>0(sl,isl)
is an indicator for whether opponent
i
was in front of the agent,
sv
is the velocity vector of the agent, and
sv,i
is the
365
velocity vector of opponent i. The penalty was speed dependent to more strongly discourage higher speed collisions.366
Unsporting-collision penalty (Ruc) -
Due to the high speed of cars and the technical difficulty of Sarthe, training the agent to
367
avoid collisions was particularly challenging. Merely increasing the any-collision penalty resulted in very timid agent behavior.
368
To discourage getting into collisions without causing the agent to be too timid, we included an additional collision penalty
369
for Sarthe. Like the any-collision penalty, this penalty was a negative boolean indicator. Unlike the any-collision penalty, it
370
11/17
only fired when the agent rear-ended or sideswiped an opponent on a straightaway or was in a collision in a curve that was
371
not caused by an opponent rear-ending them:
Ruc(s,s0),maxiNu(s0,i)
where
u(s0,i)
indicates an unsporting collision as
372
defined above.373
Training Algorithm374
To train our agent, we used a novel extension of the Soft Actor-Critic (SAC)
37
algorithm that we refer to as Quantile Regression
375
Soft Actor-Critic (QR-SAC). To give the agent more capacity to predict the variation in the environment during a race, we make
376
use of a Quantile Regression Q-function
40
modified to accept continuous actions as inputs. QR-SAC is similar to Distributional
377
SAC (DSAC
47
), but uses a different formulation of the value backup and target functions. We used
M=32
quantiles and
378
modified the loss function of the QR Q-function with an
N
-step TD backup. The target function,
yi
, for the
i
-th quantile,
ˆ
τi
,
379
consists of terms for the immediate reward,
Rt=N
i=1γi1rt+i
, the estimated quantile value at the
N
-th future state,
Zˆ
τi
, and the
380
SAC entropy term. Like existing work using
N
-step backups
39
, we do not correct for the off-policy nature of
N
-step returns
381
stored in the replay buffer. To avoid the computational cost of forwarding the policy for intermediate steps of the
N
-step backup,
382
we only include the entropy reward bonus SAC adds for encouraging exploration in the final step of the
N
-step backup. Despite
383
this lack of off-policy correction and limited use of entropy reward bonus, we found using
N
-step backups to significantly
384
improve performance compared to a standard
1
-step backup, as shown in Figure 2(d). To avoid overestimation bias, the
N
-th
385
state quantiles are taken from the Q-function with the smallest N-th state mean value48, indexed by k.386
k=argmin
m=1,2
Q(st+N,a0|θ0
m)yi=Rt+Zˆ
τi(st+N,a0|θk)αlogπ(a0|st+N,φ)(1)
where
θ
and
φ
are parameters of the Q-functions and the policy respectively. Using this target value,
yi
, the loss function of the
Q-function is defined as follows:
δi,j=yiZˆ
τj(st,at|θ)L(θ) = 1
M2
i
j
Est,at,Rt,st+1D,a0πρ(δi,j)(2)
where
D
represents data from experience replay buffer and
ρ
is a quantile Huber loss function
40
. Finally, the objective function
for the policy is as follows:
J(φ) = EsD,aπ(a|s,π)[αlogπ(a|s,φ)min
i=1,2Q(s,a|θi)] (3)
The Q-functions and policy models used in the October race consist of four hidden layers with 2048 units each and a ReLU
387
activation function. To achieve robust control, dropout
49
with a 0.1 drop probability is applied to the policy function
50
. The
388
parameters are optimized using an Adam optimizer
51
with learning rates of
5.0×105
and
2.5×105
for the Q-function and
389
policy respectively. The discount factor
γ
was
0.9896
and the SAC entropy temperature value
α
was set to
0.01
. The mixing
390
parameter when updating the target model parameters after every algorithm step was set to
0.005
. The off-course penalty and
391
rear-end speed penalty can produce large penalty values due to the squared speed term, which makes the Q-function training
392
unstable due to large loss values. To mitigate this issue, the gradients of the Q-function are clipped by the global norm of the
393
gradients with a threshold of 10.394
The rollout workers send state transition tuples
<s,a,r>
collected in an episode (of length 150 seconds) to the trainer to
395
store the data in an ERB implemented using the Reverb Python library
52
. The buffer had capacity of
107
N-step transitions.
396
The trainer began the training loop once 40,000 transitions had been collected and uses a mini-batch of size 1024 to update the
397
Q-function and policy. A training epoch is comprised of 6,000 gradient steps. After each epoch, the trainer sent the latest model
398
parameters to the rollout workers.399
Training scenarios400
Learning to race requires mastering a gamut of skills: surviving a crowded start, making tactical open-road passes, and precisely
401
running the track alone. To encourage basic racing skills, we placed the agent in scenarios with zero, one, two, three, or seven
402
opponents launched nearby (1v0, 1v1, 1v2, 1v3, and 1v7, respectively). To create variety, we randomized track positions,
403
start speeds, spacing between cars, and opponent policies. We leveraged the fact that the game supports 20 cars at a time to
404
maximize PlayStation utilization by launching more than one group on the track. All base scenarios ran for 150 seconds. In
405
addition, to ensure the agent was exposed to situations that would allow it to learn the skills highlighted by our expert advisor,
406
we utilized time- or distance-limited scenarios on specific course sections. Figure 1(f) illustrates the skill scenarios used at
407
Sarthe: 8-car grid starts, 1v1 slipstream passing, and mastering the final chicane in light traffic. Extended Data Figure 1shows
408
the specialized scenarios used to prepare the agent to race on (f) Seaside and (g) Maggiore. To learn how to avoid catastrophic
409
12/17
outcomes at the high-speed Sarthe track, we also incorporated mistake learning
53
. During policy evaluations, if an agent lost
410
control of the car, the state shortly before the event was recorded and used as a launch point for more training scenarios.411
Unlike curriculum training where early skills are supplanted by later ones, or in which skills build on top of one another in
412
a hierarchical fashion, our training scenarios are complementary and were trained into a single control policy for racing. During
413
training, the trainer assigned new scenarios to each rollout worker by selecting from the set configured for that track based on
414
hand-tuned ratios designed to provide sufficient skill coverage. See Extended Data Figure 1(e) for an example experience replay
415
buffer at Sarthe. However, even with this relative execution balance, random sampling fluctuations from the buffer often led to
416
skills being unlearned between successive training epochs, as shown in Figure 2(h). Therefore, we implemented multi-table
417
stratified sampling to explicitly enforce proportions of each scenario in each training mini-batch, significantly stabilizing skill
418
retention (Figure 2(g)).419
Policy selection420
In machine learning, convergence means that further training will not improve performance. In RL, due to the ongoing
421
exploration and random sampling of experiences, the policy’s performance often will continue to vary after convergence
422
(Figure 2(h)). Thus, even with the stabilizing techniques described above, continuing training after convergence produced
423
policies that differed in small ways in their ability to execute the desired racing skills. A subsequent policy, for instance, may
424
become marginally better at the slipstream pass and marginally worse at the chicane. Choosing which policy to race against
425
humans became a complex multi-objective optimization problem.426
Extended Data Figure 3illustrates the policy selection process. Agent policies were saved at regular intervals during
427
training. Each saved policy then competed in a single-race scenario against other AI agents, and various metrics, such as lap
428
times and car collisions, were gathered and used to filter the saved policies to a smaller set of candidates. These candidates
429
were then run through an
n
-athlon—a set of pre-specified evaluation scenarios—testing their lap speed and performance in
430
certain tactically important scenarios like starting and using the slipstream. The performance on each scenario was scored
431
and the results of each policy on each scenario were combined in a single ranked spreadsheet. This spreadsheet, along with
432
various plots and videos, was then reviewed by a human committee to select a small set of policies that seemed the most
433
competitive and the best behaved. From this set, each pair of policies competed in a multi-race round-robin policy-vs-policy
434
(PvP) tournament. These competitions were scored using the same team scoring as that in the exhibition event and evaluated on
435
collision metrics. From these results, the committee chose policies that seemed to have the best chance of winning against the
436
human drivers while minimizing penalties. These final candidate policies were then raced against test drivers at Polyphony, and
437
the subjective reports of test drivers were factored into the final decision.438
The start of Sarthe posed a particularly challenging problem for policy selection. Because the final chicanes are so close to
439
the starting line, the race was configured with a stationary grid start. From that standing start, all eight cars quickly accelerated
440
and entered the first chicane. While a group of eight GT Sophy agents might get through the chicane fairly smoothly, against
441
human drivers the start was invariably chaotic and a fair amount of bumping occurred. We tried many variations of our reward
442
functions to find a combination that was deemed an acceptable starter by our test drivers while not giving up too many positions.
443
In the October Sarthe race, we configured GT Sophy to use a policy that started well, and after 2100 meters switch to a slightly
444
more competitive policy for the rest of the race. Despite the specialized starter, the instance of GT Sophy that began the race in
445
the pole position was involved in a collision with a human driver in the first chicane, slid off the course, and fell to last place.
446
Despite that setback, it managed to come back and win the race.447
Immediately after the official race, we ran a friendly rematch against the same drivers but used the starter policy for the
448
whole track. The results were similar to the official race.449
Fairness versus humans450
Competitions between humans and AI systems cannot be made entirely fair; computers and humans think in different ways and
451
with different hardware. Our objective was to make the competition fair enough, while using technical approaches that were
452
consistent with how such an agent could be added to the game. The following list compares some of the dimensions along
453
which GT Sophy differs from human players.454
1. Perception:
GT Sophy had a map of the course with precise
x
,
y
,
z
information about the points that defined the track
455
boundaries. Humans perceived this information less precisely via vision. However, the course map did not have all of the
456
information about the track, and humans have an advantage in that they could see the kerbs and surface material outside
457
the boundaries, whereas GT Sophy could only sense these by driving on them.458
2. Opponents:
GT Sophy had precise information about the location, velocity and acceleration of the nearby vehicles.
459
However, it represented these vehicles as single points, whereas humans could perceive the whole vehicle. GT Sophy has
460
a distinct advantage in that it can see vehicles behind it as clearly as it can see those in front, whereas humans have to
461
13/17
use the mirrors or the controller to look to the sides and behind them. GT Sophy never practiced against opponents that
462
didn’t have full visibility, so it didn’t intentionally take advantage of human blind spots.463
3. Vehicle state:
GT Sophy had precise information about the load on each tire, slip angle of each tire and other vehicle
464
state. Humans learn how to control the car with less precise information about these state variables.465
4. Vehicle controls:
There are certain vehicle controls that the human drivers had access to that GT Sophy did not. In
466
particular, expert human drivers often use the Traction Control System in grid starts, and use the transmission controls to
467
change gears.468
5. Action frequency:
GT Sophy took actions at 10Hz, which was sufficient to control the car, but much less frequent
469
than human actions in GT. Competitive GT drivers use steering and pedal systems that give them 60Hz control. While
470
a human can’t take 60 distinct actions per second, they can smoothly turn a steering wheel or press on a brake pedal.
471
Extended Data Figure 2(b,c) contrasts GT Sophy’s 10Hz control pattern to Igor Fraga’s much smoother actions in a
472
corner of Sarthe.473
6. Reaction time:
GT Sophy’s asynchronous communication and inference takes around 23–30 ms, depending on the size
474
of the network. While evaluating performance in professional athletes and gamers is a complex field
35,36
, an oft-quoted
475
metric is that professional athletes have a recation time of 200–250 ms. To understand how GT Sophy’s performance
476
would be impacted if its reaction time were slowed down, we ran experiments in which we introduced artificial delays to
477
its perception pipeline. We retrained our agent with delays of 100 ms, 200 ms, and 250 ms in the Maggiore time trial
478
setting, using the same model architecture and algorithm as our time trial baseline. All three of these tests achieved a
479
superhuman lap time.480
Tests versus top GT drivers481
The following competitive GT drivers participated in the time trial evaluations:482
Emily Jones: 2020 FIA Gran Turismo Championships Manufacturers Series, Team Audi.483
Valerio Gallo:
Winner 2021 Olympic Virtual Series Motor Sport Event; 2nd place in 2020 FIA Gran Turismo Nations
484
Cup.485
Igor Fraga:
Winner of the 2018 FIA Gran Turismo Nations Cup; Winner of the 2019 Manufacturer Series championship
486
for Toyota; (Real racing) Winner of the 2020 Formula 3 Toyota Racing Series for Charouz Racing System.487
GT Sophy won all of the time trial evaluations as shown in Figure 3(f) and was reliably superhuman on all three tracks, as
488
shown in Figures 1(d,e) and Extended Data Figures 1(a–d). Interestingly, the only human with a time within the range of GT
489
Sophy’s 100 lap times on any of the tracks was Valerio Gallo on Maggiore. It is worth noting that the data in Figures 1(d,e) was
490
captured by Polyphony after the time trial event in July. Valerio was the only participant represented in the data that had seen491
GT Sophy’s trajectories on Maggiore, and between those two events, Valerio’s best time improved from 114.466 to 114.181
492
seconds.493
It is also interesting to examine what behaviors give GT Sophy such an advantage in time trials. Extended Data Figure 2(a)
494
shows an analysis of Igor’s attempt to match GT Sophy on Sarthe, showing the places on the course where he fell farther behind.
495
Not surprisingly, the hardest chicanes and corners are the places where GT Sophy has the biggest performance gains. In most
496
of these corners Igor appears to catch up a little bit by braking later, but is then unable to take the corner itself as fast, resulting
497
in him losing ground overall.498
The following competitive GT drivers participated in the team racing event:499
Takuma Miyazono:
Winner 2020 FIA Gran Turismo Nations Cup; Winner 2020 FIA Gran Turismo Manufacturer
500
Series for Subaru; Winner 2020 GR Supra GT Cup.501
Tomoaki Yamanaka: Winner 2019 Manufacturer Series for team Toyota.502
Ryota Kokubun:
Winner FIA Gran Turismo Championships 2019 World Tour Round 5, TOKYO Nations Cup; 3rd
503
place FIA Gran Turismo Championships 2020 World Finals Nations Cup.504
Shotaro Ryu:
2nd place Japan National Inter-prefectural e-Sports Championship (National Athletic Meet) 2019 Gran
505
Turismo Division (Youth).506
14/17
Driver testimonials507
The following quotes were captured after the July events.508
"I think the AI was very fast turning into the corner. How they approach into it, as well as not losing speed on the exit. We
509
tend to sacrifice a little bit the entry to make the car be in a better position for the exit, but the AI seems to be able to carry
510
more speed into the corner but still be able to have the same kind of exit, or even a faster exit. The AI can create this type of
511
line a lot quicker than us, ... it was not a possibility before because we never realized it. But the AI was able to find it for us."
512
Igor Fraga513
"It was really interesting seeing the lines where the AI would go, there were certain corners where I was going out wide and
514
then cutting back in, and the AI was going in all the way around, so I learned a lot about the lines. And also knowing what to
515
prioritize. Going into turn 1 for example, I was braking later than the AI, but the AI would get a much better exit than me and
516
beat me to the next corner. I didn’t notice that until I saw the AI and was like ‘Okay, I should do that instead’." Emily Jones
517
"The ghost is always a reference. Even when I train I always use someone else’s ghost to improve. And in this case with such a
518
very fast ghost, ... even though I wasn’t getting close to it, I was getting closer to my limits." Valerio Gallo519
"I hope we can race together more, as I felt a kind of friendly rivalry with [GT Sophy]." (translated from Japanese) Takuma
520
Miyazono521
"There is a lot to learn from [GT Sophy], and by that I can improve myself. [GT Sophy] does something original to make the
522
car go faster, and we will know it’s reasonable once we see it." (translated from Japanese) Tomoaki Yamanaka523
Data availability524
There is no static data associated with this project. All data is generated from scratch by the agent each time it learns. Videos of
525
the races are available at https://sonyai.github.io/gt_sophy_public.526
Code availability527
Pseudo-code detailing the training process and algorithms used is available as a supplement to this article. The agent interface
528
in GT is not enabled in commercial versions of the game, however Polyphony has provided a small number of universities and
529
research facilities outside of Sony access to the API and is considering working with other groups.530
Competing interests531
The authors declare no competing interests with the contents of this manuscript.532
Additional Information533
Supplementary information is available.534
Correspondence and requests for materials should be addressed to P.W. or M.S.535
Nature thanks J. C. Gerdes and the anonymous peer reviewers.536
Reprints and permissions information is available at www.nature.com/reprints537
References538
46. https://olympics.com/en/sport-events/olympic-virtual-motorsport-event/.539
47.
Ma, X., Xia, L., Zhou, Z., Yang, J. & Zhao, Q. DSAC: Distributional soft actor critic for risk-sensitive reinforcement
540
learning. arXiv preprint arXiv:2004.14547 (2020).541
48.
Fujimoto, S., van Hoof, H. & Meger, D. Addressing function approximation error in actor-critic methods. CoRR
542
abs/1802.09477 (2018). 1802.09477.543
49.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural
544
networks from overfitting. The journal machine learning research 15, 1929–1958 (2014).545
50.
Liu, Z., Li, X., Kang, B. & Darrell, T. Regularization matters for policy optimization - an empirical study on continuous
546
control. In International Conference on Learning Representations (2021).547
51.
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning
548
Representations (2015).549
52. Cassirer, A. et al. Reverb: A framework for experience replay (2021). 2102.04736.550
53.
Narvekar, S., Sinapov, J., Leonetti, M. & Stone, P. Source task creation for curriculum learning. In Proceedings of the 15th
551
International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2016) (2016).552
15/17
Course Rcp Rsoc Rl oc RwRts Rps RcRrRuc
Seaside 1 0.01 0 0.01 0.25 0.5 5 0.1 0
Maggiore 1 0.01 0 0.01 0.25 0.5 4 0.1 0
Sarthe 1 0 5 0.01 0 0.5 5 0.1 5
Extended Data Table 1. Reward weights: Reward weights for each track.
0
100
200
300
400
500
600
700
194
198
202
206
210
214
218
222
226
230
234
238
242
246
250
254
258
262
266
270
277
284
294
300
Number of players
Lap time (s)
Best human lap times - Sarthe
0
100
200
300
400
500
600
106
108
110
112
114
116
118
120
122
124
126
128
130
132
134
136
138
140
142
144
146
148
150
Number of players
Lap time (s)
Best human lap times - Seaside
(b)
(d)
Rollling start
Chicane series
Wide corners
Slipstream
Rolling start
Chicane 2-3
Corner 5-6
Chicane 13-15
Corners 12 & 16
106.00
106.25
107.00
107.50
108.00
106.75
106.50
5
1 4
3
2
107.75
107.25
192.5
193.0
193.5
194.0
194.5
195.0
195.5
51 4
32
16 hours
8 hours
Built-in AI
48 hours
(e)
(c)(a)
4 hours
2 hours
24 hours Built-in AI
0
1000
2000
3000
4000
5000
6000
012345678910 11 12 13
# Samples
Course position (km)
Snapshot of ERB contents -Sarthe
1v0 1v1 1v2 1v3 1v7 Start Slipstream Final Chicane
Extended Data Figure 1. Seaside and Sarthe training: Figures (a,b) and (c,d) show the Kudos Prime data from global
time trial challenges on Seaside and Sarthe, respectively, with the cars used in the competition. Note that these histograms
represent the single best lap time for over 12,000 individual players on Seaside, and almost 9,000 on Sarthe. In both cases, the
secondary diagrams compare the top five human times to a histogram of 100 laps by the July 2nd time trial version of GT
Sophy. In both cases, the data shows that GT Sophy was reliably superhuman, with all 100 laps better than the best human laps.
Not surprisingly, it takes longer for the agent to train on the much longer Sarthe course, taking 48 hours to reach the 99th
percentile of human performance. Figure (e) shows a histogram of a snapshot of the experience replay buffer during training on
Sarthe based on the scenario breakdown in Figure 1(f). The horizontal axis is the course position and the stacked colors
represent the number of samples that were collected in that region from each scenario. In a more condensed format than
Figure 1(f), (f) and (g) show the sections of Seaside and Maggiore that were used for skill training.
16/17
-0.25
-0.20
-0.15
-0.10
-0.05
0.00
0.05
0.10
0.15
0.20
0.25
7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950
Course position (m)
Steering -Igor vs. Sophy in curve 20
Igor steering Sophy steering
0.0
0.2
0.4
0.6
0.8
1.0
7450 7500 7550 7600 7650 7700 7750 7800 7850 7900 7950
Course position (m)
Throttle & brake -Igor vs. Sophy in curve 20
Igor throttle Igor brake Sophy throttle Sophy brake
(a)
(b) (c)
Extended Data Figure 2. Time trail on Sarthe:
An analysis of Igor Fraga’s best lap in the time trial test compared to GT
Sophy’s lap. (a) highlights areas of the track where Igor lost time with respect to GT Sophy. Corner 20, highlighted in yellow,
shows an interesting effect common to the other corners in that Igor seems to catch up a little by braking later, but then loses
time because he has to brake longer and comes out of the corner slower. (b) shows Igor’s steering controls and (c) Igor’s
throttle and braking compared to GT Sophy on corner 20. Through the steering wheel and brakes pedals, Igor is able to give
smooth 60Hz signals compared to GT Sophy’s 10Hz action rate.
Filter
Filter
Filter
Filter
a
b
c
d
e
f
PvP
n-athlon
>
>
>
vs
vs
Human Tests
x x
Extended Data Figure 3. Policy selection:
An illustration of the process by which policies were selected to run in the final
race. Starting on the left side of the diagram, thousands of policies were generated and saved during the experiments. They
were first filtered within the experiment to select the subset on the Pareto frontier of a simple evaluation criteria trading off lap
time versus off-course and collision metrics. The selected policies were run through a series of tests evaluating their overall
racing performance against a common set of opponents and their performance on a variety of hand crafted skill tests. The
results were ranked and human judgement was applied to select down to a small number of candidate policies. These policies
were matched up in round-robin, policy-vs-policy competitions. The results were again analyzed by the human committee for
overall team scores and collision metrics. The best candidate policies were run in short races against test drivers at Polyphony.
Their subjective evaluations were included in the final decisions on which policies to run in the October event.
17/17
... In recent years, deep reinforcement learning has shown promising results in various fields, such as training championshiplevel racers in Gran Turismo (Wurman et al., 2022), achieving alltime top-three Stratego game ranking (Perolat et al., 2022), and optimizing matrix multiplication operations (Fawzi et al., 2022). However, when addressing the challenge of cooperative decisionmaking in UAV swarms, reinforcement learning suffers from weak generalization ability, low sample utilization, and slow learning speed (Beck et al., 2023). ...
Article
Full-text available
Unmanned Aerial Vehicles (UAVs) have gained popularity due to their low lifecycle cost and minimal human risk, resulting in their widespread use in recent years. In the UAV swarm cooperative decision domain, multi-agent deep reinforcement learning has significant potential. However, current approaches are challenged by the multivariate mission environment and mission time constraints. In light of this, the present study proposes a meta-learning based multi-agent deep reinforcement learning approach that provides a viable solution to this problem. This paper presents an improved MAML-based multi-agent deep deterministic policy gradient (MADDPG) algorithm that achieves an unbiased initialization network by automatically assigning weights to meta-learning trajectories. In addition, a Reward-TD prioritized experience replay technique is introduced, which takes into account immediate reward and TD-error to improve the resilience and sample utilization of the algorithm. Experiment results show that the proposed approach effectively accomplishes the task in the new scenario, with significantly improved task success rate, average reward, and robustness compared to existing methods.
... Humans are characterized by their ability to quickly learn new tasks and skills from only a limited amount of examples or experience. While deep neural networks are able to achieve great performance on various tasks (Krizhevsky et al., 2012;Mnih et al., 2013;Silver et al., 2016;Wurman et al., 2022), they require large amounts of data and compute resources to learn new tasks, restricting their success to domains where such resources are available. One explanation for this gap in learning efficiency is that humans can efficiently draw on a large pool of prior knowledge and learning experience (Jankowski et al., 2011), whereas deep neural networks are often trained from scratch or lack the appropriate prior. ...
Article
Full-text available
Gradient-based meta-learning techniques aim to distill useful prior knowledge from a set of training tasks such that new tasks can be learned more efficiently with gradient descent. While these methods have achieved successes in various scenarios, they commonly adapt all parameters of trainable layers when learning new tasks. This neglects potentially more efficient learning strategies for a given task distribution and may be susceptible to overfitting, especially in few-shot learning where tasks must be learned from a limited number of examples. To address these issues, we propose Subspace Adaptation Prior (SAP), a novel gradient-based meta-learning algorithm that jointly learns good initialization parameters (prior knowledge) and layer-wise parameter subspaces in the form of operation subsets that should be adaptable. In this way, SAP can learn which operation subsets to adjust with gradient descent based on the underlying task distribution, simultaneously decreasing the risk of overfitting when learning new tasks. We demonstrate that this ability is helpful as SAP yields superior or competitive performance in few-shot image classification settings (gains between 0.1% and 3.9% in accuracy). Analysis of the learned subspaces demonstrates that low-dimensional operations often yield high activation strengths, indicating that they may be important for achieving good few-shot learning performance. For reproducibility purposes, we publish all our research code publicly.
... R EINFORCEMENT Learning (RL) algorithms have significantly contributed to a wide range of domains over the past years, including but not limited to autonomous driving [1,2,3], unmanned ground vehicle (UGV) navigation [4,5], and computer games [6]. With the representative capability of handling high-dimensional states, recent RL algorithms, e.g., deep Q-learning (DQN) [7,8], deep deterministic policy gradient (DDPG) [9,10], and soft actor-critic (SAC) [11,12] are increasingly adopted by the robotics community to address decision-making problems, especially for autonomous navigation. ...
Article
Full-text available
Despite some successful applications of goal-driven navigation, existing deep reinforcement learning (DRL)-based approaches notoriously suffers from poor data efficiency issue. One of the reasons is that the goal information is decoupled from the perception module and directly introduced as a condition of decision-making, resulting in the goal-irrelevant features of the scene representation playing an adversary role during the learning process. In light of this, we present a novel Goal-guided Transformer-enabled reinforcement learning (GTRL) approach by considering the physical goal states as an input of the scene encoder for guiding the scene representation to couple with the goal information and realizing efficient autonomous navigation. More specifically, we propose a novel variant of the Vision Transformer as the backbone of the perception system, namely Goal-guided Transformer (GoT), and pre-train it with expert priors to boost the data efficiency. Subsequently, a reinforcement learning algorithm is instantiated for the decision-making system, taking the goal-oriented scene representation from the GoT as the input and generating decision commands. As a result, our approach motivates the scene representation to concentrate mainly on goal-relevant features, which substantially enhances the data efficiency of the DRL learning process, leading to superior navigation performance. Both simulation and real-world experimental results manifest the superiority of our approach in terms of data efficiency, performance, robustness, and sim-to-real generalization, compared with other state-of-the-art (SOTA) baselines. The demonstration video (https://www.youtube.com/watch?v=aqJCHcsj4w0) and the source code (https://github.com/OscarHuangWind/DRL-Transformer-SimtoReal-Navigation) are also provided.
... A drawback of game-theoretic approaches is that the cost function of the other vehicles must be known, as they are needed to solve the OCPs or to fill the game matrices as in [27]. Emerging learning-based approaches like in [49] use deep reinforcement learning (RL) to derive control policies for multi-vehicle racing. ...
Article
Full-text available
Motion planning for autonomous vehicles remains challenging, especially in environments with multiple vehicles and high speeds. Autonomous racing offers an opportunity to develop algorithms that can deal with such situations and adds the requirement of following race rules. We propose a hybrid local planning approach capable of generating rule-compliant trajectories at the dynamic limits for multi-vehicle oval racing. The planning method is based on a spatiotemporal graph, which is searched in a two-step process to exploit the dynamic limits on the one hand and achieve a long planning horizon on the other. We introduce a soft-checking procedure that can handle cases where no collision-free, feasible, or rule-compliant solutions are found to restore an admissible state as quickly as possible. We also present a state machine explicitly designed for fully autonomous operation on a racetrack, acting on a higher level of the planning algorithm. It contains the interface to a race control entity and translates the current race rules and conditions into interpretable requests for the local planning algorithm. We present the results of experiments with a full-scale prototype, including overtaking maneuvers at speeds of up to 74 m/s.
... This trend led to active research and development worldwide in order to realize immersive media that enable users to experience more realistic and immersive content [2][3][4]. Such a movement is particularly active in the gaming industry [5,6] and is also being observed in the broadcasting field. NHK (Japan Broadcasting Corporation) has identified immersive media as a basic direction for future research, and is developing technologies for content creation and presentation using augmented reality (AR), virtual reality (VR), and other approaches [7]. ...
Article
For applications in six-degrees-of-freedom (6DoF) content reproduction, this study investigates the angular resolution of radiation characteristics required to reproduce uttered speech in all three-dimensional (3D) directions as an angle at which humans are unable to perceive a difference in the comprehensive audio quality change between adjacent radiation directions. The radiation characteristics of uttered Japanese speech for female and male speakers are measured in three cross-sections, namely the horizontal, median, and frontal planes. The measurements are obtained using a spherical jig for 3D microphone arrangement. A subjective evaluation is conducted to clarify the impact of the difference in the radiation directions on the audio quality of uttered speech in each cross-section. Finally, a statistical analysis of the experimental results shows that the required angular resolutions are 30° relative to the horizontal plane, 30° relative to the median plane, and 60° relative to the frontal plane. Therefore, it is concluded that an angular resolution of at least 30° is required to reproduce uttered speech in all 3D directions.
Article
Full-text available
Cerebellar climbing fibers convey diverse signals, but how they are organized in the compartmental structure of the cerebellar cortex during learning remains largely unclear. We analyzed a large amount of coordinate-localized two-photon imaging data from cerebellar Crus II in mice undergoing 'Go/No-go' reinforcement learning. Tensor component analysis revealed that a majority of climbing fiber inputs to Purkinje cells were reduced to only four functional components, corresponding to accurate timing control of motor initiation related to a Go cue, cognitive error-based learning, reward processing, and inhibition of erroneous behaviors after a No-go cue. Changes in neural activities during learning of the first two components were correlated with corresponding changes in timing control and error learning across animals, indirectly suggesting causal relationships. Spatial distribution of these components coincided well with boundaries of Aldolase-C/zebrin II expression in Purkinje cells, whereas several components are mixed in single neurons. Synchronization within individual components was bidirectionally regulated according to specific task contexts and learning stages. These findings suggest that, in close collaborations with other brain regions including the inferior olive nucleus, the cerebellum, based on anatomical compartments, reduces dimensions of the learning space by dynamically organizing multiple functional components, a feature that may inspire new-generation AI designs.
Article
Full-text available
Autonomous car racing is a major challenge in robotics. It raises fundamental problems for classical approaches such as planning minimum-time trajectories under uncertain dynamics and controlling the car at its limits of handling. Besides, the requirement of minimizing the lap time, which is a sparse objective, and the difficulty of collecting training data from human experts have also hindered researchers from directly applying learning-based approaches to solve the problem. In the present work, we propose a learning-based system for autonomous car racing by leveraging high-fidelity physical car simulation, a course-progress-proxy reward, and deep reinforcement learning. We deploy our system in Gran Turismo Sport, a world-leading car simulator known for its realistic physics simulation of different race cars and tracks, which is even used to recruit human race car drivers. Our trained policy achieves autonomous racing performance that goes beyond what had been achieved so far by the built-in AI, and, at the same time, outperforms the fastest driver in a dataset of over 50,000 human players.
Conference Paper
Full-text available
Learning competitive behaviors in multi-agent settings such as racing requires long-term reasoning about potential adversarial interactions. This paper presents Deep Latent Competition (DLC), a novel reinforcement learning algorithm that learns competitive visual control policies through self-play in imagination. The DLC agent imagines multi-agent interaction sequences in the compact latent space of a learned world model that combines a joint transition function with opponent viewpoint prediction. Imagined self-play reduces costly sample generation in the real world, while the latent representation enables planning to scale gracefully with observation dimensionality. We demonstrate the effectiveness of our algorithm in learning competitive behaviors on a novel multi-agent racing benchmark that requires planning from image observations. Code and videos available at https: //sites.google.com/view/deep-latent-competition.
Article
Full-text available
Many real-world applications require artificial agents to compete and coordinate with other agents in complex environments. As a stepping stone to this goal, the domain of StarCraft has emerged as an important challenge for artificial intelligence research, owing to its iconic and enduring status among the most difficult professional esports and its relevance to the real world in terms of its raw complexity and multi-agent challenges. Over the course of a decade and numerous competitions1,2,3, the strongest agents have simplified important aspects of the game, utilized superhuman capabilities, or employed hand-crafted sub-systems⁴. Despite these advantages, no previous agent has come close to matching the overall skill of top StarCraft players. We chose to address the challenge of StarCraft using general-purpose learning methods that are in principle applicable to other complex domains: a multi-agent reinforcement learning algorithm that uses data from both human and agent games within a diverse league of continually adapting strategies and counter-strategies, each represented by deep neural networks5,6. We evaluated our agent, AlphaStar, in the full game of StarCraft II, through a series of online games against human players. AlphaStar was rated at Grandmaster level for all three StarCraft races and above 99.8% of officially ranked human players.
Article
Full-text available
We present an end-to-end imitation learning system for agile, off-road autonomous driving using only low-cost on-board sensors. By imitating a model predictive controller equipped with advanced sensors, we train a deep neural network control policy to map raw, high-dimensional observations to continuous steering and throttle commands. Compared with recent approaches to similar tasks, our method requires neither state estimation nor on-the-fly planning to navigate the vehicle. Our approach relies on, and experimentally validates, recent imitation learning theory. Empirically, we show that policies trained with online imitation learning overcome well-known challenges related to covariate shift and generalize better than policies trained with batch imitation learning. Built on these insights, our autonomous driving system demonstrates successful high-speed off-road driving, matching the state-of-the-art performance.
Article
Full-text available
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.