Available via license: CC BY 4.0
Content may be subject to copyright.
Autonomous Drifting Based on Maximal Safety Probability Learning
Hikaru Hoshino1, Jiaxing Li2, Arnav Menon2, John M. Dolan3, Yorie Nakahira2
Abstract— This paper proposes a novel learning-based frame-
work for autonomous driving based on the concept of maximal
safety probability. Efficient learning requires rewards that are
informative of desirable/undesirable states, but such rewards
are challenging to design manually due to the difficulty of
differentiating better states among many safe states. On the
other hand, learning policies that maximize safety probability
does not require laborious reward shaping but is numerically
challenging because the algorithms must optimize policies based
on binary rewards sparse in time. Here, we show that physics-
informed reinforcement learning can efficiently learn this form
of maximally safe policy. Unlike existing drift control methods,
our approach does not require a specific reference trajectory
or complex reward shaping, and can learn safe behaviors only
from sparse binary rewards. This is enabled by the use of the
physics loss that plays an analogous role to reward shaping.
The effectiveness of the proposed approach is demonstrated
through lane keeping in a normal cornering scenario and safe
drifting in a high-speed racing scenario.
I. INTRODUCTION
Driving in adverse conditions (e.g. high-speed racing or
icy roads with low traction) is challenging for both hu-
man drivers and autonomous vehicles. The vehicle operates
near its handling limits in a highly nonlinear regime and
has to cope with uncertainties due to unmodeled dynamics
and noises in sensing and localization [1]. Furthermore, it
needs to have a very low response time to adapt to the
rapidly changing environment [2]. Deterministic worst-case
frameworks including robust sliding-mode control [3] can
often be efficiently computed but require full system models
and small bounded uncertainties (errors). Techniques based
on set invariance, such as control barrier functions and
barrier certificates [4]–[7], are applied to vehicle systems
with analytical models when these functions can be designed.
Techniques based on probabilistic invariance [8], [9] can
be used to generate safety certificates using samples of
complex (black-box) systems in extreme environments [8]
and occluded environments [9]. Model Predictive Con-
trol (MPC) techniques, including stochastic MPC [10] and
chance-constrained MPC [11], exploit future predictions to
account for uncertainties. However, as the number of possible
trajectories grows exponentially to the outlook time horizon,
there are often stringent tradeoffs between outlook time
horizon and computation burdens. Thus, it is still challenging
*This work was supported in part by Grant-in-Aid for Scientific Research
(KAKENHI) from the Japan Society for Promotion of Science (#23K13354)
1H. Hoshino is with the Department of Electrical Materials and Engi-
neering, University of Hyoto, hoshino@eng.u-hyogo.ac.jp
2J. Li, A. Menon, and Y. Nakahira are with the Department of Electrical
and Computer Engineering, Carnegie Mellon University, {jiaxingl,
arnavmen, ynakahira}@andrew.cmu.edu
3J. M. Dolan is with the Robotics Institute, Carnegie Mellon University,
jdolan@andrew.cmu.edu
to ensure safety in adverse and uncertain conditions with
lightweight algorithms suitable for onboard computation.
Meanwhile, within drifting control, various model-
based/model-free techniques are considered. In [12], a bi-
cycle model is used to describe the phenomenon of vehi-
cle drifting, and an unstable “drift equilibrium” has been
found. Sustained drifting has been achieved by stabilizing
a drift equilibrium using various control methods, such as
LQR [13], robust control [14], and MPC [15]. Hierarchical
control architectures are proposed in [16], [17] for general
path tracking. A drawback of these previous works is that
they aim to stabilize a specific equilibrium or reference
trajectories, and pre-computation of such a reference is
required. As a data-driven drifting control, Probability Infer-
ence for Learning Control (PILCO) has been applied [18],
but the method is only considered in a single-task setting of
minimizing tracking error of a particular drift equilibrium.
Soft actor-critic algorithm was designed to go through sharp
corners in a racing circuit in simulations [19]. Twin-Delayed
Deep Deterministic (TD3) Policy Gradient is developed in
[20], and tabular Q-learning is used in [21]. The transference
of RL agents to the real world was studied by [22], [23],
using experiments of radio-controlled (RC) model cars. Drift
parking task is considered in [24]. However, like other RL
problems [25], an appropriate design of reward function
is required to generalize across different drift maneuvering
tasks. The reward functions in these works tend to be
complex, and the distance from a target drift equilibrium or
a reference trajectory is typically used for reward shaping.
In this paper, we propose a novel approach to control
vehicle drifting based on Physics-Informed Reinforcement
Learning (PIRL). Specifically, PIRL is used to learn a control
policy that maximizes the safety probability such that the
vehicle can safely drift or slip in adverse conditions. Major
technical challenges stem from the fact that the objective
function associated with this problem takes multiplicative
or maximum costs over time and that the reward is binary
and sparse in time. To overcome the difficulty of learning
multiplicative or maximum costs, we first present a trans-
formation that converts this problem into RL with additive
costs. To efficiently learn from sparse reward, we built upon
our previous work [26], which imposes physics constraints
on the loss function. This term in the loss function plays an
analogous role in rewarding shaping and allows safe policies
to be learned only from binary rewards that are sparse in
time. The proposed framework only requires safe events
(safe regions) to be specified without the need for reference
trajectories or laborious reward shaping. The effectiveness of
the proposed approach is demonstrated through lane keeping
in a normal cornering scenario and safe drifting in a high-
arXiv:2409.03160v1 [cs.RO] 5 Sep 2024
speed racing scenario.
A. Notation
Let Rand R+be the set of real numbers and the set of
nonnegative real numbers, respectively. Let Zand Z+be the
set of integers and the set of non-negative integers. For a
set A,Acstands for the complement of A, and ∂A for the
boundary of A. Let ⌊x⌋ ∈ Zbe the greatest integer less
than or equal to x∈R. Let 1[E]be an indicator function,
which takes 1 when the condition Eholds and otherwise 0.
Let P[E|X0=x]represent the probability that the condition
Eholds involving a stochastic process X={Xt}t∈R+
conditioned on X0=x. Given random variables Xand Y,
let E[X]be the expectation of Xand E[X|Y=y]be the
conditional expectation of Xgiven Y=y. We use upper-
case letters (e.g.,Y) to denote random variables and lower-
case letters (e.g.,y) to denote their specific realizations. For
a scalar function ϕ,∂xϕstands for the gradient of ϕwith
respect to x, and ∂2
xϕfor the Hessian matrix of ϕ. Let tr(M)
be the trace of the matrix M. For x∈Rnand A⊂Rn,
dist(x, A) := infy∈A∥x−y∥.
II. MA XI MAL SA FE TY PROBA BI LITY LEARNING
In this section, the problem formulation of estimating
maximal safety probability is given in Sec. II-A, and the
PIRL algorithm [26] is briefly reviewed in Sec. II-B.
A. Problem Formulation
We assume that the vehicle dynamics can be represented
as a control system with stochastic noise of w-dimensional
Brownian motion {Wt}t∈R+starting from W0= 0. The
system state Xt∈X⊂Rnis assumed to be observable
and evolves according to the following stochastic differential
equation (SDE):
dXt=f(Xt, Ut)dt+σ(Xt, Ut)dWt,(1)
where Ut∈U⊂Rmis the control input. Throughout
this paper, we assume sufficient regularity in the coefficients
of the system (1). That is, the functions fand σare
chosen in a way such that the SDE (1) admits a unique
strong solution (see, e.g., Section IV.2 of [27]). The size
of σ(Xt, Ut)is determined from the uncertainties in the
disturbance, unmodeled dynamics, and prediction errors of
the environmental variables. For numerical approximations
of the solutions of the SDE and optimal control problems, we
consider a discretization with respect to time with a constant
step size ∆tunder piecewise constant control processes. For
0 = t0< t1<· · · < tk< . . . , where tk:= k∆t,k∈Z+,
by defining the discrete-time state Xk:= Xtkwith an abuse
of notation, the discretized system can be given as
Xk+1 =Fπ(Xk,∆Wk),(2)
where ∆Wk:= {Wt}t∈[tk,tk+1), and Fπstands for the state
transition map derived from (1) under a Markov control
policy π: [0,∞)×X→U.
Safety of the system can be defined by using a safe set
C ⊂ X. For the discretized system (2) and for a given control
policy π, the safety probability Ψπof the initial state X0=x
for the outlook horizon τ∈Rcan be characterized as the
probability that the state Xkstays within the safe set Cfor
k∈ Nτ:= {0, . . . , N (τ)}, where N(τ) := ⌊τ/∆t⌋,i.e.,
Ψπ(τ, x) := P[Xk∈ C,∀k∈ Nτ|X0=x, π].(3)
Then, the objective of the learning is to estimate the maximal
safety probability defined as
Ψ∗(τ, x) := sup
π∈P
Ψπ(τ, x),(4)
where Pis the class of bounded and Borel measurable
Markov control policies, and to learn the corresponding safe
control policy π∗:= arg supπ∈P Ψπ.
B. Physics-informed RL (PIRL)
The training framework is illustrated in Fig. 1. PIRL
integrates RL and Physics-Informed Neural Networks
(PINNs) [28] for efficient estimation of maximal safety
probability. This paper uses a PIRL algorithm based on Deep
Q-Network (DQN) [29]. The overall structure follows from
the standard DQN algorithm, and the optimal action-value
function will be estimated by using a function approximator.
For this function approximator, we use a PINN, which
is a neural network trained by penalizing the discrepancy
from a partial differential equation (PDE) condition that the
safety probability should satisfy. To define an appropriate
RL problem, we consider the augmented state space S:=
R×X⊂Rn+1 and the augmented state Sk∈ S, where we
denote the first element of Skby Hkand the other elements
by Xk,i.e.,
Sk= [Hk, X⊤
k]⊤,(5)
where Hkrepresents the remaining time before the outlook
horizon τis reached. Then, consider the stochastic dynamics
starting from the initial state s:= [τ, x⊤]⊤∈ S given as
follows1: for ∀k∈Z+,
Sk+1 =(˜
Fπ(Sk,∆Wk), Sk/∈ Sabs,
Sk, Sk∈ Sabs,(6)
with the function ˜
Fπgiven by
˜
Fπ(Sk,∆Wk) := Hk−∆t
Fπ(Xk,∆Wk),(7)
and the set of absorbing states Sabs given by
Sabs := {[τ, x⊤]⊤∈ S | τ < 0∨x /∈ C}.(8)
Then, the following proposition states that the value function
of our RL problem is equivalent to the safety probability.
Proposition 1. Consider the system (6) starting from an
initial state s= [τ, x⊤]⊤∈ S and the reward function
r:S → Rgiven by
r(Sk) := 1[Hk∈ G](9)
1With this augmentation, the variable Xkin Skis no longer equivalent
to that in (2), because Xktransitions to itself when Sk∈ Sabs. However,
we use the same notation for simplicity. See [26] for precise derivation.
Environment
PIRL agent
Ac�on
State,
Reward
𝑥
𝜏
σ
𝑄
𝜕
𝜕𝜏
𝜕
𝜕𝑥
𝜕2
𝜕𝑥2
𝑄 𝜏,𝑥⊤ ⊤,𝑎∗− 𝑦𝑗
𝜽∗
Loss
𝐿=𝐿𝐷+𝜆𝐿𝑃+𝜇 𝐿𝐵
……..
……..
Ac�on-value func�on
𝑄(𝜏,𝑥⊤ ⊤,𝑎;𝜃)
PINN Losses 𝐿𝑃,𝐿𝐵
DQN Data loss 𝐿𝐷
σ
σ
σ
σσ
Minimize
𝜕𝑄
𝑑𝜏 −𝜕𝑄
𝜕𝑥 𝑓(𝑥,𝑎∗)
−1
2tr 𝜎2𝜕2𝑄
𝜕𝑥2
Vehicle
state
Horizon
Rewards
(Control oriented)
Vehicle model
Fig. 1: Framework of training by Physics-informed reinforcement learning (PIRL)
with G:= [0,∆t). Then, for a given control policy π, the
value function vπdefined by
vπ(s) := E"Nf
X
k=0
r(Sk)
S0=s, π#,(10)
where Nf:= inf{j∈Z+|Sj∈ Sabs }, takes a value in [0,1]
and is equivalent to the safety probability Ψπ(τ, x), i.e.,
vπ(s) = Ψπ(τ, x).(11)
Proof. The proof follows from Proposition1 of [26] with a
minor rewrite of the reward and value functions.
Thus, the problem of safety probability estimation can be
solved by an episodic RL problem with the action-value
function qπ(s, u), defined as the value of taking an action
u∈Uin state sand thereafter following the policy π:
qπ(s, u) := E"Nf
X
k=0
r(Sk)
S0=s, U0=u, π#.(12)
Furthermore, it is shown in [26] that the maximal safety
probability can be characterized by the Hamilton-Jacobi-
Bellman (HJB) equation of a stochastic optimal control
problem. While the HJB equation can have discontinuous
viscosity solutions, the following theorem shows that a
slightly conservative but arbitrarily precise approximation
can be constructed by considering a set Cϵsmaller than C:
Theorem 1 ( [26]).Consider the system (6) derived from
the SDE (1) and the action-value function qπ
ϵ(s, u)given by
qπ
ϵ(s, u) := E"Nf
X
k=0
rϵ(Sk)
S0=s, U0=u, π#,(13)
where the function rϵis given by
rϵ(Sk) := 1[Hk∈ G]lϵ(Xk),(14)
with Cϵ:= {x∈ C | dist(x, Cc)≥ϵ}and
lϵ(x) := max 1−dist(x, Cϵ)
ϵ,0.(15)
Then, if the Assumption 1 in [26] holds, the optimal action-
value function qϵ(s, u) := supu∈Uqπ
ϵ(s, u)becomes a con-
tinuous viscosity solution of the following PDE in the limit
of ∆t→0:
∂sq∗
ϵ(s, u∗)˜
f(s, u∗)
+1
2tr ˜σ(s, u∗)˜σ(s, u∗)⊤∂2
sq∗
ϵ(s, u∗)= 0,(16)
where the function ˜
fand ˜σare given by
˜
f(s, u) := −1
f(x, u),˜σ(s, u) := 0
σ(x, u),(17)
and u∗:= arg supu∈Uq∗(s, u). The boundary conditions are
given by
q∗
ϵ([0, x⊤]⊤, u∗) = lϵ(x),∀x∈X,(18)
q∗
ϵ([τ, x⊤]⊤, u∗)=0,∀τ∈R,∀x∈∂C.(19)
Remark 1. When we assume further regularity conditions
on the function lϵ(x)(i.e., differentiability of lϵ(x)), the
PDE (16) can be understood in the classical sense (see e.g.,
[27, Theorem IV.4.1]). This means that the PDE condition
can be imposed by the technique of PINN using automatic
differentiation of neural networks.
Based on the above, a standard DQN algorithm is inte-
grated with a PINN by using a modified loss function given
by
L=LD+λLP+µLB,(20)
where LDis the data loss of the original DQN, LPthe loss
term imposing the PDE (16), and LBthe loss term for the
boundary conditions (18) and (19). The parameters λand
µare the weighting coefficients for the regularization loss
terms LPand LB, respectively. Detailed descriptions of these
loss terms are given in [26].
III. NUMERICAL EXP ER IM ENTS
This section discusses numerical results of autonomous
driving based on maximal safety probability learning. After
presenting training methodology in Sec. III-A, we provide
results of normal cornering in Sec. III-B and high-speed drift-
ing in Sec. III-C. Our implementation is available at https:
//github.com/hoshino06/pirl_itsc2024.
A. Training Methodology
The PIRL agent learns maximal safety probabilities by
interacting with CARLA [30], an open-source simulator
providing high-fidelity vehicle physics simulations that are
not explicitly described by control-oriented models such as a
bicycle model. The input to the neural network as a function
approximator Qfor our action-value function is the state
s∈ S defined in (5), and the output is a vector of q values for
all discrete actions as explained below. The state sconsists
of the vehicle state xand the outlook horizon τ:
s= [τ, x⊤]⊤,(21)
and xcan be further decomposed into xvehicle for the vehicle
dynamics and xroad for the vehicle position relative to the
road:
x= [ vx, β, r
| {z }
xvehicle
, e, ψ, T
| {z }
xroad
]⊤∈R15,(22)
where xvehicle consists of vxstanding for the vehicle lon-
gitudinal velocity, βfor the sideslip angle, defined as β:=
arctan(vy/vx)with the lateral velocity vy, and rfor the yaw
rate. The variable xroad consists of estanding for the lateral
error from the center line of the road, and ψfor the heading
error, and Tthat contains five (x, y)positions of reference
points ahead of the vehicle placed on the center line of the
lane. The control action ais given by
a= [δ, d]⊤,(23)
where δis the steering angle, and dis the throttle. They
are normalized to [−1,1] and [0,1], respectively. To im-
plement our DQN-based algorithm, which admits a dis-
crete action space, the control inputs are restricted to d∈
{0.6,0.7,0.8,0.9,1.0}and δ∈ {−0.8,−0.4,0,0.4,0.8}.
The reason for limiting d≥0.6is to prevent slow driving
with trivially safe behaviors (the vehicle is safe if it never
moves). The steering is limited to |δ| ≤ 0.8, since high-
speed vehicles are prone to rollover if large steering angles
are applied [19].
For better sample efficiency and generalization to unseen
regions, PIRL can exploit the structure of a control-oriented
model encoded in the form of a PDE in (16) and the
regularization loss term LPin (20). In this paper, we used
a dynamic bicycle model with a Pacejka tire model. The
parameters of the model that correspond to the vehicle
dynamics in CARLA are assumed to be known in this paper,
but PINN can also be used for the discovery of coefficients of
PDEs from data [28], or the entire tire model can be obtained
from online learning [31]. Also, the term σ(Xt, Ut)and the
convection term for the reference points Twere neglected
in the simulations, which need to be constructed or learned
from data for more precise estimation.
B. Normal Cornering
We first show preliminary results with a normal cornering
task. The objective is to keep the vehicle within the lane
while driving on a road. The safe set C ⊂ Xis given by
C={x∈X| |e| ≤ Emax},(24)
Fig. 2: Lane keeping with normal cornering
Fig. 3: Training progress for normal cornering task
where Emax is the maximum distance from the center line of
the lane, and it is assumed to be constant. We used a corner
in a built-in map of the CARLA simulator (southwest part of
Town2 and shown in Fig. 2), and the bound of the lateral error
is set to Emax = 1 m. During the training, the initial vehicle
speed was randomly selected within [5 m/s,15 m/s], and the
vehicle spawn point was randomized for each episode. Also,
the outlook horizon τwas uniformly randomized between 0
and 5.0 s. For the function approximator Q, we used a neural
network with 3 hidden layers with 32 units per layer and the
hyperbolic tangent (tanh) as the activation function. At the
output layer, the sigmoid activation function was used.
Figure 3 shows the training progress with the learning rate
of 5×10−4and the regularization weight of 1×10−4. The
orange and blue lines show the episode reward and the q-
value averaged over a moving window of 500 episodes. The
solid curves represent the mean of 10 repeated experiments,
and the shaded region shows their standard deviation. While
there is a large variance during the transient phase of
learning, it converges to similar values after about 20,000
episodes. As defined in (3), the safety probability depends
on the initial state and the horizon τ. Figure 4b shows the
dependency of the safety probability on the lateral error eand
the relative heading angle ψ, when the vehicle is placed on
the straight part of the road. The vehicle state and the horizon
are fixed to xvehicle = [10 m/s,0,0]⊤and τ= 5.0 s. The
learned safety probability is as high as approximately 1 near
(e, ψ) = (0,0), and decreases near the boundaries of the
road (|e|= 1 m). Also, it decreases when the heading angle
is large. Similarly, Fig. 4b shows that the safety probability
decreases as the velocity vxincreases and is higher in the
inner side of the curve (e < 0). Overall, the above result
(a) Effect of eand ψ
(at straight road)
(b) Effect of eand vx
(at the corner)
Fig. 4: Learned safety probability
appropriately describes how the safety probability depends
on the vehicle state and position in the lane. However,
the values of the safety probability differed across agents,
and further investigation is needed to ensure the accuracy.
Nevertheless, it has been confirmed that the learned agents
safely go through the corner when they are placed near the
center of the road.
C. Safe Drifting
Here we present results for safe drifting based on maximal
safety probability learning. A racing circuit for CARLA
developed in [19] is used, and the training is performed
at a specific corner as shown in Fig. 5. The road width is
about 20 m and Emax can be 10.0 m, but the safe set was
conservatively set as Emax = 8.0 m during the training. The
initial vehicle speed was set to 30 m/s, and the slip angle β
and the yaw rate rwere randomized in the range of
β∈[−25 deg,−20 deg],(25)
r∈[50 deg/s,70 deg/s].(26)
The neural network has the same shape with that used above.
For this task, the training progress is strongly influenced
by the learning rate as shown in Fig. 6. As before, the
solid curves represent the mean of the averaged rewards
of 8 repeated experiments, and the shaded regions show
their standard deviations. At the initial phase of training,
the averaged reward (empirical safety probability) gradually
decreases, but after a while it starts to increase. The timing
of this increase is faster for larger learning rates, but the
final episode reward increases as the learning rate is reduced
from 2×10−4to 5×10−5. However, at the learning rate of
2×10−5, the agents failed to learn safe policy and the reward
dropped to zero except for 2 cases out of 8 experiments. Also,
while the learning rate of 5×10−5performed best among
the four settings above, there were few cases where the
reward dropped to zero, which were excluded from the plot
in Fig. 6. After these trials, we selected an agent trained with
the learning rate 5×10−5at a check point of 90,000 episodes
that behaved best when tested with closed-loop simulations.
Figure 7a shows the vehicle trajectories simulated using
the learned agent. The cross mark (×) in the figure shows
the initial position of the vehicle, and 20 trajectories with
different initial conditions of βand rare illustrated. It can
be confirmed that the learned policy is navigating through the
corner without hitting the boundary of the track for different
Fig. 5: Drifting in racing circuit by a learned agent
Fig. 6: Training progress for safe drifting with different
learning rate
initial conditions. As shown in Fig. 7b, the slip angle and
yaw rate take large values, and the vehicle is safely drifting
while cornering. These safe behaviors have been learned only
from sparse zero and one rewards. It is interesting that such
a drifting maneuver can be learned by maximizing the safety
probability without providing a specific reference trajectories
or laborious reward shaping.
IV. CONCLUSIONS
This paper explored autonomous driving based on the
learning of maximal safety probability by using Physics-
informed Reinforcement Learning (PIRL). Safe drifting was
achieved only from sparse binary rewards without providing
a specific reference trajectory or laborious reward shaping.
Since this concept is related to forward invariance in the
state space, the learned safety probability may be used to
safeguard a nominal controller which does not necessarily
consider the safety specifications. Future work of this paper
includes verifying the accuracy of the learned safety proba-
bility, and testing the proposed framework in more diverse
scenarios.
ACKNOWLEDGMENT
The authors would like to thank Dvij Kalaria for providing
vehicle model parameters for CARLA simulation.
(a) Vehicle trajectories
(b) Vehicle state
Fig. 7: Closed-loop simulation with learned policy
REFERENCES
[1] A. Liniger, A. Domahidi, and M. Morari, “Optimization-based au-
tonomous racing of 1:43 scale rc cars,” Optimal Control Applications
and Methods, vol. 36, no. 5, pp. 628–647, 2015.
[2] J. Kabzan et al., “Amz driverless: The full autonomous racing system,”
arXiv: 1905.05150 [cs.RO], 2019.
[3] L. Zhang, H. Ding, J. Shi, Y. Huang, H. Chen, K. Guo, and Q. Li,
“An adaptive backstepping sliding mode controller to improve vehicle
maneuverability and stability via torque vectoring control,” IEEE
Transactions on Vehicular Technology, vol. 69, no. 3, pp. 2598–2612,
2020.
[4] A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath,
and P. Tabuada, “Control barrier functions: Theory and applications,”
in 2019 18th European Control Conference (ECC), 2019, pp. 3420–
3431.
[5] W. Xiao, N. Mehdipour, A. Collin, A. Y. Bin-Nun, E. Frazzoli, R. D.
Tebbens, and C. Belta, “Rule-based optimal control for autonomous
driving,” in Proceedings of the ACM/IEEE 12th International Confer-
ence on Cyber-Physical Systems, 2021, p. 143–154.
[6] Y. Huang and Y. Chen, “Safety-guaranteed driving control of auto-
mated vehicles via integrated clfs and cdbfs,” in Modeling, Estimation
and Control Conference MECC 2021, 2021, pp. 153–159.
[7] M. Black, M. Jankovic, A. Sharma, and D. Panagou, “Future-focused
control barrier functions for autonomous vehicle control,” in 2023
American Control Conference (ACC), 2023, pp. 3324–3331.
[8] S. Gangadhar, Z. Wang, H. Jing, and Y. Nakahira, “Adaptive safe con-
trol for driving in uncertain environments,” in 2022 IEEE Intelligent
Vehicles Symposium (IV). IEEE, Jun. 2022, pp. 1662–1668.
[9] S. Gangadhar, Z. Wang, K. Poku, N. Yamada, K. Honda, Y. Nakahira,
H. Okuda, and T. Suzuki, “An occlusion-and interaction-aware
safe control strategy for autonomous vehicles,” IFAC-PapersOnLine,
vol. 56, no. 2, pp. 7926–7933, 2023.
[10] T. Br¨
udigam, M. Olbrich, D. Wollherr, and M. Leibold, “Stochas-
tic model predictive control with a safety guarantee for automated
driving,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp.
22–36, 2023.
[11] A. Carvalho, Y. Gao, S. Lefevre, and F. Borrelli, “Stochastic predictive
control of autonomous vehicles in uncertain environments,” in 12th
International Symposium on Advanced Vehicle Control, 2014, pp. 712–
719.
[12] R. Y. Hindiyeh and J. Christian Gerdes, “A controller framework for
autonomous drifting: Design, stability, and experimental validation,”
J. Dyn. Syst. Meas. Control, vol. 136, no. 5, p. 051015, Jul. 2014.
[13] A. B´
ardos, A. Domina, V. Tihanyi, Z. Szalay, and L. Palkovics, “Im-
plementation and experimental evaluation of a mimo drifting controller
on a test vehicle,” in 2020 IEEE Intelligent Vehicles Symposium (IV),
2020, pp. 1472–1478.
[14] D. Xu, G. Wang, L. Qu, and C. Ge, “Robust control with uncertain
disturbances for vehicle drift motions,” Applied Sciences, vol. 11,
no. 11, 2021.
[15] G. Bellegarda and Q. Nguyen, “Dynamic vehicle drifting with nonlin-
ear mpc and a fused kinematic-dynamic bicycle model,” IEEE Control
Systems Letters, vol. 6, pp. 1958–1963, 2022.
[16] B. Yang, Y. Lu, X. Yang, and Y. Mo, “A hierarchical control framework
for drift maneuvering of autonomous vehicles,” in 2022 International
Conference on Robotics and Automation (ICRA), 2022, pp. 1387–
1393.
[17] G. Chen, X. Zhao, Z. Gao, and M. Hua, “Dynamic drifting control
for general path tracking of autonomous vehicles,” IEEE Transactions
on Intelligent Vehicles, vol. 8, no. 3, pp. 2527–2537, Mar. 2023.
[18] M. Cutler and J. P. How, “Autonomous drifting using simulation-aided
reinforcement learning,” in 2016 IEEE International Conference on
Robotics and Automation (ICRA), 2016, pp. 5442–5448.
[19] P. Cai, X. Mei, L. Tai, Y. Sun, and M. Liu, “High-speed autonomous
drifting with deep reinforcement learning,” IEEE Robotics and Au-
tomation Letters, vol. 5, no. 2, pp. 1247–1254, 2020.
[20] L. Orgov´
an, T. B´
ecsi, and S. Aradi, “Autonomous drifting
using reinforcement learning,” Periodica Polytechnica Transportation
Engineering, vol. 49, no. 3, p. 292–300, 2021. [Online]. Available:
https://pp.bme.hu/tr/article/view/18581
[21] S. H. T´
oth, ´
Ad´
am B´
ardos, and Z. J. Viharos, “Tabular Q-learning based
reinforcement learning agent for autonomous vehicle drift initiation
and stabilization,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 4896–4903,
2023.
[22] D. Schnieders, S. Bhattacharjee, K. D. Kabara, and K. D. Kabara,
“Autonomous drifting rc car with reinforcement learning,” https://api.
semanticscholar.org/CorpusID:13744888, 2018.
[23] F. Domberg, C. C. Wembers, H. Patel, and G. Schildbach, “Deep
drifting: Autonomous drifting of arbitrary trajectories using deep
reinforcement learning,” in 2022 International Conference on Robotics
and Automation (ICRA), 2022, pp. 7753–7759.
[24] B. Leng, Y. Yu, M. Liu, L. Cao, X. Yang, and L. Xiong, “Deep rein-
forcement learning-based drift parking control of automated vehicles,”
Sci. China Tech. Sci., vol. 66, no. 4, pp. 1152–1165, Apr. 2023.
[25] G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Paduraru,
S. Gowal, and T. Hester, “Challenges of real-world reinforcement
learning: definitions, benchmarks and analysis,” Mach. Learn., vol.
110, no. 9, pp. 2419–2468, Sep. 2021.
[26] H. Hoshino and Y. Nakahira, “Physics-informed RL for maximal
safety probability estimation,” in 2024 American Control Conference
(ACC), 2024, pp. 3561–3568.
[27] W. H. Fleming and H. M. Soner, Controlledd Markov Processes and
Viscosity Solutions, 2nd ed. Springer, 2006.
[28] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics-informed
neural networks: A deep learning framework for solving forward and
inverse problems involving nonlinear partial differential equations,” J.
Comput. Phys., vol. 378, pp. 686–707, Feb. 2019.
[29] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
[30] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,
“CARLA: An open urban driving simulator,” in Proceedings of the
1st Annual Conference on Robot Learning, Nov 2017, pp. 1–16.
[31] D. Kalaria, Q. Lin, and J. M. Dolan, “Adaptive planning and control
with time-varying tire models for autonomous racing using extreme
learning machine,” arXiv: 2303.08235 [cs.RO], 2023.