PreprintPDF Available

Robust RL with LLM-Driven Data Synthesis and Policy Adaptation for Autonomous Driving

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The integration of Large Language Models (LLMs) into autonomous driving systems demonstrates strong common sense and reasoning abilities, effectively addressing the pitfalls of purely data-driven methods. Current LLM-based agents require lengthy inference times and face challenges in interacting with real-time autonomous driving environments. A key open question is whether we can effectively leverage the knowledge from LLMs to train an efficient and robust Reinforcement Learning (RL) agent. This paper introduces RAPID, a novel \underline{\textbf{R}}obust \underline{\textbf{A}}daptive \underline{\textbf{P}}olicy \underline{\textbf{I}}nfusion and \underline{\textbf{D}}istillation framework, which trains specialized mix-of-policy RL agents using data synthesized by an LLM-based driving agent and online adaptation. RAPID features three key designs: 1) utilization of offline data collected from an LLM agent to distil expert knowledge into RL policies for faster real-time inference; 2) introduction of robust distillation in RL to inherit both performance and robustness from LLM-based teacher; and 3) employment of a mix-of-policy approach for joint decision decoding with a policy adapter. Through fine-tuning via online environment interaction, RAPID reduces the forgetting of LLM knowledge while maintaining adaptability to different tasks. Extensive experiments demonstrate RAPID's capability to effectively integrate LLM knowledge into scaled-down RL policies in an efficient, adaptable, and robust way. Code and checkpoints will be made publicly available upon acceptance.
Content may be subject to copyright.
Robust RL with LLM-Driven Data Synthesis and
Policy Adaptation for Autonomous Driving
Sihao Wu1, Jiaxu Liu1, Xiangyu Yin1, Guangliang Cheng1, Meng Fang1,
Xingyu Zhao2, Xinping Yi3, Xiaowei Huang1
1University of Liverpool, 2WMG, University of Warwick, 3Southeast University
{firstname.lastname}@liverpool.ac.uk, xingyu.zhao@warwick.ac.uk, xyi@seu.edu.cn
Abstract: The integration of Large Language Models (LLMs) into autonomous
driving systems demonstrates strong common sense and reasoning abilities, effec-
tively addressing the pitfalls of purely data-driven methods. Current LLM-based
agents require lengthy inference times and face challenges in interacting with real-
time autonomous driving environments. A key open question is whether we can
effectively leverage the knowledge from LLMs to train an efficient and robust Re-
inforcement Learning (RL) agent. This paper introduces RAPID, a novel Robust
Adaptive Policy Infusion and Distillation framework, which trains specialized
mix-of-policy RL agents using data synthesized by an LLM-based driving agent
and online adaptation. RAPID features three key designs: 1) utilization of offline
data collected from an LLM agent to distil expert knowledge into RL policies for
faster real-time inference; 2) introduction of robust distillation in RL to inherit
both performance and robustness from LLM-based teacher; and 3) employment
of a mix-of-policy approach for joint decision decoding with a policy adapter.
Through fine-tuning via online environment interaction, RAPID reduces the for-
getting of LLM knowledge while maintaining adaptability to different tasks. Ex-
tensive experiments demonstrate RAPID’s capability to effectively integrate LLM
knowledge into scaled-down RL policies in an efficient, adaptable, and robust
way. Code and checkpoints will be made publicly available upon acceptance.
Keywords: Reinforcement Learning, Robust Knowledge Distillation, LLM
1 Introduction
The integration of Large Language Models (LLMs) with emergent capabilities into autonomous
driving presents an innovative approach [1,2,3]. Previous work suggests that LLM can significantly
enhance the common sense and reasoning abilities of autonomous vehicles, effectively addressing
several pitfalls of purely data-driven methods [4,5,6]. However, LLMs face several challenges,
primarily in generating effective end-to-end instructions in real-time and dynamic driving scenarios.
This limitation stems from two primary factors: the extended inference time required by LLM-based
agents [5] and the difficulty these agents face in continuous data collection and learning [7], which
renders them unsuitable for real-time decision-making in dynamic driving environments. Further-
more, faster and smaller models, which are often preferred for real-time applications, have a higher
risk of being vulnerable to adversarial attacks compared to larger models [8,9,10,11]. These
challenges drive us to tackle the following questions:
How to develop an efficient,adaptable, and robust agent that can leverage the
capabilities of the LLM-based agent for autonomous driving?
One potential solution is to use the LLM as a teacher policy to instruct the learning of a lighter, spe-
cialized student RL policy through knowledge distillation [12,13]. This allows the student model
to inherit the reasoning abilities of the LLM while being lightweight enough for real-time infer-
arXiv:2410.12568v1 [cs.RO] 16 Oct 2024
ence. Another approach is to employ LLMs to generate high-level plans or instructions, which are
then executed by a separate controller [14]. This decouples the reasoning and execution processes,
allowing for faster reaction times. Furthermore, techniques such as Instruction Tuning (IT) [15]
and In-Context Learning (ICL) [16] can adapt the LLM to new tasks without extensive fine-tuning.
However, these approaches still have limitations. Knowledge distillation may result in information
forgetting or loss of generalization ability. Generating high-level plans relies on the strong assump-
tion that the LLM can provide complete and accurate instructions. IT and ICL are sensitive to the
choice of prompts and demonstrations, which require careful design for each task [17,18].
To tackle the above challenges, we propose RAPID, a Robust Adaptive Policy Infusion and
Distillation framework that incorporates LLM knowledge into offline RL for autonomous driving,
to leverage the common sense and robust ability of LLM and solve challenging scenarios such as
unseen corner cases. Our method encompasses several designs: (1) We utilize offline data collected
from LLMs-based agents to facilitate the distillation of expertise information into faster policies for
real-time inference. (2) We introduce robust distillation in offline RL to inherit not only perfor-
mance but also the robustness from LLM teachers. (3) We introduce mix-of-policy for joint action
decoding with policy adapter. Through fine-tuning via online environment interaction, we prevent
the forgetting of LLM knowledge while keeping the adaptability to various RL environments.
To the best of our knowledge, this work pioneers the distillation of knowledge from LLM-based
agents into RL policies through offline training combined with online adaptation in the context
of autonomous driving. Through extensive experiments, we demonstrate RAPID’s capability to
effectively integrate LLM knowledge into scaled-down RL policies in a transferable, robust, and
efficient way.
2 Preliminaries
Notation. We view a sequential decision-making problem, formalized as a Markov Decision Pro-
cess (MDP), denoted by ⟨S,A,T,R, γ , where Sand Arepresent the state and action spaces,
respectively. The transition probability function is denoted by T:S × A P(S), and the reward
function is denoted by R:S × A × S R. Moreover, γdenotes the discount factor. The main
objective is to acquire an optimal policy, denoted as π:S A, that maximizes the expected cu-
mulative return over time, maxπE[Ptγtrt]. The policy parameter θ, denoted in π, is crucial. A
typical gradient-based RL algorithm minimizes a surrogate loss, J(θ), employing gradient descent
concerning θ. This loss is estimated using sampled trajectories, wherein each trajectory comprises a
sequence of state-action-reward tuples.
Offline RL. The aim is to learn effective policies from pre-collected datasets, eliminating the need
for further interaction. Given a dataset D={(s,a, r, s)}containing trajectories collected under an
unknown behavior policy πB, the iterative Q update step with learned policy πis expressed as
ˆ
Qk+1 arg min
QJ(Q, π, D),(Policy Evaluation) (1)
where J(Q, π, D) := Es,a,s∼D hr(s,a) + γEaˆπ(a|s)[ˆ
Qk(s,a)]Q(s,a)2i.(2)
With the updated Q-function, the policy is improved by
ˆπk+1 arg max
π
Es∼D,aπk(a|s)hˆ
Qk+1(s,a)i.(Policy Improvement) (3)
Conservative Q-Learning. Offline RL algorithms following this fundamental approach are often
challenged by the issue of action distribution shift [19]. Therefore, [20] proposed conservative Q-
learning, where the Q values are penalized by Out-Of-Distribution (OOD) actions
ˆ
Qk+1 arg min
QJ(Q, π, D) + αEs∼D,aµ(·|s)[Q(s,a)] Es,a∼D [Q(s,a)],(4)
where µis an approximation of the policy that maximizes the current Q-function. While [21] found
that the effectiveness of Offline RL algorithms is significantly influenced by the characteristics of
the dataset, which also motivated us to explore the influence of LLM-generated dataset.
2
3 RAPID: Robust Distillation and Adaptive Policy Infusion
3.1 Offline Dataset Collection
As shown in Fig. 2(a), we conducted a closed-loop driving experiment on HighwayEnv [22] using
GPT-3.5 [23] to collect the offline dataset. The vehicle density and number of lanes in HighwayEnv
can be adjusted, and we choose LA NE -3-DENSITY-2 as the base environment. As a text-only LLM,
GPT-3.5 cannot directly interact with the HighwayEnv simulator. To facilitate its observation and
decision-making processes, the experiment incorporated perception tools and agent prompts, en-
abling GPT-3.5 to effectively engage with the simulated environment. The prompts have the fol-
lowing stages: (1) Prefix prompt: The LLM obtains the current driving scene and historical infor-
mation. (2) Reasoning: By employing the ReAct framework [24], the LLM-based agent reasons
about the appropriate driving actions based on the scene. (3) Output decision: The LLM outputs
its decision on which meta-action to take. The agent has access to 5meta-actions: lane left,
lane right,faster,slower, and idle. More details about the prompt setup are instructed
in Appendix F. Through an iterative closed-loop process described above, we collect the dataset
DLLM ={(s,a,r,s)|aπLLM(a|s)}, where the πLLM is the LLM agent.
3.2 Robustness Regularized Distillation
Clean Uniform Gaussian FGSM PGD
0.0
2.5
5.0
7.5
10.0
12.5
15.0
Cummulatied Reward
Observation Attack Strength
LLM-based agent
Offline DQN
Figure 1: Cumulative reward on LAN E-
3-DENSITY-2 with progressively in-
creasing observation (s) attack strength.
The LLM-based agent πLLM exhibits
improved robustness compared with
vanilla Offline DQN.
Recall the offline RL objective in Eq. (3-4). Let the LLM-
distil policy be πdistil, with the collected dataset DLLM ,
the offline training is to optimize J(Q, πdistill,DLLM)for
an improved Qfunction, then update the policy w.r.t Q.
Empirically as shown in Fig. 1, the LLM-based agent
πLLM is more robust against malicious state injection un-
der the autonomous driving setting. However, a distilled
offline policy is not as robust compared to LLMs, demon-
strated by [25], where the Qvalue can change drastically
over neighbouring states, leading to an unstable policy.
Therefore, vanilla offline RL algorithms cannot robustly
distil information to the LLM-distil agent. Inspired by
[10,11], we formulate a novel training objective by in-
troducing a discrepancy term into Eq. (4), allowing the
distillation of adversarial knowledge to the offline agent.
Jrobust(Q, π, D) := J(Q, π, D) + αEs∼D,aµ(·|s)[Q(s,a)] Es,a∼D [Q(s,a)](5)
+βEs,a∼D log σ(Q(˜
s,a))
onehot(a),where ˜
s= arg max
˜
ss2ϵ
Es,a∼D log σ(Q(˜
s,a))
onehot(a).
In Eq. (5), σ(·)denote the softmax(·)function, characterizing the probability that the Q network
will choose each action. onehot(·)converts the selected action from the offline dataset to a one-
hot vector, characterizing the definite events. Therefore, Es,a∼D hlog σ(Q(˜
s,a))
onehot(a)ican be viewed
as the KL divergence between the Qnetwork output distribution and the categorical distribution
of one-hot action. Essentially, the arg max constraint identifies the adversarial state that yields the
worst Qperformance, while the objective Jrobust seeks to find the optimal Qfunction to neutralize
the adversarial attack. This process forms an adversarial training procedure, where αand βare
hyperparameters that balance the conservative and robustness terms, respectively.
3.3 LLM Knowledge Infused Robust Policy with Environment Adaptation
In autonomous driving, the RL policy typically has the overview of the ego car, as well as sur-
rounding cars’ information. The ego action is predicted by considering the motion of all captured
cars. Assuming Vvehicles captured, Fvehicle features and Aaction features1, the state can be
1features: highway-env.farama.org/observations, actions: highway-env.farama.org/actions
3
MoP Full Attention
Transformers
Expert
Weights
Vehicle
5.0 4.0 15.0 0Ego Vehicle
-10.0 8.0 12.0 0Vehicle 1
... ... ... ......
22.2 10.5 18.0 0.5Vehicle V-1
Q
Action
Preset
0.30LANE_LEFT
0.85IDLE
0.99FASTER
0.45SLOWER
1.24LANE_RIGHT
State Action with
LLM Knowledge
Large
Language Model
LLM State-Action Rollout
Replay
Buffer
Environment
Vehicle
5.0 4.0 15.0 0Ego Vehicle
-10.0 8.0 12.0 0Vehicle 1
... ... ... ......
22.2 10.5 18.0 0.5Vehicle V-1
State
Router Network
Adapter Tokens
MoP Tokens
(Trainable)
Zero
Gates
Environment
... ... ... ...
Row-wise
Softmax
... ... ... ... ...
LLM-Distill
Policy
... ... ... ... ...
...
... ...
LLM Expert Tokens
Normalized
Expert
Weights
... ... ... ... ...
Action Transformer
Robust
Offline RL
Optimizer
Predicted Action
Online Adapter
Policy
Expert
Weights
Router Network
Adapter Tokens
MoP Token
... ... ... ...
Row-wise
Softmax
... ... ... ... ...
LLM-Distill
Policy
... ... ... ... ...
...
... ...
LLM Expert Tokens
Normalized
Expert
Weights
... ... ... ... ...
Online Adapter
Policy
(Trainable)
Zero
Gates
Action Transformer
Predicted Action
StateReward
Rollout
Buffer
Online RL
Optimizer
(c) Phase 3: Online RL Agent Adaptation(b) Phase 2: LLM Knowledge Distillation via Offline RL(a) Phase 1: Offline Data Collection
Reward
State
Label Action
MoP Full Attention
Transformers
Extra Learnable Token
Extra Learnable Token
Figure 2: Our proposed RAPID framework. Only modules tagged by ]are trained. (a) Phase 1:
Collect state-action rollouts from an environment and store them in a replay buffer. (b) Phase 2:
Distill LLM knowledge into an offline policy using the collected data, the adapter policy is frozen
and its output tokens are masked by zero gates. (c) Phase 3: Adapt the pre-trained model online by
interacting with the environment, the LLM-distilled policy is frozen, and the zero gate is trained for
progressive adaptation.
encapsulated by a matrix in RV×F, and the action is simply a vector in RA. A typical way to obtain
actions from observations is to concat all Vfeatures in a row, then feed into a MLP : RVFRA.
However, when the vanilla RL agent is expanded to multi-policy for joint decision, i.e., the action is
modelled via interdisciplinary collaboration (e.g. LLM-distilled knowledge & online environment).
In such a case, simple MLPs encode vehicles as one unified embedding, which fails to model the
explicit decision process for each other vehicle, thus lacking in explainability. Below, we discuss
how to incorporate different sources of knowledge for joint decision prediction.
3.3.1 Mixture-of-Policies via Vehicle Features Tokenization
Let the observation sRV×F, assume we have Npolicies for joint decision. Our approach is
to revise the state sas a sequence of tokens, where the sequence is of length Vand each token is
of dimension F. Borrowing ideas from language models [26,27,28], we implement mixture-of-
policies (MoP) for joint-decision. As detailed below, we illustrate the process to obtain action a
from state s. Assume the Npolicies π1...N :RV×FRV×Dwhere Dis the latent token dim. The
state is first fed into a router network G:RV×FRV×Nto get the policy weights G(s)RV×N.
Then a topK and Softmax is applied column-wise to select the most influential Kpolicies and get
the normalization ω= softmax(topK(G(s))) RV×N. Next, we calculate output sequences of all
policies: t={πi(s)}N
i=1 RN×V×D. The mixed policy is then the weighted mean of sequences,
expressed as ˜
a=PN
i=1(ω)i·tiRV×D. Finally the action is obtained via action decoder
dec : RV×DRA. In one equation, the joint policy sais formulated by
a= ΠMoP(s;θd, θr, θp) := decθd N
X
i=1 h[softmax (topK (Gθr(s)))]i·πis;θ(i)
pi!(6)
where θd, θr, θpare respectively the parameters of action decoder, router, and policies. ΠMoP repre-
sents the mixed policy for the joint decision from both distilled policy and online adaptation policy.
In practice, we design the policies as full-attention transformers, and decoder decθdas a ViT-wise
transformer, taking 1 + Vtokens as input sequence where the first token is an extra learnable token.
The extra token’s embedding is decoded as the predicted action. For detailed implementation of
MoP policies and the action decoder, please refer to Appendix C.
Remark 1. One simple alternative to joint action prediction is to mix actions instead of policies.
Regarding Npolicies as mlp1...N :RVFRA, the final action aRAis then the merge of
their respective decisions, i.e. a= merge(mlp1...N (flat(s)); w1...N )where flat : RV×FRVF.
However, this approach presents several problems: Q1: Weights w1...N , despite learnable, are fixed
vehicle-wise. This means all vehicles in policy 1share the same w1for action prediction, same
as for other policies. Q2: Weights are independent of observations, which are not generalizable
4
to constantly varying environments. Q3: The merge is not a sparse selection, meaning all policy
candidates are selected for action decision, resulting in loss of computational efficiency when N
increases. Using Eq. (6), as ωRN×V, for each policy, we have Vdifferent weights for each
vehicle, this resolves Q1. Meanwhile, as ωis generated from routed state G(s), which is adaptive
to varying states, this resolves Q2.Q3 is also resolved through employing topK, when K < N , the
selection on policies will be sparse.
3.3.2 Online Adaptation to Offline Policy with Mixture-of-Policies
Our employment of Eq. (6) is illustrated in Fig. 2, a special case where N=K= 2. The policies
come from two sources, respectively, the distilled language model knowledge (LLM-distill Policy,
π1=πdistil) and the online environment interaction (Online Adapter Policy,π2=πadapt). The
robust distillation principal (Eq. (5)) is integrated into the offline distillation phase (Fig. 2.(b)). Re-
spectively, the objective for Q-network & joint policy update is expressed as
ˆ
Qk+1 arg min
QJrobust (Q, ΠMoP(·|·;θd, θr, θp),Doff ),(Robust Offline Q-Learning) (7)
ˆ
Πk+1
MoP arg max
θdr(1)
p
Es∼Doff ,aΠk
MoP(·|s;θdrp)hˆ
Qk+1(s,a)i.(LLM Policy Improvement) (8)
The parameters θ(2)
pof π2(πadapt) during policy improvement (Eq. (8)) are frozen since LLM
knowledge should only be distilled into π1(πdistil). With an arbitrary RL algorithm, the online
adaptation phase objective can be expressed by
ˆ
Π
MoP = arg max
θdr(2)
p
Es∼Don,aΠMoP (·|s;θdrp)[Q(s,a)] .(Adapter Policy Optimization) (9)
In Eq. (9), the learned policy interacts with the environment rolling out the (s,a, r)tuple. With the
frozen distilled knowledge of LLM in πdistil, parameterized by θ(1)
p, we fine-tune πadapt, parame-
terized by θ(2)
p, to adapt the MoP policy ΠMoP with LLM prior to the actual RL environment.
3.3.3 Zero Gating Adapter Policy
To eliminate the influence of πadapt on πdistil during the offline phase, we adopt the concept of
zero gating [29] for initializing the policy. Specifically, the router network’s corresponding expert
weights for πadapt are masked with a trainable zero vector. As a result, the MoP token in phase 2
only considers the impact of πdistil(s). During phase 3, we enable the training of zero gates, allowing
adaptation tokens to progressively inject newly acquired online signals into the MoP policy ΠMoP.
4 Experiments
4.1 Experiment Setting
We conduct experiments using the HighwayEnv simulation environment [22]. We involve three
driving scenarios with increasing levels of complexity: LA NE-3-DENSITY-2, L ANE -4-DENSITY-
2.5, and LA NE-5-DENSITY-3. Detailed task descriptions are delegated to Appendix D. This paper
constructs three types of datasets, namely the random Drand, the LLM-collected DLLM , and the
combined dataset D
off . The construction details and optimal ratio for D
off are discussed in Sec. 4.2.
We compare the RAPID performance under the offline phase with several state-of-the-art RL meth-
ods, DQN [30], DDQN [31], and CQL [20] respectively. In the online phase, we employ the DQN
algorithm as the adaptation method under our RAPID framework. To validate the efficacy of Jrobust
term in Eq. (5), we utilize four various attack methods: Uniform, Gaussian, FGSM, and PGD, to
evaluate the robustness of the distilled policy. All baselines are implemented with d3rlpy library
[32]. The hyperparameters setting refers to Appendix. E. Each method is trained for a total of 10K
training iterations, with evaluations performed per 1K iterations. We employ the same reward func-
tion as defined in highway-env-rewards2.
2rewards: highway-env.farama.org/rewards
5
4.2 Fusing LLM Generated Dataset (Phase 1)
0 25 50 75 100
0
5
10
15
20
25
Cumulative Reward
DDQN
Conventional
RAPID
0 25 50 75 100
0
5
10
15
20
25 CQL
Conventional
RAPID
LLM training dataset ratio
p
(%)
Figure 3: Cumulative reward over different train-
ing ratios under offline training framework in
HI GHWAY-FAST environment. The result is aver-
aged over 5random trails.
To validate the efficacy of the LLM-collected
dataset DLLM, we build the offline dataset by
combining it with a random dataset Drand. We
evaluate the cumulative rewards on Doff =
sample({DLLM,Drand };{p, 1p})with two
offline algorithms: DDQN [31] and CQL [20],
to find the best ratio p. To clarify, Drand is sam-
pled using a random behavioural policy, serving
as a baseline for data collection. DLLM is sam-
pled via a pre-trained LLM πLLM. We gather
3K transitions using πLLM for each trail. There-
fore, both Drandom and DLLM dataset contains
15K transitions.
According to Fig. 3, the pure Drand with p= 0% cannot support a well-trained offline policy.
Utilizing only the DLLM can also harm the performance. However, augmenting offline datasets with
partial LLM-generated data can significantly boost policy cumulative reward. With such evidence,
we choose p= 25% in between 12.5% 50% as a sweet spot and define it as our final D
off .
4.3 Offline LLM Knowledge Distillation (Phase 2)
Table 1: The cumulative rewards in Phase 2, averaged over 3seeds. RAPID (with&without Jrobust)
only enable the πdistil part (as depicted in Fig. 2) in offline training. RAPID (w/o Jrobust ) distils the
LLM knowledge with Eq. (4) while RAPID uses robust distillation Eq. (5). Blue / Orange indicate
the first/second highest return for each buffer. Bold denotes the best within each environment.
Environments Dataset CQL DQN DDQN RAPID (w/o Jrobust)RAPID
LA NE-3-DENSITY-2
Drand 3.01±1.49 4.40±1.57 4.86±1.58 3.47±1.45 3.66±1.58
DLLM 7.96±1.63 9.14±8.80 10.83±4.94 12.22±7.03 14.18±2.43
D
off 13.06±1.11 8.66±6.93 12.22±7.03 12.72±0.59 15.33±5.30
LA NE-4-DENSITY-2.5
Drand 3.84±2.59 2.41±0.92 1.96±0.52 2.18±0.59 3.60±2.67
DLLM 2.56±0.90 3.99 ±1.18 5.82±6.16 3.27±2.05 4.60±2.69
D
off 7.36±0.60 7.14±6.12 7.61±6.76 10.19±1.30 10.29±0.20
LA NE-5-DENSITY-3
Drand 1.53±0.29 5.59±2.84 3.72±2.85 3.13±0.36 2.16±0.36
DLLM 2.23±0.29 5.54±7.98 4.76±1.89 5.26±4.61 1.45±2.11
D
off 5.58±3.93 5.99±4.10 5.05±2.06 5.38±0.63 6.14±2.85
Phase 2 only uses πdistil for offline training as described in Sec. 3.3. We evaluate RAPID on three
environments, respectively, LAN E-3-DENSITY-2, LA NE-4-DENSITY-2. 5 and LANE-5-DENSITY-3
with three different types of datasets: Drand ,DLLM and D
off as described in Sec. 4.2. In particular,
we collect D
off with LA NE-3-DENSITY-2 only, and apply this dataset to trails LA NE-3-DENSITY-2-
D
off ,LA NE-4-DENSITY-2.5-D
off , and LA NE -5-DENSITY-3-D
off .
We present the offline results in Tab. 1and observe the following: (1) With DLLM, policies ex-
hibit better offline performance than Drand in general, while D
off , as a mixture of above, improve
upon both randomly- and LLM-generated datasets. This again confirms our conclusion in Sec. 4.2.
(2) Our approach, RAPID, consistently outperforms conventional methods, with RAPID using the
mixed D
off achieving the highest rewards overall. (3) Jrobust does not impact the clean performance
under the offline training phase. (4) D
off collected from LA NE -3-DENSITY-2 achieves better per-
formance across all tasks, indicating that the LLM-generated dataset contains general knowledge
applicable to different tasks with the same state and action spaces.
6
0 2500 5000 7500 10000 12500 15000
0
5
10
15
20
25
Cummulative Rewards
(a) LANE-3-DENSITY-2
Online RL
Offline RL
RAPID
LLM Policy
0 2500 5000 7500 10000 12500 15000
Training Iterations
0
5
10
15
20
(b) LANE-4-DENSITY-2.5
0 2500 5000 7500 10000 12500 15000
0
5
10
15
20 (c) LANE-5-DENSITY-3
Figure 4: Performance of online adaptation (Phase 3) across three environments. Before the 5K-
iteration mark, we pre-train the πdistil policy of RAPID using the method described in Phase 2. Note
that πdistil keeps frozen in Phase 3. We report the average performance over 5random trails.
4.4 Online Adaptation Performance (Phase 3)
To evaluate the online adaptation ability of RAPID, we employ the pre-trained πdistil from Phase 2
(with the collected D
off from LA NE-3-DENSITY-2). Then train the πadapt and its zero gate (as
depicted in Fig. 2(c)) via interacting with different online environments. We compare it with the
vanilla Online RL (DQN) and Offline RL (DQN) without using the RAPID MoP policy. We train
the RAPID policy for 5K training epochs during the offline phase, followed by 10K online epochs.
The Online DQN starts from the 5K-th epoch.
As depicted in Fig. 4: (1) The cumulative rewards for πdistil are sufficiently high, due to the in-
troduced common sense knowledge and reasoning abilities from πLLM. However, the conventional
offline RL cannot perform well. (2) With the offline phase pre-trained on LANE-3-DENSITY-2, the
online adaptation with RAPID MoP on the same dataset LANE -3-DENSITY-2 (Fig. 4(a)) achieves
significantly high rewards. This demonstrates the necessity of online adaptation to generalize knowl-
edge for practical application. (3) We further conduct zero-shot adaptation on offline-unseen dataset
LA NE-4-DENSITY-2. 5 and LANE-5-DENSITY-3. In Fig. 4(b-c), we observe RAPID can achieve
competitive performance not only compared to the vanilla online approach, but toward the large
πLLM. This highlights the efficacy of the RAPID framework in task adaptation.
4.5 Robust Distillation Performance
We evaluate the robustness of multiple distillation algorithms using 4 different attack methods, re-
spectively, Uniform, Gaussian, FGSM and PGD. Specifically, we employ a 10-step PGD with a
designated step size of 0.01. For both FGSM and PGD attacks, the attack radius ϵis set to 0.1. For
Uniform and Gaussian attacks, ϵis set to 0.2. The observation was normalized before the attack and
then denormalized for standard RL policy understanding. In RAPID, we set the βas 0.5to balance
the robustness distillation term according to Eq. (5).
As illustrated in Tab. 2, the conventional methods, Online DQN and Offline DQN, are not able to
effectively defend against strong adversarial attacks like FGSM and PGD. Although Offline DQN,
which is trained on the dataset D
off partially collected by the LLM policy πLLM. It still strug-
gles to maintain performance under these attacks. RAPID (w/o Jrobust), which undergoes offline
training and online adaptation without using Jrobust, performs weakly when facing strong adver-
sarial attacks. In contrast, the full RAPID method with Jrobust demonstrates superior robustness
against various adversarial attacks across all three environments. We note that the Jrobust might
approximate the robustness of LLM by building the robust soft label in Eq. 5. Overall, the Jrobust
regularizer plays a crucial role in bolstering the robustness of the model by effectively utilizing the
robust knowledge obtained from the LLM-based teacher.
4.6 Ablation Study
MoP Routing Analysis. In Fig. 5, we visualize the contribution of each policy (πdistil and πadapt)
to the final predicted action at each vehicle position after online adaptation. In general, πdistil dom-
7
Table 2: Adversarial Performance against observation perturbation, averaged over 5 trails. The Of-
fline DQN, RAPID and RAPID (w/o Jrobust) are trained based on the dataset D
off . The RAPID and
RAPID (w/o Jrobust) are processed under offline training and online adaptation, while the Offline
DQN only goes through the offline training process. Bold indicates the best return against observa-
tion perturbation within each environment.
Environments Method Attack Return
Clean Uniform Gaussian FGSM PGD
LA NE-3-DENSITY-2
Online DQN 12.48±8.56 11.74±7.75 10.72±8.27 2.09±1.62 1.09±1.32
Offline DQN 8.66±6.93 6.73±2.04 6.29±1.28 2.12±0.23 1.73±1.04
RAPID (w/o Jrobust)19.79±3.42 17.01±4.30 15.86±5.21 3.40±1.26 1.26±1.02
RAPID 20.42±2.59 20.41±1.58 20.23±1.26 15.34±2.97 14.77±1.23
LA NE-4-DENSITY-2.5
Online DQN 11.58±0.56 10.49±4.52 10.18±4.72 6.03±0.86 6.12±0.78
Offline DQN 7.14±6.12 5.20±1.47 6.56±1.06 2.56±1.06 2.20±1.47
RAPID (w/o Jrobust)13.79±3.48 12.87±1.43 11.61±5.63 5.53±3.27 3.26±1.24
RAPID 14.34±2.84 13.63±5.26 13.91±7.84 10.87±2.68 8.39±2.11
LA NE-5-DENSITY-3
Online DQN 6.05±9.41 6.64±1.46 5.48±7.67 1.55±1.10 1.54±1.11
Offline DQN 5.99±4.10 2.55±2.10 2.22±1.99 1.22±1.03 0.88±0.58
RAPID (w/o Jrobust)8.47±2.64 4.28±1.44 3.20±2.06 0.34±0.24 0.79±0.33
RAPID 7.83±3.41 5.14±3.19 5.22±2.68 2.96±0.76 1.62±0.48
0
1
3
4 5
6
7
8 9
10
11
2
Figure 5: Visualizing the contribution of πdistil and πadapt to the final predicted action. The example
is randomly sampled via interacting with LA NE -4-DENSITY-2 .5 using ΠMoP . More samples are
demonstrated in Appendix. B.4.
inates the contribution across all vehicles, as πadapt is initially masked out during offline training.
However, after the online phase, πadapt provides a more balanced contribution, especially for the
ego vehicle and nearby vehicles (like vehicles 0, 2, 4 in Fig. 5), indicating the effectiveness of online
adaptation. Interestingly, πdistil still contributes substantially for distant vehicles, suggesting the
value of LLM-distilled knowledge in understanding overall traffic context. This demonstrates the
efficacy of the MoP approach in integrating knowledge from both the LLM-based teacher and the
environment interaction, leveraging the most relevant expertise for each vehicle based on the domain
and relative position.
Additional Ablation Studies. Please refer to Appendix B.
5 Conclusion
We propose RAPID, a promising approach for leveraging LLMs’ reasoning abilities and common
sense knowledge to enhance the performance of RL agents in heterogeneous autonomous driving
tasks. Meanwhile, RAPID overcomes challenges such as the long inference time of LLMs and
policy knowledge overwriting. The robust knowledge distillation method enables the student RL
policy to inherit the robustness of the LLM teacher.
Limitation and future work. Although we have conducted tests in three distinct autonomous driv-
ing environments and validated the closed-loop RL policy in real-time, the scope of the analysis is
restricted. The 2D HighwayEnv remains overly simplistic for comprehensive autonomous driving
evaluation. To further establish the efficacy of our online adaptation policy, it is necessary to assess
its performance with Visual Language Models [33,34] in more realistic and complex autonomous
driving environments, like CARLA [35].
8
Acknowledgments
If a paper is accepted, the final camera-ready version will (and probably should) include acknowl-
edgments. All acknowledgments go at the end of the paper, including thanks to reviewers who gave
useful comments, to colleagues who contributed to the ideas, and to funding agencies and corporate
sponsors that provided financial support.
References
[1] X. Tian, J. Gu, B. Li, Y. Liu, C. Hu, Y. Wang, K. Zhan, P. Jia, X. Lang, and H. Zhao. Drivevlm:
The convergence of autonomous driving and large vision-language models. arXiv preprint
arXiv:2402.12289, 2024.
[2] Y. Jin, X. Shen, H. Peng, X. Liu, J. Qin, J. Li, J. Xie, P. Gao, G. Zhou, and J. Gong. Surre-
aldriver: Designing generative driver agent simulation framework in urban contexts based on
large language model. arXiv preprint arXiv:2309.13193, 2023.
[3] B. Jin, X. Liu, Y. Zheng, P. Li, H. Zhao, T. Zhang, Y. Zheng, G. Zhou, and J. Liu. Adapt:
Action-aware driving caption transformer. In 2023 IEEE International Conference on Robotics
and Automation (ICRA), pages 7554–7561. IEEE, 2023.
[4] D. Fu, X. Li, L. Wen, M. Dou, P. Cai, B. Shi, and Y. Qiao. Drive like a human: Rethinking
autonomous driving with large language models. arXiv preprint arXiv:2307.07162, 2023.
[5] L. Wen, D. Fu, X. Li, X. Cai, T. Ma, P. Cai, M. Dou, B. Shi, L. He, and Y. Qiao. Dilu: A
knowledge-driven approach to autonomous driving with large language models. arXiv preprint
arXiv:2309.16292, 2023.
[6] L. Chen, O. Sinavski, J. H¨
unermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and
J. Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous
driving. arXiv preprint arXiv:2310.01957, 2023.
[7] T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud, and P.-Y. Oudeyer. Grounding large lan-
guage models in interactive environments with online reinforcement learning. In International
Conference on Machine Learning, pages 3676–3713. PMLR, 2023.
[8] B. Huang, M. Chen, Y. Wang, J. Lu, M. Cheng, and W. Wang. Boosting accuracy and robust-
ness of student models via adaptive adversarial distillation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 24668–24677, June
2023.
[9] B. Zi, S. Zhao, X. Ma, and Y.-G. Jiang. Revisiting adversarial robustness distillation: Robust
soft labels make student better. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, pages 16443–16452, 2021.
[10] M. Goldblum, L. Fowl, S. Feizi, and T. Goldstein. Adversarially robust distillation. In Pro-
ceedings of the AAAI conference on artificial intelligence, volume 34, pages 3996–4003, 2020.
[11] J. Zhu, J. Yao, B. Han, J. Zhang, T. Liu, G. Niu, J. Zhou, J. Xu, and H. Yang. Reliable
adversarial distillation with unreliable teachers. arXiv preprint arXiv:2106.04928, 2021.
[12] R. Shi, Y. Liu, Y. Ze, S. S. Du, and H. Xu. Unleashing the power of pre-trained language
models for offline reinforcement learning. arXiv preprint arXiv:2310.20587, 2023.
[13] Z. Zhou, B. Hu, P. Zhang, C. Zhao, and B. Liu. Large language model is a good policy teacher
for training reinforcement learning agents. arXiv preprint arXiv:2311.13373, 2023.
[14] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners:
Extracting actionable knowledge for embodied agents, 2022.
9
[15] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le.
Finetuned language models are zero-shot learners, 2022.
[16] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances
in neural information processing systems, 33:1877–1901, 2020.
[17] Z. Wang, X. Pan, D. Yu, D. Yu, J. Chen, and H. Ji. Zemi: Learning zero-shot semi-parametric
language models from multiple tasks. arXiv preprint arXiv:2210.00185, 2022.
[18] Y. Gu, P. Ke, X. Zhu, and M. Huang. Learning instructions with unlabeled data for zero-shot
cross-task generalization. arXiv preprint arXiv:2210.09175, 2022.
[19] A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine. Stabilizing off-policy q-learning via
bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
[20] A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforce-
ment learning, 2020.
[21] K. Schweighofer, M.-c. Dinu, A. Radler, M. Hofmarcher, V. P. Patil, A. Bitto-Nemling,
H. Eghbal-zadeh, and S. Hochreiter. A dataset perspective on offline reinforcement learning.
In Conference on Lifelong Learning Agents, pages 470–517. PMLR, 2022.
[22] E. Leurent. An environment for autonomous driving decision-making. https://github.
com/eleurent/highway-env, 2018.
[23] OpenAI. Introducing chatgpt. https://openai.com/index/chatgpt/, 2023.
[24] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing
reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
[25] R. Yang, C. Bai, X. Ma, Z. Wang, C. Zhang, and L. Han. Rorl: Robust offline reinforcement
learning via conservative smoothing. Advances in Neural Information Processing Systems, 35:
23851–23866, 2022.
[26] A. Clark, D. de Las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hecht-
man, T. Cai, S. Borgeaud, et al. Unified scaling laws for routed language models. In Interna-
tional conference on machine learning, pages 4057–4086. PMLR, 2022.
[27] H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathiamoorthy, Y. Chen, R. Mazumder, L. Hong,
and E. Chi. Dselect-k: Differentiable selection in the mixture of experts with applications to
multi-task learning. Advances in Neural Information Processing Systems, 34:29335–29347,
2021.
[28] Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Zhao, A. M. Dai, Q. V. Le, J. Laudon, et al.
Mixture-of-experts with expert choice routing. Advances in Neural Information Processing
Systems, 35:7103–7114, 2022.
[29] R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, and Y. Qiao. Llama-
adapter: Efficient fine-tuning of language models with zero-init attention, 2023.
[30] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-
forcement learning. nature, 518(7540):529–533, 2015.
[31] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning.
In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
[32] T. Seno and M. Imai. d3rlpy: An offline deep reinforcement learning library. The Journal of
Machine Learning Research, 23(1):14205–14224, 2022.
10
[33] C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li. Drivelm:
Driving with graph visual question answering. arXiv preprint arXiv:2312.14150, 2023.
[34] W. Wang, J. Xie, C. Hu, H. Zou, J. Fan, W. Tong, Y. Wen, S. Wu, H. Deng, Z. Li, et al.
Drivemlm: Aligning multi-modal large language models with behavioral planning states for
autonomous driving. arXiv preprint arXiv:2312.09245, 2023.
[35] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. Carla: An open urban driving
simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
[36] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review,
and perspectives on open problems, 2020.
[37] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.
Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[38] X. Yin, S. Wu, J. Liu, M. Fang, X. Zhao, X. Huang, and W. Ruan. Rerogcrl: Representation-
based robustness in goal-conditioned reinforcement learning, 2023.
[39] S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without explo-
ration, 2019.
[40] I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit q-learning. In
International Conference on Learning Representations, 2022. URL https://openreview.
net/forum?id=68n2s9ZJWF8.
[41] S. Fujimoto and S. S. Gu. A minimalist approach to offline reinforcement learning, 2021.
[42] Y. Wu, S. Zhai, N. Srivastava, J. M. Susskind, J. Zhang, R. Salakhutdinov, and H. Goh. Un-
certainty weighted offline reinforcement learning, 2021. URL https://openreview.net/
forum?id=7hMenh--8g.
[43] Y. Gu, L. Dong, F. Wei, and M. Huang. Knowledge distillation of large language models, 2023.
[44] C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C.-Y. Lee, and
T. Pfister. Distilling step-by-step! outperforming larger language models with less training
data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
[45] OpenAI. Gpt4 report. https://arxiv.org/abs/2303.08774, 2023.
[46] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023.
[47] A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang. Language agent
tree search unifies reasoning acting and planning in language models. arXiv preprint
arXiv:2310.04406, 2023.
[48] B. Y. Lin, Y. Fu, K. Yang, P. Ammanabrolu, F. Brahman, S. Huang, C. Bhagavatula, Y. Choi,
and X. Ren. Swiftsage: A generative agent with fast and slow thinking for complex interactive
tasks. arXiv preprint arXiv:2305.17390, 2023.
[49] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web:
Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
[50] S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, Y. Bisk, D. Fried, U. Alon,
et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint
arXiv:2307.13854, 2023.
11
[51] K. Nottingham, P. Ammanabrolu, A. Suhr, Y. Choi, H. Hajishirzi, S. Singh, and R. Fox. Do em-
bodied agents dream of pixelated sheep?: Embodied decision making using language guided
world modelling. arXiv preprint arXiv:2301.12050, 2023.
[52] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su. Llm-planner: Few-
shot grounded planning for embodied agents with large language models. In Proceedings of
the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023.
[53] Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang. Describe, explain, plan and select: Interactive
planning with large language models enables open-world multi-task agents. arXiv preprint
arXiv:2302.01560, 2023.
[54] Wayve. Lingo-1: Exploring natural language for autonomous driving. https://wayve.ai/
thinking/lingo-natural-language-autonomous-driving/, 2023.
[55] Z. Yang, X. Jia, H. Li, and J. Yan. A survey of large language models for autonomous driving.
arXiv preprint arXiv:2311.01043, 2023.
[56] Y. Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot. Specializing smaller language models
towards multi-step reasoning. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and
J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning,
volume 202 of Proceedings of Machine Learning Research, pages 10421–10430. PMLR, 23–
29 Jul 2023. URL https://proceedings.mlr.press/v202/fu23d.html.
[57] L. Beyer, X. Zhai, A. Royer, L. Markeeva, R. Anil, and A. Kolesnikov. Knowledge distillation:
A good teacher is patient and consistent. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 10925–10934, 2022.
[58] P. West, C. Bhagavatula, J. Hessel, J. Hwang, L. Jiang, R. Le Bras, X. Lu, S. Welleck,
and Y. Choi. Symbolic knowledge distillation: from general language models to common-
sense models. In M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, editors, Proceed-
ings of the 2022 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, pages 4602–4625, Seattle, United States,
July 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.naacl-main.341.
URL https://aclanthology.org/2022.naacl-main.341.
[59] F. Iliopoulos, V. Kontonis, C. Baykal, G. Menghani, K. Trinh, and E. Vee. Weighted distillation
with unlabeled examples. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances
in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?
id=M34VHvEU4NZ.
[60] R. Smith, J. A. Fries, B. Hancock, and S. H. Bach. Language models in the loop: Incorporating
prompting into weak supervision. arXiv preprint arXiv:2205.02318, 2022.
[61] S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng. Want to reduce labeling cost? GPT-3 can
help. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Findings of the Associ-
ation for Computational Linguistics: EMNLP 2021, pages 4195–4205, Punta Cana, Domini-
can Republic, Nov. 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.
findings-emnlp.354. URL https://aclanthology.org/2021.findings-emnlp.354.
[62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo-
sukhin. Attention is all you need. Advances in neural information processing systems, 30,
2017.
12
A Related Work
A.1 Offline RL for Autonomous Driving
Offline RL algorithms are designed to learn policies from a static dataset, eliminating the need
for interaction with the real environment [36]. Compared to conventional online RL [37] and the
extended goal-conditioned RL [38], this approach is especially beneficial in scenarios where such
interaction is prohibitively expensive or risky, such as in autonomous driving. Offline RL algorithms
have demonstrated the capability to surpass expert-level performance [39,20,40]. In general, these
algorithms employ policy regularization [39,41] and out-of-distribution (OOD) penalization [20,42]
as strategies to prevent value overestimation. In this paper, we pioneer utilising the LLM-generated
data to train the offline RL. Although there have been several works [43,44] focus on distillation for
LLM, none of them considers distilling to RL.
A.2 LLM for Autonomous Driving
Recent advancements in LLMs [16,45,46] demonstrate their powerful embodied abilities, provid-
ing the possibility to distil knowledge from humans to autonomous systems. LLMs exhibit a strong
aptitude for general reasoning [47,48], web agents [49,50], and embodied robotics [51,52,53].
Inspired by the superior capability of common sense of LLM-based agents, a substantial body of
research is dedicated to LLM-based autonomous driving. Wayve [54] introduced an open-loop driv-
ing commentator called LINGO-1, which integrates vision, language, and action to enhance the
interpretation and training of driving models. DiLu [5] developed a framework utilizing LLMs as
agents for closed-loop driving tasks, with a memory module to record experiences. To enhance the
stability and generalization performance, [4] utilised reasoning, interpretation, and memorization of
LLM to enhance autonomous driving. More research works for this vibrant field were summarised
in [55]. However, all these methods require huge resources, long inference time, and unstable per-
formance. To bridge this gap, we propose a knowledge distillation framework, from LLM to RL,
which enhances the applicability and stability of real-world autonomous driving.
A.3 Distillation for LLM
Knowledge distillation has proven successful in transferring knowledge from large-scale, more
competent teacher models to small-scale student models, making them more affordable for prac-
tical applications [56,57,58]. This method facilitates learning from limited labelled data, as the
larger teacher model is commonly employed to generate a training dataset with noisy pseudo-labels
[59,60,61]. Further, [44] involves extracting rationales from LLMs as additional supervisory signals
to train small-scale models within a multi-task framework. [43] utilises reverse Kullback-Leibler di-
vergence to ensure that the student model does not overestimate the low-probability regions in the
teacher distribution. Currently, knowledge distillation from LLM to RL for autonomous driving
remains unexplored.
13
B Additional Experiments
B.1 Comparison between MLP and RAPID(Attentive) based Policy
Table 3: Comparison between MLP and RAPID(Attentive) based policy under offline RL training.
Bold denotes the best among each environment. Results are averaged across 3 seeds.
Environments Dataset MLP Policy RAPID(Attentive) Policy
CQL DQN DDQN CQL DQN DDQN
LA NE-3-DENSITY-2
Drand 3.01±1.49 4.40±1.57 4.86±1.58 3.50±0.84 3.66±1.58 2.56±0.18
DLLM 7.96±1.63 9.14±8.80 10.83±4.94 12.81±7.51 14.18±2.43 5.65±3.13
D
off 13.06±1.11 8.66±6.93 12.22±7.03 12.84±0.68 15.33±5.30 13.95±1.27
LA NE-4-DENSITY-2.5
Drand 3.84±2.59 2.41±0.92 1.96±0.52 6.69±3.82 3.60±2.67 1.70±0.46
DLLM 2.56±0.90 3.99 ±1.18 5.82±6.16 5.98±1.30 4.60±2.69 2.23±0.72
D
off 7.36±0.60 7.14±6.12 7.61±6.76 8.14±0.87 10.29±0.20 3.16±0.31
LA NE-5-DENSITY-3
Drand 1.53±0.29 5.59±2.84 3.72±2.85 2.58±1.09 2.16±0.36 1.83±1.48
DLLM 2.23±0.29 5.54±7.98 4.76±1.89 2.28±1.65 1.45±2.11 2.33±0.66
D
off 5.58±3.93 5.99±4.10 5.05±2.06 3.37±1.61 6.14±2.85 3.17±0.41
To verify the contribution of the RAPID(Attentive) architecture (as shown in Fig. 8) in the offline
training phase, we conduct extra experiments in this section. As illustrated in Tab. 3, we compared
the DQN, DDQN, and CQL under MLP and RAPID architecture, respectively. As the environment
complexity increases, the performance gap between RAPID and MLP narrows, suggesting RAPID
handles simpler environments more effectively. In summary, we conlude: (1) the RAPID + DQN
method achieves the best performance among all methods, thus we choose DQN as the backbone
of RAPID for offline training. (2) The attentive architecture demonstrates superior performance
compared to MLP, particularly in less complex environments.
B.2 Impact of Jrobust
0 2000 4000 6000 8000 10000
0
2
4
6
8
10
12
14
16
Loss
(a) LANE-3-DENSITY-2
J_robust loss
0 2000 4000 6000 8000 10000
Training Iterations
0
1
2
3
4
5
6
(b) LANE-4-DENSITY-2.5
J_robust loss
0 2000 4000 6000 8000 10000
1.55
1.60
1.65
1.70
(c) LANE-5-DENSITY-3
J_robust loss
Figure 6: Offline training Jrobust loss term under the L ANE-3-DENSITY-2-D
off .
Building upon the results presented in Fig. 6, we provide an analysis of the impact of Jrobust among
three environments. The experiment setup is the same as Sec. 4.3. The RAPID is trained 10K
epochs based on LANE -3-DENSITY-2-D
off ,LA NE-4-DENSITY-2.5-D
off , and LA NE-5-DENSITY-3-
D
off . From Fig. 6, we observe that the Jrobust gradually decreased and converged to 0 during the
training process. The experimental results align with our results in Tab. 2, as the defense mecha-
nism demonstrates superior performance in both LANE -3-DENSITY-2 and L AN E-4-DENSITY-2.5.
In the more complex LA NE-5-DENSITY-3 environment, the loss remains relatively high, around 1.6,
which leads to suboptimal defense performance compared to the other scenarios. This suggests that
the increased complexity and vehicle density in this setting pose additional challenges for the de-
fense mechanism. Overall, the consistency across different settings highlights the robustness and
effectiveness of our approach.
14
B.3 Impact of β
Table 4: Adversarial Performance of different βparameters against observation perturbation.
Environments Method Attack Return
Clean Uniform Gaussian FGSM PGD
LA NE-3-DENSITY-2
RAPID (w/o Jrobust)19.79±3.42 17.01±4.30 15.86±5.21 3.40±1.26 1.26±1.02
RAPID (β= 0.1) 20.74±3.66 21.22±2.95 20.75±3.43 10.97±3.23 10.24±2.67
RAPID (β= 0.5) 20.42±2.59 20.41±1.58 20.23±1.26 15.34±2.97 14.77±1.23
RAPID (β= 0.8) 18.36±3.46 18.23±3.98 17.68±2.54 13.51±4.14 11.68±4.11
LA NE-4-DENSITY-2.5
RAPID (w/o Jrobust)13.79±3.48 12.87±1.43 11.61±5.63 5.53±3.27 3.26±1.24
RAPID (β= 0.1) 14.12±1.76 12.66±4.87 11.38±6.29 8.17±3.75 7.13±2.42
RAPID (β= 0.5) 14.34±2.84 13.63±5.26 13.91±7.84 10.87±2.68 8.39±2.11
RAPID (β= 0.8) 14.21±5.23 12.24±3.97 12.68±4.25 8.69±3.24 9.16±4.13
LA NE-5-DENSITY-3
RAPID (w/o Jrobust)8.47±2.64 4.28±1.44 3.20±2.06 0.34±0.24 0.79±0.33
RAPID (β= 0.1) 6.98±4.84 6.29±4.59 6.21±2.96 1.90±1.67 2.19±1.23
RAPID (β= 0.5) 7.83±3.41 5.14±3.19 5.22±2.68 2.96±0.76 1.62±0.48
RAPID (β= 0.8) 6.17±2.19 4.13±4.28 5.77±3.56 2.89±0.32 2.67±0.89
To investigate the effect of various βvalues (hyperparameter in Eq. 5), we compare β
{0.1,0.5,0.8}on LA NE-3-DENSITY-2, L ANE -4-DENSITY-2.5, and LAN E-5-DENSITY-3, respec-
tively. The results are illustrated in Tab. 4. Based on the results presented in Tab. 4, we make several
observations regarding the impact of the βon the performance of our robust distillation approach.
When setting β= 0.8, we notice a decline in the clean performance across all three environments.
This degradation can be attributed to the increased emphasis on adversarial examples during the
training process, which may lead to a trade-off with clean accuracy. On the other hand, when setting
β= 0.1, the attack return is relatively lower compared to other settings. This suggests that the
adversarial training strength may not be sufficient to provide adequate robustness against adversar-
ial perturbations. Considering these findings, we determine that setting β= 0.5strikes a balance
between maintaining clean performance and achieving satisfactory robustness.
15
B.4 Visualization the of πdistil and πadapt
Figure 7: Five more random examples on visualizing the contribution of πdistil and πadapt (from the
router) to the final predicted action in LA NE -4-DENSITY-2 .5 using ΠMoP .
16
C Network Implementation
Vehicle
Tokens
Policy
Tokens
......
time
Vehicle
5.0 4.0 15.0 0Ego Vehicle
-10.0 8.0 12.0 0Vehicle 1
... ... ... ......
22.2 10.5 18.0 0.5Vehicle V-1
Observation
Norm
#1 #2 #3 #V
......
(Policy)
Transformer Encoder
(c) Transformer Encoder
Linear Projection
Multi-Head
Attention
Norm
MLP
+
+
Token
Embeddings
(a) LLM-Distil / Online Adapter Policy L
(b) Action Decoder
MoP Policy
Tokens
......
Extra Learnable
Token
......
(Action)
Transformer Encoder
Predicted Action
Figure 8: Architecture of (a) Policy networks (transformer encoders) πdistil(·)and πadapt (·); (b)
Action decoder (transformer encoder) dec(·); (c) The transformer backbone (encoder-only).
C.1 Transformer Encoder
We employ the same encoder-only transformer fTransE as [62].
C.2 Policy Networks
The policy network consist of a linear projection and a transformer encoder that take the projected
state as input. Let state sRV×F. The policy network first project ssvRV×Fwith a
trainable linear projection, then regard svas the input token for the transformer. The transformer
then processes these embeddings through [0, L]layers of self-attention and feedforward networks,
in our case, we set L= 2. Mathematically, for each layer l[0, L], the πdistiladapt computes the
following: spfpolicy
TransE(fProj(s)) RV×D.
C.3 Action Decoder
The action decoder transforms encoded state representations into an action using another transformer
encoder. It takes the mix-of-policy token from the routed policy networks as input. Given the routed
MoP token smRV×D, we first concatenate smwith an extra learnable token seR1×D, then put
concat(smse)R(V+1)×Dinto the transformer encoder. We regard the first output token as the
action token. Essentially, dec : RV×DRAcomputes: afaction
TransE(concat(smse))0,:R1×A.
D Environment Details
Each of these three environments has continuous state and discrete action space. The maximum
episode horizon is defined as 30. Tab. 5describes the configuration for the highway environment in
a reinforcement learning setting. The observation configuration defines the type of observation uti-
lized, which is specified as KinematicsObservation. This observation type represents the surround-
ing vehicles as a V×Farray, where Vdenotes the number of nearby vehicles and Frepresents
the size of the feature set describing each vehicle. The specific features included in the observation
are listed in the features field of the configuration. The KinematicsObservation provides essential
information about the neighbouring vehicles, such as their presence, positions in the x-y coordinate
system, and velocities along the x and y axes. These features are represented as absolute values,
independent of the agent’s frame of reference, and are not subjected to any normalization process.
Fig. 9demonstrates the state features under KinematicObservation setting, in which we introduce
the vehicle feature tokenization in Sec. 3.3.
17
Table 5: Highway Environment Configuration
Parameter Value
Observation Type KinematicsObservation
Observation Features presence, x,y,vx,vy
Observation Absolute True
Observation Normalize False
Observation Vehicles Count 15
Observation See Behind True
Action Type DiscreteMetaAction
Action Target Speeds np.linspace(0, 32, 9)
Lanes Count 3, (4, 5)
Duration 30
Vehicles Density 2, (2.5, 3)
Show Trajectories True
Render Agent True
Vehicle
5.0 4.0 15.0 0Ego Vehicle
-10.0 8.0 12.0 0Vehicle 1
... ... ... ......
22.2 10.5 18.0 0.5Vehicle V-1
State
Figure 9: State features under KinematicObservation setting.
E Hyperparameters
In this work, conventional DQN, DDQN and CQL share the same MLP network architectures. For
all experiments, the hyperparameters of our backbone architectures and algorithms are reported in
Tab. 6. Our implementation is based on d3rlpy [32], which is open-sourced.
Table 6: Hyperparameters for conventional Offline RL methods.
Hyperparameter Value Hyperparameter Value Hyperparameter Value
DQN Batch size 32 DDQN Batch size 32 CQL Batch size 64
Learning rate 5×103Learning rate 5×104Learning rate 5×104
Target network update interval 50 Target network update interval 50 Target network update interval 50
Total timesteps 10000 Total timesteps 10000 Total timesteps 10000
Timestep per epoch 100 Timestep per epoch 100 Timestep per epoch 100
Buffer capacity 1×105Buffer capacity 1×105Buffer capacity 1×105
Number of Tests 10 Number of Tests 10 Number of Tests 10
Critic hidden dim 256 Critic hidden dim 256 Critic hidden dim 256
Critic hidden layers 2 Critic hidden layers 2 Critic hidden layers 2
Critic activation function ReLU Critic activation function ReLU Critic activation function ReLU
18
F Prompt Setup
In this section, we detail the specific prompt design and give an example of the interaction between
the LLM-based agent and the environment.
Prefix Prompt. As shown in Fig. 10, the Prefix Prompt part primarily consists of an introduction
to the autonomous driving task, a description of the scenario, common sense rules, and instructions
for the output format. The previous decision and explanation are obtained from the experience
buffer. The current scenario information plays an important role while making decision, and it
is dynamically generated based on the current decision frame. The driving scenario description
contains information about the ego and surrounding vehicles’ speed and positions. The common
sense rules section embeds the driving style to guide the vehicle’s behaviour. Finally, the final
answer format is constructed to output the discrete actions and construct the closed-loop simulation
on HighwayEnv.
Interaction. We demonstrate one example to make readers better understand the reasoning process
of GPT-3.5. As shown in Fig. 11, the ego car initially checks the available actions and related safety
outcomes. On the first round of thought, GPT-3.5 tries to understand the situation of the ego car
and checks the available lanes for decision-making. After several rounds of interaction, it checks
whether the action keep speed is safe with vehicle 7. Finally, it outputs the decision idle and
explains that maintaining the current speed and lane can keep a safe distance from surrounding cars.
Figure 10: Prefix Prompt before interacting with HighwayEnv.
19
Figure 11: Interaction between the LLM-based agent and the HighwayEnv environment.
20
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We investigate the challenge of task planning for multi-task embodied agents in open-world environments. 2 Two main difficulties are identified: 1) executing plans in an open-world environment (e.g., Minecraft) necessitates accurate and multi-step reasoning due to the long-term nature of tasks, and 2) as vanilla planners do not consider how easy the current agent can achieve a given sub-task when ordering parallel sub-goals within a complicated plan, the resulting plan could be inefficient or even infeasible. To this end, we propose "Describe, Explain, Plan and Select" (DEPS), an interactive planning approach based on Large Language Models (LLMs). DEPS facilitates better error correction on initial LLM-generated plan by integrating description of the plan execution process and providing self-explanation of feedback when encountering failures during the extended planning phases. Furthermore, it includes a goal selector, which is a trainable module that ranks parallel candidate sub-goals based on the estimated steps of completion, consequently refining the initial plan. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances. Further testing reveals our method's general effectiveness in popularly adopted non-open-ended domains as well (i.e., ALFWorld and tabletop manipulation). The ablation and exploratory studies detail how our design beats the counterparts and provide a promising update on the ObtainDiamond grand challenge with our approach. The code is released at https://github.com/CraftJarvis/MC-Planner.
Conference Paper
Full-text available
Recent studies have uncovered the potential of Large Language Models (LLMs) in addressing complex sequential decision-making tasks through the provision of high-level instructions. However, LLM-based agents lack specialization in tackling specific target problems, particularly in real-time dynamic environments. Additionally, deploying an LLM-based agent in practical scenarios can be both costly and time-consuming. On the other hand, reinforcement learning (RL) approaches train agents that specialize in the target task but often suffer from low sampling efficiency and high exploration costs. In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task. We conducted experiments on challenging MiniGrid and Habitat environments, specifically designed for embodied AI research, to evaluate the effectiveness of our framework. The results clearly demonstrate that our approach achieves superior performance compared to strong baseline methods. Our code is available at https://github.com/ZJLAB-AMMI/LLM4Teach.
Article
Full-text available
While Goal-Conditioned Reinforcement Learning (GCRL) has gained attention, its algorithmic robustness against adversarial perturbations remains unexplored. The attacks and robust representation training methods that are designed for traditional RL become less effective when applied to GCRL. To address this challenge, we first propose the Semi-Contrastive Representation attack, a novel approach inspired by the adversarial contrastive attack. Unlike existing attacks in RL, it only necessitates information from the policy function and can be seamlessly implemented during deployment. Then, to mitigate the vulnerability of existing GCRL algorithms, we introduce Adversarial Representation Tactics, which combines Semi-Contrastive Adversarial Augmentation with Sensitivity-Aware Regularizer to improve the adversarial robustness of the underlying RL agent against various types of perturbations. Extensive experiments validate the superior performance of our attack and defence methods across multiple state-of-the-art GCRL algorithms. Our code is available at https://github.com/TrustAI/ReRoGCRL.
Article
We propose a new strategy for applying large pre-trained language models to novel tasks when labeled training data is limited. Rather than apply the model in a typical zero-shot or few-shot fashion, we treat the model as the basis for labeling functions in a weak supervision framework. To create a classifier, we first prompt the model to answer multiple distinct queries about an example and define how the possible responses should be mapped to votes for labels and abstentions. We then denoise these noisy label sources using the Snorkel system and train an end classifier with the resulting training data. Our experimental evaluation shows that prompting large language models within a weak supervision framework can provide significant gains in accuracy. On the WRENCH weak supervision benchmark, this approach can significantly improve over zero-shot performance, an average 19.5% reduction in errors. We also find that this approach produces classifiers with comparable or superior accuracy to those trained from hand-engineered rules.