ChapterPDF Available

Model-Based Reinforcement Learning with Multi-Step Plan Value Estimation

Authors:

Abstract and Figures

A promising way to improve the sample efficiency of reinforcement learning is model-based methods, in which many explorations and evaluations can happen in the learned models to save real-world samples. However, when the learned model has a non-negligible model error, sequential steps in the model are hard to be accurately evaluated, limiting the model’s utilization. This paper proposes to alleviate this issue by introducing multi-step plans into policy optimization for model-based RL. We employ the multi-step plan value estimation, which evaluates the expected discounted return after executing a sequence of action plans at a given state, and updates the policy by directly computing the multi-step policy gradient via plan value estimation. The new model-based reinforcement learning algorithm MPPVE (Model-based Planning Policy Learning with Multi-step Plan Value Estimation) shows a better utilization of the learned model and achieves a better sample efficiency than state-of-the-art model-based RL approaches. The code is available at https://github.com/HxLyn3/MPPVE.
Content may be subject to copyright.
Model-Based Reinforcement Learning with Multi-Step
Plan Value Estimation
Haoxin Lina,b; *, Yihao Suna;*, Jiaji Zhangaand Yang Yua,b,c;**
aNational Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, China
bPolixir Technologies, Nanjing, Jiangsu, China
cPeng Cheng Laboratory, Shenzhen, Guangdong, China
Abstract. A promising way to improve the sample efficiency of
reinforcement learning is model-based methods, in which many ex-
plorations and evaluations can happen in the learned models to save
real-world samples. However, when the learned model has a non-
negligible model error, sequential steps in the model are hard to
be accurately evaluated, limiting the model’s utilization. This paper
proposes to alleviate this issue by introducing multi-step plans into
policy optimization for model-based RL. We employ the multi-step
plan value estimation, which evaluates the expected discounted re-
turn after executing a sequence of action plans at a given state, and
updates the policy by directly computing the multi-step policy gradi-
ent via plan value estimation. The new model-based reinforcement
learning algorithm MPPVE (Model-based Planning Policy Learning
with Multi-step Plan Value Estimation) shows a better utilization
of the learned model and achieves a better sample efficiency than
state-of-the-art model-based RL approaches. The code is available at
https://github.com/HxLyn3/MPPVE.
1 Introduction
Reinforcement Learning (RL) has attracted close attention in recent
years. Despite its empirical success in simulated domains and games,
insufficient sample efficiency is still one of the critical problems
hindering the application of RL in reality [
32
]. One promising way
to improve the sample efficiency is to train and utilize world models
[
32
], which is known as model-based RL (MBRL). World model
learning has recently received significant developments, including the
elimination of the compounding error issue [
31
] and causal model
learning studies [
9
,
34
] for achieving high-fidelity models, and has
also gained real-world applications [
27
,
26
]. Nevertheless, this paper
focuses on the dyna-style MBRL framework [
29
] that augments the
replay buffer by the world model generated data for off-policy RL.
In dyna-style MBRL, the model is often learned by supervised
learning to fit the observed transition data, which is simple to train
but exhibits non-negligible model error [
31
]. Specifically, consider
a generated
k
-step model rollout
st,a
t,ˆst+1,a
t+1, ..., ˆst+k
, where
ˆs
stands for fake states. The deviation error of fake states increases
with
k
since the error accumulates gradually as the state transitions in
imagination. If updated on the fake states with a large deviation, the
policy will be misled by policy gradients given by biased state-action
Equal Contribution.
∗∗ Corresponding Author. Email: yuy@nju.edu.cn
value estimation. The influence of model error on the directions of
policy gradients is demonstrated in Figure 1.
Figure 1. We make real branched rollouts in the real world and utilize the
learned model to generate the fake, starting from some real states. For each
fake rollout and its corresponding real rollout, we show their deviation along
with the normalized cosine similarity between their multi-step policy gradients.
Therefore, the dyna-style MBRL often introduces techniques to
reduce the impact of the model error. For example, MBPO [
14
] pro-
poses the branched rollouts scheme with a gradually growing branch
length to truncate imaginary model rollouts, avoiding the participation
of unreliable fake samples in policy optimization. BMPO [
17
]further
truncates model rollouts into a shorter branch length than MBPO after
the same number of steps of environmental sampling since it can
make imaginary rollouts from both forward and backward directions
with the bidirectional dynamics model.
We argue that, while the above previous methods tried to reduce
the impact of the accumulated model error by reducing the rollout
steps, avoiding the explicit reliance on the rolled-out fake states can be
helpful. This paper proposes employing
k
-step plan value estimation
Q(st,a
t, ..., at+k1)
to evaluate sequential action plans. When using
the
k
-step rollout to update the policy, we can compute the policy
gradient only on the starting real state
st
, but not on any fake states.
Therefore, the
k
-step policy gradients directly given by the plan value
are less influenced by the model error. Based on this key idea, we
formulate a new model-based algorithm called Model-based Planning
Policy Learning with Multi-step Plan Value Estimation (MPPVE).
The difference between MPPVE and previous model-based methods
in updating the actor is demonstrated in Figure 2.
In general, our contributions are summarized as follows:
We present a tabular planning policy iteration method alternating
between planning policy evaluation and planning policy improve-
ment, and theoretically prove the convergence of the optimal policy.
ECAI 2023
K. Gal et al. (Eds.)
© 2023 The Authors.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
doi:10.3233/FAIA230427
1481
KT\OXUTSKTZ
K^VRUXK
ܦ௘௡௩
YGSVRKݏ
···
ݏ௧ାଵ ݏ௧ା௞ିଵ ݏ
ݏYZUXK
ܦ௠௢ௗ௘௟
SUJKRXURRU[Z
(a) Data generation. Environmental data is stored in Denv, while model data is stored in Dmo del.
ܦ௘௡௩
ܦ௠௢ௗ௘௟
YGSVRK
YGSVRK
Ū
ݏ
ݏ
Ū
ܳሺݏǡ ܽሻ
ܳሺݏǡ ܽሻ
ܽ
׏ܳ
ܽ
׏ܳ
: action value, ܳሺݏǡ ܽሻ
ܳሺݏǡ ܽሻ : actor, ߨሺܽȁݏሻ
Ū3UJKR
(b) Update actor with action-value.
···
ݏݏ௧ାଵ ݏ௧ା௞ିଵ ݏ௧ା௞
ܽ
ܽ௧ାଵ ܽ௧ା௞ିଵ
3UJKR
ŪŪ
3UJKR 3UJKR
Ū
ܳሺݏǡܽ
ǡܽ
௧ାଵǡǤǤǤǡܽ
௧ା௞ିଵ
׏ܳܽ௧ାଵ ׏೟శభܳܽ௧ା௞ିଵ ׏೟శೖషభ ܳ
ܦ௘௡௩
YGSVRK
ܽ
: dynamics model, ݌ሺݏǯǡݎȁݏǡ ܽሻ : multi-step plan value
ܳሺݏǡܽ
ǡܽ
௧ାଵǡǤǤǤǡܽ
௧ା௞ିଵ
(c) Update actor with multi-step plan value.
Figure 2. A comparison between MPPVE and previous model-based methods in updating actor. (a) Data generation, where green
st
stands for the real
environmental state, red
ˆst+1 , ..., ˆst+k
stand for fake states, and the darker the red becomes, the greater the deviation error is in expectation. (b) Illustration of
how previous model-based methods update actor with action-value estimation. (c) Illustration of how MPPVE updates actor with
k
-step plan value estimation.
The k-step policy gradients are given by the plan value estimation directly when the actor plans starting from the real environmental state.
We propose a new model-based algorithm called MPPVE for gen-
eral continuous settings based on the theoretical tabular planning
policy iteration, which updates the policy by directly computing
multi-step policy gradients via plan value estimation for short plans
starting from real environmental states, mitigating the misleading
impacts of the compounding error.
We empirically verify that multi-step policy gradients computed
by MPPVE are less influenced by model error and more accurate
than those computed by previous model-based RL methods.
We show that MPPVE outperforms recent state-of-the-art model-
based algorithms in terms of sample efficiency while retaining
competitive performance close to the convergence of model-free
algorithms on MuJoCo [30] benchmarks.
2 Related Work
This work is related to dyna-style MBRL [
29
]and multi-step planning.
Dyna-style MBRL methods generate some fake transitions with
a dynamics model for accelerating value approximation or policy
learning. MVE [
12
]uses a dynamics model to simulate the short-term
horizon and Q-Learning to estimate the long-term value, improving
the quality of target values for training. STEVE [
5
] builds on MVE
to better estimate Q value by stochastic ensemble model-based value
expansion. SLBO [
20
]regards the dynamics model as a simulator and
directly uses TRPO [
25
]to optimize the policy with whole trajectories
sampled in it. Moreover, MBPO [
14
] builds on SAC [
13
], which is
an off-policy RL algorithm, and updates the policy with a mixture of
the data from the real environment and imaginary branched rollouts.
Some recent dyna-style MBRL methods pay attention to reducing the
influences of model error. For instance, M2AC [
24
] masks the high-
uncertainty model-generated data with a masking mechanism. BMPO
[
17
] utilizes both the forward and backward model to separate the
compounding error in different directions, which can acquire model
data with less error than only using the forward model. This work also
adopts the dyna-style MBRL framework and focuses on mitigating
the impact of the accumulated model error.
Multi-step planning methods usually utilize the learned dynamics
model to make plans. Model Predictive Control (MPC) [
6
]obtains an
optimal action sequence by sampling multiple sequences and applying
the first action of the sequence to the environment. MB-MF [
22
]
adopts the random-shooting method as an instantiation of MPC, which
samples several action sequences randomly and uniformly in a learned
neural model. PETS [
10
]uses CEM [
4
]instead, which samples actions
from a distribution close to previous samples that yielded high rewards
to improve the optimization efficiency.
The planning can be incorporated into the differentiable neural
network to learn the planning policy end-to-end directly [
23
,
28
,
15
]. MAAC [
11
] proposes to estimate the policy gradients by back-
propagating through the learned dynamics model using the path-
wise derivative estimator during model-based planning. Furthermore,
DDPPO [
18
] additionally learns a gradient model by minimizing
the gradient error of the dynamics model to provide more accurate
policy gradients through the model. This work also learns the planning
policy end-to-end, but back-propagates the policy gradients through
multi-step plan value instead, avoiding the reliance on the fake states.
The concept of multi-step sequential actions is introduced in other
model-based approaches. [
1
,
2
,
16
,
7
]propose a multi-step dynamics
model to directly output the outcome of executing a sequence of
actions. Unlike these, we involve the concept of multi-step sequential
actions in the value function instead of the dynamics model.
Multi-step plan value used in this paper is also presented in GPM
[
33
]. Our approach differs from GPM in the following two aspects: 1)
GPM adopts the model-free paradigm while ours adopts the model-
based paradigm; 2) GPM aims to enhance exploration and therefore
present a plan generator to output multi-step actions based on the
plan value, while we propose model-based planning policy improve-
ment based on the plan value estimation to mitigate the influence of
compounding error and increase the sample efficiency.
3 Preliminaries
A tuple
(S,A,p,r)
is considered to describe Markov Decision Pro-
cess (MDP), where
S
and
A
are state and action spaces respectively,
probability density function
p:S×S×A[0,)
represents
the transition distribution over
st+1 ∈S
conditioned on
st∈S
and
at∈A
,
r:S×A×R[0,)
is the reward distribution condi-
tioned on
st∈S
and
at∈A
, and
γ
is the discount factor. We will use
ρπ:S→[0,)
to denote on-policy distribution over states induced
H. Lin et al. / Model-Based Reinforcement Learning with Multi-Step Plan Value Estimation1482
by dynamics function
p(st+1|st,a
t)
and policy
π(at|st)
. The goal
of RL is to find the optimal policy that maximizes the expectation of
cumulative discounted reward: Eρπ
t=0 γtrt.
Recent MBRL methods aim to build a model of the dynamics
function
p
using supervised learning with data
Denv
collected via
interaction in the real environment. Then fake data
Dmodel
generated
via model rollouts will be used for an RL algorithm additionally with
real data to improve the sample efficiency.
4 Method
In this section, we will first derive tabular planning policy iteration,
and verify that the optimal policy can be attained with a convergence
guarantee. Then we will present a practical neuron-based algorithm
for general continuous environments based on this theory.
4.1 Derivation of Planning Policy Iteration
Planning policy iteration is a multi-step extension of policy iteration
for optimizing planning policy that alternates between planning policy
evaluation and planning policy improvement. With the consideration
of theoretical analysis, our derivation will focus on the tabular setting.
4.1.1 Planning Policy
With the dynamics function
p
, at any state, the agent can generate a
plan consisting of a sequence of actions to perform in the next few
steps in turn. We denote the
k
-step planning policy as
πk
, then given
state
st
, the plan
τk
t=(at,a
t+1,...,a
t+k1)
can be predicted with
τk
tπk(·|st)
. Concretely, for
m∈{0,1,...,k 1}
,
at+m
π(·|st+m),st+m+1 p(·|st+m,a
t+m),wehave
πk(τk
t|st)=π(at|st)
sk1
t+1
t+k1
i=t+1
p(si|si1,a
i1)π(ai|si),
(1)
where
sk1
t+1 =(st+1,...,s
t+k1)
is the state sequence at future
k1
steps. The planning policy just gives a temporally consecutive
plan of fixed length
k
, not a full plan that plans actions until the
termination can be reached.
4.1.2 Planning Policy Evaluation
Given a stochastic policy
πΠ
, the
k
-step plan value [
33
] function
is defined as
Qπ(st,τk
t)=Qπ(st,a
t,a
t+1, ..., at+k1)
=Ep,r,π k1
m=0
γmrt+m+γk
m=0
γmrt+k+m(2)
=Ep,r k1
m=0
γmrt+m+γkEˆ
τkπkQπ(st+k,ˆ
τk).(3)
The former
k
items
(rt,r
t+1, ..., rt+k1)
are instant rewards respec-
tively for
k
-step actions in the plan
τk
t
taken at state
st
step by step,
while the following items are future rewards for starting making deci-
sions according to πfrom state st+k.
According to the recursive Bellman equation
(3)
, the extended
Bellman backup operator Tπcan be written as
TπQ(st,τk
t)=Ep,r k1
m=0
γmrt+m+γkEp[V(st+k)] ,(4)
where
V(st)=Eτk
tπkQ(st,τk
t)(5)
is the state value function. This update equation for
k
-step plan value
function is different from multi-step Temporal Difference (TD) Learn-
ing [
3
], though there is naturally multi-step bootstrapping in plan
value learning. Specifically, the plan value is updated without bias
since the multi-step rewards
(rt,r
t+1, ..., rt+k1)
correspond to the
input plan, while the multi-step TD Learning for single-step action-
value needs importance sampling or Tree-backup [
3
] to correct the
bias from the difference between target policy and behavior policy.
Starting from any function
Q:S×A
kR
and applying
Tπ
repeatedly, we can obtain the plan value function of the policy π.
Lemma 1 (Planning Policy Evaluation).Given any initial mapping
Q0:S×A
kR
with
|A| <
, update
Qi
to
Qi+1
with
Qi+1 =
TπQi
for all
iN
,
{Qi}
will converge to the plan value of policy
πas i→∞.
Proof. For any iN, after updating Qito Qi+1 with Tπ,wehave
Qi+1(st,τk
t)=TπQi(st,τk
t)
=Ep,r k1
m=0
γmrt+m+γkEˆ
τkπkQi(st+k,ˆ
τk),(6)
for any st∈Sand τk
t∈A
k. Then
Qi+1 Qπ=max
st∈S,τk
t∈AkQi+1(st,τk
t)Qπ(st,τk
t)
max
st∈S,τk
t∈AkγkQiQπ=γkQiQπ.
(7)
So
Qi=Qπ
is a fixed point of this update rule, and the sequence
{Qi}will converge to the plan value of policy πas i→∞.
4.1.3 Planning Policy Improvement
After attaining the corresponding plan value, the policy
π
can be
updated with
πnew = arg max
πΠ
τk
t∈Ak
πk(τk
t|st)Qπold (st,τk
t)(8)
for each state. We will show that
πnew
achieves greater plan value
than πold after applying Eq.(8).
Lemma 2 (Planning Policy Improvement).Given any mapping
πold Π:S→Δ(A)
with
|A| <
, update
πold
to
πnew
with
Eq.(8), then Qπnew (st,τk
t)Qπold (st,τk
t),st∈S,τk
t∈A
k.
Proof. By definition of the state value function, we have
Vπold (st)=
τk
t∈Ak
πk
old(τk
t|st)Qπold (st,τk
t)
τk
t∈Ak
πk
new(τk
t|st)Qπold (st,τk
t),
(9)
H. Lin et al. / Model-Based Reinforcement Learning with Multi-Step Plan Value Estimation 1483
for any st∈S. Reapplying the inequality (9), the result is given by
Qπold (st,τk
t)=Ep,r k1
m=0
γmrt+m+γkVπold (st+k)
Ep,r
k1
m=0
γmrt+m+γk
τk∈Ak
πk
new(τk|st+k)Qπold (st+k,τk)
=Ep,r,πnew 2k1
m=0
γmrt+m+γ2kVπold (st+2k)≤···
Ep,r,πnew
m=0
γmrt+m=Qπnew (st,τk
t),
(10)
st∈S,τk
t∈A
k.
4.1.4 Planning Policy Iteration
The whole planning policy iteration process alternates between plan-
ning policy evaluation and planning policy improvement until the
sequence of policy
{πk}
converges to the optimal policy
π
whose
plan value of any state-plan pair is the greatest among all πΠ.
Theorem 1 (Planning Policy Iteration).Given any initial mapping
π0Π:S→Δ(A)
with
|A| <
, compute corresponding
Qπi
in planning policy evaluation step and update
πi
to
πi+1
in
planning policy improvement step for all
iN
,
{πi}
will converge
to the optimal policy
π
that
Qπ(st,τk
t)Qπ(st,τk
t)
,
st∈S
,
τk
t∈A
k, and πΠ.
Proof.
The sequence
{πi}
will converge to some
π
since
{Qπi}
is monotonically increasing with
i
and bounded above
πΠ
.By
Lemma 2, at convergence, we can’t find any πΠsatisfying
Vπ(st)<
τk
t∈Ak
πk(τk
t|st)Qπ(st,τk
t),(11)
for any st∈S. Then we obtain
Qπ(st,τk
t)=Ep,r k1
m=0
γmrt+m+γkVπ(st+k)
Ep,r
k1
m=0
γmrt+m+γk
τk∈Ak
πk(τk|st+k)Qπ(st+k,τk)
=Ep,r,π 2k1
m=0
γmrt+m+γ2kVπ(st+2k)≥···
Ep,r,π
m=0
γmrt+m=Qπ(st,τk
t),
(12)
st∈S,τk
t∈A
k, and πΠ. Hence πis the optimal policy.
4.2 Model-based Planning Policy Learning with
Multi-step Plan Value Estimation
The tabular planning policy iteration cannot be directly applied to
scenarios with inaccessible dynamics functions and continuous envi-
ronmental state-action space. Therefore, we propose a neuron-based
algorithm based on planning policy iteration for general application,
called Model-based Planning Policy Learning with Multi-step Plan
Value Estimation (MPPVE).
Algorithm 1 MPPVE
Input: Initial neural parameters
θ
,
φ
,
ψ
,
ψ
, plan length
k
, environ-
ment buffer
Denv
, model buffer
Dmodel
, start size
U
, batch size
B
,
and learning rate λQ,λπ.
1: Explore for Uenvironmental steps and add data to Denv
2: for Nepochs do
3: Train model pθon Denv by maximizing Eq.(13)
4: for Esteps do
5:
Perform action in the environment according to
πφ
; add the
environmental transition to Denv
6: for Mmodel rollouts do
7:
Sample
st
from
Denv
to make model rollout using policy
πφ; add generated samples to Dmodel
8: end for
9: for Gcritic updates do
10: Sample Bk-step trajectories from Denv ∪D
model to up-
date critic
Qψ
via
ψψλQˆ
ψJQ(ψ)
by Eq.
(18)
11: Update target critic via ψτψ +(1τ)ψ
12: end for
13:
Sample
B
states from
Denv
to update policy
πφ
via
φ
φλπˆ
φJπk(φ)by Eq.(21)
14: end for
15: end for
As shown in Figure 2(c), our MPPVE adopts the framework of
model-based policy optimization [
14
], which is divided into dynam-
ics model
pθ
, actor
πφ
and critic
Qψ
, where
θ
,
φ
and
ψ
are neural
parameters. The algorithm will be described in three parts: 1) model
learning; 2) multi-step plan value estimation; 3) model-based planning
policy improvement.
4.2.1 Model Learning
Like MBPO [
14
], we use an ensemble neural network that takes the
state-action pair as input and outputs Gaussian distribution of the
next state and reward to jointly represent the dynamics function
p
and the reward function
r
. That is,
pθ(st+1,r
t|st,a
t)
comes from
N(μθ(st,a
t),Σθ(st,a
t))
, where
μθ(st,a
t)
and
Σθ(st,a
t))
are the
outputs of the network. The environmental model is trained to maxi-
mize the expected likelihood:
Jp(θ)=E(st,at,rt,st+1)∼Denv [log pθ(st+1 ,r
t|st,a
t)] .(13)
4.2.2 Multi-step Plan Value Estimation
In order to approximate
k
-step plan value function with continuous in-
puts, we use a deep Q-network [
21
,
19
]:
Qψ(st,τk
t)
with parameters
ψ
. The plan value is estimated by minimizing the expected multi-step
TD error:
JQ(ψ)= E
(st,τk
t,rk
t,st+k)∼Denv∪Dmodel lTD (st,τk
t,rk
t,s
t+k),
(14)
with
rk
t=(rt,r
t+1, ..., rt+k1),(15)
lTD(st,τk
t,rk
t,s
t+k)= 1
2Qψ(st,τk
t)yψ(rk
t,s
t+k)2,
(16)
yψ(rk
t,s
t+k)=
k1
m=0
γmrt+m+γkEˆ
τkQψ(st+k,ˆ
τk),(17)
H. Lin et al. / Model-Based Reinforcement Learning with Multi-Step Plan Value Estimation1484
Figure 3. Learning curves of our MPPVE (red) and other six baselines on MuJoCo (v3 version) continuous control tasks. The blue dashed lines indicate the
asymptotic performance of SAC in these tasks for reference. The solid lines indicate the mean, and shaded areas indicate the standard error over eight different
random seeds. Each evaluation, taken every 1,000 environmental steps, calculates the average return over ten episodes.
where
ψ
is the parameters of the target network, which is used for
stabilizing training [
21
]. The gradient of Eq.
(14)
can be estimated
without bias by
ˆ
ψJQ(ψ)=Qψ(st,τk
t)ˆyψ(rk
t,s
t+k)ψQψ(st,τk
t),
(18)
with
ˆyψ(rk
t,s
t+k)=
k1
m=0
γmrt+m+γkQψ(st+k,τk
t+k),(19)
where
(st,τk
t,rk
t,s
t+k)
is sampled from the replay buffer and
τk
t+k
is sampled according to current planning policy.
4.2.3 Model-based Planning Policy Improvement
The probabilistic policy
πφ
is represented by the conditioned Gaus-
sian distribution of the action. That is,
πφ(at|st)
comes from
N(uφ(st)
φ(st))
, where
uφ(st)
and
σφ(st)
are the outputs of the
policy network. As shown in Eq.
(1)
,a
k
-step model-based planning
policy
πk
φ,θ
is composed of dynamics model
pθ
and policy
πφ
. Since
the plan value function is represented by a differentiable neural net-
work, we can train model-based planning policy by minimizing
Jπk(φ)=Est∼Denv Eτk
tπk
φ,θ(·|st)Qψ(st,τk
t),(20)
whose gradient can be approximated by
ˆ
φJπk(φ)=−∇τk
tQψ(st,τk
t)φτk
t,(21)
where τk
tis sampled according to current planning policy πk
φ,θ.
During the model-based planning policy improvement step, the
actor makes
k
-step plans starting from
st
using policy
πφ
and model
pθ
. Then, the corresponding state-plan pairs are fed into the neural
plan value function to directly compute the gradient
(21)
. Therefore,
the policy gradients within these
k
steps are mainly affected by the
bias of plan value estimation at
st
, avoiding the influence of action-
value estimation with compounding error at the following
k1
states
in previous model-based methods.
The complete algorithm is described in Algorithm 1. We also utilize
model rollouts to generate fake transitions, as MBPO [
14
]. The pri-
mary difference from MBPO is that only the critic is additionally
trained with
k
-step rollouts starting from fake states to support larger
Update-To-Data (UTD) [
8
] ratio in our algorithm, while the actor is
trained merely with
k
-step rollouts starting from real states at a lower
update frequency than the critic. There are two reasons for the modifi-
cation: 1) plan value is more difficult to estimate than action-value,
then the critic can guide the actor only after learning with sufficient
and diverse samples; 2) it ensures that the
k
-step policy gradient is
computed only on the starting real state, but not on any fake states.
5 Experiments
In this section, we conduct three experiments to answer: 1) How
well does our method perform on benchmark tasks of reinforcement
learning, in contrast to a broader range of state-of-the-art model-
free and model-based RL methods? 2) Does our proposed model-
based planning policy improvement based on plan value estimation
provide more accurate policy gradients than previous model-based
RL methods? 3) How does the plan length kaffect our method?
5.1 Comparison in RL Benchmarks
We evaluate our method MPPVE on seven MuJoCo continuous control
tasks [
30
], including InvertedPendulum, Hopper, Swimmer, HalfChee-
tah, Walker2d, Ant, and Humanoid. All the tasks adopt version v3
and follow the default settings. Four model-based methods and two
model-free methods are selected as our baselines:
STEVE [
5
] uses the
k
-step return computed from the ensemble
dynamics model, ensemble reward model, and ensemble Q as Q
targets to facilitate an efficient action-value estimation.
MBPO [
14
] is a representative model-based algorithm. Updating
the policy with a mixture of data from the real environment and
imaginary branched rollouts improves the efficiency of policy op-
timization. A comparison to MBPO is necessary since we follow
the model-based framework of MBPO.
MBPO + STEVE is a combination of MBPO and STEVE. Re-
placing the value estimation in MBPO with the value expansion
method in STEVE while keeping the original policy optimization
process of MBPO, MBPO combined with STEVE enables both
H. Lin et al. / Model-Based Reinforcement Learning with Multi-Step Plan Value Estimation 1485
0 1 2 3 4 5 6
0121345677681496
072
073
074
075
076
078
079
07
170
01234567383936161936354
01234566715

9
0700 0702 0704 0706 0709 0710
0121345677681496
074
075
076
078
079
07
170
01234567383936161936354
08996

9
070 072 074 076 079 170
0121345677681496
073
074
075
076
078
079
07
170
01234567383936161936354
7

9
(a)
01234
01213 5677681 96
053
054
056
057
058
059
05
01234567383936161936354
01234566715

9
0500 0502 0504 0507 0509 0510
01213 5677681 96
054
056
057
058
059
05
150
01234567383936161936354
08996

9
050 052 054 057 059 150
01213 5677681 96
056
057
058
059
05
150
01234567383936161936354
7

9
(b)
01234 03254 05264 06274 07284
0121345677681496
191
195
197
19
19
391
0123454657483953313255
01234566715

9
0191219154 01915219174 0191721914 019121914 019121934
0121345677681496
1911
1918
1931
1938
1951
1958
1961
0123454657483953313255
08996

9
019121954 019521974 01972194 0192194 01923914
0121345677681496
1911
1918
1931
1938
1951
1958
1961
1968
0123454657483953313255
7

9
(c)
Figure 4. We compare the influence of model error on the directions of
k
-step policy gradients given by MPPVE and MBPO on the tasks of HalfCheetah
(
k=3
), Hopper (
k=4
) and Ant (
k=4
). We make real branched rollouts in the real world and utilize the learned model to generate the fake, starting from some
real states. (a) For each fake rollout and its corresponding real rollout, we show their deviation along with the normalized cosine similarity between their k-step
policy gradients. (b) We select some samples and plot further, where each group of orange and blue points on the same X-coordinate corresponds to the same
starting real state. (c) We count the ratio of points with severely inaccurate k-step policy gradients for each interval of state rollout error.
efficient value estimation and policy optimization. We compare
MPPVE with this algorithm to show that we improve the sample
efficiency more than purely adding value expansion to MBPO.
DDPPO [
18
]adopts a two-model-based learning method to control
the prediction error and the gradient error of the dynamics model.
In the policy optimization phase, it uses the prediction model to roll
out the data and the gradient model to calculate the policy gradient.
We compare MPPVE with this model-based algorithm to show that
even without a gradient model the policy gradient can be accurate
enough to achieve a better sample efficiency.
GPM [
33
]is a model-free algorithm that also proposes the concept
of plan value and features an inherent mechanism for generating
temporally coordinated plans for exploration. We compare MPPVE
with GPM to demonstrate the power of multi-step plan value in the
model-based paradigm.
SAC [
13
] is the state-of-the-art model-free algorithm that can
obtain competitive rewards after training. We use SAC’s asymptotic
performance for reference to better evaluate all the algorithms.
Figure 3shows all approaches’ learning curves, along with SAC’s
asymptotic performance. MPPVE achieves great performance after
fewer environmental samples than baselines. Take Hopper as an ex-
ample, MPPVE has achieved 90% performance (about 3000) after
50k steps, while DDPPO, MBPO and MBPO + STEVE, need about
75k steps, and other three methods, STEVE, SAC, GPM, can achieve
only about 1000 after 100k steps. MPPVE performs about 1.5x faster
than DDPPO, MBPO and MBPO + STEVE, and dominates STEVE,
SAC and GPM, in terms of learning speed on the Hopper task. After
training, MPPVE can achieve a final performance close to the asymp-
totic performance of SAC on all these seven MuJoCo benchmarks.
These results show that MPPVE has both high sample efficiency and
competitive performance.
5.2 On Policy Gradients
We conduct a study to verify that multi-step policy gradients com-
puted by our MPPVE are less influenced by model error and more
accurate than those computed by previous model-based RL methods,
as we claimed. Without loss of rigor, we only choose MBPO for com-
parison since many other model-based methods follow the same way
of computing the policy gradients as MBPO.
In order to exclude irrelevant factors as far as possible, we fix the
policy and the learned dynamics model after enough environmental
samples, then learn the multi-step plan value function and action-
value function, respectively, until they are both converged. Next, we
measure the influence of model error on the directions of policy gra-
dients computed by MPPVE and MBPO, respectively. Specifically,
we sample some real environmental states from the replay buffer and
start from them to make
k
-step real rollouts with perfect oracle dy-
namics and generate
k
-step imaginary rollouts with the learned model.
Both MPPVE and MBPO can compute
k
-step policy gradients on the
generated fake rollouts and their corresponding real rollouts. MPPVE
utilizes
k
-step plan value estimation on the first states of the rollouts
to compute
k
-step policy gradients directly, while MBPO needs to
compute the single-step policy gradients with action-value estimation
H. Lin et al. / Model-Based Reinforcement Learning with Multi-Step Plan Value Estimation1486
(a) Performance (b) Policy evaluation (c) Average of value bias (d) Standard error of value bias
Figure 5. Study of plan length on Hopper task, through comparison of MPPVE with
k=1
,
k=3
,
k=5
and
k=10
respectively. (a) Evaluation of episodic
reward during the learning process. (b) Learning plan value estimation for the same fixed policy. (c) Average of the bias between neural value estimation and true
value evaluation over state-action(s) space. (d) Standard error of the bias between neural value estimation and true value evaluation over state-action(s) space.
on all states of the rollouts and averages over the
k
steps. For each
fake rollout and its corresponding real rollout, we can measure their
deviation along with the normalized cosine similarity between their
k-step policy gradients given by MBPO or MPPVE.
Figure 4(a) shows the influence of model error on the directions of
k
-step policy gradients on the tasks of HalfCheetah (
k=3
), Hopper
(
k=4
) and Ant (
k=4
), where orange stands for MPPVE while blue
stands for MBPO. Each point corresponds to one starting real state for
making
k
-step rollouts, and its X-coordinate is the error of the fake
rollout from the real rollout, while its Y-coordinate is the normalized
cosine similarity between
k
-step policy gradients computed on the real
rollout and the fake rollout. On the one hand, the point with more state
rollout error tends to have a smaller normalized cosine similarity of
policy gradients, both for MPPVE and MBPO. The tendency reveals
the influence of model error on the policy gradients. On the other
hand, the figure demonstrates that the orange points are almost overall
above the blue points, which means that the
k
-step policy gradients
computed by our MPPVE are less influenced by model error.
For the sake of clear comparison, we select only about a dozen real
states and plot further. Figure 4(b) shows the result, where each group
of orange and blue points on the same X-coordinate corresponds to the
same starting real state for rollout. We connect the points of the same
color to form a line. The orange line is always above the blue line,
indicating the advantage of MPPVE in computing policy gradients.
Furthermore, we count the ratio of points with normalized cosine
similarity between
k
-step policy gradients computed on real and fake
rollouts smaller than 0.5 for each interval of state rollout error, as
shown in Figure 4(c). A normalized cosine similarity smaller than
0.5 means that the direction of the
k
-step policy gradient deviates
by more than 90 degrees. As for MBPO, the ratio of severely biased
policy gradients increases as the state rollout error increases and
reaches approximately 1 when the state rollout error is great enough.
By contrast, our MPPVE only provides severely inaccurate policy
gradients with a tiny ratio, even under a great state rollout error.
In summary, by directly computing multi-step policy gradients
via plan value estimation, MPPVE provides policy gradients more
accurately than MBPO, whose computation of policy gradients is also
adopted by other previous state-of-the-art model-based RL methods.
5.3 On Plan Length
The previous section empirically shows the advantage of MPPVE.
Furthermore, a natural question is what plan length
k
is appropriate.
Intuitively, a larger length allows the policy to be updated on longer
model rollouts, which may improve the sample efficiency. Neverthe-
less, it also makes learning the plan value function more difficult. We
next make ablations to better understand the effect of plan length k.
Figure 5presents an empirical study of this hyper-parameter. The
performance increases first and then decreases as plan length
k
in-
creases. Specifically,
k=3
and
k=5
learn fastest at the beginning
of training followed by
k=1
and
k=10
, while
k=1
and
k=3
are more stable than
k=5
and
k=10
in the subsequent training
phase and also perform better. Figure 5(b), 5(c) and 5(d) explain why
they perform differently. We plot their policy evaluation curves for the
same fixed policy in Figure 5(b). It indicates that a larger
k
can make
learning the plan value function faster. We then quantitatively analyze
their estimation quality in Figure 5(c) and 5(d). Specifically, we define
the normalized bias of the
k
-step plan value estimation
Qψ(st,τk
t)
to
be
(Qψ(st,τk
t)Qπ(st,τk
t))/|Esρπ[Eτkπk(·|s)[Qπ(s, τk)]]|
,
where actual plan value
Qπ(st,τk
t)
is obtained by Monte Carlo sam-
pling in the real environment. Strikingly, compared to the other two
settings,
k=5
and
k=10
have a high normalized average and stan-
dard error of bias during training, indicating the value function with
too large plan length is hard to fit and causes fluctuations in policy
learning.
k=3
achieves a trade-off between stability and learning
speed of plan value estimation, so it performs best.
In conclusion, enlarging
k
felicitously can lead to great sample
efficiency. However, an extremely large
k
lets the training of MPPVE
become unstable. The problem of stabilizing MPPVE when the plan
length is large to improve the sample efficiency further is thrown out.
6 Conclusion
In this work, we propose a novel model-based RL method, namely
Model-based Planning Policy Learning with Multi-step Plan Value
Estimation (MPPVE). The algorithm is from the tabular planning
policy iteration, whose theoretical derivation shows that any initial
policy can converge to the optimal policy when applied in tabular
settings. For general continuous settings, we empirically show that
directly computing multi-step policy gradients via plan value estima-
tion, the key idea of MPPVE, is less influenced by model error and
provides more accurate policy gradients than previous model-based
RL methods. Experimental results demonstrate that MPPVE achieves
better sample efficiency than previous state-of-the-art model-free and
model-based methods while retaining competitive performance on
several continuous control tasks. In the future, we will explore the
scalability of MPPVE and study how to estimate plan value stably
when the plan length is large to improve the sample efficiency further.
H. Lin et al. / Model-Based Reinforcement Learning with Multi-Step Plan Value Estimation 1487
7 Acknowledgments
This work is supported by National Key Research and Development
Program of China (2020AAA0107200), the National Science Foun-
dation of China (61921006), and The Major Key Project of PCL
(PCL2021A12).
References
[1]
Kavosh Asadi, Evan Cater, Dipendra Misra, and Michael L. Littman,
‘Towards a simple approach to multi-step model-based reinforcement
learning’, CoRR,abs/1811.00128, (2018).
[2]
Kavosh Asadi, Dipendra Misra, Seungchan Kim, and Michael L.
Littman, ‘Combating the compounding-error problem with a multi-step
model’, CoRR,abs/1905.13320, (2019).
[3]
Kristopher De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Hol-
land, and Richard S. Sutton, ‘Multi-step reinforcement learning: A unify-
ing algorithm’, in Proceedings of the 32nd AAAI Conference on Artificial
Intelligence (AAAI’18), New Orleans, Louisiana, (2018).
[4]
Zdravko I. Botev, Dirk P. Kroese, Reuven Y. Rubinstein, and Pierre
L’Ecuyer, ‘Chapter 3 - the cross-entropy method for optimization’, in
Handbook of Statistics, volume 31 of Handbook of Statistics, 35–59,
Elsevier, (2013).
[5]
Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and
Honglak Lee, ‘Sample-efficient reinforcement learning with stochastic
ensemble value expansion’, (2018).
[6]
E.F. Camacho and C.B. Alba, Model Predictive Control, Advanced
Textbooks in Control and Signal Processing, Springer London, 2013.
[7]
Tong Che, Yuchen Lu, George Tucker, Surya Bhupatiraju, Shane Gu,
Sergey Levine, and Yoshua Bengio. Combining model-based and model-
free RL via multi-step control variates. https://openreview.net, 2018.
[8]
Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Ross, ‘Random-
ized ensembled double q-learning: Learning fast without a model’, in
9th International Conference on Learning Representations (ICLR’21),
Virtual Conference, (2021).
[9]
Xiong-Hui Chen, Yang Yu, Zheng-Mao Zhu, Zhihua Yu, Zhenjun Chen,
Chenghe Wang, Yinan Wu, Hongqiu Wu, Rong-Jun Qin, Ruijin Ding,
and Fangsheng Huang, Adversarial counterfactual environment model
learning’, CoRR,abs/2206.04890, (2022).
[10]
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey
Levine, ‘Deep reinforcement learning in a handful of trials using proba-
bilistic dynamics models’, in Advances in Neural Information Processing
Systems 31 (NeurIPS’18), Montréal, Canada, (2018).
[11]
Ignasi Clavera, Yao Fu, and Pieter Abbeel, ‘Model-augmented actor-
critic: Backpropagating through paths’, in 8th International Conference
on Learning Representations (ICLR’20), Addis Ababa, Ethiopia, (2020).
[12]
Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph E.
Gonzalez, and Sergey Levine, ‘Model-based value estimation for ef-
ficient model-free reinforcement learning’, CoRR,abs/1803.00101,
(2018).
[13]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine, ‘Soft
actor-critic: Off-policy maximum entropy deep reinforcement learning
with a stochastic actor’, in Proceedings of the 35th International Confer-
ence on Machine Learning (ICML’18), Stockholm, Sweden, (2018).
[14]
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine, ‘When
to trust your model: Model-based policy optimization’, in Advances in
Neural Information Processing Systems 32 (NeurIPS’19), Vancouver,
Canada, (2019).
[15]
Péter Karkus, Xiao Ma, David Hsu, Leslie Pack Kaelbling, Wee Sun Lee,
and Tomás Lozano-Pérez, ‘Differentiable algorithm networks for com-
posable robot learning’, in Robotics: Science and Systems XV (RSS’19),
Freiburg im Breisgau, Germany, (2019).
[16]
Nan Rosemary Ke, Amanpreet Singh, Ahmed Touati, Anirudh Goyal,
Yoshua Bengio, Devi Parikh, and Dhruv Batra, ‘Modeling the long
term future in model-based reinforcement learning’, in 7th International
Conference on Learning Representations (ICLR’19), New Orleans, LA,
(2019).
[17]
Hang Lai, Jian Shen, Weinan Zhang, and Yong Yu, ‘Bidirectional model-
based policy optimization’, in Proceedings of the 37th International Con-
ference on Machine Learning (ICML’20), Virtual Conference, (2020).
[18]
Chongchong Li, Yue Wang, Wei Chen, Yuting Liu, Zhi-Ming Ma, and
Tie-Yan Liu, ‘Gradient information matters in policy optimization by
back-propagating through model’, in 10th International Conference on
Learning Representations (ICLR’22), Virtual Conference, (2022).
[19]
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess,
Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra, ‘Continuous
control with deep reinforcement learning’, in 4th International Confer-
ence on Learning Representations (ICLR’16), San Juan, Puerto Rico,
(2016).
[20]
Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell,
and Tengyu Ma, Algorithmic framework for model-based deep rein-
forcement learning with theoretical guarantees’, in 7th International
Conference on Learning Representations (ICLR’19), New Orleans, LA,
(2019).
[21]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,
Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, An-
dreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir
Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wier-
stra, Shane Legg, and Demis Hassabis, ‘Human-level control through
deep reinforcement learning’, Nature,518(7540), 529–533, (2015).
[22]
Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey
Levine, ‘Neural network dynamics for model-based deep reinforce-
ment learning with model-free fine-tuning’, in 2018 IEEE International
Conference on Robotics and Automation (ICRA’18), Brisbane, Australia,
(2018).
[23]
Masashi Okada, Luca Rigazio, and Takenobu Aoshima, ‘Path in-
tegral networks: End-to-end differentiable optimal control’, CoRR,
abs/1706.09597, (2017).
[24]
Feiyang Pan, Jia He, Dandan Tu, and Qing He, ‘Trust the model when it
is confident: Masked model-based actor-critic’, in Advances in Neural
Information Processing Systems 33 (NeurIPS’20), Virtual Conference,
(2020).
[25]
John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and
Philipp Moritz, ‘Trust region policy optimization’, in Proceedings of the
32nd International Conference on Machine Learning (ICML’15), Lille,
France, (2015).
[26]
Wenjie Shang, Qingyang Li, Zhiwei Qin, Yang Yu, Yiping Meng, and
Jieping Ye, ‘Partially observable environment estimation with uplift
inference for reinforcement learning based recommendation’, Machine
Learning,110(9), 2603–2640, (2021).
[27]
Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and Anxiang Zeng,
‘Virtual-taobao: Virtualizing real-world online retail environment for
reinforcement learning’, in Proceedings of the 33rd AAAI Conference
on Artificial Intelligence (AAAI’19), Honolulu, Hawaii, (2019).
[28]
Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea
Finn, ‘Universal planning networks: Learning generalizable represen-
tations for visuomotor control’, in Proceedings of the 35th Interna-
tional Conference on Machine Learning (ICML’18), Stockholm, Swe-
den, (2018).
[29]
Richard S. Sutton, ‘Integrated architectures for learning, planning, and
reacting based on approximating dynamic programming’, in Proceedings
of the 7th International Conference on Machine Learning (ICML’90),
Austin, Texas, (1990).
[30]
Emanuel Todorov, Tom Erez, and Yuval Tassa, ‘Mujoco: A physics en-
gine for model-based control’, in IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS’20), Vilamoura, Portugal, (2012).
[31]
Tian Xu, Ziniu Li, and Yang Yu, ‘Error bounds of imitating policies and
environments for reinforcement learning’, IEEE Transactions on Pattern
Analysis and Machine Intelligence, (2021).
[32]
Yang Yu, ‘Towards sample efficient reinforcement learning’, in Proceed-
ings of the 27th International Joint Conference on Artificial Intelligence
(IJCAI’18), Stockholm, Sweden, (2018).
[33]
Haichao Zhang, Wei Xu, and Haonan Yu, ‘Generative planning for
temporally coordinated exploration in reinforcement learning’, in 10th
International Conference on Learning Representations (ICLR’22), Vir-
tual Conference, (2022).
[34]
Zheng-Mao Zhu, Xiong-Hui Chen, Hong-Long Tian, Kun Zhang, and
Yang Yu, ‘Offline reinforcement learning with causal structured world
models’, CoRR,abs/2206.01474, (2022).
H. Lin et al. / Model-Based Reinforcement Learning with Multi-Step Plan Value Estimation1488
... Many efforts aim to improve the sample efficiency of RL from different perspectives, such as dynamics modeling (Chua et al., 2018;Luo et al., 2019;Janner et al., 2019;Lin et al., 2023Lin et al., , 2025, reward redistribution (Gangwani et al., 2020;Ren et al., 2022;Lin et al., 2024), ensemble learning (Hasselt et al., 2016;Fujimoto et al., 2018;Chen et al., 2021), and neural parameter resetting (Nikishin et al., 2022;Kim et al., 2023). The essence of these works is to improve the estimation quality of the state-action value function, i.e., Q function (Sutton & Barto, 2018). ...
... The success of MBPO is attributed to the large amount of reliable data generated through branched model roll-outs, allowing Q estimation to be performed efficiently and with high quality. MAAC (Clavera et al., 2020), DDPPO and MPPVE (Lin et al., 2023) make further improvements within the MBPO framework. However, REDQ (Chen et al., 2021) discovers that allowing Q estimation to be effective under large UTD settings does not necessarily require a dynamics model. ...
Article
Full-text available
A promising way to address sequential decision-making is through reinforcement learning methods. One significant factor hindering the application of reinforcement learning in the real world is its low sample efficiency, primarily due to the large number of trial-and-error samples required for value estimation. Reducing the number of environmental samples necessary to fit the value function has become essential to improve sample efficiency. However, with limited data, overly frequent policy iterations can cause existing value estimation methods to fall into the trap of overestimation. This issue arises from the accumulation of value bias in a single class of bootstrapping paths. To address this, we design a novel state-action value estimation method called generalized Q estimation (QGene) in this paper. QGene is based on a novel generalized Q function, which evaluates the expected cumulative rewards that can be generated after any length of state-action sequences. This newly defined generalized Q function inherently possesses multiple Bellman equations, allowing it to be fitted with diverse targets and generate diversified bootstrapping paths to mitigate the accumulation of value bias. Furthermore, we incorporate a conservative estimation technique to effectively avoid overestimation. Experiments show that QGene can more accurately evaluate the policy in the online setting and significantly improve sample efficiency.
... It has been been integrated into RL across various domains, capitalizing on their strengths in sequence pattern recognition from static datasets and functioning as a memory-based architecture, which aids in task understanding and credit assignment. Applications of Transformers in RL include offline RL (Chebotar et al., 2023;Yamagata et al., 2023;Wu et al., 2024), offline-to-online fine-tuning (Zheng et al., 2022;Ma & Li, 2024;Zhang et al., 2023), handling partially observable states (Parisotto et al., 2020;Ni et al., 2024;Lu et al., 2024), and model-based RL (Lin et al., 2023). However, the use of Transformers within a model-free online RL framework, specifically for sequence action prediction and evaluation, remains largely unexplored (Yuan et al., 2024). ...
... It has been been integrated into RL across various domains, capitalizing on their strengths in sequence pattern recognition from static datasets and functioning as a memory-based architecture, which aids in task understanding and credit assignment. Applications of Transformers in RL include offline RL (Chebotar et al., 2023;Yamagata et al., 2023;Wu et al., 2024), offline-to-online fine-tuning (Zheng et al., 2022;Ma & Li, 2024;Zhang et al., 2023), handling partially observable states (Parisotto et al., 2020;Ni et al., 2024;Lu et al., 2024), and model-based RL (Lin et al., 2023). However, the use of Transformers within a model-free online RL framework, specifically for sequence action prediction and evaluation, remains largely unexplored (Yuan et al., 2024). ...
Preprint
Full-text available
This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajectories are typically parameterized by trajectory generators such as Movement Primitives (MP), allowing for smooth and efficient exploration over long horizons while capturing high-level temporal correlations. However, ERL methods are often constrained to on-policy frameworks due to the difficulty of evaluating state-action values for entire action sequences, limiting their sample efficiency and preventing the use of more efficient off-policy architectures. TOP-ERL addresses this shortcoming by segmenting long action sequences and estimating the state-action values for each segment using a transformer-based critic architecture alongside an n-step return estimation. These contributions result in efficient and stable training that is reflected in the empirical results conducted on sophisticated robot learning environments. TOP-ERL significantly outperforms state-of-the-art RL methods. Thorough ablation studies additionally show the impact of key design choices on the model performance.
... Human beings strive to anticipate future situations to prevent costly mistakes. One naive approach to incorporate this ability into AI is through future trajectory prediction using model-based RL [40,19,31,54,28,26]. However, this approach is feasible only if the state transition model is known or easily learnable, which is often not the case in multi-agent systems. ...
Preprint
Full-text available
Understanding cognitive processes in multi-agent interactions is a primary goal in cognitive science. It can guide the direction of artificial intelligence (AI) research toward social decision-making in multi-agent systems, which includes uncertainty from character heterogeneity. In this paper, we introduce an episodic future thinking (EFT) mechanism for a reinforcement learning (RL) agent, inspired by cognitive processes observed in animals. To enable future thinking functionality, we first develop a multi-character policy that captures diverse characters with an ensemble of heterogeneous policies. Here, the character of an agent is defined as a different weight combination on reward components, representing distinct behavioral preferences. The future thinking agent collects observation-action trajectories of the target agents and uses the pre-trained multi-character policy to infer their characters. Once the character is inferred, the agent predicts the upcoming actions of target agents and simulates the potential future scenario. This capability allows the agent to adaptively select the optimal action, considering the predicted future scenario in multi-agent interactions. To evaluate the proposed mechanism, we consider the multi-agent autonomous driving scenario with diverse driving traits and multiple particle environments. Simulation results demonstrate that the EFT mechanism with accurate character inference leads to a higher reward than existing multi-agent solutions. We also confirm that the effect of reward improvement remains valid across societies with different levels of character diversity.
... However, using ensemble environment models, which include transition dynamics and reward functions, involves extra computation. Model-based planning policy learning with multistep plan value estimation (MPPVE) [11] collects k-step model rollouts, where the start state is real and others are generated by the model. However, only the gradient on the start state is computed, reducing computation pressure and alleviating model error. ...
Preprint
Full-text available
Temporal difference (TD) learning is a fundamental technique in reinforcement learning that updates value estimates for states or state-action pairs using a TD target. This target represents an improved estimate of the true value by incorporating both immediate rewards and the estimated value of subsequent states. Traditionally, TD learning relies on the value of a single subsequent state. We propose an enhanced multi-state TD (MSTD) target that utilizes the estimated values of multiple subsequent states. Building on this new MSTD concept, we develop complete actor-critic algorithms that include management of replay buffers in two modes, and integrate with deep deterministic policy optimization (DDPG) and soft actor-critic (SAC). Experimental results demonstrate that algorithms employing the MSTD target significantly improve learning performance compared to traditional methods.
... M2AC computes the uncertainty of each sample by measuring the disagreement between one model versus the rest of the models in an ensemble model, and stores only a batch of samples with the least uncertainty. Lin et al. [92] proposed MPPVE to estimate a multi-step plan value function. MPPVE updates the value function avoiding gradient propagation through the multi-step plan, thus reducing the effect of the model-error. ...
Article
Reinforcement learning (RL) interacts with the environment to solve sequential decision-making problems via a trial-and-error approach. Errors are always undesirable in real-world applications, even though RL excels at playing complex video games that permit several trial-and-error attempts. To improve sample efficiency and thus reduce errors, model-based reinforcement learning (MBRL) is believed to be a promising direction, as it constructs environment models in which trial-and-errors can occur without incurring actual costs. In this survey, we investigate MBRL with a particular focus on the recent advancements in deep RL. There is a generalization error between the learned model of a non-tabular environment and the actual environment. Consequently, it is crucial to analyze the disparity between policy training in the environment model and that in the actual environment, guiding algorithm design for improved model learning, model utilization, and policy training. In addition, we discuss the recent developments of model-based techniques in other forms of RL, such as offline RL, goal-conditioned RL, multi-agent RL, and meta-RL. Furthermore, we discuss the applicability and benefits of MBRL for real-world tasks. Finally, this survey concludes with a discussion of the promising future development prospects for MBRL. We believe that MBRL has great unrealized potential and benefits in real-world applications, and we hope this survey will encourage additional research on MBRL.
Article
Full-text available
Spatial reasoning in Large Language Models (LLMs) serves as a foundation for embodied intelligence. However, even in simple maze environments, LLMs often struggle to plan correct paths due to hallucination issues. To address this, we propose S2ERS, an LLM-based technique that integrates entity and relation extraction with the on-policy reinforcement learning algorithm Sarsa for optimal path planning. We introduce three key improvements: (1) To tackle the hallucination of spatial, we extract a graph structure of entities and relations from the text-based maze description, aiding LLMs in accurately comprehending spatial relationships. (2) To prevent LLMs from getting trapped in dead ends due to context inconsistency hallucination by long-term reasoning, we insert the state-action value function Q into the prompts, guiding the LLM’s path planning. (3) To reduce the token consumption of LLMs, we utilize multi-step reasoning, dynamically inserting local Q-tables into the prompt to assist the LLM in outputting multiple steps of actions at once. Our comprehensive experimental evaluation, conducted using closed-source LLMs ChatGPT 3.5, ERNIE-Bot 4.0 and open-source LLM ChatGLM-6B, demonstrates that S2ERS significantly mitigates the spatial hallucination issues in LLMs, and improves the success rate and optimal rate by approximately 29% and 19%, respectively, in comparison to the SOTA CoT methods.
Article
In model-based reinforcement learning, the conventional approach to addressing world model bias is to use gradient optimization methods. However, using a singular policy from gradient optimization methods in response to a world model bias inevitably results in an inherently biased policy. This is because of constraints on the imperfect and dynamic data of state-action pairs. The gap between the world model and the real environment can never be completely eliminated. This article introduces a novel approach that explores a variety of policies instead of focusing on either world model bias or singular policy bias. Specifically, we introduce the Multi-Step Pruning Policy (MSPP), which aims to reduce redundant actions and compress the action and state spaces. This approach encourages a different perspective within the same world model. To achieve this, we use multiple pruning policies in parallel and integrate their outputs using the cross-entropy method. Additionally, we provide a convergence analysis of the pruning policy theory in tabular form and an updated parameter theoretical framework. In the experimental section, the newly proposed MSPP method demonstrates a comprehensive understanding of the world model and outperforms existing state-of-the-art model-based reinforcement learning baseline techniques.
Preprint
Full-text available
Spatial reasoning in Large Language Models (LLMs) serves as a foundation for embodied intelligence. However, even in simple maze environments, LLMs often struggle to plan correct paths due to hallucination issues. To address this, we propose S2ERS, an LLM-based technique that integrates entity and relation extraction with the on-policy reinforcement learning algorithm Sarsa for optimal path planning. We introduce three key improvements: 1) To tackle the hallucination of spatial, we extract a graph structure of entities and relations from the text-based maze description, aiding LLMs in accurately comprehending spatial relationships. 2) To prevent LLMs from getting trapped in dead ends due to context inconsistency hallucination by long-term reasoning, we insert the state-action value function Q into the prompts, guiding the LLM's path planning. 3) To reduce the token consumption of LLMs, we utilize multi-step reasoning, dynamically inserting local Q-tables into the prompt to assist the LLM in outputting multiple steps of actions at once. Our comprehensive experimental evaluation, conducted using ChatGPT 3.5 and ERNIE-Bot 4.0, demonstrates that S2ERS significantly mitigates the spatial hallucination issues in LLMs, and improves the success rate and optimal rate by approximately 29% and 19%, respectively, in comparison to the SOTA CoT methods.
Article
Safe reinforcement learning (RL) has shown great potential for building safe general-purpose robotic systems. While many existing works have focused on post-training policy safety, it remains an open problem to ensure safety during training as well as to improve exploration efficiency. Motivated to address these challenges, this work develops shielded planning guided policy optimization (SPPO), a new model-based safe RL method that augments policy optimization algorithms with path planning and shielding mechanism. In particular, SPPO is equipped with shielded planning for guided exploration and efficient data collection via model predictive path integral (MPPI), along with an advantage-based shielding rule to keep the above processes safe. Based on the collected safe data, a task-oriented parameter optimization (TOPO) method is used for policy improvement, as well as the observation-independent latent dynamics enhancement. In addition, SPPO provides explicit theoretical guarantees, i.e., clear theoretical bounds for training safety, deployment safety, and the learned policy performance. Experiments demonstrate that SPPO outperforms baselines in terms of policy performance, learning efficiency, and safety performance during training.
Article
Full-text available
Reinforcement learning (RL) aims at searching the best policy model for decision making, and has been shown powerful for sequential recommendations. The training of the policy by RL, however, is placed in an environment. In many real-world applications, the policy training in the real environment can cause an unbearable cost due to the exploration. Environment estimation from the past data is thus an appealing way to release the power of RL in these applications. The estimation of the environment is, basically, to extract the causal effect model from the data. However, real-world applications are often too complex to offer fully observable environment information. Therefore, quite possibly there are unobserved variables lying behind the data, which can obstruct an effective estimation of the environment. In this paper, by treating the hidden variables as a hidden policy, we propose a partially-observed multi-agent environment estimation (POMEE) approach to learn the partially-observed environment. To make a better extraction of the causal relationship between actions and rewards, we design a deep uplift inference network (DUIN) model to learn the causal effects of different actions. By implementing the environment model in the DUIN structure, we propose a POMEE with uplift inference (POMEE-UI) approach to generate a partially-observed environment with a causal reward mechanism. We analyze the effect of our method in both artificial and real-world environments. We first use an artificial recommender environment, abstracted from a real-world application, to verify the effectiveness of POMEE-UI. We then test POMEE-UI in the real application of Didi Chuxing. Experiment results show that POMEE-UI can effectively estimate the hidden variables, leading to a more reliable virtual environment. The online A/B testing results show that POMEE can derive a well-performing recommender policy in the real-world application.
Article
Full-text available
Recent model-free reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present model-based value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.
Article
Full-text available
In this paper, we introduce Path Integral Networks (PI-Net), a recurrent network representation of the Path Integral optimal control algorithm. The network includes both system dynamics and cost models, used for optimal control based planning. PI-Net is fully differentiable, learning both dynamics and cost models end-to-end by back-propagation and stochastic gradient descent. Because of this, PI-Net can learn to plan. PI-Net has several advantages: it can generalize to unseen states thanks to planning, it can be applied to continuous control tasks, and it allows for a wide variety learning schemes, including imitation and reinforcement learning. Preliminary experiment results show that PI-Net, trained by imitation learning, can mimic control demonstrations for two simulated problems; a linear system and a pendulum swing-up problem. We also show that PI-Net is able to learn dynamics and cost models latent in the demonstrations.
Article
In sequential decision-making, imitation learning (IL) trains a policy efficiently by mimicking expert demonstrations. Various imitation methods were proposed and empirically evaluated, meanwhile, their theoretical understandings need further studies, among which the compounding error in long-horizon decisions is a major issue. In this paper, we first analyze the value gap between the expert policy and imitated policies by two imitation methods, behavioral cloning (BC) and generative adversarial imitation. The results support that generative adversarial imitation can reduce the compounding error compared to BC. Furthermore, we establish the lower bounds of IL under two settings, suggesting the significance of environment interactions in IL. By considering the environment transition model as a dual agent, IL can also be used to learn the environment model. Therefore, based on the bounds of imitating policies, we further analyze the performance of imitating environments. The results show that environment models can be more effectively imitated by generative adversarial imitation than BC. Particularly, we obtain a policy evaluation error that is linear with the effective planning horizon w.r.t. the model bias, suggesting a novel application of adversarial imitation for model-based reinforcement learning (MBRL). We hope these results could inspire future advances in IL and MBRL.
Conference Paper
Reinforcement learning is a major tool to realize intelligent agents that can be autonomously adaptive to the environment. With deep models, reinforcement learning has shown great potential in complex tasks such as playing games from pixels. However, current reinforcement learning techniques are still suffer from requiring a huge amount of interaction data, which could result in unbearable cost in real-world applications. In this article, we share our understanding of the problem, and discuss possible ways to alleviate the sample cost of reinforcement learning, from the aspects of exploration, optimization, environment modeling, experience transfer, and abstraction. We also discuss some challenges in real-world applications, with the hope of inspiring future researches.
Conference Paper
We describe a iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.
Article
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.