Content uploaded by Jingliang Duan
Author content
All content in this area was uploaded by Jingliang Duan on Jun 14, 2021
Content may be subject to copyright.
1
Distributional Soft Actor-Critic: Off-Policy
Reinforcement Learning for Addressing Value
Estimation Errors
Jingliang Duan, Yang Guan, Shengbo Eben Li*, Yangang Ren, Qi Sun, and Bo Cheng
©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
Abstract—In reinforcement learning (RL), function approxi-
mation errors are known to easily lead to the Q-value overesti-
mations, thus greatly reducing policy performance. This paper
presents a distributional soft actor-critic (DSAC) algorithm,
which is an off-policy RL method for continuous control setting,
to improve the policy performance by mitigating Q-value overes-
timations. We first discover in theory that learning a distribution
function of state-action returns can effectively mitigate Q-value
overestimations because it is capable of adaptively adjusting the
update stepsize of the Q-value function. Then, a distributional soft
policy iteration (DSPI) framework is developed by embedding the
return distribution function into maximum entropy RL. Finally,
we present a deep off-policy actor-critic variant of DSPI, called
DSAC, which directly learns a continuous return distribution
by keeping the variance of the state-action returns within a
reasonable range to address exploding and vanishing gradient
problems. We evaluate DSAC on the suite of MuJoCo continuous
control tasks, achieving the state-of-the-art performance.
Index Terms—Reinforcement learning, overestimation, distri-
butional soft actor-critic (DSAC).
I. INTRODUCTION
DEEP neural networks (NNs) provide rich representations
that can enable reinforcement learning (RL) algorithms
to master a variety of challenging domains, from games to
robotic control [1]–[5]. However, most RL algorithms tend to
learn unrealistically high state-action values (i.e., Q-values),
known as overestimations, thereby resulting in suboptimal
policies.
The overestimations of RL were first found in the Q-
learning algorithm [6], which is the prototype of most existing
value-based RL algorithms [7]. For this algorithm, van Hasselt
et al. (2016) demonstrated that any kind of estimation errors
can induce an upward bias, irrespective of whether these errors
are caused by system noise, function approximation, or any
other sources [8]. The overestimation bias is firstly induced
by the max operator over all noisy Q-estimates of the same
This study is supported by Beijing NSF with JQ18010, and NSF China
with 51575293, and U20A20334. Special thanks should be given to TOYOTA
for funding this study. Jingliang Duan and Yang Guan contributed equally
to this work. All correspondences should be sent to S. Li with email:
lisb04@gmail.com.
J. Duan, Y. Guan, S. Li, Y. Ren, Q. Sun, and B. Cheng are with
State Key Lab of Automotive Safety and Energy, School of Vehicle
and Mobility, Tsinghua University, Beijing, 100084, China. They are
also with Center for Intelligent Connected Vehicles and Transportation,
Tsinghua University. Email: duanjl15@163.com; (guany17,
ryg18)@mails.tsinghua.edu.cn; (lishbo, qisun,
chengbo)@tsinghua.edu.cn.
state, which tends to prefer overestimated to underestimated
Q-values [9]–[11]. This overestimation bias will be further
propagated and exaggerated through the temporal difference
learning [7], wherein the Q-estimate of a state is updated using
the Q-estimate of its subsequent state. Deep RL algorithms,
such as Deep Q-Networks (DQN) [1], employ a deep NN to
estimate the Q-value. Although the deep NN can provide rich
representations with the potential for low asymptotic approxi-
mation errors, overestimations still exist, even in deterministic
environments [8], [12]. Fujimoto et al. (2018) showed that the
overestimation problem also persists in actor-critic RL [12],
such as Deterministic Policy Gradient (DPG) and Deep DPG
(DDPG) [13], [14]. In practice, inaccurate estimation exists
in almost all RL algorithms because, on the one hand, any
algorithm will introduce some estimation biases and variances,
simply due to the true Q-values are initially unknown [7].
On the other hand, function approximation errors are usually
unavoidable. This is particularly problematic because inaccu-
rate estimation can cause arbitrarily suboptimal actions to be
overestimated, resulting in a suboptimal policy.
To reduce overestimations in standard Q-learning, Double
Q-learning [15] was developed to decouple the max operation
into action selection and evaluation. To update one of these two
Q-networks, one Q-network is used to determine the greedy
policy, while another Q-network is used to determine its value,
resulting in unbiased estimates. Double DQN [8], a deep
variant of Double Q-learning, deals with the overestimation
problem of DQN, in which the target Q-network of DQN pro-
vides a natural candidate for the second Q-network. However,
these two methods can only handle discrete action spaces.
Fujimoto et al. (2018) developed actor-critic variants of the
standard Double DQN and Double Q-learning for continuous
control, by making action selections using the policy optimized
with respect to the corresponding Q-estimate [12]. However,
the actor-critic Double DQN suffers from similar overestima-
tions as DDPG, because the online and target Q-estimates
are too similar to provide an independent estimation. While
actor-critic Double Q-learning is more effective, it introduces
additional Q and policy networks at the cost of increasing
the computation time for each iteration. Finally, Fujimoto et
al. (2018) proposed Clipped Double Q-learning by taking
the minimum value between the two Q-estimates [12], which
is used in Twin Delayed Deep Deterministic policy gradient
(TD3) and Soft Actor-Critic (SAC) [16], [17]. However, this
method may introduce a considerable underestimation bias and
still requires an additional Q-network.
arXiv:2001.02811v3 [cs.LG] 11 Jun 2021
2
In this paper, we propose a new RL algorithm, called
distributional soft actor-critic (DSAC), to improve policy per-
formance by mitigating Q-value overestimations. The contri-
butions and novelty of this paper are summarized as follows:
1) A distributional soft policy iteration (DSPI) framework is
developed by embedding the return distribution function
in maximum entropy RL to learn a continuous distribution
of state-action returns (also called return distribution).
The impact of the return distribution learning on the
accuracy of Q-value estimation was barely discussed in
existing distributional RL algorithms, such as [18]–[23].
In this paper, we first found that the Q-value overestima-
tions can be mitigated by learning a distribution function
of state-action returns. This is because that compared with
most RL algorithms that directly learn the expectation
of state-action returns (i.e., Q-value) [1], [3], [8], [12],
[14], [16], the return distribution learning is capable of
adaptively adjusting the update stepsize of Q-values.
2) Based on the developed DSPI framework, we propose
the DSAC algorithm by replacing the clipped double Q-
learning of SAC [16], [17] with the return distribution
learning. In comparison with RL algorithms that use
double value networks to mitigate overestimations [8],
[12], [15]–[17], DSAC improves the Q-value estimation
accuracy by only employing a single return distribution
network, which also leads to higher time efficiency.
3) Different from existing distributional RL algorithms that
learn a discrete return distribution [18]–[23], the pro-
posed DSAC is capable of learning a continuous return
distribution by keeping the variance of the state-action
returns within a reasonable range to address exploding
and vanishing gradient problems. Therefore, DSAC re-
laxes the need for human-designed discrete ranges and
intervals. Besides, compared with most distributional
RL algorithms that can only handle discrete and low-
dimensional action spaces [18]–[22], DSAC is applicable
to continuous control settings by optimizing an indepen-
dent stochastic policy network.
4) Experiments on MuJoCo benchmarks demonstrate that
the proposed DSAC algorithm outperforms or matches
all baselines across all benchmark tasks in terms of the
final performance.
The paper is organized as follows. In Section II, we intro-
duce the related works. Section III describes some preliminar-
ies of RL and develops a DSPI framework. In Section IV, we
analyze the role of the distributional return function in solving
overestimations. Section V presents the DSAC algorithm and
PABAL architecture. In Section VI, we present experimental
results that show the efficacy of DSAC. Section VII concludes
this paper.
II. RE LATE D WOR K
Over the last decade, numerous deep RL algorithms have
appeared [1], [3], [12], [14], [16], [23]–[26]. This paper aims
to propose a new RL algorithm to mitigate Q-value overes-
timations by learning a distribution of state-action returns,
thereby improving policy performance. We also incorporate
the off-policy formulation to improve sample efficiency, and
the maximum entropy framework based on the stochastic
policy to encourage exploration. Besides, our algorithm mainly
focuses on continuous control setting. With reference to al-
gorithms such as DDPG [14], the off-policy learning and
continuous control can be easily enabled by learning separate
Q and policy networks in an actor-critic architecture. There-
fore, we mainly review prior works on the maximum entropy
framework and distributional RL in this section.
Maximum entropy RL favors stochastic policies by aug-
menting the optimization objective with the expected policy
entropy. While many prior RL algorithms consider the policy
entropy, they only use it as a regularizer [3], [24], [25].
Recently, several papers have noted the connection between
Q-learning and policy gradient methods in the setting of the
maximum entropy framework [27]–[29]. Early maximum en-
tropy RL algorithms usually only consider the policy entropy
of current states [27], [30], [31]. Unlike them, soft Q-learning
directly augments the reward with an entropy term, such that
the optimal policy aims to reach states where they will have
high policy entropy in the future [32]. Haarnoja et al. (2018)
further developed an off-policy actor-critic variant of the Soft
Q-learning for large continuous domains, called SAC [16],
[17]. In this paper, we build on the work of [16], [17] for
implementing the maximum entropy framework.
The distributional RL, in which one models the distribution
over returns, whose expectation is the value function, was
recently introduced by Bellemare et al. [18]. They proposed
a distributional RL algorithm, called C51, which achieved
great performance improvements on many Atari 2600 bench-
marks. Since then, many distributional RL algorithms and
their inherent analyses have appeared in literature [19]–[22].
Like DQN, these works can only handle discrete and low-
dimensional action spaces, as they select actions according
to their Q-networks. Barth-Maron et al. (2018) combined the
distributional return function within an actor-critic framework
for policy learning in continuous control setting domains, and
proposed the Distributed Distributional Deep Deterministic
Policy Gradient algorithm (D4PG) [23]. Inspired by these dis-
tributional RL researches, Dabney et al. (2020) found that the
brain represents possible future rewards not as a single mean,
but instead as a probability distribution through mouse experi-
ments [33]. Existing distributional RL algorithms usually learn
a discrete return distribution because it is computationally
friendly. However, this poses a problem: we need to divide the
return distribution into multiple discrete intervals in advance.
This is inconvenient because different tasks usually require
different division numbers and intervals. In addition, the role
of distributional return function in solving overestimations was
barely discussed before.
III. PRELIMINARIES AND DISTRIBUTIONAL SOFT POLICY
ITE RATI ON
In this section, we first describe the notations and introduce
the concept of maximum entropy RL. Then the distributional
soft policy iteration (DSPI) framework is developed.
3
A. Notation
We consider the standard reinforcement learning (RL) set-
ting wherein an agent interacts with an environment Ein
discrete time. This environment can be modeled as a Markov
Decision Process, defined by the tuple (S,A,R, p). The state
space Sand action space Aare assumed to be continuous,
R(rt|st, at) : S × A → P(rt)is a stochastic reward function
mapping a state-action pair (st, at)to a distribution over a
set of bounded rewards, and the unknown state transition
probability p(st+1|st, at) : S × A → P(st+1)maps a given
(st, at)to the probability distribution over st+1. For the sake
of simplicity, the current and next state-action pairs are also
denoted as (s, a)and (s0, a0), respectively.
At each time step t, the agent receives a state st∈ S and
selects an action at∈ A. In return, the agent receives the
next state st+1 ∈ S and a scalar reward rt∼R(st, at). The
process continues until the agent reaches a terminal state after
which the process restarts. The agent’s behavior is defined
by a stochastic policy π(at|st) : S → P(at), which maps a
given state to a probability distribution over actions. We will
use ρπ(s)and ρπ(s, a)to denote the state and state-action
distribution induced by policy π.
B. Maximum Entropy RL
The goal in standard RL is to learn a policy which
maximizes the expected future accumulated return
E(si≥t,ai≥t)∼ρπ,ri≥t∼R(·|si,ai)[P∞
i=tγi−tri], where γ∈[0,1)
is the discount factor. In this paper, we consider a more
general entropy-augmented objective [16], [17], [32], which
augments the reward with a policy entropy term H,
Jπ=E
(si≥t,ai≥t)∼ρπ,
ri≥t∼R(·|si,ai)
h∞
X
i=t
γi−t[ri+αH(π(·|si))]i,(1)
where
H(π(·|s)) = −Za∈A
π(a|s) log π(a|s)da
=E
a∼π(·|s)−log π(a|s).
This objective improves the exploration efficiency of the policy
by maximizing both the expected future return and policy
entropy. The temperature parameter αdetermines the relative
importance of the entropy term against the reward. Maximum
entropy RL gradually approaches the conventional RL as
α→0.
We use Gt=P∞
i=tγi−t[ri−αlog π(ai|si)] to denote the
entropy-augmented accumulated return from st, also called
soft return. The soft Q-value of policy πis defined as
Qπ(st, at) = E
r∼R(·|st,at)
[r] + γE
(si>t,ai>t )∼ρπ,
ri>t∼R(·|si,ai)
[Gt+1],(2)
which describes the expected soft return for selecting atin
state stand thereafter following policy π.
The optimal maximum entropy policy is learned by a
maximum entropy variant of the policy iteration method,
which alternates between soft policy evaluation and soft policy
improvement, called soft policy iteration. In the soft policy
evaluation process, given a policy π, the soft Q-value can be
learned by repeatedly applying a soft Bellman operator Tπ
under policy πgiven by
TπQπ(s,a) = Er∼R(·|s,a)[r]+
γEs0∼p,a0∼π[Qπ(s0, a0)−αlog π(a0|s0).(3)
The goal of the soft policy improvement process is to find
a new policy πnew that is better than the current policy πold,
such that Jπnew ≥Jπold . Hence, we can update the policy
directly by maximizing the entropy-augmented objective in
(1) in terms of the soft Q-value,
πnew = arg max
πJπ
= arg max
π
E
s∼ρπ,a∼πQπold (s, a)−αlog π(a|s).(4)
The convergence and optimality of soft policy iteration have
been verified in [16], [17], [28], [32].
C. Distributional Soft Policy Iteration
Next, we develop the distributional soft policy iteration
(DSPI) framework by extending the maximum entropy RL
into a distributional learning version. Firstly, we define the
soft state-action return of policy πfrom a state-action pair
(st, at)as
Zπ(st, at) = rt+γGt+1
(si>t,ai>t )∼ρπ,ri≥t∼R(·|si,ai),
which is usually a random variable due to the randomness in
the state transition p, reward function Rand policy π. From
(2), it is clear that
Qπ(s, a) = E[Zπ(s, a)].(5)
Instead of just considering the expected state-action return
Qπ(s, a), one can choose to directly model the distribution
of the soft returns Zπ(s, a). We define Zπ(Zπ(s, a)|s, a) :
S × A → P(Zπ(s, a)) as a mapping from (s, a)to a distri-
bution over soft state-action returns, and call it the soft state-
action return distribution or distributional value function. The
distributional variant of the Bellman operator in the maximum
entropy framework can be derived as
Tπ
DZπ(s, a)D
=r+γ(Zπ(s0, a0)−αlog π(a0|s0)),(6)
where r∼R(·|s, a), s0∼p, a0∼π, and AD
=Bdenotes
that two random variables Aand Bhave equal probability
laws. The distributional variant of policy iteration has been
proved to converge to the optimal return distribution and
policy uniformly in [18]. We can further prove that DSPI
which alternates between (6) and (4) also leads to policy
improvement with respect to the maximum entropy objective
(1). Details are provided in Appendix A.
Suppose Tπ
DZ(s, a)∼ T π
DZ(·|s, a), where Tπ
DZ(·|s, a)
denotes the distribution of Tπ
DZ(s, a). To implement (6), we
can directly update the soft return distribution by
Znew = arg min
Z
E
(s,a)∼ρπd(Tπ
DZold(·|s, a),Z(·|s, a)),(7)
where dis some metric to measure the distance between
two distributions. For calculation convenience, many practical
4
distributional RL algorithms employ Kullback-Leibler (KL)
divergence, denoted as DKL, as the metric [18], [23].
IV. OVER ES TI MATION BIAS
This section mainly focuses on the impact of the state-
action return distribution learning on reducing overestimation.
Therefore, the entropy coefficient αis assumed to be 0here.
Previous studies analyzed the Q-value estimation bias of Q-
learning in tabular cases [6], [15]. In section IV-A, we derive
the analytical expression of Q-value estimation bias from the
perspective of function approximation. Then, Section IV-B
analyzes the Q-estimate bias of the return distribution learning
and reveals its mechanism to mitigate overestimations.
A. Overestimation in Q-learning
In Q-learning with discrete actions, suppose the Q-value
is approximated by a Q-function Qθ(s, a)with parameters θ.
Defining the greedy target y=E[r] + γEs0[maxa0Qθ(s0, a0)],
the Q-estimate Qθ(s, a)can be updated by minimizing the
loss (y−Qθ(s, a))2/2using gradient descent methods, i.e.,
θnew =θ+β(y−Qθ(s, a))∇θQθ(s, a),(8)
where βis the learning rate. However, in practical applications,
the Q-estimate Qθ(s, a)usually contains random errors, which
may be caused by system noises and function approximation.
Denoting the current true Q-value as ˜
Q, we assume
Qθ(s, a) = ˜
Q(s, a) + Q,(9)
where the random error Qhas zero mean and is independent
of (s, a)and θ. To distinguish the random error of Qθ(s, a)
and Qθ(s0, a0), the random error of Qθ(s0, a0)is denoted as
0
Q. Clearly, 0
Qmay cause inaccuracy on the right-hand side
of (8). Let θtrue represent the post-update parameters obtained
based on true target ˜y, that is,
θtrue =θ+β(˜y−Qθ(s, a))∇θQθ(s, a),
where ˜y=E[r] + γEs0[maxa0˜
Q(s0, a0)].
Supposing βis sufficiently small, the post-update Q-
function can be well-approximated by linearizing around θ
using Taylor’s expansion:
Qθtrue (s, a)≈Qθ(s, a) + β(˜y−Qθ(s, a))k∇θQθ(s, a)k2
2,
Qθnew (s, a)≈Qθ(s, a) + β(y−Qθ(s, a))k∇θQθ(s, a)k2
2.
Then, in expectation, the estimate bias of post-update Q-
estimate Qθnew (s, a)is
∆(s, a) = E0
Q[Qθnew (s, a)−Qθtrue (s, a)]
≈βE0
Q[y]−˜yk∇θQθ(s, a)k2
2
=βγE0
QEs0[max
a0Q(s0, a0)]−
Es0[max
a0
˜
Q(s0, a0)]k∇θQθ(s, a)k2
2.
Defining
δ=E0
QEs0[max
a0Q(s0, a0)]−Es0[max
a0
˜
Q(s0, a0)]
=Es0E0
Q[max
a0Qθ(s0, a0)] −max
a0
˜
Q(s0, a0)
=Es0E0
Q[max
a0(˜
Qθ(s0, a0) + 0
Q]−max
a0
˜
Q(s0, a0),
(10)
∆(s, a)can be rewritten as:
∆(s, a)≈βγδk∇θQθ(s, a)k2
2.
Although 0
Qis independent of (s0, a0), it cannot be extracted
from the max operator of maxa0(˜
Q(s0, a0) + 0
Q). This is
because for each (s0, a0),0
Qis a random variable rather than a
fixed value. In fact, it has been verified by previous researches
that E0
Q[maxa0(˜
Q(s0, a0) + 0
Q)] −maxa0˜
Q(s0, a0)≥0[9],
[15]. Therefore, it is clear that
∆(s, a)≥0,
which indicates that ∆(s, a)is an upward bias. In fact, any
kind of estimation errors can induce an upward bias due to
the max operator. Although it is reasonable to expect a small
upward bias caused by single update, these overestimation
errors can be further exaggerated through temporal difference
(TD) learning, which may result in large overestimation bias
and suboptimal policy updates.
B. Return Distribution for Reducing Overestimation
Before discussing the distributional version of Q-learning,
we first assume that the random returns Z(s, a)obey a
Gaussian distribution Z(·|s, a). Suppose the mean (i.e., Q-
value) and standard deviation of the Gaussian distribution
are approximated by two independent functions Qθ(s, a)
and σψ(s, a), with parameters θand ψ, i.e., Zθ,ψ (·|s, a) =
N(Qθ(s, a), σψ(s, a)2).
Similar to standard Q-learning, we first define a ran-
dom greedy target yD=r+γZ(s0, a0∗), where a0∗ =
arg maxa0Qθ(s0, a0). Suppose yD∼ Ztarget(·|s, a), which is
also assumed to be a Gaussian distribution. Note that even
if Z(s, a)and yDare not strictly Gaussian, we can still
use the Gaussian to approximate their distributions, which
will not affect the subsequent analysis. Since E[yD] =
E[r] + γEs0[maxa0Qθ(s0, a0)] is equal to yin (8), it follows
Ztarget(·|s, a) = N(y, σtarget2). Considering the loss function
in (7) under the KL divergence measurement, Qθ(s, a)and
σψ(s, a)are updated by minimizing
DKL(Ztarget (·|s, a),Zθ,ψ (·|s, a))
= log σψ(s, a)
σtarget +σtarget2+ (y−Qθ(s, a))2
2σψ(s, a)2−1
2,(11)
that is,
θnew =θ+βy−Qθ(s, a)
σψ(s, a)2∇θQθ(s, a),
ψnew =ψ+β∆σ2+ (y−Qθ(s, a))2
σψ(s, a)3∇ψσψ(s, a).
(12)
where ∆σ2=σtarget2−σψ(s, a)2. Compared with standard
Q-learning, σψ(s, a)plays a role of adaptively adjusting the
update stepsize of Qθ(s, a). In particular, the update stepsize
of Qθ(s, a)decreases squarely as σφ(s, a)increases. Sup-
posing Qθ(s, a)also obeys (9), the post-update parameters
obtained based on the true target value ˜yis given by
θtrue =θ+β˜y−Qθ(s, a)
σψ(s, a)2∇θQθ(s, a)(13)
5
Similar to the derivation of ∆(s, a), the overestimation bias
of Qθnew (s, a)in distributional Q-learning is
∆D(s, a)≈βγδk∇θQθ(s, a)k2
2
σψ(s, a)2=∆(s, a)
σψ(s, a)2.(14)
Obviously, the overestimation errors ∆D(s, a)is inversely pro-
portional to σψ(s, a)2. In an ideal situation, when ˜
Q(s, a) = ˜y,
that is, ˜
Q(s, a)has converged after a period of learning, we
can derive that
EQ,0
Q[σψnew (s, a)] ≥σψ(s, a)+
βσtarget2−σψ(s, a)2+γ2δ2+EQ[Q2]
σψ(s, a)3k∇ψσψ(s, a)k2
2,
where this inequality holds approximately since we drop
higher order terms out in Taylor approximation. See Appendix
B-A for details of derivation.
Because σψnew is also the standard deviation for the next
time step, this indicates that by repeatedly applying (12), the
standard deviation σψ(s, a)of the return distribution tends
to be a larger value in areas with high σtarget and random
errors Q. Moreover, σtarget is often positively related to the
randomness of systems p, reward function Rand the return
distribution Z(·|s0, a0)of subsequent state-action pairs. Since
the overestimation bias ∆D(s, a)is inversely proportional to
σψ(s, a)2according to (14), distributional Q-learning can be
used to mitigate overestimations caused by task randomness
and approximation errors.
V. DISTRIBUTIONAL SOFT ACTOR -CRITIC
In this section, based on the developed DSPI framework, we
derive the learning rules of the continuous return distribution,
and propose the DSAC algorithm by replacing the clipped dou-
ble Q-learning of SAC [16], [17] with the return distribution
learning. We will consider a parameterized distributional value
function Zθ(·|s, a)and a stochastic policy πφ(·|s), where θ
and φare parameters. In this paper, both the state-action return
distribution and policy functions are modeled as Gaussian with
mean and covariance given by neural networks (NNs). We will
next derive update rules for parameters of these NNs.
A. Algorithm
1) Distributional Soft Policy Evaluation: Considering the
loss function in (7), the soft state-action return distribution
can be trained to minimize the loss function in (7) under the
KL-divergence measurement
JZ(θ) = E
(s,a)∼B DKL(Tπφ0
DZθ0(·|s, a),Zθ(·|s, a))(15)
where Bis a replay buffer of previously sampled experience,
θ0and φ0are parameters of target return distribution and policy
functions, which are used to stabilize the learning process and
evaluate the target. For practical applications, σtarget in (11)
is unknown. Therefore, we cannot directly update Zθ(·|s, a)
using the objective shown in (11). After analysis, we get the
following objective function equivalent to (15)
JZ(θ) = −E
(s,a,r,s0)∼B,a0∼πφ0,
Z(s0,a0)∼Zθ0(·|s0,a0)
hlog P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i.
We provide details of derivation in Appendix B-B.
The parameters θcan be optimized with the following
gradients
∇θJZ(θ) = −E
(s,a,r,s0)∼B,
a0∼πφ0,
Z(s0,a0)∼Zθ0
h∇θlog P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i.
Since Zθis assumed to be a Gaussian model, it can be
expressed as Zθ(·|s, a) = N(Qθ(s, a), σθ(s, a)2), where
Qθ(s, a)and σθ(s, a)are the outputs of value network. This
makes the Gaussian variant of update gradients
∇θJZ(θ)
=−E
(s,a,r,s0)∼B,
a0∼πφ0,
Z(s0,a0)∼Zθ0
h∇θlog e−T
πφ0
DZ(s,a)−Qθ(s,a)2
2σθ(s,a)2
√2πσθ(s, a)i
=E
(s,a,r,s0)∼B,
a0∼πφ0,
Z(s0,a0)∼Zθ0
h∇θTπφ0
DZ(s, a)−Qθ(s, a)2
2σθ(s, a)2+∇θσθ(s, a)
σθ(s, a)i.
Denoting ΨZ(θ) = log P(Tπφ0
DZ(s, a)|Zθ(·|s, a)), to un-
derstand the composition of ∇θJZ(θ)more intuitively, we can
rewrite it as
∇θJZ(θ) = E
(s,a,r,s0)∼B,
a0∼πφ0,
Z(s0,a0)∼Zθ0
h−∂ΨZ(θ)
∂Qθ(s, a)∇θQθ(s, a)
−∂ΨZ(θ)
∂σθ(s, a)∇θσθ(s, a)i,
(16)
where
∂ΨZ(θ)
∂Qθ(s, a)=Tπφ0
DZ(s, a)−Qθ(s, a)
σθ(s, a)2,
∂ΨZ(θ)
∂σθ(s, a)=Tπφ0
DZ(s, a)−Qθ(s, a)2
σθ(s, a)3−1
σθ(s, a).
It can be easily deduced from ∂ΨZ(θ)
∂Qθ(s,a)that the update stepsize
of Qθ(s, a)decreases squarely as σθ(s, a)increases, thereby
mitigating Q-value overestimations. However, the gradients
∇θJZ(θ)are prone to explode as σθ(s, a)→0, or to vanish
as σθ(s, a)→ ∞. To address this problem, we propose two
options to keep σθ(s, a)within a reasonable range. The first
point is to limit the minimum value of σθ(s, a)by
σθ(s, a) = max(σθ(s, a), σmin),(17)
Noted that if σmin ≥1, we always have ∆D(s, a)≤∆(s, a).
Therefore, in this paper, we let σmin = 1. And the second
point is to clip Tπφ0
DZ(s, a)of ∂ΨZ(θ)
∂σθ(s,a)to keep it close
to the expectation value Qθ(s, a)of the current soft return
distribution, thus stabilizing the learning process of σθ(s, a)
and indirectly controlling its range, i.e.,
∂ΨZ(θ)
∂σθ(s, a)=Tπφ0
DZ(s, a)−Qθ(s, a)2
σθ(s, a)3−1
σθ(s, a),
6
where
Tπφ0
DZ(s, a) = clip(Tπφ0
DZ(s, a), Qθ(s, a)−b, Qθ(s, a) + b),
(18)
where clip[x, A, B]denotes that xis clipped into the range
[A, B]and bis the clipping boundary.
The target networks mentioned above use a slow-moving
update rate, parameterized by τ, such as
θ0←τθ + (1 −τ)θ0, φ0←τ φ + (1 −τ)φ0.
2) Distributional Soft Policy Improvement: The policy can
be learned by directly maximizing a parameterized variant of
the objective in (4):
Jπ(φ) = E
s∼B,a∼πφ
[Qθ(s, a)−αlog(πφ(a|s))]
=E
s∼B,a∼πφhE
Z(s,a)∼Zθ(·|s,a)
[Z(s, a)] −αlog(πφ(a|s))i.
If ais unbounded, given the parameters of the action dis-
tribution, such as the mean and variance of the Gaussian
distribution, log(πφ(a|s)) can be easily calculated. On the
other hand, if ais bounded to a finite interval, its log-likelihood
can also be obtained in the manner given in Appendix B-C.
There are several options, such as log derivative and repa-
rameterization tricks, for maximizing Jπ(φ)[34]. In this paper,
we apply the reparameterization trick because it can reduce the
gradient estimation variance.
If the soft Q-value function Qθ(s, a)is explicitly param-
eterized through parameters θ, we only need to express the
random action aas a deterministic variable, i.e.,
a=fφ(ξa;s),(19)
where ξa∈Rdim(A)is an auxiliary variable which is sampled
form some fixed distribution. In particular, since πφ(·|s)is
assumed to be a Gaussian in this paper, fφ(ξa;s)can be
formulated as
fφ(ξa;s) = amean +ξaastd,
where amean ∈Rdim(A)and astd ∈Rdim(A)are the mean
and standard deviation of πφ(·|s),represents the Hadamard
product and ξai∼ N(0,Idim(A)). Then the policy update
gradients can be approximated with
∇φJπ(φ) = Es∼B,ξah−α∇φlog(πφ(a|s))+
(∇aQθ(s, a)−α∇alog(πφ(a|s)))∇φfφ(ξa;s)i.
If Qθ(s, a)cannot be expressed explicitly through θ, the policy
update gradients can be obtained in the manner given in
Appendix B-D.
3) Pseudo-code: Finally, according to [17], the temperature
αis updated by minimizing the following objective
J(α) = E(s,a)∼B[α(−log πφ(a|s)− H)],
where His the expected entropy. In addition, two-timescale
updates, i.e., less frequent policy updates, usually result in
higher quality policy updates [12]. Therefore, the policy,
temperature and target networks are updated every miterations
in this paper. The final algorithm is listed in Algorithm 1. Fig.
1 shows the diagram of DSAC.
Algorithm 1 DSAC Algorithm
Initialize parameters θ,φand α
Initialize target parameters θ0←θ,φ0←φ
Initialize learning rate βZ,βπ,βαand τ
Initialize iteration index k= 0
repeat
Select action a∼πφ(a|s)
Observe reward rand new state s0
Store transition tuple (s, a, r, s0)in buffer B
Sample Ntransitions (s, a, r, s0)from B
Update soft return distribution θ←θ−βZ∇θJZ(θ)
if kmod mthen
Update policy φ←φ+βπ∇φJπ(φ)
Adjust temperature α←α−βα∇αJ(α)
Update target networks:
θ0←τθ + (1 −τ)θ0,φ0←τ φ + (1 −τ)φ0
end if
k=k+ 1
until Convergence
Environment
𝑎
Distributional
Value NN
(𝑠,𝑎)
𝜋𝜙(∙|𝑠)
𝑠
Policy NN
𝒵𝜃(∙|𝑠,𝑎)
Return
Distribution
Buffer
(𝑠,𝑎,𝑟,𝑠′)
Policy Entropy
Experience
Update Update
Fig. 1. DSAC diagram. The return distribution and policy are ap-
proximated by two NNs, called distributional value network and
policy network respectively. DSAC first updates the distributional
value network based on the samples collected from the buffer. Then,
the output of the value network is used to guide the update of the
policy network.
B. Architecture
Algorithm 1 and Fig. 1 show the operation process of DSAC
in a serial way. Like most off-policy RL algorithms, we can
use parallel or distributed learning techniques to improve the
learning efficiency of DSAC. Therefore, we build a new par-
allel asynchronous buffer-actor-learner architecture (PABAL)
referring to the other high-throughput learning architectures,
such as IMPALA and Ape-X [3], [35], [36]. As shown in
Fig. 2, buffers, actors and learners are all distributed across
multiple workers, which are used to improve the efficiency of
storage and sampling, exploration, and updating, respectively.
And all communication between modules is asynchronous.
Both actors and learners asynchronously synchronize the
parameters from the shared memory. The experience generated
by each actor is asynchronously and randomly sent to a
certain buffer at each time step. Each buffer continuously
stores data and sends the sampled experience to a random
7
learner. Relying on the received sampled data, the learners
calculate the update gradients using their local functions, and
then use these gradients to update the shared value and policy
functions. In this paper, we implement DSAC and other off-
policy baseline algorithms within the PABAL architecture.
Buffer
Generated experienceSampled experience
Local Policy
Environment
Actor
Local Value Local Policy
Optimizer Optimizer
Learner
Shared Memory
Shared Policy
Shared Value
Update
Synchronize
Prameters Synchronize
Prameters
𝑎
𝑠
𝑟
(𝑠,𝑎,𝑟,𝑠′)
(𝑠,𝑎,𝑟,𝑠′)
Fig. 2. The PABAL architecture. Buffers, actors, and learners are
all distributed across multiple workers. Communication between
different modules is asynchronous.
VI. EX PE RI ME NTAL VERIFICATION
A. Benchmarks
To evaluate our algorithm, we measure its performance and
Q-value estimation bias on a suite of MuJoCo continuous con-
trol tasks without modifications to environment [37], interfaced
through OpenAI Gym [38]. Fig. 3 shows the benchmark tasks
used in this paper. See Appendix C-A for brief descriptions
of these benchmarks.
(a) (b) (c)
(d) (e)
Fig. 3: Tasks. (a) Humanoid-v2: (s×a)∈R376 ×R17. (b)
HalfCheetah-v2: (s×a)∈R17 ×R6. (c) Ant-v2: (s×a)∈
R111 ×R8. (d) Walker2d-v2: (s×a)∈R17 ×R6. (e)
InvertedDoublePendulum-v2: (s×a)∈R11 ×R1.
B. Baselines
We compare our algorithm against Deep Deterministic
Policy Gradient (DDPG) [14], Trust Region Policy Optimiza-
tion (TRPO) [24], Proximal Policy Optimization (PPO) [25],
Distributed Distributional Deep Deterministic Policy Gradients
(D4PG) [23], Twin Delayed Deep Deterministic policy gradi-
ent (TD3) [12], Soft Actor-Critic (SAC) [17]. DDPG, TRPO,
PPO, D4PG, TD3 and SAC are mainstream RL algorithms,
which have been extensively verified and applied in a variety
of challenging domains. Using these algorithms as baselines,
the performance of the proposed DSAC algorithm can be
evaluated objectively.
We additionally compare our method with our proposed
Twin Delayed Distributional Deep Deterministic policy gra-
dient algorithm (TD4), which is developed by replacing the
clipped double Q-learning in TD3 with the distributional
return learning; Double Q-learning variant of SAC (Double-Q
SAC), in which we replace the clipped double Q-learning of
SAC with the actor-critic variant of double Q-learning [12],
[15]; and single Q-value variant of SAC (Single-Q SAC), in
which we replace the clipped double Q-learning of SAC with
traditional TD learning. See Appendix C-B, C-C and C-D
for detailed descriptions of Double-Q SAC, Single-Q SAC
and TD4 algorithms. Double-Q SAC and Single-Q SAC are
adapted from SAC. Table I gives a basic description of DSAC
and each baseline. It is clear that DSAC, SAC, Double-Q
SAC and Single-Q SAC algorithms respectively use the return
distribution learning, clipped double Q-learning, double Q-
learning and traditional TD learning for policy evaluation. This
is the only difference between these algorithms. Therefore, we
can assess the impact of the distribution learning by comparing
DSAC with SAC, Single-Q SAC and Double-Q SAC. Besides,
we compare DSAC with TD4, which uses the distribution
learning but not maximum entropy, to assess the impact of
policy entropy.
All the off-policy algorithms mentioned above are imple-
mented in the proposed PABAL architecture, including 4
learners, 6 actors and 3 buffers. We use a fully connected
network with 5 hidden layers, consisting of 256 units per
layer, with Gaussian Error Linear Units (GELU) each layer
[39], for both actor and critic. For distributional value function
and stochastic policy, we use a Gaussian distribution with
mean and covariance given by a NN, where the covariance
matrix is diagonal. In this case, each NN maps the input
states to the mean and logarithm of standard deviation of the
Gaussian distribution. The Adam method [40] with a cosine
annealing learning rate is used to update all the parameters.
All algorithms adopt almost the same NN architecture and
hyperparameters. Table IV in Appendix C-E provides more
detailed hyperparameters of all algorithms.
C. Results
1) Performance: We train 5 different runs of each algorithm
with different random seeds, with evaluations every 20000
iterations. Each evaluation calculates the average return over
5 episodes without exploration noise, where the maximum
length of each episode is 1000 time steps. The learning curves
are shown in Fig. 4 and results in Table II. Results show
that the proposed DSAC algorithm outperforms or matches all
other baseline algorithms across all benchmark tasks in terms
of the final performance. For example, compared with famous
RL algorithms such as SAC, TD3, PPO, and DDPG, DSAC
gains 20.0%, 63.8%, 39.8%, 97.6% improvements on the
most complex Humanoid-v2 task, respectively. This indicates
8
TABLE I
BASIC DESCRIPTION OF DSAC AN D BAS EL INE S.
Algorithm Algorithm Type Policy Type Policy Evaluation Policy Improvement
DSAC (Ours) off-policy Stochastic Continuous soft return distribution learning Soft policy gradient
SAC [17] off-policy Stochastic Clipped double Q-learning Soft policy gradient
Double-Q SAC off-policy Stochastic Double Q-learning Soft policy gradient
Single-Q SAC off-policy Stochastic Traditional TD learning Soft policy gradient
TD4 off-policy Deterministic Continuous return distribution learning Policy gradient
TD3 [12] off-policy Deterministic Clipped double Q-learning Policy gradient
DDPG [14] off-policy Deterministic Traditional TD learning Policy gradient
D4PG [23] off-policy Deterministic Discrete return distribution learning Policy gradient
TRPO [24] on-policy Stochastic Traditional TD learning Constrained Policy Optimization
PPO [25] on-policy Stochastic Traditional TD learning Proximal Policy Optimization
(a) Humanoid-v2 (b) Ant-v2 (c) Walker2d-v2
(d) HalfCheetah-v2 (e) InvertedDoublePendulum-v2
Fig. 4: Training curves on continuous control benchmarks. The solid lines correspond to the mean and the shaded regions
correspond to 95% confidence interval over 5 runs.
that the final performance of DSAC on these benchmarks
exceeds the state of the art. Fig. 5 visually shows the con-
trol performance of DSAC and SAC on Humanoid-v2. It is
obvious that DSAC realizes a movement closer to human
running. Among DSAC, SAC, Single-Q SAC and Double-
Q SAC, DSAC has achieved the best performance on all
tasks, which shows that the return distribution learning is an
important measure to improve policy performance. Besides,
TD4 also outperforms TD3 and DDPG on most tasks, which
shows that algorithms with deterministic policies also benefit
greatly from the return distribution learning. As TD4 exceeds
the performance of D4PG, which learns a discrete return
distribution, with a wide margin on Humanoid-v2, Ant-v2
and HalfCheetah-v2, this indicates that learning a continuous
distribution causes significant performance improvements in
most cases. Compared with TD4, DSAC achieves 33.8%,
22.1%, 10.4%, 8.0% improvements on Humanoid-v2, Ant-
v2, Walker2d-v2, and HalfCheetah-v2, respectively, suggesting
that the maximum entropy framework is an effective measure
to achieve good performance.
2) Q-value Estimation Accuracy: To evaluate the impact of
the return distribution learning on Q-value estimation accuracy,
this section compares the estimation bias of DSAC, SAC,
Double-Q SAC and Single-Q SAC on different benchmarks.
The Q-value estimation bias is equal to the difference between
the Q-value estimate and the true Q-value. To approximate the
true Q-value, we calculate the average actual discounted return
over states of 10 episodes every 20000 iterations (evaluate up
to the first 200 states per episode). Fig. 6 graphs the average Q-
value estimation and true Q-value curves during learning. Ta-
ble III gives the average relative Q-value estimation bias which
equals the Q-value estimation bias divided by the true Q-value.
9
TABLE II
AVERA GE FIN AL R ETU RN . MA XI MUM VAL UE F OR E ACH TAS K IS B OL DED.±CORRESPONDS TO A SINGLE STANDARD DEVIATION OVER
5RUN S.
Task Humanoid-v2 Ant-v2 Walker2d-v2 HalfCheetah-v2 InvDoublePendulum-v2
DSAC (Ours) 10824±347 9547±346 6920±405 17479±148 9359.7±0.2
SAC 9019±292 7856±416 5878±580 17300±39 9359.6 ±0.2
Double-Q SAC 9844±396 7682±428 5881±227 16926±132 9359.4±0.6
Single-Q SAC 8525±488 6783±197 2176±1251 16445±815 9355.2±3.6
TD4 8090±789 7821±262 6270±435 16187±538 9320.2±18.3
TD3 6610±1062 7828±642 4864±512 5619±5779 9315.5±10.4
DDPG 5477±2438 6060±747 2849±690 11214±6861 9198.0±13.1
D4PG 175±53 2367±303 6588±260 7215±89 9300.9±16.3
PPO 7743±267 5889±111 6654±492 9517±936 9318.7±0.7
TRPO 581±56 3767±573 2870±28 3274±346 9324.6±2.8
(a) DSAC (Ours)
(b) SAC
Fig. 5: DSAC vs SAC on Humanoid-v2.
Noted that this part excludes the InvDoublePendulum-v2 task,
because due to its simplicity, a good policy has been learned
before the value function converges.
Compared with Single-Q SAC that updates Q-value using
the traditional TD learning method, the overestimation bias
of DSAC is reduced by 10.53%, 5.76%, 926.09%, 1.89%
on Humanoid-v2, Ant-v2, Walker2d-v2, and HalfCheetah-v2,
respectively. Our results demonstrate the theoretical analysis
in Section IV-B, i.e., the return distribution learning can be
used to reduce overestimations without introducing any addi-
tional value or policy network. As a comparison, SAC (uses
clipped double Q-learning) and Double-Q SAC (uses double
Q-learning) suffer from underestimations during the learning
procedure. While the effect of each value learning method
varies from task to task, the Q-value estimation accuracy
of DSAC is higher than SAC and Double-Q SAC in most
cases. This explains why DSAC exceeds Single-Q SAC, SAC,
and Double-Q SAC on most benchmarks by a wide margin.
Therefore, our results demonstrate that the return distribution
learning can greatly improve policy performance by mitigating
overestimations.
3) Time Efficiency: Fig. 7 compares the time efficiency of
different off-policy algorithms. Results show that the average
wall-clock time consumption per 1000 iterations of DSAC
is comparable to DDPG, and much lower than SAC, TD3,
and Double-Q SAC. This is because that unlike double Q-
learning and clipped double Q-learning, the return distribution
learning does not need to introduce any additional value
network or policy network (excluding target networks) to
reduce overestimations.
D. Ablation Studies
As shown in Table IV, compared with SAC, DSAC intro-
duces two hyperparameters: 1) the minimum standard devi-
ation σmin in (17), and 2) the clipping boundary bin (18).
These two hyperparameters are employed to prevent exploding
and vanishing gradient problems when learning the continuous
distributional value function Zθ(·|s, a).
We first take the Ant-v2 task as an example to analyze the
influence of σmin on the final performance. From (16), the
gradients ∇θJZ(θ)are prone to explode as σθ(s, a)→0.
Therefore, σθ(s, a)should be bounded above by a specific
positive value. Besides, according to the analysis in Section
IV-B, if σmin ≥1, we always have ∆D(s, a)≤∆(s, a). But
a too large σmin may reduce the estimation accuracy of the
return distribution. Therefore, this paper sets σmin = 1. Fig.
8a graphs the average final return of DSAC under different
σmin values on Ant-v2. Our results show that when σmin = 1,
DSAC achieves the best final performance on Ant-v2, which
is consistent with the above analysis.
We additionally perform the ablation study to compare the
performance of DSAC with different clipping boundaries b.
Our results are presented in Fig. 8b. In this paper, the clipping
boundary bis employed to stabilize the learning process of
σθ(s, a)and keep it in a reasonable range. Results indicate
that compared with the performance of removing the clipping
boundary trick from DSAC (i.e., b= +∞), the inclusion
of b(for different bvalues) generally improves performance.
10
(a) Humanoid-v2 (b) Ant-v2 (c) Walker2d-v2 (d) HalfCheetah-v2
Fig. 6: Average true Q-value vs estimated Q-value. The solid lines correspond to the mean and the shaded regions correspond
to 95% confidence interval over 5 runs.
TABLE III
AVERA GE RE LATI VE Q -VAL UE ES TI MATI ON B IA S OVER 5RUN S. WE AVE RAG E TH E RE LATI VE ESTI MATI ON B IA S FRO M 1.5 MILLION TO
3MILLION ITERATIONS FOR EACH RUN.+AND −I NDICATE O VER ES TI MATI ON AND UN DE RE STIMATION ,RE SP ECTIV ELY.±
CORRESPONDS TO A SINGLE STANDARD DEVIATION OVER 5RUNS .
Algorithm Main difference Humanoid-v2 Ant-v2 Walker2d-v2 HalfCheetah-v2
DSAC (Ours) Continuous return distribution learning +5.32%±0.62% +3.48%±0.69% +17.71%±2.30% -0.33%±0.18%
Single-Q SAC Traditional TD learning +15.85%±1.06% +9.24%±5.74% +943.80%±683.94% +1.56%±1.67%
SAC Clipped double Q-learning -10.16%±1.37% -4.07%±0.66% -1.45%±1.06% -0.99%±0.66%
Double-Q SAC Double Q-learning -4.63%±1.70% -16.68%±4.21% -12.84%±4.03% -0.33%±0.32%
Fig. 7. Algorithm comparison in terms of time efficiency on the
Ant-v2 benchmark. Each boxplot is drawn based on values of 50
evaluations. All evaluations were performed on a single computer
with a 2.4 GHz 20 core Intel Xeon CPU.
Therefore, DSAC appears to benefit greatly from the clipping
boundary trick. However, the final performance is a little bit
sensitive to the value of b. This is because that a too small b
will reduce the learning accuracy of the return distribution,
while a too large bcannot effectively limit the range of
σθ(s, a). In practical applications, it is usually necessary to
select an appropriate bvalue according to the range of the
state-action return Z(s, a), which limits the flexibility of the
DSAC algorithm. We will focus on this issue in the future.
VII. CONCLUSIONS
In this paper, we propose an off-policy RL algorithm for
continuous control setting, called distributional soft actor-
critic (DSAC), to mitigate Q-value overestimations, thereby
improving policy performance. We first discover in theory that
the update stepsize of the Q-value function in distributional RL
(a) Performance under different σmin (b) Performance under different b
Fig. 8: Average final return of DSAC under different hyper-
parameters on Ant-v2 over 5 runs. (a) b= 10. (b) σmin = 1.
decreases squarely as the standard deviation of state-action
returns increases, thus mitigating Q-value overestimations.
Then, a distributional soft policy iteration (DSPI) framework is
developed by embedding the return distribution function into
maximum entropy RL, which alternates between distributional
soft policy evaluation and soft policy improvement. Next, a
deep off-policy actor-critic variant of DSPI, i.e., DSAC, is
proposed to directly learn a continuous return distribution
by keeping the variance of the state-action returns within
reasonable range to address exploding and vanishing gradient
problems. We evaluate DSAC and 9 baselines (such as SAC,
TD3, PPO, DDPG) on the suite of MuJoCo tasks. Results
show that DSAC outperforms or matches all other baseline
algorithms across all benchmarks.
APPENDIX A
PROO F OF CONVERGENCE OF DISTRIBUTIONAL SOFT
POLICY ITE RATI ON
In this appendix, we present proofs to show that Distribu-
tional Soft Policy Iteration (DSPI), which alternates between
11
(6) and (4), would lead to policy improvement with respect
to the maximum entropy objective. The proofs borrow heavily
from the policy evaluation and policy improvement theorems
of Q-learning, distributional RL and soft Q-learning [7], [16],
[18].
Lemma 1. (Distributional Soft Policy Evaluation). Consider
the distributional soft bellman backup operator Tπ
Din (6) and
a soft state-action distribution function Z0(Z0(s, a)|s, a) :
S×A → P(Z0(s, a)), which maps a state-action pair (s, a)to
a distribution over random soft state-action returns Z0(s, a),
and define Zi+1(s, a) = Tπ
DZi(s, a), where Zi+1(s, a)∼
Zi+1(·|s, a). Then the sequence Ziwill converge to Zπas
i→ ∞.
Proof. Let Zdenote the space of soft return function Z.
Define the entropy augmented reward as rπ(s, a) = r(s, a)−
γα log π(a0|s0)and rewrite the distributional soft Bellman
operator as
Tπ
DZ(s, a)D
=rπ(s, a) + γZ (s0, a0),
where r∼R(·|s, a), s0∼p, a0∼π. Then we can directly
apply the standard convergence results for policy evaluation of
distributional RL [18], that is, Tπ
D:Z→Zis a γ-contraction
in terms of some measure. Therefore, Tπ
Dhas a unique fixed
point, which is Zπ, and the sequence Ziwill converge to it
as i→ ∞, i.e., Ziwill converge to Zπas i→ ∞.
Lemma 2. (Soft Policy Improvement) Let πnew be the optimal
solution of the maximization problem defined in (4). Then
Qπnew (s, a)≥Qπold (s, a)for ∀(s, a)∈ S × A.
Proof. From (4), one has
πnew(·|s) = arg max
π
E
a∼π[Qπold (s, a)−αlog π(a|s)],∀s∈ S,
(20)
then it is obvious that
E
a∼πnew
[Qπold (s, a)−αlog πnew(a|s)] ≥
E
a∼πold
[Qπold (s, a)−αlog πold(a|s)],∀s∈ S.(21)
Next, from (3), it follows that
Qπold (s, a)
=E
r∼R(·|s,a)
[r] + γE
s0∼p,a0∼πold
[Qπold (s0, a0)−αlog πold(a0|s0)]
≤E
r∼R(·|s,a)
[r] + γE
s0∼p,a0∼πnew
[Qπold (s0, a0)−αlog πnew(a0|s0)]
.
.
.
≤Qπnew (s, a),∀(s, a)∈ S × A,
where we have repeatedly expanded Qπold on the right-hand
side by applying (3).
Theorem 1. (Distributional Soft Policy Iteration). The distri-
butional soft policy iteration, which alternates between distri-
butional soft policy evaluation and soft policy improvement,
can converge to a policy π∗such that Qπ∗(s, a)≥Qπ(s, a)
for ∀πand ∀(s, a)∈ S × A, assuming that |A| <∞and
reward is bounded.
Proof. Let πkdenote the policy at iteration k. For ∀πk, we can
always find its associated Zπkthrough distributional soft pol-
icy evaluation process follows from Lemma 1. Therefore, we
can obtain Qπkaccording to (5). By Lemma 2, the sequence
Qπk(s, a)is monotonically increasing for ∀(s, a)∈ S × A.
Since Qπis bounded everywhere for ∀π(both the reward and
policy entropy are bounded), the policy sequence πkconverges
to some π†as k→ ∞. At convergence, it must follow that
E
a∼π†[Qπ†(s, a)−αlog π†(a|s)] ≥
E
a∼π[Qπ†(s, a)−αlog π(a|s)],∀π, ∀s∈ S.
(22)
Using the same iterative argument as in Lemma 2, we have
Qπ†(s, a)≥Qπ(s, a),∀π, ∀(s, a)∈ S × A.
Hence π†is optimal, i.e., π†=π∗.
APPENDIX B
DER IVATIO NS
A. Derivation of the Standard Deviation in Distributional Q-
learning
Since the random error Qin (9) is assumed to be indepen-
dent of (s, a),δin (10) can be further expressed as
δ=E0
QEs0[max
a0Q(s0, a0)]−Es0[max
a0
˜
Q(s0, a0)]
=E0
QEs0[max
a0Qθ(s0, a0)−max
a0
˜
Q(s0, a0)].
Defining η=Es0maxa0Qθ(s0, a0)−maxa0˜
Q(s0, a0), it
follows that
δ=E0
Q[η].
From (12), we linearize the post-update standard deviation
around ψusing Taylor’s expansion
σψnew (s, a)≈
σψ(s, a) + β∆σ2+ (y−Qθ(s, a))2
σψ(s, a)3k∇ψσψ(s, a)k2
2.
Then, in expectation, the post-update standard deviation is
EQ,0
Q[σψnew (s, a)] ≈σψ(s, a)+
β∆σ2+EQ,0
Q[(y−Qθ(s, a))2]
σψ(s, a)3k∇ψσψ(s, a)k2
2.
Since EQ[Q] = 0, the EQ,0
Q[(y−Qθ(s, a))2]term can be
expanded as
EQ,0
Q[(y−Qθ(s, a))2]
=EQ,0
Q(E[r] + γEs0[max
a0Qθ(s0, a0)] −Qθ(s, a))2
=EQ,0
Q(E[r] + γEs0[max
a0
˜
Q(s0, a0)] + γη −˜
Q(s, a)−Q)2
=EQ,0
Q(˜y−˜
Q(s, a) + γη −Q)2
= (˜y−˜
Q(s, a))2+EQ,0
Q(γη −Q)2+
EQ,0
Q2(˜y−˜
Q(s, a))(γη −Q)
= (˜y−˜
Q(s, a))2+γ2E0
Q[η2] + EQ[Q2]+
2γ(˜y−˜
Q(s, a))E0
Q[η]−2(γE0
Q[η] + ˜y−˜
Q(s, a))EQ[Q]
= (˜y−˜
Q(s, a))2+γ2E0
Q[η2] + EQ[Q2]+2γδ(˜y−˜
Q(s, a)).
12
In an ideal situation, when ˜
Q(s, a) = ˜y, that is, ˜
Q(s, a)has
converged after a period of learning, we further have
EQ,0
Q[(y−Qθ(s, a))2] = γ2E0
Q[η2] + EQ[Q2].
Furthermore, since E0
Q[η2]≥E0
Q[η]2, we have
EQ,0
Q[σψnew (s, a)]
≈σψ(s, a) + β∆σ2+γ2E0
Q[η2] + EQ[Q2]
σψ(s, a)3k∇ψσψ(s, a)k2
2
≥σψ(s, a) + β∆σ2+γ2E0
Q[η]2+EQ[Q2]
σψ(s, a)3k∇ψσψ(s, a)k2
2
=σψ(s, a) + β∆σ2+γ2δ2+EQ[Q2]
σψ(s, a)3k∇ψσψ(s, a)k2
2.
B. Derivation of the Objective Function for Soft Return Dis-
tribution Update
From (7), the loss function for soft state-action return
distribution under the KL-divergence measurement is
JZ(θ)
=E(s,a)∼BhDKL (Tπφ0
DZθ0(·|s, a),Zθ(·|s, a))i
=E(s,a)∼BhX
Tπφ0
DZ(s,a)
P(Tπφ0
DZ(s, a)|T πφ0
DZθ0(·|s, a))
log P(Tπφ0
DZ(s, a)|T πφ0
DZθ0(·|s, a))
P(Tπφ0
DZ(s, a)|Zθ(·|s, a)) i
=−E(s,a)∼BhX
Tπφ0
DZ(s,a)
P(Tπφ0
DZ(s, a)|T πφ0
DZθ0(·|s, a))
log P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i+c
=−E(s,a)∼BhETπφ0
DZ(s,a)∼T πφ0
DZθ0(·|s,a)
log P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i+c
=−E(s,a)∼BhE
(r,s0)∼B,a0∼πφ0,
Z(s0,a0)∼Zθ0(·|s0,a0)
log P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i+c
=−E
(s,a,r,s0)∼B,a0∼πφ0,
Z(s0,a0)∼Zθ0(·|s0,a0)
hlog P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i+c,
where cis an item independent of θ.
C. Probability Density of the Bounded Actions
For algorithms with stochastic policy, we use an unbounded
Gaussian as the action distribution µ. However, in practice, the
action usually needs to be bounded to a finite interval denoted
as [amin, amax ], where amin ∈Rdim(A)and amax ∈Rdim(A). Let
u∈Rdim(A)denote a random variable sampled from µ. To
account for the action constraint, we project uinto a desired
action by
a=amax −amin
2tanh(u) + amax +amin
2,
where represents the Hadamard product and tanh is applied
element-wise. From [16], the probability density of ais given
by
π(a|s) = µ(u|s)
det da
du
−1.
The log-likelihood of π(a|s)can be expressed as
log π(a|s) = log µ(u|s)−
dim(A)
X
i=1 log(1 −tanh2(ui)) + log amax i−amini
2.
D. Policy Update Gradients Based on the Soft-Action Return
If Qθ(s, a)cannot be expressed explicitly through θ, besides
(19), we also need to reparameterize the random return Z(s, a)
as
Z(s, a) = gθ(ξZ;s, a).
In this case, we have
∇φJπ(φ) = Es∼B,ξZ,ξah−α∇φlog(πφ(a|s))+
(∇agθ(ξZ;s, a)−α∇alog(πφ(a|s)))∇φfφ(ξa;s)i.
Besides, the distribution Zθoffers a richer set of predictions
for learning than its expected value Qθ. Therefore, we can
also choose to maximize the ith percentile of Zθ
Jπ,i(φ) = Es∼B,a∼πφ[Pi(Zθ(s, a)) −αlog(πφ(a|s))],
where Pidenotes the ith percentile. For example, ishould be
a smaller value for risk-aware policies learning. The gradients
of this objective can also be easily approximated using the
reparamterization trick.
APPENDIX C
EXP ER IM EN TAL DETA IL S
A. Brief Descriptions of Benchmarks
The Humanoid-v2 task aims to make a three-dimensional
bipedal robot walk forward as fast as possible, without falling
over. Its state is described by 376-dimensional information,
including the position and velocity of joints, the inertia and
velocity at the center of mass, and actuator forces. The action
of this task is composed of the torque applied over 17 joints.
The reward function is designed to punish the actions that cost
a lot of energy or cause mission failure. Similarly, Walker2d-
v2 is a two-dimensional bipedal robot which possesses 17-
dimensional states and 6-dimensional actions. The Ant-v2 task
aims to make a four-legged creature walk forward as fast as
possible with a 111-dimensional state vector to describe the
position and velocity of each joint. Its action consists of the
torque of 8 joints, and the reward is also designed to punish
the actions that cost a lot of energy or cause mission failure.
Analogously, HalfCheetah-v2 is a two-legged cheetah with
17-dimensional states and 6-dimensional actions. The goal of
InvertedDoublePendulum-v2, which is described by an 11-
dimensional state vector, is to make two linked poles stand
up on a cart as long as possible by applying a force on the
cart. See https://github.com/openai/gym/tree/master/gym/envs
for all details.
13
B. Double-Q SAC Algorithm
Suppose the soft Q-value and policy are approximated by
parameterized functions Qθ(s, a)and πφ(a|s)respectively.
A pair of soft Q-value functions (Qθ1, Qθ2)and policies
(πφ1, πφ2)are required in Double-Q SAC, where πφ1is
updated with respect to Qθ1and πφ2with respect to Qθ2.
Given separate target soft Q-value functions (Qθ0
1, Qθ0
2)and
policies (πφ0
1, πφ0
2), the update targets of Qθ1and Qθ2are
calculated as:
y1=r+γ(Qθ0
2(s0, a0)−αlog(πφ0
1(a0|s0))), a0∼πφ0
1,
y2=r+γ(Qθ0
1(s0, a0)−αlog(πφ0
2(a0|s0))), a0∼πφ0
2.
The soft Q-value can be trained by directly minimizing
JQ(θi) = E
(s,a,r,s0)∼B,a0∼πφ0
i(yi−Qθi(s, a))2,for i∈ {1,2}.
The policy can be learned by directly maximizing a param-
eterized variant of the objective function in (4)
Jπ(φi) = Es∼BEa∼πφi[Qθi(s, a)−αlog(πφi(a|s))].
The pseudo-code of Double-Q SAC is shown in Algorithm 2.
Algorithm 2 Double-Q SAC Algorithm
Initialize parameters θ1,θ2,φ1,φ2, and α
Initialize target parameters θ0
1←θ1,θ0
2←θ2,φ0
1←φ1,
φ0
2←φ2
Initialize learning rate βQ,βπ,βαand τ
Initialize iteration index k= 0
repeat
Select action a∼πφ1(a|s)
Observe reward rand new state s0
Store transition tuple (s, a, r, s0)in buffer B
Sample Ntransitions (s, a, r, s0)from B
Update soft Q θi←θi−βQ∇θiJQ(θi)for i∈ {1,2}
if kmod mthen
Update policy φi←φi+βπ∇φiJπ(φi)for i∈ {1,2}
Adjust temperature α←α−βα∇αJ(α)
Update target networks:
θ0
i←τθi+ (1 −τ)θ0
ifor i∈ {1,2}
φ0
i←τφi+ (1 −τ)φ0
ifor i∈ {1,2}
end if
k=k+ 1
until Convergence
C. Single-Q SAC Algorithm
Suppose the soft Q-value and policy are approximated by
parameterized functions Qθ(s, a)and πφ(a|s)respectively.
Given separate target soft Q-value function Qθ0and policy
πφ0, the update target of Qθis calculated as:
y=r+γ(Qθ0(s0, a0)−αlog(πφ0(a0|s0))), a0∼πφ0.
The soft Q-value can be trained by directly minimizing
JQ(θ) = E
(s,a,r,s0)∼B,a0∼πφ0(y−Qθ(s, a))2.
The policy can be learned by directly maximizing a param-
eterized variant of the objective function in (4)
Jπ(φ) = Es∼BEa∼πφ[Qθ(s, a)−αlog(πφ(a|s))].
The pseudo-code of Single-Q SAC is shown in Algorithm 3.
Algorithm 3 Single-Q SAC Algorithm
Initialize parameters θ,φand α
Initialize target parameters θ0←θ,φ0←φ
Initialize learning rate βQ,βπ,βαand τ
Initialize iteration index k= 0
repeat
Select action a∼πφ(a|s)
Observe reward rand new state s0
Store transition tuple (s, a, r, s0)in buffer B
Sample Ntransitions (s, a, r, s0)from B
Update soft Q-function θ←θ−βQ∇θJQ(θ)
if kmod mthen
Update policy φ←φ+βπ∇φJπ(φ)
Adjust temperature α←α−βα∇αJ(α)
Update target networks:
θ0←τθ + (1 −τ)θ0,φ0←τ φ + (1 −τ)φ0
end if
k=k+ 1
until Convergence
D. TD4 Algorithm
Consider a parameterized state-action return distribution
function Zθ(·|s, a)and a deterministic policy πφ(s), where
θand φare parameters. The target networks Zθ0(·|s, a)and
πφ0(s)are used to stabilize learning. The return distribution
can be trained to minimize
JZ(θ) = −E
(s,a,r,s0)∼B,a0∼πφ0,
Z(s0,a0)∼Zθ0(·|s0,a0)
hlog P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i,
where
Tπ
DZ(s, a)D
=r(s, a) + γZ (s0, a0)
and
a0=πφ0(s0) + , ∼clip(N(0, σ2),−c, c).
The calculation of ∇θJZ(θ)is similar to DSAC. The policy
can be learned by directly maximizing the expected return
Jπ(φ) = Es∼BQθ(s, πφ(s)).
The pseudo-code is shown in Algorithm 4.
E. Hyperparameters
Table IV lists the hyperparameters of all algorithms.
ACKNOWLEDGMENT
We would like to acknowledge Dongjie Yu for his valuable
suggestions. The authors are grateful to the Editor-in-Chief,
the Associate Editor, and anonymous reviewers for their valu-
able comments.
14
Algorithm 4 TD4 Algorithm
Initialize parameters θ,φand α
Initialize target parameters θ0←θ,φ0←φ
Initialize learning rate βZ,βπ,βαand τ
Initialize iteration index k= 0
repeat
Select action with exploration noise a=πφ(s) + ,∼
N(0,ˆσ2)
Observe reward rand new state s0
Store transition tuple (s, a, r, s0)in buffer B
Sample Ntransitions (s, a, r, s0)from B
Calculate action for target policy smoothing a0=
πφ0(s0) + ,∼clip(N(0, σ2),−c, c)
Update return distribution θ←θ−βZ∇θJZ(θ)
if kmod mthen
Update policy φ←φ+βπ∇φJφ(φ)
Update target networks:
θ0←τθ + (1 −τ)θ0,φ0←τ φ + (1 −τ)φ0
end if
k=k+ 1
until Convergence
TABLE IV
DETAILED HYPERPARAMETERS.
Hyperparameters Value
Shared
Optimizer Adam (β1= 0.9, β2= 0.999)
Number of hidden layers 5
Number of hidden units per layer 256
Nonlinearity of hidden layer GELU
Replay buffer size 5×105
Batch size 256
Actor learning rate cos anneal 5e−5→1e−6
Critic learning rate cos anneal 8e−5→1e−6
Discount factor (γ) 0.99
Update interval (m) 2
Target smoothing coefficient (τ) 0.001
Reward scale 0.2
Number of actor processes 6
Number of learner processes 4
Number of buffer processes 3
Stochastic policy
Learning rate of αcos anneal 5e−5→1e−6
Expected entropy (H)H=−dim(A)
Deterministic policy
Exploration noise ∼ N (0,0.12)
Distributional value function
Bounds of variance σmin = 1
Clipping boundary b= 10
TD4,TD3
Policy smoothing noise ∼clip(N(0,0.22),−0.5,0.5)
REFERENCES
[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, p. 529, 2015.
[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot, et al., “Mastering the game of go with deep neural networks
and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
[3] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep
reinforcement learning,” in Proceedings of the 33rd International Con-
ference on Machine Learning, (ICML 2016), (New York City, NY, USA),
pp. 1928–1937, 2016.
[4] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering
the game of go without human knowledge,” Nature, vol. 550, no. 7676,
p. 354, 2017.
[5] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical
reinforcement learning for self-driving decision-making without reliance
on labelled driving data,” IET Intelligent Transport Systems, vol. 14,
no. 5, pp. 297–305, 2020.
[6] C. J. C. H. Watkins, Learning from delayed rewards. PhD thesis, King’s
College, Cambridge, 1989.
[7] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[8] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double q-learning,” in Proceedings of the 30th Conference on
Artificial Intelligence (AAAI 2016), (Phoenix, Arizona,USA), pp. 2094–
2100, 2016.
[9] S. Thrun and A. Schwartz, “Issues in using function approximation
for reinforcement learning,” in Proceedings of the 1993 Connectionist
Models Summer School, (Hillsdale NJ. Lawrence Erlbaum), 1993.
[10] D. Lee, B. Defourny, and W. B. Powell, “Bias-corrected q-learning to
control max-operator bias in q-learning,” in 2013 IEEE Symposium on
Adaptive Dynamic Programming and Reinforcement Learning (ADPRL),
pp. 93–99, IEEE, 2013.
[11] D. Lee and W. B. Powell, “Bias-corrected q-learning with multistate
extension,” IEEE Transactions on Automatic Control, vol. 64, no. 10,
pp. 4011–4023, 2019.
[12] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function ap-
proximation error in actor-critic methods,” in Proceedings of the 35th
International Conference on Machine Learning (ICML 2018), (Stock-
holmsm¨
assan, Stockholm Sweden), pp. 1587–1596, PMLR, 2018.
[13] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in Proceedings of the 31st
International Conference on Machine Learning (ICML 2014), (Bejing,
China), pp. 387–395, PMLR, 2014.
[14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” in 4th International Conference on Learning Representations
(ICLR 2016), (San Juan, Puerto Rico), 2016.
[15] H. van Hasselt, “Double q-learning,” in 23rd Advances in Neural
Information Processing Systems (NeurIPS 2010), (Vancouver, British
Columbia, Canada), pp. 2613–2621, 2010.
[16] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic
actor,” in Proceedings of the 35th International Conference on Ma-
chine Learning (ICML 2018), (Stockholmsm¨
assan, Stockholm Sweden),
pp. 1861–1870, PMLR, 2018.
[17] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,
V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic
algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
[18] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspec-
tive on reinforcement learning,” in Proceedings of the 34th International
Conference on Machine Learning, (ICML 2017), (Sydney, NSW, Aus-
tralia), pp. 449–458, 2017.
[19] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Distribu-
tional reinforcement learning with quantile regression,” in Proceedings
of the 32nd Conference on Artificial Intelligence, (AAAI 2018), (New
Orleans, Louisiana, USA), pp. 2892–2901, 2018.
[20] W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quantile
networks for distributional reinforcement learning,” in Proceedings of
the 35th International Conference on Machine Learning (ICML 2018),
(Stockholmsm¨
assan, Stockholm Sweden), pp. 1096–1105, PMLR, 2018.
[21] M. Rowland, M. Bellemare, W. Dabney, R. Munos, and Y. W. Teh, “An
analysis of categorical distributional reinforcement learning,” in Inter-
national Conference on Artificial Intelligence and Statistics, (AISTATS
2018), (Playa Blanca, Lanzarote, Canary Islands, Spain), pp. 29–37,
PMLR, 2018.
[22] C. Lyle, M. G. Bellemare, and P. S. Castro, “A comparative analysis of
expected and distributional reinforcement learning,” in Proceedings of
the 33rd Conference on Artificial Intelligence (AAAI 2019), (Honolulu,
Hawaii,USA), pp. 4504–4511, 2019.
[23] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan,
D. TB, A. Muldal, N. Heess, and T. P. Lillicrap, “Distributed distribu-
tional deterministic policy gradients,” in 6th International Conference
on Learning Representations, (ICLR 2018), (Vancouver, BC, Canada),
2018.
15
[24] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz,
“Trust region policy optimization,” in Proceedings of the 32nd Interna-
tional Conference on Machine Learning, (ICML 2015), (Lille, France),
pp. 1889–1897, 2015.
[25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
2017.
[26] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez,
Z. Wang, S. Eslami, M. Riedmiller, et al., “Emergence of locomotion
behaviours in rich environments,” arXiv preprint arXiv:1707.02286,
2017.
[27] B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih, “Combining
policy gradient and q-learning,” in 4th International Conference on
Learning Representations (ICLR 2016), (San Juan, Puerto Rico), 2016.
[28] J. Schulman, X. Chen, and P. Abbeel, “Equivalence between policy
gradients and soft q-learning,” arXiv preprint arXiv:1704.06440, 2017.
[29] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the
gap between value and policy based reinforcement learning,” in 30th
Advances in Neural Information Processing Systems (NeurIPS 2017),
(Long Beach, CA, USA), pp. 2775–2785, 2017.
[30] B. Sallans and G. E. Hinton, “Reinforcement learning with factored
states and actions,” Journal of Machine Learning Research, vol. 5, no. 8,
pp. 1063–1088, 2004.
[31] R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforcement
learning via soft updates,” in Proceedings of the 32nd Conference on
Uncertainty in Artificial Intelligence (UAI 2016), (Arlington, Virginia,
United States), pp. 202–211, AUAI Press, 2016.
[32] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning
with deep energy-based policies,” in Proceedings of the 34th Interna-
tional Conference on Machine Learning, (ICML 2017), (Sydney, NSW,
Australia), pp. 1352–1361, 2017.
[33] W. Dabney, Z. Kurth-Nelson, N. Uchida, C. K. Starkweather, D. Hass-
abis, R. Munos, and M. Botvinick, “A distributional code for value in
dopamine-based reinforcement learning,” Nature, pp. 1–5, 2020.
[34] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv
preprint arXiv:1312.6114, 2013.
[35] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward,
Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu,
“IMPALA: Scalable distributed deep-RL with importance weighted
actor-learner architectures,” in Proceedings of the 35th International
Conference on Machine Learning (ICML 2018), (Stockholmsm¨
assan,
Stockholm Sweden), pp. 1407–1416, PMLR, 2018.
[36] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel,
H. Van Hasselt, and D. Silver, “Distributed prioritized experience
replay,” arXiv preprint arXiv:1803.00933, 2018.
[37] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for
model-based control,” in 2012 IEEE/RSJ International Conference on
Intelligent Robots and Systems, (IROS 2012), (Vilamoura, Algarve,
Portugal), pp. 5026–5033, IEEE, 2012.
[38] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-
man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint
arXiv:1606.01540, 2016.
[39] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv
preprint arXiv:1606.08415, 2016.
[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in 3rd International Conference on Learning Representations, (ICLR
2015), (San Diego, CA, USA), 2015.
Jingliang Duan received the B.S. degree from the
College of Automotive Engineering, Jilin University,
Changchun, China, in 2015. He studied as a visiting
student researcher in Department of Mechanical En-
gineering, University of California, Berkeley, USA,
in 2019. He received his Ph.D. degree in the School
of Vehicle and Mobility, Tsinghua University, Bei-
jing, China, in 2021. His research interests include
decision and control of autonomous vehicle, rein-
forcement learning and adaptive dynamic program-
ming, and driver behaviour analysis.
Yang Guan received the B.S. degree from school
of mechanical engineering, Beijing institute of tech-
nology, Beijing, China, in 2017. He is pursuing his
Ph.D. degree in the School of Vehicle and Mobility,
Tsinghua University, Beijing, China. His research
interests include decision-making of autonomous
vehicle, and reinforcement learning.
Shengbo Eben Li (SM’16) received the M.S. and
Ph.D. degrees from Tsinghua University in 2006 and
2009. He worked at Stanford University, University
of Michigan, and University of California, Berkeley.
He is currently a tenured professor at Tsinghua Uni-
versity. His active research interests include intel-
ligent vehicles and driver assistance, reinforcement
learning and distributed control, optimal control and
estimation, etc.
He is the author of over 100 journal/conference
papers, and the co-inventor of over 20 Chinese
patents. He was the recipient of Best Paper Award in 2014 IEEE ITS
Symposium, Best Paper Award in 14th ITS Asia Pacific Forum, National
Award for Technological Invention in China (2013), Excellent Young Scholar
of NSF China (2016), Young Professorship of Changjiang Scholar Program
(2016). He is now the IEEE senior member and serves as associated editor
of IEEE ITSM and IEEE Trans. ITS, etc.
Yangang Ren received the B.S. degree from the
Department of Automotive Engineering, Tsinghua
University, Beijing, China, in 2018. He is currently
pursuing his Ph.D. degree in the School of Vehicle
and Mobility, Tsinghua University, Beijing, China.
His research interests include decision and control
of autonomous driving, reinforcement learning, and
adversarial learning.
Qi Sun received his Ph.D. degree in Automotive
Engineering from Ecole Centrale de Lille, France,
in 2017. He did scientific research and completed
his Ph.D. dissertation in CRIStAL Research Center
at Ecole Centrale de Lille, France, between 2013
and 2016. He is currently a Postdoctor at the State
Key Laboratory of Automotive Safety and Energy
and at the School of Vehicle and Mobility, Tsinghua
University, Beijing, China. His active research inter-
ests include intelligent vehicles, automatic driving
technology, distributed control and optimal control.
Bo Cheng received the B.S. and M.S. degrees in
automotive engineering from Tsinghua University,
Beijing, China, in 1985 and 1988, respectively, and
the Ph.D. degree in mechanical engineering from
the University of Tokyo, Tokyo, Japan, in 1998.
He is currently a Professor with School of Ve-
hicle and Mobility, Tsinghua University, and the
Dean of Tsinghua University–Suzhou Automotive
Research Institute. He is the author of more than 100
peer-reviewed journal/conference papers and the co-
inventor of 40 patents. His active research interests
include autonomous vehicles, driver-assistance systems, active safety, and
vehicular ergonomics, among others. Dr. Cheng is also the Chairman of the
Academic Board of SAE-Beijing, a member of the Council of the Chinese
Ergonomics Society, and a Committee Member of National 863 Plan, among
others.