ArticlePDF Available

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

Authors:

Abstract and Figures

In reinforcement learning (RL), function approximation errors are known to easily lead to the Q-value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q-value overestimations because it is capable of adaptively adjusting the update step size of the Q-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
Content may be subject to copyright.
1
Distributional Soft Actor-Critic: Off-Policy
Reinforcement Learning for Addressing Value
Estimation Errors
Jingliang Duan, Yang Guan, Shengbo Eben Li*, Yangang Ren, Qi Sun, and Bo Cheng
©2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
Abstract—In reinforcement learning (RL), function approxi-
mation errors are known to easily lead to the Q-value overesti-
mations, thus greatly reducing policy performance. This paper
presents a distributional soft actor-critic (DSAC) algorithm,
which is an off-policy RL method for continuous control setting,
to improve the policy performance by mitigating Q-value overes-
timations. We first discover in theory that learning a distribution
function of state-action returns can effectively mitigate Q-value
overestimations because it is capable of adaptively adjusting the
update stepsize of the Q-value function. Then, a distributional soft
policy iteration (DSPI) framework is developed by embedding the
return distribution function into maximum entropy RL. Finally,
we present a deep off-policy actor-critic variant of DSPI, called
DSAC, which directly learns a continuous return distribution
by keeping the variance of the state-action returns within a
reasonable range to address exploding and vanishing gradient
problems. We evaluate DSAC on the suite of MuJoCo continuous
control tasks, achieving the state-of-the-art performance.
Index Terms—Reinforcement learning, overestimation, distri-
butional soft actor-critic (DSAC).
I. INTRODUCTION
DEEP neural networks (NNs) provide rich representations
that can enable reinforcement learning (RL) algorithms
to master a variety of challenging domains, from games to
robotic control [1]–[5]. However, most RL algorithms tend to
learn unrealistically high state-action values (i.e., Q-values),
known as overestimations, thereby resulting in suboptimal
policies.
The overestimations of RL were first found in the Q-
learning algorithm [6], which is the prototype of most existing
value-based RL algorithms [7]. For this algorithm, van Hasselt
et al. (2016) demonstrated that any kind of estimation errors
can induce an upward bias, irrespective of whether these errors
are caused by system noise, function approximation, or any
other sources [8]. The overestimation bias is firstly induced
by the max operator over all noisy Q-estimates of the same
This study is supported by Beijing NSF with JQ18010, and NSF China
with 51575293, and U20A20334. Special thanks should be given to TOYOTA
for funding this study. Jingliang Duan and Yang Guan contributed equally
to this work. All correspondences should be sent to S. Li with email:
lisb04@gmail.com.
J. Duan, Y. Guan, S. Li, Y. Ren, Q. Sun, and B. Cheng are with
State Key Lab of Automotive Safety and Energy, School of Vehicle
and Mobility, Tsinghua University, Beijing, 100084, China. They are
also with Center for Intelligent Connected Vehicles and Transportation,
Tsinghua University. Email: duanjl15@163.com; (guany17,
ryg18)@mails.tsinghua.edu.cn; (lishbo, qisun,
chengbo)@tsinghua.edu.cn.
state, which tends to prefer overestimated to underestimated
Q-values [9]–[11]. This overestimation bias will be further
propagated and exaggerated through the temporal difference
learning [7], wherein the Q-estimate of a state is updated using
the Q-estimate of its subsequent state. Deep RL algorithms,
such as Deep Q-Networks (DQN) [1], employ a deep NN to
estimate the Q-value. Although the deep NN can provide rich
representations with the potential for low asymptotic approxi-
mation errors, overestimations still exist, even in deterministic
environments [8], [12]. Fujimoto et al. (2018) showed that the
overestimation problem also persists in actor-critic RL [12],
such as Deterministic Policy Gradient (DPG) and Deep DPG
(DDPG) [13], [14]. In practice, inaccurate estimation exists
in almost all RL algorithms because, on the one hand, any
algorithm will introduce some estimation biases and variances,
simply due to the true Q-values are initially unknown [7].
On the other hand, function approximation errors are usually
unavoidable. This is particularly problematic because inaccu-
rate estimation can cause arbitrarily suboptimal actions to be
overestimated, resulting in a suboptimal policy.
To reduce overestimations in standard Q-learning, Double
Q-learning [15] was developed to decouple the max operation
into action selection and evaluation. To update one of these two
Q-networks, one Q-network is used to determine the greedy
policy, while another Q-network is used to determine its value,
resulting in unbiased estimates. Double DQN [8], a deep
variant of Double Q-learning, deals with the overestimation
problem of DQN, in which the target Q-network of DQN pro-
vides a natural candidate for the second Q-network. However,
these two methods can only handle discrete action spaces.
Fujimoto et al. (2018) developed actor-critic variants of the
standard Double DQN and Double Q-learning for continuous
control, by making action selections using the policy optimized
with respect to the corresponding Q-estimate [12]. However,
the actor-critic Double DQN suffers from similar overestima-
tions as DDPG, because the online and target Q-estimates
are too similar to provide an independent estimation. While
actor-critic Double Q-learning is more effective, it introduces
additional Q and policy networks at the cost of increasing
the computation time for each iteration. Finally, Fujimoto et
al. (2018) proposed Clipped Double Q-learning by taking
the minimum value between the two Q-estimates [12], which
is used in Twin Delayed Deep Deterministic policy gradient
(TD3) and Soft Actor-Critic (SAC) [16], [17]. However, this
method may introduce a considerable underestimation bias and
still requires an additional Q-network.
arXiv:2001.02811v3 [cs.LG] 11 Jun 2021
2
In this paper, we propose a new RL algorithm, called
distributional soft actor-critic (DSAC), to improve policy per-
formance by mitigating Q-value overestimations. The contri-
butions and novelty of this paper are summarized as follows:
1) A distributional soft policy iteration (DSPI) framework is
developed by embedding the return distribution function
in maximum entropy RL to learn a continuous distribution
of state-action returns (also called return distribution).
The impact of the return distribution learning on the
accuracy of Q-value estimation was barely discussed in
existing distributional RL algorithms, such as [18]–[23].
In this paper, we first found that the Q-value overestima-
tions can be mitigated by learning a distribution function
of state-action returns. This is because that compared with
most RL algorithms that directly learn the expectation
of state-action returns (i.e., Q-value) [1], [3], [8], [12],
[14], [16], the return distribution learning is capable of
adaptively adjusting the update stepsize of Q-values.
2) Based on the developed DSPI framework, we propose
the DSAC algorithm by replacing the clipped double Q-
learning of SAC [16], [17] with the return distribution
learning. In comparison with RL algorithms that use
double value networks to mitigate overestimations [8],
[12], [15]–[17], DSAC improves the Q-value estimation
accuracy by only employing a single return distribution
network, which also leads to higher time efficiency.
3) Different from existing distributional RL algorithms that
learn a discrete return distribution [18]–[23], the pro-
posed DSAC is capable of learning a continuous return
distribution by keeping the variance of the state-action
returns within a reasonable range to address exploding
and vanishing gradient problems. Therefore, DSAC re-
laxes the need for human-designed discrete ranges and
intervals. Besides, compared with most distributional
RL algorithms that can only handle discrete and low-
dimensional action spaces [18]–[22], DSAC is applicable
to continuous control settings by optimizing an indepen-
dent stochastic policy network.
4) Experiments on MuJoCo benchmarks demonstrate that
the proposed DSAC algorithm outperforms or matches
all baselines across all benchmark tasks in terms of the
final performance.
The paper is organized as follows. In Section II, we intro-
duce the related works. Section III describes some preliminar-
ies of RL and develops a DSPI framework. In Section IV, we
analyze the role of the distributional return function in solving
overestimations. Section V presents the DSAC algorithm and
PABAL architecture. In Section VI, we present experimental
results that show the efficacy of DSAC. Section VII concludes
this paper.
II. RE LATE D WOR K
Over the last decade, numerous deep RL algorithms have
appeared [1], [3], [12], [14], [16], [23]–[26]. This paper aims
to propose a new RL algorithm to mitigate Q-value overes-
timations by learning a distribution of state-action returns,
thereby improving policy performance. We also incorporate
the off-policy formulation to improve sample efficiency, and
the maximum entropy framework based on the stochastic
policy to encourage exploration. Besides, our algorithm mainly
focuses on continuous control setting. With reference to al-
gorithms such as DDPG [14], the off-policy learning and
continuous control can be easily enabled by learning separate
Q and policy networks in an actor-critic architecture. There-
fore, we mainly review prior works on the maximum entropy
framework and distributional RL in this section.
Maximum entropy RL favors stochastic policies by aug-
menting the optimization objective with the expected policy
entropy. While many prior RL algorithms consider the policy
entropy, they only use it as a regularizer [3], [24], [25].
Recently, several papers have noted the connection between
Q-learning and policy gradient methods in the setting of the
maximum entropy framework [27]–[29]. Early maximum en-
tropy RL algorithms usually only consider the policy entropy
of current states [27], [30], [31]. Unlike them, soft Q-learning
directly augments the reward with an entropy term, such that
the optimal policy aims to reach states where they will have
high policy entropy in the future [32]. Haarnoja et al. (2018)
further developed an off-policy actor-critic variant of the Soft
Q-learning for large continuous domains, called SAC [16],
[17]. In this paper, we build on the work of [16], [17] for
implementing the maximum entropy framework.
The distributional RL, in which one models the distribution
over returns, whose expectation is the value function, was
recently introduced by Bellemare et al. [18]. They proposed
a distributional RL algorithm, called C51, which achieved
great performance improvements on many Atari 2600 bench-
marks. Since then, many distributional RL algorithms and
their inherent analyses have appeared in literature [19]–[22].
Like DQN, these works can only handle discrete and low-
dimensional action spaces, as they select actions according
to their Q-networks. Barth-Maron et al. (2018) combined the
distributional return function within an actor-critic framework
for policy learning in continuous control setting domains, and
proposed the Distributed Distributional Deep Deterministic
Policy Gradient algorithm (D4PG) [23]. Inspired by these dis-
tributional RL researches, Dabney et al. (2020) found that the
brain represents possible future rewards not as a single mean,
but instead as a probability distribution through mouse experi-
ments [33]. Existing distributional RL algorithms usually learn
a discrete return distribution because it is computationally
friendly. However, this poses a problem: we need to divide the
return distribution into multiple discrete intervals in advance.
This is inconvenient because different tasks usually require
different division numbers and intervals. In addition, the role
of distributional return function in solving overestimations was
barely discussed before.
III. PRELIMINARIES AND DISTRIBUTIONAL SOFT POLICY
ITE RATI ON
In this section, we first describe the notations and introduce
the concept of maximum entropy RL. Then the distributional
soft policy iteration (DSPI) framework is developed.
3
A. Notation
We consider the standard reinforcement learning (RL) set-
ting wherein an agent interacts with an environment Ein
discrete time. This environment can be modeled as a Markov
Decision Process, defined by the tuple (S,A,R, p). The state
space Sand action space Aare assumed to be continuous,
R(rt|st, at) : S × A → P(rt)is a stochastic reward function
mapping a state-action pair (st, at)to a distribution over a
set of bounded rewards, and the unknown state transition
probability p(st+1|st, at) : S × A → P(st+1)maps a given
(st, at)to the probability distribution over st+1. For the sake
of simplicity, the current and next state-action pairs are also
denoted as (s, a)and (s0, a0), respectively.
At each time step t, the agent receives a state st∈ S and
selects an action at∈ A. In return, the agent receives the
next state st+1 ∈ S and a scalar reward rtR(st, at). The
process continues until the agent reaches a terminal state after
which the process restarts. The agent’s behavior is defined
by a stochastic policy π(at|st) : S → P(at), which maps a
given state to a probability distribution over actions. We will
use ρπ(s)and ρπ(s, a)to denote the state and state-action
distribution induced by policy π.
B. Maximum Entropy RL
The goal in standard RL is to learn a policy which
maximizes the expected future accumulated return
E(sit,ait)ρπ,ritR(·|si,ai)[P
i=tγitri], where γ[0,1)
is the discount factor. In this paper, we consider a more
general entropy-augmented objective [16], [17], [32], which
augments the reward with a policy entropy term H,
Jπ=E
(sit,ait)ρπ,
ritR(·|si,ai)
h
X
i=t
γit[ri+αH(π(·|si))]i,(1)
where
H(π(·|s)) = Za∈A
π(a|s) log π(a|s)da
=E
aπ(·|s)log π(a|s).
This objective improves the exploration efficiency of the policy
by maximizing both the expected future return and policy
entropy. The temperature parameter αdetermines the relative
importance of the entropy term against the reward. Maximum
entropy RL gradually approaches the conventional RL as
α0.
We use Gt=P
i=tγit[riαlog π(ai|si)] to denote the
entropy-augmented accumulated return from st, also called
soft return. The soft Q-value of policy πis defined as
Qπ(st, at) = E
rR(·|st,at)
[r] + γE
(si>t,ai>t )ρπ,
ri>tR(·|si,ai)
[Gt+1],(2)
which describes the expected soft return for selecting atin
state stand thereafter following policy π.
The optimal maximum entropy policy is learned by a
maximum entropy variant of the policy iteration method,
which alternates between soft policy evaluation and soft policy
improvement, called soft policy iteration. In the soft policy
evaluation process, given a policy π, the soft Q-value can be
learned by repeatedly applying a soft Bellman operator Tπ
under policy πgiven by
TπQπ(s,a) = ErR(·|s,a)[r]+
γEs0p,a0π[Qπ(s0, a0)αlog π(a0|s0).(3)
The goal of the soft policy improvement process is to find
a new policy πnew that is better than the current policy πold,
such that Jπnew Jπold . Hence, we can update the policy
directly by maximizing the entropy-augmented objective in
(1) in terms of the soft Q-value,
πnew = arg max
πJπ
= arg max
π
E
sρπ,aπQπold (s, a)αlog π(a|s).(4)
The convergence and optimality of soft policy iteration have
been verified in [16], [17], [28], [32].
C. Distributional Soft Policy Iteration
Next, we develop the distributional soft policy iteration
(DSPI) framework by extending the maximum entropy RL
into a distributional learning version. Firstly, we define the
soft state-action return of policy πfrom a state-action pair
(st, at)as
Zπ(st, at) = rt+γGt+1
(si>t,ai>t )ρπ,ritR(·|si,ai),
which is usually a random variable due to the randomness in
the state transition p, reward function Rand policy π. From
(2), it is clear that
Qπ(s, a) = E[Zπ(s, a)].(5)
Instead of just considering the expected state-action return
Qπ(s, a), one can choose to directly model the distribution
of the soft returns Zπ(s, a). We define Zπ(Zπ(s, a)|s, a) :
S × A P(Zπ(s, a)) as a mapping from (s, a)to a distri-
bution over soft state-action returns, and call it the soft state-
action return distribution or distributional value function. The
distributional variant of the Bellman operator in the maximum
entropy framework can be derived as
Tπ
DZπ(s, a)D
=r+γ(Zπ(s0, a0)αlog π(a0|s0)),(6)
where rR(·|s, a), s0p, a0π, and AD
=Bdenotes
that two random variables Aand Bhave equal probability
laws. The distributional variant of policy iteration has been
proved to converge to the optimal return distribution and
policy uniformly in [18]. We can further prove that DSPI
which alternates between (6) and (4) also leads to policy
improvement with respect to the maximum entropy objective
(1). Details are provided in Appendix A.
Suppose Tπ
DZ(s, a) T π
DZ(·|s, a), where Tπ
DZ(·|s, a)
denotes the distribution of Tπ
DZ(s, a). To implement (6), we
can directly update the soft return distribution by
Znew = arg min
Z
E
(s,a)ρπd(Tπ
DZold(·|s, a),Z(·|s, a)),(7)
where dis some metric to measure the distance between
two distributions. For calculation convenience, many practical
4
distributional RL algorithms employ Kullback-Leibler (KL)
divergence, denoted as DKL, as the metric [18], [23].
IV. OVER ES TI MATION BIAS
This section mainly focuses on the impact of the state-
action return distribution learning on reducing overestimation.
Therefore, the entropy coefficient αis assumed to be 0here.
Previous studies analyzed the Q-value estimation bias of Q-
learning in tabular cases [6], [15]. In section IV-A, we derive
the analytical expression of Q-value estimation bias from the
perspective of function approximation. Then, Section IV-B
analyzes the Q-estimate bias of the return distribution learning
and reveals its mechanism to mitigate overestimations.
A. Overestimation in Q-learning
In Q-learning with discrete actions, suppose the Q-value
is approximated by a Q-function Qθ(s, a)with parameters θ.
Defining the greedy target y=E[r] + γEs0[maxa0Qθ(s0, a0)],
the Q-estimate Qθ(s, a)can be updated by minimizing the
loss (yQθ(s, a))2/2using gradient descent methods, i.e.,
θnew =θ+β(yQθ(s, a))θQθ(s, a),(8)
where βis the learning rate. However, in practical applications,
the Q-estimate Qθ(s, a)usually contains random errors, which
may be caused by system noises and function approximation.
Denoting the current true Q-value as ˜
Q, we assume
Qθ(s, a) = ˜
Q(s, a) + Q,(9)
where the random error Qhas zero mean and is independent
of (s, a)and θ. To distinguish the random error of Qθ(s, a)
and Qθ(s0, a0), the random error of Qθ(s0, a0)is denoted as
0
Q. Clearly, 0
Qmay cause inaccuracy on the right-hand side
of (8). Let θtrue represent the post-update parameters obtained
based on true target ˜y, that is,
θtrue =θ+βyQθ(s, a))θQθ(s, a),
where ˜y=E[r] + γEs0[maxa0˜
Q(s0, a0)].
Supposing βis sufficiently small, the post-update Q-
function can be well-approximated by linearizing around θ
using Taylor’s expansion:
Qθtrue (s, a)Qθ(s, a) + βyQθ(s, a))k∇θQθ(s, a)k2
2,
Qθnew (s, a)Qθ(s, a) + β(yQθ(s, a))k∇θQθ(s, a)k2
2.
Then, in expectation, the estimate bias of post-update Q-
estimate Qθnew (s, a)is
∆(s, a) = E0
Q[Qθnew (s, a)Qθtrue (s, a)]
βE0
Q[y]˜yk∇θQθ(s, a)k2
2
=βγE0
QEs0[max
a0Q(s0, a0)]
Es0[max
a0
˜
Q(s0, a0)]k∇θQθ(s, a)k2
2.
Defining
δ=E0
QEs0[max
a0Q(s0, a0)]Es0[max
a0
˜
Q(s0, a0)]
=Es0E0
Q[max
a0Qθ(s0, a0)] max
a0
˜
Q(s0, a0)
=Es0E0
Q[max
a0(˜
Qθ(s0, a0) + 0
Q]max
a0
˜
Q(s0, a0),
(10)
∆(s, a)can be rewritten as:
∆(s, a)βγδk∇θQθ(s, a)k2
2.
Although 0
Qis independent of (s0, a0), it cannot be extracted
from the max operator of maxa0(˜
Q(s0, a0) + 0
Q). This is
because for each (s0, a0),0
Qis a random variable rather than a
fixed value. In fact, it has been verified by previous researches
that E0
Q[maxa0(˜
Q(s0, a0) + 0
Q)] maxa0˜
Q(s0, a0)0[9],
[15]. Therefore, it is clear that
∆(s, a)0,
which indicates that ∆(s, a)is an upward bias. In fact, any
kind of estimation errors can induce an upward bias due to
the max operator. Although it is reasonable to expect a small
upward bias caused by single update, these overestimation
errors can be further exaggerated through temporal difference
(TD) learning, which may result in large overestimation bias
and suboptimal policy updates.
B. Return Distribution for Reducing Overestimation
Before discussing the distributional version of Q-learning,
we first assume that the random returns Z(s, a)obey a
Gaussian distribution Z(·|s, a). Suppose the mean (i.e., Q-
value) and standard deviation of the Gaussian distribution
are approximated by two independent functions Qθ(s, a)
and σψ(s, a), with parameters θand ψ, i.e., Zθ,ψ (·|s, a) =
N(Qθ(s, a), σψ(s, a)2).
Similar to standard Q-learning, we first define a ran-
dom greedy target yD=r+γZ(s0, a0∗), where a0∗ =
arg maxa0Qθ(s0, a0). Suppose yD∼ Ztarget(·|s, a), which is
also assumed to be a Gaussian distribution. Note that even
if Z(s, a)and yDare not strictly Gaussian, we can still
use the Gaussian to approximate their distributions, which
will not affect the subsequent analysis. Since E[yD] =
E[r] + γEs0[maxa0Qθ(s0, a0)] is equal to yin (8), it follows
Ztarget(·|s, a) = N(y, σtarget2). Considering the loss function
in (7) under the KL divergence measurement, Qθ(s, a)and
σψ(s, a)are updated by minimizing
DKL(Ztarget (·|s, a),Zθ,ψ (·|s, a))
= log σψ(s, a)
σtarget +σtarget2+ (yQθ(s, a))2
2σψ(s, a)21
2,(11)
that is,
θnew =θ+βyQθ(s, a)
σψ(s, a)2θQθ(s, a),
ψnew =ψ+βσ2+ (yQθ(s, a))2
σψ(s, a)3ψσψ(s, a).
(12)
where σ2=σtarget2σψ(s, a)2. Compared with standard
Q-learning, σψ(s, a)plays a role of adaptively adjusting the
update stepsize of Qθ(s, a). In particular, the update stepsize
of Qθ(s, a)decreases squarely as σφ(s, a)increases. Sup-
posing Qθ(s, a)also obeys (9), the post-update parameters
obtained based on the true target value ˜yis given by
θtrue =θ+β˜yQθ(s, a)
σψ(s, a)2θQθ(s, a)(13)
5
Similar to the derivation of ∆(s, a), the overestimation bias
of Qθnew (s, a)in distributional Q-learning is
D(s, a)βγδk∇θQθ(s, a)k2
2
σψ(s, a)2=∆(s, a)
σψ(s, a)2.(14)
Obviously, the overestimation errors D(s, a)is inversely pro-
portional to σψ(s, a)2. In an ideal situation, when ˜
Q(s, a) = ˜y,
that is, ˜
Q(s, a)has converged after a period of learning, we
can derive that
EQ,0
Q[σψnew (s, a)] σψ(s, a)+
βσtarget2σψ(s, a)2+γ2δ2+EQ[Q2]
σψ(s, a)3k∇ψσψ(s, a)k2
2,
where this inequality holds approximately since we drop
higher order terms out in Taylor approximation. See Appendix
B-A for details of derivation.
Because σψnew is also the standard deviation for the next
time step, this indicates that by repeatedly applying (12), the
standard deviation σψ(s, a)of the return distribution tends
to be a larger value in areas with high σtarget and random
errors Q. Moreover, σtarget is often positively related to the
randomness of systems p, reward function Rand the return
distribution Z(·|s0, a0)of subsequent state-action pairs. Since
the overestimation bias D(s, a)is inversely proportional to
σψ(s, a)2according to (14), distributional Q-learning can be
used to mitigate overestimations caused by task randomness
and approximation errors.
V. DISTRIBUTIONAL SOFT ACTOR -CRITIC
In this section, based on the developed DSPI framework, we
derive the learning rules of the continuous return distribution,
and propose the DSAC algorithm by replacing the clipped dou-
ble Q-learning of SAC [16], [17] with the return distribution
learning. We will consider a parameterized distributional value
function Zθ(·|s, a)and a stochastic policy πφ(·|s), where θ
and φare parameters. In this paper, both the state-action return
distribution and policy functions are modeled as Gaussian with
mean and covariance given by neural networks (NNs). We will
next derive update rules for parameters of these NNs.
A. Algorithm
1) Distributional Soft Policy Evaluation: Considering the
loss function in (7), the soft state-action return distribution
can be trained to minimize the loss function in (7) under the
KL-divergence measurement
JZ(θ) = E
(s,a)∼B DKL(Tπφ0
DZθ0(·|s, a),Zθ(·|s, a))(15)
where Bis a replay buffer of previously sampled experience,
θ0and φ0are parameters of target return distribution and policy
functions, which are used to stabilize the learning process and
evaluate the target. For practical applications, σtarget in (11)
is unknown. Therefore, we cannot directly update Zθ(·|s, a)
using the objective shown in (11). After analysis, we get the
following objective function equivalent to (15)
JZ(θ) = E
(s,a,r,s0)∼B,a0πφ0,
Z(s0,a0)∼Zθ0(·|s0,a0)
hlog P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i.
We provide details of derivation in Appendix B-B.
The parameters θcan be optimized with the following
gradients
θJZ(θ) = E
(s,a,r,s0)∼B,
a0πφ0,
Z(s0,a0)∼Zθ0
hθlog P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i.
Since Zθis assumed to be a Gaussian model, it can be
expressed as Zθ(·|s, a) = N(Qθ(s, a), σθ(s, a)2), where
Qθ(s, a)and σθ(s, a)are the outputs of value network. This
makes the Gaussian variant of update gradients
θJZ(θ)
=E
(s,a,r,s0)∼B,
a0πφ0,
Z(s0,a0)∼Zθ0
hθlog eT
πφ0
DZ(s,a)Qθ(s,a)2
2σθ(s,a)2
2πσθ(s, a)i
=E
(s,a,r,s0)∼B,
a0πφ0,
Z(s0,a0)∼Zθ0
hθTπφ0
DZ(s, a)Qθ(s, a)2
2σθ(s, a)2+θσθ(s, a)
σθ(s, a)i.
Denoting ΨZ(θ) = log P(Tπφ0
DZ(s, a)|Zθ(·|s, a)), to un-
derstand the composition of θJZ(θ)more intuitively, we can
rewrite it as
θJZ(θ) = E
(s,a,r,s0)∼B,
a0πφ0,
Z(s0,a0)∼Zθ0
hΨZ(θ)
∂Qθ(s, a)θQθ(s, a)
ΨZ(θ)
∂σθ(s, a)θσθ(s, a)i,
(16)
where
ΨZ(θ)
∂Qθ(s, a)=Tπφ0
DZ(s, a)Qθ(s, a)
σθ(s, a)2,
ΨZ(θ)
∂σθ(s, a)=Tπφ0
DZ(s, a)Qθ(s, a)2
σθ(s, a)31
σθ(s, a).
It can be easily deduced from ΨZ(θ)
∂Qθ(s,a)that the update stepsize
of Qθ(s, a)decreases squarely as σθ(s, a)increases, thereby
mitigating Q-value overestimations. However, the gradients
θJZ(θ)are prone to explode as σθ(s, a)0, or to vanish
as σθ(s, a)→ ∞. To address this problem, we propose two
options to keep σθ(s, a)within a reasonable range. The first
point is to limit the minimum value of σθ(s, a)by
σθ(s, a) = max(σθ(s, a), σmin),(17)
Noted that if σmin 1, we always have D(s, a)∆(s, a).
Therefore, in this paper, we let σmin = 1. And the second
point is to clip Tπφ0
DZ(s, a)of ΨZ(θ)
∂σθ(s,a)to keep it close
to the expectation value Qθ(s, a)of the current soft return
distribution, thus stabilizing the learning process of σθ(s, a)
and indirectly controlling its range, i.e.,
ΨZ(θ)
∂σθ(s, a)=Tπφ0
DZ(s, a)Qθ(s, a)2
σθ(s, a)31
σθ(s, a),
6
where
Tπφ0
DZ(s, a) = clip(Tπφ0
DZ(s, a), Qθ(s, a)b, Qθ(s, a) + b),
(18)
where clip[x, A, B]denotes that xis clipped into the range
[A, B]and bis the clipping boundary.
The target networks mentioned above use a slow-moving
update rate, parameterized by τ, such as
θ0τθ + (1 τ)θ0, φ0τ φ + (1 τ)φ0.
2) Distributional Soft Policy Improvement: The policy can
be learned by directly maximizing a parameterized variant of
the objective in (4):
Jπ(φ) = E
s∼B,aπφ
[Qθ(s, a)αlog(πφ(a|s))]
=E
s∼B,aπφhE
Z(s,a)∼Zθ(·|s,a)
[Z(s, a)] αlog(πφ(a|s))i.
If ais unbounded, given the parameters of the action dis-
tribution, such as the mean and variance of the Gaussian
distribution, log(πφ(a|s)) can be easily calculated. On the
other hand, if ais bounded to a finite interval, its log-likelihood
can also be obtained in the manner given in Appendix B-C.
There are several options, such as log derivative and repa-
rameterization tricks, for maximizing Jπ(φ)[34]. In this paper,
we apply the reparameterization trick because it can reduce the
gradient estimation variance.
If the soft Q-value function Qθ(s, a)is explicitly param-
eterized through parameters θ, we only need to express the
random action aas a deterministic variable, i.e.,
a=fφ(ξa;s),(19)
where ξaRdim(A)is an auxiliary variable which is sampled
form some fixed distribution. In particular, since πφ(·|s)is
assumed to be a Gaussian in this paper, fφ(ξa;s)can be
formulated as
fφ(ξa;s) = amean +ξaastd,
where amean Rdim(A)and astd Rdim(A)are the mean
and standard deviation of πφ(·|s),represents the Hadamard
product and ξai∼ N(0,Idim(A)). Then the policy update
gradients can be approximated with
φJπ(φ) = Es∼Bahαφlog(πφ(a|s))+
(aQθ(s, a)αalog(πφ(a|s)))φfφ(ξa;s)i.
If Qθ(s, a)cannot be expressed explicitly through θ, the policy
update gradients can be obtained in the manner given in
Appendix B-D.
3) Pseudo-code: Finally, according to [17], the temperature
αis updated by minimizing the following objective
J(α) = E(s,a)∼B[α(log πφ(a|s)− H)],
where His the expected entropy. In addition, two-timescale
updates, i.e., less frequent policy updates, usually result in
higher quality policy updates [12]. Therefore, the policy,
temperature and target networks are updated every miterations
in this paper. The final algorithm is listed in Algorithm 1. Fig.
1 shows the diagram of DSAC.
Algorithm 1 DSAC Algorithm
Initialize parameters θ,φand α
Initialize target parameters θ0θ,φ0φ
Initialize learning rate βZ,βπ,βαand τ
Initialize iteration index k= 0
repeat
Select action aπφ(a|s)
Observe reward rand new state s0
Store transition tuple (s, a, r, s0)in buffer B
Sample Ntransitions (s, a, r, s0)from B
Update soft return distribution θθβZθJZ(θ)
if kmod mthen
Update policy φφ+βπφJπ(φ)
Adjust temperature ααβααJ(α)
Update target networks:
θ0τθ + (1 τ)θ0,φ0τ φ + (1 τ)φ0
end if
k=k+ 1
until Convergence
Environment
𝑎
Distributional
Value NN
(𝑠,𝑎)
𝜋𝜙(|𝑠)
𝑠
Policy NN
𝒵𝜃(|𝑠,𝑎)
Return
Distribution
Buffer
Policy Entropy
Experience
Update Update
Fig. 1. DSAC diagram. The return distribution and policy are ap-
proximated by two NNs, called distributional value network and
policy network respectively. DSAC first updates the distributional
value network based on the samples collected from the buffer. Then,
the output of the value network is used to guide the update of the
policy network.
B. Architecture
Algorithm 1 and Fig. 1 show the operation process of DSAC
in a serial way. Like most off-policy RL algorithms, we can
use parallel or distributed learning techniques to improve the
learning efficiency of DSAC. Therefore, we build a new par-
allel asynchronous buffer-actor-learner architecture (PABAL)
referring to the other high-throughput learning architectures,
such as IMPALA and Ape-X [3], [35], [36]. As shown in
Fig. 2, buffers, actors and learners are all distributed across
multiple workers, which are used to improve the efficiency of
storage and sampling, exploration, and updating, respectively.
And all communication between modules is asynchronous.
Both actors and learners asynchronously synchronize the
parameters from the shared memory. The experience generated
by each actor is asynchronously and randomly sent to a
certain buffer at each time step. Each buffer continuously
stores data and sends the sampled experience to a random
7
learner. Relying on the received sampled data, the learners
calculate the update gradients using their local functions, and
then use these gradients to update the shared value and policy
functions. In this paper, we implement DSAC and other off-
policy baseline algorithms within the PABAL architecture.
Buffer
Generated experienceSampled experience
Local Policy
Environment
Actor
Local Value Local Policy
Optimizer Optimizer
Learner
Shared Memory
Shared Policy
Shared Value
Update
Synchronize
Prameters Synchronize
Prameters
𝑎
𝑠
𝑟
(𝑠,𝑎,𝑟,𝑠′)
(𝑠,𝑎,𝑟,𝑠′)
Fig. 2. The PABAL architecture. Buffers, actors, and learners are
all distributed across multiple workers. Communication between
different modules is asynchronous.
VI. EX PE RI ME NTAL VERIFICATION
A. Benchmarks
To evaluate our algorithm, we measure its performance and
Q-value estimation bias on a suite of MuJoCo continuous con-
trol tasks without modifications to environment [37], interfaced
through OpenAI Gym [38]. Fig. 3 shows the benchmark tasks
used in this paper. See Appendix C-A for brief descriptions
of these benchmarks.
(a) (b) (c)
(d) (e)
Fig. 3: Tasks. (a) Humanoid-v2: (s×a)R376 ×R17. (b)
HalfCheetah-v2: (s×a)R17 ×R6. (c) Ant-v2: (s×a)
R111 ×R8. (d) Walker2d-v2: (s×a)R17 ×R6. (e)
InvertedDoublePendulum-v2: (s×a)R11 ×R1.
B. Baselines
We compare our algorithm against Deep Deterministic
Policy Gradient (DDPG) [14], Trust Region Policy Optimiza-
tion (TRPO) [24], Proximal Policy Optimization (PPO) [25],
Distributed Distributional Deep Deterministic Policy Gradients
(D4PG) [23], Twin Delayed Deep Deterministic policy gradi-
ent (TD3) [12], Soft Actor-Critic (SAC) [17]. DDPG, TRPO,
PPO, D4PG, TD3 and SAC are mainstream RL algorithms,
which have been extensively verified and applied in a variety
of challenging domains. Using these algorithms as baselines,
the performance of the proposed DSAC algorithm can be
evaluated objectively.
We additionally compare our method with our proposed
Twin Delayed Distributional Deep Deterministic policy gra-
dient algorithm (TD4), which is developed by replacing the
clipped double Q-learning in TD3 with the distributional
return learning; Double Q-learning variant of SAC (Double-Q
SAC), in which we replace the clipped double Q-learning of
SAC with the actor-critic variant of double Q-learning [12],
[15]; and single Q-value variant of SAC (Single-Q SAC), in
which we replace the clipped double Q-learning of SAC with
traditional TD learning. See Appendix C-B, C-C and C-D
for detailed descriptions of Double-Q SAC, Single-Q SAC
and TD4 algorithms. Double-Q SAC and Single-Q SAC are
adapted from SAC. Table I gives a basic description of DSAC
and each baseline. It is clear that DSAC, SAC, Double-Q
SAC and Single-Q SAC algorithms respectively use the return
distribution learning, clipped double Q-learning, double Q-
learning and traditional TD learning for policy evaluation. This
is the only difference between these algorithms. Therefore, we
can assess the impact of the distribution learning by comparing
DSAC with SAC, Single-Q SAC and Double-Q SAC. Besides,
we compare DSAC with TD4, which uses the distribution
learning but not maximum entropy, to assess the impact of
policy entropy.
All the off-policy algorithms mentioned above are imple-
mented in the proposed PABAL architecture, including 4
learners, 6 actors and 3 buffers. We use a fully connected
network with 5 hidden layers, consisting of 256 units per
layer, with Gaussian Error Linear Units (GELU) each layer
[39], for both actor and critic. For distributional value function
and stochastic policy, we use a Gaussian distribution with
mean and covariance given by a NN, where the covariance
matrix is diagonal. In this case, each NN maps the input
states to the mean and logarithm of standard deviation of the
Gaussian distribution. The Adam method [40] with a cosine
annealing learning rate is used to update all the parameters.
All algorithms adopt almost the same NN architecture and
hyperparameters. Table IV in Appendix C-E provides more
detailed hyperparameters of all algorithms.
C. Results
1) Performance: We train 5 different runs of each algorithm
with different random seeds, with evaluations every 20000
iterations. Each evaluation calculates the average return over
5 episodes without exploration noise, where the maximum
length of each episode is 1000 time steps. The learning curves
are shown in Fig. 4 and results in Table II. Results show
that the proposed DSAC algorithm outperforms or matches all
other baseline algorithms across all benchmark tasks in terms
of the final performance. For example, compared with famous
RL algorithms such as SAC, TD3, PPO, and DDPG, DSAC
gains 20.0%, 63.8%, 39.8%, 97.6% improvements on the
most complex Humanoid-v2 task, respectively. This indicates
8
TABLE I
BASIC DESCRIPTION OF DSAC AN D BAS EL INE S.
Algorithm Algorithm Type Policy Type Policy Evaluation Policy Improvement
DSAC (Ours) off-policy Stochastic Continuous soft return distribution learning Soft policy gradient
SAC [17] off-policy Stochastic Clipped double Q-learning Soft policy gradient
Double-Q SAC off-policy Stochastic Double Q-learning Soft policy gradient
Single-Q SAC off-policy Stochastic Traditional TD learning Soft policy gradient
TD4 off-policy Deterministic Continuous return distribution learning Policy gradient
TD3 [12] off-policy Deterministic Clipped double Q-learning Policy gradient
DDPG [14] off-policy Deterministic Traditional TD learning Policy gradient
D4PG [23] off-policy Deterministic Discrete return distribution learning Policy gradient
TRPO [24] on-policy Stochastic Traditional TD learning Constrained Policy Optimization
PPO [25] on-policy Stochastic Traditional TD learning Proximal Policy Optimization
(a) Humanoid-v2 (b) Ant-v2 (c) Walker2d-v2
(d) HalfCheetah-v2 (e) InvertedDoublePendulum-v2
Fig. 4: Training curves on continuous control benchmarks. The solid lines correspond to the mean and the shaded regions
correspond to 95% confidence interval over 5 runs.
that the final performance of DSAC on these benchmarks
exceeds the state of the art. Fig. 5 visually shows the con-
trol performance of DSAC and SAC on Humanoid-v2. It is
obvious that DSAC realizes a movement closer to human
running. Among DSAC, SAC, Single-Q SAC and Double-
Q SAC, DSAC has achieved the best performance on all
tasks, which shows that the return distribution learning is an
important measure to improve policy performance. Besides,
TD4 also outperforms TD3 and DDPG on most tasks, which
shows that algorithms with deterministic policies also benefit
greatly from the return distribution learning. As TD4 exceeds
the performance of D4PG, which learns a discrete return
distribution, with a wide margin on Humanoid-v2, Ant-v2
and HalfCheetah-v2, this indicates that learning a continuous
distribution causes significant performance improvements in
most cases. Compared with TD4, DSAC achieves 33.8%,
22.1%, 10.4%, 8.0% improvements on Humanoid-v2, Ant-
v2, Walker2d-v2, and HalfCheetah-v2, respectively, suggesting
that the maximum entropy framework is an effective measure
to achieve good performance.
2) Q-value Estimation Accuracy: To evaluate the impact of
the return distribution learning on Q-value estimation accuracy,
this section compares the estimation bias of DSAC, SAC,
Double-Q SAC and Single-Q SAC on different benchmarks.
The Q-value estimation bias is equal to the difference between
the Q-value estimate and the true Q-value. To approximate the
true Q-value, we calculate the average actual discounted return
over states of 10 episodes every 20000 iterations (evaluate up
to the first 200 states per episode). Fig. 6 graphs the average Q-
value estimation and true Q-value curves during learning. Ta-
ble III gives the average relative Q-value estimation bias which
equals the Q-value estimation bias divided by the true Q-value.
9
TABLE II
AVERA GE FIN AL R ETU RN . MA XI MUM VAL UE F OR E ACH TAS K IS B OL DED.±CORRESPONDS TO A SINGLE STANDARD DEVIATION OVER
5RUN S.
Task Humanoid-v2 Ant-v2 Walker2d-v2 HalfCheetah-v2 InvDoublePendulum-v2
DSAC (Ours) 10824±347 9547±346 6920±405 17479±148 9359.7±0.2
SAC 9019±292 7856±416 5878±580 17300±39 9359.6 ±0.2
Double-Q SAC 9844±396 7682±428 5881±227 16926±132 9359.4±0.6
Single-Q SAC 8525±488 6783±197 2176±1251 16445±815 9355.2±3.6
TD4 8090±789 7821±262 6270±435 16187±538 9320.2±18.3
TD3 6610±1062 7828±642 4864±512 5619±5779 9315.5±10.4
DDPG 5477±2438 6060±747 2849±690 11214±6861 9198.0±13.1
D4PG 175±53 2367±303 6588±260 7215±89 9300.9±16.3
PPO 7743±267 5889±111 6654±492 9517±936 9318.7±0.7
TRPO 581±56 3767±573 2870±28 3274±346 9324.6±2.8
(a) DSAC (Ours)
(b) SAC
Fig. 5: DSAC vs SAC on Humanoid-v2.
Noted that this part excludes the InvDoublePendulum-v2 task,
because due to its simplicity, a good policy has been learned
before the value function converges.
Compared with Single-Q SAC that updates Q-value using
the traditional TD learning method, the overestimation bias
of DSAC is reduced by 10.53%, 5.76%, 926.09%, 1.89%
on Humanoid-v2, Ant-v2, Walker2d-v2, and HalfCheetah-v2,
respectively. Our results demonstrate the theoretical analysis
in Section IV-B, i.e., the return distribution learning can be
used to reduce overestimations without introducing any addi-
tional value or policy network. As a comparison, SAC (uses
clipped double Q-learning) and Double-Q SAC (uses double
Q-learning) suffer from underestimations during the learning
procedure. While the effect of each value learning method
varies from task to task, the Q-value estimation accuracy
of DSAC is higher than SAC and Double-Q SAC in most
cases. This explains why DSAC exceeds Single-Q SAC, SAC,
and Double-Q SAC on most benchmarks by a wide margin.
Therefore, our results demonstrate that the return distribution
learning can greatly improve policy performance by mitigating
overestimations.
3) Time Efficiency: Fig. 7 compares the time efficiency of
different off-policy algorithms. Results show that the average
wall-clock time consumption per 1000 iterations of DSAC
is comparable to DDPG, and much lower than SAC, TD3,
and Double-Q SAC. This is because that unlike double Q-
learning and clipped double Q-learning, the return distribution
learning does not need to introduce any additional value
network or policy network (excluding target networks) to
reduce overestimations.
D. Ablation Studies
As shown in Table IV, compared with SAC, DSAC intro-
duces two hyperparameters: 1) the minimum standard devi-
ation σmin in (17), and 2) the clipping boundary bin (18).
These two hyperparameters are employed to prevent exploding
and vanishing gradient problems when learning the continuous
distributional value function Zθ(·|s, a).
We first take the Ant-v2 task as an example to analyze the
influence of σmin on the final performance. From (16), the
gradients θJZ(θ)are prone to explode as σθ(s, a)0.
Therefore, σθ(s, a)should be bounded above by a specific
positive value. Besides, according to the analysis in Section
IV-B, if σmin 1, we always have D(s, a)∆(s, a). But
a too large σmin may reduce the estimation accuracy of the
return distribution. Therefore, this paper sets σmin = 1. Fig.
8a graphs the average final return of DSAC under different
σmin values on Ant-v2. Our results show that when σmin = 1,
DSAC achieves the best final performance on Ant-v2, which
is consistent with the above analysis.
We additionally perform the ablation study to compare the
performance of DSAC with different clipping boundaries b.
Our results are presented in Fig. 8b. In this paper, the clipping
boundary bis employed to stabilize the learning process of
σθ(s, a)and keep it in a reasonable range. Results indicate
that compared with the performance of removing the clipping
boundary trick from DSAC (i.e., b= +), the inclusion
of b(for different bvalues) generally improves performance.
10
(a) Humanoid-v2 (b) Ant-v2 (c) Walker2d-v2 (d) HalfCheetah-v2
Fig. 6: Average true Q-value vs estimated Q-value. The solid lines correspond to the mean and the shaded regions correspond
to 95% confidence interval over 5 runs.
TABLE III
AVERA GE RE LATI VE Q -VAL UE ES TI MATI ON B IA S OVER 5RUN S. WE AVE RAG E TH E RE LATI VE ESTI MATI ON B IA S FRO M 1.5 MILLION TO
3MILLION ITERATIONS FOR EACH RUN.+AND I NDICATE O VER ES TI MATI ON AND UN DE RE STIMATION ,RE SP ECTIV ELY.±
CORRESPONDS TO A SINGLE STANDARD DEVIATION OVER 5RUNS .
Algorithm Main difference Humanoid-v2 Ant-v2 Walker2d-v2 HalfCheetah-v2
DSAC (Ours) Continuous return distribution learning +5.32%±0.62% +3.48%±0.69% +17.71%±2.30% -0.33%±0.18%
Single-Q SAC Traditional TD learning +15.85%±1.06% +9.24%±5.74% +943.80%±683.94% +1.56%±1.67%
SAC Clipped double Q-learning -10.16%±1.37% -4.07%±0.66% -1.45%±1.06% -0.99%±0.66%
Double-Q SAC Double Q-learning -4.63%±1.70% -16.68%±4.21% -12.84%±4.03% -0.33%±0.32%
Fig. 7. Algorithm comparison in terms of time efficiency on the
Ant-v2 benchmark. Each boxplot is drawn based on values of 50
evaluations. All evaluations were performed on a single computer
with a 2.4 GHz 20 core Intel Xeon CPU.
Therefore, DSAC appears to benefit greatly from the clipping
boundary trick. However, the final performance is a little bit
sensitive to the value of b. This is because that a too small b
will reduce the learning accuracy of the return distribution,
while a too large bcannot effectively limit the range of
σθ(s, a). In practical applications, it is usually necessary to
select an appropriate bvalue according to the range of the
state-action return Z(s, a), which limits the flexibility of the
DSAC algorithm. We will focus on this issue in the future.
VII. CONCLUSIONS
In this paper, we propose an off-policy RL algorithm for
continuous control setting, called distributional soft actor-
critic (DSAC), to mitigate Q-value overestimations, thereby
improving policy performance. We first discover in theory that
the update stepsize of the Q-value function in distributional RL
(a) Performance under different σmin (b) Performance under different b
Fig. 8: Average final return of DSAC under different hyper-
parameters on Ant-v2 over 5 runs. (a) b= 10. (b) σmin = 1.
decreases squarely as the standard deviation of state-action
returns increases, thus mitigating Q-value overestimations.
Then, a distributional soft policy iteration (DSPI) framework is
developed by embedding the return distribution function into
maximum entropy RL, which alternates between distributional
soft policy evaluation and soft policy improvement. Next, a
deep off-policy actor-critic variant of DSPI, i.e., DSAC, is
proposed to directly learn a continuous return distribution
by keeping the variance of the state-action returns within
reasonable range to address exploding and vanishing gradient
problems. We evaluate DSAC and 9 baselines (such as SAC,
TD3, PPO, DDPG) on the suite of MuJoCo tasks. Results
show that DSAC outperforms or matches all other baseline
algorithms across all benchmarks.
APPENDIX A
PROO F OF CONVERGENCE OF DISTRIBUTIONAL SOFT
POLICY ITE RATI ON
In this appendix, we present proofs to show that Distribu-
tional Soft Policy Iteration (DSPI), which alternates between
11
(6) and (4), would lead to policy improvement with respect
to the maximum entropy objective. The proofs borrow heavily
from the policy evaluation and policy improvement theorems
of Q-learning, distributional RL and soft Q-learning [7], [16],
[18].
Lemma 1. (Distributional Soft Policy Evaluation). Consider
the distributional soft bellman backup operator Tπ
Din (6) and
a soft state-action distribution function Z0(Z0(s, a)|s, a) :
S×A → P(Z0(s, a)), which maps a state-action pair (s, a)to
a distribution over random soft state-action returns Z0(s, a),
and define Zi+1(s, a) = Tπ
DZi(s, a), where Zi+1(s, a)
Zi+1(·|s, a). Then the sequence Ziwill converge to Zπas
i→ ∞.
Proof. Let Zdenote the space of soft return function Z.
Define the entropy augmented reward as rπ(s, a) = r(s, a)
γα log π(a0|s0)and rewrite the distributional soft Bellman
operator as
Tπ
DZ(s, a)D
=rπ(s, a) + γZ (s0, a0),
where rR(·|s, a), s0p, a0π. Then we can directly
apply the standard convergence results for policy evaluation of
distributional RL [18], that is, Tπ
D:ZZis a γ-contraction
in terms of some measure. Therefore, Tπ
Dhas a unique fixed
point, which is Zπ, and the sequence Ziwill converge to it
as i→ ∞, i.e., Ziwill converge to Zπas i→ ∞.
Lemma 2. (Soft Policy Improvement) Let πnew be the optimal
solution of the maximization problem defined in (4). Then
Qπnew (s, a)Qπold (s, a)for (s, a)∈ S × A.
Proof. From (4), one has
πnew(·|s) = arg max
π
E
aπ[Qπold (s, a)αlog π(a|s)],s∈ S,
(20)
then it is obvious that
E
aπnew
[Qπold (s, a)αlog πnew(a|s)]
E
aπold
[Qπold (s, a)αlog πold(a|s)],s∈ S.(21)
Next, from (3), it follows that
Qπold (s, a)
=E
rR(·|s,a)
[r] + γE
s0p,a0πold
[Qπold (s0, a0)αlog πold(a0|s0)]
E
rR(·|s,a)
[r] + γE
s0p,a0πnew
[Qπold (s0, a0)αlog πnew(a0|s0)]
.
.
.
Qπnew (s, a),(s, a)∈ S × A,
where we have repeatedly expanded Qπold on the right-hand
side by applying (3).
Theorem 1. (Distributional Soft Policy Iteration). The distri-
butional soft policy iteration, which alternates between distri-
butional soft policy evaluation and soft policy improvement,
can converge to a policy πsuch that Qπ(s, a)Qπ(s, a)
for πand (s, a) S × A, assuming that |A| <and
reward is bounded.
Proof. Let πkdenote the policy at iteration k. For πk, we can
always find its associated Zπkthrough distributional soft pol-
icy evaluation process follows from Lemma 1. Therefore, we
can obtain Qπkaccording to (5). By Lemma 2, the sequence
Qπk(s, a)is monotonically increasing for (s, a) S × A.
Since Qπis bounded everywhere for π(both the reward and
policy entropy are bounded), the policy sequence πkconverges
to some πas k→ ∞. At convergence, it must follow that
E
aπ[Qπ(s, a)αlog π(a|s)]
E
aπ[Qπ(s, a)αlog π(a|s)],π, s∈ S.
(22)
Using the same iterative argument as in Lemma 2, we have
Qπ(s, a)Qπ(s, a),π, (s, a) S × A.
Hence πis optimal, i.e., π=π.
APPENDIX B
DER IVATIO NS
A. Derivation of the Standard Deviation in Distributional Q-
learning
Since the random error Qin (9) is assumed to be indepen-
dent of (s, a),δin (10) can be further expressed as
δ=E0
QEs0[max
a0Q(s0, a0)]Es0[max
a0
˜
Q(s0, a0)]
=E0
QEs0[max
a0Qθ(s0, a0)max
a0
˜
Q(s0, a0)].
Defining η=Es0maxa0Qθ(s0, a0)maxa0˜
Q(s0, a0), it
follows that
δ=E0
Q[η].
From (12), we linearize the post-update standard deviation
around ψusing Taylor’s expansion
σψnew (s, a)
σψ(s, a) + βσ2+ (yQθ(s, a))2
σψ(s, a)3k∇ψσψ(s, a)k2
2.
Then, in expectation, the post-update standard deviation is
EQ,0
Q[σψnew (s, a)] σψ(s, a)+
βσ2+EQ,0
Q[(yQθ(s, a))2]
σψ(s, a)3k∇ψσψ(s, a)k2
2.
Since EQ[Q] = 0, the EQ,0
Q[(yQθ(s, a))2]term can be
expanded as
EQ,0
Q[(yQθ(s, a))2]
=EQ,0
Q(E[r] + γEs0[max
a0Qθ(s0, a0)] Qθ(s, a))2
=EQ,0
Q(E[r] + γEs0[max
a0
˜
Q(s0, a0)] + γη ˜
Q(s, a)Q)2
=EQ,0
Qy˜
Q(s, a) + γη Q)2
= (˜y˜
Q(s, a))2+EQ,0
Q(γη Q)2+
EQ,0
Q2(˜y˜
Q(s, a))(γη Q)
= (˜y˜
Q(s, a))2+γ2E0
Q[η2] + EQ[Q2]+
2γy˜
Q(s, a))E0
Q[η]2(γE0
Q[η] + ˜y˜
Q(s, a))EQ[Q]
= (˜y˜
Q(s, a))2+γ2E0
Q[η2] + EQ[Q2]+2γδ(˜y˜
Q(s, a)).
12
In an ideal situation, when ˜
Q(s, a) = ˜y, that is, ˜
Q(s, a)has
converged after a period of learning, we further have
EQ,0
Q[(yQθ(s, a))2] = γ2E0
Q[η2] + EQ[Q2].
Furthermore, since E0
Q[η2]E0
Q[η]2, we have
EQ,0
Q[σψnew (s, a)]
σψ(s, a) + βσ2+γ2E0
Q[η2] + EQ[Q2]
σψ(s, a)3k∇ψσψ(s, a)k2
2
σψ(s, a) + βσ2+γ2E0
Q[η]2+EQ[Q2]
σψ(s, a)3k∇ψσψ(s, a)k2
2
=σψ(s, a) + βσ2+γ2δ2+EQ[Q2]
σψ(s, a)3k∇ψσψ(s, a)k2
2.
B. Derivation of the Objective Function for Soft Return Dis-
tribution Update
From (7), the loss function for soft state-action return
distribution under the KL-divergence measurement is
JZ(θ)
=E(s,a)∼BhDKL (Tπφ0
DZθ0(·|s, a),Zθ(·|s, a))i
=E(s,a)∼BhX
Tπφ0
DZ(s,a)
P(Tπφ0
DZ(s, a)|T πφ0
DZθ0(·|s, a))
log P(Tπφ0
DZ(s, a)|T πφ0
DZθ0(·|s, a))
P(Tπφ0
DZ(s, a)|Zθ(·|s, a)) i
=E(s,a)∼BhX
Tπφ0
DZ(s,a)
P(Tπφ0
DZ(s, a)|T πφ0
DZθ0(·|s, a))
log P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i+c
=E(s,a)∼BhETπφ0
DZ(s,a)∼T πφ0
DZθ0(·|s,a)
log P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i+c
=E(s,a)∼BhE
(r,s0)∼B,a0πφ0,
Z(s0,a0)∼Zθ0(·|s0,a0)
log P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i+c
=E
(s,a,r,s0)∼B,a0πφ0,
Z(s0,a0)∼Zθ0(·|s0,a0)
hlog P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i+c,
where cis an item independent of θ.
C. Probability Density of the Bounded Actions
For algorithms with stochastic policy, we use an unbounded
Gaussian as the action distribution µ. However, in practice, the
action usually needs to be bounded to a finite interval denoted
as [amin, amax ], where amin Rdim(A)and amax Rdim(A). Let
uRdim(A)denote a random variable sampled from µ. To
account for the action constraint, we project uinto a desired
action by
a=amax amin
2tanh(u) + amax +amin
2,
where represents the Hadamard product and tanh is applied
element-wise. From [16], the probability density of ais given
by
π(a|s) = µ(u|s)
det da
du
1.
The log-likelihood of π(a|s)can be expressed as
log π(a|s) = log µ(u|s)
dim(A)
X
i=1 log(1 tanh2(ui)) + log amax iamini
2.
D. Policy Update Gradients Based on the Soft-Action Return
If Qθ(s, a)cannot be expressed explicitly through θ, besides
(19), we also need to reparameterize the random return Z(s, a)
as
Z(s, a) = gθ(ξZ;s, a).
In this case, we have
φJπ(φ) = Es∼BZahαφlog(πφ(a|s))+
(agθ(ξZ;s, a)αalog(πφ(a|s)))φfφ(ξa;s)i.
Besides, the distribution Zθoffers a richer set of predictions
for learning than its expected value Qθ. Therefore, we can
also choose to maximize the ith percentile of Zθ
Jπ,i(φ) = Es∼B,aπφ[Pi(Zθ(s, a)) αlog(πφ(a|s))],
where Pidenotes the ith percentile. For example, ishould be
a smaller value for risk-aware policies learning. The gradients
of this objective can also be easily approximated using the
reparamterization trick.
APPENDIX C
EXP ER IM EN TAL DETA IL S
A. Brief Descriptions of Benchmarks
The Humanoid-v2 task aims to make a three-dimensional
bipedal robot walk forward as fast as possible, without falling
over. Its state is described by 376-dimensional information,
including the position and velocity of joints, the inertia and
velocity at the center of mass, and actuator forces. The action
of this task is composed of the torque applied over 17 joints.
The reward function is designed to punish the actions that cost
a lot of energy or cause mission failure. Similarly, Walker2d-
v2 is a two-dimensional bipedal robot which possesses 17-
dimensional states and 6-dimensional actions. The Ant-v2 task
aims to make a four-legged creature walk forward as fast as
possible with a 111-dimensional state vector to describe the
position and velocity of each joint. Its action consists of the
torque of 8 joints, and the reward is also designed to punish
the actions that cost a lot of energy or cause mission failure.
Analogously, HalfCheetah-v2 is a two-legged cheetah with
17-dimensional states and 6-dimensional actions. The goal of
InvertedDoublePendulum-v2, which is described by an 11-
dimensional state vector, is to make two linked poles stand
up on a cart as long as possible by applying a force on the
cart. See https://github.com/openai/gym/tree/master/gym/envs
for all details.
13
B. Double-Q SAC Algorithm
Suppose the soft Q-value and policy are approximated by
parameterized functions Qθ(s, a)and πφ(a|s)respectively.
A pair of soft Q-value functions (Qθ1, Qθ2)and policies
(πφ1, πφ2)are required in Double-Q SAC, where πφ1is
updated with respect to Qθ1and πφ2with respect to Qθ2.
Given separate target soft Q-value functions (Qθ0
1, Qθ0
2)and
policies (πφ0
1, πφ0
2), the update targets of Qθ1and Qθ2are
calculated as:
y1=r+γ(Qθ0
2(s0, a0)αlog(πφ0
1(a0|s0))), a0πφ0
1,
y2=r+γ(Qθ0
1(s0, a0)αlog(πφ0
2(a0|s0))), a0πφ0
2.
The soft Q-value can be trained by directly minimizing
JQ(θi) = E
(s,a,r,s0)∼B,a0πφ0
i(yiQθi(s, a))2,for i∈ {1,2}.
The policy can be learned by directly maximizing a param-
eterized variant of the objective function in (4)
Jπ(φi) = Es∼BEaπφi[Qθi(s, a)αlog(πφi(a|s))].
The pseudo-code of Double-Q SAC is shown in Algorithm 2.
Algorithm 2 Double-Q SAC Algorithm
Initialize parameters θ1,θ2,φ1,φ2, and α
Initialize target parameters θ0
1θ1,θ0
2θ2,φ0
1φ1,
φ0
2φ2
Initialize learning rate βQ,βπ,βαand τ
Initialize iteration index k= 0
repeat
Select action aπφ1(a|s)
Observe reward rand new state s0
Store transition tuple (s, a, r, s0)in buffer B
Sample Ntransitions (s, a, r, s0)from B
Update soft Q θiθiβQθiJQ(θi)for i∈ {1,2}
if kmod mthen
Update policy φiφi+βπφiJπ(φi)for i∈ {1,2}
Adjust temperature ααβααJ(α)
Update target networks:
θ0
iτθi+ (1 τ)θ0
ifor i∈ {1,2}
φ0
iτφi+ (1 τ)φ0
ifor i∈ {1,2}
end if
k=k+ 1
until Convergence
C. Single-Q SAC Algorithm
Suppose the soft Q-value and policy are approximated by
parameterized functions Qθ(s, a)and πφ(a|s)respectively.
Given separate target soft Q-value function Qθ0and policy
πφ0, the update target of Qθis calculated as:
y=r+γ(Qθ0(s0, a0)αlog(πφ0(a0|s0))), a0πφ0.
The soft Q-value can be trained by directly minimizing
JQ(θ) = E
(s,a,r,s0)∼B,a0πφ0(yQθ(s, a))2.
The policy can be learned by directly maximizing a param-
eterized variant of the objective function in (4)
Jπ(φ) = Es∼BEaπφ[Qθ(s, a)αlog(πφ(a|s))].
The pseudo-code of Single-Q SAC is shown in Algorithm 3.
Algorithm 3 Single-Q SAC Algorithm
Initialize parameters θ,φand α
Initialize target parameters θ0θ,φ0φ
Initialize learning rate βQ,βπ,βαand τ
Initialize iteration index k= 0
repeat
Select action aπφ(a|s)
Observe reward rand new state s0
Store transition tuple (s, a, r, s0)in buffer B
Sample Ntransitions (s, a, r, s0)from B
Update soft Q-function θθβQθJQ(θ)
if kmod mthen
Update policy φφ+βπφJπ(φ)
Adjust temperature ααβααJ(α)
Update target networks:
θ0τθ + (1 τ)θ0,φ0τ φ + (1 τ)φ0
end if
k=k+ 1
until Convergence
D. TD4 Algorithm
Consider a parameterized state-action return distribution
function Zθ(·|s, a)and a deterministic policy πφ(s), where
θand φare parameters. The target networks Zθ0(·|s, a)and
πφ0(s)are used to stabilize learning. The return distribution
can be trained to minimize
JZ(θ) = E
(s,a,r,s0)∼B,a0πφ0,
Z(s0,a0)∼Zθ0(·|s0,a0)
hlog P(Tπφ0
DZ(s, a)|Zθ(·|s, a))i,
where
Tπ
DZ(s, a)D
=r(s, a) + γZ (s0, a0)
and
a0=πφ0(s0) + , clip(N(0, σ2),c, c).
The calculation of θJZ(θ)is similar to DSAC. The policy
can be learned by directly maximizing the expected return
Jπ(φ) = Es∼BQθ(s, πφ(s)).
The pseudo-code is shown in Algorithm 4.
E. Hyperparameters
Table IV lists the hyperparameters of all algorithms.
ACKNOWLEDGMENT
We would like to acknowledge Dongjie Yu for his valuable
suggestions. The authors are grateful to the Editor-in-Chief,
the Associate Editor, and anonymous reviewers for their valu-
able comments.
14
Algorithm 4 TD4 Algorithm
Initialize parameters θ,φand α
Initialize target parameters θ0θ,φ0φ
Initialize learning rate βZ,βπ,βαand τ
Initialize iteration index k= 0
repeat
Select action with exploration noise a=πφ(s) + ,
N(0,ˆσ2)
Observe reward rand new state s0
Store transition tuple (s, a, r, s0)in buffer B
Sample Ntransitions (s, a, r, s0)from B
Calculate action for target policy smoothing a0=
πφ0(s0) + ,clip(N(0, σ2),c, c)
Update return distribution θθβZθJZ(θ)
if kmod mthen
Update policy φφ+βπφJφ(φ)
Update target networks:
θ0τθ + (1 τ)θ0,φ0τ φ + (1 τ)φ0
end if
k=k+ 1
until Convergence
TABLE IV
DETAILED HYPERPARAMETERS.
Hyperparameters Value
Shared
Optimizer Adam (β1= 0.9, β2= 0.999)
Number of hidden layers 5
Number of hidden units per layer 256
Nonlinearity of hidden layer GELU
Replay buffer size 5×105
Batch size 256
Actor learning rate cos anneal 5e51e6
Critic learning rate cos anneal 8e51e6
Discount factor (γ) 0.99
Update interval (m) 2
Target smoothing coefficient (τ) 0.001
Reward scale 0.2
Number of actor processes 6
Number of learner processes 4
Number of buffer processes 3
Stochastic policy
Learning rate of αcos anneal 5e51e6
Expected entropy (H)H=dim(A)
Deterministic policy
Exploration noise N (0,0.12)
Distributional value function
Bounds of variance σmin = 1
Clipping boundary b= 10
TD4,TD3
Policy smoothing noise clip(N(0,0.22),0.5,0.5)
REFERENCES
[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
et al., “Human-level control through deep reinforcement learning,
Nature, vol. 518, no. 7540, p. 529, 2015.
[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot, et al., “Mastering the game of go with deep neural networks
and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
[3] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep
reinforcement learning,” in Proceedings of the 33rd International Con-
ference on Machine Learning, (ICML 2016), (New York City, NY, USA),
pp. 1928–1937, 2016.
[4] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering
the game of go without human knowledge,Nature, vol. 550, no. 7676,
p. 354, 2017.
[5] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical
reinforcement learning for self-driving decision-making without reliance
on labelled driving data,IET Intelligent Transport Systems, vol. 14,
no. 5, pp. 297–305, 2020.
[6] C. J. C. H. Watkins, Learning from delayed rewards. PhD thesis, King’s
College, Cambridge, 1989.
[7] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[8] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double q-learning,” in Proceedings of the 30th Conference on
Artificial Intelligence (AAAI 2016), (Phoenix, Arizona,USA), pp. 2094–
2100, 2016.
[9] S. Thrun and A. Schwartz, “Issues in using function approximation
for reinforcement learning,” in Proceedings of the 1993 Connectionist
Models Summer School, (Hillsdale NJ. Lawrence Erlbaum), 1993.
[10] D. Lee, B. Defourny, and W. B. Powell, “Bias-corrected q-learning to
control max-operator bias in q-learning,” in 2013 IEEE Symposium on
Adaptive Dynamic Programming and Reinforcement Learning (ADPRL),
pp. 93–99, IEEE, 2013.
[11] D. Lee and W. B. Powell, “Bias-corrected q-learning with multistate
extension,” IEEE Transactions on Automatic Control, vol. 64, no. 10,
pp. 4011–4023, 2019.
[12] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function ap-
proximation error in actor-critic methods,” in Proceedings of the 35th
International Conference on Machine Learning (ICML 2018), (Stock-
holmsm¨
assan, Stockholm Sweden), pp. 1587–1596, PMLR, 2018.
[13] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in Proceedings of the 31st
International Conference on Machine Learning (ICML 2014), (Bejing,
China), pp. 387–395, PMLR, 2014.
[14] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” in 4th International Conference on Learning Representations
(ICLR 2016), (San Juan, Puerto Rico), 2016.
[15] H. van Hasselt, “Double q-learning,” in 23rd Advances in Neural
Information Processing Systems (NeurIPS 2010), (Vancouver, British
Columbia, Canada), pp. 2613–2621, 2010.
[16] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic
actor,” in Proceedings of the 35th International Conference on Ma-
chine Learning (ICML 2018), (Stockholmsm¨
assan, Stockholm Sweden),
pp. 1861–1870, PMLR, 2018.
[17] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,
V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic
algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
[18] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspec-
tive on reinforcement learning,” in Proceedings of the 34th International
Conference on Machine Learning, (ICML 2017), (Sydney, NSW, Aus-
tralia), pp. 449–458, 2017.
[19] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Distribu-
tional reinforcement learning with quantile regression,” in Proceedings
of the 32nd Conference on Artificial Intelligence, (AAAI 2018), (New
Orleans, Louisiana, USA), pp. 2892–2901, 2018.
[20] W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quantile
networks for distributional reinforcement learning,” in Proceedings of
the 35th International Conference on Machine Learning (ICML 2018),
(Stockholmsm¨
assan, Stockholm Sweden), pp. 1096–1105, PMLR, 2018.
[21] M. Rowland, M. Bellemare, W. Dabney, R. Munos, and Y. W. Teh, “An
analysis of categorical distributional reinforcement learning,” in Inter-
national Conference on Artificial Intelligence and Statistics, (AISTATS
2018), (Playa Blanca, Lanzarote, Canary Islands, Spain), pp. 29–37,
PMLR, 2018.
[22] C. Lyle, M. G. Bellemare, and P. S. Castro, “A comparative analysis of
expected and distributional reinforcement learning,” in Proceedings of
the 33rd Conference on Artificial Intelligence (AAAI 2019), (Honolulu,
Hawaii,USA), pp. 4504–4511, 2019.
[23] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan,
D. TB, A. Muldal, N. Heess, and T. P. Lillicrap, “Distributed distribu-
tional deterministic policy gradients,” in 6th International Conference
on Learning Representations, (ICLR 2018), (Vancouver, BC, Canada),
2018.
15
[24] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz,
“Trust region policy optimization,” in Proceedings of the 32nd Interna-
tional Conference on Machine Learning, (ICML 2015), (Lille, France),
pp. 1889–1897, 2015.
[25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
2017.
[26] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez,
Z. Wang, S. Eslami, M. Riedmiller, et al., “Emergence of locomotion
behaviours in rich environments,arXiv preprint arXiv:1707.02286,
2017.
[27] B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih, “Combining
policy gradient and q-learning,” in 4th International Conference on
Learning Representations (ICLR 2016), (San Juan, Puerto Rico), 2016.
[28] J. Schulman, X. Chen, and P. Abbeel, “Equivalence between policy
gradients and soft q-learning,” arXiv preprint arXiv:1704.06440, 2017.
[29] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the
gap between value and policy based reinforcement learning,” in 30th
Advances in Neural Information Processing Systems (NeurIPS 2017),
(Long Beach, CA, USA), pp. 2775–2785, 2017.
[30] B. Sallans and G. E. Hinton, “Reinforcement learning with factored
states and actions,” Journal of Machine Learning Research, vol. 5, no. 8,
pp. 1063–1088, 2004.
[31] R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforcement
learning via soft updates,” in Proceedings of the 32nd Conference on
Uncertainty in Artificial Intelligence (UAI 2016), (Arlington, Virginia,
United States), pp. 202–211, AUAI Press, 2016.
[32] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning
with deep energy-based policies,” in Proceedings of the 34th Interna-
tional Conference on Machine Learning, (ICML 2017), (Sydney, NSW,
Australia), pp. 1352–1361, 2017.
[33] W. Dabney, Z. Kurth-Nelson, N. Uchida, C. K. Starkweather, D. Hass-
abis, R. Munos, and M. Botvinick, “A distributional code for value in
dopamine-based reinforcement learning,” Nature, pp. 1–5, 2020.
[34] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,arXiv
preprint arXiv:1312.6114, 2013.
[35] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward,
Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu,
“IMPALA: Scalable distributed deep-RL with importance weighted
actor-learner architectures,” in Proceedings of the 35th International
Conference on Machine Learning (ICML 2018), (Stockholmsm¨
assan,
Stockholm Sweden), pp. 1407–1416, PMLR, 2018.
[36] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel,
H. Van Hasselt, and D. Silver, “Distributed prioritized experience
replay,arXiv preprint arXiv:1803.00933, 2018.
[37] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for
model-based control,” in 2012 IEEE/RSJ International Conference on
Intelligent Robots and Systems, (IROS 2012), (Vilamoura, Algarve,
Portugal), pp. 5026–5033, IEEE, 2012.
[38] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-
man, J. Tang, and W. Zaremba, “Openai gym,arXiv preprint
arXiv:1606.01540, 2016.
[39] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv
preprint arXiv:1606.08415, 2016.
[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,
in 3rd International Conference on Learning Representations, (ICLR
2015), (San Diego, CA, USA), 2015.
Jingliang Duan received the B.S. degree from the
College of Automotive Engineering, Jilin University,
Changchun, China, in 2015. He studied as a visiting
student researcher in Department of Mechanical En-
gineering, University of California, Berkeley, USA,
in 2019. He received his Ph.D. degree in the School
of Vehicle and Mobility, Tsinghua University, Bei-
jing, China, in 2021. His research interests include
decision and control of autonomous vehicle, rein-
forcement learning and adaptive dynamic program-
ming, and driver behaviour analysis.
Yang Guan received the B.S. degree from school
of mechanical engineering, Beijing institute of tech-
nology, Beijing, China, in 2017. He is pursuing his
Ph.D. degree in the School of Vehicle and Mobility,
Tsinghua University, Beijing, China. His research
interests include decision-making of autonomous
vehicle, and reinforcement learning.
Shengbo Eben Li (SM’16) received the M.S. and
Ph.D. degrees from Tsinghua University in 2006 and
2009. He worked at Stanford University, University
of Michigan, and University of California, Berkeley.
He is currently a tenured professor at Tsinghua Uni-
versity. His active research interests include intel-
ligent vehicles and driver assistance, reinforcement
learning and distributed control, optimal control and
estimation, etc.
He is the author of over 100 journal/conference
papers, and the co-inventor of over 20 Chinese
patents. He was the recipient of Best Paper Award in 2014 IEEE ITS
Symposium, Best Paper Award in 14th ITS Asia Pacific Forum, National
Award for Technological Invention in China (2013), Excellent Young Scholar
of NSF China (2016), Young Professorship of Changjiang Scholar Program
(2016). He is now the IEEE senior member and serves as associated editor
of IEEE ITSM and IEEE Trans. ITS, etc.
Yangang Ren received the B.S. degree from the
Department of Automotive Engineering, Tsinghua
University, Beijing, China, in 2018. He is currently
pursuing his Ph.D. degree in the School of Vehicle
and Mobility, Tsinghua University, Beijing, China.
His research interests include decision and control
of autonomous driving, reinforcement learning, and
adversarial learning.
Qi Sun received his Ph.D. degree in Automotive
Engineering from Ecole Centrale de Lille, France,
in 2017. He did scientific research and completed
his Ph.D. dissertation in CRIStAL Research Center
at Ecole Centrale de Lille, France, between 2013
and 2016. He is currently a Postdoctor at the State
Key Laboratory of Automotive Safety and Energy
and at the School of Vehicle and Mobility, Tsinghua
University, Beijing, China. His active research inter-
ests include intelligent vehicles, automatic driving
technology, distributed control and optimal control.
Bo Cheng received the B.S. and M.S. degrees in
automotive engineering from Tsinghua University,
Beijing, China, in 1985 and 1988, respectively, and
the Ph.D. degree in mechanical engineering from
the University of Tokyo, Tokyo, Japan, in 1998.
He is currently a Professor with School of Ve-
hicle and Mobility, Tsinghua University, and the
Dean of Tsinghua University–Suzhou Automotive
Research Institute. He is the author of more than 100
peer-reviewed journal/conference papers and the co-
inventor of 40 patents. His active research interests
include autonomous vehicles, driver-assistance systems, active safety, and
vehicular ergonomics, among others. Dr. Cheng is also the Chairman of the
Academic Board of SAE-Beijing, a member of the Council of the Chinese
Ergonomics Society, and a Committee Member of National 863 Plan, among
others.
... In addition, IQN was implemented with a predefined discrete set of motion commands for vehicle control, and was not able to output arbitrary actions in a continuous action space. In recent years, Distributional RL algorithms with actor-critic structures [11], [12], [13] have been developed to work over continuous action domains. Adopting this idea, we develop a novel ASV decision making and control policy based on IQN employed within an actor-critic framework, which we denote AC-IQN, for continuous control in scenarios involving congested multivehicle encounters. ...
... Due to the high computational expense of Gazebo's realistic environment simulation, RL agents were trained in a simplified 2D environment. As shown in Equation (12), a simplified three Degree-of-Freedom (DoF) dynamic model described in [26] is used, which only includes surge, sway, and yaw DoFs. Fig. 3: Learning performance. ...
Preprint
Full-text available
With the growing demands for Autonomous Surface Vehicles (ASVs) in recent years, the number of ASVs being deployed for various maritime missions is expected to increase rapidly in the near future. However, it is still challenging for ASVs to perform sensor-based autonomous navigation in obstacle-filled and congested waterways, where perception errors, closely gathered vehicles and limited maneuvering space near buoys may cause difficulties in following the Convention on the International Regulations for Preventing Collisions at Sea (COLREGs). To address these issues, we propose a novel Distributional Reinforcement Learning based navigation system that can work with onboard LiDAR and odometry sensors to generate arbitrary thrust commands in continuous action space. Comprehensive evaluations of the proposed system in high-fidelity Gazebo simulations show its ability to decide whether to follow COLREGs or take other beneficial actions based on the scenarios encountered, offering superior performance in navigation safety and efficiency compared to systems using state-of-the-art Distributional RL, non-Distributional RL and classical methods.
... , where ι presents a factor in SAC [35]. In DSAC, we can obtain the state-action value as ...
Preprint
Full-text available
In this paper, we introduce a novel framework consisting of hybrid bit-level and generative semantic communications for efficient downlink image transmission within space-air-ground integrated networks (SAGINs). The proposed model comprises multiple low Earth orbit (LEO) satellites, unmanned aerial vehicles (UAVs), and ground users. Considering the limitations in signal coverage and receiver antennas that make the direct communication between satellites and ground users unfeasible in many scenarios, thus UAVs serve as relays and forward images from satellites to the ground users. Our hybrid communication framework effectively combines bit-level transmission with several semantic-level image generation modes, optimizing bandwidth usage to meet stringent satellite link budget constraints and ensure communication reliability and low latency under low signal-to-noise ratio (SNR) conditions. To reduce the transmission delay while ensuring the reconstruction quality at the ground user, we propose a novel metric for measuring delay and reconstruction quality in the proposed system, and employ a deep reinforcement learning (DRL)-based strategy to optimize the resource in the proposed network. Simulation results demonstrate the superiority of the proposed framework in terms of communication resource conservation, reduced latency, and maintaining high image quality, significantly outperforming traditional solutions. Therefore, the proposed framework can ensure that real-time image transmission requirements in SAGINs, even under dynamic network conditions and user demand.
... Reinforcement learning (RL) stands out as a prominent area within the artificial intelligence community, which promises to provide solutions for decision-making and control of largescale and complex problems [1]. In recent years, RL has shown great potential in various challenging domains, including games [2], [3], robotics [4], [5], autonomous driving [6], [7], etc. However, training an RL agent with neural networks (NNs) is extremely unstable compared to other machine learning tasks since the inherent trial-and-error mechanism brings intractable uncertainty [8], [9]. ...
Preprint
Full-text available
Training deep reinforcement learning (RL) agents necessitates overcoming the highly unstable nonconvex stochastic optimization inherent in the trial-and-error mechanism. To tackle this challenge, we propose a physics-inspired optimization algorithm called relativistic adaptive gradient descent (RAD), which enhances long-term training stability. By conceptualizing neural network (NN) training as the evolution of a conformal Hamiltonian system, we present a universal framework for transferring long-term stability from conformal symplectic integrators to iterative NN updating rules, where the choice of kinetic energy governs the dynamical properties of resulting optimization algorithms. By utilizing relativistic kinetic energy, RAD incorporates principles from special relativity and limits parameter updates below a finite speed, effectively mitigating abnormal gradient influences. Additionally, RAD models NN optimization as the evolution of a multi-particle system where each trainable parameter acts as an independent particle with an individual adaptive learning rate. We prove RAD's sublinear convergence under general nonconvex settings, where smaller gradient variance and larger batch sizes contribute to tighter convergence. Notably, RAD degrades to the well-known adaptive moment estimation (ADAM) algorithm when its speed coefficient is chosen as one and symplectic factor as a small positive value. Experimental results show RAD outperforming nine baseline optimizers with five RL algorithms across twelve environments, including standard benchmarks and challenging scenarios. Notably, RAD achieves up to a 155.1% performance improvement over ADAM in Atari games, showcasing its efficacy in stabilizing and accelerating RL training.
... Research has demonstrated that a CNN can be taught to regress bearing vectors for these kinds of landmarks even in situations where the camera is not able to see them. The demonstration is that the predicted landmarks produce correct posture estimates and that their method surpasses DSAC [17], the current state-of-the-art in learnt localisation. Moreover, adding our predictions to the correspondences of HLoc (a reliable method) increases its accuracy even more. ...
Article
This paper presents a decentralized robust dynamic event‐sampled tracking (EST) control law for interconnected nonlinear‐constrained systems. The core of developing such a control law is to convert the original EST control problem into the event‐sampled decentralized stabilization problem of augmented interconnected systems. To address the transformed decentralized stabilization problem, an indirect approach relying on the optimal control methodology is proposed. Initially, a group of cost functions are constructed for the nominal subsystems related to the augmented interconnected systems. Then, the dynamic event‐sampling mechanisms are introduced for lessening the computational burden. Meanwhile, the event‐sampled Hamilton–Jacobi–Bellman equations (ES‐HJBEs) are proposed for the augmented interconnected systems. To approximately solve the ES‐HJBEs, the critic approximators are used with their parameters tuned under the reinforcement learning framework. After that, the uniform ultimate boundedness of the tracking errors and the approximators' parameter estimation errors are assured based on the Lyapunov theorem. Finally, a nonlinear plant is provided to validate the decentralized robust dynamic EST control law.
Article
Full-text available
This paper investigates resource management in device-to-device (D2D) networks coexisting with cellular user equipment (CUEs). We introduce a novel modelfor joint scheduling and resource management in D2D networks, taking into account environmental constraints. To preserve information freshness, measured by minimizing the average age of information (AoI), and to effectively utilize energy harvesting (EH) technology to satisfy the network's energy needs, we formulate an online optimization problem. This formulation considers factors such as the quality of service (QoS) for both CUEs and D2Ds, available power, information freshness, and environmental sensing requirements. Due to the mixed-integer nonlinear nature and online characteristics of the problem, we propose a deep reinforcement learning (DRL) approach to solve it effectively. Numerical results show that the proposed joint scheduling and resource management strategy, utilizing the soft actor-critic (SAC) algorithm, reduces the average AoI by 20% compared to other baseline methods.
Article
Full-text available
Decision making for self‐driving cars is usually tackled by manually encoding rules from drivers’ behaviours or imitating drivers’ manipulation using supervised learning techniques. Both of them rely on mass driving data to cover all possible driving scenarios. This study presents a hierarchical reinforcement learning method for decision making of self‐driving cars, which does not depend on a large amount of labelled driving data. This method comprehensively considers both high‐level manoeuvre selection and low‐level motion control in both lateral and longitudinal directions. The authors firstly decompose the driving tasks into three manoeuvres, including driving in lane, right lane change and left lane change, and learn the sub‐policy for each manoeuvre. Then, a master policy is learned to choose the manoeuvre policy to be executed in the current state. All policies, including master policy and manoeuvre policies, are represented by fully‐connected neural networks and trained by using asynchronous parallel reinforcement learners, which builds a mapping from the sensory outputs to driving decisions. Different state spaces and reward functions are designed for each manoeuvre. They apply this method to a highway driving scenario, which demonstrates that it can realise smooth and safe decision making for self‐driving cars.
Article
Full-text available
Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain1–3. According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning4–6. We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.
Article
Full-text available
This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting. We combine this within a distributed framework for off-policy learning in order to develop what we call the Distributed Distributional Deep Deterministic Policy Gradient algorithm, D4PG. We also combine this technique with a number of additional, simple improvements such as the use of N-step returns and prioritized experience replay. Experimentally we examine the contribution of each of these individual components, and show how they interact, as well as their combined contributions. Our results show that across a wide variety of simple control tasks, difficult manipulation tasks, and a set of hard obstacle-based locomotion tasks the D4PG algorithm achieves state of the art performance.
Article
Since their introduction a year ago, distributional approaches to reinforcement learning (distributional RL) have produced strong results relative to the standard approach which models expected values (expected RL). However, aside from convergence guarantees, there have been few theoretical results investigating the reasons behind the improvements distributional RL provides. In this paper we begin the investigation into this fundamental question by analyzing the differences in the tabular, linear approximation, and non-linear approximation settings. We prove that in many realizations of the tabular and linear approximation settings, distributional RL behaves exactly the same as expected RL. In cases where the two methods behave differently, distributional RL can in fact hurt performance when it does not induce identical behaviour. We then continue with an empirical analysis comparing distributional and expected RL methods in control settings with non-linear approximators to tease apart where the improvements from distributional RL methods are coming from.
Article
Q-learning is a sample-based model-free algorithm that solves Markov decision problems asymptotically, but in finite time it can perform poorly when random rewards and transitions result in large variance of value estimates. We pinpoint its cause to be the estimation bias due to the maximum operator in Q-learning algorithm, and present the evidence of max-operator bias in its Q value estimates. We then present an asymptotically optimal bias-correction strategy and construct an extension to bias-corrected Q-learning algorithm to multi-state MDPs, with asymptotic convergence properties as strong as those from Q-learning. We report the empirical performance of the bias-corrected Q-learning algorithm with multistate extension in two model problems: a multi-armed bandit version of Roulette and an electricity storage control simulation. The bias-corrected Q-learning algorithm with multistate extension is shown to control max-operator bias effectively, where the bias-resistance can be tuned predictably by adjusting a correction parameter.
Article
We propose a distributed architecture for deep reinforcement learning at scale, that enables agents to learn effectively from orders of magnitude more data than previously possible. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared experience replay memory; the learner replays samples of experience and updates the neural network. The architecture relies on prioritized experience replay to focus only on the most significant data generated by the actors. Our architecture substantially improves the state of the art on the Arcade Learning Environment, achieving better final performance in a fraction of the wall-clock training time.
Article
In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and critic. Our algorithm takes the minimum value between a pair of critics to restrict overestimation and delays policy updates to reduce per-update error. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.
Article
Distributional approaches to value-based reinforcement learning model the entire distribution of returns, rather than just their expected values, and have recently been shown to yield state-of-the-art empirical performance. This was demonstrated by the recently proposed C51 algorithm, based on cat