Bayesian Distributional Policy Gradients
Luchen Li, 1A. Aldo Faisal 1,2,3
1Brain & Behaviour Lab, Dept. of Computing, Imperial College London, UK
2Brain & Behaviour Lab, Dept. of Bioengineering, Imperial College London, UK
3UKRI Centre in AI for Healthcare, Imperial College London, UK
Distributional Reinforcement Learning (RL) maintains the
entire probability distribution of the reward-to-go, i.e. the re-
turn, providing more learning signals that account for the
uncertainty associated with policy performance, which may
be beneﬁcial for trading off exploration and exploitation and
policy learning in general. Previous works in distributional
RL focused mainly on computing the state-action-return dis-
tributions, here we model the state-return distributions. This
enables us to translate successful conventional RL algorithms
that are based on state values into distributional RL. We for-
mulate the distributional Bellman operation as an inference-
based auto-encoding process that minimises Wasserstein met-
rics between target/model return distributions. The proposed
algorithm, BDPG (Bayesian Distributional Policy Gradients),
uses adversarial training in joint-contrastive learning to esti-
mate a variational posterior from the returns. Moreover, we
can now interpret the return prediction uncertainty as an in-
formation gain, which allows to obtain a new curiosity mea-
sure that helps BDPG steer exploration actively and efﬁ-
ciently. We demonstrate in a suite of Atari 2600 games and
MuJoCo tasks, including well known hard-exploration chal-
lenges, how BDPG learns generally faster and with higher
asymptotic performance than reference distributional RL al-
In reinforcement learning (RL), the performance of a policy
is evaluated by the (discounted) accumulated future rewards,
a random variable known as the reward-to-go or the return.
Instead of maintaining the expectations of returns as scalar
value functions, distributional RL estimates return distribu-
tions. Keeping track of the uncertainties around returns has
initially been leveraged as a means to raise risk awareness in
RL (Morimura et al. 2010; Lattimore and Hutter 2012). Re-
cently, a line of research pioneered by (Bellemare, Dabney,
and Munos 2017) applied the distributional Bellman opera-
tor for control purposes. Distributional RL is shown to out-
perform previous successful deep RL methods in Atari-57
when combined with other avant-garde developments in RL
(Hessel et al. 2018; Dabney et al. 2018a).
Copyright © 2021, Association for the Advancement of Artiﬁcial
Intelligence (www.aaai.org). All rights reserved.
The critical hurdle in distributional RL is to minimise
a Wasserstein distance between the distributions of a re-
turn and its Bellman target, under which the Bellman op-
eration is a contraction mapping (Bellemare, Dabney, and
Munos 2017). A differentiable Wasserstein distance estima-
tor can be obtained in its dual form with constrained Kan-
torovich potentials (Gulrajani et al. 2017; Arjovsky, Chin-
tala, and Bottou 2017), or approximated by restricting the
search for couplings to a set of smoothed joint probabili-
ties with entropic regularisations (Cuturi 2013; Montavon,
uller, and Cuturi 2016; Genevay et al. 2016; Luise et al.
2018). Alternatively, a Bayesian inference perspective redi-
rects the search space to a set of probabilistic encoders that
map data in the input space to codes in a latent space (Bous-
quet et al. 2017; Tolstikhin et al. 2018; Ambrogioni et al.
2018). Bayesian approaches rely on inference to bypass rigid
and sub-optimal distributions that are usually entailed other-
wise, while retaining differentiability and tractability. More-
over, predictions based on inference, the expectation across
a latent space, are more robust to unseen data (Blundell et al.
2015) and thus able to generalise better.
In contrast to previous distributional RL work that fo-
cuses on state-action-return distributions, here we investi-
gate state-return distributions and prove that its Bellman
operator is also a contraction in Wasserstein metrics. This
opens up the possibility of converting state-value algorithms
into distributional RL settings. We then formulate the dis-
tributional Bellman operation as an inference-based auto-
encoding process that minimises Wasserstein metrics be-
tween continuous distributions of the Bellman target and es-
timated return. A second beneﬁt of our inference model is
that the learned posterior enables a curiosity bonus in the
form of information gain (IG), which is leveraged as internal
reward to boost exploration efﬁciency. We explicitly calcu-
late the entropy reduction in a latent space corresponding to
return probability estimation as a KL divergence. In contrast
to previous work (Bellemare et al. 2016; Sekar et al. 2020;
Ball et al. 2020) in which IG was approximated with ensem-
ble entropy or prediction gains, we obtain analytical results
from our variational inference scheme.
To test our fully Bayesian approach and curiosity-driven
exploration mechanism against a distributional RL back-
arXiv:2103.11265v1 [cs.LG] 20 Mar 2021
drop, we embed these two innovations into a policy gradi-
ent framework. Both innovations would also work for value-
based and off-policy policy gradients methods where the
state-action-return distribution is modelled instead.
We evaluate and compare our method to other distribu-
tional RL approaches on the Arcade Learning Environment
Atari 2600 games (Bellemare et al. 2013), including some
of the best known hard-exploration cases, and on MuJoCo
continuous-control tasks (Todorov, Erez, and Tassa 2012).
To conclude we perform ablation experiments, where we in-
vestigate our exploration mechanism and the length of boot-
strapping in distributional Bellman backup.
Our key contributions in this work are two-fold: we derive
ﬁrst, a fully inference-based generative approach to distri-
butional Bellman operations; and second, a novel curiosity-
driven exploration mechanism formulated as posterior infor-
mation gains attributed to return prediction uncertainty.
Wasserstein Variational Inference
In this subsection, we discuss how Wasserstein metrics in a
data space can be estimated in a Bayesian fashion using ad-
versarial training. Notation-wise we use calligraphic letters
for spaces, capital letters for random variables and lower-
case letters for values. We denote probability distributions
and densities with the same notation, discriminated by the
argument being capital or lower-case, respectively.
In optimal transport problems (Villani 2008), divergences
between two probability distributions are estimated as the
cost required to transport probability mass from one to the
other. Consider input spaces X ∈ Rn,Y ∈ Rmand a pair-
wise cost function c:X × Y 7→ R+. For two probability
measures α:X 7→ P, β :Y 7→ P, an optimal transport
divergence is deﬁned as
Lc(α, β) := inf
c(x, y)dγ(x, y),(1)
where Γ(α, β)is a set of joint distributions or couplings on
X × Y with marginals αand βrespectively. Particularly,
when Y=Xand the cost function is derived from a metric
over X,d:X × X 7→ R+, via c(x, y) = dp(x, y), p ≥1,
the p-Wasserstein distance on Xis given as
Wp(α, β) := Ldp(α, β )1/p.(2)
Now consider a generative process through a latent vari-
able Z∈ Z ∈ Rlwith a prior pZ(Z), a decoder pθ(X|Z)
and an encoder (amortised inference estimator) qφ(Z|X), in
which the parameters φ, θ are trained to mimic the data dis-
tribution pX(X)implicitly represented by the i.i.d. training
samples. The density corresponding to the model distribu-
tion can be expressed as pG(x) = Ez∼pZ[pθ(x|z)]. For a
deterministic1decoder X=Gθ(Z),pGcan be thought of
1For the purpose of generative modelling, the intuition of min-
imising Wasserstein metrics between target/model distributions
(instead of stronger probability density discrepancies such as f-
divergences) is to still see meaningful gradients when the model
manifold and the true distribution’s support have few intersections
without introducing noise to the model distribution (by using a di-
rected continuous mapping) that renders reconstructions blurry.
as the push-forward of pZthrough Gθ, i.e. pG=Gθ#pZ.
Minimising the Wasserstein distance between pX(X)and
pG(X)is thereby equivalent to ﬁnding an optimal trans-
port plan between pX(X)and pZ(Z), and matching the ag-
gregated posterior Q(Z) := Ex∼pXqφ(Z|x)to the prior
pZ(Z)(Bousquet et al. 2017; Tolstikhin et al. 2018; Ambro-
gioni et al. 2018; Rosca, Lakshminarayanan, and Mohamed
2018; He et al. 2019)
p(pX, pG) = inf
Marginal matching in Zis sometimes preferred for gener-
ative models, since it alleviates the posterior collapse prob-
lem (Zhao, Song, and Ermon 2017; Hoffman and Johnson
2016) by enabling Zto be more diversely distributed for
different x’s. However, when doing so, Eq. (3) is no longer a
proper inference objective, as it enforces neither posterior-
contrastive nor joint-contrastive learning. In fact, the en-
coder needs not to approximate the true posterior pθ(Z|X)
exactly to satisfy the marginal match. In contrast, our ap-
proach maintains a fully Bayesian inference pipeline.
While explicit variational inference requires all proba-
bility densities to have analytical expressions, we bypass
this by direct density matching through adversarial training
(Goodfellow et al. 2014), which requires only that densities
can be sampled from for gradient backpropagation, thereby
allowing for a degenerate decoder pθ(X|Z) = δGθ(Z)(X).
Lemma 1. (Donahue, Kr¨
uhl, and Darrell 2017)
Let p(X, Z )and q(X, Z)denote the joint sampling
distribution induced by the decoder and encoder respec-
tively, Dψa discriminator, and deﬁne F(ψ, φ, θ) :=
Ex,z∼plog Dψ(x, z)+Ex,z ∼qlog 1−Dψ(x, z).For
any encoder and decoder, deterministic or stochastic, the
optimal discriminator Dψ∗= argmaxDψF(ψ, φ, θ)is the
Radon-Nikodym derivative of measure p(X, Z)w.r.t.
p(X, Z ) + q(X, Z).The encoder and decoder’s objective
for an optimal discriminator C(φ, θ) := F(ψ∗, φ, θ)can be
written in the Jenson-Shannon divergence C(φ, θ) =
2JSp, q−log 4,in which the global minimum is achieved
if and only if p(X, Z ) = q(X, Z).
We jointly minimise Wasserstein metrics between
model/target distributions in Xand conduct variational in-
ference adversarially. pGwill be shown to be modelling the
distribution of random returns, leading to a novel approach
to accomplishing distributional Bellman operations.
Distributional Reinforcement Learning
We start by laying out RL and policy gradients notation, then
explain the distributional perspective of RL, as well as pre-
vious solutions to it.
Policy Gradients A standard RL task is framed within a
Markov decision process (MDP) S,A,R, P, γ(Puterman
1994), where Sand Adenote the state and action spaces
respectively, R:S × A 7→ Rna potentially stochastic re-
ward function, P:S × A 7→ P(S)a transition probability
density function, and γ∈(0,1) a temporal discount factor.
An RL agent has a policy that maps states to a probability
distribution over actions π:S 7→ P(A).
The return Gπunder the policy πis a random variable
that represents the sum of discounted future rewards and the
state-dependent return is Gπ(s) := P∞
t=0 γtrt, s0=s.
A state value function is deﬁned as the expected return
Vπ(s) := E[Gπ(s)], a state-action value function the ex-
pected state-action return Qπ(s, a) := E[Gπ(s, a)]. The
Bellman operator Tπ(Bellman 1957) is deﬁned as
TπVπ(s) := Eπ,R,P r+γV π(s0),(4)
TπQπ(s, a) := ER,P,π r+γQπ(s0, a0).(5)
Policy gradient methods (Sutton et al. 1999) optimise a
parameterised policy πby directly ascending the gradient of
a policy performance objective such as
Es∼dπ,a∼π(·|s)log π(a|s)Aπ(s, a)(Mnih et al. 2016;
Schulman et al. 2016) with respect to the parameters of
π, where dπ(s)is the marginal state density induced by
π, and the advantage function Aπcan be estimated as
Distributional RL In distributional reinforcement learn-
ing, the distributions of returns instead of their expectations
(i.e. value functions) are maintained. The distributional Bell-
man equation in terms of the state-action return is (Belle-
mare, Dabney, and Munos 2017)
TπGπ(s, a) :D
=R(s, a) + γGπ(S0, A0).(6)
The distribution equation U:D
=Vspeciﬁes that the random
variable Uis distributed by the same law as is random vari-
able V. The reward R(s, a), next state-action tuple S0, A0
and its return Gπ(S0, A0)are random variables, with com-
pound randomness stemmed from π, P , and R.
Eq. (6) is a contraction mapping in the p-th order Wasser-
stein metrics Wp(Bellemare, Dabney, and Munos 2017).
Previously, Eq. (6) is exploited in a Q-learning style value
iteration, with the distribution of Gπ(s, a)represented as
a particle set, updated either through cross-entropy loss
(Bellemare, Dabney, and Munos 2017), quantile regression
(Dabney et al. 2018a,b; Rowland et al. 2019), or Sinkhorn
iterations (Martin et al. 2020). Particle-based (ensemble-
)critics Gπ(s, a)are incorporated into conventional off-
policy policy gradient methods by (Barth-Maron et al. 2018)
and (Kuznetsov et al. 2020). A continuous Gπ(s, a)distribu-
tion can be conferred via Wasserstein-GAN (WGAN) (Ar-
jovsky, Chintala, and Bottou 2017), and has been investi-
gated in both Q-learning (Doan, Mazoure, and Lyle 2018)
and policy gradients (Freirich et al. 2019). We remark that
these works all estimate return distributions with empirical
approximations, e.g. particle set or WGAN.
We begin with proving that the distributional Bellman oper-
ation in terms of state-return distributions is also a contrac-
tion mapping in Wasserstein metrics. We then show resem-
blance between distributional Bellman update and a varia-
tional Bayesian solution to return distributions, leading to a
novel distributional RL approach. Thereafter, we propose an
internal incentive that leverages posterior IG stemmed from
return estimation inaccuracy.
Distributional Bellman Operator for State-Return
First, in the same sense that Eq. (6) extends Eq. (5), we ex-
tend Eq. (4) and deﬁne the distributional Bellman operator
regarding the state return Gπ(s)as
=R(s) + γGπ(S0).(7)
Now we demonstrate that Eq. (7) is also a contraction in p-
For notional convenience, we write the inﬁmum-p-
Wasserstein metric in Eq. (2) in terms of random variables:
dp(X, Y ) := Wp(α, β), X ∼α, Y ∼β.
Let G ∈ Rndenote a space of returns valid in the MDP,
and Ω∈P(G)(S)a space of state-return distributions with
bounded moments. Represent as ωthe collection of distribu-
s∈ S}, in which ω(s)is the distribution of ran-
dom return G(s). For any two distributions ω1, ω2∈Ω, the
supremum-p-Wasserstein metric on Ωis deﬁned as (Belle-
mare, Dabney, and Munos 2017; Rowland et al. 2018)
dp(ω1, ω2) := sup
Lemma 2. ¯
dpis a metric over state-return distributions.
The proof is a straightforward analogue to that of Lemma
2 in (Bellemare, Dabney, and Munos 2017), substituting the
state space Sfor state-action space S × A.
Proposition 1. The distributional Bellman operator for
state-return distributions is a γ-contraction in ¯
Proof. The reward R(s)∈ G is a random vector such that
R(s) = RAR(s, a)˜π(a|s)da, where ˜πdenotes the nor-
malised policy π.
Represent the marginal state transition kernel under policy
πas Pπ(s0|s) = RAP(s0|s, a)˜π(a|s)da. Then deﬁne a cor-
responding transition operator Pπ:G 7→ G
With the marginal state transition operator substituted for
the action-dependent one, the rest of the proof is analogous
to that of Lemma 3 presented by (Bellemare, Dabney, and
We therefore conclude that Eq. (7) has a unique ﬁxed
point Gπ. Proposition 1 vindicates backing up distributions
of the state return Gπ(s)by minimising Wasserstein metrics
to a target distribution.
For the policy gradient theorem (Sutton et al. 1999) to
hold, one would need at each encountered stan unbiased
estimator of EPk=0 γkrt+kin computing the policy gra-
dient. In distributional RL, such a quantity is obtained by
sampling from the approximated return distribution (or av-
eraging across such samples). The Bellman operator being a
contraction ensures convergence to a unique true on-policy
Algorithm 1 Bayesian Distributional Policy Gradients
1: Initialise encoder qφ(Z|X, S), generator Gθ(Z, S),
prior pθ(Z|S), discriminator Dψ(X, Z, S )and policy π
2: While not converge:
// roll out
3: training batch D ← ∅
4: For t= 0,...,k−1,∀threads:
5: execute at∼π(·|st), get rt,st+1
6: sample return zt∼pθ(·|st), gt←Gθ(zt, st)
7: update D
8: sample last return zk∼pθ(·|sk), gk←Gθ(zk, sk)
// estimate advantage for whole batch
9: For t∈ D:
10: estimate advantage ˆ
Atwith rt:t+k−1, gt:t+k
using any estimation method
11: Bellman target xt←ˆ
12: get curiosity reward rc
13: get augmented advantage ˆ
tby substituting rtin ˆ
// train with mini batch B⊂ D
14: For t∈B:
15: encode ˜zt∼qφ(·|xt, st)
16: sample from joint pθ:zt∼pθ(·|st),˜xt←G¯
// take gradients
17: update Dψby ascending
|B|Pt∈Blog Dψ(˜xt, zt, st) +log 1−Dψ(xt,˜zt, st)
18: update encoder, prior by ascending
|B|Pt∈Blog 1−Dψ(˜xt, zt, st)+ log Dψ(xt,˜zt, st)
19: update Gθby descending 1
20: update πby ascending 1
using any policy gradient method
21: Return π
return distribution, whose expectation is thereby also un-
biased. The same holds also for sample estimates of the
state-conditioned reward-to-go and thereby for the advan-
Inference in Distributional Bellman Update
We now proceed to show that the distribution of TπGπ(s)
can be interpreted as the target distribution pX, and hence
propose a new approach to distributional RL. Speciﬁcally,
let the data space X=Gbe the space of returns. ∀s∈ S, we
shorthand as such x(s) := TπGπ(s), g(s) := Gπ(s), thus
x(s), g(s)∈ X . We view the Bellman target x(s)as a sam-
ple from the empirical data distribution x(s)∼pX(X|s),
whilst the estimated return g(s)is generated from the model
distribution g(s)∼pG(X|s). The state sis an observable
condition to the generative model: its distribution is of no
interest to and not modelled in the Bayesian system.
We factorise the s-conditioned sampling distributions in
Lemma 1 such that
pθ(X, Z |s) := pθ(X|Z, s)pθ(Z|s),
qφ(X, Z |s) := pX(X|s)qφ(Z|X, s).(10)
pθ(X|Z, s) = δGθ(Z,s)(X)is a deterministic decoder.
The intuition of a state-conditioned, learned prior for Z
instead of a simple, ﬁxed one, is to add stochasticity for the
prior and encoder to meet halfway. Similar to the encoder,
we represent the prior also in a variational fashion and sam-
ple through re-parameterisation during gradient estimation.
Lemma 1 implies that training Dψand the generative
model alternatingly with factorisation in Eq. (10) would suf-
ﬁce to both have the encoder qφ(Z|X, s)approximating the
true posterior pθ(Z|X, s) := pθ(X|Z, s)pθ(Z|s)/pX(X|s)
and to reconstruct in X(Dumoulin et al. 2017; Donahue,
uhl, and Darrell 2017). Notice that a globally ob-
servable condition sis orthogonal to Lemma 1 and Eq. (3).
And so is a learned prior: both the true posterior and the
push-forward are relative to the prior pθ(Z|s).
In our work, in contrast, Gθis deemed ﬁxed in relation to
the minimax game, leaving the encoder,prior and discrimi-
nator to be trained in the minimax game
θ(z, s), z, s)
+Ex∼pX(X|s)Ez∼qφ(Z|x,s)log 1−Dψ(x, z, s).
The overhead bar ¯
(·)denotes that gradient is not back-
propagated through the parameter in question. This means
qφ(Z|X, s)is still trained to approximate the true posterior
induced by the current Gθ, irrespective of the capability of
the latter for reconstruction. Meanwhile, the reconstruction
is achieved by minimising a Wasserstein metric in X
φ(Z|x,s)dpx, Gθ(z, s).(12)
Essentially, we are alternating between training the en-
coder and prior via Eq. (11) and training the generator via
Eq. (12). We will use a ﬁxed prior pZand omit state de-
pendence in the ensuing discussion, as they do not affect
convergence. If the encoder approximates the true posterior
everywhere in X, the aggregated posterior Q(Z)is naturally
matched to the prior pZ(Z), so long as pθ(X|Z)is properly
normalised, as is indeed the case when it’s degenerate. As
such, meeting the constraint on the search space in Eq. (3) is
a necessary condition to accurate posterior approximation.
Note that in Eq. (3), EpXEqφ[Gθ(Z)] is the push-forward
of Q(Z),Gθ#Q. The primal form of WppX, pG, where
pG=Gθ#pZ, is thereby the inﬁmum of WppX, Gθ#Q
over qφs.t. Q=pZ. Therefore, Wp(pX, Gθ#Q)is an upper
bound to the true objective Wp(pX, Gθ#pZ)upon Q=pZ.
Learning converges as we explain in the following. And
to provide intuition, we highlight the resemblances to the
Expectation-Maximisation (EM) algorithm. Eq. (11) en-
forces contrastive learning such that the variational posterior
approaches the true posterior, comparable to the E-step in
EM. Eq. (11) allows to compute a bound Wp(pX, Gθ#Q)
in Eq. (12), which is equivalent to the computationally
tractable surrogate objective function of the negative free
energy in EM, or ELBO in variational Bayes. The expected
Wasserstein metric w.r.t. the current qφis then minimised by
updating the parameters of the decoder via Eq. (12). This up-
date is reminiscent of the M-step in EM, which maximises
the expected log likelihood while ﬁxing inference for Z.
In our method Wp(pX, Gθ#Q)acts as an upper bound
when Q=pZ, whereas in EM the surrogate objective is
Figure 1: Learning curves on Atari games with the mean (solid line) and standard deviation (shaded area) across 5runs.
a lower bound. This upper bound decreases in Eq. (11) as
it approaches the true objective Wp(pX, Gθ#pZ). Eq. (12)
then decreases Wp(pX, Gθ#Q)further and consequently
also decreases Wp(pX, Gθ#pZ). Note, the condition Q=
pZdoes not have to hold on each iteration, but can be amor-
tised over iterations. Assuming inﬁnite model expressive-
ness, the discrepancy between Qand pZshrinks monotoni-
cally, as all determinant functions for Q:= EpX[qφ(·|x)] =
pZin Eq. (11) are ﬁxed irrespective of the value of θ. When
qφconverges to the true posterior, Wp(pX, Gθ#Q)is more
sufﬁciently an upper bound due to restricted search space
in the primal form. While Wp(pX, Gθ#Q)functions as
an amortised upper bound, Wp(pX, Gθ#pZ)still decreases
continually (as opposed to from each iteration) and con-
verges to a local minimum.
The merit of the two-step training is two-fold: 1) with only
the distributions over Zunder tuning in the minimax game,
the adversarial training comes off with a weaker topology
and is not relied upon for reconstruction, making its poten-
tial instability less of a concern; and 2) an explicit distance
loss din Xminimises Wpto ensure contraction of return
distribution backups. If everything was trained adversarially
in JS divergence and allowed to reach global optimum, the
decoder and encoder would be reversing each other both in
density domain. In our setting, Qis matched to pZevery-
where in Z, while pGhas minimum Wpdistance to pX.
At each step of environmental interaction, a state return
is sampled via the standard two steps g(s)∼pG(X|s)⇐⇒
z∼pθ(Z|s), g(s)←Gθ(z, s). The one-step Bellman target
x(s)is calculated as r+γg(s0). Generalisation to k-step
bootstrap can be made analogously to the conventional RL.
Exploration through Posterior Information Gain
Curiosity (Schmidhuber 1991; Schaul et al. 2016; Houthooft
et al. 2016; Freirich et al. 2019) produces internal incentives
when external reward is sparse. We explore through encour-
aging visits to regions where the model’s ability to predict
the reward-to-go from current return distribution is weak.
However, the Bellman error x(s)−g(s)is not a prefer-
able indicator, as high x(s)−g(s)may well be attributed
to high moments of g(s)itself under point estimation (i.e.
the aleatoric uncertainty), whereas it is the uncertainty in
value belief due to estimating parameters with limited data
around the state-action tuple (i.e. the epistemic uncertainty)
that should be driving strengthened visitation.
To measure the true reduction in uncertainty about return
prediction, we estimate discrepancies in function space in-
stead of parameter space. Speciﬁcally, the insufﬁciency in
data collection can be interpreted as how much a posterior
distribution of a statistic or parameter inferred from a condi-
tion progresses from a prior distribution with respect to the
action execution that changes this condition, i.e., the IG. A
large IG means a large amount of data is required to achieve
the update. In its simplest form, the condition is implicitly
the data trained on. In exact Bayes, the condition itself can
be thought of as a variable estimated from data, e.g. the ran-
dom return X, hence enabling an explicit IG derived from
existing posterior model qφ(Z|X). Therefore, we deﬁne the
IG u(s)at sin return estimation as
u(s) := KLqφZ|x(s), s
Before the transition, the agent’s estimation for return is
g(s). The action execution enables the computation of the
Bellman target x(s), which would not be viable before the
transition, in which qφZ|g(s), sacts here as a prior. As a
result, u(s)would encourage the agent to make transitions
that maximally acquire new information about Z, hence fa-
cilitating updating pGtowards pX. Upon convergence, g(s)
and x(s)are indistinguishable and the IG approaches 0. The
beneﬁt of our IG is tree-fold: it is moments-invariant, makes
use of all training data, and increases computation complex-
ity only in forward-passing the posterior model when cal-
culating the KL divergence without even requiring gradient
The curiosity reward rc(s)is determined by u(s)and a
truncation scheme R:R+7→ R[0, η·¯u), η, ¯u∈R+
∗, to pre-
vent radical exploration
rc(s) := R(u(s)) := ηt·min u(s),¯u.(14)
We exploit relative value by normalising the clipped u(s)
by a running mean and standard deviation of previous IGs.
The exploration coefﬁcient ηtis logarithmically decayed as
ηt=ηplog t/t, by the rate at which the parametric uncer-
tainty decays (Koenker 2005; Mavrin et al. 2019), where t
is the global training step, and ηan initial value, to assuage
exploration getting more sensitive to the value of u(s)as
parameters become more accurate.
We use rcto augment return backup during policy update,
as the purpose is for the action to lead to uncertain regions by
encouraging curiosity about future steps. When training the
generative model for return distributions we use the original
We investigate a multi-step advantage function. The con-
traction property of the distributional Bellman operator is
propagated from 1-step to k-step scenarios by the same logic
as in conventional RL. The beneﬁt of looking into further
steps for exploration is intuitive viewed from the long-term
goal of RL tasks: the agent should not be complacent about
a state just because it is informative to immediate steps.
The pseudocode in Algorithm 1 presents a mini-batch
version of our methodology BDPG. We denote state return
g(st)as gtfor compactness. Other step-dependent values
are shorthanded accordingly. We use Euclidean distance for
reconstruction, leading to the W2metric being minimised.
kis the number of unroll steps, and is also the maximum
bootstrap length, albeit the two are not necessarily the same.
Policy optimisation enables importance sampling based off-
policy evaluation for re-sampling weights in experience re-
play schemes (Wang et al. 2016; Gruslys et al. 2018). In
continuous control, where the policy is usually a paramet-
ric Gaussian, exploration can be realised by perturbing the
Gaussian mean (Lillicrap et al. 2015; Ciosek et al. 2019), or
maintaining a mixture of Gaussians (Lim et al. 2018). Alter-
natively, random actions can be directly incentivised by reg-
ularising policy (cross-)entropy (Abdolmaleki et al. 2015;
Mnih et al. 2016; Nachum, Norouzi, and Schuurmans 2016;
Akrour et al. 2016; Haarnoja et al. 2018).
A group of works propose to exploit epistemic uncer-
tainty via an approximate posterior distribution of Qval-
ues. Stochastic posterior estimates are constructed through
training on bootstrapped perturbations of data (Osband et al.
2016, 2019), or overlaying learned posterior variance (Chen
et al. 2017; O’Donoghue et al. 2018). While this series of
works can be thought of as posterior sampling w.r.t. Qval-
ues, (Tang and Agrawal 2018) approximates Bayesian infer-
ence by sampling parameters for a distributional RL model.
On the other hand, a particle-based distributional RL model
itself registers notion of dispersion, inspiring risk-averse and
risk-seeking policies (Dabney et al. 2018a) and optimism-
in-the-face-of-uncertainty quantiﬁed by the variance of the
better half of the particle set (Mavrin et al. 2019).
Figure 2: Impact of bootstrap length kand truncation cap ¯u
for information gain at 10M and 150M steps into training.
There are also approaches exploiting dynamics uncer-
tainty (Houthooft et al. 2016), pseudo counts (Bellemare
et al. 2016; Ostrovski et al. 2017; Tang et al. 2017), gradient
of a generative model (Freirich et al. 2019), and good past
experiences (Oh et al. 2018), that do not estimate dispersion
or model disagreement of value functions.
We evaluate our method on the Arcade Learning Environ-
ment Atari 2600 games and continuous control with Mu-
JoCo. We estimate a k-step advantage function using Gener-
alised Advantage Estimation (GAE) (Schulman et al. 2016),
and update the policy using Proximal Policy Optimisation
(PPO) (Schulman et al. 2017) which maximises a clipped
surrogate of the policy gradient objective. For both Atari and
MuJoCo environments, we use 16 parallel workers for data
collection, and train in mini batches. For Atari, we unroll
128 steps with each worker in each training batch for all al-
gorithms, and average scores every 80 training batches. For
MuJoCo, we unroll 256 steps, and average scores every 4
batches. Except for ablation experiments that used rollout
length max(k, 128), the number of unroll steps is also the
bootstrap length k.
We compare to other distributional RL baselines on eight
of the Atari games, including some of the recognised hard-
exploration tasks: Freeway,Hero,Seaquest and Venture.
Direct comparisons to previous works are not meaningful
due to compounded discrepancies. To allow for a meaning-
ful comparison, we implement our own versions of base-
lines, ﬁxing other algorithmic implementation choices such
that the tested algorithms vary only in how return distribu-
tions are estimated and in the exploration scheme. We mod-
ify two previous algorithms (Freirich et al. 2019) and (Dab-
ney et al. 2018b), retaining their return distribution estima-
tors as benchmarks: a generative model Wasserstein-GAN
Figure 3: Learning curves on MuJoCo tasks with the mean (solid line) and standard deviation (shaded area) across 5runs.
(Arjovsky, Chintala, and Bottou 2017) (PPO+WGAN), and
a discrete approximation of distribution updated through
quantile regression (Koenker 2005) (PPO+QR). Impor-
tantly, all of our baselines are distributional RL solutions
that maintain state-return distributions.
Our BDPG is evaluated in two versions: 1) with the naive
add-noise-and-argmax (Mnih et al. 2016) exploration mech-
anism (BDPG naive), and 2) one that explores with the pro-
posed curiosity reward (BDPG). Naive exploration is also
used for the PPO+WGAN and PPO+QR baselines. Learn-
ing curves in Figure 1 suggest that with exploration mech-
anism ﬁxed, the proposed Bayesian approach BDPG naive
outperforms or is comparable to WGAN and QR in 6out of
8games. Morever, BDPG is always better or equal to BDPG
naive, vindicating our exploration scheme, and is able to get
the highest score among all tested algorithms in all four of
the hard-exploration games tested on.
We conduct ablation and parameter studies to investigate
the impact of the bootstrap length k, and of the truncation
cap ¯uon the IG u(s), on Atari games Breakout and Q∗bert.
In particular, k= 1 and ¯u→ ∞ are looked at as ablation
cases. Average scores of the batch started at 10M and 150M
steps into training are shown in Figure 2. Each coloured
pixel corresponds to the best outcome with respect to ηvalue
among its selection sweep according to average long-term
performance for each combination of kand ¯u. We found
that as training progresses, short kcomes to display a more
prohibitive effect, as the model becomes more discrimina-
tive about the environment, and lack of learning signals, i.e.
fewer rewards to calculate the Bellman target with, becomes
increasingly suppressive. Our experiments suggest that al-
though the best bootstrap length depends on the task, longer
bootstrapping generally produces better long-term perfor-
mance. But a long bootstrap length does not work well with
a large ¯u, a possible explanation is that as kincreases, the
variance in the Bellman targets multiplies. In this scenario,
the agent may encounter states with which it is very unfa-
miliar. The value of u(s)can grow unbounded and the ten-
dency to explore get out of hand if we do not curb it. More-
over, such extreme values can also jeopardise subsequent
curiosity comprehension through the normalisation of u(s).
This phenomenon justiﬁes the application of our truncation
scheme, especially for larger k. In addition, we found that
choosing too large kdoes not diminish performance dras-
tically, potentially due to the return distribution already ac-
counting for some degree of reward uncertainty, which is a
helpful characteristic when prototyping agents.
In the continuous control tasks with MuJoCo, we focus
on the ability of distributional RL algorithms to generalise,
and less the challenge of exploration. Therefore, we com-
pare the performance of BDPG naive against the benchmark
distributional RL algorithm PPO+WGAN, a generative so-
lution that does not conduct inference. Both are stripped of
exploration incentives. Noticeable amounts of variance dis-
played throughout training for both algorithms may be due
to that they both involve adversarial training. As shown in
Figure 3, however, our model outperforms the benchmark in
all cases with distinct margins. We believe this is because
WGAN does not take expectations across an amortised in-
ference space that accounts for better generalisation. This
proves to be highly beneﬁcial for reasoning about return
distributions in continuous control tasks such as MuJoCo
environments, where robustness in the face of unseen data
weighs up more in behaviour stability.
We formulate the distributional Bellman operation as an
inference-based auto-encoding process. We demonstrated
contraction of the Bellman operator for state-return distribu-
tions, expanding on the distributional RL body of work that
focused on state-action returns, to date. Our tractable solu-
tion alternates between minimising Wasserstein metrics to
continuous distributions of the Bellman target and adversar-
ial training for joint-contrastive learning to estimate a varia-
tional posterior from bootstrapped targets. This allows us to
distill the beneﬁts of Bayesian inference into distributional
RL, where predictions of returns are based on expectation
and thus more accurate in the face of unseen data. As a sec-
ond innovation we use the availability of a variational pos-
terior to derive a curiosity-driven exploration mechanism,
which we show is more efﬁciently solving hard-exploration
tasks. Either of our two contributions can be combined with
other building blocks to form new RL algorithms, e.g. in Ex-
plainable RL (Beyret, Shafti, and Faisal 2019). We believe
that our innovations link and expand the applicability and
efﬁciency of distributional RL methods.
We are grateful for our funding support: a Department of
Computing PhD Award to LL and a UKRI Turing AI Fel-
lowship (EP/V025449/1) to AAF.
Abdolmaleki, A.; et al. 2015. Model-Based Relative Entropy
Stochastic Search. In Adv. Neural Inform. Proc. Sys. (NIPS)
Akrour, R.; et al. 2016. Model-Free Trajectory Optimization
for Reinforcement Learning. In Proc. the 33nd Intl. Conf.
on Machine Learning (ICML), volume 48, 2961–2970. New
York, NY, USA.
Ambrogioni, L.; et al. 2018. Wasserstein Variational Infer-
ence. In Adv. Neural Inform. Proc. Sys. (NIPS) 31, 2473–
Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein
Generative Adversarial Networks. In Proc. the 34th Intl.
Conf. on Machine Learning (ICML), volume 70, 214–223.
Ball, P.; et al. 2020. Ready Policy One: World Building
Through Active Learning. In Proc. the 37th Intl. Conf. on
Machine Learning (ICML), volume 119, 591–601. Virtual.
Barth-Maron, G.; et al. 2018. Distributed Distributional De-
terministic Policy Gradients. In Proc. 6th Intl. Conf. on
Learning Representations (ICLR).
Bellemare, M.; et al. 2016. Unifying Count-Based Explo-
ration and Intrinsic Motivation. In Adv. Neural Inform. Proc.
Sys. (NIPS) 29, 1471–1479.
Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A
Distributional Perspective on Reinforcement Learning. In
Proc. the 34th Intl. Conf. on Machine Learning (ICML), vol-
ume 70, 449–458. Sydney, Australia.
Bellemare, M. G.; et al. 2013. The Arcade Learning En-
vironment: An Evaluation Platform for General Agents. In
Journal of Artiﬁcial Intelligence Research.
Bellman, R. 1957. Dynamic Programming. Princeton, NJ,
USA: Princeton University Press, 1st edition.
Beyret, B.; Shafti, A.; and Faisal, A. A. 2019. Dot-to-
Dot: Explainable Hierarchical Reinforcement Learning for
Robotic Manipulation. In 2019 IEEE/RSJ Intl. Conf. on In-
telligent Robots and Systems (IROS), 5014–5019. IEEE.
Blundell, C.; et al. 2015. Weight Uncertainty in Neural Net-
work. In Proc. the 32nd Intl. Conf. on Machine Learning
(ICML), volume 37, 1613–1622. Lille, France.
Bousquet, O.; et al. 2017. From Optimal Transport to Gener-
ative Modeling: the VEGAN cookbook. arXiv 1705.07642.
Chen, R. Y.; et al. 2017. UCB Exploration via Q-Ensembles.
Ciosek, K.; et al. 2019. Better Exploration with Optimistic
Actor Critic. In Adv. Neural Inform. Proc. Sys. (NIPS) 32,
Cuturi, M. 2013. Sinkhorn Distances: Lightspeed Computa-
tion of Optimal Transport. In Adv. Neural Inform. Proc. Sys.
(NIPS) 26, 2292–2300.
Dabney, W.; et al. 2018a. Implicit Quantile Networks for
Distributional Reinforcement Learning. In Proc. the 35th
Intl. Conf. on Machine Learning (ICML), volume 80, 1096–
1105. Stockholm, Sweden.
Dabney, W.; et al. 2018b. Distributional Reinforcement
Learning With Quantile Regression. In Proc. the 32nd AAAI
Conf. on Artiﬁcial Intelligence.
Doan, T.; Mazoure, B.; and Lyle, C. 2018. GAN Q-learning.
Donahue, J.; Kr¨
uhl, P.; and Darrell, T. 2017. Adver-
sarial Feature Learning. In Proc. the 5th Intl. Conf. Learning
Dumoulin, V.; et al. 2017. Adversarially Learned Inference.
In Proc. the 5th Intl. Conf. Learning Repres. (ICLR).
Freirich, D.; et al. 2019. Distributional Multivariate Pol-
icy Evaluation and Exploration with the Bellman GAN. In
Proc. the 36th Intl. Conf. on Machine Learning (ICML), vol-
ume 97, 1983–1992. Long Beach, CA, USA.
Genevay, A.; et al. 2016. Stochastic Optimization for Large-
scale Optimal Transport. In Adv. Neural Inform. Proc. Sys.
(NIPS) 29, 3440–3448.
Goodfellow, I. J.; et al. 2014. Generative Adversarial Net-
works. In Adv. Neural Inform. Proc. Sys. (NIPS) 27, 2672–
Gruslys, A.; et al. 2018. The Reactor: A Sample-Efﬁcient
Actor-Critic Architecture. In Proc. the 6th Intl. Conf. Learn-
ing Representations (ICLR).
Gulrajani, I.; et al. 2017. Improved Training of Wasser-
stein GANs. In Adv. Neural Inform. Proc. Sys. (NIPS) 30,
Haarnoja, T.; et al. 2018. Soft Actor-Critic: Off-Policy
Maximum Entropy Deep Reinforcement Learning with a
Stochastic Actor. In Proc. the 35th Intl. Conf. on Machine
Learning (ICML), volume 80, 1861–1870.
He, J.; et al. 2019. Lagging Inference Networks and Poste-
rior Collapse in Variational Autoencoders. In Proc. the 7th
Intl. Conf. Learning Representations (ICLR).
Hessel, M.; et al. 2018. Rainbow: Combining Improvements
in Deep Reinforcement Learning. In Proc. the 32nd AAAI
Conf. on Artiﬁcial Intelligence.
Hoffman, M. D.; and Johnson, M. J. 2016. ELBO surgery:
yet another way to carve up the variational evidence lower
bound. In In Workshop of Approximate Bayesian Inference
in Neural Information Processing Systems 29.
Houthooft, R.; et al. 2016. VIME: Variational Information
Maximizing Exploration. In Adv. Neural Inform. Proc. Sys.
(NIPS) 29, 1109–1117.
Koenker, R. 2005. Quantile Regression. Cambridge Univer-
Kuznetsov, A.; et al. 2020. Controlling Overestimation Bias
with Truncated Mixture of Continuous Distributional Quan-
tile Critics. In Proc. 37th Intl. Conf. on Machine Learning
Lattimore, T.; and Hutter, M. 2012. PAC Bounds for Dis-
counted MDPs. In Proc. the 23rd Intl. Conf. on Algorithmic
Learning Theory, 320–334.
Lillicrap, T. P.; et al. 2015. Continuous control with deep
reinforcement learning. In Proc. the 3rd Intl. Conf. Learning
Lim, S.; et al. 2018. Actor-Expert: A Framework for using
Q-learning in Continuous Action Spaces. arXiv 1810.09103.
Luise, G.; et al. 2018. Differential Properties of Sinkhorn
Approximation for Learning with Wasserstein Distance. In
Adv. Neural Inform. Proc. Sys. (NIPS) 31, 5859–5870.
Martin, J.; et al. 2020. Stochastically Dominant Distribu-
tional Reinforcement Learning. In Proc. the 37th Intl. Conf.
on Machine Learning (ICML). Virtual.
Mavrin, B.; et al. 2019. Distributional Reinforcement Learn-
ing for Efﬁcient Exploration. In Proc. the 36th Intl. Conf. on
Machine Learning (ICML), volume 97, 4424–4434. Long
Beach, CA, USA.
Mnih, V.; et al. 2016. Asynchronous Methods for Deep Re-
inforcement Learning. In Proc. the 33rd Intl. Conf. on Ma-
chine Learning (ICML), volume 48, 1928–1937. New York,
Montavon, G.; M¨
uller, K.-R.; and Cuturi, M. 2016. Wasser-
stein Training of Restricted Boltzmann Machines. In Adv.
Neural Inform. Proc. Sys. (NIPS) 29, 3718–3726.
Morimura, T.; et al. 2010. Parametric Return Density Esti-
mation for Reinforcement Learning. In Proc. the 26th Conf.
on Uncertainty in Artiﬁcial Intelligence.
Nachum, O.; Norouzi, M.; and Schuurmans, D. 2016. Im-
proving Policy Gradient by Exploring Under-appreciated
Rewards. In Proc. the 4th Intl. Conf. Learning Represen-
O’Donoghue, B.; et al. 2018. The Uncertainty Bellman
Equation and Exploration. In Proc. the 35th Intl. Conf. on
Machine Learning (ICML), volume 80, 3839–3848. Stock-
Oh, J.; et al. 2018. Self-Imitation Learning. In Proc. the
35th Intl. Conf. on Machine Learning (ICML), volume 80,
3878–3887. Stockholm, Sweden.
Osband, I.; et al. 2016. Deep Exploration via Bootstrapped
DQN. In Adv. Neural Inform. Proc. Sys. (NIPS) 29, 4026–
Osband, I.; et al. 2019. Deep Exploration via Randomized
Value Functions. Journal of Machine Learning Research 20:
Ostrovski, G.; et al. 2017. Count-Based Exploration with
Neural Density Models. In Proc. the 34th Intl. Conf. on
Machine Learning (ICML), volume 70, 2721–2730. Inter-
national Convention Centre, Sydney, Australia.
Puterman, M. L. 1994. Markov Decision Processes : Dis-
crete Stochastic Dynamic Programming. John Wiley &
Rosca, M.; Lakshminarayanan, B.; and Mohamed, S. 2018.
Distribution Matching in Variational Inference. arXiv
Rowland, M.; et al. 2018. An Analysis of Categorical
Distributional Reinforcement Learning. In Proc. 21st Intl.
Conf. on Artiﬁcial Intelligence and Statistics (AISTATS),
volume 84. Lanzarote, Spain.
Rowland, M.; et al. 2019. Statistics and Samples in Distribu-
tional Reinforcement Learning. In Proc. the 36th Intl. Conf.
on Machine Learning (ICML), volume 97, 5528–5536. Long
Beach, CA, USA.
Schaul, T.; et al. 2016. Prioritized Experience Replay. In
Proc. the 4th Intl. Conf. Learning Representations (ICLR).
Schmidhuber, J. 1991. Curious model-building control sys-
tems. In [Proceedings] 1991 IEEE International Joint Conf.
on Neural Networks, volume 2, 1458–1463.
Schulman, J.; et al. 2016. High-Dimensional Continuous
Control Using Generalized Advantage Estimation. In Proc.
the 4th Intl. Conf. Learning Representations (ICLR).
Schulman, J.; et al. 2017. Proximal Policy Optimization Al-
gorithms. arXiv 1707.06347.
Sekar, R.; et al. 2020. Planning to Explore via Self-
Supervised World Models. In Proc. the 37th Intl. Conf. on
Machine Learning (ICML), volume 119, 8583–8592. Vir-
Sutton, R. S.; et al. 1999. Policy Gradient Methods for Rein-
forcement Learning with Function Approximation. In Adv.
Neural Inform. Proc. Sys. (NIPS) 12, 1057–1063.
Tang, H.; et al. 2017. #Exploration: A Study of Count-Based
Exploration for Deep Reinforcement Learning. In Adv. Neu-
ral Inform. Proc. Sys. (NIPS) 30, 2753–2762.
Tang, Y.; and Agrawal, S. 2018. Exploration by Distribu-
tional Reinforcement Learning. In Proc. the 27th Interna-
tional Joint Conf. on Artiﬁcial Intelligence, 2710–2716.
Todorov, E.; Erez, T.; and Tassa, Y. 2012. MuJoCo: A
physics engine for model-based control. In 2012 IEEE/RSJ
Intl. Conf. on Intelligent Robots and Systems, 5026–5033.
Tolstikhin, I.; et al. 2018. Wasserstein Auto-Encoders. In
Proc. the 6th Intl. Conf. Learning Representations (ICLR).
Villani, C. 2008. Optimal Transport: Old and New.
Grundlehren der mathematischen Wissenschaften. Springer
Wang, Z.; et al. 2016. Sample Efﬁcient Actor-Critic with
Experience Replay. In Intl. Conf. Learning Reps. (ICLR).
Zhao, S.; Song, J.; and Ermon, S. 2017. InfoVAE: In-
formation Maximizing Variational Autoencoders. ArXiv