Available via license: CC BY-NC-ND 4.0

Content may be subject to copyright.

Bayesian Distributional Policy Gradients

Luchen Li, 1A. Aldo Faisal 1,2,3

1Brain & Behaviour Lab, Dept. of Computing, Imperial College London, UK

2Brain & Behaviour Lab, Dept. of Bioengineering, Imperial College London, UK

3UKRI Centre in AI for Healthcare, Imperial College London, UK

{l.li17, aldo.faisal}@imperial.ac.uk

Abstract

Distributional Reinforcement Learning (RL) maintains the

entire probability distribution of the reward-to-go, i.e. the re-

turn, providing more learning signals that account for the

uncertainty associated with policy performance, which may

be beneﬁcial for trading off exploration and exploitation and

policy learning in general. Previous works in distributional

RL focused mainly on computing the state-action-return dis-

tributions, here we model the state-return distributions. This

enables us to translate successful conventional RL algorithms

that are based on state values into distributional RL. We for-

mulate the distributional Bellman operation as an inference-

based auto-encoding process that minimises Wasserstein met-

rics between target/model return distributions. The proposed

algorithm, BDPG (Bayesian Distributional Policy Gradients),

uses adversarial training in joint-contrastive learning to esti-

mate a variational posterior from the returns. Moreover, we

can now interpret the return prediction uncertainty as an in-

formation gain, which allows to obtain a new curiosity mea-

sure that helps BDPG steer exploration actively and efﬁ-

ciently. We demonstrate in a suite of Atari 2600 games and

MuJoCo tasks, including well known hard-exploration chal-

lenges, how BDPG learns generally faster and with higher

asymptotic performance than reference distributional RL al-

gorithms.

Introduction

In reinforcement learning (RL), the performance of a policy

is evaluated by the (discounted) accumulated future rewards,

a random variable known as the reward-to-go or the return.

Instead of maintaining the expectations of returns as scalar

value functions, distributional RL estimates return distribu-

tions. Keeping track of the uncertainties around returns has

initially been leveraged as a means to raise risk awareness in

RL (Morimura et al. 2010; Lattimore and Hutter 2012). Re-

cently, a line of research pioneered by (Bellemare, Dabney,

and Munos 2017) applied the distributional Bellman opera-

tor for control purposes. Distributional RL is shown to out-

perform previous successful deep RL methods in Atari-57

when combined with other avant-garde developments in RL

(Hessel et al. 2018; Dabney et al. 2018a).

Copyright © 2021, Association for the Advancement of Artiﬁcial

Intelligence (www.aaai.org). All rights reserved.

The critical hurdle in distributional RL is to minimise

a Wasserstein distance between the distributions of a re-

turn and its Bellman target, under which the Bellman op-

eration is a contraction mapping (Bellemare, Dabney, and

Munos 2017). A differentiable Wasserstein distance estima-

tor can be obtained in its dual form with constrained Kan-

torovich potentials (Gulrajani et al. 2017; Arjovsky, Chin-

tala, and Bottou 2017), or approximated by restricting the

search for couplings to a set of smoothed joint probabili-

ties with entropic regularisations (Cuturi 2013; Montavon,

M¨

uller, and Cuturi 2016; Genevay et al. 2016; Luise et al.

2018). Alternatively, a Bayesian inference perspective redi-

rects the search space to a set of probabilistic encoders that

map data in the input space to codes in a latent space (Bous-

quet et al. 2017; Tolstikhin et al. 2018; Ambrogioni et al.

2018). Bayesian approaches rely on inference to bypass rigid

and sub-optimal distributions that are usually entailed other-

wise, while retaining differentiability and tractability. More-

over, predictions based on inference, the expectation across

a latent space, are more robust to unseen data (Blundell et al.

2015) and thus able to generalise better.

In contrast to previous distributional RL work that fo-

cuses on state-action-return distributions, here we investi-

gate state-return distributions and prove that its Bellman

operator is also a contraction in Wasserstein metrics. This

opens up the possibility of converting state-value algorithms

into distributional RL settings. We then formulate the dis-

tributional Bellman operation as an inference-based auto-

encoding process that minimises Wasserstein metrics be-

tween continuous distributions of the Bellman target and es-

timated return. A second beneﬁt of our inference model is

that the learned posterior enables a curiosity bonus in the

form of information gain (IG), which is leveraged as internal

reward to boost exploration efﬁciency. We explicitly calcu-

late the entropy reduction in a latent space corresponding to

return probability estimation as a KL divergence. In contrast

to previous work (Bellemare et al. 2016; Sekar et al. 2020;

Ball et al. 2020) in which IG was approximated with ensem-

ble entropy or prediction gains, we obtain analytical results

from our variational inference scheme.

To test our fully Bayesian approach and curiosity-driven

exploration mechanism against a distributional RL back-

arXiv:2103.11265v1 [cs.LG] 20 Mar 2021

drop, we embed these two innovations into a policy gradi-

ent framework. Both innovations would also work for value-

based and off-policy policy gradients methods where the

state-action-return distribution is modelled instead.

We evaluate and compare our method to other distribu-

tional RL approaches on the Arcade Learning Environment

Atari 2600 games (Bellemare et al. 2013), including some

of the best known hard-exploration cases, and on MuJoCo

continuous-control tasks (Todorov, Erez, and Tassa 2012).

To conclude we perform ablation experiments, where we in-

vestigate our exploration mechanism and the length of boot-

strapping in distributional Bellman backup.

Our key contributions in this work are two-fold: we derive

ﬁrst, a fully inference-based generative approach to distri-

butional Bellman operations; and second, a novel curiosity-

driven exploration mechanism formulated as posterior infor-

mation gains attributed to return prediction uncertainty.

Preliminaries

Wasserstein Variational Inference

In this subsection, we discuss how Wasserstein metrics in a

data space can be estimated in a Bayesian fashion using ad-

versarial training. Notation-wise we use calligraphic letters

for spaces, capital letters for random variables and lower-

case letters for values. We denote probability distributions

and densities with the same notation, discriminated by the

argument being capital or lower-case, respectively.

In optimal transport problems (Villani 2008), divergences

between two probability distributions are estimated as the

cost required to transport probability mass from one to the

other. Consider input spaces X ∈ Rn,Y ∈ Rmand a pair-

wise cost function c:X × Y 7→ R+. For two probability

measures α:X 7→ P, β :Y 7→ P, an optimal transport

divergence is deﬁned as

Lc(α, β) := inf

γ∈Γ(α,β)ZX ×Y

c(x, y)dγ(x, y),(1)

where Γ(α, β)is a set of joint distributions or couplings on

X × Y with marginals αand βrespectively. Particularly,

when Y=Xand the cost function is derived from a metric

over X,d:X × X 7→ R+, via c(x, y) = dp(x, y), p ≥1,

the p-Wasserstein distance on Xis given as

Wp(α, β) := Ldp(α, β )1/p.(2)

Now consider a generative process through a latent vari-

able Z∈ Z ∈ Rlwith a prior pZ(Z), a decoder pθ(X|Z)

and an encoder (amortised inference estimator) qφ(Z|X), in

which the parameters φ, θ are trained to mimic the data dis-

tribution pX(X)implicitly represented by the i.i.d. training

samples. The density corresponding to the model distribu-

tion can be expressed as pG(x) = Ez∼pZ[pθ(x|z)]. For a

deterministic1decoder X=Gθ(Z),pGcan be thought of

1For the purpose of generative modelling, the intuition of min-

imising Wasserstein metrics between target/model distributions

(instead of stronger probability density discrepancies such as f-

divergences) is to still see meaningful gradients when the model

manifold and the true distribution’s support have few intersections

without introducing noise to the model distribution (by using a di-

rected continuous mapping) that renders reconstructions blurry.

as the push-forward of pZthrough Gθ, i.e. pG=Gθ#pZ.

Minimising the Wasserstein distance between pX(X)and

pG(X)is thereby equivalent to ﬁnding an optimal trans-

port plan between pX(X)and pZ(Z), and matching the ag-

gregated posterior Q(Z) := Ex∼pXqφ(Z|x)to the prior

pZ(Z)(Bousquet et al. 2017; Tolstikhin et al. 2018; Ambro-

gioni et al. 2018; Rosca, Lakshminarayanan, and Mohamed

2018; He et al. 2019)

Wp

p(pX, pG) = inf

qφ:Q=pZ

EX∼pXEZ∼qφdpX, Gθ(Z).

(3)

Marginal matching in Zis sometimes preferred for gener-

ative models, since it alleviates the posterior collapse prob-

lem (Zhao, Song, and Ermon 2017; Hoffman and Johnson

2016) by enabling Zto be more diversely distributed for

different x’s. However, when doing so, Eq. (3) is no longer a

proper inference objective, as it enforces neither posterior-

contrastive nor joint-contrastive learning. In fact, the en-

coder needs not to approximate the true posterior pθ(Z|X)

exactly to satisfy the marginal match. In contrast, our ap-

proach maintains a fully Bayesian inference pipeline.

While explicit variational inference requires all proba-

bility densities to have analytical expressions, we bypass

this by direct density matching through adversarial training

(Goodfellow et al. 2014), which requires only that densities

can be sampled from for gradient backpropagation, thereby

allowing for a degenerate decoder pθ(X|Z) = δGθ(Z)(X).

Lemma 1. (Donahue, Kr¨

ahenb¨

uhl, and Darrell 2017)

Let p(X, Z )and q(X, Z)denote the joint sampling

distribution induced by the decoder and encoder respec-

tively, Dψa discriminator, and deﬁne F(ψ, φ, θ) :=

Ex,z∼plog Dψ(x, z)+Ex,z ∼qlog 1−Dψ(x, z).For

any encoder and decoder, deterministic or stochastic, the

optimal discriminator Dψ∗= argmaxDψF(ψ, φ, θ)is the

Radon-Nikodym derivative of measure p(X, Z)w.r.t.

p(X, Z ) + q(X, Z).The encoder and decoder’s objective

for an optimal discriminator C(φ, θ) := F(ψ∗, φ, θ)can be

written in the Jenson-Shannon divergence C(φ, θ) =

2JSp, q−log 4,in which the global minimum is achieved

if and only if p(X, Z ) = q(X, Z).

We jointly minimise Wasserstein metrics between

model/target distributions in Xand conduct variational in-

ference adversarially. pGwill be shown to be modelling the

distribution of random returns, leading to a novel approach

to accomplishing distributional Bellman operations.

Distributional Reinforcement Learning

We start by laying out RL and policy gradients notation, then

explain the distributional perspective of RL, as well as pre-

vious solutions to it.

Policy Gradients A standard RL task is framed within a

Markov decision process (MDP) S,A,R, P, γ(Puterman

1994), where Sand Adenote the state and action spaces

respectively, R:S × A 7→ Rna potentially stochastic re-

ward function, P:S × A 7→ P(S)a transition probability

density function, and γ∈(0,1) a temporal discount factor.

An RL agent has a policy that maps states to a probability

distribution over actions π:S 7→ P(A).

The return Gπunder the policy πis a random variable

that represents the sum of discounted future rewards and the

state-dependent return is Gπ(s) := P∞

t=0 γtrt, s0=s.

A state value function is deﬁned as the expected return

Vπ(s) := E[Gπ(s)], a state-action value function the ex-

pected state-action return Qπ(s, a) := E[Gπ(s, a)]. The

Bellman operator Tπ(Bellman 1957) is deﬁned as

TπVπ(s) := Eπ,R,P r+γV π(s0),(4)

TπQπ(s, a) := ER,P,π r+γQπ(s0, a0).(5)

Policy gradient methods (Sutton et al. 1999) optimise a

parameterised policy πby directly ascending the gradient of

a policy performance objective such as

Es∼dπ,a∼π(·|s)log π(a|s)Aπ(s, a)(Mnih et al. 2016;

Schulman et al. 2016) with respect to the parameters of

π, where dπ(s)is the marginal state density induced by

π, and the advantage function Aπcan be estimated as

TπVπ(s)−Vπ(s).

Distributional RL In distributional reinforcement learn-

ing, the distributions of returns instead of their expectations

(i.e. value functions) are maintained. The distributional Bell-

man equation in terms of the state-action return is (Belle-

mare, Dabney, and Munos 2017)

TπGπ(s, a) :D

=R(s, a) + γGπ(S0, A0).(6)

The distribution equation U:D

=Vspeciﬁes that the random

variable Uis distributed by the same law as is random vari-

able V. The reward R(s, a), next state-action tuple S0, A0

and its return Gπ(S0, A0)are random variables, with com-

pound randomness stemmed from π, P , and R.

Eq. (6) is a contraction mapping in the p-th order Wasser-

stein metrics Wp(Bellemare, Dabney, and Munos 2017).

Previously, Eq. (6) is exploited in a Q-learning style value

iteration, with the distribution of Gπ(s, a)represented as

a particle set, updated either through cross-entropy loss

(Bellemare, Dabney, and Munos 2017), quantile regression

(Dabney et al. 2018a,b; Rowland et al. 2019), or Sinkhorn

iterations (Martin et al. 2020). Particle-based (ensemble-

)critics Gπ(s, a)are incorporated into conventional off-

policy policy gradient methods by (Barth-Maron et al. 2018)

and (Kuznetsov et al. 2020). A continuous Gπ(s, a)distribu-

tion can be conferred via Wasserstein-GAN (WGAN) (Ar-

jovsky, Chintala, and Bottou 2017), and has been investi-

gated in both Q-learning (Doan, Mazoure, and Lyle 2018)

and policy gradients (Freirich et al. 2019). We remark that

these works all estimate return distributions with empirical

approximations, e.g. particle set or WGAN.

Methods

We begin with proving that the distributional Bellman oper-

ation in terms of state-return distributions is also a contrac-

tion mapping in Wasserstein metrics. We then show resem-

blance between distributional Bellman update and a varia-

tional Bayesian solution to return distributions, leading to a

novel distributional RL approach. Thereafter, we propose an

internal incentive that leverages posterior IG stemmed from

return estimation inaccuracy.

Distributional Bellman Operator for State-Return

First, in the same sense that Eq. (6) extends Eq. (5), we ex-

tend Eq. (4) and deﬁne the distributional Bellman operator

regarding the state return Gπ(s)as

TπGπ(s) :D

=R(s) + γGπ(S0).(7)

Now we demonstrate that Eq. (7) is also a contraction in p-

Wasserstein metrics.

For notional convenience, we write the inﬁmum-p-

Wasserstein metric in Eq. (2) in terms of random variables:

dp(X, Y ) := Wp(α, β), X ∼α, Y ∼β.

Let G ∈ Rndenote a space of returns valid in the MDP,

and Ω∈P(G)(S)a space of state-return distributions with

bounded moments. Represent as ωthe collection of distribu-

tions {ω(s)

s∈ S}, in which ω(s)is the distribution of ran-

dom return G(s). For any two distributions ω1, ω2∈Ω, the

supremum-p-Wasserstein metric on Ωis deﬁned as (Belle-

mare, Dabney, and Munos 2017; Rowland et al. 2018)

¯

dp(ω1, ω2) := sup

s∈S

dpG1(s), G2(s).(8)

Lemma 2. ¯

dpis a metric over state-return distributions.

The proof is a straightforward analogue to that of Lemma

2 in (Bellemare, Dabney, and Munos 2017), substituting the

state space Sfor state-action space S × A.

Proposition 1. The distributional Bellman operator for

state-return distributions is a γ-contraction in ¯

dp.

Proof. The reward R(s)∈ G is a random vector such that

R(s) = RAR(s, a)˜π(a|s)da, where ˜πdenotes the nor-

malised policy π.

Represent the marginal state transition kernel under policy

πas Pπ(s0|s) = RAP(s0|s, a)˜π(a|s)da. Then deﬁne a cor-

responding transition operator Pπ:G 7→ G

PπG(s) :D

=G(S0), S0∼Pπ(·|s).(9)

With the marginal state transition operator substituted for

the action-dependent one, the rest of the proof is analogous

to that of Lemma 3 presented by (Bellemare, Dabney, and

Munos 2017).

We therefore conclude that Eq. (7) has a unique ﬁxed

point Gπ. Proposition 1 vindicates backing up distributions

of the state return Gπ(s)by minimising Wasserstein metrics

to a target distribution.

For the policy gradient theorem (Sutton et al. 1999) to

hold, one would need at each encountered stan unbiased

estimator of EPk=0 γkrt+kin computing the policy gra-

dient. In distributional RL, such a quantity is obtained by

sampling from the approximated return distribution (or av-

eraging across such samples). The Bellman operator being a

contraction ensures convergence to a unique true on-policy

Algorithm 1 Bayesian Distributional Policy Gradients

1: Initialise encoder qφ(Z|X, S), generator Gθ(Z, S),

prior pθ(Z|S), discriminator Dψ(X, Z, S )and policy π

2: While not converge:

// roll out

3: training batch D ← ∅

4: For t= 0,...,k−1,∀threads:

5: execute at∼π(·|st), get rt,st+1

6: sample return zt∼pθ(·|st), gt←Gθ(zt, st)

7: update D

8: sample last return zk∼pθ(·|sk), gk←Gθ(zk, sk)

// estimate advantage for whole batch

9: For t∈ D:

10: estimate advantage ˆ

Atwith rt:t+k−1, gt:t+k

using any estimation method

11: Bellman target xt←ˆ

At+gt

12: get curiosity reward rc

tby Eq.(13)-(14)

13: get augmented advantage ˆ

Ac

tby substituting rtin ˆ

At

with rt+rc

t

// train with mini batch B⊂ D

14: For t∈B:

15: encode ˜zt∼qφ(·|xt, st)

16: sample from joint pθ:zt∼pθ(·|st),˜xt←G¯

θ(zt, st)

// take gradients

17: update Dψby ascending

1

|B|Pt∈Blog Dψ(˜xt, zt, st) +log 1−Dψ(xt,˜zt, st)

18: update encoder, prior by ascending

1

|B|Pt∈Blog 1−Dψ(˜xt, zt, st)+ log Dψ(xt,˜zt, st)

19: update Gθby descending 1

|B|Pt∈B||xt−Gθ(˜zt, st)||2

2

20: update πby ascending 1

|B|Pt∈Blog π(at|st)ˆ

Ac

t

using any policy gradient method

21: Return π

return distribution, whose expectation is thereby also un-

biased. The same holds also for sample estimates of the

state-conditioned reward-to-go and thereby for the advan-

tage function.

Inference in Distributional Bellman Update

We now proceed to show that the distribution of TπGπ(s)

can be interpreted as the target distribution pX, and hence

propose a new approach to distributional RL. Speciﬁcally,

let the data space X=Gbe the space of returns. ∀s∈ S, we

shorthand as such x(s) := TπGπ(s), g(s) := Gπ(s), thus

x(s), g(s)∈ X . We view the Bellman target x(s)as a sam-

ple from the empirical data distribution x(s)∼pX(X|s),

whilst the estimated return g(s)is generated from the model

distribution g(s)∼pG(X|s). The state sis an observable

condition to the generative model: its distribution is of no

interest to and not modelled in the Bayesian system.

We factorise the s-conditioned sampling distributions in

Lemma 1 such that

pθ(X, Z |s) := pθ(X|Z, s)pθ(Z|s),

qφ(X, Z |s) := pX(X|s)qφ(Z|X, s).(10)

pθ(X|Z, s) = δGθ(Z,s)(X)is a deterministic decoder.

The intuition of a state-conditioned, learned prior for Z

instead of a simple, ﬁxed one, is to add stochasticity for the

prior and encoder to meet halfway. Similar to the encoder,

we represent the prior also in a variational fashion and sam-

ple through re-parameterisation during gradient estimation.

Lemma 1 implies that training Dψand the generative

model alternatingly with factorisation in Eq. (10) would suf-

ﬁce to both have the encoder qφ(Z|X, s)approximating the

true posterior pθ(Z|X, s) := pθ(X|Z, s)pθ(Z|s)/pX(X|s)

and to reconstruct in X(Dumoulin et al. 2017; Donahue,

Kr¨

ahenb¨

uhl, and Darrell 2017). Notice that a globally ob-

servable condition sis orthogonal to Lemma 1 and Eq. (3).

And so is a learned prior: both the true posterior and the

push-forward are relative to the prior pθ(Z|s).

In our work, in contrast, Gθis deemed ﬁxed in relation to

the minimax game, leaving the encoder,prior and discrimi-

nator to be trained in the minimax game

max

Dψ

min

qφ,pθ

Ez∼pθ(Z|s)log Dψ(G¯

θ(z, s), z, s)

+Ex∼pX(X|s)Ez∼qφ(Z|x,s)log 1−Dψ(x, z, s).

(11)

The overhead bar ¯

(·)denotes that gradient is not back-

propagated through the parameter in question. This means

qφ(Z|X, s)is still trained to approximate the true posterior

induced by the current Gθ, irrespective of the capability of

the latter for reconstruction. Meanwhile, the reconstruction

is achieved by minimising a Wasserstein metric in X

min

Gθ

Ex∼pX(X|s)Ez∼q¯

φ(Z|x,s)dpx, Gθ(z, s).(12)

Essentially, we are alternating between training the en-

coder and prior via Eq. (11) and training the generator via

Eq. (12). We will use a ﬁxed prior pZand omit state de-

pendence in the ensuing discussion, as they do not affect

convergence. If the encoder approximates the true posterior

everywhere in X, the aggregated posterior Q(Z)is naturally

matched to the prior pZ(Z), so long as pθ(X|Z)is properly

normalised, as is indeed the case when it’s degenerate. As

such, meeting the constraint on the search space in Eq. (3) is

a necessary condition to accurate posterior approximation.

Note that in Eq. (3), EpXEqφ[Gθ(Z)] is the push-forward

of Q(Z),Gθ#Q. The primal form of WppX, pG, where

pG=Gθ#pZ, is thereby the inﬁmum of WppX, Gθ#Q

over qφs.t. Q=pZ. Therefore, Wp(pX, Gθ#Q)is an upper

bound to the true objective Wp(pX, Gθ#pZ)upon Q=pZ.

Learning converges as we explain in the following. And

to provide intuition, we highlight the resemblances to the

Expectation-Maximisation (EM) algorithm. Eq. (11) en-

forces contrastive learning such that the variational posterior

approaches the true posterior, comparable to the E-step in

EM. Eq. (11) allows to compute a bound Wp(pX, Gθ#Q)

in Eq. (12), which is equivalent to the computationally

tractable surrogate objective function of the negative free

energy in EM, or ELBO in variational Bayes. The expected

Wasserstein metric w.r.t. the current qφis then minimised by

updating the parameters of the decoder via Eq. (12). This up-

date is reminiscent of the M-step in EM, which maximises

the expected log likelihood while ﬁxing inference for Z.

In our method Wp(pX, Gθ#Q)acts as an upper bound

when Q=pZ, whereas in EM the surrogate objective is

Figure 1: Learning curves on Atari games with the mean (solid line) and standard deviation (shaded area) across 5runs.

a lower bound. This upper bound decreases in Eq. (11) as

it approaches the true objective Wp(pX, Gθ#pZ). Eq. (12)

then decreases Wp(pX, Gθ#Q)further and consequently

also decreases Wp(pX, Gθ#pZ). Note, the condition Q=

pZdoes not have to hold on each iteration, but can be amor-

tised over iterations. Assuming inﬁnite model expressive-

ness, the discrepancy between Qand pZshrinks monotoni-

cally, as all determinant functions for Q:= EpX[qφ(·|x)] =

pZin Eq. (11) are ﬁxed irrespective of the value of θ. When

qφconverges to the true posterior, Wp(pX, Gθ#Q)is more

sufﬁciently an upper bound due to restricted search space

in the primal form. While Wp(pX, Gθ#Q)functions as

an amortised upper bound, Wp(pX, Gθ#pZ)still decreases

continually (as opposed to from each iteration) and con-

verges to a local minimum.

The merit of the two-step training is two-fold: 1) with only

the distributions over Zunder tuning in the minimax game,

the adversarial training comes off with a weaker topology

and is not relied upon for reconstruction, making its poten-

tial instability less of a concern; and 2) an explicit distance

loss din Xminimises Wpto ensure contraction of return

distribution backups. If everything was trained adversarially

in JS divergence and allowed to reach global optimum, the

decoder and encoder would be reversing each other both in

density domain. In our setting, Qis matched to pZevery-

where in Z, while pGhas minimum Wpdistance to pX.

At each step of environmental interaction, a state return

is sampled via the standard two steps g(s)∼pG(X|s)⇐⇒

z∼pθ(Z|s), g(s)←Gθ(z, s). The one-step Bellman target

x(s)is calculated as r+γg(s0). Generalisation to k-step

bootstrap can be made analogously to the conventional RL.

Exploration through Posterior Information Gain

Curiosity (Schmidhuber 1991; Schaul et al. 2016; Houthooft

et al. 2016; Freirich et al. 2019) produces internal incentives

when external reward is sparse. We explore through encour-

aging visits to regions where the model’s ability to predict

the reward-to-go from current return distribution is weak.

However, the Bellman error x(s)−g(s)is not a prefer-

able indicator, as high x(s)−g(s)may well be attributed

to high moments of g(s)itself under point estimation (i.e.

the aleatoric uncertainty), whereas it is the uncertainty in

value belief due to estimating parameters with limited data

around the state-action tuple (i.e. the epistemic uncertainty)

that should be driving strengthened visitation.

To measure the true reduction in uncertainty about return

prediction, we estimate discrepancies in function space in-

stead of parameter space. Speciﬁcally, the insufﬁciency in

data collection can be interpreted as how much a posterior

distribution of a statistic or parameter inferred from a condi-

tion progresses from a prior distribution with respect to the

action execution that changes this condition, i.e., the IG. A

large IG means a large amount of data is required to achieve

the update. In its simplest form, the condition is implicitly

the data trained on. In exact Bayes, the condition itself can

be thought of as a variable estimated from data, e.g. the ran-

dom return X, hence enabling an explicit IG derived from

existing posterior model qφ(Z|X). Therefore, we deﬁne the

IG u(s)at sin return estimation as

u(s) := KLqφZ|x(s), s

qφZ|g(s), s.(13)

Before the transition, the agent’s estimation for return is

g(s). The action execution enables the computation of the

Bellman target x(s), which would not be viable before the

transition, in which qφZ|g(s), sacts here as a prior. As a

result, u(s)would encourage the agent to make transitions

that maximally acquire new information about Z, hence fa-

cilitating updating pGtowards pX. Upon convergence, g(s)

and x(s)are indistinguishable and the IG approaches 0. The

beneﬁt of our IG is tree-fold: it is moments-invariant, makes

use of all training data, and increases computation complex-

ity only in forward-passing the posterior model when cal-

culating the KL divergence without even requiring gradient

backpropagation.

The curiosity reward rc(s)is determined by u(s)and a

truncation scheme R:R+7→ R[0, η·¯u), η, ¯u∈R+

∗, to pre-

vent radical exploration

rc(s) := R(u(s)) := ηt·min u(s),¯u.(14)

We exploit relative value by normalising the clipped u(s)

by a running mean and standard deviation of previous IGs.

The exploration coefﬁcient ηtis logarithmically decayed as

ηt=ηplog t/t, by the rate at which the parametric uncer-

tainty decays (Koenker 2005; Mavrin et al. 2019), where t

is the global training step, and ηan initial value, to assuage

exploration getting more sensitive to the value of u(s)as

parameters become more accurate.

We use rcto augment return backup during policy update,

as the purpose is for the action to lead to uncertain regions by

encouraging curiosity about future steps. When training the

generative model for return distributions we use the original

reward only.

We investigate a multi-step advantage function. The con-

traction property of the distributional Bellman operator is

propagated from 1-step to k-step scenarios by the same logic

as in conventional RL. The beneﬁt of looking into further

steps for exploration is intuitive viewed from the long-term

goal of RL tasks: the agent should not be complacent about

a state just because it is informative to immediate steps.

The pseudocode in Algorithm 1 presents a mini-batch

version of our methodology BDPG. We denote state return

g(st)as gtfor compactness. Other step-dependent values

are shorthanded accordingly. We use Euclidean distance for

reconstruction, leading to the W2metric being minimised.

kis the number of unroll steps, and is also the maximum

bootstrap length, albeit the two are not necessarily the same.

Related Work

Policy optimisation enables importance sampling based off-

policy evaluation for re-sampling weights in experience re-

play schemes (Wang et al. 2016; Gruslys et al. 2018). In

continuous control, where the policy is usually a paramet-

ric Gaussian, exploration can be realised by perturbing the

Gaussian mean (Lillicrap et al. 2015; Ciosek et al. 2019), or

maintaining a mixture of Gaussians (Lim et al. 2018). Alter-

natively, random actions can be directly incentivised by reg-

ularising policy (cross-)entropy (Abdolmaleki et al. 2015;

Mnih et al. 2016; Nachum, Norouzi, and Schuurmans 2016;

Akrour et al. 2016; Haarnoja et al. 2018).

A group of works propose to exploit epistemic uncer-

tainty via an approximate posterior distribution of Qval-

ues. Stochastic posterior estimates are constructed through

training on bootstrapped perturbations of data (Osband et al.

2016, 2019), or overlaying learned posterior variance (Chen

et al. 2017; O’Donoghue et al. 2018). While this series of

works can be thought of as posterior sampling w.r.t. Qval-

ues, (Tang and Agrawal 2018) approximates Bayesian infer-

ence by sampling parameters for a distributional RL model.

On the other hand, a particle-based distributional RL model

itself registers notion of dispersion, inspiring risk-averse and

risk-seeking policies (Dabney et al. 2018a) and optimism-

in-the-face-of-uncertainty quantiﬁed by the variance of the

better half of the particle set (Mavrin et al. 2019).

Figure 2: Impact of bootstrap length kand truncation cap ¯u

for information gain at 10M and 150M steps into training.

There are also approaches exploiting dynamics uncer-

tainty (Houthooft et al. 2016), pseudo counts (Bellemare

et al. 2016; Ostrovski et al. 2017; Tang et al. 2017), gradient

of a generative model (Freirich et al. 2019), and good past

experiences (Oh et al. 2018), that do not estimate dispersion

or model disagreement of value functions.

Evaluation

We evaluate our method on the Arcade Learning Environ-

ment Atari 2600 games and continuous control with Mu-

JoCo. We estimate a k-step advantage function using Gener-

alised Advantage Estimation (GAE) (Schulman et al. 2016),

and update the policy using Proximal Policy Optimisation

(PPO) (Schulman et al. 2017) which maximises a clipped

surrogate of the policy gradient objective. For both Atari and

MuJoCo environments, we use 16 parallel workers for data

collection, and train in mini batches. For Atari, we unroll

128 steps with each worker in each training batch for all al-

gorithms, and average scores every 80 training batches. For

MuJoCo, we unroll 256 steps, and average scores every 4

batches. Except for ablation experiments that used rollout

length max(k, 128), the number of unroll steps is also the

bootstrap length k.

We compare to other distributional RL baselines on eight

of the Atari games, including some of the recognised hard-

exploration tasks: Freeway,Hero,Seaquest and Venture.

Direct comparisons to previous works are not meaningful

due to compounded discrepancies. To allow for a meaning-

ful comparison, we implement our own versions of base-

lines, ﬁxing other algorithmic implementation choices such

that the tested algorithms vary only in how return distribu-

tions are estimated and in the exploration scheme. We mod-

ify two previous algorithms (Freirich et al. 2019) and (Dab-

ney et al. 2018b), retaining their return distribution estima-

tors as benchmarks: a generative model Wasserstein-GAN

Figure 3: Learning curves on MuJoCo tasks with the mean (solid line) and standard deviation (shaded area) across 5runs.

(Arjovsky, Chintala, and Bottou 2017) (PPO+WGAN), and

a discrete approximation of distribution updated through

quantile regression (Koenker 2005) (PPO+QR). Impor-

tantly, all of our baselines are distributional RL solutions

that maintain state-return distributions.

Our BDPG is evaluated in two versions: 1) with the naive

add-noise-and-argmax (Mnih et al. 2016) exploration mech-

anism (BDPG naive), and 2) one that explores with the pro-

posed curiosity reward (BDPG). Naive exploration is also

used for the PPO+WGAN and PPO+QR baselines. Learn-

ing curves in Figure 1 suggest that with exploration mech-

anism ﬁxed, the proposed Bayesian approach BDPG naive

outperforms or is comparable to WGAN and QR in 6out of

8games. Morever, BDPG is always better or equal to BDPG

naive, vindicating our exploration scheme, and is able to get

the highest score among all tested algorithms in all four of

the hard-exploration games tested on.

We conduct ablation and parameter studies to investigate

the impact of the bootstrap length k, and of the truncation

cap ¯uon the IG u(s), on Atari games Breakout and Q∗bert.

In particular, k= 1 and ¯u→ ∞ are looked at as ablation

cases. Average scores of the batch started at 10M and 150M

steps into training are shown in Figure 2. Each coloured

pixel corresponds to the best outcome with respect to ηvalue

among its selection sweep according to average long-term

performance for each combination of kand ¯u. We found

that as training progresses, short kcomes to display a more

prohibitive effect, as the model becomes more discrimina-

tive about the environment, and lack of learning signals, i.e.

fewer rewards to calculate the Bellman target with, becomes

increasingly suppressive. Our experiments suggest that al-

though the best bootstrap length depends on the task, longer

bootstrapping generally produces better long-term perfor-

mance. But a long bootstrap length does not work well with

a large ¯u, a possible explanation is that as kincreases, the

variance in the Bellman targets multiplies. In this scenario,

the agent may encounter states with which it is very unfa-

miliar. The value of u(s)can grow unbounded and the ten-

dency to explore get out of hand if we do not curb it. More-

over, such extreme values can also jeopardise subsequent

curiosity comprehension through the normalisation of u(s).

This phenomenon justiﬁes the application of our truncation

scheme, especially for larger k. In addition, we found that

choosing too large kdoes not diminish performance dras-

tically, potentially due to the return distribution already ac-

counting for some degree of reward uncertainty, which is a

helpful characteristic when prototyping agents.

In the continuous control tasks with MuJoCo, we focus

on the ability of distributional RL algorithms to generalise,

and less the challenge of exploration. Therefore, we com-

pare the performance of BDPG naive against the benchmark

distributional RL algorithm PPO+WGAN, a generative so-

lution that does not conduct inference. Both are stripped of

exploration incentives. Noticeable amounts of variance dis-

played throughout training for both algorithms may be due

to that they both involve adversarial training. As shown in

Figure 3, however, our model outperforms the benchmark in

all cases with distinct margins. We believe this is because

WGAN does not take expectations across an amortised in-

ference space that accounts for better generalisation. This

proves to be highly beneﬁcial for reasoning about return

distributions in continuous control tasks such as MuJoCo

environments, where robustness in the face of unseen data

weighs up more in behaviour stability.

Conclusion

We formulate the distributional Bellman operation as an

inference-based auto-encoding process. We demonstrated

contraction of the Bellman operator for state-return distribu-

tions, expanding on the distributional RL body of work that

focused on state-action returns, to date. Our tractable solu-

tion alternates between minimising Wasserstein metrics to

continuous distributions of the Bellman target and adversar-

ial training for joint-contrastive learning to estimate a varia-

tional posterior from bootstrapped targets. This allows us to

distill the beneﬁts of Bayesian inference into distributional

RL, where predictions of returns are based on expectation

and thus more accurate in the face of unseen data. As a sec-

ond innovation we use the availability of a variational pos-

terior to derive a curiosity-driven exploration mechanism,

which we show is more efﬁciently solving hard-exploration

tasks. Either of our two contributions can be combined with

other building blocks to form new RL algorithms, e.g. in Ex-

plainable RL (Beyret, Shafti, and Faisal 2019). We believe

that our innovations link and expand the applicability and

efﬁciency of distributional RL methods.

Acknowledgments

We are grateful for our funding support: a Department of

Computing PhD Award to LL and a UKRI Turing AI Fel-

lowship (EP/V025449/1) to AAF.

References

Abdolmaleki, A.; et al. 2015. Model-Based Relative Entropy

Stochastic Search. In Adv. Neural Inform. Proc. Sys. (NIPS)

28, 3537–3545.

Akrour, R.; et al. 2016. Model-Free Trajectory Optimization

for Reinforcement Learning. In Proc. the 33nd Intl. Conf.

on Machine Learning (ICML), volume 48, 2961–2970. New

York, NY, USA.

Ambrogioni, L.; et al. 2018. Wasserstein Variational Infer-

ence. In Adv. Neural Inform. Proc. Sys. (NIPS) 31, 2473–

2482.

Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein

Generative Adversarial Networks. In Proc. the 34th Intl.

Conf. on Machine Learning (ICML), volume 70, 214–223.

Sydney, Australia.

Ball, P.; et al. 2020. Ready Policy One: World Building

Through Active Learning. In Proc. the 37th Intl. Conf. on

Machine Learning (ICML), volume 119, 591–601. Virtual.

Barth-Maron, G.; et al. 2018. Distributed Distributional De-

terministic Policy Gradients. In Proc. 6th Intl. Conf. on

Learning Representations (ICLR).

Bellemare, M.; et al. 2016. Unifying Count-Based Explo-

ration and Intrinsic Motivation. In Adv. Neural Inform. Proc.

Sys. (NIPS) 29, 1471–1479.

Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A

Distributional Perspective on Reinforcement Learning. In

Proc. the 34th Intl. Conf. on Machine Learning (ICML), vol-

ume 70, 449–458. Sydney, Australia.

Bellemare, M. G.; et al. 2013. The Arcade Learning En-

vironment: An Evaluation Platform for General Agents. In

Journal of Artiﬁcial Intelligence Research.

Bellman, R. 1957. Dynamic Programming. Princeton, NJ,

USA: Princeton University Press, 1st edition.

Beyret, B.; Shafti, A.; and Faisal, A. A. 2019. Dot-to-

Dot: Explainable Hierarchical Reinforcement Learning for

Robotic Manipulation. In 2019 IEEE/RSJ Intl. Conf. on In-

telligent Robots and Systems (IROS), 5014–5019. IEEE.

Blundell, C.; et al. 2015. Weight Uncertainty in Neural Net-

work. In Proc. the 32nd Intl. Conf. on Machine Learning

(ICML), volume 37, 1613–1622. Lille, France.

Bousquet, O.; et al. 2017. From Optimal Transport to Gener-

ative Modeling: the VEGAN cookbook. arXiv 1705.07642.

Chen, R. Y.; et al. 2017. UCB Exploration via Q-Ensembles.

arXiv 1706.01502.

Ciosek, K.; et al. 2019. Better Exploration with Optimistic

Actor Critic. In Adv. Neural Inform. Proc. Sys. (NIPS) 32,

1787–1798.

Cuturi, M. 2013. Sinkhorn Distances: Lightspeed Computa-

tion of Optimal Transport. In Adv. Neural Inform. Proc. Sys.

(NIPS) 26, 2292–2300.

Dabney, W.; et al. 2018a. Implicit Quantile Networks for

Distributional Reinforcement Learning. In Proc. the 35th

Intl. Conf. on Machine Learning (ICML), volume 80, 1096–

1105. Stockholm, Sweden.

Dabney, W.; et al. 2018b. Distributional Reinforcement

Learning With Quantile Regression. In Proc. the 32nd AAAI

Conf. on Artiﬁcial Intelligence.

Doan, T.; Mazoure, B.; and Lyle, C. 2018. GAN Q-learning.

arXiv 1805.04874.

Donahue, J.; Kr¨

ahenb¨

uhl, P.; and Darrell, T. 2017. Adver-

sarial Feature Learning. In Proc. the 5th Intl. Conf. Learning

Representations (ICLR).

Dumoulin, V.; et al. 2017. Adversarially Learned Inference.

In Proc. the 5th Intl. Conf. Learning Repres. (ICLR).

Freirich, D.; et al. 2019. Distributional Multivariate Pol-

icy Evaluation and Exploration with the Bellman GAN. In

Proc. the 36th Intl. Conf. on Machine Learning (ICML), vol-

ume 97, 1983–1992. Long Beach, CA, USA.

Genevay, A.; et al. 2016. Stochastic Optimization for Large-

scale Optimal Transport. In Adv. Neural Inform. Proc. Sys.

(NIPS) 29, 3440–3448.

Goodfellow, I. J.; et al. 2014. Generative Adversarial Net-

works. In Adv. Neural Inform. Proc. Sys. (NIPS) 27, 2672–

2680.

Gruslys, A.; et al. 2018. The Reactor: A Sample-Efﬁcient

Actor-Critic Architecture. In Proc. the 6th Intl. Conf. Learn-

ing Representations (ICLR).

Gulrajani, I.; et al. 2017. Improved Training of Wasser-

stein GANs. In Adv. Neural Inform. Proc. Sys. (NIPS) 30,

5769–5779.

Haarnoja, T.; et al. 2018. Soft Actor-Critic: Off-Policy

Maximum Entropy Deep Reinforcement Learning with a

Stochastic Actor. In Proc. the 35th Intl. Conf. on Machine

Learning (ICML), volume 80, 1861–1870.

He, J.; et al. 2019. Lagging Inference Networks and Poste-

rior Collapse in Variational Autoencoders. In Proc. the 7th

Intl. Conf. Learning Representations (ICLR).

Hessel, M.; et al. 2018. Rainbow: Combining Improvements

in Deep Reinforcement Learning. In Proc. the 32nd AAAI

Conf. on Artiﬁcial Intelligence.

Hoffman, M. D.; and Johnson, M. J. 2016. ELBO surgery:

yet another way to carve up the variational evidence lower

bound. In In Workshop of Approximate Bayesian Inference

in Neural Information Processing Systems 29.

Houthooft, R.; et al. 2016. VIME: Variational Information

Maximizing Exploration. In Adv. Neural Inform. Proc. Sys.

(NIPS) 29, 1109–1117.

Koenker, R. 2005. Quantile Regression. Cambridge Univer-

sity Press.

Kuznetsov, A.; et al. 2020. Controlling Overestimation Bias

with Truncated Mixture of Continuous Distributional Quan-

tile Critics. In Proc. 37th Intl. Conf. on Machine Learning

(ICML). Virtual.

Lattimore, T.; and Hutter, M. 2012. PAC Bounds for Dis-

counted MDPs. In Proc. the 23rd Intl. Conf. on Algorithmic

Learning Theory, 320–334.

Lillicrap, T. P.; et al. 2015. Continuous control with deep

reinforcement learning. In Proc. the 3rd Intl. Conf. Learning

Representations (ICLR).

Lim, S.; et al. 2018. Actor-Expert: A Framework for using

Q-learning in Continuous Action Spaces. arXiv 1810.09103.

Luise, G.; et al. 2018. Differential Properties of Sinkhorn

Approximation for Learning with Wasserstein Distance. In

Adv. Neural Inform. Proc. Sys. (NIPS) 31, 5859–5870.

Martin, J.; et al. 2020. Stochastically Dominant Distribu-

tional Reinforcement Learning. In Proc. the 37th Intl. Conf.

on Machine Learning (ICML). Virtual.

Mavrin, B.; et al. 2019. Distributional Reinforcement Learn-

ing for Efﬁcient Exploration. In Proc. the 36th Intl. Conf. on

Machine Learning (ICML), volume 97, 4424–4434. Long

Beach, CA, USA.

Mnih, V.; et al. 2016. Asynchronous Methods for Deep Re-

inforcement Learning. In Proc. the 33rd Intl. Conf. on Ma-

chine Learning (ICML), volume 48, 1928–1937. New York,

NY, USA.

Montavon, G.; M¨

uller, K.-R.; and Cuturi, M. 2016. Wasser-

stein Training of Restricted Boltzmann Machines. In Adv.

Neural Inform. Proc. Sys. (NIPS) 29, 3718–3726.

Morimura, T.; et al. 2010. Parametric Return Density Esti-

mation for Reinforcement Learning. In Proc. the 26th Conf.

on Uncertainty in Artiﬁcial Intelligence.

Nachum, O.; Norouzi, M.; and Schuurmans, D. 2016. Im-

proving Policy Gradient by Exploring Under-appreciated

Rewards. In Proc. the 4th Intl. Conf. Learning Represen-

tations (ICLR).

O’Donoghue, B.; et al. 2018. The Uncertainty Bellman

Equation and Exploration. In Proc. the 35th Intl. Conf. on

Machine Learning (ICML), volume 80, 3839–3848. Stock-

holm, Sweden.

Oh, J.; et al. 2018. Self-Imitation Learning. In Proc. the

35th Intl. Conf. on Machine Learning (ICML), volume 80,

3878–3887. Stockholm, Sweden.

Osband, I.; et al. 2016. Deep Exploration via Bootstrapped

DQN. In Adv. Neural Inform. Proc. Sys. (NIPS) 29, 4026–

4034.

Osband, I.; et al. 2019. Deep Exploration via Randomized

Value Functions. Journal of Machine Learning Research 20:

1–62.

Ostrovski, G.; et al. 2017. Count-Based Exploration with

Neural Density Models. In Proc. the 34th Intl. Conf. on

Machine Learning (ICML), volume 70, 2721–2730. Inter-

national Convention Centre, Sydney, Australia.

Puterman, M. L. 1994. Markov Decision Processes : Dis-

crete Stochastic Dynamic Programming. John Wiley &

Sons, Inc.

Rosca, M.; Lakshminarayanan, B.; and Mohamed, S. 2018.

Distribution Matching in Variational Inference. arXiv

1802.06847.

Rowland, M.; et al. 2018. An Analysis of Categorical

Distributional Reinforcement Learning. In Proc. 21st Intl.

Conf. on Artiﬁcial Intelligence and Statistics (AISTATS),

volume 84. Lanzarote, Spain.

Rowland, M.; et al. 2019. Statistics and Samples in Distribu-

tional Reinforcement Learning. In Proc. the 36th Intl. Conf.

on Machine Learning (ICML), volume 97, 5528–5536. Long

Beach, CA, USA.

Schaul, T.; et al. 2016. Prioritized Experience Replay. In

Proc. the 4th Intl. Conf. Learning Representations (ICLR).

Schmidhuber, J. 1991. Curious model-building control sys-

tems. In [Proceedings] 1991 IEEE International Joint Conf.

on Neural Networks, volume 2, 1458–1463.

Schulman, J.; et al. 2016. High-Dimensional Continuous

Control Using Generalized Advantage Estimation. In Proc.

the 4th Intl. Conf. Learning Representations (ICLR).

Schulman, J.; et al. 2017. Proximal Policy Optimization Al-

gorithms. arXiv 1707.06347.

Sekar, R.; et al. 2020. Planning to Explore via Self-

Supervised World Models. In Proc. the 37th Intl. Conf. on

Machine Learning (ICML), volume 119, 8583–8592. Vir-

tual.

Sutton, R. S.; et al. 1999. Policy Gradient Methods for Rein-

forcement Learning with Function Approximation. In Adv.

Neural Inform. Proc. Sys. (NIPS) 12, 1057–1063.

Tang, H.; et al. 2017. #Exploration: A Study of Count-Based

Exploration for Deep Reinforcement Learning. In Adv. Neu-

ral Inform. Proc. Sys. (NIPS) 30, 2753–2762.

Tang, Y.; and Agrawal, S. 2018. Exploration by Distribu-

tional Reinforcement Learning. In Proc. the 27th Interna-

tional Joint Conf. on Artiﬁcial Intelligence, 2710–2716.

Todorov, E.; Erez, T.; and Tassa, Y. 2012. MuJoCo: A

physics engine for model-based control. In 2012 IEEE/RSJ

Intl. Conf. on Intelligent Robots and Systems, 5026–5033.

Tolstikhin, I.; et al. 2018. Wasserstein Auto-Encoders. In

Proc. the 6th Intl. Conf. Learning Representations (ICLR).

Villani, C. 2008. Optimal Transport: Old and New.

Grundlehren der mathematischen Wissenschaften. Springer

Berlin Heidelberg.

Wang, Z.; et al. 2016. Sample Efﬁcient Actor-Critic with

Experience Replay. In Intl. Conf. Learning Reps. (ICLR).

Zhao, S.; Song, J.; and Ermon, S. 2017. InfoVAE: In-

formation Maximizing Variational Autoencoders. ArXiv

1706.02262.