Available via license: CC BY-SA 4.0
Content may be subject to copyright.
Bag of Policies for Distributional Deep Exploration
Asen Nachkov1Luchen Li1Giulia Luise1Filippo Valdettaro1Aldo Faisal1,2,3
1Department of Computing, Imperial College London, London, UK
2Department of Bioengineering, Imperial College London, London, UK
3Lehrstuhlinhaber für Digital Health & Data Science, University of Bayreuth, Bayreuth, Bavaria, Germany
Abstract
Efficient exploration in complex environments re-
mains a major challenge for reinforcement learning
(RL). Compared to previous Thompson sampling-
inspired mechanisms that enable temporally ex-
tended exploration, i.e., deep exploration, we focus
on deep exploration in distributional RL. We de-
velop here a general purpose approach, Bag of Poli-
cies (BoP), that can be built on top of any return
distribution estimator by maintaining a population
of its copies. BoP consists of an ensemble of mul-
tiple heads that are updated independently. During
training, each episode is controlled by only one of
the heads and the collected state-action pairs are
used to update all heads off-policy, leading to dis-
tinct learning signals for each head which diversify
learning and behaviour. To test whether optimistic
ensemble method can improve on distributional RL
as did on scalar RL, by e.g. Bootstrapped DQN,
we implement the BoP approach with a popula-
tion of distributional actor-critics using Bayesian
Distributional Policy Gradients (BDPG). The popu-
lation thus approximates a posterior distribution of
return distributions along with a posterior distribu-
tion of policies. Another benefit of building upon
BDPG is that it allows to analyze global posterior
uncertainty along with local curiosity bonus simul-
taneously for exploration. As BDPG is already an
optimistic method, this pairing helps to investigate
if optimism is accumulatable in distributional RL.
Overall BoP results in greater robustness and speed
during learning as demonstrated by our experimen-
tal results on ALE Atari games.
1 INTRODUCTION
Distributional RL (DiRL) has rapidly established its place
among reinforcement learning (RL) algorithms Belle-
mare et al. [2017] as a powerful improvement over non-
distributional value-based counterparts Lyle et al. [2019]. In
DiRL, the agent does not learn a single summary statistic of
the return for each state-action pair, but instead learns the
whole return distribution. The agent’s behaviour is being
evaluated for multiple possible consequences which in turn
affect the policy update. While this does lead to more stable
learning and better performance Lyle et al. [2019], it does
not itself change the way actions are selected; as distribu-
tional extensions to value-based RL, in C51 Bellemare et al.
[2017], QR-DQN Dabney et al. [2018b] the agent still takes
actions according to the mean of the estimated return distri-
butions in each state-action pair. Thus, estimating a return
distribution, at its core, provides performance advantages
from better representation and evaluation which are unre-
lated to the action-selection and the exploration behaviour
of the agent.
One of the major challenges in DiRL is to better leverage
an estimated return distribution for action selection Dabney
et al. [2018a]. In a discrete action setting, the agent could
take the action maximizing the mean plus a time-dependent
scaled standard deviation of the return Mavrin et al. [2019].
This approach, combined with computing the upper variance
as an embodiment of the “optimism in the face of uncer-
tainty”, has yielded considerable improvements in explo-
ration. A DiRL solution based on Wasserstein-GAN Freirich
et al. [2019] captures uncertainty in the gradient magnitude
of parameters. In the recent Bayesian Distributional Policy
Gradients (BDPG) Li and Faisal [2021], a curiosity bonus
in the form of an information gain is added to the reward
function, which motivates the agent to choose actions whose
outcome has high epistemic uncertainty in terms of the re-
turn distribution estimation. These strategies all bias the
data-collecting policy with a curiosity component computed
from the return distribution modelling, instead of using only
Paper accepted at E-pi UAI Workshop 2023.
arXiv:2308.01759v1 [cs.LG] 3 Aug 2023
Actor-critic
Actor-critic
Feature
Extractor
(Convolutions)
Discriminator
Encoder
Actor-critic
State Policy/
Actor
latent
action
Generator/
Decoder/
Critic
latent
logits
In return space:
Bellman target ,
sample from return
In latent space:
prior return ,
encoded return
Prior
In pixel space:
Image frame
Figure 1: Schematic and data flow in BoP.
the mean returns as done in conventional RL.
In the meantime, one of the most effective exploration ap-
proaches centred around the idea of optimism in the face
of uncertainty in scalar RL, is those based on posterior
sampling or Thompson sampling Thompson [1935], as we
shall discuss in more detail in the Related Work section.
These methods typically maintain an ensemble of random-
ized copies of the same function, relying on model diversity
to select actions that are either relatively certainly optimal
or uncertain, and therefore efficiently expand the data space
with maximal information gathering.
In this work, we want to combine ensemble method with
already optimistic DiRL to see if extra optimism from di-
versified model estimation is beneficial. To this end, we
enhance BDPG Li and Faisal [2021] into multiple distribu-
tional actor-critic models to achieve a combination of both
Thompson sampling and curiosity bonus induced optimism.
Crucially, we are now able to leverage the epistemic uncer-
tainty of the agent in a two-pronged manner: First, through
the Thompson sampling enabled by an ensemble of policies
and return distributions (heads); Second, by using a local
information gain inferred from the observed inaccuracies in
each head’s return distribution estimate.
Our contributions in this paper are as follows: we introduce
the Bag of Policies framework, analyze its properties and
the off-policy training aspects. We also explore variations of
our algorithm and test its performance on the Atari Arcade
Learning Environment Bellemare et al. [2013]. Our results
suggest that Thompson sampling can indeed improve upon
already optimistic DiRL. This finding is promising, as we
now know that optimism is accumulatable in DiRL.
2 PRELIMINARIES
A reinforcement learning (RL) task is modelled as a Markov
decision process (MDP) (
S,A,R, P, γ
) Puterman [1994]
where
S
is the set of possible states of the environment,
A
is
the set of actions the agent can take,
R
is a reward function
,
P
is the transition probability density for transitioning to a
new state given the agent’s current state and their selected
action, and
γ
is the discount factor. States and actions can
be continuous in our framework. The agent also learns or
controls a possibly probabilistic policy
π:S → P(A)
mapping from states to a distribution over actions.
In our implementation the policy
π
is parameterized and
updated with policy gradient Sutton [1999], Mnih et al.
[2016], Schulman et al. [2016], maximizing an explicit ob-
jective function of the performance of the agent such as
Es∼dπ,a∼π(·|s)A(s, a) log π(a|s)
, where
dπ
is the station-
ary marginal state density induced by
π
and
A(s, a)
the
advantage function.
The return for a given state
s∈ S
is defined as the sum of
discounted future rewards that the agent collects starting
from the state
s
,
Gπ(s) := P∞
t=0 γtrt, s0=s
. We con-
sider state-dependent return is this works, while the idea
is applicable also to action-dependent return. The return is
a random variable with potential aleatoric uncertainty (e.g.
from stochastic state transitions). DiRL methods learn not
only a single statistic of the return in a given state, but a
representation of the whole distribution. Similar to the Bell-
man operator defined in conventional RL with respect to the
mean of the return, one can use the distributional Bellman
operator Bellemare et al. [2017].
TπGπ(s) :D
=R(s) + γGπ(s′),(1)
where
R(s)
is the random variable of the reward and
s′∼
P(·|a, s), a ∼π(·|s), to learn the state return distribution.
We test our idea of improving optimistic DiRL using (also
optimistic) deep exploration on the DiRL method BDPG Li
and Faisal [2021], as it provides an optimistic exploration
bonus. BDPG interprets the distributional Bellman operator
as a variational auto-encoding (VAE) process. Meanwhile,
this VAE provides an estimate of information gain between
samples from target and model return distributions, giving
rise to the quantification of the epistemic uncertainty in re-
turn distribution modelling around the current state under
the current policy. This information gain is then used to aug-
ment reward signals to encourage exploration. As an already
(locally) optimistic DiRL method, BDPG is a good baseline
to pair with Thompson sampling for our investigation of
buildable optimism in DiRL.
We introduce the superscript
i
in the notation
πi
to denote
the
i
-th policy in an ensemble of
K
distributional estimator
actor-critic heads. Accordingly,
Ai
t
is the advantage of the
i
-
th policy at timestep
t
. We adopt the notations used in BDPG
to shorthand the model output state-return as
g(s) := Gπ(s)
,
and its Bellman backup target as
x(s) := TπGπ(s)
. They
both belong to the return space. BDPG matches
g(s)
to
x(s)
in Wasserstein metric (under which the distributional Bell-
man operator is known to be a contraction Bellemare et al.
[2017]) with a deterministic decoder
pθ(x|z, s)
that maps
a latent random variable
z
to a return sample
g(s)
, a vari-
ational encoder
qϕ(z|x, s)
that approximates the posterior
over
z
conditional on observed data
x(s)
, and a discrimi-
nator
Dψ(x, z, s)
to enforce joint-contrastive learning Don-
ahue et al. [2017] now that the decoder is non-differentiable.
For algorithm details please refer to the original paper Li
and Faisal [2021]. Dependency on
s
is omitted when no
confusion is to be incurred. We adapt per-head shorthand
from BDPG notations for the Bellman targets as
xi
t
, the
samples from the generators
gi
t
, and latent variables for the
generative process zi
t.
3 RELATED WORK
Closest to our work is the multi-worker or multi-head ap-
proaches which leverage Thompson sampling Thompson
[1935] to diversify behaviour. Notice that unlike parallel
approaches such as A3C Mnih et al. [2016] and variants
Liang et al. [2018], Zhang et al. [2019] in which the multi-
ple workers are updated with the same gradient, Thompson
sampling allows the workers to learn from bootstrapped
signals Efron and Tibshirani [1994] and to focus on where
the current understanding is insufficient rather than merely
cover more state space more quickly together than a single
worker.
One early example of Thompson sampling in deep RL is
the Bootstrapped DQN Osband et al. [2016] which utilizes
multiple
Q
-value “heads”. Each head is initialized randomly
and differently and then trained on their respective learning
signal (with slightly different data or even identical data but
distinct gradient). Thanks to this diversity, the heads’ eval-
uations for a state-action pair tend to converge to the true
value (which then allows the optimal action to be chosen) as
the pair is being sufficiently visited. Thus said, an action is
selected either because it is optimal or because its evaluation
is uncertain, thus optimistic. This uncertainty is referred to
as the epistemic uncertainty that would diminish as training
data increases. Bootstrapped DQN and variants Osband et al.
[2019], Chen et al. [2017], O’Donoghue et al. [2018] can be
thought of as representing a posterior distribution of the
Q
-
function and thus gives a measure of epistemic uncertainty
in the
Q
-value estimation. Similarly, Tang and Agrawal
[2018] approximates Thompson sampling by drawing net-
work parameters for an RL model. To further encourage
optimism in the face of uncertainty, gradient diversification
An et al. [2021] can effectively boost performance of an
ensemble method.
Ensemble method is important as it also allows for deep
exploration by engaging the same policy for the duration of
the whole episode. However, all ensemble-based exploration
schemes for now are intended for conventional scalar RL,
and the purpose of this work is to investigate whether deep
exploration could improve performance in distributional RL,
further still, whether the global optimism incurred by the
ensemble effect can further enhance exploration efficiency
when the agent is already optimistically biased on per-state
basis. Combining ensemble technique with DiRL is not a
trivial investigation and the benefit of diversity does not
automatically propagate, as now we would be learning a
posterior distribution over distributions, whilst admittedly
the diversity over distributions is harder to maintain than
that over numbers.
We base our approach on BDPG Li and Faisal [2021] which
is a DiRL approach made up of a single distributional actor-
critic and provides a local curiosity bonus favouring states
whose return distribution estimations suffer from high epis-
temic uncertainty. BoP in contrast maintains a posterior
distribution over return distributions and over policies, keep-
ing track of both aleatoric and epistemic uncertainties about
return distribution estimation simultaneously.
4 BAG OF POLICIES (BOP)
In the following we lay out the structure, theory and variants
of BoP. A schematic can be found in Fig. 1 and for the
pseudocode please refer to Algorithm 1.
Algorithm 1 Bag of Policies
1:
Initialize prior
pθ(z|s)
, encoder
qϕ(z|x, s)
, discrimina-
tor Dψ(x, z, s), actor-critic population
{Gi
θ(z, s), πi}K
i=1
2: while not converged do
3: // Roll-out stage
4: training batch D ← ∅
5: Sample k∼Uniform(1, ..., K)// actor-critic to act
6: for t= 0 to T−1do
7: execute ak
t∼πk(·|st), get rt, st+1
8: sample return zt∼pθ(·|st), gi
t←Gi
θ(zt, st),∀i
9: D ← D ∪ st, at, rt, gi
t,log πi(at|st),∀i
10: end for
11: sample last return
zT∼pθ(·|sT), gi
T←Gi
θ(zT, sT),∀i
12: // Data-driven estimation stage
13: for t∈ D,∀ido
14: estimate off-policy advantage Ai
t
15: compute Bellman target xi
t←Ai
t+gi
t
16: compute mixed curiosity reward
ˆri
t∝KLqϕ(·|xk
t, st)|| qϕ(·|gi
t, st)
17: augment Ai
tby replacing rtwith rt+ ˆri
t
18: end for
19: // Update stage
20: // Train with minibatch B⊂ D
21: for t∈Bdo
22: encode ˜zi
t∼qϕ(·|xi
t, st),∀i
23: sample adversaries
zi
t∼pθ(·|st),˜xi
t←Gi
¯
θ(zi
t, st),∀i
24: take averages
¯zt=1
KPK
i=1 ˜zi
t,¯xt=1
KPK
i=1 ˜xi
t
25: end for
26: update Dψby ascending
1
|B|Pt∈Blog Dψ(¯xt, zt, st) +
log 1−Dψ(xt,¯zt, st)
27: update encoder, prior by ascending
1
|B|Pt∈Blog 1−Dψ(¯xt, zt, st)+
log Dψ(xt,¯zt, st)
28: update Gi
θ,∀iby descending
1
|B|Pt∈B||xi
t−Gi
θ(˜zi
t, st)||2
2
29: update πi,∀iby ascending
1
|B|Pt∈Blog πi(at|st)Ai
t
30: end while
4.1 ARCHITECTURE
The Bag of Policies (BoP) framework can be applied to any
DiRL estimator. In this work we have chosen BDPG Li and
Faisal [2021] as the implementation framework because it
exploits local epistemic uncertainty. We use the architecture
as that of BDPG except that we have multiple distributional
actor-critic heads. Specifically, we refer to each distribu-
tional actor-critic pair (policy + return distribution) as a
head. Each episode of trajectory is rolled out by only one
of the heads sampled uniformly randomly, and the collected
data is trained on by all of the heads for data efficiency. The
head that generated the trajectory learns on-policy, whilst
all other heads learn from the same batch of data off-policy.
Throughout training, the heads are updated by different gra-
dients because they were initialized differently and learn
from their own Bellman target which entails random sam-
pling from their own return distributions at subsequent steps.
This ensures a diversity in policies, giving rise to the effect
of Thompson sampling.
The training of BoP entails an outer loop of three stages: a
roll-out stage, a data-driven estimation stage, and an update
stage. The majority of the algorithm is the same as that
of BDPG, we only highlight the different or extra aspects
incurred by using multiple distributional actor-critic heads,
as well as the heed we paid to maintain diversity. During
the roll-out stage, at the start of each episode, the agent
selects a head uniformly at random, which we call the active
head
k
, and executes actions according to the policy of
that head for the duration of the whole episode. Thus, the
sequence of collected transitions
(st, at, rt)
is determined
entirely by the policy of the active head as a behaviour
policy. Meanwhile, the other heads still compute the log-
probability of the selected action
log πi(at|st)
, and form
an estimation of the return distribution for each state in the
episode by generating a sample gi
tfrom each distributional
critic. These local evaluations are used to provide them with
a unique sample estimate of the learning objective when
updating their parameters.
During the sample estimate stage, the agent computes the ad-
vantages
Ai
t
and the Bellman targets
xi
t
, both being unique
for each respective head. These local estimates also engen-
der a local curiosity bonus
ˆri
t
that is proportional to the
entropy reduction in the latent space conditioned on the
return space due to the condition being a more accurate
estimate (the on-policy Bellman target
xk
t
) as opposed to a
model prediction (a sample from the model gi
t):
ˆri
t∝Information_gain(st, i)
:= KLqϕ(·|xk
t, st)|| qϕ(·|gi
t, st),(2)
in which the state
st
is generated by the active head
k
.
This entropy reduction, or equivalently, information gain,
quantifies the epistemic uncertainty about how well each
head is modelling the return distribution. The posterior
xk
t
is
shared across heads as it is the only Bellman backup target
among all
xi
t
that is estimated on-policy and thus suffers the
least degree of bias / variance. The curiosity bonus is then
added to the external reward for computing the augmented
Figure 2: Comparison of BDPG and BoP on selected Atari environments.
advantage function that would be used to update the local
policies.
During the update stage, all policies
πi
are updated on the
data generated by the active policy
πk
. The active head is up-
dated on-policy as in BDPG. For the other heads we use an
off-policy advantage function estimated with V-trace Espe-
holt et al. [2018] at each timestep. Subsequently, each head’s
critic is updated by minimizing a Wasserstein distance be-
tween its current return distribution model and the distribu-
tion of a Bellman target estimated from its own predictions
at subsequent timesteps. Like BDPG, this is achieved by
minimizing the squared distance between Bellman target
xi
t
and sample
gi
t
while in the meantime enforcing joint-
contrastive learning adversarially.
We used a single common set of hyperparameters for imple-
menting the BoP on Atari as shown in Table 1. The choices
for the number of heads were laid out in Table 2. The policy
heads are updated by Proximal Policy Optimization (PPO)
Schulman et al. [2017].
4.2 DESIGN CONSIDERATIONS &
IMPLICATIONS
In this subsection we discuss the possible variants of BoP en-
abled by its ensemble characteristic, and which of them are
conducive to diversity. To understand these and the design
decisions leading to the final settings described above, we
first explain the need for diversification and the off-policy
side of our framework.
Note that the following considerations for diversification
are not entailed in ensemble methods applied to scalar RL,
where behaviour and model diversity is much easier to up-
hold, i.e., a surprising backup target is considered unseen
in a scalar model but merely less likely in a distribution
model. Thus said, the same level of diversity would pro-
vide lesser learning signal to the latter. This means if the
posterior over return distributions is not sufficiently diverse,
which would happen if ensemble techniques and DiRL were
naively combined, the return distribution models will only
tweak the likelihood for the current backup target rather
than fundamentally change its return prediction; and the en-
semble members remain similar after light update, forming
a vicious cycle.
With multiple heads that are being maintained in BoP, each
head ican form a return estimate gi
tat the current timestep
and produce a Bellman target
xi
t
by sampling from itself for
each of the future timesteps. Thus, there are choices about
sharing these targets or not across heads in two domains:
1. to update the critic and 2. to compute the local curiosity
bonus.
First we considered sharing the Bellman target
xk
t
across
critics, i.e., using the target of the current active head
k
to
update all critics. We found that in practice sharing Bellman
targets causes the diversity between policies to collapse, as
measured by various metrics such as the sum of all absolute
differences between the assigned probabilities of any two
policies and the cosine similarity between two policies. This
resulted in poorer exploration outcomes than with individual
Bellman targets. Therefore, in the final BoP framework, the
on-policy Bellman target is not shared and each critic head is
trained against its own target, which provides (more) unique
bootstrapped signals to each head. Formally, we use the
following variation of the distributional Bellman equation:
TπkGπi(st)D
=rt+γGπi(st+1 )(3)
st+1 ∼P(st+1|st, at), at∼πk(st).
Since the states, actions and rewards are collected by the
active head
k
, only this head is trained on-policy, while all
other heads are trained off-policy. Thus, the final version of
BoP is an algorithm of a mix of on-policy and off-policy
learning, in contrast to other algorithms Bellemare et al.
[2017], Mnih et al. [2016], Li and Faisal [2021], Osband
et al. [2016], Dabney et al. [2018b], which are either one or
the other.
Each distributional actor-critic member in BoP has a curios-
ity bonus
ˆri
t
applied to its personal policy. We have discussed
Hyperparameter Value Hyperparameter Value
Num. parallel envs 16 Evaluation frequency Every 2e4timesteps
Num. stacked frames 4 Num. heads Variable
GAE λ0.95 Discount γ0.99
Num. episodes per roll-out 4 Minibatch size 256
Roll-out timesteps 128 Value loss weight 0.5
Entropy coefficient 0.01 Learning rate Linearly 2.5e−4→0
PPO clip range Linearly 0.1→0Curiosity Mixed
Actor-critics updated All & on all the data xi
tcomputed by Critic i
Actions during roll-out Sampled Actions during testing Greedy
Table 1: Hyperparameters for implementing BoP model on Atari games.
the reason for sharing the posterior return estimate
xk
t
in Eq.
(2)
which is for its accuracy, we here consider the choices
for the prior
gt
. This prior can either be computed from
that head itself
KLq(·|xk
t, st)|| q(·|gi
t, st)
as in Eq.
(2)
or
shared
KLq(·|xk
t, st)|| q(·|gk
t, st)
, where
q(z|x, s)
is the
variational encoder for the return distribution. We use local
return samples
gi
t
as prior to compute the information gain.
This is because a sample surprising to one return distribution
model might not surprise another, and each model should
keep track of its accuracy to correctly bias local action selec-
tion, so that the model agreement on policies informs also
implicitly the agreement on the level of surprise.
Next, we consider the schedule for updating heads: we can
update all of them on every batch of freshly collected data,
or just some of them. It is indeed feasible to update only
one head - e.g. the active one or the most uncertain one, as
measured by the curiosity bonuses. By updating only the
most uncertain head at each iteration, the agent maximizes
the reduction in epistemic uncertainty while still keeping
the heads sufficiently diverse (by updating only one head
in each update iteration). However, we found in practice
that this diminishes sample efficiency due to limiting the
number of heads being trained with a given amount of data.
Similar empirical findings were made when training with
other setups that selectively update members like updating
a randomly sampled head according to its epistemic uncer-
tainty, or training each head on 50% of their most uncertain
samples. For that reason in BoP we update all heads on all
the data, which provides the fastest learning at no cost of
sacrificing sufficient diversity of the heads and therefore
maximizes their ensemble benefits on balance.
During testing, on the other hand, the question arises as
to how action selection should be performed now that we
have multiple policies. BoP selects an action by averag-
ing the action distributions of all policy heads and then
chooses the greedy action from that distribution Wiering
and van Hasselt [2008]. Since all policies in our implemen-
tation are Gaussian, this is easy to realize. In practice we
observed more stable performance and higher scores when
using this “average-then-argmax” approach compared to
other options, such as each head picking their greedy action
Figure 3: Comparison of BoP with one
(K= 1)
vs many
heads
(K= 3)
or
(K= 5)
on Breakout and Pong. The
shading shows one standard error from the mean. Expo-
nential smoothing was applied for the scores on Breakout.
The numbers of heads in multi-head cases shown are the
minimal numbers that enable significant improvement over
one-head baseline.
and then performing a majority vote selection of the most
popular / similar action across heads (“argmax-then-vote”),
or even selecting actions only from one randomly selected
head (“sample-then-argmax”). The benefits of the “average-
then-argmax” tactic over the others result from relying on
model agreement which is what has been truly learned and
understood rather than uncertainty-quenching exploration.
Finally, we considered how to choose the number of ensem-
ble members. We found that a small number of
3
to
5
strikes
a good balance between effective exploration and training
speed. For calibration of this statement, we found that each
additional head adds about
15%
more FLOPS to the total
amount of computation. This increase in performance from
having many heads saturates as policies start to converge
and thus become more and more correlated, which justifies
using a small number of heads to increase performance and
exploration efficiency without increasing training time too
much.
Atari env # BoP # A3C # Boot
Freeway 3 34 16 0.1 10 34
Breakout 3 686 16 682 10 855
Hero 3 37,728 16 32,464 10 21,021
MsPacman 3 9,057 16 654 10 2,983
Qbert 3 20,583 16 15,149 10 15,092
Alien 5 5,470 16 518 10 2,436
Asterix 5 42,500 16 22,141 10 19,713
Frostbite 3 5,940 16 191 10 2,181
Amidar 5 1,123 16 264 10 1,272
BeamRider 3 6,499 16 22,708 10 23,429
StarGunner 3 56,200 16 138,218 10 55,725
Seaquest 3 2,027 16 2,355 10 9,083
Table 2: The testing results achieved across 3 runs. The
number of heads (“#”) for BoP were set by us, the ones
for A3C and Bootstrapped DQN (“Boot”) were set in their
original works Mnih et al. [2016], Osband et al. [2016].
5 EXPERIMENTAL RESULTS
We tested the performance of BoP on a diverse set of Atari
environments, and compared to three baselines with which
BoP shares similarity in different aspects: to Bootstrapped
DQN Osband et al. [2016] as an ensemble-based exploration
in scalar RL, to BDPG Li and Faisal [2021] as a locally op-
timistic DiRL exploration approach and the algorithm we
built upon, and to A3C Mnih et al. [2016] which facilitates
covering extensive data space mainly by deploying multi-
ple workers in parallel. The set of Atari environments were
chosen to represent an exemplar mix of hard exploration
environments, mazes, and shooter games. Pixel-based ob-
servations were pre-processed by standard Atari wrappers
as laid out in Dabney et al. [2018b], including cropping the
image frame, using only grayscaled frames, frame stacking,
and taking the maximum values of any two consecutive
frames.
First of all, we compare performances between BDPG
and BoP on
Freeway
,
Hero
and
MsPacman
, with learning
curves shown in Fig. 2. Compared to BDPG, BoP improves
significantly on sample efficiency, albeit not necessarily on
asymptotic performance. We note that the only difference
between BDPG and BoP is that BoP is a bootstrapped en-
semble of BDPG, so the improvement of learning speed is in
principle attributed to the ensemble effect. However, at this
stage, it cannot be decided it is the deep exploration enabled
by Thompson sampling or the sheer multitude of workers
that has resulted the advantage of ensemble technique.
To this end, we move on to investigate the impact of the
number of ensemble members, i.e. heads, on a multi-worker
agent’s performance. In Table 2, we compare BoP against
multi-worker / multi-head methods Bootrapped DQN and
A3C. Although BoP does not beat baselines on all selected
environments, we can see that it indeed outperforms the
Figure 4: Testing scores for BoP on Atari. We test the agent
periodically while training. Shaded areas indicate variation
in performance across multiple runs.
others most frequently. Crucially, the number of heads for
A3C (
K= 16
) and Bootstrapped DQN (
K= 10
), as
proposed by their original publications, are substantially
higher than those needed for BoP (
K= 3
or in two cases
K= 5
). Moreover, for all games where deep exploration
is essential (such as
MsPacman
and
Frostbite
), BoP sub-
stantially outperforms baselines by
50% −200%
. These
promising results suggest that increasing the multitude of
workers alone is not key to better performance. Combining
with the observations made in comparing to BDPG, one can
establish that the strengths of BoP are attributed to deep
exploration. In the meantime, we also observed that BoP
is not as competitive in games with moving targets but not
exploration-demanding, such as
BeamRider
and
Seaquest
.
We surmise that this is because BoP became overly explo-
rative seeing the moving subjects whereas the motion could
not be accounted for by being bold unlike in the case of
Freeway.
Another experiment that we did to investigate the ensemble
effect is to compare a one-head BoP against a few-head
BoP (Fig. 3). We noted a substantial improvement when
the number of heads increased from one to even a very
small value (e.g.
3
), while being a qualitative transformation
from a flat algorithm into an ensemble one. This finding
reinforces the notion that the benefit of BoP is not due to
merely more workers, but a fundamentally distinct way that
an ensemble method works in – by posterior sampling and
therefore favouring uncertain areas for exploration.
This performance improvement outweighs the increase in
compute time for a few-head BoP. This is probably due
to the better exploration capability of each head operating
with an independently updated policy learning from its own
estimation errors so that an action is selected either because
it is agreed to be relatively certainly optimal or is uncertain.
However, when the number of ensemble members becomes
too large, BoP will have lost its advantage. As mentioned
earlier, we observed that on Atari each additional head adds
about
15%
more FLOPS for the full duration of the training
compared to having only one head. This linear relationship
holds true also when having
10
or
15
heads. The marginal
cost in terms of FLOPS is constant. The marginal bene-
fit from adding another head decreases with the number
of heads quickly, resulting in our lower use of heads than
other multi-worker algorithms. In conclusion, BoP is most
advantageous when using a handful of members.
The learning curves of all Atari environments that BoP was
tested on can be found in Fig. 4. For instance,
Frostbite
is a
game that needs very specific sequences of actions, similar
to maze games like
MsPacman
or
QBert
that require deep
exploration, as decisions which one takes early on have long
term impacts.
6 DISCUSSION & FUTURE WORK
The Bag of Policies (BoP) algorithm presented here is a
multi-head distributional estimator applicable to a large
class of algorithms and thus extends deep exploration to
distributional RL settings. Our most salient finding is that
the benefits of the optimism in the face of uncertainty can
accumulate for DiRL exploration, as we were able to sub-
stantially improve upon a DiRL approach that is optimistic
on per-state basis simply by making an ensemble of it and
performing Thompson sampling. This suggests that at least
in DiRL, Thompson sampling and curiosity bonus, although
both facilitate optimism, can work in conjunction to enhance
performance further.
In addtion, as our experimental results showed, the advan-
tage of BoP is not attributed to the mere multitude of work-
ers as BoP can surpass other multi-worker / multi-head al-
gorithms with much fewer heads, and to boost performance
considerably with a few-head BoP from a one-head coun-
terpart. These experiments substantiate that the ensemble
technique which empowered BoP functions in a fundamen-
tally different fashion than naively summing up the efforts
of parallel workers – by Thompson sampling, which enables
optimistic and deep exploration in the same time.
On average, BoP achieves better learning speed and asymp-
totic performance than baselines. We have observed the
biggest improvement from baselines in maze-like games
like
MsPacman
or
QBert
where the agent has to choose
a path in a labyrinth while collecting various items scat-
tered throughout it and avoiding the enemies many moves
down the line. These kinds of environments require deep
exploration and the agent can experience vastly different
outcomes depending on which path it takes. Hence, these
environments provide a good example where the BoP explo-
ration capabilities can improve the agent’s performance.
In a nutshell, BoP has demonstrated that not only deep
exploration is viable in DiRL, offering extensions to any
DiRL settings (e.g. Bellemare et al. [2017], Dabney et al.
[2018b,a], Freirich et al. [2019], Doan et al. [2018], Mar-
tin et al. [2020], Barth-Maron et al. [2018], Singh et al.
[2020], Kuznetsov et al. [2020], Choi et al. [2019]), but also
that deep exploration can further improve learning from an
already optimistic exploration strategy.
Acknowledgments
We are grateful for our funding support. At the time of this
work, GL and AF are sponsored by UKRI Turing AI Fellow-
ship (EP/V025449/1), LL and FV by the PhD sponsorship
of the Department of Computing, Imperial College London.
References
G. An, S. Moon, J. Kim, and H. O. Song. Uncertainty-
based offline reinforcement learning with diversified q-
ensemble. In Advances in Neural Information Processing
Systems, volume 34, pages 7436–7447, 2021.
G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney,
D. Horgan, D. TB, A. Muldal, N. Heess, and T. Lillicrap.
Distributed distributional deterministic policy gradients.
In Proceedings of the 6th International Conference on
Learning Representations (ICLR), 2018.
M. G Bellemare, Y. Naddaf, J. Veness, and M. Bowling.
The Arcade Learning Environment: An evaluation plat-
form for general agents. Journal of Artificial Intelligence
Research, 47:253–279, 2013.
M. G. Bellemare, W. Dabney, and R. Munos. A distribu-
tional perspective on reinforcement learning. Proceedings
of the 34th International Conference on Machine Learn-
ing, 70:449–458, 2017.
R. Y. Chen, S. Sidor, P. Abbeel, and J. Schulman. UCB
exploration via q-ensembles, 2017.
Y. Choi, K. Lee, and S. Oh. Distributional deep reinforce-
ment learning with a mixture of Gaussians. In 2019
International Conference on Robotics and Automation
(ICRA), pages 9791–9797, 2019.
W. Dabney, G. Ostrovski, D. Silver, and R. Munos. Implicit
quantile networks for distributional reinforcement learn-
ing. Proceedings of the 35th International Conference on
Machine Learning, 80:1096–1105, 2018a.
W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos.
Distributional reinforcement learning with quantile re-
gression. Proceedings of the AAAI Conference on Artifi-
cial Intelligence, 2018b.
T. Doan, B. Mazoure, and C. Lyle. GAN q-learning, 2018.
J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial
feature learning. Proceedings of the 5th International
Conference on Learning Representations (ICLR), 2017.
B. Efron and R. J. Tibshirani. An introduction to the boot-
strap. CRC press, 1994.
Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Si-
monyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad
Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Ko-
ray Kavukcuoglu. IMPALA: Scalable distributed deep-
RL with importance weighted actor-learner architectures.
In Proceedings of the 35th International Conference on
Machine Learning, volume 80, pages 1407–1416, Stock-
holmsmässan, Stockholm, Sweden, 2018.
D. Freirich, T. Shimkin, R. Meir, and A. Tamar. Distribu-
tional multivariate policy evaluation and exploration with
the bellman gan. In Proc. the 36th Intl. Conf. on Machine
Learning (ICML), volume 97, pages 1983–1992, Long
Beach, CA, USA, 2019.
A. Kuznetsov, P. Shvechikov, A. Grishin, and D. Vetrov.
Controlling overestimation bias with truncated mixture
of continuous distributional quantile critics. In Proceed-
ings of the 37th International Conference on Machine
Learning, 2020.
L. Li and A. Faisal. Bayesian distributional policy gradi-
ents. Proceedings of the AAAI Conference on Artificial
Intelligence, 35(1):8429–8437, 2021.
Jacky Liang, Viktor Makoviychuk, Ankur Handa, Nut-
tapong Chentanez, Miles Macklin, and Dieter Fox. Gpu-
accelerated robotic simulation for distributed reinforce-
ment learning. In Conference on Robot Learning, pages
270–282. PMLR, 2018.
C. Lyle, M. G. Bellemare, and P. S. Castro. A compara-
tive analysis of expected and distributional reinforcement
learning. Proceedings of the AAAI Conference on Artifi-
cial Intelligence, 33:4504–4511, 2019.
J. Martin, M. Lyskawinski, X. Li, and B. Englot. Stochas-
tically dominant distributional reinforcement learning.
In Proceedings of the 37th International Conference on
Machine Learning, 2020.
B. Mavrin et al. Distributional reinforcement learning for ef-
ficient exploration. Proceedings of the 36th International
Conference on Machine Learning,, 97:4424–4434, 2019.
V.; Mnih et al. Asynchronous methods for deep reinforce-
ment learning. Proceedings of The 33rd International
Conference on Machine Learning, 48:1928–1937, 2016.
B. O’Donoghue, I. Osband, R. Munos, and V. Mnih. The un-
certainty Bellman equation and exploration. In Proceed-
ings of the 35th International Conference on Machine
Learning, volume 80, pages 3839–3848, Stockholmsmäs-
san, Stockholm Sweden, 10–15 Jul 2018.
I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep
exploration via Bootstrapped dqn. Advances in Neural
Information Processing Systems, 29:4026–4034, 2016.
I. Osband, B. Van Roy, D. J. Russo, and Z. Wen. Deep
exploration via randomized value functions. Journal of
Machine Learning Research, 20:1–62, 2019.
M. L. Puterman. Markov Decision Processes: Discrete
stochastic dynamic programming. John Wiley & Sons,
1994.
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel.
High-dimensional continuous control using generalized
advantage estimation. Proceedings of the 4th Interna-
tional Conference on Learning Representations (ICLR),
2016.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
O. Klimov. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347, 2017.
R. Singh, K. Lee, and Y. Chen. Sample-based distributional
policy gradient, 2020.
R. S. Sutton. Policy gradient methods for reinforcement
learning with function approximation. Advances in
Neural Information Processing Systems, 12:1057–1063,
1999.
Y. Tang and S. Agrawal. Exploration by distributional rein-
forcement learning. Proceedings of the 27th International
Joint Conference on Artificial Intelligence, pages 2710–
2716, 2018.
W. R. Thompson. On the theory of apportionment. Ameri-
can Journal of Mathematics, 57(2):450–456, 1935.
M. A. Wiering and H. P. van Hasselt. Ensemble algorithms
in reinforcement learning. IEEE Transactions on Systems,
Man, and Cybernetics, Part B, 38(4):930–936, 2008.
Zhizheng Zhang, Jiale Chen, Zhibo Chen, and Weiping
Li. Asynchronous episodic deep deterministic policy
gradient: Toward continuous control in computationally
complex environments. IEEE transactions on cybernetics,
2019.