PreprintPDF Available

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful bound in the average-reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemeny's constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic DRL (Deep Reinforcement Learning) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.
Content may be subject to copyright.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
Yiming Zhang 1Keith W. Ross2 1
Abstract
We develop theory and algorithms for average-
reward on-policy Reinforcement Learning (RL).
We first consider bounding the difference of the
long-term average reward for two policies. We
show that previous work based on the discounted
return (Schulman et al.,2015;Achiam et al.,2017)
results in a non-meaningful bound in the average-
reward setting. By addressing the average-reward
criterion directly, we then derive a novel bound
which depends on the average divergence between
the two policies and Kemeny’s constant. Based
on this bound, we develop an iterative procedure
which produces a sequence of monotonically im-
proved policies for the average reward criterion.
This iterative procedure can then be combined
with classic DRL (Deep Reinforcement Learn-
ing) methods, resulting in practical DRL algo-
rithms that target the long-run average reward cri-
terion. In particular, we demonstrate that Average-
Reward TRPO (ATRPO), which adapts the on-
policy TRPO algorithm to the average-reward
criterion, significantly outperforms TRPO in the
most challenging MuJuCo environments.
1. Introduction
The goal of Reinforcement Learning (RL) is to build agents
that can learn high-performing behaviors through trial-and-
error interactions with the environment. Broadly speak-
ing, modern RL tackles two kinds of problems: episodic
tasks and continuing tasks. In episodic tasks, the agent-
environment interaction can be broken into separate distinct
episodes, and the performance of the agent is simply the
sum of the rewards accrued within an episode. Examples
of episodic tasks include training an agent to learn to play
Go (Silver et al.,2016;2018), where the episode terminates
when the game ends. In continuing tasks, such as robotic
locomotion (Peters & Schaal,2008;Schulman et al.,2015;
1
New York University
2
New York University Shanghai. Corre-
spondence to: Yiming Zhang <yiming.zhang@cs.nyu.edu>.
Proceedings of the
38 th
International Conference on Machine
Learning, PMLR 139, 2021. Copyright 2021 by the author(s).
Haarnoja et al.,2018) or in a queuing scenario (Tadepalli
& Ok,1994;Sutton & Barto,2018), there is no natural sep-
aration of episodes and the agent-environment interaction
continues indefinitely. The performance of an agent in a
continuing task is more difficult to quantify since the total
sum of rewards is typically infinite.
One way of making the long-term reward objective mean-
ingful for continuing tasks is to apply discounting so that
the infinite-horizon return is guaranteed to be finite for any
bounded reward function. However the discounted objec-
tive biases the optimal policy to choose actions that lead to
high near-term performance rather than to high long-term
performance. Such an objective is not appropriate when the
goal is to optimize long-term behavior, i.e., when the natural
objective underlying the task at hand is non-discounted. In
particular, we note that for the vast majority of benchmarks
for reinforcement learning such as Atari games (Mnih et al.,
2013) and MuJoCo (Todorov et al.,2012), a non-discounted
performance measure is used to evaluate the trained policies.
Although in many circumstances, non-discounted criteria
are more natural, most of the successful DRL algorithms
today have been designed to optimize a discounted crite-
rion during training. One possible work-around for this
mismatch is to simply train with a discount factor that is
very close to one. Indeed, from the Blackwell optimality
theory of MDPs (Blackwell,1962), we know that if the dis-
count factor is very close to one, then an optimal policy for
the infinite-horizon discounted criterion is also optimal for
the long-run average-reward criterion. However, although
Blackwell’s result suggests we can simply use a large dis-
count factor to optimize non-discounted criteria, problems
with large discount factors are in general more difficult to
solve (Petrik & Scherrer,2008;Jiang et al.,2015;2016;
Lehnert et al.,2018). Researchers have also observed that
state-of-the-art DRL algorithms typically break down when
the discount factor gets too close to one (Schulman et al.,
2016;Andrychowicz et al.,2020).
In this paper we seek to develop algorithms for finding
high-performing policies for average-reward DRL problems.
Instead of trying to simply use standard discounted DRL
algorithms with large discount factors, we instead attack the
problem head-on, seeking to directly optimize the average-
reward criterion. While the average reward setting has been
arXiv:2106.07329v1 [cs.LG] 14 Jun 2021
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
extensively studied in the classical Markov Decision Pro-
cess literature (Howard,1960;Blackwell,1962;Veinott,
1966;Bertsekas et al.,1995), and has to some extent been
studied for tabular RL (Schwartz,1993;Mahadevan,1996;
Abounadi et al.,2001;Wan et al.,2020), it has received
relatively little attention in the DRL community. In this
paper, our focus is on developing average-reward on-policy
DRL algorithms.
One major source of difficulty with modern on-policy DRL
algorithms lies in controlling the step-size for policy updates.
In order to have better control over step-sizes, Schulman
et al. (2015) constructed a lower bound on the difference
between the expected discounted return for two arbitrary
policies
π
and
π1
by building upon the work of Kakade
& Langford (2002). The bound is a function of the diver-
gence between these two policies and the discount factor.
Schulman et al. (2015) showed that iteratively maximizing
this lower bound generates a sequence of monotonically
improved policies for their discounted return.
In this paper, we first show that the policy improvement theo-
rem from Schulman et al. (2015) results in a non-meaningful
bound in the average reward case. We then derive a novel
result which lower bounds the difference of the average long-
run rewards. The bound depends on the average divergence
between the policies and on the so-called Kemeny con-
stant, which measures to what degree the irreducible Markov
chains associated with the policies are “well-mixed”. We
show that iteratively maximizing this lower bound guaran-
tees monotonic average reward policy improvement.
Similar to the discounted case, the problem of maximizing
the lower bound can be approximated with DRL algorithms
which can be optimized using samples collected in the en-
vironment. In particular, we describe in detail the Average
Reward TRPO (ATRPO) algorithm, which is the average re-
ward variant of the TRPO algorithm (Schulman et al.,2015).
Using the MuJoCo simulated robotic benchmark, we carry
out extensive experiments demonstrating the effectiveness
of of ATRPO compared to its discounted counterpart, in
particular on the most challenging MuJoCo tasks. Notably,
we show that ATRPO can significantly out-perform TRPO
on a set of high-dimensional continuing control tasks.
Our main contributions can be summarized as follows:
We extend the policy improvement bound from Schul-
man et al. (2015) and Achiam et al. (2017) to the av-
erage reward setting. We demonstrate that our new
bound depends on the average divergence between the
two policies and on the mixing time of the underlying
Markov chain.
We use the aforementioned policy improvement bound
to derive novel on-policy deep reinforcement learning
algorithms for optimizing the average reward.
Most modern DRL algorithms introduce a discount
factor during training even when the natural objective
of interest is undiscounted. This leads to a discrep-
ancy between the evaluation and training objective.
We demonstrate that optimizing the average reward
directly can effectively address this mismatch and lead
to much stronger performance.
2. Preliminaries
Consider a Markov Decision Process (MDP) (Sutton &
Barto,2018)
pS,A, P, r, µq
where the state space
S
and
action space
A
are assumed to be finite. The transition
probability is denoted by
P:SˆAˆSÑ r0,1s
, the
bounded reward function
r:SˆAÑ rrmin, rmax s
, and
µ:SÑ r0,1s
is the initial state distribution. Let
π:SÑ
pAq
be a stationary policy where
pAq
is the probabilty
simplex over
A
, and
Π
is the set of all stationary policies.
We consider two classes of MDPs:
Assumption 1
(Ergodic)
.
For every stationary policy, the
induced Markov chain is irreducible and aperiodic.
Assumption 2
(Aperiodic Unichain)
.
For every stationary
policy, the induced Markov chain contains a single aperi-
odic recurrent class and a finite but possibly empty set of
transient states.
By definition, any MDP which satisfies Assumption 1is
also unichain. We note that most MDPs of practical interest
belong in these two classes. We will mostly focus on MDPs
which satisfy Assumption 1in the main text. In the supple-
mentary material, we will address the aperiodic unichain
case. Here we present the two objective formulations for
continuing control tasks: the average reward approach and
discounted reward criterion.
Average Reward Criterion
The average reward objective is defined as:
ρpπq:lim
NÑ8
1
NE
τπ«N´1
ÿ
t0
rpst, atqE
sdπ
aπrrps, aqs.
(1)
Here
dπpsq:limNÑ8
1
NřN´1
t0Pτπpstsq
is the
stationary state distribution under policy
π
, and
τ
ps0, a0,...,q
is a sample trajectory. The limits in
ρpπq
and
dπpsq
are guaranteed to exist under our assumptions.
Since the MDP is aperiodic, it can also be shown that
dπpsq limtÑ8 Pτπpstsq
. In the unichain case, the
average reward
ρpπq
does not depend on the initial state
for any policy
π
(Bertsekas et al.,1995). We express the
average-reward bias function as
s
Vπpsq:E
τπ«8
ÿ
t0prpst, atq ´ ρpπqqˇˇˇˇs0s
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
and average-reward action-bias function as
s
Qπps, aq:E
τπ«8
ÿ
t0prpst, atq ´ ρpπqqˇˇˇˇs0s, a0a.
We define the average-reward advantage function as
s
Aπps, aq:s
Qπps, aq ´ s
Vπpsq.
Discounted Reward Criterion
For some discount factor
γP p0,1q
, the discounted reward
objective is defined as
ργpπq:E
τπ«8
ÿ
t0
γtrpst, atq1
1´γE
sdπ,γ
aπrrps, aqs
(2)
where
dπ,γ psq: p1´γqř8
t0γtPτπpstsq
is known
as the future discounted state visitation distribution under
policy
π
. Note that unlike the average reward objective,
the discounted objective depends on the initial state distri-
bution
µ
. It can be easily shown that
dπ,γ psq Ñ dπpsq
for all
s
as
γÑ1
. The discounted value function
is defined as
Vπ
γpsq:Eτπř8
t0γtrpst, atqˇˇˇˇs0s
and discounted action-value function
Qπ
γps, aq:
Eτπř8
t0γtrpst, atqˇˇˇˇs0s, a0a
. Finally, the dis-
counted advantage function is defined as
Aπ
γps, aq:
Qπ
γps, aq ´ Vπ
γpsq.
It is well-known that
limγÑ1p1´γqργpπq ρpπq
, im-
plying that the discounted and average reward objectives
are equivalent in the limit as
γ
approaches 1 (Blackwell,
1962). We further discuss the relationship between the dis-
counted and average reward criteria in Appendix Aand
prove that
limγÑ1Aπ
γps, aq s
Aπps, aq
(see Corollary A.1).
The proofs of all results in the subsequent sections, if not
given, can be found in the supplementary material.
3. Montonically Improvement Guarantees for
Discounted RL
In much of the on-policy DRL literature (Schulman et al.,
2015;2017;Wu et al.,2017;Vuong et al.,2019;Song et al.,
2020), algorithms iteratively update policies by maximiz-
ing them within a local region, i.e., at iteration
k
we find
a policy
πk`1
by maximizing
ργpπq
within some region
Dpπ, πkq ď δ
for some divergence measure
D
. By using
different choices of
D
and
δ
, this approach allows us to
control the step-size of each update, which can lead to bet-
ter sample efficiency (Peters & Schaal,2008). Schulman
et al. (2015) derived a policy improvement bound based on
a specific choice of D:
ργpπk`1q ´ ργpπkq ě 1
1´γE
sdπk
aπk`1
rAπk
γps, aqs
´C¨max
srDTVpπk`1kπkqrsss
(3)
where
DTVpπ1kπqrss:1
2řa|π1pa|sq ´ πpa|sq|
is the
total variation divergence, and
C4γ{p1´γq2
where
is some constant. Schulman et al. (2015) showed that
by choosing
πk`1
which maximizes the right hand side of
(3)
, we are guaranteed to have
ργpπk`1q ě ργpπkq
. This
provided the theoretical foundation for an entire class of
on-policy DRL algorithms (Schulman et al.,2015;2017;
Wu et al.,2017;Vuong et al.,2019;Song et al.,2020).
A natural question arises here is whether the iterative pro-
cedure described by Schulman et al. (2015) also guarantees
improvement for the average reward. Since the discounted
and average reward objectives become equivalent as
γÑ1
,
one may conjecture that we can also lower bound the pol-
icy performance difference of the average reward objec-
tive by simply letting
γÑ1
for the bounds in Schulman
et al. (2015). Unfortunately this results in a non-meaningful
bound (see supplementary material for proof.)
Proposition 1.
Consider the bounds in Theorem 1 of Schul-
man et al. (2015) and Corollary 1 of Achiam et al. (2017).
The right hand side of both bounds times
1´γ
goes to
negative infinity as γÑ1.
Since
limγÑ1p1´γqpργpπ1q ´ ργpπqq ρpπ1q ´ ρpπq
,
Proposition 1says that the policy improvement guarantee
from Schulman et al. (2015) and Achiam et al. (2017) be-
comes trivial when
γÑ1
and thus does not generalize
to the average reward setting. In the next section, we will
derive a novel policy improvement bound for the average
reward objective, which in turn can be used to generate
monotonically improved policies w.r.t. the average reward.
4. Main Results
4.1. Average Reward Policy Improvement Theorem
Let
dπPR|S|
be the probability column vector whose com-
ponents are
dπpsq
. Let
PπPR|S|ˆ|S|
be the transition ma-
trix under policy
π
whose
ps, s1q
component is
Pπps1|sq
řaPps1|s, aqπpa|sq
, and
P
π:limNÑ8
1
NřN
t0Pt
π
be
the limiting distribution of the transition matrix. For aperi-
odic unichain MDPs, P
πlimtÑ8 Pt
π1dT
π.
Suppose we have a new policy
π1
obtained via some update
rule from the current policy
π
. Similar to the discounted
case, we would like to measure their performance difference
ρpπ1q ´ ρpπq
using an expression which depends on
π
and
some divergence metric between the two policies. The
following identity shows that
ρpπ1q´ρpπq
can be expressed
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
using the average reward advantange function of π.
Lemma 1. Under Assumption 2:
ρpπ1q ´ ρpπq E
sdπ1
aπ1s
Aπps, aq(4)
for any two stochastic policies πand π1.
Lemma 1is an extension of the well-known policy differ-
ence lemma from Kakade & Langford (2002) to the average
reward case. A similar result was proven by Even-Dar et al.
(2009) and Neu et al. (2010). For completeness, we provide
a simple proof in the supplementary material. Note that this
expression depends on samples drawn from
π1
. However
we can show through the following lemma that when
dπ
and
dπ1
are “close” w.r.t. the TV divergence, we can evaluate
ρpπ1q
using samples from
dπ
(see supplementary material
for proof).
Lemma 2.
Under Assumption 2, the following bound holds
for any two stochastic policies πand π1:
ˇˇˇˇˇˇ
ρpπ1q ´ ρpπq ´ E
sdπ
aπ1s
Aπps, aqˇˇˇˇˇˇď2DTVpdπ1kdπq
(5)
where maxsˇˇEaπ1pa|sqrs
Aπps, aqsˇˇ.
Lemma 2implies that
ρpπ1q « ρpπq ` E
sdπ
aπ1s
Aπps, aq(6)
when
dπ
and
dπ1
are “close”. However in order to study how
policy improvement is connected to changes in the actual
policies themselves, we need to analyze the relationship
between changes in the policies and changes in stationary
distributions. It turns out that the sensitivity of the station-
ary distributions in relation to the policies is related to the
structure of the underlying Markov chain.
Let
MπPR|S|ˆ|S|
be the mean first passage time matrix
whose elements
Mπps, s1q
is the expected number of steps
it takes to reach state
s1
from
s
under policy
π
. Under
Assumption 1, the matrix
Mπ
can be calculated via (see
Theorem 4.4.7 of Kemeny & Snell (1960))
Mπ pI´Zπ`EZ π
dgqDπ(7)
where
Zπ pI´Pπ`P
πq´1
is known as the fundamental
matrix of the Markov chain (Kemeny & Snell,1960),
E
is
a square matrix consisting of all ones. The subscript ‘dg’
on some square matrix refers to taking the diagonal of said
matrix and placing zeros everywhere else.
DπPR|S|ˆ|S|
is
a diagonal matrix whose elements are 1{dπpsq.
One important property of mean first passage time is that
for any MDP which satisfies Assumption 1, the quantity
κπÿ
s1
dπps1qMπps, s1q tracepZπq(8)
is a constant independent of the starting state for any pol-
icy
π
(Theorem 4.4.10 of Kemeny & Snell (1960).) The
constant
κπ
is sometimes referred to as Kemeny’s constant
(Grinstead & Snell,2012). This constant can be interpreted
as the mean number of steps it takes to get to any goal state
weighted by the steady-distribution of the goal states. This
weighted mean does not depend on the starting state, as
mentioned just above.
It can be shown that the value of Kemeny’s constant is also
related to the mixing time of the Markov Chain, i.e., how
fast the chain converges to the stationary distribution (see
Appendix Cfor additional details).
The following result connects the sensitivity of the stationary
distribution to changes to the policy.
Lemma 3.
Under Assumption 1, the divergence between the
stationary distributions
dπ
and
dπ1
can be upper bounded
by the average divergence between policies πand π1:
DTVpdπ1kdπqďpκ´1qE
sdπrDTVpπ1kπqrsss (9)
where κmaxπκπ
For Markov chains with a small mixing time, where an agent
can quickly get to any state, Kemeny’s constant is relatively
small and Lemma 3shows that the stationary distributions
are not highly sensitive to small changes in the policy. On
the other hand, for Markov chains that that have high mixing
times, the factor can become very large. In this case Lemma
3shows that small changes in the policy can have a large
impact on the resulting stationary distributions.
Combining the bounds in Lemma 2and Lemma 3gives us
the following result:
Theorem 1.
Under Assumption 1the following bounds hold
for any two stochastic policies πand π1, :
D´
πpπ1q ď ρpπ1q ´ ρpπq ď D`
πpπ1q(10)
where
D˘
πpπ1q E
sdπ
aπ1s
Aπps, aq˘2ξE
sdπrDTVpπ1kπqrsss
and ξ pκ´1qmaxsEaπ1|s
Aπps, aq|.
The bounds in Theorem 1are guaranteed to be finite. Anal-
ogous to the discounted case, the multiplicative factor
ξ
provides guidance on the step-sizes for policy updates. Note
that Theorem 1holds for MDPs satisfying Assumption 1; in
Appendix Dwe discuss how a similar result can be derived
for the more general aperiodic unichain case.
The bound in Theorem 1is given in terms of the TV diver-
gence; however the KL divergence is more commonly used
in practice. The relationship between the TV divergence and
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
Algorithm 1
Approximate Average Reward Policy Iteration
1: Input: π0
2: for k0,1,2, . . . do
3: Policy Evaluation: Evaluate s
Aπkps, aqfor all s, a
4: Policy Improvement:
πk`1argmax
π
D´
πkpπq(12)
where
D´
πkpπq E
sdπk
aπs
Aπkps, aq
´ξc2E
sdπkrDKL pπ}πkqrsss
and ξ pκ´1qmaxsEaπ|s
Aπkps, aq|
5: end for
KL divergence is given by Pinsker’s inequality (Tsybakov,
2008), which says that for any two distributions
p
and
q
:
DTVppkqq ď aDKL pp}qq {2. We can then show that
E
sdπrDTVpπ1kπqrsss ď E
sdπraDKL pπ1}πqrss{2s
ďbE
sdπrDKL pπ1}πqsrsss{2
(11)
where the second inequality comes from Jensen’s inequality.
The inequality in
(11)
shows that the bounds in Theorem 1
still hold when
EsdπrDTVpπ1kπqrsss
is substituted with
aEsdπrDKL pπ1}πqsrss{2.
4.2. Approximate Policy Iteration
One direct consequence of Theorem 1is that iteratively max-
imizing the
D´
πpπ1q
term in the bound generates a mono-
tonically improving sequence of policies w.r.t. the average
reward objective. Algorithm 1gives an approximate policy
iteration algorithm that produces such a sequence of policies.
Proposition 2.
Given an initial policy
π0
, Algorithm 1is
guaranteed to generate a sequence of policies
π1, π2, . . .
such that ρpπ0q ď ρpπ1q ď ρpπ2q ď ¨¨ ¨.
Proof.
At iteration
k
,
Esdπk,aπrs
Aπkps, aqs 0
,
EsdπkrDKL pπ}πkqrsss 0
for
ππk
. By Theorem
1and (12), ρpπk`1q ´ ρpπkq ě 0.
However, Algorithm 1is difficult to implement in prac-
tice since it requires exact knowledge of
s
Aπkps, aq
and the
transition matrix. Furthermore, calculating the term
ξ
is
impractical for high-dimensional problems. In the next sec-
tion, we will introduce a sample-based algorithm which
approximates the update rule in Algorithm 1.
5. Practical Algorithm
As noted in the previous section, Algorithm 1is not practical
for problems with large state and action spaces. In this
section, we will discuss how Algorithm 1and Theorem
1can be used in practice to create algorithms which can
effectively solve high dimensional DRL problems with the
use of trust region methods.
In Appendix F, we will also discuss how Theorem 1can
be used to solve DRL problems with average cost safety
constraints. RL with safety constraints are an important
class of problems with practical implications (Amodei et al.,
2016). Trust region methods have been successfully applied
to this class of problems as it provides worst-case constraint
violation guarantees for evaluating the cost constraint values
for policy updates (Achiam et al.,2017;Yang et al.,2020;
Zhang et al.,2020). However the aforementioned theoreti-
cal guarantees were only shown to apply to discounted cost
constraints. Tessler et al. (2019) pointed out that trust-region
based methods such as the Constrained Policy Optimiza-
tion (CPO) algorithm (Achiam et al.,2017) cannot be used
for average costs constraints. Contrary to this belief, in
Appendix F, we demonstrate that Theorem 1provides a
worst-case constraint violation guarantee for average costs
and trust-region-based constrained RL methods can easily
be modified to accommodate for average cost constraints.
5.1. Average Reward Trust Region Methods
For DRL problems, it is common to consider some param-
eterized policy class
ΠΘ tπθ:θPΘu
. Our goal is
to devise a computationally tractable version of Algorithm
1for policies in
ΠΘ
. We can rewrite the unconstrained
optimization problem in (12) as a constrained problem:
maximize
πθPΠΘ
E
sdπθk
aπθ
rs
Aπθkps, aqs
subject to ¯
DKLpπθkπθkq ď δ
(13)
where
¯
DKLpπθkπθkq:EsdπθkrDKL pπθ}πθkq rsss
. Im-
portantly, the advantage function
s
Aπθkps, aq
appearing in
(13)
is the average-reward advantage function, defined as
the bias minus the action-bias, and not the discounted ad-
vantage function. The constraint set
tπθPΠΘ:¯
DKLpπθk
πθkq ď δu
is called the trust region set. The problem
(13)
can be regarded as an average reward variant of the trust
region problem from Schulman et al. (2015). The step-size
δ
is treated as a hyperparamter in practice and should ideally
be tuned for each specific task. However we note that in
the average reward, the choice of step-size is related to the
mixing time of the underlying Markov chain (since it is re-
lated to the multiplicative factor
ξ
in Theorem 1). When the
mixing time is small, a larger step-size can be chosen and
vice versa. While it is impractical to calculate the optimal
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
step-size, in certain applications domain knowledge on the
mixing time can be used to serve as a guide for tuning δ.
When we set
πθk`1
to be the optimal solution to
(13)
, similar
to the discounted case, the policy improvement guarantee
no longer holds. However we can show that
πθk`1
has the
following worst-case performance degradation guarantee:
Proposition 3.
Let
πθk`1
be the optimal solution to
(13)
for
some
πθkPΠΘ
. The policy performance difference between
πθk`1and πθkcan be lower bounded by
ρpπθk`1q ´ ρpπθkqě´ξπθk`1?2δ(14)
where
ξπθk`1 pκπθk`1´1qmaxsEaπθk`1|s
Aπθkps, aq|
.
Proof.
Since
¯
DKLpπθkkπθkq 0
,
πθk
is feasible. The
objective value is 0 for
πθπθk
. The bound follows from
(10) and (11) where the average KL is bounded by δ.
Several algorithms have been proposed for efficiently solv-
ing the discounted version of
(13)
:Schulman et al. (2015)
and Wu et al. (2017) converts
(13)
into a convex problem
via Taylor approximations; another approach is to first solve
(13)
in the non-parametric policy space and then project the
result back into the parameter space (Vuong et al.,2019;
Song et al.,2020). These algorithms can also be adapted for
the average reward case and are theoretically justified via
Theorem 1and Proposition 3. In the next section, we will
provide as a specific example how this can be done for one
such algorithm.
5.2. Average Reward TRPO (ATRPO)
In this section, we introduce ATRPO, which is an average-
reward modification of the TRPO algorithm (Schulman
et al.,2015). Similar to TRPO, we apply Taylor approxi-
mations to
(13)
. This gives us a new optimization problem
which can be solved exactly using Lagrange duality (Boyd
et al.,2004). The solution to this approximate problem
gives an explicit update rule for the policy parameters which
then allows us to perform policy updates using an actor-
critic framework. More details can be found in Appendix E.
Algorithm 2provides a basic outline of ATRPO.
The major differences between ATRPO and TRPO are as
follows:
i
The critic network in Algorithm 2approximates the
average-reward bias rather than the discounted value
function.
ii
ATRPO must estimate the average return
ρ
of the current
policy.
iii
The targets for the bias and the advantage are calculated
without discount factors and the average return
ρ
is
Algorithm 2 Average Reward TRPO (ATRPO)
1: Input:
Policy parameters
θ0
, critic net parameters
φ0
,
learning rate α, trajectory truncation parameter N.
2: for k0,1,2,¨¨¨ do
3:
Collect a truncated trajectory
tst, at, st`1, rtu, t
1, . . . , N from the environment using πθk.
4: Calculate sample average reward of πθkvia
ρ1
NřN
t1rt.
5: for t1,2, . . . , N do
6: Get target s
Vtarget
trt´ρ`s
Vφkpst`1q
7: Get advantage estimate:
ˆ
Apst, atq rt´ρ`s
Vφkpst`1q ´ s
Vφkpstq
8: end for
9: Update critic by
φk`1Ðφk´αφLpφkq
where
Lpφkq 1
N
N
ÿ
t1¯
Vφkpstq ´ s
Vtarget
t2
10:
Use
ˆ
Apst, atq
to update
θk
using TRPO policy update
(Schulman et al.,2015).
11: end for
subtracted from the reward. Simply setting the discount
factor to 1 in TRPO does not lead to Algorithm 2.
iv
ATRPO also assumes that the underlying task is a con-
tinuing infinite-horizon task. But since in practice we
cannot run infinitely long trajectories, all trajectories
are truncated at some large truncation value
N
. Unlike
TRPO, during training we do not allow for episodic
tasks where episodes terminate early (before
N
). For
the MuJoCo environments, we will address this by hav-
ing the agent not only resume locomotion after falling
but also incur a penalty for falling (see Section 6.)
In Algorithm 2, for illustrative purposes, we use the average
reward one-step bootstrapped estimate for the target of the
critic and the advantage function. In practice, we instead de-
velop and use an average-reward version of the Generalized
Advantage Estimator (GAE) from Schulman et al. (2016).
In Appendix Gwe provide more details on how GAE can
be generalized to the average-reward case.
6. Experiments
We conducted experiments comparing the performance of
ATRPO and TRPO on continuing control tasks. We con-
sider three tasks (Ant, HalfCheetah, and Humanoid) from
the MuJoCo physical simulator (Todorov et al.,2012) imple-
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
Figure 1.
Comparing performance of ATRPO and TRPO with different discount factors. The
x
-axis is the number of agent-environment
interactions and the
y
-axis is the total return averaged over 10 seeds. The solid line represents the agents’ performance on evaluation
trajectories of maximum length 1,000 (top row) and 10,000 (bottom row). The shaded region represents one standard deviation.
mented using OpenAI gym (Brockman et al.,2016), where
the natural goal is to train the agents to run as fast as possible
without falling.
6.1. Evaluation Protocol
Even though the MuJoCo benchmark is commonly trained
using the discounted objective (see e.g. Schulman et al.
(2015), Wu et al. (2017), Lillicrap et al. (2016), Schulman
et al. (2017), Haarnoja et al. (2018), Vuong et al. (2019)), it
is always evaluated without discounting. Similarly, we also
evaluate performance using the undiscounted total-reward
objective for both TRPO and ATRPO.
Specifically for each environment, we train a policy for 10
million environment steps. During training, every 100,000
steps, we run 10 separate evaluation trajectories with the
current policy without exploration (i.e., the policy is kept
fixed and deterministic). For each evaluation trajectory we
calculate the undiscounted return of the trajectory until the
agent falls or until 1,000 steps, whichever comes first. We
then report the average undiscounted return over the 10
trajectories. Note that this is the standard evaluation metric
for the MuJoCo environments. In order to understand the
performance of the agent for long time horizons, we also
report the performance of the agent evaluated on trajectories
of maximum length 10,000.
6.2. Comparing ATRPO and TRPO
To simulate an infinite-horizon setting during training, we
do the following: when the agent falls, the trajectory does
not terminate; instead the agent incurs a large reset cost for
falling, and then continues the trajectory from a random
start state. The reset cost is set to 100. However, we show in
the supplementary material (Appendix I.2) that the results
are largely insensitive to the choice of reset cost. We note
that this modification does not change the underlying goal
of the task. We also point out that the reset cost is only ap-
plied during training and is not used in the evaluation phase
described in the previous section. Hyperparameter settings
and other additional details can be found in Appendix H.
We plot the performance for ATRPO and TRPO trained with
different discount factors in Figure 1. We see that TRPO
with its best discount factor can perform as well as ATRPO
for the simplest environment HalfCheetah. But ATRPO
provides dramatic improvements in Ant and Humanoid. In
particular for the most challenging environment Humanoid,
ATRPO performs on average
50.1%
better than TRPO with
its best discount factor when evaluated on trajectories of
maximum length 1000. The improvement is even greater
when the agents are evaluated on trajectories of maximum
length 10,000 where the performance boost jumps to
913%
.
In Appendix I.1, we provide an additional set of experiments
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
Figure 2.
Speed-time plot of a single trajectory (maximum length 10,000) for ATRPO and Discounted TRPO in the Humanoid-v3
environment. The solid line represents the speed of the agent at the corresponding timesteps.
demonstrating that ATRPO also significantly outperforms
TRPO when TRPO is trained without the reset scheme de-
scribed at the beginning of this section (i.e. the standard
MuJoCo setting.)
We make two observations regarding discounting. First, we
note that increasing the discount factor does not necessar-
ily lead to better performance for TRPO. A larger discount
factor in principle enables the algorithm to seek a policy
that performs well for the average-reward criterion (Black-
well,1962). Unfortunately, a larger discount factor can
also increase the variance of the gradient estimator (Zhao
et al.,2011;Schulman et al.,2016), increase the complex-
ity of the policy space (Jiang et al.,2015), lead to slower
convergence (Bertsekas et al.,1995;Agarwal et al.,2020),
and degrade generalization in limited data settings (Amit
et al.,2020). Moreover, algorithms with discounting are
known to become unstable as
γÑ1
(Naik et al.,2019).
Secondly, for TRPO the best discount factor is different for
each environment (0.99 for HalfCheetah and Ant, 0.95 for
Humanoid). The discount factor therefore serves as a hy-
perparameter which can be tuned to improve performance,
choosing a suboptimal discount factor can have significant
consequences. Both of these observation are consistent with
what was seen in the literature (Andrychowicz et al.,2020).
We have shown here that using the average reward crite-
rion directly not only delivers superior performance but also
obviates the need to tune the discount factor.
6.3. Understanding Long Run Performance
Next, we demonstrate that agents trained using the aver-
age reward criterion are better at optimizing for long-term
returns. Here, we first train Humanoid with 10 million sam-
ples with ATRPO and with TRPO with a discount factor of
0.95 (shown to be the best discount factor in the previous ex-
periments). Then for evaluation, we run the trained ATRPO
and TRPO policies for a trajectory of 10,000 timesteps (or
until the agent falls). We use the same random seeds for
the two algorithms. Figure 2is a plot of the speed of the
agent at each time step of the trajectory, using the seed that
gives the best performance for discounted TRPO. We see
in Figure 2that the discounted algorithm gives a higher
initial speed at the beginning of the trajectory. However its
overall speed is much more erratic throughout the trajectory,
resulting in the agent falling over after approximately 5000
steps. This coincides with the notion of discounting where
more emphasis is placed at the beginning of the trajectory
and ignores longer-term behavior. On the other hand, the
average-reward policy while having a slightly lower ve-
locity overall throughout its trajectory is able to sustain
the trajectory much longer, thus giving it a higher total re-
turn. In fact, we observed that for all 10 random seeds we
tested, the average reward agent is able to finish the entire
10,000 time step trajectory without falling. In Table 1we
present the summary statistics of trajectory length for all
trajectories using discounted TRPO we note that the median
trajectory length for the TRPO discounted agent is 452.5,
meaning that on average TRPO performs significantly worse
than what is reported in Figure. 2.
Table 1.
Summary statistics for all 10 trajectories using a
Humanoid-v3 agent trained with TRPO
Min Max Average Median Std
108 4806 883.1 452.5 1329.902
7. Related Work
Dynamic programming algorithms for finding the optimal
average reward policies have been well-studied (Howard,
1960;Blackwell,1962;Veinott,1966). Several tabular Q-
learning-like algorithms for problems with unknown dy-
namics have been proposed, such as R-Learning (Schwartz,
1993), RVI Q-Learning (Abounadi et al.,2001), CSV-
Learning (Yang et al.,2016), and Differential Q-Learning
(Wan et al.,2020). Mahadevan (1996) conducted a thor-
ough empirical analysis of the R-Learning algorithm. We
note that much of the previous work on average reward RL
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
focuses on the tabular setting without function approxima-
tions, and the theoretical properties of many of these Q-
learning-based algorithm are not well understood (in partic-
ular R-learning). More recently, POLITEX updates policies
using a Boltzmann distribution over the sum of action-value
function estimates of the previous policies (Abbasi-Yadkori
et al.,2019) and Wei et al. (2020) introduced a model-free
algorithm for optimizing the average reward of weakly-
communicating MDPs.
For policy gradient methods, Baxter & Bartlett (2001)
showed that if
1{p1´γq
is large compared to the mix-
ing time of the Markov chain induced by the MDP, then
the gradient of
ργpπq
can accurately approximate the gra-
dient of
ρpπq
.Kakade (2001a) extended upon this result
and provided an error bound on using an optimal discounted
policy to maximize the average reward. In contrast, our
work directly deals with the average reward objective and
provides theoretical guidance on the optimal step size for
each policy update.
Policy improvement bounds have been extensively explored
in the discounted case. The results from Schulman et al.
(2015) are extensions of Kakade & Langford (2002). Pirotta
et al. (2013) also proposed an alternative generalization to
Kakade & Langford (2002). Achiam et al. (2017) improved
upon Schulman et al. (2015) by replacing the maximum
divergence with the average divergence.
8. Conclusion
In this paper, we introduce a novel policy improvement
bound for the average reward criterion. The bound is based
on the average divergence between two policies and Ke-
meny’s constant or mixing time of the Markov chain. We
show that previous existing policy improvement bounds for
the discounted case results in a non-meaningful bound for
the average reward objective. Our work provides the theo-
retical justification and the means to generalize the popular
trust-region based algorithms to the average reward setting.
Based on this theory, we propose ATRPO, a modification
of the TRPO algorithm for on-policy DRL. We demonstrate
through a series of experiments that ATRPO is highly effec-
tive on high-dimensional continuing control tasks.
Acknowledgements
We would like to extend our gratitude to Quan Vuong and
the anonymous reviewers for their constructive comments
and suggestions. We also thank Shuyang Ling, Che Wang,
Zining (Lily) Wang, and Yanqiu Wu for the insightful dis-
cussions on this work.
References
Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N.,
Szepesvari, C., and Weisz, G. Politex: Regret bounds for
policy iteration using expert prediction. In International
Conference on Machine Learning, pp. 3692–3702, 2019.
Abounadi, J., Bertsekas, D., and Borkar, V. S. Learning
algorithms for markov decision processes with average
cost. SIAM Journal on Control and Optimization, 40(3):
681–698, 2001.
Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained
policy optimization. In Proceedings of the 34th Interna-
tional Conference on Machine Learning-Volume 70, pp.
22–31. JMLR. org, 2017.
Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Op-
timality and approximation with policy gradient methods
in markov decision processes. In Conference on Learning
Theory, pp. 64–66. PMLR, 2020.
Altman, E. Constrained Markov decision processes, vol-
ume 7. CRC Press, 1999.
Amit, R., Meir, R., and Ciosek, K. Discount factor as a
regularizer in reinforcement learning. In International
conference on machine learning, 2020.
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schul-
man, J., and Mané, D. Concrete problems in ai safety.
arXiv preprint arXiv:1606.06565, 2016.
Andrychowicz, M., Raichuk, A., Sta´
nczyk, P., Orsini, M.,
Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin,
O., Michalski, M., et al. What matters in on-policy rein-
forcement learning? a large-scale empirical study. arXiv
preprint arXiv:2006.05990, 2020.
Baxter, J. and Bartlett, P. L. Infinite-horizon policy-gradient
estimation. Journal of Artificial Intelligence Research,
15:319–350, 2001.
Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., and
Bertsekas, D. P. Dynamic programming and optimal
control, volume 1,2. Athena scientific Belmont, MA,
1995.
Blackwell, D. Discrete dynamic programming. The Annals
of Mathematical Statistics, pp. 719–726, 1962.
Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex opti-
mization. Cambridge university press, 2004.
Brémaud, P. Markov Chains Gibbs Fields, Monte Carlo
Simulation and Queues. Springer, 2 edition, 2020.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
Schulman, J., Tang, J., and Zaremba, W. Openai gym,
2016.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
Cho, G. E. and Meyer, C. D. Comparison of perturbation
bounds for the stationary distribution of a markov chain.
Linear Algebra and its Applications, 335(1-3):137–150,
2001.
Even-Dar, E., Kakade, S. M., and Mansour, Y. Online
markov decision processes. Mathematics of Operations
Research, 34(3):726–736, 2009.
Grinstead, C. M. and Snell, J. L. Introduction to probability.
American Mathematical Soc., 2012.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-
critic: Off-policy maximum entropy deep reinforcement
learning with a stochastic actor. International Conference
on Machine Learning (ICML), 2018.
Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridge
university press, 2012.
Howard, R. A. Dynamic programming and markov pro-
cesses. John Wiley, 1960.
Hunter, J. J. Stationary distributions and mean first passage
times of perturbed markov chains. Linear Algebra and
its Applications, 410:217–243, 2005.
Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The depen-
dence of effective planning horizon on model accuracy.
In Proceedings of the 2015 International Conference on
Autonomous Agents and Multiagent Systems, pp. 1181–
1189. Citeseer, 2015.
Jiang, N., Singh, S. P., and Tewari, A. On structural proper-
ties of mdps that bound loss due to shallow planning. In
IJCAI, pp. 1640–1647, 2016.
Kakade, S. Optimizing average reward using discounted
rewards. In International Conference on Computational
Learning Theory, pp. 605–615. Springer, 2001a.
Kakade, S. and Langford, J. Approximately optimal approxi-
mate reinforcement learning. In International Conference
on Machine Learning, volume 2, pp. 267–274, 2002.
Kakade, S. M. A natural policy gradient. Advances in neural
information processing systems, 14, 2001b.
Kallenberg, L. Linear Programming and Finite Markovian
Control Problems. Centrum Voor Wiskunde en Informat-
ica, 1983.
Kemeny, J. and Snell, I. Finite Markov Chains. Van Nos-
trand, New Jersey, 1960.
Lehmann, E. L. and Casella, G. Theory of point estimation.
Springer Science & Business Media, 2006.
Lehnert, L., Laroche, R., and van Seijen, H. On value func-
tion representation of long horizon problems. In Proceed-
ings of the AAAI Conference on Artificial Intelligence,
volume 32, 2018.
Levin, D. A. and Peres, Y. Markov chains and mixing times,
volume 107. American Mathematical Soc., 2017.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,
T., Tassa, Y., Silver, D., and Wierstra, D. Continuous
control with deep reinforcement learning. International
Conference on Learning Representations (ICLR), 2016.
Mahadevan, S. Average reward reinforcement learning:
Foundations, algorithms, and empirical results. Machine
learning, 22(1-3):159–195, 1996.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M. Play-
ing atari with deep reinforcement learning. NIPS Deep
Learning Workshop, 2013.
Naik, A., Shariff, R., Yasui, N., and Sutton, R. S. Discounted
reinforcement learning is not an optimization problem.
NeurIPS Optimization Foundations for Reinforcement
Learning Workshop, 2019.
Neu, G., Antos, A., György, A., and Szepesvári, C. Online
markov decision processes under bandit feedback. In
Advances in Neural Information Processing Systems, pp.
1804–1812, 2010.
Peters, J. and Schaal, S. Reinforcement learning of motor
skills with policy gradients. Neural networks, 21(4):682–
697, 2008.
Petrik, M. and Scherrer, B. Biasing approximate dynamic
programming with a lower discount factor. In Twenty-
Second Annual Conference on Neural Information Pro-
cessing Systems-NIPS 2008, 2008.
Pirotta, M., Restelli, M., Pecorino, A., and Calandriello,
D. Safe policy iteration. In International Conference on
Machine Learning, pp. 307–315, 2013.
Ross, K. W. Constrained markov decision processes with
queueing applications. Dissertation Abstracts Interna-
tional Part B: Science and Engineering[DISS. ABST. INT.
PT. B- SCI. & ENG.],, 46(4), 1985.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,
P. Trust region policy optimization. In International
Conference on Machine Learning, pp. 1889–1897, 2015.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel,
P. High-dimensional continuous control using general-
ized advantage estimation. International Conference on
Learning Representations (ICLR), 2016.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347, 2017.
Schwartz, A. A reinforcement learning method for maxi-
mizing undiscounted rewards. In Proceedings of the tenth
international conference on machine learning, volume
298, pp. 298–305, 1993.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M., et al. Mastering the
game of go with deep neural networks and tree search.
nature, 529(7587):484, 2016.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,
M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae-
pel, T., et al. A general reinforcement learning algorithm
that masters chess, shogi, and go through self-play. Sci-
ence, 362(6419):1140–1144, 2018.
Song, H. F., Abdolmaleki, A., Springenberg, J. T., Clark, A.,
Soyer, H., Rae, J. W., Noury, S., Ahuja, A., Liu, S., Tiru-
mala, D., et al. V-mpo: on-policy maximum a posteriori
policy optimization for discrete and continuous control.
International Conference on Learning Representations,
2020.
Sutton, R. S. and Barto, A. G. Reinforcement learning: An
introduction. MIT press, 2018.
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,
Y. Policy gradient methods for reinforcement learning
with function approximation. In Advances in neural in-
formation processing systems, pp. 1057–1063, 2000.
Tadepalli, P. and Ok, D. H-learning: A reinforcement
learning method to optimize undiscounted average re-
ward. Technical Report 94-30-01, Oregon State Univer-
sity, 1994.
Tessler, C., Mankowitz, D. J., and Mannor, S. Reward con-
strained policy optimization. International Conference
on Learning Representation (ICLR), 2019.
Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics
engine for model-based control. In 2012 IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems,
pp. 5026–5033. IEEE, 2012.
Tsybakov, A. B. Introduction to nonparametric estimation.
Springer Science & Business Media, 2008.
Veinott, A. F. On finding optimal policies in discrete dy-
namic programming with no discounting. The Annals of
Mathematical Statistics, 37(5):1284–1294, 1966.
Vuong, Q., Zhang, Y., and Ross, K. W. Supervised policy
update for deep reinforcement learning. In International
Conference on Learning Representation (ICLR), 2019.
Wan, Y., Naik, A., and Sutton, R. S. Learning and plan-
ning in average-reward markov decision processes. arXiv
preprint arXiv:2006.16318, 2020.
Wei, C.-Y., Jafarnia-Jahromi, M., Luo, H., Sharma, H., and
Jain, R. Model-free reinforcement learning in infinite-
horizon average-reward markov decision processes. In
International conference on machine learning, 2020.
Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba,
J. Scalable trust-region method for deep reinforcement
learning using kronecker-factored approximation. In Ad-
vances in neural information processing systems (NIPS),
pp. 5285–5294, 2017.
Yang, S., Gao, Y., An, B., Wang, H., and Chen, X. Efficient
average reward reinforcement learning using constant
shifting values. In AAAI, pp. 2258–2264, 2016.
Yang, T.-Y., Rosca, J., Narasimhan, K., and Ramadge, P. J.
Projection-based constrained policy optimization. In
International Conference on Learning Representation
(ICLR), 2020.
Zhang, Y., Vuong, Q., and Ross, K. First order constrained
optimization in policy space. Advances in Neural Infor-
mation Processing Systems, 33, 2020.
Zhao, T., Hachiya, H., Niu, G., and Sugiyama, M. Anal-
ysis and improvement of policy gradient estimation. In
Advances in Neural Information Processing Systems, pp.
262–270, 2011.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
Supplementary Materials
A. Relationship Between the Discounted and Average Reward Criteria
We first introduce the average reward Bellman equations (Sutton & Barto,2018):
s
Vπpsq ÿ
a
πpa|sq«rps, aq ´ ρpπq ` ÿ
s1
Pps1|s, aqs
Vπps1q(15)
s
Qπps, aq rps, aq ´ ρpπq ` ÿ
s1
Pps1|s, aqÿ
a1
πpa1|s1qs
Qπps1, a1q.(16)
From which we can easily show that:
s
Vπpsq ÿ
a
πpa|sqs
Qπps, aq(17)
s
Qπps, aq rps, aq ´ ρpπq ` ÿ
s1
Pps1|s, aqs
Vπps1q.(18)
Note that these equations take a slightly different form compared to the discounted Bellman equations, there are no discount
factors and the rewards are now replaced with rps, aq ´ ρpπq.
The following classic result relates the value function in the discounted case and average reward bias functions.
Proposition A.1 (Blackwell,1962).For a given stationary policy πand discount factor γP p0,1q,
lim
γÑ1ˆVπ
γpsq ´ ρpπq
1´γ˙s
Vπpsq(19)
for all sPS.
Note that Proposition A.1 applies to any MDP, we will however restrict our discussion to the unichain case to coincide with
the scope of the paper. From Proposition A.1, it is clear that
limγÑ1p1´γqργpπq ρpπq
, i.e. the discounted and average
reward objective are equivalent in the limit as
γ
approaches 1. We can derive similar relations for the action-bias function
and advantage function.
Corollary A.1. For a given stationary policy πand discount factor γP p0,1q,
lim
γÑ1ˆQπ
γps, aq ´ ρpπq
1´γ˙s
Qπps, aq(20)
lim
γÑ1Aπ
γps, aq s
Aπps, aq(21)
for all sPSand aPA.
Proof. From Proposition A.1, we can rewrite (19) as
Vπ
γpsq ρpπq
1´γ`s
Vπpsq ` gpγ, sq(22)
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
where limγÑ1gpγ, sq 0. We then expand Qπ
γps, aqusing the Bellman equation
Qπ
γps, aq rps, aq ` γÿ
s1
Pps1|s, aqVπ
γps1q
rps, aq ` γÿ
s1
Pps1|s, aqˆρpπq
1´γ`s
Vπps1q ` gπpγ, s1q˙
rps, aq ` γρpπq
1´γ`γÿ
s1
Pps1|s, aq`s
Vπps1q ` gπpγ, s1q˘
rps, aq ´ ρpπq ` ρpπq
1´γ`ÿ
s1
Pps1|s, aqs
Vπps1q
´ p1´γqÿ
s1
Pps1|s, aqs
Vπps1q ` γÿ
s1
Pps1|s, aqgπpγ, s1q
s
Qπps, aq ` ρpπq
1´γ´ p1´γqÿ
s1
Pps1|s, aqs
Vπps1q ` γÿ
s1
Pps1|s, aqgπpγ, s1q
where we used Proposition A.1 for the second equality. Note that the last two terms in the last equality approach 0 as
γÑ1
,
rearranging the terms and taking the limit for γÑ1gives us Equation (20).
We can then similarly rewrite (20) as
Qπ
γps, aq ρpπq
1´γ`s
Qπps, aq ` hpγ, s, aq(23)
with limγÑ1hpγ, s, aq 0. This allows us to rewrite the discounted advantage function as
Aπ
γps, aq Qπ
γps, aq ´ Vπ
γpsq
s
Qπps, aq ` ρpπq
1´γ`hπps, a, γq ´ s
Vπpsq ´ ρpπq
1´γ´gπps, γq
s
Aπps, aq ` hπps, a, γq ´ gπps, γ q
Since hπps, a, γqand gπps, γ qboth approach 0 as γapproaches 1, taking the limit for γÑ1gives us Equation (21).
B. Proofs
B.1. Proof of Proposition 1
Proposition 1.
Consider the bounds in Theorem 1 of Schulman et al. (2015) and Corollary 1 of Achiam et al. (2017). The
right hand side of both bounds times 1´γgoes to negative infinity as γÑ1.
Proof.
We will give a proof for the case of Corollary 1 in Achiam et al. (2017), a similar argument can be applied to the
bound in Theorem 1 of Schulman et al. (2015).
We first state Corollary 1 of Achiam et al. (2017) which says that for any two stationary policies πand π1:
ργpπ1q ´ ργpπq ě 1
1´γ»
E
sdπ,γ
aπ1rAπ
γps, aqs ´ 2γγ
1´γE
sdπ,γ
DTVpπ1kπq
(24)
where
γmaxsˇˇEaπ1rAπ
γps, aqsˇˇ
. Since
dπ,γ
approaches the stationary distribution
dπ
as
γÑ1
, we can multiply the
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
right hand side of (24) by p1´γqand take the limit which gives us:
lim
γÑ1¨
˝E
sdπ,γ
aπ1rAπ
γps, aqs ˘ 2γγ
1´γE
sdπ,γ
DTVpπ1kπq˛
E
sdπ
aπ1rs
Aπps, aqs ´ 2E
sdπrDTVpπ1kπqs lim
γÑ1
γ
1´γ
“´8
Here maxsˇˇEaπ1rs
Aπps, aqsˇˇ. The first equality is a direct result of Corollary A.1.
B.2. Proof of Lemma 1
Lemma 1. Under Assumption 2:
ρpπ1q ´ ρpπq E
sdπ1
aπ1s
Aπps, aq(4)
for any two stochastic policies πand π1.
Proof.
We give two approaches for this proof. In the first approach, we directly expand the right-hand side using the
definition of the advantage function and Bellman equation, which gives us:
E
sdπ1
aπ1s
Aπps, aqE
sdπ1
aπ1s
Qπps, aq ´ s
Vπpsq
E
sdπ1
aπ1rps, aq ´ ρpπq ` E
s1Pp¨|s,aqs
Vπps1q´s
Vπpsq
ρpπ1q ´ ρpπq ` E
sdπ1
aπ1
s1Pp¨|s,aq
rs
Vπps1qs ´ E
sdπ1rs
Vπpsqs
Since dπ1psqis the stationary distribution:
E
sdπ1
aπ1
s1Pp¨|s,aq
rs
Vπps1qs ÿ
s
dπ1psqÿ
a
π1pa|sqÿ
s1
Pps1|s, aqs
Vπps1q ÿ
s
dπ1psqÿ
s1
Pπ1ps1|sqs
Vπps1q ÿ
s1
dπ1ps1qs
Vπps1q
Therefore,
E
sdπ1
aπ1
s1Pp¨|s,aq
rs
Vπps1qs ´ E
sdπ1rs
Vπpsqs 0
which gives us the desired result.
Alternatively, we can directly apply Proposition A.1 and Corollary A.1 to Lemma 6.1 of Kakade & Langford (2002) and
take the limit as γÑ1.
B.3. Proof of Lemma 2
Lemma 2. Under Assumption 2, the following bound holds for any two stochastic policies πand π1:
ˇˇˇˇˇˇ
ρpπ1q ´ ρpπq ´ E
sdπ
aπ1s
Aπps, aqˇˇˇˇˇˇď2DTVpdπ1kdπq(5)
where maxsˇˇEaπ1pa|sqrs
Aπps, aqsˇˇ.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
Proof.
ˇˇˇˇˇˇ
ρpπ1q ´ ρpπq ´ E
sdπ
aπ1s
Aπps, aqˇˇˇˇˇˇˇˇˇˇˇˇ
E
sdπ1
aπ1s
Aπps, aq´E
sdπ
aπ1s
Aπps, aqˇˇˇˇˇˇ
ˇˇˇˇˇÿ
s
E
aπ1s
Aπps, aqpdπ1psq ´ dπpsqqˇˇˇˇˇ
ďÿ
sˇˇˇE
aπ1s
Aπps, aqpdπ1psq ´ dπpsqqˇˇˇ
ďmax
sˇˇˇE
aπ1s
Aπps, aqˇˇˇ}dπ1´dπ}1
2DTVpdπ1kdπq
where the last inequality follows from Hölder’s inequality.
B.4. Proof of Lemma 3
Lemma 3.
Under Assumption 1, the divergence between the stationary distributions
dπ
and
dπ1
can be upper bounded by
the average divergence between policies πand π1:
DTVpdπ1kdπqďpκ´1qE
sdπrDTVpπ1kπqrsss (9)
where κmaxπκπ
Proof. Our proof is based on Markov chain perturbation theory (Cho & Meyer,2001;Hunter,2005). Note first that
pdT
π1´dT
πqpI´Pπ1`P
π1q dT
π1´dT
π´dT
π1`dT
πPπ1
dT
πPπ1´dT
π
dT
πpPπ1´Pπq
(25)
Right multiplying (25) by pI´Pπ1`P
π1q´1gives us:
dT
π1´dT
πdT
πpPπ1´PπqpI´Pπ1`P
π1q´1(26)
Recall that Zπ1 pI´Pπ1`P
π1q´1and Mπ1 pI´Zπ1`EZ π1
dg qDπ1. Rearranging the terms we find that
Zπ1I`EZ π1
dg ´Mπ1pDπ1q´1(27)
Plugging (27) into (26) gives us
dT
π1´dT
πdT
πpPπ1´PπqpI`EZ π1
dg ´Mπ1pDπ1q´1q
dT
πpPπ1´PπqpI´Mπ1pDπ1q´1q(28)
where the last equality is due to pPπ1´PπqE0.
Let
}¨}p
denote the the operator norm of a matrix, in particular
}¨}1
and
}¨}8
are the maximum absolute column sum and
maximum absolute row sum respectively. By the submultiplicative property of operator norms (Horn & Johnson,2012), we
have:
}dπ1´dπ}1pI´Mπ1pDπ1q´1qTpPT
π1´PT
πqdπ1
ďpI´Mπ1pDπ1q´1qT1pPT
π1´PT
πqdπ1
pI´Mπ1pDπ1q´1q8pPT
π1´PT
πqdπ1
(29)
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
We can rewrite I´Mπ1pDπ1q´18as
I´Mπ1pDπ1q´18max
s˜ÿ
s1
Mπ1ps, s1qdπ1ps1q ´ 1¸
κπ1´1
(30)
Finally we bound pPT
π1´PT
πqdπ1by
pPT
π1´PT
πqdπ1ÿ
s1ˇˇˇˇˇÿ
s˜ÿ
a
Pps1|s, aqπ1pa|sq ´ Pps1|s, aqπpa|sq¸dπpsqˇˇˇˇˇ
ďÿ
s1,s ˇˇˇˇˇÿ
a
Pps1|s, aqpπ1pa|sq ´ πpa|sqqˇˇˇˇˇdπpsq
ďÿ
s,s1,a
Pps1|s, aqˇˇπ1pa|sq ´ πpa|sqˇˇdπpsq
ďÿ
s,a ˇˇπ1pa|sq ´ πpa|sqˇˇdπpsq
2E
sdπrDTVpπ1kπqs
(31)
Plugging back into (29) and setting κmaxπκπgives the desired result.
C. Kemeny’s Constant and Mixing Time
Proposition C.1.
Under Assumption 1, let
1λ1pπq ą λ2pπq ě ¨¨ ¨ ě λ|S|pπqą´1
be the eigenvalues of
Pπ
, we have
κπď1`|S| ´ 1
1´λpπq(32)
where λpπq maxi2,...,|S||λipπq|.
Proof.
For brevity, we omit
π
from the notations in our proof. Let
λ
be an eigenvalue of
P
and
u
its corresponding
eigenvector. Since Pis aperiodic, λ ´1, we then have
pI´P`Pquu´P u `lim
nÑ8 Pnu
p1´λqu`ulim
nÑ8 λn
´1´λ`lim
nÑ8 λn¯u
(33)
where
limnÑ8 λn1
when
λ1
and
0
when
|λ| ă 1
. Therefore,
pI´P`Pq
has eigenvalues
1,1´λ2,...,1´λ|S|
.
The fundamental matrix
Z pI´P`Pq´1
has eigenvalues
1,1
1´λ2,¨¨ ¨ ,1
1´λ|S|
. We can then upper bound Kemeny’s
constant by
κtracepZq 1`
|S|
ÿ
i2
1
1´λiď1`
|S|
ÿ
i2
1
1´ |λi|ď1`|S| ´ 1
1´λ(34)
The expression
λpπq
is called as the Second Largest Eigenvalue Modulo (SLEM). The Perron-Frobenius theorem says
that the transition matrix
Pπ
converges to the limiting distribution
P
π
at an exponential rate, and the rate of convergence is
determined by the SLEM (see Theorem 4.3.8 of Brémaud (2020) for more details). In fact, it turns out that the mixing time
of a Markov chain is directly related to the SLEM where Markov chains with larger SLEM takes longer to mix and vice
versa (Levin & Peres,2017).
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
D. Average Reward Policy Improvement Bound for Aperiodic Unichain MDPs
In this section, we consider general aperiodic unichain MDPs, i.e. MPDs which satisfy Assumption 2.
We note that Lemma 1and Lemma 2both hold under Assumption 2. We can then show the following under the general
aperiodic unichain case:
Lemma D.1. For any aperiodic unichain MDP:
DTVpdπ1kdπq ď ζE
sdπrDTVpπ1kπqrsss (35)
where ζmaxπ}Zπ}8.
Proof. Note that
dT
π1´dT
πdT
πpPπ1´PπqpI´Pπ1`P
π1q´1
from Equation 26 still holds in the general aperiodic unichain case. By the submultiplicative property, we have:
}dπ1´dπ}1ppI´Pπ1`P
π1q´1qTpPT
π1´PT
πqdπ1
ďppI´Pπ1`P
π1q´1qT1pPT
π1´PT
πqdπ1
pI´Pπ1`P
π1q´18pPT
π1´PT
πqdπ1
(36)
Using the same argument as (31) to bound pPT
π1´PT
πqdπ1and setting ζmaxπ}Zπ}8gives the desired result.
Combining Lemma 2and Lemma D.1 gives us the following result:
Theorem 2. For any aperiodic unichain MDP, the following bounds hold for any two stochastic policies πand π1:
ρpπ1q ´ ρpπq ď E
sdπ
aπ1s
Aπps, aq`2˜
ξE
sdπrDTVpπ1kπqrsss (37)
ρpπ1q ´ ρpπq ě E
sdπ
aπ1s
Aπps, aq´2˜
ξE
sdπrDTVpπ1kπqrsss (38)
where ˜
ξζmaxsEaπ1|s
Aπps, aq|.
The constant
ζ
is always finite therefore we can similarly apply the approximate policy iteration procedure from Algorithm
1to generate a sequence of monotonically improving policies.
E. Derivation of ATRPO
In this section, we give the derivation and additional details of the ATRPO algorithm presented in Algorithm 2. The
algorithm is similar to TRPO in the discount case but with several notable distinctions. Recall the trust region optimization
problem from (13) where
maximize
πθPΠθ
E
sdπθk
aπθ
rs
Aπθkps, aqs
subject to ¯
DKLpπθkπθkq ď δ
We once again note that the objective above is the expectation of the average-reward advantage function and not the standard
discounted advantage function. As done for the derivation for discounted TRPO, we can approximate this problem by
performing first-order Taylor approximation on the objective and second-order approximation on the KL constraint
1
around
θkwhich gives us:
maximize
θgTpθ´θkq
subject to 1
2pθ´θkqTHpθ´θkq ď δ
(39)
1The gradient and first-order Taylor approximation of
¯
DKLpπθkπθkqat θθkis zero.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
where
g:E
sdπθk
aπθkθlog πθpa|sq|θθks
Aπθkps, aq(40)
and
H:E
sdπθk
aπθkθlog πθpa|sq|θθkθlog πθpa|sq|T
θθk(41)
Note that this approximation is good provided that the step-size
δ
is small. The term
g
is the average reward policy gradient
at
θθk
with an additional baseline term (Sutton et al.,2000) and
H
is the Fisher Information Matrix (FIM) (Lehmann &
Casella,2006). The FIM is a symmetrical matrix and always positive semi-definite. If we assume
H
is always positive
definite, we can solve (39) analytically with a Lagrange duality argument which yields the solution:
θθk`d2δ
gTH´1gH´1g(42)
The update rule in
(42)
has the same form as that of natural policy gradients (Kakade,2001b) for the average reward case.
Similar to discounted TRPO, both
g
and
H
can be approximated using samples drawn from the policy
πθk
. The FIM
H
here is identical to the FIM
H
for Natural Gradient and TRPO. However, the definition of
g
is different from the definition
of gfor discounted TRPO since it includes the average-reward advantage function.
Thus, in order to estimate gwe need to estimate
s
Aπθkps, aq s
Qπθkps, aq ´ s
Vπθkpsq(43)
This can be done in various ways. One approach is to approximate the average-reward bias
s
Vπθkpsq
and then use a one-step
TD backup (as was done in Algorithm 2) to estimate the action-bias function. Concretely, combining (43) and the Bellman
equation in (18) gives
s
Aπθkps, aq rps, aq ´ ρpπθkq ` E
s1Pp¨|s,aqs
Vπθkps1q´s
Vπθkpsq(44)
This expression involves the average-reward bias
s
Vπθkpsq
, which we can approximate using a critic network
s
Vφkpsq
, giving
line 7 in Algorithm 2. It remains to specify what the target should be for updating the critic parameter
φ
. For this, we can
similarly make use of the Bellman equation for the average-reward bias in Equation
(15)
which gives line 6 in Algorithm 2.
Finally, like discounted TRPO, after applying the update term
(42)
, we use backtracking linesearch to find an update term
which has a positive advantage value and also maintains KL constraint satisfaction. We also apply the conjugate gradient
method to estimate H´1.
F. Reinforcement Learning with Average Cost Constraints
F.1. The Constrained RL Problem
In addition to learning to improve its long-term performance, many real-world applications of RL also require the agent to
satisfy certain safety constraints. A mathematically principled framework for incorporating safety constraints into RL is
using Constraint Markov Decision Processes (CMDP). A CMDP (Kallenberg,1983;Ross,1985;Altman,1999) is an MDP
equipped with a constraint set
Πc
, a CMDP problems finds a policy
π
that maximizes an agent’s long-run reward given that
πPΠc
. We consider two forms of constraint sets: the average cost constraint set
tπPΠ : ρcpπq ď bu
and the discounted
cost constraint set
tπPΠ : ρc,γ pπq ď bu
. Here
b
is some given constraint bound, the cost constraint functions are given by
ρcpπq:lim
NÑ8
1
NE
τπ«N´1
ÿ
t0
cpst, atq(45)
ρc,γ pπq:E
τπ«8
ÿ
t0
γtcpst, atq(46)
for some bounded cost function c:SˆAÑ rcmin, cmax s.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
F.2. Constrained RL via Local Policy Update
Directly adding cost constraints to any iterative policy improvement algorithms can be sample inefficient since the cost
constraint needs to be evaluated using samples from the new policy after every policy update. Instead, Achiam et al. (2017)
proposed updating πθkvia the following optimization problem:
maximize
πθPΠθ
E
sdπθk
aπθ
rAπθk
γps, aqs
subject to ˜ρc,γ pπθq ď b,
¯
DKLpπθkπθkq ď δ.
(47)
Here,
˜ρc,γpπθq:ρc,γ pπθkq ` 1
1´γE
sdπθk ,aπθAπθk
c,γ ps, aqı(48)
is a surrogate cost function used to approximate the cost constraint and
Aπθk
c,γ ps, aq
is the discounted cost advantage function
where we replace the reward with the cost
2
. Note that
(48)
can be evaluated using samples from
πθk
. By Corollary 2 of
Achiam et al. (2017) and (11):
|ρc,γ pπθq ´ ˜ρc,γpπθq| ď γc,γ
p1´γq2b2¯
DKLpπθkπθkq(49)
where
c,γ maxsˇˇEaπ1rs
Aπ
c,γ ps, aqsˇˇ
. This shows that the surrogate cost is a good approximation to
ρc,γ pπθq
when
πθ
and
πθk
are close w.r.t. the KL divergence. Using
(49)
and the trust region constraint, the worst-case constraint violation for
when πθk`1is the solution to (47) can be upper bounded (Proposition 2 of Achiam et al. (2017).)
This framework is problematic when the cost constraint is undiscounted. Define the average surrogate cost as
˜ρcpπθq:ρcpπθkq ` E
sdπθk
aπθk
rAπθ
cps, aqs (50)
where Aπθ
cps, aqis the average cost advantage function. We can easily show that
lim
γÑ1p1´γqpρc,γ pπθq ´ ˜ρc,γpπθqq ρcpπθq ´ ˜ρcpπθqand lim
γÑ1
γc,γ
1´γb2¯
DKLpπθkπθkq“8
However, by Theorem 13and (11):
|ρcpπθq ´ ˜ρcpπθq| ď ξπθ
cb2¯
DKLpπθkπθkq(51)
where ξπθ
c pκ´1qmaxsEaπθ|Aπθk
cps, aq|. We then have the following result:
Proposition F.1. Suppose πθand πθksatisfy the constraints ˜ρcpπθq ă band ¯
DKLpπθkπθkq ď δ, then
ρCpπθq ď b`ξπθ
c?2δ(52)
The upper-bound in Proposition F.1 provides a worst-case constraint violation guarantee when
πθ
is the solution to the
average-cost variant of
(47)
. It is an undiscounted parallel to Proposition 2 in Achiam et al. (2017) which provides a similar
guarantee for the discounted case. It shows that contrary to what was previously believed (Tessler et al.,2019),
(47)
can
easily be modified to accommodate for average cost constraints and still satisfy an upper bound for worst-case constraint
violation. Scalable algorithms have been proposed for approximately solving
(47)
(Achiam et al.,2017;Zhang et al.,2020).
Proposition F.1 shows that these algorithms can be generalized to average cost constraints with only minor modifications. In
the next section, we will show how the CPO algorithm (Achiam et al.,2017) can be modified for average cost constraints.
2We can define the discounted value/action-value cost functions and the average cost bias/action-bias functions in a similar manner
3It is straightforward to show that the theorem still holds when we replace the reward with the cost.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
F.3. Average Cost CPO (ACPO)
Consider the average cost variant of (47):
maximize
πθPΠθ
E
sdπθk
aπθ
rs
Aπθkps, aqs
subject to ˜ρcpπθq ď b,
¯
DKLpπθkπθkq ď δ.
(53)
Similar to TRPO/ATRPO, we can apply first and second order Taylor approximations to (53) which then gives us
maximize
θgTpθ´θkq
subject to ˜c`˜gTpθ´θkq ď 0
1
2pθ´θkqTHpθ´θkq ď δ
(54)
where g, H were defined in the previous section, ˜cρpπθkq ´ b, and
˜g:E
sdπθk
aπθkθlog πθpa|sq|θθks
Aπθk
cps, aqı(55)
is the gradient of the constraint. Similar to the case of ATRPO,
g
,
˜g
,
H
, and
˜c
can all be approximated using samples
collected from
πθk
. The term
s
Aπθk
cps, aq
also involves the cost bias-function (see Equation 44) which can be approximated
via a separate cost critic network. The optimization problem
(54)
is a convex optimization problem where strong duality
holds, hence it can be solved using a simple Lagrangian argument. The update rule takes the form
θθk`1
λH´1pg´ν˜gq(56)
where λand νare Lagrange multipliers satisfying (Achiam et al.,2017)
max
λ,νě0´1
2λ`gTH´1g`2νgTH´1˜g`ν2˜gH ´1˜gT˘`ν˜c´1
2λδ (57)
The dual problem
(57)
can be solved explicitly (Achiam et al.,2017). Similar to ATRPO, we use the conjugate gradient
method to estimate Hand perform a backtracking line search procedure to guarantee approximate constraint satisfaction.
F.4. Experiment Results
For constrained RL algorithms, being able to accurately evaluate the cost constraint for a particular policy is key to learning
constraint-satisfying policies. In this section, we consider the MuJoCo agents from Section 6. However for safety reasons,
we wish the agent to maintain its average speed over a trajectory below a certain threshold which is set at 2.0 for all
environments.
Here we use the same evaluation protocol as was introduced in Section 6.1 but we calculated the average cost (total cost /
trajectory length) as well as the total return for each evaluation trajectory. We used a maximum trajectory length of 1000 for
these experiments. We plotted the results of our experiments in Figure 3.
From Figure 3we see that ACPO is able to learn high-performing policies while enforcing the average cost constraint.
G. Generalized Advantage Estimator (GAE) for the Average Reward Setting
Suppose the agent collects a batch of data consisting of a trajectories each of length
Ntst, at, rt, st`1u pt1, . . . , N q
using policy
π
. Similar to what is commonly done for critic estimation in on-policy methods, we fit some value function
Vπ
φ
parameterized by φusing data collected with the policy.
We will first review how this is done in the discounted case. Two of the most common ways of calculating the regression
target for Vπ
φare the Monte Carlo target denoted by
Vtarget
t
N
ÿ
t1t
γt1´trt,(58)
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
Figure 3.
Performance for ACPO. Unconstrained ATRPO is plotted for comparison. The
x
-axis is the number of agent-environment
interactions and the
y
-axis is the total return averaged over 10 seeds. The solid line represents the agents’ average total return (top row)
and average cost (bottom row) on the evaluation trajectories. The shaded region represent one standard deviation.
or the bootstrapped target
Vtarget
trt`γs
Vπ
φpst`1q.(59)
Using the dataset
tst, V target
tu
, we can fit
Vπ
φ
with supervised regression by minimizing the MSE between
Vπ
φpstq
and
Vtarget
t
.
With the fitted value function, we can estimate the advantage function either with the Monte Carlo estimator
ˆ
Aπ
MCpst, atq
N
ÿ
t1t
γt1´trt´s
Vπ
φpstq
or the bootstrap estimator ˆ
Aπ
BSpst, atq rt`γs
Vπ
φpst`1q ´ s
Vπ
φpstq.
When the Monte Carlo advantage estimator is used to approximate the policy gradient, it does not introduce a bias but tends
to have a high variance whereas the bootstrapped estimator introduces a bias but tends to have lower variance. These two
estimators are seen as the two extreme ends of the bias-variance trade-off. In order to have better control over the bias and
variance, Schulman et al. (2016) used the idea of eligibility traces (Sutton & Barto,2018) and introduced the Generalized
Advantage Estimator (GAE). The GAE takes the form
ˆ
AGAEpst, atq
N
ÿ
t1tpγλqt1´tδt1(60)
where
δt1rt1`γs
Vπ
φpst1`1q ´ s
Vπ
φpst1q(61)
and
λP r0,1s
is the eligibility trace parameter. We can then use the parameter
λ
to tune the bias-variance trade-off. It is
worth noting two special cases corresponding to the bootstrap and Monte Carlo estimator:
λ0 : ˆ
AGAEpst, atq rt`γs
Vπ
φpst`1q ´ s
Vπ
φpstq
λ1 : ˆ
AGAEpst, atq
N
ÿ
t1t
γt1´trt1´s
Vπ
φpstq
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
For infinite horizon tasks, the discount factor
γ
is used to reduce variance by downweighting rewards far into the future
(Schulman et al.,2016). Also noted in Schulman et al. (2016) is that for any
l"1{p1´γq
,
γl
decreases rapidly and any
effects resulting from actions after
l«1{p1´γq
are effectively "forgotten". This approach in essence converts a continuous
control task into an episodic task where any rewards received after
l«1{p1´γq
becomes negligible. This undermines the
original continuing nature of the task and could prove to be especially problematic for problems where effects of actions are
delayed far into the future. However, increasing
γ
would lead to an increase in variance. Thus in practice
γ
is often treated
as a hyperparameter to balance the effective horizon of the task and the variance of the gradient estimator.
To mitigate this, we introduce how we can formulate critics for the average reward. A key difference is that in the discounted
case we use
Vπ
φ
to approximate the discounted value function whereas in the average reward case
s
Vπ
φ
is used to approximate
the average reward bias function.
Let
ˆρπ1
N
N
ÿ
t1
rt
denote the estimated average reward. The Monte Carlo target for the average reward value function is
s
Vtarget
t
N
ÿ
t1tprt´ˆρπq(62)
and the bootstrapped target is
s
Vtarget
trt´ˆρπ`s
Vπ
φpst`1q.(63)
Note that our targets (62-63) are distinctly different from the traditional discounted targets (58-59).
The Monte Carlo and Bootstrap estimators for the average reward advantage function are:
ˆ
Aπ
MCpst, atq
N
ÿ
t1tprt´ˆρπq ´ s
Vπ
φpstq
ˆ
Aπ
BSpst, atq ri,t ´ˆρπ`s
Vπ
φpst`1q ´ s
Vπ
φpstq
We can similarly extend the GAE to the average reward setting:
ˆ
AGAEpst, atq
N
ÿ
t1t
λt1´tδt1(64)
where
δt1rt1´ˆρπ`s
Vπ
φpst1`1q ´ s
Vπ
φpst1q.(65)
and set the target for the value function to
s
Vtarget
trt´ˆρπ`s
Vπ
φpst`1q `
N
ÿ
t1t`1
λt1´tδt1(66)
The two special cases corresponding to λ0and λ1are
λ0 : ˆ
AGAEpst, atq rt´ˆρπ`s
Vπ
φpst`1q ´ s
Vπ
φpstq
λ1 : ˆ
AGAEpst, atq
N
ÿ
t1tprt1´ˆρπq ´ s
Vπ
φpstq
We note again that the average reward advantage estimator is distinct from the discounted case. To summarize, in the average
reward setting:
The parameterized value function is used to fit the average reward bias function.
The reward term rtin the discounted formulation is replaced by rt´ˆρπ.
Without any discount factors, recent and future experiences are weighed equally thus respecting the continuing nature
of the task.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
H. Experimental Setup
All experiments were implemented in Pytorch 1.3.1 and Python 3.7.4 on Intel Xeon Gold 6230 processors. We based our
TRPO implementation on
https://github.com/ikostrikov/pytorch-trpo
and
https://github.com/
Khrylx/PyTorch-RL
. Our CPO implementation is our own Pytorch implementation based on
https://github.
com/jachiam/cpo
and
https://github.com/openai/safety-starter-agents
. Our hyperparameter
selections were also based on these implementations. Our choice of hyperparameters were based on the motivation that we
wanted to put discounted TRPO in the best possible light and compare its performance with ATRPO. Our hyperparameter
choices for ATRPO mirrored the discounted case since we wanted to understand how performance for the average reward
case differs while controlling for all other variables.
With the exception of results in Appendix I.2, the reset cost is set to 100 on all three environments. In the original
implementation of the MuJoCo environments in OpenAI gym, the maximum episode length is set to 1000
4
, we removed this
restriction in our experiments in order to study long-run performance.
We used a two-layer feedforward neural network with a
tanh
activation for both our policy and critic networks. The policy
is Gaussian with a diagonal covariance matrix. The policy networks outputs a mean vector and a vector containing the
state-independent log standard deviations. States are normalized by the running mean and the running standard deviation
before being fed to any network. We used the GAE for advantage estimation (see Appendix G). The advantage values
are normalized by its batch mean and batch standard deviation before being used for policy updates. Learning rates are
linearly annealed to 0 over the course of training. Note that these settings are common in most open-source implementations
of TRPO and other on-policy algorithms. For training and evaluation, we used different random seeds (i.e. the random
seeds we used to generate the evaluation trajectories are different from those used during training.) Table 2summarizes the
hyperparameters used in our experiments.
Table 2. Hyperparameter Setup
Hyperparameter TRPO/ATRPO CPO/ACPO
No. of hidden layers 2 2
No. of hidden nodes 64 64
Activation tanh tanh
Initial log std -0.5 -1
Batch size 5000 5000
GAE parameter (reward) 0.95 0.95
GAE parameter (cost) N/A 0.95
Learning rate for policy 3ˆ10´43ˆ10´4
Learning rate for reward critic net 3ˆ10´43ˆ10´4
Learning rate for cost critic net N/A 3ˆ10´4
L2-regularization coeff. for critic net 3ˆ10´33ˆ10´3
Damping coeff. 0.01 0.01
Backtracking coeff. 0.8 0.8
Max backtracking iterations 10 10
Max conjugate gradient iterations 10 10
Trust region bound δ0.01 0.01
I. Additional Experiments
I.1. Comparing with TRPO Trained Without Resets
Figure 4repeats the experiments presented in Figure 1except discounted TRPO is trained in the standard MuJoCo setting
without any resets (i.e. during training, when the agent falls, the trajectory terminates.) The maximum length of a TRPO
training episode is 1000. This is identical to how TRPO is trained in the literature for the MuJoCo environments. We apply
4See https://github.com/openai/gym/blob/master/gym/envs/__init__.py
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
the same evaluation protocol introduced in Section 6.1. We note that when TRPO is trained in the standard MuJoCo setting,
ATRPO still outperforms discounted TRPO by a significant margin.
Figure 4.
Comparing performance of ATRPO and TRPO with different discount factors. TRPO is trained without the reset scheme. The
x
-axis is the number of agent-environment interactions and the
y
-axis is the total return averaged over 10 seeds. The solid line represents
the agents’ performance on evaluation trajectories of maximum length 1,000 (top row) and 10,000 (bottom row). The shaded region
represent one standard deviation.
Figure 5.
Comparing performance of ATRPO and TRPO trained with and without the reset costs. The curves for TRPO are for the best
discount factor for each environment. The
x
-axis is the number of agent-environment interactions and the
y
-axis is the total return
averaged over 10 seeds. The solid line represents the agents’ performance on evaluation trajectories of maximum length 1,000 (top row)
and 10,000 (bottom row). The shaded region represent one standard deviation.
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion
In Figure 5we plotted the performance of the best discount factor for each environment for TRPO trained with and without
the reset scheme (i.e. the best performing TRPO curves from Figure 1and Figure 4.) ATRPO is also plotted for comparison.
We note here that the performance of TRPO trained with and without the reset scheme are quite similar, this further supports
the notion that introducing the reset scheme does not alter the goal of the tasks.
I.2. Sensitivity Analysis on Reset Cost
For the experiments presented in Figure 2, we introduced a reset cost in order to simulate an infinite horizon setting. Here
we analyze the sensitivity of the results with respect to this reset cost.
Figure 6.
Comparing ATRPO trained with different reset costs to discounted TRPO with the best discount factor for each environment.
The
x
-axis is the number of agent-environment interactions and the
y
-axis is the total return averaged over 10 seeds. The solid line
represents the agents’ performance on evaluation trajectories of maximum length 1,000. The shaded region represent one standard
deviation.
Figure 6shows that ATRPO is largely insensitive to the choice of reset cost. Though we note that for Humanoid, extremely
large reset costs (200 and 500) does negatively impact performance but the result is still above that of TRPO.
Article
The global transition to renewable energy is crucial for mitigating climate change, but the increasing penetration of renewable sources introduces challenges such as uncertainty and intermittency. The electricity market plays a vital role in encouraging renewable generation while ensuring operational security and grid stability. This Review examines the optimization of market design for power systems with high renewable penetration. We explore recent innovations in renewable-dominated electricity market designs, summarizing key research questions and strategies. Special focus is given to multi-agent reinforcement learning (MARL) for market simulations, its performance and real-world applicability. We also review performance evaluation metrics and present a case study from the Horizon 2020 TradeRES project, exploring European electricity market design under 100% renewable penetration. Finally, we discuss unresolved issues and future research directions.
Article
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by (Kimura et al. 1995). The algorithm’s chief advantages are that it requires storage of only twice the number of policy parameters, uses one free beta (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter beta is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper [ibid. 15, 351-381 (2001; Zbl 0994.68222)] we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugategradient procedure to find local optima of the average reward.
Article
This paper gives the first rigorous convergence analysis of analogs of Watkins' Q-learning algorithm, applied to average cost control of finite-state Markov chains. We discuss two algorithms which may be viewed as stochastic approximation counterparts of two existing algorithms for recursively computing the value function of average cost problem - the traditional relative value iteration algorithm and a recent algorithm of Bertsekas based on the stochastic shortest path (SSP) formulation of the problem. Both synchronous and asynchronous implementations are considered and are analysed using the "ODE" method. This involves establishing asymptotic stability of associated ODE limits. The SSP algorithm also uses ideas from two time scale stochastic approximation.
Article
This report presents a unified approach for the study of constrained Markov decision processes with a countable state space and unbounded costs. We consider a single controller having several objectives; it is desirable to design a controller that minimize one of cost objective, subject to inequality constraints on other cost objectives. The objectives that we study are both the expected average cost, as well as the expected total cost (of which the discounted cost is a special case). We provide two frameworks: the case were costs are bounded below, as well as the contracting framework. We characterize the set of achievable expected occupation measures as well as performance vectors. This allows us to reduce the original control dynamic problem into an infinite Linear Programming. We present a Lagrangian approach that enables us to obtain sensitivity analysis. In particular, we obtain asymptotical results for the constrained control problem: convergence of both the value and the pol...
Politex: Regret bounds for policy iteration using expert prediction
  • Y Abbasi-Yadkori
  • P Bartlett
  • K Bhatia
  • N Lazic
  • C Szepesvari
  • G Weisz
Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C., and Weisz, G. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pp. 3692-3702, 2019.
Constrained policy optimization
  • J Achiam
  • D Held
  • A Tamar
Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 22-31. JMLR. org, 2017.
Optimality and approximation with policy gradient methods in markov decision processes
  • A Agarwal
  • S M Kakade
  • J D Lee
  • G Mahajan
Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pp. 64-66. PMLR, 2020.
Discount factor as a regularizer in reinforcement learning
  • R Amit
  • R Meir
  • K Ciosek
Amit, R., Meir, R., and Ciosek, K. Discount factor as a regularizer in reinforcement learning. In International conference on machine learning, 2020.
  • D Amodei
  • C Olah
  • J Steinhardt
  • P Christiano
  • J Schulman
  • D Mané
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
  • M Andrychowicz
  • A Raichuk
  • P Stańczyk
  • M Orsini
  • S Girgin
  • R Marinier
  • L Hussenot
  • M Geist
  • O Pietquin
  • M Michalski
Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., et al. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020.