Content uploaded by Csaba Szepesvári
Author content
All content in this area was uploaded by Csaba Szepesvári
Content may be subject to copyright.
Algorithms for Reinforcement Learning
Draft of the lecture published in the
Synthesis Lectures on Artificial Intelligence and Machine Learning
series
by
Morgan & Claypool Publishers
Csaba Szepesv´ari
June 9, 2009∗
Contents
1 Overview 3
2 Markov decision processes 7
2.1 Preliminaries ................................... 7
2.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Valuefunctions .................................. 12
2.4 Dynamic programming algorithms for solving MDPs . . . . . . . . . . . . . . 16
3 Value prediction problems 17
3.1 Temporal difference learning in finite state spaces . . . . . . . . . . . . . . . 18
3.1.1 TabularTD(0) .............................. 18
3.1.2 Every-visit Monte-Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 TD(λ): Unifying Monte-Carlo and TD(0) . . . . . . . . . . . . . . . . 23
3.2 Algorithms for large state spaces . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 TD(λ) with function approximation . . . . . . . . . . . . . . . . . . . 29
3.2.2 Gradient temporal difference learning . . . . . . . . . . . . . . . . . . 33
3.2.3 Least-squares methods . . . . . . . . . . . . . . . . . . . . . . . . . . 36
∗Last update: August 18, 2010
1
3.2.4 The choice of the function space . . . . . . . . . . . . . . . . . . . . . 42
4 Control 45
4.1 A catalog of learning problems . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Closed-loop interactive learning . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Online learning in bandits . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Active learning in bandits . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.3 Active learning in Markov Decision Processes . . . . . . . . . . . . . 50
4.2.4 Online learning in Markov Decision Processes . . . . . . . . . . . . . 51
4.3 Directmethods .................................. 56
4.3.1 Q-learning in finite MDPs . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.2 Q-learning with function approximation . . . . . . . . . . . . . . . . 59
4.4 Actor-criticmethods ............................... 62
4.4.1 Implementing a critic . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.2 Implementing an actor . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 For further exploration 72
5.1 Furtherreading .................................. 72
5.2 Applications.................................... 73
5.3 Software...................................... 73
5.4 Acknowledgements ................................ 73
A The theory of discounted Markovian decision processes 74
A.1 Contractions and Banach’s fixed-point theorem . . . . . . . . . . . . . . . . 74
A.2 ApplicationtoMDPs............................... 78
Abstract
Reinforcement learning is a learning paradigm concerned with learning to control a
system so as to maximize a numerical performance measure that expresses a long-term
objective. What distinguishes reinforcement learning from supervised learning is that
only partial feedback is given to the learner about the learner’s predictions. Further,
the predictions may have long term effects through influencing the future state of the
controlled system. Thus, time plays a special role. The goal in reinforcement learning
is to develop efficient learning algorithms, as well as to understand the algorithms’
merits and limitations. Reinforcement learning is of great interest because of the large
number of practical applications that it can be used to address, ranging from problems
in artificial intelligence to operations research or control engineering. In this book, we
focus on those algorithms of reinforcement learning that build on the powerful theory of
dynamic programming. We give a fairly comprehensive catalog of learning problems,
2
Figure 1: The basic reinforcement learning scenario
describe the core ideas together with a large number of state of the art algorithms,
followed by the discussion of their theoretical properties and limitations.
Keywords: reinforcement learning; Markov Decision Processes; temporal difference learn-
ing; stochastic approximation; two-timescale stochastic approximation; Monte-Carlo meth-
ods; simulation optimization; function approximation; stochastic gradient methods; least-
squares methods; overfitting; bias-variance tradeoff; online learning; active learning; plan-
ning; simulation; PAC-learning; Q-learning; actor-critic methods; policy gradient; natural
gradient
1 Overview
Reinforcement learning (RL) refers to both a learning problem and a subfield of machine
learning. As a learning problem, it refers to learning to control a system so as to maxi-
mize some numerical value which represents a long-term objective. A typical setting where
reinforcement learning operates is shown in Figure 1: A controller receives the controlled
system’s state and a reward associated with the last state transition. It then calculates an
action which is sent back to the system. In response, the system makes a transition to a
new state and the cycle is repeated. The problem is to learn a way of controlling the system
so as to maximize the total reward. The learning problems differ in the details of how the
data is collected and how performance is measured.
In this book, we assume that the system that we wish to control is stochastic. Further,
we assume that the measurements available on the system’s state are detailed enough so
that the the controller can avoid reasoning about how to collect information about the
3
state. Problems with these characteristics are best described in the framework of Markovian
Decision Processes (MDPs). The standard approach to ‘solve’ MDPs is to use dynamic
programming, which transforms the problem of finding a good controller into the problem
of finding a good value function. However, apart from the simplest cases when the MDP has
very few states and actions, dynamic programming is infeasible. The RL algorithms that
we discuss here can be thought of as a way of turning the infeasible dynamic programming
methods into practical algorithms so that they can be applied to large-scale problems.
There are two key ideas that allow RL algorithms to achieve this goal. The first idea is to
use samples to compactly represent the dynamics of the control problem. This is important
for two reasons: First, it allows one to deal with learning scenarios when the dynamics is
unknown. Second, even if the dynamics is available, exact reasoning that uses it might
be intractable on its own. The second key idea behind RL algorithms is to use powerful
function approximation methods to compactly represent value functions. The significance
of this is that it allows dealing with large, high-dimensional state- and action-spaces. What
is more, the two ideas fit nicely together: Samples may be focused on a small subset of the
spaces they belong to, which clever function approximation techniques might exploit. It is
the understanding of the interplay between dynamic programming, samples and function
approximation that is at the heart of designing, analyzing and applying RL algorithms.
The purpose of this book is to allow the reader to have a chance to peek into this beautiful
field. However, certainly we are not the first to set out to accomplish this goal. In 1996,
Kaelbling et al. have written a nice, compact survey about the approaches and algorithms
available at the time (Kaelbling et al., 1996). This was followed by the publication of the book
by Bertsekas and Tsitsiklis (1996), which detailed the theoretical foundations. A few years
later Sutton and Barto, the ‘fathers’ of RL, published their book, where they presented their
ideas on RL in a very clear and accessible manner (Sutton and Barto, 1998). A more recent
and comprehensive overview of the tools and techniques of dynamic programming/optimal
control is given in the two-volume book by Bertsekas (2007a,b) which devotes one chapter
to RL methods.1At times, when a field is rapidly developing, books can get out of date
pretty quickly. In fact, to keep up with the growing body of new results, Bertsekas maintains
an online version of his Chapter 6 of Volume II of his book, which, at the time of writing
this survey counted as much as 160 pages (Bertsekas, 2010). Other recent books on the
subject include the book of Gosavi (2003) who devotes 60 pages to reinforcement learning
algorithms in Chapter 9, concentrating on average cost problems, or that of Cao (2007) who
focuses on policy gradient methods. Powell (2007) presents the algorithms and ideas from an
operations research perspective and emphasizes methods that are capable of handling large
1In this book, RL is called neuro-dynamic programming or approximate dynamic programming. The
term neuro-dynamic programming stems from the fact that, in many cases, RL algorithms are used with
artificial neural networks.
4
control spaces, Chang et al. (2008) focuses on adaptive sampling (i.e., simulation-based
performance optimization), while the center of the recent book by Busoniu et al. (2010) is
function approximation.
Thus, by no means do RL researchers lack a good body of literature. However, what seems
to be missing is a self-contained and yet relatively short summary that can help newcomers
to the field to develop a good sense of the state of the art, as well as existing researchers to
broaden their overview of the field, an article, similar to that of Kaelbling et al. (1996), but
with an updated contents. To fill this gap is the very purpose of this short book.
Having the goal of keeping the text short, we had to make a few, hopefully, not too trou-
bling compromises. The first compromise we made was to present results only for the total
expected discounted reward criterion. This choice is motivated by that this is the criterion
that is both widely used and the easiest to deal with mathematically. The next compro-
mise is that the background on MDPs and dynamic programming is kept ultra-compact
(although an appendix is added that explains these basic results). Apart from these, the
book aims to cover a bit of all aspects of RL, up to the level that the reader should be
able to understand the whats and hows, as well as to implement the algorithms presented.
Naturally, we still had to be selective in what we present. Here, the decision was to focus
on the basic algorithms, ideas, as well as the available theory. Special attention was paid to
describing the choices of the user, as well as the tradeoffs that come with these. We tried
to be impartial as much as possible, but some personal bias, as usual, surely remained. The
pseudocode of almost twenty algorithms was included, hoping that this will make it easier
for the practically inclined reader to implement the algorithms described.
The target audience is advanced undergaduate and graduate students, as well as researchers
and practitioners who want to get a good overview of the state of the art in RL quickly.
Researchers who are already working on RL might also enjoy reading about parts of the RL
literature that they are not so familiar with, thus broadening their perspective on RL. The
reader is assumed to be familiar with the basics of linear algebra, calculus, and probability
theory. In particular, we assume that the reader is familiar with the concepts of random
variables, conditional expectations, and Markov chains. It is helpful, but not necessary,
for the reader to be familiar with statistical learning theory, as the essential concepts will
be explained as needed. In some parts of the book, knowledge of regression techniques of
machine learning will be useful.
This book has three parts. In the first part, in Section 2, we provide the necessary back-
ground. It is here where the notation is introduced, followed by a short overview of the
theory of Markov Decision Processes and the description of the basic dynamic programming
algorithms. Readers familiar with MDPs and dynamic programming should skim through
this part to familiarize themselves with the notation used. Readers, who are less familiar
5
with MDPs, must spend enough time here before moving on because the rest of the book
builds heavily on the results and ideas presented here.
The remaining two parts are devoted to the two basic RL problems (cf. Figure 1), one part
devoted to each. In Section 3, the problem of learning to predict values associated with
states is studied. We start by explaining the basic ideas for the so-called tabular case when
the MDP is small enough so that one can store one value per state in an array allocated in
a computer’s main memory. The first algorithm explained is TD(λ), which can be viewed
as the learning analogue to value iteration from dynamic programming. After this, we
consider the more challenging situation when there are more states than what fits into a
computer’s memory. Clearly, in this case, one must compress the table representing the
values. Abstractly, this can be done by relying on an appropriate function approximation
method. First, we describe how TD(λ) can be used in this situation. This is followed by the
description of some new gradient based methods (GTD2 and TDC), which can be viewed
as improved versions of TD(λ) in that they avoid some of the convergence difficulties that
TD(λ) faces. We then discuss least-squares methods (in particular, LSTD(λ) and λ-LSPE)
and compare them to the incremental methods described earlier. Finally, we describe choices
available for implementing function approximation and the tradeoffs that these choices come
with.
The second part (Section 4) is devoted to algorithms that are developed for control learning.
First, we describe methods whose goal is optimizing online performance. In particular,
we describe the “optimism in the face of uncertainty” principle and methods that explore
their environment based on this principle. State of the art algorithms are given both for
bandit problems and MDPs. The message here is that clever exploration methods make
a large difference, but more work is needed to scale up the available methods to large
problems. The rest of this section is devoted to methods that aim at developing methods
that can be used in large-scale applications. As learning in large-scale MDPs is significantly
more difficult than learning when the MDP is small, the goal of learning is relaxed to
learning a good enough policy in the limit. First, direct methods are discussed which aim at
estimating the optimal action-values directly. These can be viewed as the learning analogue
of value iteration of dynamic programming. This is followed by the description of actor-
critic methods, which can be thought of as the counterpart of the policy iteration algorithm
of dynamic programming. Both methods based on direct policy improvement and policy
gradient (i.e., which use parametric policy classes) are presented.
The book is concluded in Section 5, which lists some topics for further exploration.
6
!"#$%&'%()
*+,-#.%'#"+'%()
!(,%&/.%'#"+'%()
!(,%&/.0#+"&1
2()'"(,
Figure 2: Types of reinforcement problems and approaches.
2 Markov decision processes
The purpose of this section is to introduce the notation that will be used in the subsequent
parts and the most essential facts that we will need from the theory of Markov Decision
Processes (MDPs) in the rest of the book. Readers familiar with MDPs should skim through
this section to familiarize themselves with the notation. Readers unfamiliar with MDPs are
suggested to spend enough time with this section to understand the details. Proofs of most
of the results (with some simplifications) are included in Appendix A. The reader who is
interested in learning more about MDPs is suggested to consult one of the many excellent
books on the subject, such as the books of Bertsekas and Shreve (1978), Puterman (1994),
or the two-volume book by Bertsekas (2007a,b).
2.1 Preliminaries
We use Nto denote the set of natural numbers: N={0,1,2, . . .}, while Rdenotes the set
of reals. By a vector v(unless it is transposed, v>), we mean a column vector. The inner
product of two finite-dimensional vectors, u, v ∈Rdis hu, vi=Pd
i=1 uivi. The resulting 2-
norm is kuk2=hu, ui. The maximum norm for vectors is defined by kuk∞= maxi=1,...,d |ui|,
while for a function f:X → R,k · k∞is defined by kfk∞= supx∈X |f(x)|. A mapping
Tbetween the metric spaces (M1, d1), (M2, d2) is called Lipschitz with modulus L∈Rif
for any a, b ∈M1,d2(T(a), T (b)) ≤L d1(a, b). If Tis Lipschitz with a modulus L≤1, it
is called a non-expansion. If L < 1, the mapping is called a contraction. The indicator
function of event Swill be denoted by I{S}(i.e., I{S}= 1 if Sholds and I{S}= 0, otherwise).
If v=v(θ, x), ∂
∂θ vshall denote the partial derivative of vwith respect to θ, which is a
d-dimensional row vector if θ∈Rd. The total derivative of some expression vwith respect
to θwill be denoted by d
dθ v(and will be treated as a row vector). Further, ∇θv= ( d
dθ v)>.
If Pis a distribution or a probability measure, then X∼Pmeans that Xis a random
variable drawn from P.
7
2.2 Markov Decision Processes
For ease of exposition, we restrict our attention to countable MDPs and the discounted total
expected reward criterion. However, under some technical conditions, the results extend to
continuous state-action MDPs, too. This also holds true for the results presented in later
parts of this book.
A countable MDP is defined as a triplet M= (X,A,P0), where Xis the countable non-
empty set of states, Ais the countable non-empty set of actions. The transition probability
kernel P0assigns to each state-action pair (x, a)∈ X ×A a probability measure over X ×R,
which we shall denote by P0(·|x, a). The semantics of P0is the following: For U⊂ X × R,
P0(U|x, a) gives the probability that the next state and the associated reward belongs to the
set Uprovided that the current state is xand the action taken is a.2We also fix a discount
factor 0 ≤γ≤1 whose role will become clear soon.
The transition probability kernel gives rise to the state transition probability kernel,P, which,
for any (x, a, y)∈ X × A × X triplet gives the probability of moving from state xto some
other state yprovided that action awas chosen in state x:
P(x, a, y) = P0({y} × R|x, a).
In addition to P,P0also gives rise to the immediate reward function r:X × A → R,
which gives the expected immediate reward received when action ais chosen in state x: If
(Y(x,a), R(x,a))∼ P0(·|x, a), then
r(x, a) = ER(x,a).
In what follows, we shall assume that the rewards are bounded by some quantity R>0:
for any (x, a)∈ X × A,|R(x,a)|≤Ralmost surely.3It is immediate that if the random
rewards are bounded by Rthen krk∞= sup(x,a)∈X ×A |r(x, a)| ≤ R also holds. An MDP is
called finite if both Xand Aare finite.
Markov Decision Processes are a tool for modeling sequential decision-making problems
where a decision maker interacts with a system in a sequential fashion. Given an MDP M,
this interaction happens as follows: Let t∈Ndenote the current time (or stage), let Xt∈ X
2The probability P0(U|x, a) is defined only when Uis a Borel-measurable set. Borel-measurability is a
technical notion whose purpose is to prevent some pathologies. The collection of Borel-measurable subsets
of X ×Rinclude practically all “interesting” subsets X ×R. In particular, they include subsets of the form
{x} × [a, b] and subsets which can be obtained from such subsets by taking their complement, or the union
(intersection) of at most countable collections of such sets in a recursive fashion.
3“Almost surely” means the same as “with probability one” and is used to refer to the fact that the
statement concerned holds everywhere on the probability space with the exception of a set of events with
measure zero.
8
and At∈ A denote the random state of the system and the action chosen by the decision
maker at time t, respectively. Once the action is selected, it is sent to the system, which
makes a transition:
(Xt+1, Rt+1)∼ P0(·|Xt, At).(1)
In particular, Xt+1 is random and P(Xt+1 =y|Xt=x, At=a) = P(x, a, y) holds for any
x, y ∈ X,a∈ A. Further, E[Rt+1|Xt, At] = r(Xt, At). The decision maker then observes
the next state Xt+1 and reward Rt+1, chooses a new action At+1 ∈ A and the process is
repeated. The goal of the decision maker is to come up with a way of choosing the actions
so as to maximize the expected total discounted reward.
The decision maker can select its actions at any stage based on the observed history. A rule
describing the way the actions are selected is called a behavior. A behavior of the decision
maker and some initial random state X0together define a random state-action-reward se-
quence ((Xt, At, Rt+1); t≥0), where (Xt+1 , Rt+1) is connected to (Xt, At) by (1) and Atis the
action prescribed by the behavior based on the history X0, A0, R1, . . . , Xt−1, At−1, Rt, Xt.4
The return underlying a behavior is defined as the total discounted sum of the rewards
incurred:
R=
∞
X
t=0
γtRt+1.
Thus, if γ < 1 then rewards far in the future worth exponentially less than the reward
received at the first stage. An MDP when the return is defined by this formula is called a
discounted reward MDP. When γ= 1, the MDP is called undiscounted.
The goal of the decision-maker is to choose a behavior that maximizes the expected return,
irrespectively of how the process is started. Such a maximizing behavior is said to be optimal.
Example 1 (Inventory control with lost sales): Consider the problem of day-to-day control
of an inventory of a fixed maximum size in the face of uncertain demand: Every evening,
the decision maker must decide about the quantity to be ordered for the next day. In the
morning, the ordered quantity arrives with which the inventory is filled up. During the day,
some stochastic demand is realized, where the demands are independent with a common fixed
distribution, see Figure 3. The goal of the inventory manager is to manage the inventory so
as to maximize the present monetary value of the expected total future income.
The payoff at time step tis determined as follows: The cost associated with purchasing At
items is KI{At>0}+cAt. Thus, there is a fixed entry cost Kof ordering nonzero items and
each item must be purchased at a fixed price c. Here K, c > 0. In addition, there is a cost of
holding an inventory of size x > 0. In the simplest case, this cost is proportional to the size
4Mathematically, a behavior is an infinite sequence of probability kernels π0, π1, . . . , πt, . . ., where
πtmaps histories of length tto a probability distribution over the action space A:πt=
πt(·|x0, a0, r0, . . . , xt−1, at−1, rt−1, xt).
9
of the inventory with proportionality factor h > 0. Finally, upon selling zunits the manager
is paid the monetary amount of p z, where p > 0. In order to make the problem interesting,
we must have p>h, otherwise there is no incentive to order new items.
This problem can be represented as an MDP as follows: Let the state Xton day t≥0be the
size of the inventory in the evening of that day. Thus, X={0,1, . . . , M}, where M∈Nis
the maximum inventory size. The action Atgives the number of items ordered in the evening
of day t. Thus, we can choose A={0,1, . . . , M}since there is no need to consider orders
larger than the inventory size. Given Xtand At, the size of the next inventory is given by
Xt+1 = ((Xt+At)∧M−Dt+1)+,(2)
where a∧bis a shorthand notation for the minimum of the numbers a,b,(a)+=a∨0 =
max(a, 0) is the positive part of a, and Dt+1 ∈Nis the demand on the (t+ 1)th day.
By assumption, (Dt;t > 0) is a sequence of independent and identically distributed (i.i.d.)
integer-valued random variables. The revenue made on day t+ 1 is
Rt+1 =−KI{At>0}−c((Xt+At)∧M−Xt)+
−h Xt+p((Xt+At)∧M−Xt+1)+.(3)
Equations (2)–(3) can be written in the compact form
(Xt+1, Rt+1) = f(Xt, At, Dt+1),(4)
with an appropriately chosen function f. Then, P0is given by
P0(U|x, a) = P(f(x, a, D)∈U) =
∞
X
d=0
I{f(x,a,d)∈U}pD(d).
Here pD(·)is the probability mass function of the random demands and D∼pD(·). This
finishes the definition of the MDP underlying the inventory optimization problem.
Inventory control is just one of the many operations research problems that give rise to
an MDP. Other problems include optimizing transportation systems, optimizing schedules
or production. MDPs arise naturally in many engineering optimal control problems, too,
such as the optimal control of chemical, electronic or mechanical systems (the latter class
includes the problem of controlling robots). Quite a few information theory problems can
also be represented as MDPs (e.g., optimal coding, optimizing channel allocation, or sensor
networks). Another important class of problems comes from finance. These include, amongst
others, optimal portfolio management and option pricing.
10
19:00
7:00
14:00
Figure 3: Illustration of the inventory management problem
In the case of the inventory control problem, the MDP was conveniently specified by a
transition function f(cf., (4)). In fact, transition functions are as powerful as transition
kernels: any MDP gives rise to some transition function fand any transition function f
gives rise to some MDP.
In some problems, not all actions are meaningful in all states. For example, ordering more
items than what one has room for in the inventory does not make much sense. However,
such meaningless actions (or forbidden actions) can always be remapped to other actions,
just like it was done above. In some cases, this is unnatural and leads to a convoluted
dynamics. Then, it might be better to introduce an additional mapping which assigns the
set of admissible actions to each state.
In some MDPs, some states are impossible to leave: If xis such a state, Xt+s=xholds
almost surely for any s≥1 provided that Xt=x, no matter what actions are selected
after time t. By convention, we will assume that no reward is incurred in such terminal
or absorbing states. An MDP with such states is called episodic. An episode then is the
(generally random) time period from the beginning of time until a terminal state is reached.
In an episodic MDP, we often consider undiscounted rewards, i.e., when γ= 1.
Example 2 (Gambling): A gambler enters a game whereby she may stake any fraction At∈
[0,1] of her current wealth Xt≥0. She wins her stake back and as much more with probability
p∈[0,1], while she loses her stake with probability 1−p. Thus, the fortune of the gambler
evolves according to
Xt+1 = (1 + St+1 At)Xt.
Here (St;t≥1) is a sequence of independent random variables taking values in {−1,+1}
with P(St+1 = 1) = p. The goal of the gambler is to maximize the probability that her wealth
reaches an a priori given value w∗>0. It is assumed that the initial wealth is in [0, w∗].
This problem can be represented as an episodic MDP, where the state space is X= [0, w∗]
11
and the action space is A= [0,1].5We define
Xt+1 = (1 + St+1 At)Xt∧w∗,(5)
when 0≤Xt< w∗and make w∗a terminal state: Xt+1 =Xtif Xt=w∗. The immediate
reward is zero as long as Xt+1 < w∗and is one when the state reaches w∗for the first time:
Rt+1 =
1, Xt< w∗and Xt+1 =w∗;
0,otherwise.
If we set the discount factor to one, the total reward along any trajectory will be one or
zero depending on whether the wealth reaches w∗. Thus, the expected total reward is just the
probability that the gambler’s fortune reaches w∗.
Based on the two examples presented so far, the reader unfamiliar with MDPs might believe
that all MDPs come with handy finite, one-dimensional state- and action-spaces. If only this
was true! In fact, in practical applications the state- and action-spaces are often very large,
multidimensional spaces. For example, in a robot control application, the dimensionality
of the state space can be 3—6 times the number of joints the robot has. An industrial
robot’s state space might easily be 12—20 dimensional, while the state space of a humanoid
robot might easily have 100 dimensions. In a real-world inventory control application, items
would have multiple types, the prices and costs would also change based on the state of
the “market”, whose state would thus also become part of the MDP’s state. Hence, the
state space in any such practical application would be very large and very high dimensional.
The same holds for the action spaces. Thus, working with large, multidimensional state- and
action-spaces should be considered the normal situation, while the examples presented in this
section with their one-dimensional, small state spaces should be viewed as the exceptions.
2.3 Value functions
The obvious way of finding an optimal behavior in some MDP is to list all behaviors and
then identify the ones that give the highest possible value for each initial state. Since, in
general, there are too many behaviors, this plan is not viable. A better approach is based on
computing value functions. In this approach, one first computes the so-called optimal value
function, which then allows one to determine an optimal behavior with relative easiness.
The optimal value,V∗(x), of state x∈ X gives the highest achievable expected return when
the process is started from state x. The function V∗:X → Ris called the optimal value
5Hence, in this case the state and action spaces are continuous. Notice that our definition of MDPs is
general enough to encompass this case, too.
12
function. A behavior that achieves the optimal values in all states is optimal.
Deterministic stationary policies represent a special class of behaviors, which, as we shall see
soon, play an important role in the theory of MDPs. They are specified by some mapping
π, which maps states to actions (i.e., π:X → A). Following πmeans that at any time t≥0
the action Atis selected using
At=π(Xt).(6)
More generally, a stochastic stationary policy (or just stationary policy) πmaps states to
distributions over the action space. When referring to such a policy π, we shall use π(a|x)
to denote the probability of action abeing selected by πin state x. Note that if a stationary
policy is followed in an MDP, i.e., if
At∼π(·|Xt), t ∈N,
the state process (Xt;t≥0) will be a (time-homogeneous) Markov chain. We will use Πstat
to denote the set of all stationary policies. For brevity, in what follows, we will often say
just “policy” instead of “stationary policy”, hoping that this will not cause confusion.
A stationary policy and an MDP induce what is called a Markov reward processes (MRP): An
MRP is determined by the pair M= (X,P0), where now P0assigns a probability measure
over X ×Rto each state. An MRP Mgives rise to the stochastic process ((Xt, Rt+1); t≥0),
where (Xt+1, Rt+1)∼ P0(·|Xt). (Note that (Zt;t≥0), Zt= (Xt, Rt) is a time-homogeneous
Markov process, where R0is an arbitrary random variable, while ((Xt, Rt+1); t≥0) is a
second-order Markov process.) Given a stationary policy πand the MDP M= (X,A,P0),
the transition kernel of the MRP (X,Pπ
0) induced by πand Mis defined using Pπ
0(·|x) =
Pa∈A π(a|x)P0(·|x, a). An MRP is called finite if its state space is finite.
Let us now define value functions underlying stationary policies.6For this, let us fix some
policy π∈Πstat. The value function,Vπ:X → R, underlying πis defined by
Vπ(x) = E"∞
X
t=0
γtRt+1 X0=x#, x ∈ X,(7)
with the understanding (i) that the process (Rt;t≥1) is the “reward-part” of the process
((Xt, At, Rt+1); t≥0) obtained when following policy πand (ii) X0is selected at random
such that P(X0=x)>0 holds for all states x. This second condition makes the conditional
expectation in (7) well-defined for every state. If the initial state distribution satisfies this
condition, it has no influence on the definition of values.
6Value functions can also be defined underlying any behavior analogously to the definition given below.
13
The value function underlying an MRP is defined the same way and is denoted by V:
V(x) = E"∞
X
t=0
γtRt+1 X0=x#, x ∈ X.
It will also be useful to define the action-value function,Qπ:X × A → R, underlying a
policy π∈Πstat in an MDP: Assume that the first action A0is selected randomly such that
P(A0=a)>0 holds for all a∈ A, while for the subsequent stages of the decision process
the actions are chosen by following policy π. Let ((Xt, At, Rt+1 ); t≥0) be the resulting
stochastic process, where X0is as in the definition of Vπ. Then
Qπ(x, a) = E"∞
X
t=0
γtRt+1 X0=x, A0=a#, x ∈ X, a ∈ A.
Similarly to V∗(x), the optimal action-value Q∗(x, a) at the state-action pair (x, a) is defined
as the maximum of the expected return under the constraints that the process starts at state
x, and the first action chosen is a. The underlying function Q∗:X × A → Ris called the
optimal action-value function.
The optimal value- and action-value functions are connected by the following equations:
V∗(x) = sup
a∈A
Q∗(x, a), x ∈ X,
Q∗(x, a) = r(x, a) + γX
y∈X P(x, a, y)V∗(y), x ∈ X, a ∈ A.
In the class of MDPs considered here, an optimal stationary policy always exists:
V∗(x) = sup
π∈Πstat
Vπ(x), x ∈ X.
In fact, any policy π∈Πstat which satisfies the equality
X
a∈A
π(a|x)Q∗(x, a) = V∗(x) (8)
simultaneously for all states x∈ X is optimal. Notice that in order (8) to hold, π(·|x)
must be concentrated on the set of actions that maximize Q∗(x, ·). In general, given some
action-value function, Q:X × A → R, an action that maximizes Q(x, ·) for some state xis
called greedy with respect to Qin state x. A policy that chooses greedy actions only with
respect to Qin all states is called greedy w.r.t. Q.
Thus, a greedy policy with respect to Q∗is optimal, i.e., the knowledge of Q∗alone is
sufficient for finding an optimal policy. Similarly, knowing V∗,rand Palso suffices to act
14
optimally.
The next question is how to find V∗or Q∗. Let us start with the simpler question of how to
find the value function of a policy:
Fact 1 (Bellman Equations for Deterministic Policies): Fix an MDP M= (X,A,P0), a
discount factor γand deterministic policy π∈Πstat . Let rbe the immediate reward function
of M. Then Vπsatisfies
Vπ(x) = r(x, π(x)) + γX
y∈X P(x, π(x), y)Vπ(y), x ∈ X.(9)
This system of equations is called the Bellman equation for Vπ. Define the Bellman operator
underlying π,Tπ:RX→RX, by
(TπV)(x) = r(x, π(x)) + γX
y∈X P(x, π(x), y)V(y), x ∈ X.
With the help of Tπ, Equation (9) can be written in the compact form
TπVπ=Vπ.(10)
Note that this is a linear system of equations in Vπand Tπis an affine linear operator. If
0< γ < 1then Tπis a maximum-norm contraction and the fixed-point equation TπV=V
has a unique solution.
When the state space Xis finite, say, it has Dstates, RXcan be identified with the D-
dimensional Euclidean space and V∈RXcan be thought of as a D-dimensional vector: V∈
RD. With this identification, TπVcan also be written as rπ+γ P πVwith an appropriately
defined vector rπ∈RDand matrix Pπ∈RD×D. In this case, (10) can be written in the
form
rπ+γP πVπ=Vπ.(11)
The above facts also hold true in MRPs, where the Bellman operator T:RX→RXis
defined by
(T V )(x) = r(x) + γX
y∈X P(x, y)V(y), x ∈ X.
The optimal value function is known to satisfy a certain fixed-point equation:
Fact 2 (Bellman Optimality Equations): The optimal value function satisfies the fixed-point
equation
V∗(x) = sup
a∈A (r(x, a) + γX
y∈X P(x, a, y)V∗(y)), x ∈ X.(12)
15
Define the Bellman optimality operator operator, T∗:RX→RX, by
(T∗V)(x) = sup
a∈A (r(x, a) + γX
y∈X P(x, a, y)V(y)), x ∈ X.(13)
Note that this is a nonlinear operator due to the presence of sup. With the help of T∗,
Equation (12) can be written compactly as
T∗V∗=V∗.
If 0< γ < 1, then T∗is a maximum-norm contraction, and the fixed-point equation T∗V=V
has a unique solution.
In order to minimize clutter, in what follows we will write expressions like (TπV)(x) as
TπV(x), with the understanding that the application of operator Tπtakes precedence to the
applycation of the point evaluation operator, “·(x)”.
The action-value functions underlying a policy (or an MRP) and the optimal action-value
function also satisfy some fixed-point equations similar to the previous ones:
Fact 3 (Bellman Operators and Fixed-point Equations for Action-value Functions): With a
slight abuse of notation, define Tπ:RX ×A →RX ×A and T∗:RX ×A →RX ×A as follows:
TπQ(x, a) = r(x, a) + γX
y∈X P(x, a, y)Q(y, π(x)),(x, a)∈ X × A,(14)
T∗Q(x, a) = r(x, a) + γX
y∈X P(x, a, y) sup
a0∈A
Q(y, a0),(x, a)∈ X × A.(15)
Note that Tπis again affine linear, while T∗is nonlinear. The operators Tπand T∗are
maximum-norm contractions. Further, the action-value function of π,Qπ, satisfies TπQπ=
Qπand Qπis the unique solution to this fixed-point equation. Similarly, the optimal action-
value function, Q∗, satisfies T∗Q∗=Q∗and Q∗is the unique solution to this fixed-point
equation.
2.4 Dynamic programming algorithms for solving MDPs
The above facts provide the basis for the value- and policy-iteration algorithms.
Value iteration generates a sequence of value functions
Vk+1 =T∗Vk, k ≥0,
16
where V0is arbitrary. Thanks to Banach’s fixed-point theorem, (Vk;k≥0) converges to V∗
at a geometric rate.
Value iteration can also be used in conjunction with action-value functions; in which case,
it takes the form
Qk+1 =T∗Qk, k ≥0,
which again converges to Q∗at a geometric rate. The idea is that once Vk(or Qk) is close to
V∗(resp., Q∗), a policy that is greedy with respect to Vk(resps., Qk) will be close-to-optimal.
In particular, the following bound is known to hold: Fix an action-value function Qand let
πbe a greedy policy w.r.t. Q. Then the value of policy πcan be lower bounded as follows
(e.g., Singh and Yee, 1994, Corollary 2):
Vπ(x)≥V∗(x)−2
1−γkQ−Q∗k∞, x ∈ X.(16)
Policy iteration works as follows. Fix an arbitrary initial policy π0. At iteration k > 0,
compute the action-value function underlying πk(this is called the policy evaluation step).
Next, given Qπk, define πk+1 as a policy that is greedy with respect to Qπk(this is called the
policy improvement step). After kiterations, policy iteration gives a policy not worse than
the policy that is greedy w.r.t. to the value function computed using kiterations of value
iteration if the two procedures are started with the same initial value function. However, the
computational cost of a single step in policy iteration is much higher (because of the policy
evaluation step) than that of one update in value iteration.
3 Value prediction problems
In this section, we consider the problem of estimating the value function Vunderlying
some Markov reward process (MRP). Value prediction problems arise in a number of ways:
Estimating the probability of some future event, the expected time until some event occurs,
or the (action-)value function underlying some policy in an MDP are all value prediction
problems. Specific applications are estimating the failure probability of a large power grid
(Frank et al., 2008) or estimating taxi-out times of flights on busy airports (Balakrishna
et al., 2008), just to mention two of the many possibilities.
Since the value of a state is defined as the expectation of the random return when the
process is started from the given state, an obvious way of estimating this value is to compute
an average over multiple independent realizations started from the given state. This is an
instance of the so-called Monte-Carlo method. Unfortunately, the variance of the returns can
be high, which means that the quality of the estimates will be poor. Also, when interacting
17
with a system in a closed-loop fashion (i.e., when estimation happens while interacting with
the system), it might be impossible to reset the state of the system to some particular
state. In this case, the Monte-Carlo technique cannot be applied without introducing some
additional bias. Temporal difference (TD) learning (Sutton, 1984, 1988), which is without
doubt one of the most significant ideas in reinforcement learning, is a method that can be
used to address these issues.
3.1 Temporal difference learning in finite state spaces
The unique feature of TD learning is that it uses bootstrapping: predictions are used as
targets during the course of learning. In this section, we first introduce the most basic
TD algorithm and explain how bootstrapping works. Next, we compare TD learning to
(vanilla) Monte-Carlo methods and argue that both of them have their own merits. Finally,
we present the TD(λ) algorithm that unifies the two approaches. Here we consider only the
case of small, finite MRPs, when the value-estimates of all the states can be stored in the
main memory of a computer in an array or table, which is known as the tabular case in
the reinforcement learning literature. Extensions of the ideas presented here to large state
spaces, when a tabular representations is not feasible, will be described in the subsequent
sections.
3.1.1 Tabular TD(0)
Fix some finite Markov Reward Process M. We wish to estimate the value function V
underlying Mgiven a realization ((Xt, Rt+1); t≥0) of M. Let ˆ
Vt(x) denote the estimate of
state xat time t(say, ˆ
V0≡0). In the tth step TD(0) performs the following calculations:
δt+1 =Rt+1 +γˆ
Vt(Xt+1)−ˆ
Vt(Xt),
ˆ
Vt+1(x) = ˆ
Vt(x) + αtδt+1 I{Xt=x},
x∈ X.
(17)
Here the step-size sequence (αt;t≥0) consists of (small) nonnegative numbers chosen by
the user. Algorithm 1 shows the pseudocode of this algorithm.
A closer inspection of the update equation reveals that the only value changed is the one
associated with Xt, i.e., the state just visited (cf. line 2 of the pseudocode). Further,
when αt≤1, the value of Xtis moved towards the “target” Rt+1 +γˆ
Vt(Xt+1). Since the
target depends on the estimated value function, the algorithm uses bootstrapping. The term
“temporal difference” in the name of the algorithm comes from that δt+1 is defined as the
18
Algorithm 1 The function implementing the tabular TD(0) algorithm. This function must
be called after each transition.
function TD0(X, R, Y, V )
Input: Xis the last state, Yis the next state, Ris the immediate reward associated with
this transition, Vis the array storing the current value estimates
1: δ←R+γ·V[Y]−V[X]
2: V[X]←V[X] + α·δ
3: return V
difference between values of states corresponding to successive time steps. In particular, δt+1
is called a temporal difference error.
Just like many other algorithms in reinforcement learning, tabular TD(0) is a stochastic
approximation (SA) algorithm. It is easy to see that if it converges, then it must converge
to a function ˆ
Vsuch that the expected temporal difference given ˆ
V,
Fˆ
V(x)def
=EhRt+1 +γˆ
V(Xt+1)−ˆ
V(Xt)Xt=xi,
is zero for all states x, at least for all states that are sampled infinitely often. A simple
calculation shows that Fˆ
V=Tˆ
V−ˆ
V, where Tis the Bellman-operator underlying the
MRP considered. By Fact 1, Fˆ
V= 0 has a unique solution, the value function V. Thus, if
TD(0) converges (and all states are sampled infinitely often) then it must converge to V.
To study the algorithm’s convergence properties, for simplicity, assume that (Xt;t∈N) is
a stationary, ergodic Markov chain.7Further, identify the approximate value functions ˆ
Vt
with D-dimensional vectors as before (e.g., ˆ
Vt,i =ˆ
Vt(xi), i= 1, . . . , D, where D=|X| and
X={x1, . . . , xD}). Then, assuming that the step-size sequence satisfies the Robbins-Monro
(RM) conditions,∞
X
t=0
αt=∞,
∞
X
t=0
α2
t<+∞,
the sequence ( ˆ
Vt∈RD;t∈N) will track the trajectories of the ordinary differential equation
(ODE)
˙v(t) = c F (v(t)), t ≥0,(18)
where c= 1/D and v(t)∈RD(e.g., Borkar, 1998). Borrowing the notation used in (11), the
above ODE can be written as
˙v=r+ (γP −I)v.
Note that this is a linear ODE. Since the eigenvalues of γP −Iall lie in the open left half
complex plane, this ODE is globally asymptotically stable. From this, using standard results
7Remember that a Markov chain (Xt;t∈N) is ergodic if it is irreducible, aperiodic and positive recurrent.
Practically, this means that the law of large number holds for sufficiently regular functions of the chain.
19
of SA it follows that ˆ
Vtconverges almost surely to V.
On step-sizes Since many of the algorithms that we will discuss use step-sizes, it is worth-
while spending some time on discussing their choice. A simple step-size sequence that satisfies
the above conditions is αt=c/t, with c > 0. More generally, any step-size sequence of the
form αt=ct−ηwill work as long as 1/2< η ≤1. Of these step-size sequences, η= 1 gives the
smallest step-sizes. Asymptotically, this choice will be the best, but from the point of view of
the transient behavior of the algorithm, choosing ηcloser to 1/2 will work better (since with
this choice the step-sizes are bigger and thus the algorithm will make larger moves). It is
possible to do even better than this. In fact, a simple method, called iterate-averaging due to
Polyak and Juditsky (1992), is known to achieve the best possible asymptotic rate of conver-
gence. However, despite its appealing theoretical properties, iterate-averaging is rarely used
in practice. In fact, in practice people often use constant step-sizes, which clearly violates
the RM conditions. This choice is justified based on two grounds: First, the algorithms are
often used in a non-stationary environment (i.e., the policy to be evaluated might change).
Second, the algorithms are often used only in the small sample regime. (When a constant
step-size is used, the parameters converge in distribution. The variance of the limiting distri-
bution will be proportional to the step-size chosen.) There is also a great deal of work going
into developing methods that tune step-sizes automatically, see (Sutton, 1992; Schraudolph,
1999; George and Powell, 2006) and the references therein. However, the jury is still out on
which of these methods is the best.
With a small change, the algorithm can also be used on an observation sequence of the form
((Xt, Rt+1, Yt+1); t≥0), where (Xt;t≥0) is an arbitrary ergodic Markov chain over X,
(Yt+1, Rt+1)∼ P0(·|Xt). The change concerns the definition of temporal differences:
δt+1 =Rt+1 +γˆ
V(Yt+1)−ˆ
V(Xt).
Then, with no extra conditions, ˆ
Vtstill converges almost surely to the value function un-
derlying the MRP (X,P0). In particular, the distribution of the states (Xt;t≥0) does not
play a role here.
This is interesting for multiple reasons. For example, if the samples are generated using a
simulator, we may be able to control the distribution of the states (Xt;t≥0) independently
of the MRP. This might be useful to counterbalance any unevenness in the stationary dis-
tribution underlying the Markov kernel P. Another use is to learn about some target policy
in an MDP while following some other policy, often called the behavior policy. Assume for
simplicity that the target policy is deterministic. Then ((Xt, Rt+1, Yt+1), t ≥0) could be
obtained by skipping all those state-action-reward-next state quadruples in the trajectory
20
generated by using the behavior policy, where the action taken does not match the action
that would have been taken in the given state by the target policy, while keeping the rest.
This technique might allow one to learn about multiple policies at the same time (more
generally, about multiple long-term prediction problems). When learning about one policy,
while following another is called off-policy learning. Because of this, we shall also call learn-
ing based on triplets ((Xt, Rt+1, Yt+1); t≥0) when Yt+1 6=Xt+1 off-policy learning. A third,
technical use is when the goal is to apply the algorithm to an episodic problem. In this case,
the triplets (Xt, Rt+1, Yt+1) are chosen as follows: First, Yt+1 is sampled from the transition
kernel P(X, ·). If Yt+1 is not a terminal state, we let Xt+1 =Yt+1; otherwise, Xt+1 ∼ P0(·),
where P0is a user-chosen distribution over X. In other words, when a terminal state is
reached, the process is restarted from the initial state distribution P0. The period between
the time of a restart from P0and reaching a terminal state is called an episode (hence
the name of episodic problems). This way of generating a sample shall be called continual
sampling with restarts from P0.
Being a standard linear SA method, the rate of convergence of tabular TD(0) will be of
the usual order O(1/√t) (consult the paper by Tadi´c (2004) and the references therein for
precise results). However, the constant factor in the rate will be largely influenced by the
choice of the step-size sequence, the properties of the kernel P0and the value of γ.
3.1.2 Every-visit Monte-Carlo
As mentioned before, one can also estimate the value of a state by computing sample means,
giving rise to the so-called every visit Monte-Carlo method. Here we define more precisely
what we mean by this and compare the resulting method to TD(0).
To firm up the ideas, consider some episodic problem (otherwise, it is impossible to finitely
compute the return of a given state since the trajectories are infinitely long). Let the
underlying MRP be M= (X,P0) and let ((Xt, Rt+1, Yt+1); t≥0) be generated by continual
sampling in Mwith restarts from some distribution P0defined over X. Let (Tk;k≥0) be
the sequence of times when an episode starts (thus, for each k,XTkis sampled from P0).
For a given time t, let k(t) be the unique episode index such that t∈[Tk, Tk+1). Let
Rt=
Tk(t)+1−1
X
s=t
γs−tRs+1 (19)
denote the return from time ton until the end of the episode. Clearly, V(x) = E[Rt|Xt=x],
for any state xsuch that P(Xt=x)>0. Hence, a sensible way of updating the estimates
is to use
ˆ
Vt+1(x) = ˆ
Vt(x) + αt(Rt−ˆ
Vt(x)) I{Xt=x}, x ∈ X.
21
Algorithm 2 The function that implements the every-visit Monte-Carlo algorithm to es-
timate value functions in episodic MDPs. This routine must be called at the end of each
episode with the state-reward sequence collected during the episode. Note that the algorithm
as shown here has linear time- and space-complexity in the length of the episodes.
function EveryVisitMC(X0, R1, X1, R2, . . . , XT−1, RT, V )
Input: Xtis the state at time t,Rt+1 is the reward associated with the tth transition, Tis
the length of the episode, Vis the array storing the current value function estimate
1: sum ←0
2: for t←T−1downto 0do
3: sum ←Rt+1 +γ·sum
4: target[Xt]←sum
5: V[Xt]←V[Xt] + α·(target[Xt]−V[Xt])
6: end for
7: return V
Monte-Carlo methods such as the above one, since they use multi-step predictions of the
value (cf. Equation (19)), are called multi-step methods. The pseudo-code of this update-
rule is shown as Algorithm 2.
This algorithm is again an instance of stochastic approximation. As such, its behavior is
governed by the ODE ˙v(t) = V−v(t). Since the unique globally asymptotically stable
equilibrium of this ODE is V,ˆ
Vtagain converges to Valmost surely. Since both algorithms
achieve the same goal, one may wonder which algorithm is better.
TD(0) or Monte-Carlo? First, let us consider an example when TD(0) converges faster.
Consider the undiscounted episodic MRP shown on Figure 4. The initial states are either 1
or 2. With high probability the process starts at state 1, while the process starts at state
2 less frequently. Consider now how TD(0) will behave at state 2. By the time state 2 is
visited the kth time, on the average state 3 has already been visited 10 ktimes. Assume that
αt= 1/(t+ 1). At state 3 the TD(0) update reduces to averaging the Bernoulli rewards
incurred upon leaving state 3. At the kth visit of state 2, Var hˆ
Vt(3)i≈1/(10 k) (clearly,
Ehˆ
Vt(3)i=V(3) = 0.5). Thus, the target of the update of state 2 will be an estimate of
the true value of state 2 with accuracy increasing with k. Now, consider the Monte-Carlo
method. The Monte-Carlo method ignores the estimate of the value of state 3 and uses the
Bernoulli rewards directly. In particular, Var[Rt|Xt= 2] = 0.25, i.e., the variance of the
target does not change with time. On this example, this makes the Monte-Carlo method
slower to converge, showing that sometimes bootstrapping might indeed help.
To see an example when bootstrapping is not helpful, imagine that the problem is modified so
that the reward associated with the transition from state 3 to state 4 is made deterministically
22
0
0
3
34
4
2
2
1
10
P0(1) = 0:9
P0(2) = 0:1
R»Ber(0:5)
Figure 4: An episodic Markov reward process. In this example, all transitions are determin-
istic. The reward is zero, except when transitioning from state 3 to state 4, when it is given
by a Bernoulli random variable with parameter 0.5. State 4 is a terminal state. When the
process reaches the terminal state, it is reset to start at state 1 or 2. The probability of
starting at state 1 is 0.9, while the probability of starting at state 2 is 0.1.
equal to one. In this case, the Monte-Carlo method becomes faster since Rt= 1 is the true
target value, while for the value of state 2 to get close to its true value, TD(0) has to wait
until the estimate of the value at state 3 becomes close to its true value. This slows down
the convergence of TD(0). In fact, one can imagine a longer chain of states, where state i+1
follows state i, for i∈ {1, . . . , N}and the only time a nonzero reward is incurred is when
transitioning from state N−1 to state N. In this example, the rate of convergence of the
Monte-Carlo method is not impacted by the value of N, while TD(0) would get slower with
Nincreasing (for an informal argument, see Sutton, 1988; for a formal one with exact rates,
see Beleznay et al., 1999).
3.1.3 TD(λ): Unifying Monte-Carlo and TD(0)
The previous examples show that both Monte-Carlo and TD(0) have their own merits.
Interestingly, there is a way to unify these approaches. This is achieved by the so-called
TD(λ) family of methods (Sutton, 1984, 1988). Here, λ∈[0,1] is a parameter that allows
one to interpolate between the Monte-Carlo and TD(0) updates: λ= 0 gives TD(0) (hence
the name of TD(0)), while λ= 1, i.e., TD(1) is equivalent to a Monte-Carlo method. In
essence, given some λ > 0, the targets in the TD(λ) update are given as some mixture of
23
the multi-step return predictions
Rt:k=
t+k
X
s=t
γs−tRs+1 +γk+1 ˆ
Vt(Xt+k+1),
where the mixing coefficients are the exponential weights (1 −λ)λk,k≥0. Thus, for λ > 0
TD(λ) will be a multi-step method. The algorithm is made incremental by the introduction
of the so-called eligibility traces.
In fact, the eligibility traces can be defined in multiple ways and hence TD(λ) exists in corre-
spondingly many multiple forms. The update rule of TD(λ) with the so-called accumulating
traces is as follows:
δt+1 =Rt+1 +γˆ
Vt(Xt+1)−ˆ
Vt(Xt),
zt+1(x) = I{x=Xt}+γλ zt(x),
ˆ
Vt+1(x) = ˆ
Vt(x) + αtδt+1 zt+1(x),
z0(x) = 0,
x∈ X.
Here zt(x) is the eligibility trace of state x. The rationale of the name is that the value of
zt(x) modulates the influence of the TD error on the update of the value stored at state x.
In another variant of the algorithm, the eligibility traces are updated according to
zt+1(x) = max(I{x=Xt}, γλ zt(x)), x ∈ X.
This is called the replacing traces update. In these updates, the trace-decay parameter λ
controls the amount of bootstrapping: When λ= 0 the above algorithms become identical
to TD(0) (since limλ→0+(1 −λ)Pk≥0λkRt:k=Rt:0 =Rt+1 +γˆ
Vt(Xt+1)). When λ= 1,
we get the TD(1) algorithm, which with accumulating traces will simulate the previously
described every-visit Monte-Carlo algorithm in episodic problems. (For an exact equivalence,
one needs to assume that the value updates happen only at the end of trajectories, up to
which point the updates are just accumulated. The statement then follows because the
discounted sum of temporal differences along a trajectory from a start state to a terminal
state telescopes and gives the sum of rewards along the trajectory.) Replacing traces and
λ= 1 correspond to a version of the Monte-Carlo algorithm where a state is updated only
when it is encountered for the first time in a trajectory. The corresponding algorithm is
called first-visit Monte-Carlo method. The formal correspondence between the first-visit
Monte-Carlo method and TD(1) with replacing traces is known to hold for the undiscounted
case only (Singh and Sutton, 1996). Algorithm 3 gives the pseudocode corresponding to the
24
Algorithm 3 The function that implements the tabular TD(λ) algorithm with replacing
traces. This function must be called after each transition.
function TDLambda(X, R, Y, V, z)
Input: Xis the last state, Yis the next state, Ris the immediate reward associated with
this transition, Vis the array storing the current value function estimate, zis the array
storing the eligibility traces
1: δ←R+γ·V[Y]−V[X]
2: for all x∈ X do
3: z[x]←γ·λ·z[x]
4: if X=xthen
5: z[x]←1
6: end if
7: V[x]←V[x] + α·δ·z[x]
8: end for
9: return (V, z)
variant with replacing traces.
In practice, the best value of λis determined by trial and error. In fact, the value of λ
can be changed even during the algorithm, without impacting convergence. This holds for
a wide range of other possible eligibility trace updates (for precise conditions, see Bertsekas
and Tsitsiklis, 1996, Section 5.3.3 and 5.3.6). The replacing traces version of the algorithm
is believed to perform better in practice (for some examples when this happens, consult
Sutton and Barto, 1998, Section 7.8). It has been noted that λ > 0 is helpful when the
learner has only partial knowledge of the state, or (in the related situation) when function
approximation is used to approximate the value functions in a large state space – the topic
of the next section.
In summary, TD(λ) allows one to estimate value functions in MRPs. It generalizes Monte-
Carlo methods, it can be used in non-episodic problems, and it allows for bootstrapping.
Further, by appropriately tuning λit can converge significantly faster than Monte-Carlo
methods or TD(0).
3.2 Algorithms for large state spaces
When the state space is large (or infinite), it is not feasible to keep a separate value for each
state in the memory. In such cases, we often seek an estimate of the values in the form
Vθ(x) = θ>ϕ(x), x ∈ X,
where θ∈Rdis a vector of parameters and ϕ:X → Rdis a mapping of states to d-
dimensional vectors. For state x, the components ϕi(x) of the vector ϕ(x) are called the
25
features of state xand ϕis called a feature extraction method. The individual functions
ϕi:X → Rdefining the components of ϕare called basis functions.
Examples of function approximation methods Given access to the state, the features
(or basis functions) can be constructed in a great many different ways. If x∈R(i.e., X ⊂ R)
one may use a polynomial, Fourier, or wavelet basis up to some order. For example, in
the case of a polynomial basis, ϕ(x) = (1, x, x2, . . . , xd−1)>, or, an orthogonal system of
polynomials if a suitable measure (such as the stationary distribution) over the states is
available. This latter choice may help to increase the convergence speed of the incremental
algorithms that we will discuss soon.
In the case of multidimensional state spaces, the tensor product construction is a commonly
used way to construct features given features of the states’ individual components. The
tensor product construction works as follows: Imagine that X ⊂ X1× X2×. . . × Xk. Let
ϕ(i):Xi→Rdibe a feature extractor defined for the ith state component. The tensor
product ϕ=ϕ(1) ⊗. . . ⊗ϕ(k)feature extractor will have d=d1d2. . . dkcomponents, which
can be conveniently indexed using multi-indices of the form (i1, . . . , ik), 1 ≤ij≤dj,j=
1, . . . , k. Then ϕ(i1,...,ik)(x) = ϕ(1)
i1(x1)ϕ(2)
i2(x2). . . ϕ(k)
ik(xk). When X ⊂ Rk, one particularly
popular choice is to use radial basis function (RBF) networks, when ϕ(i)(xi) = (G(|xi−
x(1)
i|), . . . , G(|xi−x(di)
i|))>. Here x(j)
i∈R(j= 1, . . . , di) is fixed by the user and Gis
a suitable function. A typical choice for Gis G(z) = exp(−η z2) where η > 0 is a scale
parameter. The tensor product construct in this cases places Gaussians at points of a
regular grid and the ith basis function becomes
ϕi(x) = exp(−ηkx−x(i)k2),
where x(i)∈ X now denotes a point on a regular d1×. . . ×dkgrid. A related method is to
use kernel smoothing:
Vθ(x) = Pd
i=1 θiG(kx−x(i)k)
Pd
j=1 G(kx−x(j)k)=
d
X
i=1
θi
G(kx−x(i)k)
Pd
j=1 G(kx−x(j)k).(20)
More generally, one may use Vθ(x) = Pd
i=1 θisi(x), where si≥0 and Pd
i=1 si(x)≡1 holds
for any x∈ X. In this case, we say that Vθis an averager. Averagers are important
in reinforcement learning because the mapping θ7→ Vθis a non-expansion in the max-
norm, which makes them “well-behaved” when used together with approximate dynamic
programming.
An alternative to the above is to use binary features, i.e., when ϕ(x)∈ {0,1}d. Binary
features may be advantageous from a computational point of view: when ϕ(x)∈ {0,1}d
26
then Vθ(x) = Pi:ϕi(x)=1 θi. Thus, the value of state xcan be computed at the cost of s
additions if ϕ(x) is s-sparse (i.e., if only selements of ϕ(x) are non-zero), provided that
there is a direct way of computing the index of the non-zero components of the feature
vector.
This is the case when the state aggregation is used to define the features. In this case,
the coordinate functions of ϕ(the individual features) correspond to indicators of non-
overlapping regions of the state space Xwhose union covers X(i.e., the regions form a
partition of the state space). Clearly, in this case, θ>ϕ(x) will be constant over the individual
regions, thus state aggregation essentially “discretizes” the state space. A state aggregator
function approximator is also an averager.
Another choice that leads to binary features is tile coding (originally called CMAC, Albus,
1971, 1981). In the simplest version of tile coding, the basis functions of ϕcorrespond to
indicator functions of multiple shifted partitions (tilings) of the state space: if stilings are
used, ϕwill be s-sparse. To make tile coding an effective function approximation method,
the offsets of the tilings corresponding to different dimensions should be different.
The curse of dimensionality The issue with tensor product constructions, state aggre-
gation and straightforward tile coding is that when the state space is high dimensional they
quickly become intractable: For example, a tiling of [0,1]Dwith cubical regions with side-
lengths of εgives rise to d=ε−D-dimensional feature- and parameter-vectors. If ε= 1/2
and D= 100, we get the enormous number d≈1030. This is problematic since state-
representations with hundreds of dimensions are common in applications. At this stage,
one may wonder if it is possible at all to successfully deal with applications when the state
lives in a high dimensional space. What often comes at rescue is that the actual problem
complexity might be much lower than what is predicted by merely counting the number of
dimensions of the state variable (although, there is no guarantee that this happens). To see
why sometimes this holds, note that the same problem can have multiple representations,
some of which may come with low-dimensional state variables, some with high. Since,
in many cases, the state-representation is chosen by the user in a conservative fashion, it
may happen that in the chosen representation many of the state variables are irrelevant.
It may also happen that the states that are actually encountered lie on (or lie close to) a
low-dimensional submanifold of the chosen high dimensional “state-space”.
To illustrate this, imagine an industrial robot arm with say 3 joints and 6 degrees of freedom.
The intrinsic dimensionality of the state is then 12, twice the number of degrees of freedom
of the arm since the dynamics is second-order. One (approximate) state representation
is to take high resolution camera images of the arm in close succession (to account for the
dynamics) from multiple angles (to account for occlusions). The dimensionality of the chosen
27
state representation will easily be in the range of millions, yet the intrinsic dimensionality
will still be 12. In fact, the more cameras we have, the higher the dimensionality will be.
A simple-minded approach, which aims for minimizing the dimensionality would suggest to
use as few cameras as possible. But more information should not hurt! Therefore, the quest
should be for clever algorithms and function approximation methods that can deal with
high-dimensional but low complexity problems.
Possibilities include using strip-like tilings combined with hash functions, interpolators that
use low-discrepancy grids (Lemieux, 2009, Chapter 5 and 6), or random projections (Das-
gupta and Freund, 2008). Nonlinear function approximation methods (examples of which
include neural networks with sigmoidal transfer functions in the hidden layers or RBF net-
works where the centers are also considered as parameters) and nonparametric techniques
also hold great promise.
Nonparametric methods In a nonparametric method, the user does not start with a
fixed finite-dimensional representation, such as in the previous examples, but allows for the
representation to grow and change as needed. For example, in a k-nearest neighbor method
for regression, given the data Dn= [(x1, v1),...,(xn, vn)], where xi∈Rk,vi∈R, the value
at location xis predicted using
V(k)
D(x) =
n
X
i=1
vi
K(k)
D(x, xi)
k,
where K(k)
D(x, x0) is one when x0is closer to xthen the kth closest neighbor of xin Dand is
zero otherwise. Note that k=Pn
j=1 K(k)
D(x, xj). Replacing kin the above expression with
this sum and replacing K(k)
D(x, ·) with some other data based kernel KD(e.g., a Gaussian
centered around xwith standard deviation proportional to the distance to the kth nearest
neighbor), we arrive at nonparametric kernel smoothing:
V(k)
D(x) =
n
X
i=1
vi
KD(x, xi)
Pn
j=1 KD(x, xj),
which should be compared to its parametric counterpart (20). Other examples include meth-
ods that work by finding an appropriate function in some large (infinite dimensional) function
space that fits an empirical error. The function space is usually a Reproducing Kernel Hilbert
space which is a convenient choice from the point of view of optimization. In special cases,
we get spline smoothers (Wahba, 2003) and Gaussian process regression (Rasmussen and
Williams, 2005). Another idea is to split the input space recursively into finer regions using
some heuristic criterion and then predict with some simple method the values in the leafs,
28
leading to tree-based methods. The border between parametric and nonparametric methods
is blurry. For example, a linear predictor when the number of basis functions is allowed to
change (i.e., when new basis functions are introduced as needed) becomes a nonparametric
method. Thus, when one experiments with different feature extraction methods, from the
point of view of the overall tuning process, we can say that one really uses a nonparametric
technique. In fact, if we take this viewpoint, it follows that in practice “true” parametric
methods are rarely used if they are used at all.
The advantage of nonparametric methods is their inherent flexibility. However, this comes
usually at the price of increased computational complexity. Therefore, when using non-
parametric methods, efficient implementations are important (e.g., one should use k-D trees
when implementing nearest neighbor methods, or the Fast Gaussian Transform when imple-
menting a Gaussian smoother). Also, nonparametric methods must be carefully tuned as
they can easily overfit or underfit. For example, in a k-nearest neighbor method if kis too
large, the method is going to introduce too much smoothing (i.e., it will underfit), while if
kis too small, it will fit to the noise (i.e., overfit). Overfitting will be further discussed in
Section 3.2.4. For more information about nonparametric regression, the reader is advised
to consult the books by H¨ardle (1990); Gy¨orfi et al. (2002); Tsybakov (2009).
Although our discussion below will assume a parametric function approximation method
(and in many cases linear function approximation), many of the algorithms can be extended
to nonparametric techniques. We will mention when such extensions exist as appropriate.
Up to now, the discussion implicitly assumed that the state is accessible for measurement.
This is, however, rarely the case in practical applications. Luckily, the methods that we
will discuss below do not actually need to access the states directly, but they can perform
equally well when some “sufficiently descriptive feature-based representation” of the states
is available (such as the camera images in the robot-arm example). A common way of
arriving at such a representation is to construct state estimators (or observers, in control
terminology) based on the history of the observations, which has a large literature both in
machine learning and control. The discussion of these techniques, however, lies outside of
the scope of the present paper.
3.2.1 TD(λ) with function approximation
Let us return to the problem of estimating a value function Vof a Markov reward process
M= (X,P0), but now assume that the state space is large (or even infinite). Let D=
((Xt, Rt+1); t≥0) be a realization of M. The goal, as before, is to estimate the value
function of Mgiven Din an incremental manner.
Choose a smooth parametric function-approximation method (Vθ;θ∈Rd) (i.e., for any
θ∈Rd,Vθ:X → Ris such that ∇θVθ(x) exists for any x∈ X). The generalization of
29
Algorithm 4 The function implementing the TD(λ) algorithm with linear function approx-
imation. This function must be called after each transition.
function TDLambdaLinFApp(X, R, Y, θ, z)
Input: Xis the last state, Yis the next state, Ris the immediate reward associated with
this transition, θ∈Rdis the parameter vector of the linear function approximation,
z∈Rdis the vector of eligibility traces
1: δ←R+γ·θ>ϕ[Y]−θ>ϕ[X]
2: z←ϕ[X] + γ·λ·z
3: θ←θ+α·δ·z
4: return (θ, z)
tabular TD(λ) with accumulating eligibility traces to the case when the value functions are
approximated using members of (Vθ;θ∈Rd) uses the following updates (Sutton, 1984, 1988):
δt+1 =Rt+1 +γVθt(Xt+1 )−Vθt(Xt),
zt+1 =∇θVθt(Xt) + γλ zt,
θt+1 =θt+αtδt+1 zt+1,
z0= 0.
(21)
Here zt∈Rd. Algorithm 4 shows the pseudocode of this algorithm.
To see that this algorithm is indeed a generalization of tabular TD(λ) assume that X=
{x1, . . . , xD}and let Vθ(x) = θ>ϕ(x) with ϕi(x) = I{x=xi}. Note that since Vθis linear in the
parameters (i.e., Vθ=θ>ϕ), it holds that ∇θVθ=ϕ. Hence, identifying zt,i (θt,i) with zt(xi)
(resp., ˆ
Vt(xi)) we see that the update (21), indeed, reduces to the previous one.
In the off-policy version of TD(λ), the definition of δt+1 becomes
δt+1 =Rt+1 +γVθt(Yt+1 )−Vθt(Xt).
Unlike the tabular case, under off-policy sampling, convergence is no longer guaranteed, but,
in fact, the parameters may diverge (see, e.g., Bertsekas and Tsitsiklis, 1996, Example 6.7,
p. 307). This is true for linear function approximation when the distributions of (Xt;t≥0)
do not match the stationary distribution of the MRP M. Another case when the algorithm
may diverge is when it is used with a nonlinear function-approximation method (see, e.g.,
Bertsekas and Tsitsiklis, 1996, Example 6.6, p. 292). For further examples of instability, see
Baird (1995); Boyan and Moore (1995).
On the positive side, almost sure convergence can be guaranteed when (i) a linear function-
approximation method is used with ϕ:X → Rd;(ii) the stochastic process (Xt;t≥0) is
30
an ergodic Markov process whose stationary distribution µis the same as the stationary
distribution of the MRP M; and (iii) the step-size sequence satisfies the RM conditions
(Tsitsiklis and Van Roy, 1997; Bertsekas and Tsitsiklis, 1996, p. 222, Section 5.3.7). In
the results cited, it is also assumed that the components of ϕ(i.e., ϕ1, . . . , ϕd) are linearly
independent. When this holds, the limit of the parameter vector will be unique. In the other
case, i.e., when the features are redundant, the parameters will still converge, but the limit
will depend on the parameter vector’s initial value. However, the limiting value function will
be unique (Bertsekas, 2010).
Assuming that TD(λ) converges, let θ(λ)denote the limiting value of θt.
Let
F={Vθ|θ∈Rd}
be the space of functions that can be represented using the chosen features ϕ. Note that
Fis a linear subspace of the vector space of all real-valued functions with domain X. The
limit θ(λ)is known to satisfy the so-called projected fixed-point equation
Vθ(λ)= ΠF,µ T(λ)Vθ(λ),(22)
where the operators T(λ)and ΠF,µ are defined as follows: For m∈Nlet T[m]be the m-step
lookahead Bellman operator:
T[m]ˆ
V(x) = E"m
X
t=0
γtRt+1 +γm+1 ˆ
V(Xm+1)X0=x#.
Clearly, V, the value function to be estimated is a fixed point of T[m]for any m≥0.
Assume that λ < 1. Then, operator T(λ)is defined as the exponentially weighted average of
T[0], T [1], . . .:
T(λ)ˆ
V(x) = (1 −λ)
∞
X
m=0
λmT[m]ˆ
V(x).
For λ= 1, we let T(1) ˆ
V= limλ→1−T(λ)ˆ
V=V. Notice that for λ= 0, T(0) =T. Operator
ΠF,µ is a projection: It projects functions of states to the linear space Fwith respect to the
weighted 2-norm kfk2
µ=Px∈X f2(x)µ(x):
ΠF,µ ˆ
V= argmin
f∈F kˆ
V−fkµ.
The essence of the proof of convergence of TD(λ) is that the composite operator ΠF,µT(λ)
is a contraction with respect to the norm k·kµ. This result heavily exploits that µis the
stationary distribution underlying M(which defines T(λ)). For other distributions, the
31
composite operator might not be a contraction; in which case, TD(λ) might diverge.
As to the quality of the solution found, the following error bound holds for the fixed point
of (22):
kVθ(λ)−Vkµ≤1
√1−γλkΠF,µV−Vkµ.
Here γλ=γ(1 −λ)/(1 −λγ) is the contraction modulus of ΠF,µ T(λ)(Tsitsiklis and Van Roy,
1999a; Bertsekas, 2007b). (For sharper bounds, see Yu and Bertsekas 2008; Scherrer 2010.)
From the error bound we see that Vθ(1) is the best approximation to Vwithin Fwith respect
to the norm k·kµ(this should come at no surprise as TD(1) minimizes this mean-squared
error by design). We also see that as we let λ→0 the bound allows for larger errors. It is
known that this is not an artifact of the analysis. In fact, in Example 6.5 of the book by
Bertsekas and Tsitsiklis (1996) (p. 288), a simple MRP with nstates and a one-dimensional
feature extractor ϕis given such that Vθ(0) is a very poor approximation to V, while Vθ(1)
is a reasonable approximation. Thus, in order to get good accuracy when working with
λ < 1, it is not enough to choose the function space Fso that the best approximation to
Vhas small error. At this stage, however, one might wonder if using λ < 1 makes sense
at all. A recent paper by Van Roy (2006) suggests that when considering performance loss
bounds instead of approximation errors and the full control learning task (cf. Section 4),
λ= 0 will in general be at no disadvantage compared to using λ= 1, at least, when state-
aggregation is considered. Thus, while the mean-squared error of the solution might be large,
when the solution is used in control, the performance of the resulting policy will still be as
good as that of one that is obtained by calculating the TD(1) solution. However, the major
reason to prefer TD(λ) with λ < 1 over TD(1) is because empirical evidence suggests that
it converges much faster than TD(1), the latter of which, at least for practical sample sizes,
often produces very poor estimates (e.g., Sutton and Barto, 1998, Section 8.6).
TD(λ) solves a model Sutton et al. (2008) and Parr et al. (2008) observed independently
of each other that the solution obtained by TD(0) can be thought of as the solution of a
deterministic MRP with a linear dynamics. In fact, as we will argue now this also holds in
the case of TD(λ).
This suggests that if the deterministic MRP captures the essential features of the original
MRP, Vθ(λ)will be a good approximation to V. To firm up this statement, following Parr
et al. (2008), let us study the Bellman error
∆(λ)(ˆ
V) = T(λ)ˆ
V−ˆ
V
of ˆ
V:X → Runder T(λ). Note that ∆(λ)(ˆ
V) : X → R. A simple contraction argument
shows that
V−ˆ
V
∞≤1
1−γ
∆(λ)(ˆ
V)
∞. Hence, if ∆(λ)(ˆ
V) is small, ˆ
Vis close to V.
32
The following error decomposition can be shown to hold:8
∆(λ)(Vθ(λ)) = (1 −λ)X
m≥0
λm∆[r]
m+γ((1 −λ)X
m≥0
λm∆[ϕ]
m)θ(λ).
Here ∆[r]
m=rm−ΠF,µrmand ∆[ϕ]
m=Pm+1ϕ>−ΠF,µ Pm+1ϕ>are the errors of modeling the
m-step rewards and transitions with respect to the features ϕ, respectively; rm:X → Ris
defined by rm(x) = E[Rm+1 |X0=x] and Pm+1ϕ>denotes a function that maps states to d-
dimensional row-vectors and which is defined by Pm+1ϕ>(x) = (Pm+1ϕ1(x), . . . , P m+1ϕd(x)).
Here Pmϕi:X → Ris the function defined by Pmϕi(x) = E[ϕi(Xm)|X0=x]. Thus, we see
that the Bellman error will be small if the m-step immediate rewards and the m-step feature-
expectations are well captured by the features. We can also see that as λgets closer to 1, it
becomes more important for the features to capture the structure of the value function, and
as λgets closer to 0, it becomes more important to capture the structure of the immediate
rewards and the immediate feature-expectations. This suggests that the “best” value of λ
(i.e., the one that minimizes k∆(λ)(Vθ(λ))k) may depend on whether the features are more
successful at capturing the short-term or the long-term dynamics (and rewards).
3.2.2 Gradient temporal difference learning
That TD(λ) can diverge in off-policy learning situations spoils its otherwise immaculate
record. In Section 3.2.3, we will introduce some methods that avoid this issue. However,
as we will see it, the computational (time and storage) complexity of these methods will be
significantly larger than that of TD(λ). In this section, we present two recent algorithms
introduced by Sutton et al. (2009b,a), which also overcome the instability issue, converge to
the TD(λ) solutions in the on-policy case, and yet they are almost as efficient as TD(λ).
For simplicity, we consider the case when λ= 0, ((Xt, Rt+1, Yt+1); t≥0) is a stationary
process, Xt∼ν(νcan be different from the stationary distribution of P) and when linear
function approximation is used with linearly independent features. Assume that θ(0), the
solution to (22), exists. Consider the objective function
J(θ) = kVθ−ΠF,ν T Vθk2
ν.(23)
Notice that all solutions to (22) are minimizers of J, and there are no other minimizers of
Jwhen (22) has solutions. Thus, minimizing Jwill give a solution to (22). Let θ∗denote a
minimizer of J. Since, by assumption, the features are linearly independent, the minimizer
8Parr et al. (2008) observed this for λ= 0. The extension to λ > 0 is new.
33
of Jis unique, i.e., θ∗is well-defined. Introduce the shorthand notations
δt+1(θ) = Rt+1 +γVθ(Yt+1)−Vθ(Xt) (24)
=Rt+1 +γθ>ϕ0
t+1 −θ>ϕt,
ϕt=ϕ(Xt),
ϕ0
t+1 =ϕ(Yt+1).
A simple calculation allows us to rewrite Jin the following form:
J(θ) = E[δt+1(θ)ϕt]>Eϕtϕ>
t−1E[δt+1(θ)ϕt].(25)
Taking the gradient of the objective function we get
∇θJ(θ) = −2E(ϕt−γϕ0
t+1)ϕ>
tw(θ),(26)