Content uploaded by Csaba Szepesvári

Author content

All content in this area was uploaded by Csaba Szepesvári

Content may be subject to copyright.

Algorithms for Reinforcement Learning

Draft of the lecture published in the

Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning

series

by

Morgan & Claypool Publishers

Csaba Szepesv´ari

June 9, 2009∗

Contents

1 Overview 3

2 Markov decision processes 7

2.1 Preliminaries ................................... 7

2.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Valuefunctions .................................. 12

2.4 Dynamic programming algorithms for solving MDPs . . . . . . . . . . . . . . 16

3 Value prediction problems 17

3.1 Temporal diﬀerence learning in ﬁnite state spaces . . . . . . . . . . . . . . . 18

3.1.1 TabularTD(0) .............................. 18

3.1.2 Every-visit Monte-Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.3 TD(λ): Unifying Monte-Carlo and TD(0) . . . . . . . . . . . . . . . . 23

3.2 Algorithms for large state spaces . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 TD(λ) with function approximation . . . . . . . . . . . . . . . . . . . 29

3.2.2 Gradient temporal diﬀerence learning . . . . . . . . . . . . . . . . . . 33

3.2.3 Least-squares methods . . . . . . . . . . . . . . . . . . . . . . . . . . 36

∗Last update: August 18, 2010

1

3.2.4 The choice of the function space . . . . . . . . . . . . . . . . . . . . . 42

4 Control 45

4.1 A catalog of learning problems . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Closed-loop interactive learning . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Online learning in bandits . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 Active learning in bandits . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.3 Active learning in Markov Decision Processes . . . . . . . . . . . . . 50

4.2.4 Online learning in Markov Decision Processes . . . . . . . . . . . . . 51

4.3 Directmethods .................................. 56

4.3.1 Q-learning in ﬁnite MDPs . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.2 Q-learning with function approximation . . . . . . . . . . . . . . . . 59

4.4 Actor-criticmethods ............................... 62

4.4.1 Implementing a critic . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4.2 Implementing an actor . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 For further exploration 72

5.1 Furtherreading .................................. 72

5.2 Applications.................................... 73

5.3 Software...................................... 73

5.4 Acknowledgements ................................ 73

A The theory of discounted Markovian decision processes 74

A.1 Contractions and Banach’s ﬁxed-point theorem . . . . . . . . . . . . . . . . 74

A.2 ApplicationtoMDPs............................... 78

Abstract

Reinforcement learning is a learning paradigm concerned with learning to control a

system so as to maximize a numerical performance measure that expresses a long-term

objective. What distinguishes reinforcement learning from supervised learning is that

only partial feedback is given to the learner about the learner’s predictions. Further,

the predictions may have long term eﬀects through inﬂuencing the future state of the

controlled system. Thus, time plays a special role. The goal in reinforcement learning

is to develop eﬃcient learning algorithms, as well as to understand the algorithms’

merits and limitations. Reinforcement learning is of great interest because of the large

number of practical applications that it can be used to address, ranging from problems

in artiﬁcial intelligence to operations research or control engineering. In this book, we

focus on those algorithms of reinforcement learning that build on the powerful theory of

dynamic programming. We give a fairly comprehensive catalog of learning problems,

2

Figure 1: The basic reinforcement learning scenario

describe the core ideas together with a large number of state of the art algorithms,

followed by the discussion of their theoretical properties and limitations.

Keywords: reinforcement learning; Markov Decision Processes; temporal diﬀerence learn-

ing; stochastic approximation; two-timescale stochastic approximation; Monte-Carlo meth-

ods; simulation optimization; function approximation; stochastic gradient methods; least-

squares methods; overﬁtting; bias-variance tradeoﬀ; online learning; active learning; plan-

ning; simulation; PAC-learning; Q-learning; actor-critic methods; policy gradient; natural

gradient

1 Overview

Reinforcement learning (RL) refers to both a learning problem and a subﬁeld of machine

learning. As a learning problem, it refers to learning to control a system so as to maxi-

mize some numerical value which represents a long-term objective. A typical setting where

reinforcement learning operates is shown in Figure 1: A controller receives the controlled

system’s state and a reward associated with the last state transition. It then calculates an

action which is sent back to the system. In response, the system makes a transition to a

new state and the cycle is repeated. The problem is to learn a way of controlling the system

so as to maximize the total reward. The learning problems diﬀer in the details of how the

data is collected and how performance is measured.

In this book, we assume that the system that we wish to control is stochastic. Further,

we assume that the measurements available on the system’s state are detailed enough so

that the the controller can avoid reasoning about how to collect information about the

3

state. Problems with these characteristics are best described in the framework of Markovian

Decision Processes (MDPs). The standard approach to ‘solve’ MDPs is to use dynamic

programming, which transforms the problem of ﬁnding a good controller into the problem

of ﬁnding a good value function. However, apart from the simplest cases when the MDP has

very few states and actions, dynamic programming is infeasible. The RL algorithms that

we discuss here can be thought of as a way of turning the infeasible dynamic programming

methods into practical algorithms so that they can be applied to large-scale problems.

There are two key ideas that allow RL algorithms to achieve this goal. The ﬁrst idea is to

use samples to compactly represent the dynamics of the control problem. This is important

for two reasons: First, it allows one to deal with learning scenarios when the dynamics is

unknown. Second, even if the dynamics is available, exact reasoning that uses it might

be intractable on its own. The second key idea behind RL algorithms is to use powerful

function approximation methods to compactly represent value functions. The signiﬁcance

of this is that it allows dealing with large, high-dimensional state- and action-spaces. What

is more, the two ideas ﬁt nicely together: Samples may be focused on a small subset of the

spaces they belong to, which clever function approximation techniques might exploit. It is

the understanding of the interplay between dynamic programming, samples and function

approximation that is at the heart of designing, analyzing and applying RL algorithms.

The purpose of this book is to allow the reader to have a chance to peek into this beautiful

ﬁeld. However, certainly we are not the ﬁrst to set out to accomplish this goal. In 1996,

Kaelbling et al. have written a nice, compact survey about the approaches and algorithms

available at the time (Kaelbling et al., 1996). This was followed by the publication of the book

by Bertsekas and Tsitsiklis (1996), which detailed the theoretical foundations. A few years

later Sutton and Barto, the ‘fathers’ of RL, published their book, where they presented their

ideas on RL in a very clear and accessible manner (Sutton and Barto, 1998). A more recent

and comprehensive overview of the tools and techniques of dynamic programming/optimal

control is given in the two-volume book by Bertsekas (2007a,b) which devotes one chapter

to RL methods.1At times, when a ﬁeld is rapidly developing, books can get out of date

pretty quickly. In fact, to keep up with the growing body of new results, Bertsekas maintains

an online version of his Chapter 6 of Volume II of his book, which, at the time of writing

this survey counted as much as 160 pages (Bertsekas, 2010). Other recent books on the

subject include the book of Gosavi (2003) who devotes 60 pages to reinforcement learning

algorithms in Chapter 9, concentrating on average cost problems, or that of Cao (2007) who

focuses on policy gradient methods. Powell (2007) presents the algorithms and ideas from an

operations research perspective and emphasizes methods that are capable of handling large

1In this book, RL is called neuro-dynamic programming or approximate dynamic programming. The

term neuro-dynamic programming stems from the fact that, in many cases, RL algorithms are used with

artiﬁcial neural networks.

4

control spaces, Chang et al. (2008) focuses on adaptive sampling (i.e., simulation-based

performance optimization), while the center of the recent book by Busoniu et al. (2010) is

function approximation.

Thus, by no means do RL researchers lack a good body of literature. However, what seems

to be missing is a self-contained and yet relatively short summary that can help newcomers

to the ﬁeld to develop a good sense of the state of the art, as well as existing researchers to

broaden their overview of the ﬁeld, an article, similar to that of Kaelbling et al. (1996), but

with an updated contents. To ﬁll this gap is the very purpose of this short book.

Having the goal of keeping the text short, we had to make a few, hopefully, not too trou-

bling compromises. The ﬁrst compromise we made was to present results only for the total

expected discounted reward criterion. This choice is motivated by that this is the criterion

that is both widely used and the easiest to deal with mathematically. The next compro-

mise is that the background on MDPs and dynamic programming is kept ultra-compact

(although an appendix is added that explains these basic results). Apart from these, the

book aims to cover a bit of all aspects of RL, up to the level that the reader should be

able to understand the whats and hows, as well as to implement the algorithms presented.

Naturally, we still had to be selective in what we present. Here, the decision was to focus

on the basic algorithms, ideas, as well as the available theory. Special attention was paid to

describing the choices of the user, as well as the tradeoﬀs that come with these. We tried

to be impartial as much as possible, but some personal bias, as usual, surely remained. The

pseudocode of almost twenty algorithms was included, hoping that this will make it easier

for the practically inclined reader to implement the algorithms described.

The target audience is advanced undergaduate and graduate students, as well as researchers

and practitioners who want to get a good overview of the state of the art in RL quickly.

Researchers who are already working on RL might also enjoy reading about parts of the RL

literature that they are not so familiar with, thus broadening their perspective on RL. The

reader is assumed to be familiar with the basics of linear algebra, calculus, and probability

theory. In particular, we assume that the reader is familiar with the concepts of random

variables, conditional expectations, and Markov chains. It is helpful, but not necessary,

for the reader to be familiar with statistical learning theory, as the essential concepts will

be explained as needed. In some parts of the book, knowledge of regression techniques of

machine learning will be useful.

This book has three parts. In the ﬁrst part, in Section 2, we provide the necessary back-

ground. It is here where the notation is introduced, followed by a short overview of the

theory of Markov Decision Processes and the description of the basic dynamic programming

algorithms. Readers familiar with MDPs and dynamic programming should skim through

this part to familiarize themselves with the notation used. Readers, who are less familiar

5

with MDPs, must spend enough time here before moving on because the rest of the book

builds heavily on the results and ideas presented here.

The remaining two parts are devoted to the two basic RL problems (cf. Figure 1), one part

devoted to each. In Section 3, the problem of learning to predict values associated with

states is studied. We start by explaining the basic ideas for the so-called tabular case when

the MDP is small enough so that one can store one value per state in an array allocated in

a computer’s main memory. The ﬁrst algorithm explained is TD(λ), which can be viewed

as the learning analogue to value iteration from dynamic programming. After this, we

consider the more challenging situation when there are more states than what ﬁts into a

computer’s memory. Clearly, in this case, one must compress the table representing the

values. Abstractly, this can be done by relying on an appropriate function approximation

method. First, we describe how TD(λ) can be used in this situation. This is followed by the

description of some new gradient based methods (GTD2 and TDC), which can be viewed

as improved versions of TD(λ) in that they avoid some of the convergence diﬃculties that

TD(λ) faces. We then discuss least-squares methods (in particular, LSTD(λ) and λ-LSPE)

and compare them to the incremental methods described earlier. Finally, we describe choices

available for implementing function approximation and the tradeoﬀs that these choices come

with.

The second part (Section 4) is devoted to algorithms that are developed for control learning.

First, we describe methods whose goal is optimizing online performance. In particular,

we describe the “optimism in the face of uncertainty” principle and methods that explore

their environment based on this principle. State of the art algorithms are given both for

bandit problems and MDPs. The message here is that clever exploration methods make

a large diﬀerence, but more work is needed to scale up the available methods to large

problems. The rest of this section is devoted to methods that aim at developing methods

that can be used in large-scale applications. As learning in large-scale MDPs is signiﬁcantly

more diﬃcult than learning when the MDP is small, the goal of learning is relaxed to

learning a good enough policy in the limit. First, direct methods are discussed which aim at

estimating the optimal action-values directly. These can be viewed as the learning analogue

of value iteration of dynamic programming. This is followed by the description of actor-

critic methods, which can be thought of as the counterpart of the policy iteration algorithm

of dynamic programming. Both methods based on direct policy improvement and policy

gradient (i.e., which use parametric policy classes) are presented.

The book is concluded in Section 5, which lists some topics for further exploration.

6

!"#$%&'%()

*+,-#.%'#"+'%()

!(,%&/.%'#"+'%()

!(,%&/.0#+"&1

2()'"(,

Figure 2: Types of reinforcement problems and approaches.

2 Markov decision processes

The purpose of this section is to introduce the notation that will be used in the subsequent

parts and the most essential facts that we will need from the theory of Markov Decision

Processes (MDPs) in the rest of the book. Readers familiar with MDPs should skim through

this section to familiarize themselves with the notation. Readers unfamiliar with MDPs are

suggested to spend enough time with this section to understand the details. Proofs of most

of the results (with some simpliﬁcations) are included in Appendix A. The reader who is

interested in learning more about MDPs is suggested to consult one of the many excellent

books on the subject, such as the books of Bertsekas and Shreve (1978), Puterman (1994),

or the two-volume book by Bertsekas (2007a,b).

2.1 Preliminaries

We use Nto denote the set of natural numbers: N={0,1,2, . . .}, while Rdenotes the set

of reals. By a vector v(unless it is transposed, v>), we mean a column vector. The inner

product of two ﬁnite-dimensional vectors, u, v ∈Rdis hu, vi=Pd

i=1 uivi. The resulting 2-

norm is kuk2=hu, ui. The maximum norm for vectors is deﬁned by kuk∞= maxi=1,...,d |ui|,

while for a function f:X → R,k · k∞is deﬁned by kfk∞= supx∈X |f(x)|. A mapping

Tbetween the metric spaces (M1, d1), (M2, d2) is called Lipschitz with modulus L∈Rif

for any a, b ∈M1,d2(T(a), T (b)) ≤L d1(a, b). If Tis Lipschitz with a modulus L≤1, it

is called a non-expansion. If L < 1, the mapping is called a contraction. The indicator

function of event Swill be denoted by I{S}(i.e., I{S}= 1 if Sholds and I{S}= 0, otherwise).

If v=v(θ, x), ∂

∂θ vshall denote the partial derivative of vwith respect to θ, which is a

d-dimensional row vector if θ∈Rd. The total derivative of some expression vwith respect

to θwill be denoted by d

dθ v(and will be treated as a row vector). Further, ∇θv= ( d

dθ v)>.

If Pis a distribution or a probability measure, then X∼Pmeans that Xis a random

variable drawn from P.

7

2.2 Markov Decision Processes

For ease of exposition, we restrict our attention to countable MDPs and the discounted total

expected reward criterion. However, under some technical conditions, the results extend to

continuous state-action MDPs, too. This also holds true for the results presented in later

parts of this book.

A countable MDP is deﬁned as a triplet M= (X,A,P0), where Xis the countable non-

empty set of states, Ais the countable non-empty set of actions. The transition probability

kernel P0assigns to each state-action pair (x, a)∈ X ×A a probability measure over X ×R,

which we shall denote by P0(·|x, a). The semantics of P0is the following: For U⊂ X × R,

P0(U|x, a) gives the probability that the next state and the associated reward belongs to the

set Uprovided that the current state is xand the action taken is a.2We also ﬁx a discount

factor 0 ≤γ≤1 whose role will become clear soon.

The transition probability kernel gives rise to the state transition probability kernel,P, which,

for any (x, a, y)∈ X × A × X triplet gives the probability of moving from state xto some

other state yprovided that action awas chosen in state x:

P(x, a, y) = P0({y} × R|x, a).

In addition to P,P0also gives rise to the immediate reward function r:X × A → R,

which gives the expected immediate reward received when action ais chosen in state x: If

(Y(x,a), R(x,a))∼ P0(·|x, a), then

r(x, a) = ER(x,a).

In what follows, we shall assume that the rewards are bounded by some quantity R>0:

for any (x, a)∈ X × A,|R(x,a)|≤Ralmost surely.3It is immediate that if the random

rewards are bounded by Rthen krk∞= sup(x,a)∈X ×A |r(x, a)| ≤ R also holds. An MDP is

called ﬁnite if both Xand Aare ﬁnite.

Markov Decision Processes are a tool for modeling sequential decision-making problems

where a decision maker interacts with a system in a sequential fashion. Given an MDP M,

this interaction happens as follows: Let t∈Ndenote the current time (or stage), let Xt∈ X

2The probability P0(U|x, a) is deﬁned only when Uis a Borel-measurable set. Borel-measurability is a

technical notion whose purpose is to prevent some pathologies. The collection of Borel-measurable subsets

of X ×Rinclude practically all “interesting” subsets X ×R. In particular, they include subsets of the form

{x} × [a, b] and subsets which can be obtained from such subsets by taking their complement, or the union

(intersection) of at most countable collections of such sets in a recursive fashion.

3“Almost surely” means the same as “with probability one” and is used to refer to the fact that the

statement concerned holds everywhere on the probability space with the exception of a set of events with

measure zero.

8

and At∈ A denote the random state of the system and the action chosen by the decision

maker at time t, respectively. Once the action is selected, it is sent to the system, which

makes a transition:

(Xt+1, Rt+1)∼ P0(·|Xt, At).(1)

In particular, Xt+1 is random and P(Xt+1 =y|Xt=x, At=a) = P(x, a, y) holds for any

x, y ∈ X,a∈ A. Further, E[Rt+1|Xt, At] = r(Xt, At). The decision maker then observes

the next state Xt+1 and reward Rt+1, chooses a new action At+1 ∈ A and the process is

repeated. The goal of the decision maker is to come up with a way of choosing the actions

so as to maximize the expected total discounted reward.

The decision maker can select its actions at any stage based on the observed history. A rule

describing the way the actions are selected is called a behavior. A behavior of the decision

maker and some initial random state X0together deﬁne a random state-action-reward se-

quence ((Xt, At, Rt+1); t≥0), where (Xt+1 , Rt+1) is connected to (Xt, At) by (1) and Atis the

action prescribed by the behavior based on the history X0, A0, R1, . . . , Xt−1, At−1, Rt, Xt.4

The return underlying a behavior is deﬁned as the total discounted sum of the rewards

incurred:

R=

∞

X

t=0

γtRt+1.

Thus, if γ < 1 then rewards far in the future worth exponentially less than the reward

received at the ﬁrst stage. An MDP when the return is deﬁned by this formula is called a

discounted reward MDP. When γ= 1, the MDP is called undiscounted.

The goal of the decision-maker is to choose a behavior that maximizes the expected return,

irrespectively of how the process is started. Such a maximizing behavior is said to be optimal.

Example 1 (Inventory control with lost sales): Consider the problem of day-to-day control

of an inventory of a ﬁxed maximum size in the face of uncertain demand: Every evening,

the decision maker must decide about the quantity to be ordered for the next day. In the

morning, the ordered quantity arrives with which the inventory is ﬁlled up. During the day,

some stochastic demand is realized, where the demands are independent with a common ﬁxed

distribution, see Figure 3. The goal of the inventory manager is to manage the inventory so

as to maximize the present monetary value of the expected total future income.

The payoﬀ at time step tis determined as follows: The cost associated with purchasing At

items is KI{At>0}+cAt. Thus, there is a ﬁxed entry cost Kof ordering nonzero items and

each item must be purchased at a ﬁxed price c. Here K, c > 0. In addition, there is a cost of

holding an inventory of size x > 0. In the simplest case, this cost is proportional to the size

4Mathematically, a behavior is an inﬁnite sequence of probability kernels π0, π1, . . . , πt, . . ., where

πtmaps histories of length tto a probability distribution over the action space A:πt=

πt(·|x0, a0, r0, . . . , xt−1, at−1, rt−1, xt).

9

of the inventory with proportionality factor h > 0. Finally, upon selling zunits the manager

is paid the monetary amount of p z, where p > 0. In order to make the problem interesting,

we must have p>h, otherwise there is no incentive to order new items.

This problem can be represented as an MDP as follows: Let the state Xton day t≥0be the

size of the inventory in the evening of that day. Thus, X={0,1, . . . , M}, where M∈Nis

the maximum inventory size. The action Atgives the number of items ordered in the evening

of day t. Thus, we can choose A={0,1, . . . , M}since there is no need to consider orders

larger than the inventory size. Given Xtand At, the size of the next inventory is given by

Xt+1 = ((Xt+At)∧M−Dt+1)+,(2)

where a∧bis a shorthand notation for the minimum of the numbers a,b,(a)+=a∨0 =

max(a, 0) is the positive part of a, and Dt+1 ∈Nis the demand on the (t+ 1)th day.

By assumption, (Dt;t > 0) is a sequence of independent and identically distributed (i.i.d.)

integer-valued random variables. The revenue made on day t+ 1 is

Rt+1 =−KI{At>0}−c((Xt+At)∧M−Xt)+

−h Xt+p((Xt+At)∧M−Xt+1)+.(3)

Equations (2)–(3) can be written in the compact form

(Xt+1, Rt+1) = f(Xt, At, Dt+1),(4)

with an appropriately chosen function f. Then, P0is given by

P0(U|x, a) = P(f(x, a, D)∈U) =

∞

X

d=0

I{f(x,a,d)∈U}pD(d).

Here pD(·)is the probability mass function of the random demands and D∼pD(·). This

ﬁnishes the deﬁnition of the MDP underlying the inventory optimization problem.

Inventory control is just one of the many operations research problems that give rise to

an MDP. Other problems include optimizing transportation systems, optimizing schedules

or production. MDPs arise naturally in many engineering optimal control problems, too,

such as the optimal control of chemical, electronic or mechanical systems (the latter class

includes the problem of controlling robots). Quite a few information theory problems can

also be represented as MDPs (e.g., optimal coding, optimizing channel allocation, or sensor

networks). Another important class of problems comes from ﬁnance. These include, amongst

others, optimal portfolio management and option pricing.

10

19:00

7:00

14:00

Figure 3: Illustration of the inventory management problem

In the case of the inventory control problem, the MDP was conveniently speciﬁed by a

transition function f(cf., (4)). In fact, transition functions are as powerful as transition

kernels: any MDP gives rise to some transition function fand any transition function f

gives rise to some MDP.

In some problems, not all actions are meaningful in all states. For example, ordering more

items than what one has room for in the inventory does not make much sense. However,

such meaningless actions (or forbidden actions) can always be remapped to other actions,

just like it was done above. In some cases, this is unnatural and leads to a convoluted

dynamics. Then, it might be better to introduce an additional mapping which assigns the

set of admissible actions to each state.

In some MDPs, some states are impossible to leave: If xis such a state, Xt+s=xholds

almost surely for any s≥1 provided that Xt=x, no matter what actions are selected

after time t. By convention, we will assume that no reward is incurred in such terminal

or absorbing states. An MDP with such states is called episodic. An episode then is the

(generally random) time period from the beginning of time until a terminal state is reached.

In an episodic MDP, we often consider undiscounted rewards, i.e., when γ= 1.

Example 2 (Gambling): A gambler enters a game whereby she may stake any fraction At∈

[0,1] of her current wealth Xt≥0. She wins her stake back and as much more with probability

p∈[0,1], while she loses her stake with probability 1−p. Thus, the fortune of the gambler

evolves according to

Xt+1 = (1 + St+1 At)Xt.

Here (St;t≥1) is a sequence of independent random variables taking values in {−1,+1}

with P(St+1 = 1) = p. The goal of the gambler is to maximize the probability that her wealth

reaches an a priori given value w∗>0. It is assumed that the initial wealth is in [0, w∗].

This problem can be represented as an episodic MDP, where the state space is X= [0, w∗]

11

and the action space is A= [0,1].5We deﬁne

Xt+1 = (1 + St+1 At)Xt∧w∗,(5)

when 0≤Xt< w∗and make w∗a terminal state: Xt+1 =Xtif Xt=w∗. The immediate

reward is zero as long as Xt+1 < w∗and is one when the state reaches w∗for the ﬁrst time:

Rt+1 =

1, Xt< w∗and Xt+1 =w∗;

0,otherwise.

If we set the discount factor to one, the total reward along any trajectory will be one or

zero depending on whether the wealth reaches w∗. Thus, the expected total reward is just the

probability that the gambler’s fortune reaches w∗.

Based on the two examples presented so far, the reader unfamiliar with MDPs might believe

that all MDPs come with handy ﬁnite, one-dimensional state- and action-spaces. If only this

was true! In fact, in practical applications the state- and action-spaces are often very large,

multidimensional spaces. For example, in a robot control application, the dimensionality

of the state space can be 3—6 times the number of joints the robot has. An industrial

robot’s state space might easily be 12—20 dimensional, while the state space of a humanoid

robot might easily have 100 dimensions. In a real-world inventory control application, items

would have multiple types, the prices and costs would also change based on the state of

the “market”, whose state would thus also become part of the MDP’s state. Hence, the

state space in any such practical application would be very large and very high dimensional.

The same holds for the action spaces. Thus, working with large, multidimensional state- and

action-spaces should be considered the normal situation, while the examples presented in this

section with their one-dimensional, small state spaces should be viewed as the exceptions.

2.3 Value functions

The obvious way of ﬁnding an optimal behavior in some MDP is to list all behaviors and

then identify the ones that give the highest possible value for each initial state. Since, in

general, there are too many behaviors, this plan is not viable. A better approach is based on

computing value functions. In this approach, one ﬁrst computes the so-called optimal value

function, which then allows one to determine an optimal behavior with relative easiness.

The optimal value,V∗(x), of state x∈ X gives the highest achievable expected return when

the process is started from state x. The function V∗:X → Ris called the optimal value

5Hence, in this case the state and action spaces are continuous. Notice that our deﬁnition of MDPs is

general enough to encompass this case, too.

12

function. A behavior that achieves the optimal values in all states is optimal.

Deterministic stationary policies represent a special class of behaviors, which, as we shall see

soon, play an important role in the theory of MDPs. They are speciﬁed by some mapping

π, which maps states to actions (i.e., π:X → A). Following πmeans that at any time t≥0

the action Atis selected using

At=π(Xt).(6)

More generally, a stochastic stationary policy (or just stationary policy) πmaps states to

distributions over the action space. When referring to such a policy π, we shall use π(a|x)

to denote the probability of action abeing selected by πin state x. Note that if a stationary

policy is followed in an MDP, i.e., if

At∼π(·|Xt), t ∈N,

the state process (Xt;t≥0) will be a (time-homogeneous) Markov chain. We will use Πstat

to denote the set of all stationary policies. For brevity, in what follows, we will often say

just “policy” instead of “stationary policy”, hoping that this will not cause confusion.

A stationary policy and an MDP induce what is called a Markov reward processes (MRP): An

MRP is determined by the pair M= (X,P0), where now P0assigns a probability measure

over X ×Rto each state. An MRP Mgives rise to the stochastic process ((Xt, Rt+1); t≥0),

where (Xt+1, Rt+1)∼ P0(·|Xt). (Note that (Zt;t≥0), Zt= (Xt, Rt) is a time-homogeneous

Markov process, where R0is an arbitrary random variable, while ((Xt, Rt+1); t≥0) is a

second-order Markov process.) Given a stationary policy πand the MDP M= (X,A,P0),

the transition kernel of the MRP (X,Pπ

0) induced by πand Mis deﬁned using Pπ

0(·|x) =

Pa∈A π(a|x)P0(·|x, a). An MRP is called ﬁnite if its state space is ﬁnite.

Let us now deﬁne value functions underlying stationary policies.6For this, let us ﬁx some

policy π∈Πstat. The value function,Vπ:X → R, underlying πis deﬁned by

Vπ(x) = E"∞

X

t=0

γtRt+1 X0=x#, x ∈ X,(7)

with the understanding (i) that the process (Rt;t≥1) is the “reward-part” of the process

((Xt, At, Rt+1); t≥0) obtained when following policy πand (ii) X0is selected at random

such that P(X0=x)>0 holds for all states x. This second condition makes the conditional

expectation in (7) well-deﬁned for every state. If the initial state distribution satisﬁes this

condition, it has no inﬂuence on the deﬁnition of values.

6Value functions can also be deﬁned underlying any behavior analogously to the deﬁnition given below.

13

The value function underlying an MRP is deﬁned the same way and is denoted by V:

V(x) = E"∞

X

t=0

γtRt+1 X0=x#, x ∈ X.

It will also be useful to deﬁne the action-value function,Qπ:X × A → R, underlying a

policy π∈Πstat in an MDP: Assume that the ﬁrst action A0is selected randomly such that

P(A0=a)>0 holds for all a∈ A, while for the subsequent stages of the decision process

the actions are chosen by following policy π. Let ((Xt, At, Rt+1 ); t≥0) be the resulting

stochastic process, where X0is as in the deﬁnition of Vπ. Then

Qπ(x, a) = E"∞

X

t=0

γtRt+1 X0=x, A0=a#, x ∈ X, a ∈ A.

Similarly to V∗(x), the optimal action-value Q∗(x, a) at the state-action pair (x, a) is deﬁned

as the maximum of the expected return under the constraints that the process starts at state

x, and the ﬁrst action chosen is a. The underlying function Q∗:X × A → Ris called the

optimal action-value function.

The optimal value- and action-value functions are connected by the following equations:

V∗(x) = sup

a∈A

Q∗(x, a), x ∈ X,

Q∗(x, a) = r(x, a) + γX

y∈X P(x, a, y)V∗(y), x ∈ X, a ∈ A.

In the class of MDPs considered here, an optimal stationary policy always exists:

V∗(x) = sup

π∈Πstat

Vπ(x), x ∈ X.

In fact, any policy π∈Πstat which satisﬁes the equality

X

a∈A

π(a|x)Q∗(x, a) = V∗(x) (8)

simultaneously for all states x∈ X is optimal. Notice that in order (8) to hold, π(·|x)

must be concentrated on the set of actions that maximize Q∗(x, ·). In general, given some

action-value function, Q:X × A → R, an action that maximizes Q(x, ·) for some state xis

called greedy with respect to Qin state x. A policy that chooses greedy actions only with

respect to Qin all states is called greedy w.r.t. Q.

Thus, a greedy policy with respect to Q∗is optimal, i.e., the knowledge of Q∗alone is

suﬃcient for ﬁnding an optimal policy. Similarly, knowing V∗,rand Palso suﬃces to act

14

optimally.

The next question is how to ﬁnd V∗or Q∗. Let us start with the simpler question of how to

ﬁnd the value function of a policy:

Fact 1 (Bellman Equations for Deterministic Policies): Fix an MDP M= (X,A,P0), a

discount factor γand deterministic policy π∈Πstat . Let rbe the immediate reward function

of M. Then Vπsatisﬁes

Vπ(x) = r(x, π(x)) + γX

y∈X P(x, π(x), y)Vπ(y), x ∈ X.(9)

This system of equations is called the Bellman equation for Vπ. Deﬁne the Bellman operator

underlying π,Tπ:RX→RX, by

(TπV)(x) = r(x, π(x)) + γX

y∈X P(x, π(x), y)V(y), x ∈ X.

With the help of Tπ, Equation (9) can be written in the compact form

TπVπ=Vπ.(10)

Note that this is a linear system of equations in Vπand Tπis an aﬃne linear operator. If

0< γ < 1then Tπis a maximum-norm contraction and the ﬁxed-point equation TπV=V

has a unique solution.

When the state space Xis ﬁnite, say, it has Dstates, RXcan be identiﬁed with the D-

dimensional Euclidean space and V∈RXcan be thought of as a D-dimensional vector: V∈

RD. With this identiﬁcation, TπVcan also be written as rπ+γ P πVwith an appropriately

deﬁned vector rπ∈RDand matrix Pπ∈RD×D. In this case, (10) can be written in the

form

rπ+γP πVπ=Vπ.(11)

The above facts also hold true in MRPs, where the Bellman operator T:RX→RXis

deﬁned by

(T V )(x) = r(x) + γX

y∈X P(x, y)V(y), x ∈ X.

The optimal value function is known to satisfy a certain ﬁxed-point equation:

Fact 2 (Bellman Optimality Equations): The optimal value function satisﬁes the ﬁxed-point

equation

V∗(x) = sup

a∈A (r(x, a) + γX

y∈X P(x, a, y)V∗(y)), x ∈ X.(12)

15

Deﬁne the Bellman optimality operator operator, T∗:RX→RX, by

(T∗V)(x) = sup

a∈A (r(x, a) + γX

y∈X P(x, a, y)V(y)), x ∈ X.(13)

Note that this is a nonlinear operator due to the presence of sup. With the help of T∗,

Equation (12) can be written compactly as

T∗V∗=V∗.

If 0< γ < 1, then T∗is a maximum-norm contraction, and the ﬁxed-point equation T∗V=V

has a unique solution.

In order to minimize clutter, in what follows we will write expressions like (TπV)(x) as

TπV(x), with the understanding that the application of operator Tπtakes precedence to the

applycation of the point evaluation operator, “·(x)”.

The action-value functions underlying a policy (or an MRP) and the optimal action-value

function also satisfy some ﬁxed-point equations similar to the previous ones:

Fact 3 (Bellman Operators and Fixed-point Equations for Action-value Functions): With a

slight abuse of notation, deﬁne Tπ:RX ×A →RX ×A and T∗:RX ×A →RX ×A as follows:

TπQ(x, a) = r(x, a) + γX

y∈X P(x, a, y)Q(y, π(x)),(x, a)∈ X × A,(14)

T∗Q(x, a) = r(x, a) + γX

y∈X P(x, a, y) sup

a0∈A

Q(y, a0),(x, a)∈ X × A.(15)

Note that Tπis again aﬃne linear, while T∗is nonlinear. The operators Tπand T∗are

maximum-norm contractions. Further, the action-value function of π,Qπ, satisﬁes TπQπ=

Qπand Qπis the unique solution to this ﬁxed-point equation. Similarly, the optimal action-

value function, Q∗, satisﬁes T∗Q∗=Q∗and Q∗is the unique solution to this ﬁxed-point

equation.

2.4 Dynamic programming algorithms for solving MDPs

The above facts provide the basis for the value- and policy-iteration algorithms.

Value iteration generates a sequence of value functions

Vk+1 =T∗Vk, k ≥0,

16

where V0is arbitrary. Thanks to Banach’s ﬁxed-point theorem, (Vk;k≥0) converges to V∗

at a geometric rate.

Value iteration can also be used in conjunction with action-value functions; in which case,

it takes the form

Qk+1 =T∗Qk, k ≥0,

which again converges to Q∗at a geometric rate. The idea is that once Vk(or Qk) is close to

V∗(resp., Q∗), a policy that is greedy with respect to Vk(resps., Qk) will be close-to-optimal.

In particular, the following bound is known to hold: Fix an action-value function Qand let

πbe a greedy policy w.r.t. Q. Then the value of policy πcan be lower bounded as follows

(e.g., Singh and Yee, 1994, Corollary 2):

Vπ(x)≥V∗(x)−2

1−γkQ−Q∗k∞, x ∈ X.(16)

Policy iteration works as follows. Fix an arbitrary initial policy π0. At iteration k > 0,

compute the action-value function underlying πk(this is called the policy evaluation step).

Next, given Qπk, deﬁne πk+1 as a policy that is greedy with respect to Qπk(this is called the

policy improvement step). After kiterations, policy iteration gives a policy not worse than

the policy that is greedy w.r.t. to the value function computed using kiterations of value

iteration if the two procedures are started with the same initial value function. However, the

computational cost of a single step in policy iteration is much higher (because of the policy

evaluation step) than that of one update in value iteration.

3 Value prediction problems

In this section, we consider the problem of estimating the value function Vunderlying

some Markov reward process (MRP). Value prediction problems arise in a number of ways:

Estimating the probability of some future event, the expected time until some event occurs,

or the (action-)value function underlying some policy in an MDP are all value prediction

problems. Speciﬁc applications are estimating the failure probability of a large power grid

(Frank et al., 2008) or estimating taxi-out times of ﬂights on busy airports (Balakrishna

et al., 2008), just to mention two of the many possibilities.

Since the value of a state is deﬁned as the expectation of the random return when the

process is started from the given state, an obvious way of estimating this value is to compute

an average over multiple independent realizations started from the given state. This is an

instance of the so-called Monte-Carlo method. Unfortunately, the variance of the returns can

be high, which means that the quality of the estimates will be poor. Also, when interacting

17

with a system in a closed-loop fashion (i.e., when estimation happens while interacting with

the system), it might be impossible to reset the state of the system to some particular

state. In this case, the Monte-Carlo technique cannot be applied without introducing some

additional bias. Temporal diﬀerence (TD) learning (Sutton, 1984, 1988), which is without

doubt one of the most signiﬁcant ideas in reinforcement learning, is a method that can be

used to address these issues.

3.1 Temporal diﬀerence learning in ﬁnite state spaces

The unique feature of TD learning is that it uses bootstrapping: predictions are used as

targets during the course of learning. In this section, we ﬁrst introduce the most basic

TD algorithm and explain how bootstrapping works. Next, we compare TD learning to

(vanilla) Monte-Carlo methods and argue that both of them have their own merits. Finally,

we present the TD(λ) algorithm that uniﬁes the two approaches. Here we consider only the

case of small, ﬁnite MRPs, when the value-estimates of all the states can be stored in the

main memory of a computer in an array or table, which is known as the tabular case in

the reinforcement learning literature. Extensions of the ideas presented here to large state

spaces, when a tabular representations is not feasible, will be described in the subsequent

sections.

3.1.1 Tabular TD(0)

Fix some ﬁnite Markov Reward Process M. We wish to estimate the value function V

underlying Mgiven a realization ((Xt, Rt+1); t≥0) of M. Let ˆ

Vt(x) denote the estimate of

state xat time t(say, ˆ

V0≡0). In the tth step TD(0) performs the following calculations:

δt+1 =Rt+1 +γˆ

Vt(Xt+1)−ˆ

Vt(Xt),

ˆ

Vt+1(x) = ˆ

Vt(x) + αtδt+1 I{Xt=x},

x∈ X.

(17)

Here the step-size sequence (αt;t≥0) consists of (small) nonnegative numbers chosen by

the user. Algorithm 1 shows the pseudocode of this algorithm.

A closer inspection of the update equation reveals that the only value changed is the one

associated with Xt, i.e., the state just visited (cf. line 2 of the pseudocode). Further,

when αt≤1, the value of Xtis moved towards the “target” Rt+1 +γˆ

Vt(Xt+1). Since the

target depends on the estimated value function, the algorithm uses bootstrapping. The term

“temporal diﬀerence” in the name of the algorithm comes from that δt+1 is deﬁned as the

18

Algorithm 1 The function implementing the tabular TD(0) algorithm. This function must

be called after each transition.

function TD0(X, R, Y, V )

Input: Xis the last state, Yis the next state, Ris the immediate reward associated with

this transition, Vis the array storing the current value estimates

1: δ←R+γ·V[Y]−V[X]

2: V[X]←V[X] + α·δ

3: return V

diﬀerence between values of states corresponding to successive time steps. In particular, δt+1

is called a temporal diﬀerence error.

Just like many other algorithms in reinforcement learning, tabular TD(0) is a stochastic

approximation (SA) algorithm. It is easy to see that if it converges, then it must converge

to a function ˆ

Vsuch that the expected temporal diﬀerence given ˆ

V,

Fˆ

V(x)def

=EhRt+1 +γˆ

V(Xt+1)−ˆ

V(Xt)Xt=xi,

is zero for all states x, at least for all states that are sampled inﬁnitely often. A simple

calculation shows that Fˆ

V=Tˆ

V−ˆ

V, where Tis the Bellman-operator underlying the

MRP considered. By Fact 1, Fˆ

V= 0 has a unique solution, the value function V. Thus, if

TD(0) converges (and all states are sampled inﬁnitely often) then it must converge to V.

To study the algorithm’s convergence properties, for simplicity, assume that (Xt;t∈N) is

a stationary, ergodic Markov chain.7Further, identify the approximate value functions ˆ

Vt

with D-dimensional vectors as before (e.g., ˆ

Vt,i =ˆ

Vt(xi), i= 1, . . . , D, where D=|X| and

X={x1, . . . , xD}). Then, assuming that the step-size sequence satisﬁes the Robbins-Monro

(RM) conditions,∞

X

t=0

αt=∞,

∞

X

t=0

α2

t<+∞,

the sequence ( ˆ

Vt∈RD;t∈N) will track the trajectories of the ordinary diﬀerential equation

(ODE)

˙v(t) = c F (v(t)), t ≥0,(18)

where c= 1/D and v(t)∈RD(e.g., Borkar, 1998). Borrowing the notation used in (11), the

above ODE can be written as

˙v=r+ (γP −I)v.

Note that this is a linear ODE. Since the eigenvalues of γP −Iall lie in the open left half

complex plane, this ODE is globally asymptotically stable. From this, using standard results

7Remember that a Markov chain (Xt;t∈N) is ergodic if it is irreducible, aperiodic and positive recurrent.

Practically, this means that the law of large number holds for suﬃciently regular functions of the chain.

19

of SA it follows that ˆ

Vtconverges almost surely to V.

On step-sizes Since many of the algorithms that we will discuss use step-sizes, it is worth-

while spending some time on discussing their choice. A simple step-size sequence that satisﬁes

the above conditions is αt=c/t, with c > 0. More generally, any step-size sequence of the

form αt=ct−ηwill work as long as 1/2< η ≤1. Of these step-size sequences, η= 1 gives the

smallest step-sizes. Asymptotically, this choice will be the best, but from the point of view of

the transient behavior of the algorithm, choosing ηcloser to 1/2 will work better (since with

this choice the step-sizes are bigger and thus the algorithm will make larger moves). It is

possible to do even better than this. In fact, a simple method, called iterate-averaging due to

Polyak and Juditsky (1992), is known to achieve the best possible asymptotic rate of conver-

gence. However, despite its appealing theoretical properties, iterate-averaging is rarely used

in practice. In fact, in practice people often use constant step-sizes, which clearly violates

the RM conditions. This choice is justiﬁed based on two grounds: First, the algorithms are

often used in a non-stationary environment (i.e., the policy to be evaluated might change).

Second, the algorithms are often used only in the small sample regime. (When a constant

step-size is used, the parameters converge in distribution. The variance of the limiting distri-

bution will be proportional to the step-size chosen.) There is also a great deal of work going

into developing methods that tune step-sizes automatically, see (Sutton, 1992; Schraudolph,

1999; George and Powell, 2006) and the references therein. However, the jury is still out on

which of these methods is the best.

With a small change, the algorithm can also be used on an observation sequence of the form

((Xt, Rt+1, Yt+1); t≥0), where (Xt;t≥0) is an arbitrary ergodic Markov chain over X,

(Yt+1, Rt+1)∼ P0(·|Xt). The change concerns the deﬁnition of temporal diﬀerences:

δt+1 =Rt+1 +γˆ

V(Yt+1)−ˆ

V(Xt).

Then, with no extra conditions, ˆ

Vtstill converges almost surely to the value function un-

derlying the MRP (X,P0). In particular, the distribution of the states (Xt;t≥0) does not

play a role here.

This is interesting for multiple reasons. For example, if the samples are generated using a

simulator, we may be able to control the distribution of the states (Xt;t≥0) independently

of the MRP. This might be useful to counterbalance any unevenness in the stationary dis-

tribution underlying the Markov kernel P. Another use is to learn about some target policy

in an MDP while following some other policy, often called the behavior policy. Assume for

simplicity that the target policy is deterministic. Then ((Xt, Rt+1, Yt+1), t ≥0) could be

obtained by skipping all those state-action-reward-next state quadruples in the trajectory

20

generated by using the behavior policy, where the action taken does not match the action

that would have been taken in the given state by the target policy, while keeping the rest.

This technique might allow one to learn about multiple policies at the same time (more

generally, about multiple long-term prediction problems). When learning about one policy,

while following another is called oﬀ-policy learning. Because of this, we shall also call learn-

ing based on triplets ((Xt, Rt+1, Yt+1); t≥0) when Yt+1 6=Xt+1 oﬀ-policy learning. A third,

technical use is when the goal is to apply the algorithm to an episodic problem. In this case,

the triplets (Xt, Rt+1, Yt+1) are chosen as follows: First, Yt+1 is sampled from the transition

kernel P(X, ·). If Yt+1 is not a terminal state, we let Xt+1 =Yt+1; otherwise, Xt+1 ∼ P0(·),

where P0is a user-chosen distribution over X. In other words, when a terminal state is

reached, the process is restarted from the initial state distribution P0. The period between

the time of a restart from P0and reaching a terminal state is called an episode (hence

the name of episodic problems). This way of generating a sample shall be called continual

sampling with restarts from P0.

Being a standard linear SA method, the rate of convergence of tabular TD(0) will be of

the usual order O(1/√t) (consult the paper by Tadi´c (2004) and the references therein for

precise results). However, the constant factor in the rate will be largely inﬂuenced by the

choice of the step-size sequence, the properties of the kernel P0and the value of γ.

3.1.2 Every-visit Monte-Carlo

As mentioned before, one can also estimate the value of a state by computing sample means,

giving rise to the so-called every visit Monte-Carlo method. Here we deﬁne more precisely

what we mean by this and compare the resulting method to TD(0).

To ﬁrm up the ideas, consider some episodic problem (otherwise, it is impossible to ﬁnitely

compute the return of a given state since the trajectories are inﬁnitely long). Let the

underlying MRP be M= (X,P0) and let ((Xt, Rt+1, Yt+1); t≥0) be generated by continual

sampling in Mwith restarts from some distribution P0deﬁned over X. Let (Tk;k≥0) be

the sequence of times when an episode starts (thus, for each k,XTkis sampled from P0).

For a given time t, let k(t) be the unique episode index such that t∈[Tk, Tk+1). Let

Rt=

Tk(t)+1−1

X

s=t

γs−tRs+1 (19)

denote the return from time ton until the end of the episode. Clearly, V(x) = E[Rt|Xt=x],

for any state xsuch that P(Xt=x)>0. Hence, a sensible way of updating the estimates

is to use

ˆ

Vt+1(x) = ˆ

Vt(x) + αt(Rt−ˆ

Vt(x)) I{Xt=x}, x ∈ X.

21

Algorithm 2 The function that implements the every-visit Monte-Carlo algorithm to es-

timate value functions in episodic MDPs. This routine must be called at the end of each

episode with the state-reward sequence collected during the episode. Note that the algorithm

as shown here has linear time- and space-complexity in the length of the episodes.

function EveryVisitMC(X0, R1, X1, R2, . . . , XT−1, RT, V )

Input: Xtis the state at time t,Rt+1 is the reward associated with the tth transition, Tis

the length of the episode, Vis the array storing the current value function estimate

1: sum ←0

2: for t←T−1downto 0do

3: sum ←Rt+1 +γ·sum

4: target[Xt]←sum

5: V[Xt]←V[Xt] + α·(target[Xt]−V[Xt])

6: end for

7: return V

Monte-Carlo methods such as the above one, since they use multi-step predictions of the

value (cf. Equation (19)), are called multi-step methods. The pseudo-code of this update-

rule is shown as Algorithm 2.

This algorithm is again an instance of stochastic approximation. As such, its behavior is

governed by the ODE ˙v(t) = V−v(t). Since the unique globally asymptotically stable

equilibrium of this ODE is V,ˆ

Vtagain converges to Valmost surely. Since both algorithms

achieve the same goal, one may wonder which algorithm is better.

TD(0) or Monte-Carlo? First, let us consider an example when TD(0) converges faster.

Consider the undiscounted episodic MRP shown on Figure 4. The initial states are either 1

or 2. With high probability the process starts at state 1, while the process starts at state

2 less frequently. Consider now how TD(0) will behave at state 2. By the time state 2 is

visited the kth time, on the average state 3 has already been visited 10 ktimes. Assume that

αt= 1/(t+ 1). At state 3 the TD(0) update reduces to averaging the Bernoulli rewards

incurred upon leaving state 3. At the kth visit of state 2, Var hˆ

Vt(3)i≈1/(10 k) (clearly,

Ehˆ

Vt(3)i=V(3) = 0.5). Thus, the target of the update of state 2 will be an estimate of

the true value of state 2 with accuracy increasing with k. Now, consider the Monte-Carlo

method. The Monte-Carlo method ignores the estimate of the value of state 3 and uses the

Bernoulli rewards directly. In particular, Var[Rt|Xt= 2] = 0.25, i.e., the variance of the

target does not change with time. On this example, this makes the Monte-Carlo method

slower to converge, showing that sometimes bootstrapping might indeed help.

To see an example when bootstrapping is not helpful, imagine that the problem is modiﬁed so

that the reward associated with the transition from state 3 to state 4 is made deterministically

22

0

0

3

34

4

2

2

1

10

P0(1) = 0:9

P0(2) = 0:1

R»Ber(0:5)

Figure 4: An episodic Markov reward process. In this example, all transitions are determin-

istic. The reward is zero, except when transitioning from state 3 to state 4, when it is given

by a Bernoulli random variable with parameter 0.5. State 4 is a terminal state. When the

process reaches the terminal state, it is reset to start at state 1 or 2. The probability of

starting at state 1 is 0.9, while the probability of starting at state 2 is 0.1.

equal to one. In this case, the Monte-Carlo method becomes faster since Rt= 1 is the true

target value, while for the value of state 2 to get close to its true value, TD(0) has to wait

until the estimate of the value at state 3 becomes close to its true value. This slows down

the convergence of TD(0). In fact, one can imagine a longer chain of states, where state i+1

follows state i, for i∈ {1, . . . , N}and the only time a nonzero reward is incurred is when

transitioning from state N−1 to state N. In this example, the rate of convergence of the

Monte-Carlo method is not impacted by the value of N, while TD(0) would get slower with

Nincreasing (for an informal argument, see Sutton, 1988; for a formal one with exact rates,

see Beleznay et al., 1999).

3.1.3 TD(λ): Unifying Monte-Carlo and TD(0)

The previous examples show that both Monte-Carlo and TD(0) have their own merits.

Interestingly, there is a way to unify these approaches. This is achieved by the so-called

TD(λ) family of methods (Sutton, 1984, 1988). Here, λ∈[0,1] is a parameter that allows

one to interpolate between the Monte-Carlo and TD(0) updates: λ= 0 gives TD(0) (hence

the name of TD(0)), while λ= 1, i.e., TD(1) is equivalent to a Monte-Carlo method. In

essence, given some λ > 0, the targets in the TD(λ) update are given as some mixture of

23

the multi-step return predictions

Rt:k=

t+k

X

s=t

γs−tRs+1 +γk+1 ˆ

Vt(Xt+k+1),

where the mixing coeﬃcients are the exponential weights (1 −λ)λk,k≥0. Thus, for λ > 0

TD(λ) will be a multi-step method. The algorithm is made incremental by the introduction

of the so-called eligibility traces.

In fact, the eligibility traces can be deﬁned in multiple ways and hence TD(λ) exists in corre-

spondingly many multiple forms. The update rule of TD(λ) with the so-called accumulating

traces is as follows:

δt+1 =Rt+1 +γˆ

Vt(Xt+1)−ˆ

Vt(Xt),

zt+1(x) = I{x=Xt}+γλ zt(x),

ˆ

Vt+1(x) = ˆ

Vt(x) + αtδt+1 zt+1(x),

z0(x) = 0,

x∈ X.

Here zt(x) is the eligibility trace of state x. The rationale of the name is that the value of

zt(x) modulates the inﬂuence of the TD error on the update of the value stored at state x.

In another variant of the algorithm, the eligibility traces are updated according to

zt+1(x) = max(I{x=Xt}, γλ zt(x)), x ∈ X.

This is called the replacing traces update. In these updates, the trace-decay parameter λ

controls the amount of bootstrapping: When λ= 0 the above algorithms become identical

to TD(0) (since limλ→0+(1 −λ)Pk≥0λkRt:k=Rt:0 =Rt+1 +γˆ

Vt(Xt+1)). When λ= 1,

we get the TD(1) algorithm, which with accumulating traces will simulate the previously

described every-visit Monte-Carlo algorithm in episodic problems. (For an exact equivalence,

one needs to assume that the value updates happen only at the end of trajectories, up to

which point the updates are just accumulated. The statement then follows because the

discounted sum of temporal diﬀerences along a trajectory from a start state to a terminal

state telescopes and gives the sum of rewards along the trajectory.) Replacing traces and

λ= 1 correspond to a version of the Monte-Carlo algorithm where a state is updated only

when it is encountered for the ﬁrst time in a trajectory. The corresponding algorithm is

called ﬁrst-visit Monte-Carlo method. The formal correspondence between the ﬁrst-visit

Monte-Carlo method and TD(1) with replacing traces is known to hold for the undiscounted

case only (Singh and Sutton, 1996). Algorithm 3 gives the pseudocode corresponding to the

24

Algorithm 3 The function that implements the tabular TD(λ) algorithm with replacing

traces. This function must be called after each transition.

function TDLambda(X, R, Y, V, z)

Input: Xis the last state, Yis the next state, Ris the immediate reward associated with

this transition, Vis the array storing the current value function estimate, zis the array

storing the eligibility traces

1: δ←R+γ·V[Y]−V[X]

2: for all x∈ X do

3: z[x]←γ·λ·z[x]

4: if X=xthen

5: z[x]←1

6: end if

7: V[x]←V[x] + α·δ·z[x]

8: end for

9: return (V, z)

variant with replacing traces.

In practice, the best value of λis determined by trial and error. In fact, the value of λ

can be changed even during the algorithm, without impacting convergence. This holds for

a wide range of other possible eligibility trace updates (for precise conditions, see Bertsekas

and Tsitsiklis, 1996, Section 5.3.3 and 5.3.6). The replacing traces version of the algorithm

is believed to perform better in practice (for some examples when this happens, consult

Sutton and Barto, 1998, Section 7.8). It has been noted that λ > 0 is helpful when the

learner has only partial knowledge of the state, or (in the related situation) when function

approximation is used to approximate the value functions in a large state space – the topic

of the next section.

In summary, TD(λ) allows one to estimate value functions in MRPs. It generalizes Monte-

Carlo methods, it can be used in non-episodic problems, and it allows for bootstrapping.

Further, by appropriately tuning λit can converge signiﬁcantly faster than Monte-Carlo

methods or TD(0).

3.2 Algorithms for large state spaces

When the state space is large (or inﬁnite), it is not feasible to keep a separate value for each

state in the memory. In such cases, we often seek an estimate of the values in the form

Vθ(x) = θ>ϕ(x), x ∈ X,

where θ∈Rdis a vector of parameters and ϕ:X → Rdis a mapping of states to d-

dimensional vectors. For state x, the components ϕi(x) of the vector ϕ(x) are called the

25

features of state xand ϕis called a feature extraction method. The individual functions

ϕi:X → Rdeﬁning the components of ϕare called basis functions.

Examples of function approximation methods Given access to the state, the features

(or basis functions) can be constructed in a great many diﬀerent ways. If x∈R(i.e., X ⊂ R)

one may use a polynomial, Fourier, or wavelet basis up to some order. For example, in

the case of a polynomial basis, ϕ(x) = (1, x, x2, . . . , xd−1)>, or, an orthogonal system of

polynomials if a suitable measure (such as the stationary distribution) over the states is

available. This latter choice may help to increase the convergence speed of the incremental

algorithms that we will discuss soon.

In the case of multidimensional state spaces, the tensor product construction is a commonly

used way to construct features given features of the states’ individual components. The

tensor product construction works as follows: Imagine that X ⊂ X1× X2×. . . × Xk. Let

ϕ(i):Xi→Rdibe a feature extractor deﬁned for the ith state component. The tensor

product ϕ=ϕ(1) ⊗. . . ⊗ϕ(k)feature extractor will have d=d1d2. . . dkcomponents, which

can be conveniently indexed using multi-indices of the form (i1, . . . , ik), 1 ≤ij≤dj,j=

1, . . . , k. Then ϕ(i1,...,ik)(x) = ϕ(1)

i1(x1)ϕ(2)

i2(x2). . . ϕ(k)

ik(xk). When X ⊂ Rk, one particularly

popular choice is to use radial basis function (RBF) networks, when ϕ(i)(xi) = (G(|xi−

x(1)

i|), . . . , G(|xi−x(di)

i|))>. Here x(j)

i∈R(j= 1, . . . , di) is ﬁxed by the user and Gis

a suitable function. A typical choice for Gis G(z) = exp(−η z2) where η > 0 is a scale

parameter. The tensor product construct in this cases places Gaussians at points of a

regular grid and the ith basis function becomes

ϕi(x) = exp(−ηkx−x(i)k2),

where x(i)∈ X now denotes a point on a regular d1×. . . ×dkgrid. A related method is to

use kernel smoothing:

Vθ(x) = Pd

i=1 θiG(kx−x(i)k)

Pd

j=1 G(kx−x(j)k)=

d

X

i=1

θi

G(kx−x(i)k)

Pd

j=1 G(kx−x(j)k).(20)

More generally, one may use Vθ(x) = Pd

i=1 θisi(x), where si≥0 and Pd

i=1 si(x)≡1 holds

for any x∈ X. In this case, we say that Vθis an averager. Averagers are important

in reinforcement learning because the mapping θ7→ Vθis a non-expansion in the max-

norm, which makes them “well-behaved” when used together with approximate dynamic

programming.

An alternative to the above is to use binary features, i.e., when ϕ(x)∈ {0,1}d. Binary

features may be advantageous from a computational point of view: when ϕ(x)∈ {0,1}d

26

then Vθ(x) = Pi:ϕi(x)=1 θi. Thus, the value of state xcan be computed at the cost of s

additions if ϕ(x) is s-sparse (i.e., if only selements of ϕ(x) are non-zero), provided that

there is a direct way of computing the index of the non-zero components of the feature

vector.

This is the case when the state aggregation is used to deﬁne the features. In this case,

the coordinate functions of ϕ(the individual features) correspond to indicators of non-

overlapping regions of the state space Xwhose union covers X(i.e., the regions form a

partition of the state space). Clearly, in this case, θ>ϕ(x) will be constant over the individual

regions, thus state aggregation essentially “discretizes” the state space. A state aggregator

function approximator is also an averager.

Another choice that leads to binary features is tile coding (originally called CMAC, Albus,

1971, 1981). In the simplest version of tile coding, the basis functions of ϕcorrespond to

indicator functions of multiple shifted partitions (tilings) of the state space: if stilings are

used, ϕwill be s-sparse. To make tile coding an eﬀective function approximation method,

the oﬀsets of the tilings corresponding to diﬀerent dimensions should be diﬀerent.

The curse of dimensionality The issue with tensor product constructions, state aggre-

gation and straightforward tile coding is that when the state space is high dimensional they

quickly become intractable: For example, a tiling of [0,1]Dwith cubical regions with side-

lengths of εgives rise to d=ε−D-dimensional feature- and parameter-vectors. If ε= 1/2

and D= 100, we get the enormous number d≈1030. This is problematic since state-

representations with hundreds of dimensions are common in applications. At this stage,

one may wonder if it is possible at all to successfully deal with applications when the state

lives in a high dimensional space. What often comes at rescue is that the actual problem

complexity might be much lower than what is predicted by merely counting the number of

dimensions of the state variable (although, there is no guarantee that this happens). To see

why sometimes this holds, note that the same problem can have multiple representations,

some of which may come with low-dimensional state variables, some with high. Since,

in many cases, the state-representation is chosen by the user in a conservative fashion, it

may happen that in the chosen representation many of the state variables are irrelevant.

It may also happen that the states that are actually encountered lie on (or lie close to) a

low-dimensional submanifold of the chosen high dimensional “state-space”.

To illustrate this, imagine an industrial robot arm with say 3 joints and 6 degrees of freedom.

The intrinsic dimensionality of the state is then 12, twice the number of degrees of freedom

of the arm since the dynamics is second-order. One (approximate) state representation

is to take high resolution camera images of the arm in close succession (to account for the

dynamics) from multiple angles (to account for occlusions). The dimensionality of the chosen

27

state representation will easily be in the range of millions, yet the intrinsic dimensionality

will still be 12. In fact, the more cameras we have, the higher the dimensionality will be.

A simple-minded approach, which aims for minimizing the dimensionality would suggest to

use as few cameras as possible. But more information should not hurt! Therefore, the quest

should be for clever algorithms and function approximation methods that can deal with

high-dimensional but low complexity problems.

Possibilities include using strip-like tilings combined with hash functions, interpolators that

use low-discrepancy grids (Lemieux, 2009, Chapter 5 and 6), or random projections (Das-

gupta and Freund, 2008). Nonlinear function approximation methods (examples of which

include neural networks with sigmoidal transfer functions in the hidden layers or RBF net-

works where the centers are also considered as parameters) and nonparametric techniques

also hold great promise.

Nonparametric methods In a nonparametric method, the user does not start with a

ﬁxed ﬁnite-dimensional representation, such as in the previous examples, but allows for the

representation to grow and change as needed. For example, in a k-nearest neighbor method

for regression, given the data Dn= [(x1, v1),...,(xn, vn)], where xi∈Rk,vi∈R, the value

at location xis predicted using

V(k)

D(x) =

n

X

i=1

vi

K(k)

D(x, xi)

k,

where K(k)

D(x, x0) is one when x0is closer to xthen the kth closest neighbor of xin Dand is

zero otherwise. Note that k=Pn

j=1 K(k)

D(x, xj). Replacing kin the above expression with

this sum and replacing K(k)

D(x, ·) with some other data based kernel KD(e.g., a Gaussian

centered around xwith standard deviation proportional to the distance to the kth nearest

neighbor), we arrive at nonparametric kernel smoothing:

V(k)

D(x) =

n

X

i=1

vi

KD(x, xi)

Pn

j=1 KD(x, xj),

which should be compared to its parametric counterpart (20). Other examples include meth-

ods that work by ﬁnding an appropriate function in some large (inﬁnite dimensional) function

space that ﬁts an empirical error. The function space is usually a Reproducing Kernel Hilbert

space which is a convenient choice from the point of view of optimization. In special cases,

we get spline smoothers (Wahba, 2003) and Gaussian process regression (Rasmussen and

Williams, 2005). Another idea is to split the input space recursively into ﬁner regions using

some heuristic criterion and then predict with some simple method the values in the leafs,

28

leading to tree-based methods. The border between parametric and nonparametric methods

is blurry. For example, a linear predictor when the number of basis functions is allowed to

change (i.e., when new basis functions are introduced as needed) becomes a nonparametric

method. Thus, when one experiments with diﬀerent feature extraction methods, from the

point of view of the overall tuning process, we can say that one really uses a nonparametric

technique. In fact, if we take this viewpoint, it follows that in practice “true” parametric

methods are rarely used if they are used at all.

The advantage of nonparametric methods is their inherent ﬂexibility. However, this comes

usually at the price of increased computational complexity. Therefore, when using non-

parametric methods, eﬃcient implementations are important (e.g., one should use k-D trees

when implementing nearest neighbor methods, or the Fast Gaussian Transform when imple-

menting a Gaussian smoother). Also, nonparametric methods must be carefully tuned as

they can easily overﬁt or underﬁt. For example, in a k-nearest neighbor method if kis too

large, the method is going to introduce too much smoothing (i.e., it will underﬁt), while if

kis too small, it will ﬁt to the noise (i.e., overﬁt). Overﬁtting will be further discussed in

Section 3.2.4. For more information about nonparametric regression, the reader is advised

to consult the books by H¨ardle (1990); Gy¨orﬁ et al. (2002); Tsybakov (2009).

Although our discussion below will assume a parametric function approximation method

(and in many cases linear function approximation), many of the algorithms can be extended

to nonparametric techniques. We will mention when such extensions exist as appropriate.

Up to now, the discussion implicitly assumed that the state is accessible for measurement.

This is, however, rarely the case in practical applications. Luckily, the methods that we

will discuss below do not actually need to access the states directly, but they can perform

equally well when some “suﬃciently descriptive feature-based representation” of the states

is available (such as the camera images in the robot-arm example). A common way of

arriving at such a representation is to construct state estimators (or observers, in control

terminology) based on the history of the observations, which has a large literature both in

machine learning and control. The discussion of these techniques, however, lies outside of

the scope of the present paper.

3.2.1 TD(λ) with function approximation

Let us return to the problem of estimating a value function Vof a Markov reward process

M= (X,P0), but now assume that the state space is large (or even inﬁnite). Let D=

((Xt, Rt+1); t≥0) be a realization of M. The goal, as before, is to estimate the value

function of Mgiven Din an incremental manner.

Choose a smooth parametric function-approximation method (Vθ;θ∈Rd) (i.e., for any

θ∈Rd,Vθ:X → Ris such that ∇θVθ(x) exists for any x∈ X). The generalization of

29

Algorithm 4 The function implementing the TD(λ) algorithm with linear function approx-

imation. This function must be called after each transition.

function TDLambdaLinFApp(X, R, Y, θ, z)

Input: Xis the last state, Yis the next state, Ris the immediate reward associated with

this transition, θ∈Rdis the parameter vector of the linear function approximation,

z∈Rdis the vector of eligibility traces

1: δ←R+γ·θ>ϕ[Y]−θ>ϕ[X]

2: z←ϕ[X] + γ·λ·z

3: θ←θ+α·δ·z

4: return (θ, z)

tabular TD(λ) with accumulating eligibility traces to the case when the value functions are

approximated using members of (Vθ;θ∈Rd) uses the following updates (Sutton, 1984, 1988):

δt+1 =Rt+1 +γVθt(Xt+1 )−Vθt(Xt),

zt+1 =∇θVθt(Xt) + γλ zt,

θt+1 =θt+αtδt+1 zt+1,

z0= 0.

(21)

Here zt∈Rd. Algorithm 4 shows the pseudocode of this algorithm.

To see that this algorithm is indeed a generalization of tabular TD(λ) assume that X=

{x1, . . . , xD}and let Vθ(x) = θ>ϕ(x) with ϕi(x) = I{x=xi}. Note that since Vθis linear in the

parameters (i.e., Vθ=θ>ϕ), it holds that ∇θVθ=ϕ. Hence, identifying zt,i (θt,i) with zt(xi)

(resp., ˆ

Vt(xi)) we see that the update (21), indeed, reduces to the previous one.

In the oﬀ-policy version of TD(λ), the deﬁnition of δt+1 becomes

δt+1 =Rt+1 +γVθt(Yt+1 )−Vθt(Xt).

Unlike the tabular case, under oﬀ-policy sampling, convergence is no longer guaranteed, but,

in fact, the parameters may diverge (see, e.g., Bertsekas and Tsitsiklis, 1996, Example 6.7,

p. 307). This is true for linear function approximation when the distributions of (Xt;t≥0)

do not match the stationary distribution of the MRP M. Another case when the algorithm

may diverge is when it is used with a nonlinear function-approximation method (see, e.g.,

Bertsekas and Tsitsiklis, 1996, Example 6.6, p. 292). For further examples of instability, see

Baird (1995); Boyan and Moore (1995).

On the positive side, almost sure convergence can be guaranteed when (i) a linear function-

approximation method is used with ϕ:X → Rd;(ii) the stochastic process (Xt;t≥0) is

30

an ergodic Markov process whose stationary distribution µis the same as the stationary

distribution of the MRP M; and (iii) the step-size sequence satisﬁes the RM conditions

(Tsitsiklis and Van Roy, 1997; Bertsekas and Tsitsiklis, 1996, p. 222, Section 5.3.7). In

the results cited, it is also assumed that the components of ϕ(i.e., ϕ1, . . . , ϕd) are linearly

independent. When this holds, the limit of the parameter vector will be unique. In the other

case, i.e., when the features are redundant, the parameters will still converge, but the limit

will depend on the parameter vector’s initial value. However, the limiting value function will

be unique (Bertsekas, 2010).

Assuming that TD(λ) converges, let θ(λ)denote the limiting value of θt.

Let

F={Vθ|θ∈Rd}

be the space of functions that can be represented using the chosen features ϕ. Note that

Fis a linear subspace of the vector space of all real-valued functions with domain X. The

limit θ(λ)is known to satisfy the so-called projected ﬁxed-point equation

Vθ(λ)= ΠF,µ T(λ)Vθ(λ),(22)

where the operators T(λ)and ΠF,µ are deﬁned as follows: For m∈Nlet T[m]be the m-step

lookahead Bellman operator:

T[m]ˆ

V(x) = E"m

X

t=0

γtRt+1 +γm+1 ˆ

V(Xm+1)X0=x#.

Clearly, V, the value function to be estimated is a ﬁxed point of T[m]for any m≥0.

Assume that λ < 1. Then, operator T(λ)is deﬁned as the exponentially weighted average of

T[0], T [1], . . .:

T(λ)ˆ

V(x) = (1 −λ)

∞

X

m=0

λmT[m]ˆ

V(x).

For λ= 1, we let T(1) ˆ

V= limλ→1−T(λ)ˆ

V=V. Notice that for λ= 0, T(0) =T. Operator

ΠF,µ is a projection: It projects functions of states to the linear space Fwith respect to the

weighted 2-norm kfk2

µ=Px∈X f2(x)µ(x):

ΠF,µ ˆ

V= argmin

f∈F kˆ

V−fkµ.

The essence of the proof of convergence of TD(λ) is that the composite operator ΠF,µT(λ)

is a contraction with respect to the norm k·kµ. This result heavily exploits that µis the

stationary distribution underlying M(which deﬁnes T(λ)). For other distributions, the

31

composite operator might not be a contraction; in which case, TD(λ) might diverge.

As to the quality of the solution found, the following error bound holds for the ﬁxed point

of (22):

kVθ(λ)−Vkµ≤1

√1−γλkΠF,µV−Vkµ.

Here γλ=γ(1 −λ)/(1 −λγ) is the contraction modulus of ΠF,µ T(λ)(Tsitsiklis and Van Roy,

1999a; Bertsekas, 2007b). (For sharper bounds, see Yu and Bertsekas 2008; Scherrer 2010.)

From the error bound we see that Vθ(1) is the best approximation to Vwithin Fwith respect

to the norm k·kµ(this should come at no surprise as TD(1) minimizes this mean-squared

error by design). We also see that as we let λ→0 the bound allows for larger errors. It is

known that this is not an artifact of the analysis. In fact, in Example 6.5 of the book by

Bertsekas and Tsitsiklis (1996) (p. 288), a simple MRP with nstates and a one-dimensional

feature extractor ϕis given such that Vθ(0) is a very poor approximation to V, while Vθ(1)

is a reasonable approximation. Thus, in order to get good accuracy when working with

λ < 1, it is not enough to choose the function space Fso that the best approximation to

Vhas small error. At this stage, however, one might wonder if using λ < 1 makes sense

at all. A recent paper by Van Roy (2006) suggests that when considering performance loss

bounds instead of approximation errors and the full control learning task (cf. Section 4),

λ= 0 will in general be at no disadvantage compared to using λ= 1, at least, when state-

aggregation is considered. Thus, while the mean-squared error of the solution might be large,

when the solution is used in control, the performance of the resulting policy will still be as

good as that of one that is obtained by calculating the TD(1) solution. However, the major

reason to prefer TD(λ) with λ < 1 over TD(1) is because empirical evidence suggests that

it converges much faster than TD(1), the latter of which, at least for practical sample sizes,

often produces very poor estimates (e.g., Sutton and Barto, 1998, Section 8.6).

TD(λ) solves a model Sutton et al. (2008) and Parr et al. (2008) observed independently

of each other that the solution obtained by TD(0) can be thought of as the solution of a

deterministic MRP with a linear dynamics. In fact, as we will argue now this also holds in

the case of TD(λ).

This suggests that if the deterministic MRP captures the essential features of the original

MRP, Vθ(λ)will be a good approximation to V. To ﬁrm up this statement, following Parr

et al. (2008), let us study the Bellman error

∆(λ)(ˆ

V) = T(λ)ˆ

V−ˆ

V

of ˆ

V:X → Runder T(λ). Note that ∆(λ)(ˆ

V) : X → R. A simple contraction argument

shows that

V−ˆ

V

∞≤1

1−γ

∆(λ)(ˆ

V)

∞. Hence, if ∆(λ)(ˆ

V) is small, ˆ

Vis close to V.

32

The following error decomposition can be shown to hold:8

∆(λ)(Vθ(λ)) = (1 −λ)X

m≥0

λm∆[r]

m+γ((1 −λ)X

m≥0

λm∆[ϕ]

m)θ(λ).

Here ∆[r]

m=rm−ΠF,µrmand ∆[ϕ]

m=Pm+1ϕ>−ΠF,µ Pm+1ϕ>are the errors of modeling the

m-step rewards and transitions with respect to the features ϕ, respectively; rm:X → Ris

deﬁned by rm(x) = E[Rm+1 |X0=x] and Pm+1ϕ>denotes a function that maps states to d-

dimensional row-vectors and which is deﬁned by Pm+1ϕ>(x) = (Pm+1ϕ1(x), . . . , P m+1ϕd(x)).

Here Pmϕi:X → Ris the function deﬁned by Pmϕi(x) = E[ϕi(Xm)|X0=x]. Thus, we see

that the Bellman error will be small if the m-step immediate rewards and the m-step feature-

expectations are well captured by the features. We can also see that as λgets closer to 1, it

becomes more important for the features to capture the structure of the value function, and

as λgets closer to 0, it becomes more important to capture the structure of the immediate

rewards and the immediate feature-expectations. This suggests that the “best” value of λ

(i.e., the one that minimizes k∆(λ)(Vθ(λ))k) may depend on whether the features are more

successful at capturing the short-term or the long-term dynamics (and rewards).

3.2.2 Gradient temporal diﬀerence learning

That TD(λ) can diverge in oﬀ-policy learning situations spoils its otherwise immaculate

record. In Section 3.2.3, we will introduce some methods that avoid this issue. However,

as we will see it, the computational (time and storage) complexity of these methods will be

signiﬁcantly larger than that of TD(λ). In this section, we present two recent algorithms

introduced by Sutton et al. (2009b,a), which also overcome the instability issue, converge to

the TD(λ) solutions in the on-policy case, and yet they are almost as eﬃcient as TD(λ).

For simplicity, we consider the case when λ= 0, ((Xt, Rt+1, Yt+1); t≥0) is a stationary

process, Xt∼ν(νcan be diﬀerent from the stationary distribution of P) and when linear

function approximation is used with linearly independent features. Assume that θ(0), the

solution to (22), exists. Consider the objective function

J(θ) = kVθ−ΠF,ν T Vθk2

ν.(23)

Notice that all solutions to (22) are minimizers of J, and there are no other minimizers of

Jwhen (22) has solutions. Thus, minimizing Jwill give a solution to (22). Let θ∗denote a

minimizer of J. Since, by assumption, the features are linearly independent, the minimizer

8Parr et al. (2008) observed this for λ= 0. The extension to λ > 0 is new.

33

of Jis unique, i.e., θ∗is well-deﬁned. Introduce the shorthand notations

δt+1(θ) = Rt+1 +γVθ(Yt+1)−Vθ(Xt) (24)

=Rt+1 +γθ>ϕ0

t+1 −θ>ϕt,

ϕt=ϕ(Xt),

ϕ0

t+1 =ϕ(Yt+1).

A simple calculation allows us to rewrite Jin the following form:

J(θ) = E[δt+1(θ)ϕt]>Eϕtϕ>

t−1E[δt+1(θ)ϕt].(25)

Taking the gradient of the objective function we get

∇θJ(θ) = −2E(ϕt−γϕ0

t+1)ϕ>

tw(θ),(26)