Content uploaded by Jan Peters

Author content

All content in this area was uploaded by Jan Peters on Jun 17, 2014

Content may be subject to copyright.

Reinforcement Learning in Robotics:

A Survey

Jens Kober∗† J. Andrew Bagnell‡Jan Peters§¶

email: jkober@cor-lab.uni-bielefeld.de,dbagnell@ri.cmu.edu,mail@jan-peters.net

Reinforcement learning oﬀers to robotics a frame-

work and set of tools for the design of sophisticated

and hard-to-engineer behaviors. Conversely, the chal-

lenges of robotic problems provide both inspiration,

impact, and validation for developments in reinforce-

ment learning. The relationship between disciplines

has suﬃcient promise to be likened to that between

physics and mathematics. In this article, we attempt

to strengthen the links between the two research com-

munities by providing a survey of work in reinforce-

ment learning for behavior generation in robots. We

highlight both key challenges in robot reinforcement

learning as well as notable successes. We discuss how

contributions tamed the complexity of the domain and

study the role of algorithms, representations, and prior

knowledge in achieving these successes. As a result, a

particular focus of our paper lies on the choice between

model-based and model-free as well as between value

function-based and policy search methods. By analyz-

ing a simple problem in some detail we demonstrate

how reinforcement learning approaches may be prof-

itably applied, and we note throughout open questions

and the tremendous potential for future research.

keywords: reinforcement learning, learning control,

robot, survey

1 Introduction

A remarkable variety of problems in robotics may

be naturally phrased as ones of reinforcement learn-

ing. Reinforcement learning (RL) enables a robot to

autonomously discover an optimal behavior through

trial-and-error interactions with its environment. In-

stead of explicitly detailing the solution to a problem,

in reinforcement learning the designer of a control task

∗Bielefeld University, CoR-Lab Research Institute for Cogni-

tion and Robotics, Universitätsstr. 25, 33615 Bielefeld, Ger-

many

†Honda Research Institute Europe, Carl-Legien-Str. 30, 63073

Oﬀenbach/Main, Germany

‡Carnegie Mellon University, Robotics Institute, 5000 Forbes

Avenue, Pittsburgh, PA 15213, USA

§Max Planck Institute for Intelligent Systems, Department of

Empirical Inference, Spemannstr. 38, 72076 Tübingen, Ger-

many

¶Technische Universität Darmstadt, FB Informatik, FG Intel-

ligent Autonomous Systems, Hochschulstr. 10, 64289 Darm-

stadt, Germany

provides feedback in terms of a scalar objective func-

tion that measures the one-step performance of the

robot. Figure 1 illustrates the diverse set of robots

that have learned tasks using reinforcement learning.

Consider, for example, attempting to train a robot

to return a table tennis ball over the net (Muelling

et al., 2012). In this case, the robot might make an

observations of dynamic variables specifying ball posi-

tion and velocity and the internal dynamics of the joint

position and velocity. This might in fact capture well

the state sof the system – providing a complete statis-

tic for predicting future observations. The actions a

available to the robot might be the torque sent to mo-

tors or the desired accelerations sent to an inverse dy-

namics control system. A function πthat generates

the motor commands (i.e., the actions) based on the

incoming ball and current internal arm observations

(i.e., the state) would be called the policy. A rein-

forcement learning problem is to ﬁnd a policy that

optimizes the long term sum of rewards R(s, a); a re-

inforcement learning algorithm is one designed to ﬁnd

such a (near)-optimal policy. The reward function in

this example could be based on the success of the hits

as well as secondary criteria like energy consumption.

1.1 Reinforcement Learning in the

Context of Machine Learning

In the problem of reinforcement learning, an agent ex-

plores the space of possible strategies and receives feed-

back on the outcome of the choices made. From this

information, a “good” – or ideally optimal – policy

(i.e., strategy or controller) must be deduced.

Reinforcement learning may be understood by con-

trasting the problem with other areas of study in ma-

chine learning. In supervised learning (Langford and

Zadrozny, 2005), an agent is directly presented a se-

quence of independent examples of correct predictions

to make in diﬀerent circumstances. In imitation learn-

ing, an agent is provided demonstrations of actions of

a good strategy to follow in given situations (Argall

et al., 2009; Schaal, 1999).

To aid in understanding the RL problem and its

relation with techniques widely used within robotics,

Figure 2 provides a schematic illustration of two axes

of problem variability: the complexity of sequential in-

teraction and the complexity of reward structure. This

1

(a) OBELIX robot (b) Zebra Zero robot

(c) Autonomous helicopter (d) Sarcos humanoid

DB

Figure 1: This ﬁgure illustrates a small sample of robots

with behaviors that were reinforcement learned. These

cover the whole range of aerial vehicles, robotic arms,

autonomous vehicles, and humanoid robots. (a) The

OBELIX robot is a wheeled mobile robot that learned to

push boxes (Mahadevan and Connell, 1992) with a value

function-based approach (Picture reprint with permission

of Sridhar Mahadevan). (b) A Zebra Zero robot arm

learned a peg-in-hole insertion task (Gullapalli et al., 1994)

with a model-free policy gradient approach (Picture reprint

with permission of Rod Grupen). (c) Carnegie Mellon’s au-

tonomous helicopter leveraged a model-based policy search

approach to learn a robust ﬂight controller (Bagnell and

Schneider, 2001). (d) The Sarcos humanoid DB learned

a pole-balancing task (Schaal, 1996) using forward models

(Picture reprint with permission of Stefan Schaal).

hierarchy of problems, and the relations between them,

is a complex one, varying in manifold attributes and

diﬃcult to condense to something like a simple linear

ordering on problems. Much recent work in the ma-

chine learning community has focused on understand-

ing the diversity and the inter-relations between prob-

lem classes. The ﬁgure should be understood in this

light as providing a crude picture of the relationship

between areas of machine learning research important

for robotics.

Each problem subsumes those that are both below

and to the left in the sense that one may always frame

the simpler problem in terms of the more complex one;

note that some problems are not linearly ordered. In

this sense, reinforcement learning subsumes much of

the scope of classical machine learning as well as con-

textual bandit and imitation learning problems. Re-

duction algorithms (Langford and Zadrozny, 2005) are

used to convert eﬀective solutions for one class of prob-

lems into eﬀective solutions for others, and have proven

to be a key technique in machine learning.

At lower left, we ﬁnd the paradigmatic problem of

supervised learning, which plays a crucial role in ap-

plications as diverse as face detection and spam ﬁlter-

ing. In these problems (including binary classiﬁcation

and regression), a learner’s goal is to map observations

(typically known as features or covariates) to actions

which are usually a discrete set of classes or a real

value. These problems possess no interactive compo-

nent: the design and analysis of algorithms to address

Reward Structure Complexity

Interactive/Sequential Complexity

Binary Classification

Cost-sensitive Learning Structured Prediction

Supervised Learning Imitation Learning

Contextual Bandit Baseline Distribution RL Reinforcement Learning

Figure 2: An illustration of the inter-relations between

well-studied learning problems in the literature along axes

that attempt to capture both the information and com-

plexity available in reward signals and the complexity of

sequential interaction between learner and environment.

Each problem subsumes those to the left and below; reduc-

tion techniques provide methods whereby harder problems

(above and right) may be addressed using repeated appli-

cation of algorithms built for simpler problems. (Langford

and Zadrozny, 2005)

these problems rely on training and testing instances

as independent and identical distributed random vari-

ables. This rules out any notion that a decision made

by the learner will impact future observations: su-

pervised learning algorithms are built to operate in a

world in which every decision has no eﬀect on the fu-

ture examples considered. Further, within supervised

learning scenarios, during a training phase the “cor-

rect” or preferred answer is provided to the learner, so

there is no ambiguity about action choices.

More complex reward structures are also often stud-

ied: one such is known as cost-sensitive learning, where

each training example and each action or prediction is

annotated with a cost for making such a prediction.

Learning techniques exist that reduce such problems

to the simpler classiﬁcation problem, and active re-

search directly addresses such problems as they are

crucial in practical learning applications.

Contextual bandit or associative reinforcement

learning problems begin to address the fundamental

problem of exploration-vs-exploitation, as information

is provided only about a chosen action and not what-

might-have-been. These ﬁnd wide-spread application

in problems as diverse as pharmaceutical drug discov-

ery to ad placement on the web, and are one of the

most active research areas in the ﬁeld.

Problems of imitation learning and structured pre-

diction may be seen to vary from supervised learning

on the alternate dimension of sequential interaction.

Structured prediction, a key technique used within

computer vision and robotics, where many predictions

are made in concert by leveraging inter-relations be-

tween them, may be seen as a simpliﬁed variant of

imitation learning (Daumé III et al., 2009; Ross et al.,

2011a). In imitation learning, we assume that an ex-

pert (for example, a human pilot) that we wish to

mimic provides demonstrations of a task. While “cor-

rect answers” are provided to the learner, complexity

arises because any mistake by the learner modiﬁes the

future observations from what would have been seen

had the expert chosen the controls. Such problems

provably lead to compounding errors and violate the

basic assumption of independent examples required for

2

successful supervised learning. In fact, in sharp con-

trast with supervised learning problems where only a

single data-set needs to be collected, repeated inter-

action between learner and teacher appears to both

necessary and suﬃcient (Ross et al., 2011b) to provide

performance guarantees in both theory and practice in

imitation learning problems.

Reinforcement learning embraces the full complex-

ity of these problems by requiring both interactive,

sequential prediction as in imitation learning as well

as complex reward structures with only “bandit” style

feedback on the actions actually chosen. It is this

combination that enables so many problems of rele-

vance to robotics to be framed in these terms; it is

this same combination that makes the problem both

information-theoretically and computationally hard.

We note here brieﬂy the problem termed “Baseline

Distribution RL”: this is the standard RL problem with

the additional beneﬁt for the learner that it may draw

initial states from a distribution provided by an ex-

pert instead of simply an initial state chosen by the

problem. As we describe further in Section 5.1, this

additional information of which states matter dramat-

ically aﬀects the complexity of learning.

1.2 Reinforcement Learning in the

Context of Optimal Control

Reinforcement Learning (RL) is very closely related

to the theory of classical optimal control, as well

as dynamic programming, stochastic programming,

simulation-optimization, stochastic search, and opti-

mal stopping (Powell, 2012). Both RL and optimal

control address the problem of ﬁnding an optimal pol-

icy (often also called the controller or control policy)

that optimizes an objective function (i.e., the accu-

mulated cost or reward), and both rely on the notion

of a system being described by an underlying set of

states, controls and a plant or model that describes

transitions between states. However, optimal control

assumes perfect knowledge of the system’s description

in the form of a model (i.e., a function Tthat de-

scribes what the next state of the robot will be given

the current state and action). For such models, op-

timal control ensures strong guarantees which, never-

theless, often break down due to model and compu-

tational approximations. In contrast, reinforcement

learning operates directly on measured data and re-

wards from interaction with the environment. Rein-

forcement learning research has placed great focus on

addressing cases which are analytically intractable us-

ing approximations and data-driven techniques. One

of the most important approaches to reinforcement

learning within robotics centers on the use of classi-

cal optimal control techniques (e.g. Linear-Quadratic

Regulation and Diﬀerential Dynamic Programming)

to system models learned via repeated interaction with

the environment (Atkeson, 1998; Bagnell and Schnei-

der, 2001; Coates et al., 2009). A concise discussion

of viewing reinforcement learning as “adaptive optimal

control” is presented in (Sutton et al., 1991).

1.3 Reinforcement Learning in the

Context of Robotics

Robotics as a reinforcement learning domain dif-

fers considerably from most well-studied reinforcement

learning benchmark problems. In this article, we high-

light the challenges faced in tackling these problems.

Problems in robotics are often best represented with

high-dimensional, continuous states and actions (note

that the 10-30 dimensional continuous actions common

in robot reinforcement learning are considered large

(Powell, 2012)). In robotics, it is often unrealistic to

assume that the true state is completely observable

and noise-free. The learning system will not be able

to know precisely in which state it is and even vastly

diﬀerent states might look very similar. Thus, robotics

reinforcement learning are often modeled as partially

observed, a point we take up in detail in our formal

model description below. The learning system must

hence use ﬁlters to estimate the true state. It is often

essential to maintain the information state of the en-

vironment that not only contains the raw observations

but also a notion of uncertainty on its estimates (e.g.,

both the mean and the variance of a Kalman ﬁlter

tracking the ball in the robot table tennis example).

Experience on a real physical system is tedious to

obtain, expensive and often hard to reproduce. Even

getting to the same initial state is impossible for the

robot table tennis system. Every single trial run, also

called a roll-out, is costly and, as a result, such ap-

plications force us to focus on diﬃculties that do not

arise as frequently in classical reinforcement learning

benchmark examples. In order to learn within a rea-

sonable time frame, suitable approximations of state,

policy, value function, and/or system dynamics need

to be introduced. However, while real-world experi-

ence is costly, it usually cannot be replaced by learning

in simulations alone. In analytical or learned models

of the system even small modeling errors can accumu-

late to a substantially diﬀerent behavior, at least for

highly dynamic tasks. Hence, algorithms need to be

robust with respect to models that do not capture all

the details of the real system, also referred to as under-

modeling, and to model uncertainty. Another chal-

lenge commonly faced in robot reinforcement learning

is the generation of appropriate reward functions. Re-

wards that guide the learning system quickly to success

are needed to cope with the cost of real-world expe-

rience. This problem is called reward shaping (Laud,

2004) and represents a substantial manual contribu-

tion. Specifying good reward functions in robotics re-

quires a fair amount of domain knowledge and may

often be hard in practice.

Not every reinforcement learning method is equally

suitable for the robotics domain. In fact, many of

the methods thus far demonstrated on diﬃcult prob-

lems have been model-based (Atkeson et al., 1997;

Abbeel et al., 2007; Deisenroth and Rasmussen, 2011)

and robot learning systems often employ policy search

methods rather than value function-based approaches

(Gullapalli et al., 1994; Miyamoto et al., 1996; Bagnell

and Schneider, 2001; Kohl and Stone, 2004; Tedrake

3

et al., 2005; Peters and Schaal, 2008a,b; Kober and

Peters, 2009; Deisenroth et al., 2011). Such design

choices stand in contrast to possibly the bulk of the

early research in the machine learning community

(Kaelbling et al., 1996; Sutton and Barto, 1998). We

attempt to give a fairly complete overview on real

robot reinforcement learning citing most original pa-

pers while grouping them based on the key insights

employed to make the Robot Reinforcement Learn-

ing problem tractable. We isolate key insights such

as choosing an appropriate representation for a value

function or policy, incorporating prior knowledge, and

transfer knowledge from simulations.

This paper surveys a wide variety of tasks where re-

inforcement learning has been successfully applied to

robotics. If a task can be phrased as an optimiza-

tion problem and exhibits temporal structure, rein-

forcement learning can often be proﬁtably applied to

both phrase and solve that problem. The goal of this

paper is twofold. On the one hand, we hope that

this paper can provide indications for the robotics

community which type of problems can be tackled

by reinforcement learning and provide pointers to ap-

proaches that are promising. On the other hand, for

the reinforcement learning community, this paper can

point out novel real-world test beds and remarkable

opportunities for research on open questions. We fo-

cus mainly on results that were obtained on physical

robots with tasks going beyond typical reinforcement

learning benchmarks.

We concisely present reinforcement learning tech-

niques in the context of robotics in Section 2. The chal-

lenges in applying reinforcement learning in robotics

are discussed in Section 3. Diﬀerent approaches to

making reinforcement learning tractable are treated

in Sections 4 to 6. In Section 7, the example of ball-

in-a-cup is employed to highlight which of the various

approaches discussed in the paper have been particu-

larly helpful to make such a complex task tractable.

Finally, in Section 8, we summarize the speciﬁc prob-

lems and beneﬁts of reinforcement learning in robotics

and provide concluding thoughts on the problems and

promise of reinforcement learning in robotics.

2 A Concise Introduction to

Reinforcement Learning

In reinforcement learning, an agent tries to maxi-

mize the accumulated reward over its life-time. In an

episodic setting, where the task is restarted after each

end of an episode, the objective is to maximize the to-

tal reward per episode. If the task is on-going without

a clear beginning and end, either the average reward

over the whole life-time or a discounted return (i.e., a

weighted average where distant rewards have less inﬂu-

ence) can be optimized. In such reinforcement learning

problems, the agent and its environment may be mod-

eled being in a state s∈Sand can perform actions

a∈A, each of which may be members of either dis-

crete or continuous sets and can be multi-dimensional.

A state scontains all relevant information about the

current situation to predict future states (or observ-

ables); an example would be the current position of a

robot in a navigation task1. An action ais used to con-

trol (or change) the state of the system. For example,

in the navigation task we could have the actions corre-

sponding to torques applied to the wheels. For every

step, the agent also gets a reward R, which is a scalar

value and assumed to be a function of the state and

observation. (It may equally be modeled as a random

variable that depends on only these variables.) In the

navigation task, a possible reward could be designed

based on the energy costs for taken actions and re-

wards for reaching targets. The goal of reinforcement

learning is to ﬁnd a mapping from states to actions,

called policy π, that picks actions ain given states

smaximizing the cumulative expected reward. The

policy πis either deterministic or probabilistic. The

former always uses the exact same action for a given

state in the form a=π(s), the later draws a sample

from a distribution over actions when it encounters a

state, i.e., a∼π(s, a) = P(a|s). The reinforcement

learning agent needs to discover the relations between

states, actions, and rewards. Hence exploration is re-

quired which can either be directly embedded in the

policy or performed separately and only as part of the

learning process.

Classical reinforcement learning approaches are

based on the assumption that we have a Markov Deci-

sion Process (MDP) consisting of the set of states S,

set of actions A, the rewards Rand transition probabil-

ities Tthat capture the dynamics of a system. Transi-

tion probabilities (or densities in the continuous state

case) T(s′, a, s) = P(s′|s, a)describe the eﬀects of the

actions on the state. Transition probabilities general-

ize the notion of deterministic dynamics to allow for

modeling outcomes are uncertain even given full state.

The Markov property requires that the next state s′

and the reward only depend on the previous state s

and action a(Sutton and Barto, 1998), and not on ad-

ditional information about the past states or actions.

In a sense, the Markov property recapitulates the idea

of state – a state is a suﬃcient statistic for predicting

the future, rendering previous observations irrelevant.

In general in robotics, we may only be able to ﬁnd

some approximate notion of state.

Diﬀerent types of reward functions are commonly

used, including rewards depending only on the current

state R=R(s), rewards depending on the current state

and action R=R(s, a), and rewards including the tran-

sitions R=R(s′, a, s). Most of the theoretical guar-

antees only hold if the problem adheres to a Markov

structure, however in practice, many approaches work

very well for many problems that do not fulﬁll this

requirement.

1When only observations but not the complete state is avail-

able, the suﬃcient statistics of the ﬁlter can alternatively

serve as state s. Such a state is often called information or

belief state.

4

2.1 Goals of Reinforcement Learning

The goal of reinforcement learning is to discover an

optimal policy π∗that maps states (or observations)

to actions so as to maximize the expected return J,

which corresponds to the cumulative expected reward.

There are diﬀerent models of optimal behavior (Kael-

bling et al., 1996) which result in diﬀerent deﬁnitions

of the expected return. A ﬁnite-horizon model only at-

tempts to maximize the expected reward for the hori-

zon H, i.e., the next H(time-)steps h

J=E

H

X

h=0

Rh

.

This setting can also be applied to model problems

where it is known how many steps are remaining.

Alternatively, future rewards can be discounted by

a discount factor γ(with 0≤γ < 1)

J=E

∞

X

h=0

γhRh

.

This is the setting most frequently discussed in clas-

sical reinforcement learning texts. The parameter γ

aﬀects how much the future is taken into account and

needs to be tuned manually. As illustrated in (Kael-

bling et al., 1996), this parameter often qualitatively

changes the form of the optimal solution. Policies

designed by optimizing with small γare myopic and

greedy, and may lead to poor performance if we ac-

tually care about longer term rewards. It is straight-

forward to show that the optimal control law can be

unstable if the discount factor is too low (e.g., it is

not diﬃcult to show this destabilization even for dis-

counted linear quadratic regulation problems). Hence,

discounted formulations are frequently inadmissible in

robot control.

In the limit when γapproaches 1, the metric ap-

proaches what is known as the average-reward crite-

rion (Bertsekas, 1995),

J= lim

H→∞ E

1

H

H

X

h=0

Rh

.

This setting has the problem that it cannot distin-

guish between policies that initially gain a transient of

large rewards and those that do not. This transient

phase, also called preﬁx, is dominated by the rewards

obtained in the long run. If a policy accomplishes both

an optimal preﬁx as well as an optimal long-term be-

havior, it is called bias optimal Lewis and Puterman

(2001). An example in robotics would be the tran-

sient phase during the start of a rhythmic movement,

where many policies will accomplish the same long-

term reward but diﬀer substantially in the transient

(e.g., there are many ways of starting the same gait

in dynamic legged locomotion) allowing for room for

improvement in practical application.

In real-world domains, the shortcomings of the dis-

counted formulation are often more critical than those

of the average reward setting as stable behavior is often

more important than a good transient (Peters et al.,

2004). We also often encounter an episodic control

task, where the task runs only for Htime-steps and

then reset (potentially by human intervention) and

started over. This horizon, H, may be arbitrarily large,

as long as the expected reward over the episode can

be guaranteed to converge. As such episodic tasks are

probably the most frequent ones, ﬁnite-horizon models

are often the most relevant.

Two natural goals arise for the learner. In the ﬁrst,

we attempt to ﬁnd an optimal strategy at the end of

a phase of training or interaction. In the second, the

goal is to maximize the reward over the whole time the

robot is interacting with the world.

In contrast to supervised learning, the learner must

ﬁrst discover its environment and is not told the opti-

mal action it needs to take. To gain information about

the rewards and the behavior of the system, the agent

needs to explore by considering previously unused ac-

tions or actions it is uncertain about. It needs to de-

cide whether to play it safe and stick to well known ac-

tions with (moderately) high rewards or to dare trying

new things in order to discover new strategies with an

even higher reward. This problem is commonly known

as the exploration-exploitation trade-oﬀ.

In principle, reinforcement learning algorithms for

Markov Decision Processes with performance guar-

antees are known (Kakade, 2003; Kearns and Singh,

2002; Brafman and Tennenholtz, 2002) with polyno-

mial scaling in the size of the state and action spaces,

an additive error term, as well as in the horizon length

(or a suitable substitute including the discount factor

or “mixing time” (Kearns and Singh, 2002)). However,

state spaces in robotics problems are often tremen-

dously large as they scale exponentially in the num-

ber of state variables and often are continuous. This

challenge of exponential growth is often referred to as

the curse of dimensionality (Bellman, 1957) (also dis-

cussed in Section 3.1).

Oﬀ-policy methods learn independent of the em-

ployed policy, i.e., an explorative strategy that is dif-

ferent from the desired ﬁnal policy can be employed

during the learning process. On-policy methods collect

sample information about the environment using the

current policy. As a result, exploration must be built

into the policy and determines the speed of the policy

improvements. Such exploration and the performance

of the policy can result in an exploration-exploitation

trade-oﬀ between long- and short-term improvement

of the policy. Modeling exploration models with prob-

ability distributions has surprising implications, e.g.,

stochastic policies have been shown to be the optimal

stationary policies for selected problems (Sutton et al.,

1999; Jaakkola et al., 1993) and can even break the

curse of dimensionality (Rust, 1997). Furthermore,

stochastic policies often allow the derivation of new

policy update steps with surprising ease.

The agent needs to determine a correlation between

actions and reward signals. An action taken does not

have to have an immediate eﬀect on the reward but

can also inﬂuence a reward in the distant future. The

diﬃculty in assigning credit for rewards is directly re-

5

lated to the horizon or mixing time of the problem. It

also increases with the dimensionality of the actions as

not all parts of the action may contribute equally.

The classical reinforcement learning setup is a MDP

where additionally to the states S, actions A, and re-

wards Rwe also have transition probabilities T(s′, a, s).

Here, the reward is modeled as a reward function

R(s, a). If both the transition probabilities and reward

function are known, this can be seen as an optimal

control problem (Powell, 2012).

2.2 Reinforcement Learning in the

Average Reward Setting

We focus on the average-reward model in this section.

Similar derivations exist for the ﬁnite horizon and dis-

counted reward cases. In many instances, the average-

reward case is often more suitable in a robotic setting

as we do not have to choose a discount factor and we

do not have to explicitly consider time in the deriva-

tion.

To make a policy able to be optimized by continuous

optimization techniques, we write a policy as a condi-

tional probability distribution π(s, a) = P(a|s). Below,

we consider restricted policies that are paramertized

by a vector θ. In reinforcement learning, the policy

is usually considered to be stationary and memory-

less. Reinforcement learning and optimal control aim

at ﬁnding the optimal policy π∗or equivalent pol-

icy parameters θ∗which maximize the average return

J(π) = Ps,a µπ(s)π(s, a)R(s, a)where µπis the sta-

tionary state distribution generated by policy πacting

in the environment, i.e., the MDP. It can be shown

(Puterman, 1994) that such policies that map states

(even deterministically) to actions are suﬃcient to en-

sure optimality in this setting – a policy needs neither

to remember previous states visited, actions taken, or

the particular time step. For simplicity and to ease

exposition, we assume that this distribution is unique.

Markov Decision Processes where this fails (i.e., non-

ergodic processes) require more care in analysis, but

similar results exist (Puterman, 1994). The transitions

between states scaused by actions aare modeled as

T(s, a, s′) = P(s′|s, a). We can then frame the control

problem as an optimization of

max

πJ(π) = Ps,aµπ(s)π(s, a)R(s, a),(1)

s.t. µπ(s′) = Ps,aµπ(s)π(s, a)T(s, a, s′),∀s′∈S, (2)

1 = Ps,aµπ(s)π(s, a)(3)

π(s, a)≥0,∀s∈S, a ∈A.

Here, Equation (2) deﬁnes stationarity of the state dis-

tributions µπ(i.e., it ensures that it is well deﬁned) and

Equation (3) ensures a proper state-action probability

distribution. This optimization problem can be tack-

led in two substantially diﬀerent ways (Bellman, 1967,

1971). We can search the optimal solution directly in

this original, primal problem or we can optimize in

the Lagrange dual formulation. Optimizing in the pri-

mal formulation is known as policy search in reinforce-

ment learning while searching in the dual formulation

is known as a value function-based approach.

2.2.1 Value Function Approaches

Much of the reinforcement learning literature has fo-

cused on solving the optimization problem in Equa-

tions (1-3) in its dual form (Gordon, 1999; Puterman,

1994)2. Using Lagrange multipliers Vπ(s′)and ¯

R, we

can express the Lagrangian of the problem by

L=X

s,a

µπ(s)π(s, a)R(s, a)

+X

s′

Vπ(s′)

X

s,a

µπ(s)π(s, a)T(s, a, s′)−µπ(s′)

+¯

R

1−X

s,a

µπ(s)π(s, a)

=X

s,a

µπ(s)π(s, a)

R(s, a) + X

s′

Vπ(s′)T(s, a, s′)−¯

R

−X

s′

Vπ(s′)µπ(s′)X

a′

π(s′, a′)

|{z }

=1

+¯

R.

Using the property Ps′,a′V(s′)µπ(s′)π(s′, a′) =

Ps,a V(s)µπ(s)π(s, a), we can obtain the Karush-

Kuhn-Tucker conditions (Kuhn and Tucker, 1950) by

diﬀerentiating with respect to µπ(s)π(s, a)which yields

extrema at

∂µππL=R(s, a) + X

s′

Vπ(s′)T(s, a, s′)−¯

R−Vπ(s) = 0.

This statement implies that there are as many equa-

tions as the number of states multiplied by the num-

ber of actions. For each state there can be one or

several optimal actions a∗that result in the same

maximal value, and, hence, can be written in terms

of the optimal action a∗as Vπ∗(s) = R(s, a∗)−¯

R+

Ps′Vπ∗(s′)T(s, a∗, s′). As a∗is generated by the same

optimal policy π∗, we know the condition for the mul-

tipliers at optimality is

V∗(s) = max

a∗

R(s, a∗)−¯

R+X

s′

V∗(s′)T(s, a∗, s′)

,

(4)

where V∗(s)is a shorthand notation for Vπ∗(s). This

statement is equivalent to the Bellman Principle of

Optimality (Bellman, 1957)3that states “An optimal

policy has the property that whatever the initial state

and initial decision are, the remaining decisions must

constitute an optimal policy with regard to the state

resulting from the ﬁrst decision.” Thus, we have to

perform an optimal action a∗, and, subsequently, fol-

low the optimal policy π∗in order to achieve a global

optimum. When evaluating Equation (4), we realize

that optimal value function V∗(s)corresponds to the

2For historical reasons, what we call the dual is often referred

to in the literature as the primal. We argue that problem

of optimizing expected reward is the fundamental problem,

and values are an auxiliary concept.

3This optimality principle was originally formulated for a set-

ting with discrete time steps and continuous states and ac-

tions but is also applicable for discrete states and actions.

6

long term additional reward, beyond the average re-

ward ¯

R, gained by starting in state swhile taking op-

timal actions a∗(according to the optimal policy π∗).

This principle of optimality has also been crucial in

enabling the ﬁeld of optimal control (Kirk, 1970).

Hence, we have a dual formulation of the origi-

nal problem that serves as condition for optimality.

Many traditional reinforcement learning approaches

are based on identifying (possibly approximate) solu-

tions to this equation, and are known as value function

methods. Instead of directly learning a policy, they

ﬁrst approximate the Lagrangian multipliers V∗(s),

also called the value function, and use it to reconstruct

the optimal policy. The value function Vπ(s)is deﬁned

equivalently, however instead of always taking the op-

timal action a∗, the action ais picked according to a

policy π

Vπ(s) = X

a

π(s, a)R(s, a)−¯

R+X

s′

Vπ(s′)T(s, a, s′).

Instead of the value function Vπ(s)many algorithms

rely on the state-action value function Qπ(s, a)instead,

which has advantages for determining the optimal pol-

icy as shown below. This function is deﬁned as

Qπ(s, a) = R(s, a)−¯

R+X

s′

Vπ(s′)T(s, a, s′).

In contrast to the value function Vπ(s), the state-

action value function Qπ(s, a)explicitly contains the

information about the eﬀects of a particular action.

The optimal state-action value function is

Q∗(s, a) = R(s, a)−¯

R+X

s′

V∗(s′)T(s, a, s′).

=R(s, a)−¯

R+X

s′max

a′Q∗(s′, a′)T(s, a, s′).

It can be shown that an optimal, deterministic pol-

icy π∗(s)can be reconstructed by always picking the

action a∗in the current state that leads to the state s

with the highest value V∗(s)

π∗(s) = arg max

aR(s, a)−¯

R+X

s′

V∗(s′)T(s, a, s′)

If the optimal value function V∗(s′)and the transi-

tion probabilities T(s, a, s′)for the following states are

known, determining the optimal policy is straightfor-

ward in a setting with discrete actions as an exhaustive

search is possible. For continuous spaces, determining

the optimal action a∗is an optimization problem in it-

self. If both states and actions are discrete, the value

function and the policy may, in principle, be repre-

sented by tables and picking the appropriate action is

reduced to a look-up. For large or continuous spaces

representing the value function as a table becomes in-

tractable. Function approximation is employed to ﬁnd

a lower dimensional representation that matches the

real value function as closely as possible, as discussed

in Section 2.4. Using the state-action value function

Q∗(s, a)instead of the value function V∗(s)

π∗(s) = arg max

aQ∗(s, a),

avoids having to calculate the weighted sum over the

successor states, and hence no knowledge of the tran-

sition function is required.

A wide variety of methods of value function based

reinforcement learning algorithms that attempt to es-

timate V∗(s)or Q∗(s, a)have been developed and

can be split mainly into three classes: (i) dynamic

programming-based optimal control approaches such

as policy iteration or value iteration, (ii) rollout-based

Monte Carlo methods and (iii) temporal diﬀerence

methods such as TD(λ) (Temporal Diﬀerence learn-

ing), Q-learning, and SARSA (State-Action-Reward-

State-Action).

Dynamic Programming-Based Methods require a

model of the transition probabilities T(s′, a, s)and the

reward function R(s, a)to calculate the value function.

The model does not necessarily need to be predeter-

mined but can also be learned from data, potentially

incrementally. Such methods are called model-based.

Typical methods include policy iteration and value it-

eration.

Policy iteration alternates between the two phases

of policy evaluation and policy improvement. The ap-

proach is initialized with an arbitrary policy. Policy

evaluation determines the value function for the cur-

rent policy. Each state is visited and its value is up-

dated based on the current value estimates of its suc-

cessor states, the associated transition probabilities, as

well as the policy. This procedure is repeated until the

value function converges to a ﬁxed point, which corre-

sponds to the true value function. Policy improvement

greedily selects the best action in every state accord-

ing to the value function as shown above. The two

steps of policy evaluation and policy improvement are

iterated until the policy does not change any longer.

Policy iteration only updates the policy once the

policy evaluation step has converged. In contrast,

value iteration combines the steps of policy evalua-

tion and policy improvement by directly updating the

value function based on Eq. (4) every time a state is

updated.

Monte Carlo Methods use sampling in order to es-

timate the value function. This procedure can be

used to replace the policy evaluation step of the dy-

namic programming-based methods above. Monte

Carlo methods are model-free, i.e., they do not need

an explicit transition function. They perform roll-

outs by executing the current policy on the system,

hence operating on-policy. The frequencies of transi-

tions and rewards are kept track of and are used to

form estimates of the value function. For example, in

an episodic setting the state-action value of a given

state action pair can be estimated by averaging all the

returns that were received when starting from them.

Temporal Diﬀerence Methods, unlike Monte Carlo

methods, do not have to wait until an estimate of the

return is available (i.e., at the end of an episode) to

update the value function. Rather, they use tempo-

ral errors and only have to wait until the next time

7

step. The temporal error is the diﬀerence between the

old estimate and a new estimate of the value function,

taking into account the reward received in the current

sample. These updates are done iterativley and, in

contrast to dynamic programming methods, only take

into account the sampled successor states rather than

the complete distributions over successor states. Like

the Monte Carlo methods, these methods are model-

free, as they do not use a model of the transition func-

tion to determine the value function. In this setting,

the value function cannot be calculated analytically

but has to be estimated from sampled transitions in

the MDP. For example, the value function could be

updated iteratively by

V′(s) = V(s) + αR(s, a)−¯

R+V(s′)−V(s),

where V(s)is the old estimate of the value function,

V′(s)the updated one, and αis a learning rate. This

update step is called the TD(0)-algorithm in the dis-

counted reward case. In order to perform action selec-

tion a model of the transition function is still required.

The equivalent temporal diﬀerence learning algo-

rithm for state-action value functions is the average

reward case version of SARSA with

Q′(s, a) = Q(s, a) + αR(s, a)−¯

R+Q(s′, a′)−Q(s, a),

where Q(s, a)is the old estimate of the state-action

value function and Q′(s, a)the updated one. This al-

gorithm is on-policy as both the current action aas

well as the subsequent action a′are chosen according

to the current policy π. The oﬀ-policy variant is called

R-learning (Schwartz, 1993), which is closely related to

Q-learning, with the updates

Q′(s, a) = Q(s, a)+αR(s, a)−¯

R+max

a′Q(s′, a′)−Q(s, a).

These methods do not require a model of the transi-

tion function for determining the deterministic optimal

policy π∗(s).H-learning (Tadepalli and Ok, 1994) is

a related method that estimates a model of the tran-

sition probabilities and the reward function in order

to perform updates that are reminiscent of value iter-

ation.

An overview of publications using value function

based methods is presented in Table 1. Here, model-

based methods refers to all methods that employ a

predetermined or a learned model of system dynam-

ics.

2.2.2 Policy Search

The primal formulation of the problem in terms of pol-

icy rather then value oﬀers many features relevant to

robotics. It allows for a natural integration of expert

knowledge, e.g., through both structure and initializa-

tions of the policy. It allows domain-appropriate pre-

structuring of the policy in an approximate form with-

out changing the original problem. Optimal policies

often have many fewer parameters than optimal value

functions. For example, in linear quadratic control,

the value function has quadratically many parameters

in the dimensionality of the state-variables while the

policy requires only linearly many parameters. Local

search in policy space can directly lead to good results

as exhibited by early hill-climbing approaches (Kirk,

1970), as well as more recent successes (see Table 2).

Additional constraints can be incorporated naturally,

e.g., regularizing the change in the path distribution.

As a result, policy search often appears more natural

to robotics.

Nevertheless, policy search has been considered the

harder problem for a long time as the optimal solution

cannot directly be determined from Equations (1-3)

while the solution of the dual problem leveraging Bell-

man Principle of Optimality (Bellman, 1957) enables

dynamic programming based solutions.

Notwithstanding this, in robotics, policy search has

recently become an important alternative to value

function based methods due to better scalability as

well as the convergence problems of approximate value

function methods (see Sections 2.3 and 4.2). Most pol-

icy search methods optimize locally around existing

policies π, parametrized by a set of policy parameters

θi, by computing changes in the policy parameters ∆θi

that will increase the expected return and results in it-

erative updates of the form

θi+1 =θi+ ∆θi.

The computation of the policy update is the key

step here and a variety of updates have been pro-

posed ranging from pairwise comparisons (Strens and

Moore, 2001; Ng et al., 2004a) over gradient estima-

tion using ﬁnite policy diﬀerences (Geng et al., 2006;

Kohl and Stone, 2004; Mitsunaga et al., 2005; Roberts

et al., 2010; Sato et al., 2002; Tedrake et al., 2005),

and general stochastic optimization methods (such as

Nelder-Mead (Bagnell and Schneider, 2001), cross en-

tropy (Rubinstein and Kroese, 2004) and population-

based methods (Goldberg, 1989)) to approaches com-

ing from optimal control such as diﬀerential dynamic

programming (DDP) (Atkeson, 1998) and multiple

shooting approaches (Betts, 2001). We may broadly

break down policy-search methods into “black box”

and “white box” methods. Black box methods are gen-

eral stochastic optimization algorithms (Spall, 2003)

using only the expected return of policies, estimated by

sampling, and do not leverage any of the internal struc-

ture of the RL problem. These may be very sophisti-

cated techniques (Tesch et al., 2011) that use response

surface estimates and bandit-like strategies to achieve

good performance. White box methods take advan-

tage of some of additional structure within the rein-

forcement learning domain, including, for instance, the

(approximate) Markov structure of problems, devel-

oping approximate models, value-function estimates

when available (Peters and Schaal, 2008c), or even

simply the causal ordering of actions and rewards. A

major open issue within the ﬁeld is the relative mer-

its of the these two approaches: in principle, white

box methods leverage more information, but with the

exception of models (which have been demonstrated

repeatedly to often make tremendous performance im-

provements, see Section 6), the performance gains are

8

Value Function Approaches

Approach Employed by. . .

Model-Based Bakker et al. (2006); Hester et al. (2010, 2012); Kalmár et al. (1998); Martínez-Marín

and Duckett (2005); Schaal (1996); Touzet (1997)

Model-Free Asada et al. (1996); Bakker et al. (2003); Benbrahim et al. (1992); Benbrahim and

Franklin (1997); Birdwell and Livingston (2007); Bitzer et al. (2010); Conn and

Peters II (2007); Duan et al. (2007, 2008); Fagg et al. (1998); Gaskett et al. (2000);

Gräve et al. (2010); Hafner and Riedmiller (2007); Huang and Weng (2002); Huber

and Grupen (1997); Ilg et al. (1999); Katz et al. (2008); Kimura et al. (2001);

Kirchner (1997); Konidaris et al. (2011a, 2012); Kroemer et al. (2009, 2010); Kwok

and Fox (2004); Latzke et al. (2007); Mahadevan and Connell (1992); Matarić (1997);

Morimoto and Doya (2001); Nemec et al. (2009, 2010); Oßwald et al. (2010); Paletta

et al. (2007); Pendrith (1999); Platt et al. (2006); Riedmiller et al. (2009); Rottmann

et al. (2007); Smart and Kaelbling (1998, 2002); Soni and Singh (2006); Tamoši¯unait˙e

et al. (2011); Thrun (1995); Tokic et al. (2009); Touzet (1997); Uchibe et al. (1998);

Wang et al. (2006); Willgoss and Iqbal (1999)

Table 1: This table illustrates diﬀerent value function based reinforcement learning methods employed for robotic tasks

(both average and discounted reward cases) and associated publications.

traded-oﬀ with additional assumptions that may be vi-

olated and less mature optimization algorithms. Some

recent work including (Stulp and Sigaud, 2012; Tesch

et al., 2011) suggest that much of the beneﬁt of policy

search is achieved by black-box methods.

Some of the most popular white-box general re-

inforcement learning techniques that have translated

particularly well into the domain of robotics include:

(i) policy gradient approaches based on likelihood-

ratio estimation (Sutton et al., 1999), (ii) policy up-

dates inspired by expectation-maximization (Tous-

saint et al., 2010), and (iii) the path integral methods

(Kappen, 2005).

Let us brieﬂy take a closer look at gradient-based

approaches ﬁrst. The updates of the policy parameters

are based on a hill-climbing approach, that is following

the gradient of the expected return Jfor a deﬁned

step-size α

θi+1 =θi+α∇θJ.

Diﬀerent methods exist for estimating the gradient

∇θJand many algorithms require tuning of the step-

size α.

In ﬁnite diﬀerence gradients Pperturbed policy pa-

rameters are evaluated to obtain an estimate of the

gradient. Here we have ∆ˆ

Jp≈J(θi+∆θp)−Jref , where

p= [1..P ]are the individual perturbations, ∆ˆ

Jpthe es-

timate of their inﬂuence on the return, and Jref is a

reference return, e.g., the return of the unperturbed

parameters. The gradient can now be estimated by

linear regression

∇θJ≈∆ΘT∆Θ−1∆ΘT∆ˆ

J,

where the matrix ∆Θcontains all the stacked samples

of the perturbations ∆θpand ∆ˆ

Jcontains the corre-

sponding ∆ˆ

Jp. In order to estimate the gradient the

number of perturbations needs to be at least as large

as the number of parameters. The approach is very

straightforward and even applicable to policies that

are not diﬀerentiable. However, it is usually consid-

ered to be very noisy and ineﬃcient. For the ﬁnite

diﬀerence approach tuning the step-size αfor the up-

date, the number of perturbations P, and the type

and magnitude of perturbations are all critical tuning

factors.

Likelihood ratio methods rely on the insight that in

an episodic setting where the episodes τare generated

according to the distribution Pθ(τ) = P(τ|θ)with the

return of an episode Jτ=PH

h=1 Rhand number of

steps Hthe expected return for a set of policy param-

eter θcan be expressed as

Jθ=X

τ

Pθ(τ)Jτ.(5)

The gradient of the episode distribution can be written

as4

∇θPθ(τ) = Pθ(τ)∇θlog Pθ(τ),(6)

which is commonly known as the the likelihood ratio

or REINFORCE (Williams, 1992) trick. Combining

Equations (5) and (6) we get the gradient of the ex-

pected return in the form

∇θJθ=X

τ

∇θPθ(τ)Jτ=X

τ

Pθ(τ)∇θlog Pθ(τ)Jτ

=En∇θlog Pθ(τ)Jτo.

If we have a stochastic policy πθ(s, a)that generates

the episodes τ, we do not need to keep track of the

probabilities of the episodes but can directly express

the gradient in terms of the policy as ∇θlog Pθ(τ) =

PH

h=1 ∇θlog πθ(s, a). Finally the gradient of the ex-

pected return with respect to the policy parameters

can be estimated as

∇θJθ=E

H

X

h=1

∇θlog πθ(sh, ah)

Jτ

.

If we now take into account that rewards at the

beginning of an episode cannot be caused by actions

4From multi-variate calculus we have ∇θlog Pθ(τ) =

∇θPθ(τ)/P θ(τ).

9

taken at the end of an episode, we can replace the re-

turn of the episode Jτby the state-action value func-

tion Qπ(s, a)and get (Peters and Schaal, 2008c)

∇θJθ=E

H

X

h=1

∇θlog πθ(sh, ah)Qπ(sh, ah)

,

which is equivalent to the policy gradient theorem (Sut-

ton et al., 1999). In practice, it is often advisable to

subtract a reference Jref , also called baseline, from the

return of the episode Jτor the state-action value func-

tion Qπ(s, a)respectively to get better estimates, simi-

lar to the ﬁnite diﬀerence approach. In these settings,

the exploration is automatically taken care of by the

stochastic policy.

Initial gradient-based approaches such as ﬁnite dif-

ferences gradients or REINFORCE (REward Incre-

ment = Nonnegative Factor times Oﬀset Reinforce-

ment times Characteristic Eligibility) (Williams, 1992)

have been rather slow. The weight perturbation

algorithm is related to REINFORCE but can deal

with non-Gaussian distributions which signiﬁcantly

improves the signal to noise ratio of the gradient

(Roberts et al., 2010). Recent natural policy gradient

approaches (Peters and Schaal, 2008c,b) have allowed

for faster convergence which may be advantageous for

robotics as it reduces the learning time and required

real-world interactions.

A diﬀerent class of safe and fast policy search meth-

ods, that are inspired by expectation-maximization,

can be derived when the reward is treated as an im-

proper probability distribution (Dayan and Hinton,

1997). Some of these approaches have proven success-

ful in robotics, e.g., reward-weighted regression (Peters

and Schaal, 2008a), Policy Learning by Weighting Ex-

ploration with the Returns (Kober and Peters, 2009),

Monte Carlo Expectation-Maximization(Vlassis et al.,

2009), and Cost-regularized Kernel Regression (Kober

et al., 2010). Algorithms with closely related update

rules can also be derived from diﬀerent perspectives

including Policy Improvements with Path Integrals

(Theodorou et al., 2010) and Relative Entropy Policy

Search (Peters et al., 2010a).

Finally, the Policy Search by Dynamic Programming

(Bagnell et al., 2003) method is a general strategy that

combines policy search with the principle of optimality.

The approach learns a non-stationary policy backward

in time like dynamic programming methods, but does

not attempt to enforce the Bellman equation and the

resulting approximation instabilities (See Section 2.4).

The resulting approach provides some of the strongest

guarantees that are currently known under function

approximation and limited observability It has been

demonstrated in learning walking controllers and in

ﬁnding near-optimal trajectories for map exploration

(Kollar and Roy, 2008). The resulting method is more

expensive than the value function methods because it

scales quadratically in the eﬀective time horizon of the

problem. Like DDP methods (Atkeson, 1998), it is tied

to a non-stationary (time-varying) policy.

An overview of publications using policy search

methods is presented in Table 2.

One of the key open issues in the ﬁeld is determining

when it is appropriate to use each of these methods.

Some approaches leverage signiﬁcant structure speciﬁc

to the RL problem (e.g. (Theodorou et al., 2010)), in-

cluding reward structure, Markovanity, causality of re-

ward signals (Williams, 1992), and value-function esti-

mates when available (Peters and Schaal, 2008c). Oth-

ers embed policy search as a generic, black-box, prob-

lem of stochastic optimization (Bagnell and Schneider,

2001; Lizotte et al., 2007; Kuindersma et al., 2011;

Tesch et al., 2011). Signiﬁcant open questions remain

regarding which methods are best in which circum-

stances and further, at an even more basic level, how

eﬀective leveraging the kinds of problem structures

mentioned above are in practice.

2.3 Value Function Approaches versus

Policy Search

Some methods attempt to ﬁnd a value function or pol-

icy which eventually can be employed without signif-

icant further computation, whereas others (e.g., the

roll-out methods) perform the same amount of com-

putation each time.

If a complete optimal value function is known, a

globally optimal solution follows simply by greed-

ily choosing actions to optimize it. However, value-

function based approaches have thus far been diﬃcult

to translate into high dimensional robotics as they re-

quire function approximation for the value function.

Most theoretical guarantees no longer hold for this ap-

proximation and even ﬁnding the optimal action can

be a hard problem due to the brittleness of the ap-

proximation and the cost of optimization. For high

dimensional actions, it can be as hard computing an

improved policy for all states in policy search as ﬁnd-

ing a single optimal action on-policy for one state by

searching the state-action value function.

In principle, a value function requires total cover-

age of the state space and the largest local error de-

termines the quality of the resulting policy. A par-

ticularly signiﬁcant problem is the error propagation

in value functions. A small change in the policy may

cause a large change in the value function, which again

causes a large change in the policy. While this may

lead more quickly to good, possibly globally optimal

solutions, such learning processes often prove unsta-

ble under function approximation (Boyan and Moore,

1995; Kakade and Langford, 2002; Bagnell et al., 2003)

and are considerably more dangerous when applied to

real systems where overly large policy deviations may

lead to dangerous decisions.

In contrast, policy search methods usually only con-

sider the current policy and its neighborhood in or-

der to gradually improve performance. The result is

that usually only local optima, and not the global one,

can be found. However, these methods work well in

conjunction with continuous features. Local coverage

and local errors results into improved scaleability in

robotics.

Policy search methods are sometimes called actor-

only methods; value function methods are sometimes

10

Policy Search

Approach Employed by. . .

Gradient Deisenroth and Rasmussen (2011); Deisenroth et al. (2011); Endo et al. (2008);

Fidelman and Stone (2004); Geng et al. (2006); Guenter et al. (2007); Gullapalli

et al. (1994); Hailu and Sommer (1998); Ko et al. (2007); Kohl and Stone (2004);

Kolter and Ng (2009a); Michels et al. (2005); Mitsunaga et al. (2005); Miyamoto

et al. (1996); Ng et al. (2004a,b); Peters and Schaal (2008c,b); Roberts et al. (2010);

Rosenstein and Barto (2004); Tamei and Shibata (2009); Tedrake (2004); Tedrake

et al. (2005)

Other Abbeel et al. (2006, 2007); Atkeson and Schaal (1997); Atkeson (1998); Bagnell and

Schneider (2001); Bagnell (2004); Buchli et al. (2011); Coates et al. (2009); Daniel

et al. (2012); Donnart and Meyer (1996); Dorigo and Colombetti (1993); Erden and

Leblebicioğlu (2008); Kalakrishnan et al. (2011); Kober and Peters (2009); Kober

et al. (2010); Kolter et al. (2008); Kuindersma et al. (2011); Lizotte et al. (2007);

Matarić (1994); Pastor et al. (2011); Peters and Schaal (2008a); Peters et al. (2010a);

Schaal and Atkeson (1994); Stulp et al. (2011); Svinin et al. (2001); Tamoši¯unait˙e

et al. (2011); Yasuda and Ohkura (2008); Youssef (2005)

Table 2: This table illustrates diﬀerent policy search reinforcement learning methods employed for robotic tasks and

associated publications.

called critic-only methods. The idea of a critic is to

ﬁrst observe and estimate the performance of choosing

controls on the system (i.e., the value function), then

derive a policy based on the gained knowledge. In

contrast, the actor directly tries to deduce the optimal

policy. A set of algorithms called actor-critic meth-

ods attempt to incorporate the advantages of each: a

policy is explicitly maintained, as is a value-function

for the current policy. The value function (i.e., the

critic) is not employed for action selection. Instead,

it observes the performance of the actor and decides

when the policy needs to be updated and which action

should be preferred. The resulting update step fea-

tures the local convergence properties of policy gradi-

ent algorithms while reducing update variance (Green-

smith et al., 2004). There is a trade-oﬀ between the

beneﬁt of reducing the variance of the updates and

having to learn a value function as the samples re-

quired to estimate the value function could also be

employed to obtain better gradient estimates for the

update step. Rosenstein and Barto (2004) propose an

actor-critic method that additionally features a super-

visor in the form of a stable policy.

2.4 Function Approximation

Function approximation (Rivlin, 1969) is a family of

mathematical and statistical techniques used to rep-

resent a function of interest when it is computation-

ally or information-theoretically intractable to repre-

sent the function exactly or explicitly (e.g. in tabular

form). Typically, in reinforcement learning te func-

tion approximation is based on sample data collected

during interaction with the environment. Function ap-

proximation is critical in nearly every RL problem, and

becomes inevitable in continuous state ones. In large

discrete spaces it is also often impractical to visit or

even represent all states and actions, and function ap-

proximation in this setting can be used as a means to

generalize to neighboring states and actions.

Function approximation can be employed to rep-

resent policies, value functions, and forward mod-

els. Broadly speaking, there are two kinds of func-

tion approximation methods: parametric and non-

parametric. A parametric function approximator uses

a ﬁnite set of parameters or arguments with the goal

is to ﬁnd parameters that make this approximation ﬁt

the observed data as closely as possible. Examples in-

clude linear basis functions and neural networks. In

contrast, non-parametric methods expand representa-

tional power in relation to collected data and hence

are not limited by the representation power of a cho-

sen parametrization (Bishop, 2006). A prominent ex-

ample that has found much use within reinforcement

learning is Gaussian process regression (Rasmussen

and Williams, 2006). A fundamental problem with us-

ing supervised learning methods developed in the lit-

erature for function approximation is that most such

methods are designed for independently and identi-

cally distributed sample data. However, the data gen-

erated by the reinforcement learning process is usually

neither independent nor identically distributed. Usu-

ally, the function approximator itself plays some role

in the data collection process (for instance, by serving

to deﬁne a policy that we execute on a robot.)

Linear basis function approximators form one of the

most widely used approximate value function tech-

niques in continuous (and discrete) state spaces. This

is largely due to the simplicity of their representa-

tion as well as a convergence theory, albeit limited, for

the approximation of value functions based on samples

(Tsitsiklis and Van Roy, 1997). Let us brieﬂy take a

closer look at a radial basis function network to illus-

trate this approach. The value function maps states to

a scalar value. The state space can be covered by a grid

of points, each of which correspond to the center of a

Gaussian-shaped basis function. The value of the ap-

proximated function is the weighted sum of the values

of all basis functions at the query point. As the in-

ﬂuence of the Gaussian basis functions drops rapidly,

the value of the query points will be predominantly

11

inﬂuenced by the neighboring basis functions. The

weights are set in a way to minimize the error between

the observed samples and the reconstruction. For the

mean squared error, these weights can be determined

by linear regression. Kolter and Ng (2009b) discuss

the beneﬁts of regularization of such linear function

approximators to avoid over-ﬁtting.

Other possible function approximators for value

functions include wire ﬁtting, whichBaird and Klopf

(1993) suggested as an approach that makes contin-

uous action selection feasible. The Fourier basis had

been suggested by Konidaris et al. (2011b). Even dis-

cretizing the state-space can be seen as a form of func-

tion approximation where coarse values serve as es-

timates for a smooth continuous function. One ex-

ample is tile coding (Sutton and Barto, 1998), where

the space is subdivided into (potentially irregularly

shaped) regions, called tiling. The number of diﬀer-

ent tilings determines the resolution of the ﬁnal ap-

proximation. For more examples, please refer to Sec-

tions 4.1 and 4.2.

Policy search also beneﬁts from a compact represen-

tation of the policy as discussed in Section 4.3.

Models of the system dynamics can be represented

using a wide variety of techniques. In this case, it is

often important to model the uncertainty in the model

(e.g., by a stochastic model or Bayesian estimates of

model parameters) to ensure that the learning algo-

rithm does not exploit model inaccuracies. See Sec-

tion 6 for a more detailed discussion.

3 Challenges in Robot

Reinforcement Learning

Reinforcement learning is generally a hard problem

and many of its challenges are particularly apparent

in the robotics setting. As the states and actions of

most robots are inherently continuous, we are forced to

consider the resolution at which they are represented.

We must decide how ﬁne grained the control is that we

require over the robot, whether we employ discretiza-

tion or function approximation, and what time step we

establish. Additionally, as the dimensionality of both

states and actions can be high, we face the “Curse of

Dimensionality” (Bellman, 1957) as discussed in Sec-

tion 3.1. As robotics deals with complex physical sys-

tems, samples can be expensive due to the long ex-

ecution time of complete tasks, required manual in-

terventions, and the need maintenance and repair. In

these real-world measurements, we must cope with the

uncertainty inherent in complex physical systems. A

robot requires that the algorithm runs in real-time.

The algorithm must be capable of dealing with delays

in sensing and execution that are inherent in physi-

cal systems (see Section 3.2). A simulation might al-

leviate many problems but these approaches need to

be robust with respect to model errors as discussed

in Section 3.3. An often underestimated problem is

the goal speciﬁcation, which is achieved by designing

a good reward function. As noted in Section 3.4, this

choice can make the diﬀerence between feasibility and

Figure 3: This Figure illustrates the state space used in

the modeling of a robot reinforcement learning task of pad-

dling a ball.

an unreasonable amount of exploration.

3.1 Curse of Dimensionality

When Bellman (1957) explored optimal control in dis-

crete high-dimensional spaces, he faced an exponential

explosion of states and actions for which he coined the

term “Curse of Dimensionality”. As the number of di-

mensions grows, exponentially more data and compu-

tation are needed to cover the complete state-action

space. For example, if we assume that each dimension

of a state-space is discretized into ten levels, we have

10 states for a one-dimensional state-space, 103= 1000

unique states for a three-dimensional state-space, and

10npossible states for a n-dimensional state space.

Evaluating every state quickly becomes infeasible with

growing dimensionality, even for discrete states. Bell-

man originally coined the term in the context of opti-

mization, but it also applies to function approximation

and numerical integration (Donoho, 2000). While su-

pervised learning methods have tamed this exponen-

tial growth by considering only competitive optimality

with respect to a limited class of function approxima-

tors, such results are much more diﬃcult in reinforce-

ment learning where data must collected throughout

state-space to ensure global optimality.

Robotic systems often have to deal with these high

dimensional states and actions due to the many de-

grees of freedom of modern anthropomorphic robots.

For example, in the ball-paddling task shown in Fig-

ure 3, a proper representation of a robot’s state would

consist of its joint angles and velocities for each of its

seven degrees of freedom as well as the Cartesian po-

sition and velocity of the ball. The robot’s actions

would be the generated motor commands, which often

are torques or accelerations. In this example, we have

2×(7 + 3) = 20 state dimensions and 7-dimensional

continuous actions. Obviously, other tasks may re-

quire even more dimensions. For example, human-

like actuation often follows the antagonistic principle

(Yamaguchi and Takanishi, 1997) which additionally

enables control of stiﬀness. Such dimensionality is a

12

major challenge for both the robotics and the rein-

forcement learning communities.

In robotics, such tasks are often rendered tractable

to the robot engineer by a hierarchical task decom-

position that shifts some complexity to a lower layer

of functionality. Classical reinforcement learning ap-

proaches often consider a grid-based representation

with discrete states and actions, often referred to as

agrid-world. A navigational task for mobile robots

could be projected into this representation by employ-

ing a number of actions like “move to the cell to the

left” that use a lower level controller that takes care

of accelerating, moving, and stopping while ensuring

precision. In the ball-paddling example, we may sim-

plify by controlling the robot in racket space (which is

lower-dimensional as the racket is orientation-invariant

around the string’s mounting point) with an opera-

tional space control law (Nakanishi et al., 2008). Many

commercial robot systems also encapsulate some of the

state and action components in an embedded control

system (e.g., trajectory fragments are frequently used

as actions for industrial robots). However, this form

of a state dimensionality reduction severely limits the

dynamic capabilities of the robot according to our ex-

perience (Schaal et al., 2002; Peters et al., 2010b).

The reinforcement learning community has a long

history of dealing with dimensionality using computa-

tional abstractions. It oﬀers a larger set of applicable

tools ranging from adaptive discretizations (Buşoniu

et al., 2010) and function approximation approaches

(Sutton and Barto, 1998) to macro-actions or op-

tions (Barto and Mahadevan, 2003; Hart and Grupen,

2011). Options allow a task to be decomposed into

elementary components and quite naturally translate

to robotics. Such options can autonomously achieve a

sub-task, such as opening a door, which reduces the

planning horizon (Barto and Mahadevan, 2003). The

automatic generation of such sets of options is a key

issue in order to enable such approaches. We will dis-

cuss approaches that have been successful in robot re-

inforcement learning in Section 4.

3.2 Curse of Real-World Samples

Robots inherently interact with the physical world.

Hence, robot reinforcement learning suﬀers from most

of the resulting real-world problems. For example,

robot hardware is usually expensive, suﬀers from wear

and tear, and requires careful maintenance. Repair-

ing a robot system is a non-negligible eﬀort associ-

ated with cost, physical labor and long waiting peri-

ods. To apply reinforcement learning in robotics, safe

exploration becomes a key issue of the learning process

(Schneider, 1996; Bagnell, 2004; Deisenroth and Ras-

mussen, 2011; Moldovan and Abbeel, 2012), a problem

often neglected in the general reinforcement learning

community. Perkins and Barto (2002) have come up

with a method for constructing reinforcement learn-

ing agents based on Lyapunov functions. Switching

between the underlying controllers is always safe and

oﬀers basic performance guarantees.

However, several more aspects of the real-world

make robotics a challenging domain. As the dynamics

of a robot can change due to many external factors

ranging from temperature to wear, the learning pro-

cess may never fully converge, i.e., it needs a “tracking

solution” (Sutton et al., 2007). Frequently, the en-

vironment settings during an earlier learning period

cannot be reproduced. External factors are not al-

ways clear – for example, how light conditions aﬀect

the performance of the vision system and, as a result,

the task’s performance. This problem makes compar-

ing algorithms particularly hard. Furthermore, the ap-

proaches often have to deal with uncertainty due to in-

herent measurement noise and the inability to observe

all states directly with sensors.

Most real robot learning tasks require some form

of human supervision, e.g., putting the pole back on

the robot’s end-eﬀector during pole balancing (see Fig-

ure 1d) after a failure. Even when an automatic reset

exists (e.g., by having a smart mechanism that resets

the pole), learning speed becomes essential as a task

on a real robot cannot be sped up. In some tasks like

a slowly rolling robot, the dynamics can be ignored;

in others like a ﬂying robot, they cannot. Especially

in the latter case, often the whole episode needs to be

completed as it is not possible to start from arbitrary

states.

For such reasons, real-world samples are expensive

in terms of time, labor and, potentially, ﬁnances. In

robotic reinforcement learning, it is often considered

to be more important to limit the real-world interac-

tion time instead of limiting memory consumption or

computational complexity. Thus, sample eﬃcient al-

gorithms that are able to learn from a small number

of trials are essential. In Section 6 we will point out

several approaches that allow the amount of required

real-world interactions to be reduced.

Since the robot is a physical system, there are strict

constraints on the interaction between the learning al-

gorithm and the robot setup. For dynamic tasks, the

movement cannot be paused and actions must be se-

lected within a time-budget without the opportunity

to pause to think, learn or plan between actions. These

constraints are less severe in an episodic setting where

the time intensive part of the learning can be post-

poned to the period between episodes. Hester et al.

(2012) has proposed a real-time architecture for model-

based value function reinforcement learning methods

taking into account these challenges.

As reinforcement learning algorithms are inherently

implemented on a digital computer, the discretiza-

tion of time is unavoidable despite that physical sys-

tems are inherently continuous time systems. Time-

discretization of the actuation can generate undesir-

able artifacts (e.g., the distortion of distance between

states) even for idealized physical systems, which can-

not be avoided. As most robots are controlled at ﬁxed

sampling frequencies (in the range between 500Hz and

3kHz) determined by the manufacturer of the robot,

the upper bound on the rate of temporal discretization

is usually pre-determined. The lower bound depends

on the horizon of the problem, the achievable speed of

changes in the state, as well as delays in sensing and

13

actuation.

All physical systems exhibit such delays in sensing

and actuation. The state of the setup (represented by

the ﬁltered sensor signals) may frequently lag behind

the real state due to processing and communication de-

lays. More critically, there are also communication de-

lays in actuation as well as delays due to the fact that

neither motors, gear boxes nor the body’s movement

can change instantly. Due to these delays, actions may

not have instantaneous eﬀects but are observable only

several time steps later. In contrast, in most general

reinforcement learning algorithms, the actions are as-

sumed to take eﬀect instantaneously as such delays

would violate the usual Markov assumption. This ef-

fect can be addressed by putting some number of re-

cent actions into the state. However, this signiﬁcantly

increases the dimensionality of the problem.

The problems related to time-budgets and delays

can also be avoided by increasing the duration of the

time steps. One downside of this approach is that the

robot cannot be controlled as precisely; another is that

it may complicate a description of system dynamics.

3.3 Curse of Under-Modeling and Model

Uncertainty

One way to oﬀset the cost of real-world interaction is to

use accurate models as simulators. In an ideal setting,

this approach would render it possible to learn the be-

havior in simulation and subsequently transfer it to the

real robot. Unfortunately, creating a suﬃciently accu-

rate model of the robot and its environment is chal-

lenging and often requires very many data samples. As

small model errors due to this under-modeling accu-

mulate, the simulated robot can quickly diverge from

the real-world system. When a policy is trained using

an imprecise forward model as simulator, the behav-

ior will not transfer without signiﬁcant modiﬁcations

as experienced by Atkeson (1994) when learning the

underactuated pendulum swing-up. The authors have

achieved a direct transfer in only a limited number of

experiments; see Section 6.1 for examples.

For tasks where the system is self-stabilizing (that

is, where the robot does not require active control

to remain in a safe state or return to it), transfer-

ring policies often works well. Such tasks often fea-

ture some type of dampening that absorbs the energy

introduced by perturbations or control inaccuracies.

If the task is inherently stable, it is safer to assume

that approaches that were applied in simulation work

similarly in the real world (Kober and Peters, 2010).

Nevertheless, tasks can often be learned better in the

real world than in simulation due to complex mechan-

ical interactions (including contacts and friction) that

have proven diﬃcult to model accurately. For exam-

ple, in the ball-paddling task (Figure 3) the elastic

string that attaches the ball to the racket always pulls

back the ball towards the racket even when hit very

hard. Initial simulations (including friction models,

restitution models, dampening models, models for the

elastic string, and air drag) of the ball-racket contacts

indicated that these factors would be very hard to con-

trol. In a real experiment, however, the reﬂections of

the ball on the racket proved to be less critical than in

simulation and the stabilizing forces due to the elas-

tic string were suﬃcient to render the whole system

self-stabilizing.

In contrast, in unstable tasks small variations have

drastic consequences. For example, in a pole balanc-

ing task, the equilibrium of the upright pole is very

brittle and constant control is required to stabilize the

system. Transferred policies often perform poorly in

this setting. Nevertheless, approximate models serve

a number of key roles which we discuss in Section 6,

including verifying and testing the algorithms in simu-

lation, establishing proximity to theoretically optimal

solutions, calculating approximate gradients for local

policy improvement, identiﬁng strategies for collecting

more data, and performing “mental rehearsal”.

3.4 Curse of Goal Speciﬁcation

In reinforcement learning, the desired behavior is im-

plicitly speciﬁed by the reward function. The goal of

reinforcement learning algorithms then is to maximize

the accumulated long-term reward. While often dra-

matically simpler than specifying the behavior itself,

in practice, it can be surprisingly diﬃcult to deﬁne a

good reward function in robot reinforcement learning.

The learner must observe variance in the reward signal

in order to be able to improve a policy: if the same

return is always received, there is no way to determine

which policy is better or closer to the optimum.

In many domains, it seems natural to provide re-

wards only upon task achievement – for example, when

a table tennis robot wins a match. This view results

in an apparently simple, binary reward speciﬁcation.

However, a robot may receive such a reward so rarely

that it is unlikely to ever succeed in the lifetime of a

real-world system. Instead of relying on simpler bi-

nary rewards, we frequently need to include interme-

diate rewards in the scalar reward function to guide

the learning process to a reasonable solution, a pro-

cess known as reward shaping (Laud, 2004).

Beyond the need to shorten the eﬀective problem

horizon by providing intermediate rewards, the trade-

oﬀ between diﬀerent factors may be essential. For in-

stance, hitting a table tennis ball very hard may re-

sult in a high score but is likely to damage a robot or

shorten its life span. Similarly, changes in actions may

be penalized to avoid high frequency controls that are

likely to be very poorly captured with tractable low

dimensional state-space or rigid-body models. Rein-

forcement learning algorithms are also notorious for

exploiting the reward function in ways that are not

anticipated by the designer. For example, if the dis-

tance between the ball and the desired highest point

is part of the reward in ball paddling (see Figure 3),

many locally optimal solutions would attempt to sim-

ply move the racket upwards and keep the ball on it.

Reward shaping gives the system a notion of closeness

to the desired behavior instead of relying on a reward

that only encodes success or failure (Ng et al., 1999).

Often the desired behavior can be most naturally

14

represented with a reward function in a particular

state and action space. However, this representation

does not necessarily correspond to the space where

the actual learning needs to be performed due to both

computational and statistical limitations. Employing

methods to render the learning problem tractable of-

ten result in diﬀerent, more abstract state and action

spaces which might not allow accurate representation

of the original reward function. In such cases, a rewar-

dartfully speciﬁedin terms of the features of the space

in which the learning algorithm operates can prove re-

markably eﬀective. There is also a trade-oﬀ between

the complexity of the reward function and the com-

plexity of the learning problem. For example, in the

ball-in-a-cup task (Section 7) the most natural reward

would be a binary value depending on whether the ball

is in the cup or not. To render the learning problem

tractable, a less intuitive reward needed to be devised

in terms of a Cartesian distance with additional direc-

tional information (see Section 7.1 for details). An-

other example is Crusher (Ratliﬀ et al., 2006a), an

outdoor robot, where the human designer was inter-

ested in a combination of minimizing time and risk to

the robot. However, the robot reasons about the world

on the long time horizon scale as if it was a very sim-

ple, deterministic, holonomic robot operating on a ﬁne

grid of continuous costs. Hence, the desired behavior

cannot be represented straightforwardly in this state-

space. Nevertheless, a remarkably human-like behav-

ior that seems to respect time and risk priorities can

be achieved by carefully mapping features describing

each state (discrete grid location with features com-

puted by an on-board perception system) to cost.

Inverse optimal control, also known as inverse re-

inforcement learning (Russell, 1998), is a promising

alternative to specifying the reward function manu-

ally. It assumes that a reward function can be recon-

structed from a set of expert demonstrations. This

reward function does not necessarily correspond to

the true reward function, but provides guarantees on

the resulting performance of learned behaviors (Abbeel

and Ng, 2004; Ratliﬀ et al., 2006b). Inverse optimal

control was initially studied in the control community

(Kalman, 1964) and in the ﬁeld of economics (Keeney

and Raiﬀa, 1976). The initial results were only ap-

plicable to limited domains (linear quadratic regulator

problems) and required closed form access to plant and

controller, hence samples from human demonstrations

could not be used. Russell (1998) brought the ﬁeld

to the attention of the machine learning community.

Abbeel and Ng (2004) deﬁned an important constraint

on the solution to the inverse RL problem when reward

functions are linear in a set of features: a policy that is

extracted by observing demonstrations has to earn the

same reward as the policy that is being demonstrated.

Ratliﬀ et al. (2006b) demonstrated that inverse op-

timal control can be understood as a generalization

of ideas in machine learning of structured prediction

and introduced eﬃcient sub-gradient based algorithms

with regret bounds that enabled large scale application

of the technique within robotics. Ziebart et al. (2008)

extended the technique developed by Abbeel and Ng

(2004) by rendering the idea robust and probabilis-

tic, enabling its eﬀective use for both learning poli-

cies and predicting the behavior of sub-optimal agents.

These techniques, and many variants, have been re-

cently successfully applied to outdoor robot navigation

(Ratliﬀ et al., 2006a; Silver et al., 2008, 2010), manipu-

lation (Ratliﬀ et al., 2007), and quadruped locomotion

(Ratliﬀ et al., 2006a, 2007; Kolter et al., 2007).

More recently, the notion that complex policies can

be built on top of simple, easily solved optimal con-

trol problems by exploiting rich, parametrized re-

ward functions has been exploited within reinforce-

ment learning more directly. In (Sorg et al., 2010;

Zucker and Bagnell, 2012), complex policies are de-

rived by adapting a reward function for simple opti-

mal control problems using policy search techniques.

Zucker and Bagnell (2012) demonstrate that this tech-

nique can enable eﬃcient solutions to robotic marble-

maze problems that eﬀectively transfer between mazes

of varying design and complexity. These works high-

light the natural trade-oﬀ between the complexity of

the reward function and the complexity of the under-

lying reinforcement learning problem for achieving a

desired behavior.

4 Tractability Through

Representation

As discussed above, reinforcement learning provides

a framework for a remarkable variety of problems of

signiﬁcance to both robotics and machine learning.

However, the computational and information-theoretic

consequences that we outlined above accompany this

power and generality. As a result, naive application of

reinforcement learning techniques in robotics is likely

to be doomed to failure. The remarkable successes

that we reference in this article have been achieved

by leveraging a few key principles – eﬀective repre-

sentations, approximate models, and prior knowledge

or information. In the following three sections, we

review these principles and summarize how each has

been made eﬀective in practice. We hope that under-

standing these broad approaches will lead to new suc-

cesses in robotic reinforcement learning by combining

successful methods and encourage research on novel

techniques that embody each of these principles.

Much of the success of reinforcement learning meth-

ods has been due to the clever use of approximate

representations. The need of such approximations

is particularly pronounced in robotics, where table-

based representations (as discussed in Section 2.2.1)

are rarely scalable. The diﬀerent ways of making rein-

forcement learning methods tractable in robotics are

tightly coupled to the underlying optimization frame-

work. Reducing the dimensionality of states or ac-

tions by smart state-action discretization is a repre-

sentational simpliﬁcation that may enhance both pol-

icy search and value function-based methods (see Sec-

tion 4.1). A value function-based approach requires an

accurate and robust but general function approxima-

tor that can capture the value function with suﬃcient

15

precision (see Section 4.2) while maintaining stabil-

ity during learning. Policy search methods require a

choice of policy representation that controls the com-

plexity of representable policies to enhance learning

speed (see Section 4.3). An overview of publications

that make particular use of eﬃcient representations to

render the learning problem tractable is presented in

Table 3.

4.1 Smart State-Action Discretization

Decreasing the dimensionality of state or action spaces

eases most reinforcement learning problems signiﬁ-

cantly, particularly in the context of robotics. Here, we

give a short overview of diﬀerent attempts to achieve

this goal with smart discretization.

Hand Crafted Discretization. A variety of authors

have manually developed discretizations so that ba-

sic tasks can be learned on real robots. For low-

dimensional tasks, we can generate discretizations

straightforwardly by splitting each dimension into a

number of regions. The main challenge is to ﬁnd the

right number of regions for each dimension that allows

the system to achieve a good ﬁnal performance while

still learning quickly. Example applications include

balancing a ball on a beam (Benbrahim et al., 1992),

one degree of freedom ball-in-a-cup (Nemec et al.,

2010), two degree of freedom crawling motions (Tokic

et al., 2009), and gait patterns for four legged walking

(Kimura et al., 2001). Much more human experience

is needed for more complex tasks. For example, in a

basic navigation task with noisy sensors (Willgoss and

Iqbal, 1999), only some combinations of binary state

or action indicators are useful (e.g., you can drive left

and forward at the same time, but not backward and

forward). The state space can also be based on vastly

diﬀerent features, such as positions, shapes, and colors,

when learning object aﬀordances (Paletta et al., 2007)

where both the discrete sets and the mapping from

sensor values to the discrete values need to be crafted.

Kwok and Fox (2004) use a mixed discrete and contin-

uous representation of the state space to learn active

sensing strategies in a RoboCup scenario. They ﬁrst

discretize the state space along the dimension with

the strongest non-linear inﬂuence on the value func-

tion and subsequently employ a linear value function

approximation (Section 4.2) for each of the regions.

Learned from Data. Instead of specifying the dis-

cretizations by hand, they can also be built adap-

tively during the learning process. For example, a

rule based reinforcement learning approach automati-

cally segmented the state space to learn a cooperative

task with mobile robots (Yasuda and Ohkura, 2008).

Each rule is responsible for a local region of the state-

space. The importance of the rules are updated based

on the rewards and irrelevant rules are discarded. If

the state is not covered by a rule yet, a new one is

added. In the related ﬁeld of computer vision, Pi-

ater et al. (2011) propose an approach that adaptively

and incrementally discretizes a perceptual space into

discrete states, training an image classiﬁer based on

the experience of the RL agent to distinguish visual

classes, which correspond to the states.

Meta-Actions. Automatic construction of meta-

actions (and the closely related concept of options)

has fascinated reinforcement learning researchers and

there are various examples in the literature. The idea

is to have more intelligent actions that are composed

of a sequence of movements and that in themselves

achieve a simple task. A simple example would be to

have a meta-action “move forward 5m.” A lower level

system takes care of accelerating, stopping, and cor-

recting errors. For example, in (Asada et al., 1996),

the state and action sets are constructed in a way that

repeated action primitives lead to a change in the state

to overcome problems associated with the discretiza-

tion. Q-learning and dynamic programming based ap-

proaches have been compared in a pick-n-place task

(Kalmár et al., 1998) using modules. Huber and Gru-

pen (1997) use a set of controllers with associated

predicate states as a basis for learning turning gates

with a quadruped. Fidelman and Stone (2004) use a

policy search approach to learn a small set of parame-

ters that controls the transition between a walking and

a capturing meta-action in a RoboCup scenario. A

task of transporting a ball with a dog robot (Soni and

Singh, 2006) can be learned with semi-automatically

discovered options. Using only the sub-goals of prim-

itive motions, a humanoid robot can learn a pour-

ing task (Nemec et al., 2009). Other examples in-

clude foraging (Matarić, 1997) and cooperative tasks

(Matarić, 1994) with multiple robots, grasping with

restricted search spaces (Platt et al., 2006), and mo-

bile robot navigation (Dorigo and Colombetti, 1993).

If the meta-actions are not ﬁxed in advance, but rather

learned at the same time, these approaches are hierar-

chical reinforcement learning approaches as discussed

in Section 5.2. Konidaris et al. (2011a, 2012) propose

an approach that constructs a skill tree from human

demonstrations. Here, the skills correspond to options

and are chained to learn a mobile manipulation skill.

Relational Representations. In a relational repre-

sentation, the states, actions, and transitions are not

represented individually. Entities of the same prede-

ﬁned type are grouped and their relationships are con-

sidered. This representation may be preferable for

highly geometric tasks (which frequently appear in

robotics) and has been employed to learn to navigate

buildings with a real robot in a supervised setting (Co-

cora et al., 2006) and to manipulate articulated objects

in simulation (Katz et al., 2008).

4.2 Value Function Approximation

Function approximation has always been the key com-

ponent that allowed value function methods to scale

into interesting domains. In robot reinforcement learn-

ing, the following function approximation schemes

have been popular and successful. Using function

16

Smart State-Action Discretization

Approach Employed by. . .

Hand crafted Benbrahim et al. (1992); Kimura et al. (2001); Kwok and Fox (2004); Nemec et al.

(2010); Paletta et al. (2007); Tokic et al. (2009); Willgoss and Iqbal (1999)

Learned Piater et al. (2011); Yasuda and Ohkura (2008)

Meta-actions Asada et al. (1996); Dorigo and Colombetti (1993); Fidelman and Stone (2004);

Huber and Grupen (1997); Kalmár et al. (1998); Konidaris et al. (2011a, 2012);

Matarić (1994, 1997); Platt et al. (2006); Soni and Singh (2006); Nemec et al. (2009)

Relational

Representation

Cocora et al. (2006); Katz et al. (2008)

Value Function Approximation

Approach Employed by. . .

Physics-inspired

Features

An et al. (1988); Schaal (1996)

Neural Networks Benbrahim and Franklin (1997); Duan et al. (2008); Gaskett et al. (2000); Hafner

and Riedmiller (2003); Riedmiller et al. (2009); Thrun (1995)

Neighbors Hester et al. (2010); Mahadevan and Connell (1992); Touzet (1997)

Local Models Bentivegna (2004); Schaal (1996); Smart and Kaelbling (1998)

GPR Gräve et al. (2010); Kroemer et al. (2009, 2010); Rottmann et al. (2007)

Pre-structured Policies

Approach Employed by. . .

Via Points & Splines Kuindersma et al. (2011); Miyamoto et al. (1996); Roberts et al. (2010)

Linear Models Tamei and Shibata (2009)

Motor Primitives Kohl and Stone (2004); Kober and Peters (2009); Peters and Schaal (2008c,b); Stulp

et al. (2011); Tamoši¯unait˙e et al. (2011); Theodorou et al. (2010)

GMM & LLM Deisenroth and Rasmussen (2011); Deisenroth et al. (2011); Guenter et al. (2007);

Lin and Lai (2012); Peters and Schaal (2008a)

Neural Networks Endo et al. (2008); Geng et al. (2006); Gullapalli et al. (1994); Hailu and Sommer

(1998); Bagnell and Schneider (2001)

Controllers Bagnell and Schneider (2001); Kolter and Ng (2009a); Tedrake (2004); Tedrake et al.

(2005); Vlassis et al. (2009); Zucker and Bagnell (2012)

Non-parametric Kober et al. (2010); Mitsunaga et al. (2005); Peters et al. (2010a)

Table 3: This table illustrates diﬀerent methods of making robot reinforcement learning tractable by employing a

suitable representation.

approximation for the value function can be com-

bined with using function approximation for learn-

ing a model of the system (as discussed in Section 6)

in the case of model-based reinforcement learning ap-

proaches.

Unfortunately the max-operator used within the

Bellman equation and temporal-diﬀerence updates can

theoretically make most linear or non-linear approxi-

mation schemes unstable for either value iteration or

policy iteration. Quite frequently such an unstable

behavior is also exhibited in practice. Linear func-

tion approximators are stable for policy evaluation,

while non-linear function approximation (e.g., neural

networks) can even diverge if just used for policy eval-

uation (Tsitsiklis and Van Roy, 1997).

Physics-inspired Features. If good hand-crafted fea-

tures are known, value function approximation can be

accomplished using a linear combination of features.

However, good features are well known in robotics only

for a few problems, such as features for local stabiliza-

tion (Schaal, 1996) and features describing rigid body

dynamics (An et al., 1988). Stabilizing a system at

an unstable equilibrium point is the most well-known

example, where a second order Taylor expansion of

the state together with a linear value function approx-

imator often suﬃce as features in the proximity of the

equilibrium point. For example, Schaal (1996) showed

that such features suﬃce for learning how to stabilize a

pole on the end-eﬀector of a robot when within ±15−30

degrees of the equilibrium angle. For suﬃcient fea-

tures, linear function approximation is likely to yield

good results in an on-policy setting. Nevertheless, it is

straightforward to show that impoverished value func-

tion representations (e.g., omitting the cross-terms in

quadratic expansion in Schaal’s set-up) will make it

impossible for the robot to learn this behavior. Sim-

ilarly, it is well known that linear value function ap-

proximation is unstable in the oﬀ-policy case (Tsitsiklis

and Van Roy, 1997; Gordon, 1999; Sutton and Barto,

1998).

Neural Networks. As good hand-crafted features are

rarely available, various groups have employed neural

networks as global, non-linear value function approxi-

mation. Many diﬀerent ﬂavors of neural networks have

17

Figure 4: The Brainstormer Tribots won the RoboCup

2006 MidSize League (Riedmiller et al., 2009)(Picture

reprint with permission of Martin Riedmiller).

been applied in robotic reinforcement learning. For

example, multi-layer perceptrons were used to learn

a wandering behavior and visual servoing (Gaskett

et al., 2000). Fuzzy neural networks (Duan et al., 2008)

and explanation-based neural networks (Thrun, 1995)

have allowed robots to learn basic navigation. CMAC

neural networks have been used for biped locomotion

(Benbrahim and Franklin, 1997).

The Brainstormers RoboCup soccer team is a par-

ticularly impressive application of value function ap-

proximation.(see Figure 4). It used multi-layer per-

ceptrons to learn various sub-tasks such as learning

defenses, interception, position control, kicking, mo-

tor speed control, dribbling and penalty shots (Hafner

and Riedmiller, 2003; Riedmiller et al., 2009). The re-

sulting components contributed substantially to win-

ning the world cup several times in the simulation and

the mid-size real robot leagues. As neural networks

are global function approximators, overestimating the

value function at a frequently occurring state will in-

crease the values predicted by the neural network for

all other states, causing fast divergence (Boyan and

Moore, 1995; Gordon, 1999).Riedmiller et al. (2009)

solved this problem by always deﬁning an absorbing

state where they set the value predicted by their neu-

ral network to zero, which “clamps the neural network

down” and thereby prevents divergence. It also allows

re-iterating on the data, which results in an improved

value function quality. The combination of iteration

on data with the clamping technique appears to be the

key to achieving good performance with value function

approximation in practice.

Generalize to Neighboring Cells. As neural net-

works are globally aﬀected from local errors, much

work has focused on simply generalizing from neigh-

boring cells. One of the earliest papers in robot re-

inforcement learning (Mahadevan and Connell, 1992)

introduced this idea by statistically clustering states to

speed up a box-pushing task with a mobile robot, see

Figure 1a. This approach was also used for a naviga-

tion and obstacle avoidance task with a mobile robot

(Touzet, 1997). Similarly, decision trees have been

used to generalize states and actions to unseen ones,

e.g., to learn a penalty kick on a humanoid robot (Hes-

ter et al., 2010). The core problem of these methods

is the lack of scalability to high-dimensional state and

action spaces.

Local Models. Local models can be seen as an ex-

tension of generalization among neighboring cells to

generalizing among neighboring data points. Locally

weighted regression creates particularly eﬃcient func-

tion approximation in the context of robotics both in

supervised and reinforcement learning. Here, regres-

sion errors are weighted down by proximity to query

point to train local modelsThe predictions of these

local models are combined using the same weighting

functions. Using local models for value function ap-

proximation has allowed learning a navigation task

with obstacle avoidance (Smart and Kaelbling, 1998),

a pole swing-up task (Schaal, 1996), and an air hockey

task (Bentivegna, 2004).

Gaussian Process Regression. Parametrized global

or local models need to pre-specify, which requires a

trade-oﬀ between representational accuracy and the

number of parameters. A non-parametric function ap-

proximator like Gaussian Process Regression (GPR)

could be employed instead, but potentially at the cost

of a higher computational complexity. GPR has the

added advantage of providing a notion of uncertainty

about the approximation quality for a query point.

Hovering with an autonomous blimp (Rottmann et al.,

2007) has been achieved by approximation the state-

action value function with a GPR. Similarly, another

paper shows that grasping can be learned using Gaus-

sian process regression (Gräve et al., 2010) by addi-

tionally taking into account the uncertainty to guide

the exploration. Grasping locations can be learned

by approximating the rewards with a GPR, and try-

ing candidates with predicted high rewards (Kroemer

et al., 2009), resulting in an active learning approach.

High reward uncertainty allows intelligent exploration

in reward-based grasping (Kroemer et al., 2010) in a

bandit setting.

4.3 Pre-structured Policies

Policy search methods greatly beneﬁt from employ-

ing an appropriate function approximation of the pol-

icy. For example, when employing gradient-based ap-

proaches, the trade-oﬀ between the representational

power of the policy (in the form of many policy pa-

rameters) and the learning speed (related to the num-

ber of samples required to estimate the gradient) needs

to be considered. To make policy search approaches

tractable, the policy needs to be represented with a

function approximation that takes into account do-

main knowledge, such as task-relevant parameters or

generalization properties. As the next action picked

by a policy depends on the current state and ac-

tion, a policy can be seen as a closed-loop controller.

Roberts et al. (2011) demonstrate that care needs to be

taken when selecting closed-loop parameterizations for

weakly-stable systems, and suggest forms that are par-

ticularly robust during learning. However, especially

18

Figure 5: Boston Dynamics LittleDog jumping (Kolter and Ng, 2009a) (Picture reprint with permission of Zico Kolter).

for episodic RL tasks, sometimes open-loop policies

(i.e., policies where the actions depend only on the

time) can also be employed.

Via Points & Splines. An open-loop policy may of-

ten be naturally represented as a trajectory, either

in the space of states or targets or directly as a set

of controls. Here, the actions are only a function

of time, which can be considered as a component of

the state. Such spline-based policies are very suitable

for compressing complex trajectories into few param-

eters. Typically the desired joint or Cartesian posi-

tion, velocities, and/or accelerations are used as ac-

tions. To minimize the required number of parame-

ters, not every point is stored. Instead, only impor-

tant via-points are considered and other points are in-

terpolated. Miyamoto et al. (1996) optimized the po-

sition and timing of such via-points in order to learn

a kendama task (a traditional Japanese toy similar to

ball-in-a-cup). A well known type of a via point repre-

sentations are splines, which rely on piecewise-deﬁned

smooth polynomial functions for interpolation. For

example, Roberts et al. (2010) used a periodic cubic

spline as a policy parametrization for a ﬂapping system

and Kuindersma et al. (2011) used a cubic spline to

represent arm movements in an impact recovery task.

Linear Models. If model knowledge of the system is

available, it can be used to create features for lin-

ear closed-loop policy representations. For example,

Tamei and Shibata (2009) used policy-gradient rein-

forcement learning to adjust a model that maps from

human EMG signals to forces that in turn is used in a

cooperative holding task.

Motor Primitives. Motor primitives combine linear

models describing dynamics with parsimonious move-

ment parametrizations. While originally biologically-

inspired, they have a lot of success for representing

basic movements in robotics such as a reaching move-

ment or basic locomotion. These basic movements

can subsequently be sequenced and/or combined to

achieve more complex movements. For both goal ori-

ented and rhythmic movement, diﬀerent technical rep-

resentations have been proposed in the robotics com-

munity. Dynamical system motor primitives (Ijspeert

et al., 2003; Schaal et al., 2007) have become a popular

representation for reinforcement learning of discrete

movements. The dynamical system motor primitives

always have a strong dependence on the phase of the

movement, which corresponds to time. They can be

employed as an open-loop trajectory representation.

Nevertheless, they can also be employed as a closed-

loop policy to a limited extent. In our experience, they

oﬀer a number of advantages over via-point or spline

based policy representation (see Section 7.2). The dy-

namical system motor primitives have been trained

with reinforcement learning for a T-ball batting task

(Peters and Schaal, 2008c,b), an underactuated pendu-

lum swing-up and a ball-in-a-cup task (Kober and Pe-

ters, 2009), ﬂipping a light switch (Buchli et al., 2011),

pouring water (Tamoši¯unait˙e et al., 2011), and play-

ing pool and manipulating a box (Pastor et al., 2011).

For rhythmic behaviors, a representation based on the

same biological motivation but with a fairly diﬀerent

technical implementation (based on half-elliptical lo-

cuses) have been used to acquire the gait patterns for

an Aibo robot dog locomotion (Kohl and Stone, 2004).

Gaussian Mixture Models and Radial Basis Function

Models. When more general policies with a strong

state-dependence are needed, general function approx-

imators based on radial basis functions, also called

Gaussian kernels, become reasonable choices. While

learning with ﬁxed basis function centers and widths

often works well in practice, estimating them is chal-

lenging. These centers and widths can also be esti-

mated from data prior to the reinforcement learning

process. This approach has been used to generalize

a open-loop reaching movement (Guenter et al., 2007;

Lin and Lai, 2012) and to learn the closed-loop cart-

pole swingup task (Deisenroth and Rasmussen, 2011).

Globally linear models were employed in a closed-loop

block stacking task (Deisenroth et al., 2011).

19

Neural Networks are another general function ap-

proximation used to represent policies. Neural os-

cillators with sensor feedback have been used to

learn rhythmic movements where open and closed-

loop information were combined, such as gaits for

a two legged robot (Geng et al., 2006; Endo et al.,

2008). Similarly, a peg-in-hole (see Figure 1b), a ball-

balancing task (Gullapalli et al., 1994), and a naviga-

tion task (Hailu and Sommer, 1998) have been learned

with closed-loop neural networks as policy function ap-

proximators.

Locally Linear Controllers. As local linearity is

highly desirable in robot movement generation to

avoid actuation diﬃculties, learning the parameters of

a locally linear controller can be a better choice than

using a neural network or radial basis function repre-

sentation. Several of these controllers can be combined

to form a global, inherently closed-loop policy. This

type of policy has allowed for many applications, in-

cluding learning helicopter ﬂight (Bagnell and Schnei-

der, 2001), learning biped walk patterns (Tedrake,

2004; Tedrake et al., 2005), driving a radio-controlled

(RC) car, learning a jumping behavior for a robot dog

(Kolter and Ng, 2009a) (illustrated in Figure 5), and

balancing a two wheeled robot (Vlassis et al., 2009).

Operational space control was also learned by Peters

and Schaal (2008a) using locally linear controller mod-

els. In a marble maze task, Zucker and Bagnell (2012)

used such a controller as a policy that expressed the

desired velocity of the ball in terms of the directional

gradient of a value function.

Non-parametric Policies. Polices based on non-

parametric regression approaches often allow a more

data-driven learning process. This approach is often

preferable over the purely parametric policies listed

above becausethe policy structure can evolve during

the learning process. Such approaches are especially

useful when a policy learned to adjust the existing

behaviors of an lower-level controller, such as when

choosing among diﬀerent robot human interaction pos-

sibilities (Mitsunaga et al., 2005), selecting among dif-

ferent striking movements in a table tennis task (Pe-

ters et al., 2010a), and setting the meta-actions for

dart throwing and table tennis hitting tasks (Kober

et al., 2010).

5 Tractability Through Prior

Knowledge

Prior knowledge can dramatically help guide the learn-

ing process. It can be included in the form of initial

policies, demonstrations, initial models, a predeﬁned

task structure, or constraints on the policy such as

torque limits or ordering constraints of the policy pa-

rameters. These approaches signiﬁcantly reduce the

search space and, thus, speed up the learning process.

Providing a (partially) successful initial policy allows

a reinforcement learning method to focus on promising

regions in the value function or in policy space, see Sec-

tion 5.1. Pre-structuring a complex task such that it

can be broken down into several more tractable ones

can signiﬁcantly reduce the complexity of the learn-

ing task, see Section 5.2. An overview of publications

using prior knowledge to render the learning problem

tractable is presented in Table 4. Constraints may also

limit the search space, but often pose new, additional

problems for the learning methods. For example, pol-

icy search limits often do not handle hard limits on

the policy well. Relaxing such constraints (a trick of-

ten applied in machine learning) is not feasible if they

were introduced to protect the robot in the ﬁrst place.

5.1 Prior Knowledge Through

Demonstration

People and other animals frequently learn using a com-

bination of imitation and trial and error. When learn-

ing to play tennis, for instance, an instructor will re-

peatedly demonstrate the sequence of motions that

form an orthodox forehand stroke. Students subse-

quently imitate this behavior, but still need hours of

practice to successfully return balls to a precise loca-

tion on the opponent’s court. Input from a teacher

need not be limited to initial instruction. The instruc-

tor may provide additional demonstrations in later

learning stages (Latzke et al., 2007; Ross et al., 2011a)

and which can also be used as diﬀerential feedback

(Argall et al., 2008).

This combination of imitation learning with rein-

forcement learning is sometimes termed apprenticeship

learning (Abbeel and Ng, 2004) to emphasize the need

for learning both from a teacher and by practice. The

term “apprenticeship learning” is often employed to re-

fer to “inverse reinforcement learning” or “inverse op-

timal control” but is intended here to be employed in

this original, broader meaning. For a recent survey

detailing the state of the art in imitation learning for

robotics, see (Argall et al., 2009).

Using demonstrations to initialize reinforcement

learning provides multiple beneﬁts. Perhaps the most

obvious beneﬁt is that it provides supervised training