ArticlePDF Available

Reinforcement Learning in Robotics: A Survey

Authors:

Abstract and Figures

Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between model-based and model-free as well as between value-function-based and policy-search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and we note throughout open questions and the tremendous potential for future research.
Content may be subject to copyright.
Reinforcement Learning in Robotics:
A Survey
Jens Kober∗† J. Andrew BagnellJan Peters§¶
email: jkober@cor-lab.uni-bielefeld.de,dbagnell@ri.cmu.edu,mail@jan-peters.net
Reinforcement learning offers to robotics a frame-
work and set of tools for the design of sophisticated
and hard-to-engineer behaviors. Conversely, the chal-
lenges of robotic problems provide both inspiration,
impact, and validation for developments in reinforce-
ment learning. The relationship between disciplines
has sufficient promise to be likened to that between
physics and mathematics. In this article, we attempt
to strengthen the links between the two research com-
munities by providing a survey of work in reinforce-
ment learning for behavior generation in robots. We
highlight both key challenges in robot reinforcement
learning as well as notable successes. We discuss how
contributions tamed the complexity of the domain and
study the role of algorithms, representations, and prior
knowledge in achieving these successes. As a result, a
particular focus of our paper lies on the choice between
model-based and model-free as well as between value
function-based and policy search methods. By analyz-
ing a simple problem in some detail we demonstrate
how reinforcement learning approaches may be prof-
itably applied, and we note throughout open questions
and the tremendous potential for future research.
keywords: reinforcement learning, learning control,
robot, survey
1 Introduction
A remarkable variety of problems in robotics may
be naturally phrased as ones of reinforcement learn-
ing. Reinforcement learning (RL) enables a robot to
autonomously discover an optimal behavior through
trial-and-error interactions with its environment. In-
stead of explicitly detailing the solution to a problem,
in reinforcement learning the designer of a control task
Bielefeld University, CoR-Lab Research Institute for Cogni-
tion and Robotics, Universitätsstr. 25, 33615 Bielefeld, Ger-
many
Honda Research Institute Europe, Carl-Legien-Str. 30, 63073
Offenbach/Main, Germany
Carnegie Mellon University, Robotics Institute, 5000 Forbes
Avenue, Pittsburgh, PA 15213, USA
§Max Planck Institute for Intelligent Systems, Department of
Empirical Inference, Spemannstr. 38, 72076 Tübingen, Ger-
many
Technische Universität Darmstadt, FB Informatik, FG Intel-
ligent Autonomous Systems, Hochschulstr. 10, 64289 Darm-
stadt, Germany
provides feedback in terms of a scalar objective func-
tion that measures the one-step performance of the
robot. Figure 1 illustrates the diverse set of robots
that have learned tasks using reinforcement learning.
Consider, for example, attempting to train a robot
to return a table tennis ball over the net (Muelling
et al., 2012). In this case, the robot might make an
observations of dynamic variables specifying ball posi-
tion and velocity and the internal dynamics of the joint
position and velocity. This might in fact capture well
the state sof the system – providing a complete statis-
tic for predicting future observations. The actions a
available to the robot might be the torque sent to mo-
tors or the desired accelerations sent to an inverse dy-
namics control system. A function πthat generates
the motor commands (i.e., the actions) based on the
incoming ball and current internal arm observations
(i.e., the state) would be called the policy. A rein-
forcement learning problem is to find a policy that
optimizes the long term sum of rewards R(s, a); a re-
inforcement learning algorithm is one designed to find
such a (near)-optimal policy. The reward function in
this example could be based on the success of the hits
as well as secondary criteria like energy consumption.
1.1 Reinforcement Learning in the
Context of Machine Learning
In the problem of reinforcement learning, an agent ex-
plores the space of possible strategies and receives feed-
back on the outcome of the choices made. From this
information, a “good” – or ideally optimal – policy
(i.e., strategy or controller) must be deduced.
Reinforcement learning may be understood by con-
trasting the problem with other areas of study in ma-
chine learning. In supervised learning (Langford and
Zadrozny, 2005), an agent is directly presented a se-
quence of independent examples of correct predictions
to make in different circumstances. In imitation learn-
ing, an agent is provided demonstrations of actions of
a good strategy to follow in given situations (Argall
et al., 2009; Schaal, 1999).
To aid in understanding the RL problem and its
relation with techniques widely used within robotics,
Figure 2 provides a schematic illustration of two axes
of problem variability: the complexity of sequential in-
teraction and the complexity of reward structure. This
1
(a) OBELIX robot (b) Zebra Zero robot
(c) Autonomous helicopter (d) Sarcos humanoid
DB
Figure 1: This figure illustrates a small sample of robots
with behaviors that were reinforcement learned. These
cover the whole range of aerial vehicles, robotic arms,
autonomous vehicles, and humanoid robots. (a) The
OBELIX robot is a wheeled mobile robot that learned to
push boxes (Mahadevan and Connell, 1992) with a value
function-based approach (Picture reprint with permission
of Sridhar Mahadevan). (b) A Zebra Zero robot arm
learned a peg-in-hole insertion task (Gullapalli et al., 1994)
with a model-free policy gradient approach (Picture reprint
with permission of Rod Grupen). (c) Carnegie Mellon’s au-
tonomous helicopter leveraged a model-based policy search
approach to learn a robust flight controller (Bagnell and
Schneider, 2001). (d) The Sarcos humanoid DB learned
a pole-balancing task (Schaal, 1996) using forward models
(Picture reprint with permission of Stefan Schaal).
hierarchy of problems, and the relations between them,
is a complex one, varying in manifold attributes and
difficult to condense to something like a simple linear
ordering on problems. Much recent work in the ma-
chine learning community has focused on understand-
ing the diversity and the inter-relations between prob-
lem classes. The figure should be understood in this
light as providing a crude picture of the relationship
between areas of machine learning research important
for robotics.
Each problem subsumes those that are both below
and to the left in the sense that one may always frame
the simpler problem in terms of the more complex one;
note that some problems are not linearly ordered. In
this sense, reinforcement learning subsumes much of
the scope of classical machine learning as well as con-
textual bandit and imitation learning problems. Re-
duction algorithms (Langford and Zadrozny, 2005) are
used to convert effective solutions for one class of prob-
lems into effective solutions for others, and have proven
to be a key technique in machine learning.
At lower left, we find the paradigmatic problem of
supervised learning, which plays a crucial role in ap-
plications as diverse as face detection and spam filter-
ing. In these problems (including binary classification
and regression), a learner’s goal is to map observations
(typically known as features or covariates) to actions
which are usually a discrete set of classes or a real
value. These problems possess no interactive compo-
nent: the design and analysis of algorithms to address
Reward Structure Complexity
Interactive/Sequential Complexity
Binary Classification
Cost-sensitive Learning Structured Prediction
Supervised Learning Imitation Learning
Contextual Bandit Baseline Distribution RL Reinforcement Learning
Figure 2: An illustration of the inter-relations between
well-studied learning problems in the literature along axes
that attempt to capture both the information and com-
plexity available in reward signals and the complexity of
sequential interaction between learner and environment.
Each problem subsumes those to the left and below; reduc-
tion techniques provide methods whereby harder problems
(above and right) may be addressed using repeated appli-
cation of algorithms built for simpler problems. (Langford
and Zadrozny, 2005)
these problems rely on training and testing instances
as independent and identical distributed random vari-
ables. This rules out any notion that a decision made
by the learner will impact future observations: su-
pervised learning algorithms are built to operate in a
world in which every decision has no effect on the fu-
ture examples considered. Further, within supervised
learning scenarios, during a training phase the “cor-
rect” or preferred answer is provided to the learner, so
there is no ambiguity about action choices.
More complex reward structures are also often stud-
ied: one such is known as cost-sensitive learning, where
each training example and each action or prediction is
annotated with a cost for making such a prediction.
Learning techniques exist that reduce such problems
to the simpler classification problem, and active re-
search directly addresses such problems as they are
crucial in practical learning applications.
Contextual bandit or associative reinforcement
learning problems begin to address the fundamental
problem of exploration-vs-exploitation, as information
is provided only about a chosen action and not what-
might-have-been. These find wide-spread application
in problems as diverse as pharmaceutical drug discov-
ery to ad placement on the web, and are one of the
most active research areas in the field.
Problems of imitation learning and structured pre-
diction may be seen to vary from supervised learning
on the alternate dimension of sequential interaction.
Structured prediction, a key technique used within
computer vision and robotics, where many predictions
are made in concert by leveraging inter-relations be-
tween them, may be seen as a simplified variant of
imitation learning (Daumé III et al., 2009; Ross et al.,
2011a). In imitation learning, we assume that an ex-
pert (for example, a human pilot) that we wish to
mimic provides demonstrations of a task. While “cor-
rect answers” are provided to the learner, complexity
arises because any mistake by the learner modifies the
future observations from what would have been seen
had the expert chosen the controls. Such problems
provably lead to compounding errors and violate the
basic assumption of independent examples required for
2
successful supervised learning. In fact, in sharp con-
trast with supervised learning problems where only a
single data-set needs to be collected, repeated inter-
action between learner and teacher appears to both
necessary and sufficient (Ross et al., 2011b) to provide
performance guarantees in both theory and practice in
imitation learning problems.
Reinforcement learning embraces the full complex-
ity of these problems by requiring both interactive,
sequential prediction as in imitation learning as well
as complex reward structures with only “bandit” style
feedback on the actions actually chosen. It is this
combination that enables so many problems of rele-
vance to robotics to be framed in these terms; it is
this same combination that makes the problem both
information-theoretically and computationally hard.
We note here briefly the problem termed “Baseline
Distribution RL”: this is the standard RL problem with
the additional benefit for the learner that it may draw
initial states from a distribution provided by an ex-
pert instead of simply an initial state chosen by the
problem. As we describe further in Section 5.1, this
additional information of which states matter dramat-
ically affects the complexity of learning.
1.2 Reinforcement Learning in the
Context of Optimal Control
Reinforcement Learning (RL) is very closely related
to the theory of classical optimal control, as well
as dynamic programming, stochastic programming,
simulation-optimization, stochastic search, and opti-
mal stopping (Powell, 2012). Both RL and optimal
control address the problem of finding an optimal pol-
icy (often also called the controller or control policy)
that optimizes an objective function (i.e., the accu-
mulated cost or reward), and both rely on the notion
of a system being described by an underlying set of
states, controls and a plant or model that describes
transitions between states. However, optimal control
assumes perfect knowledge of the system’s description
in the form of a model (i.e., a function Tthat de-
scribes what the next state of the robot will be given
the current state and action). For such models, op-
timal control ensures strong guarantees which, never-
theless, often break down due to model and compu-
tational approximations. In contrast, reinforcement
learning operates directly on measured data and re-
wards from interaction with the environment. Rein-
forcement learning research has placed great focus on
addressing cases which are analytically intractable us-
ing approximations and data-driven techniques. One
of the most important approaches to reinforcement
learning within robotics centers on the use of classi-
cal optimal control techniques (e.g. Linear-Quadratic
Regulation and Differential Dynamic Programming)
to system models learned via repeated interaction with
the environment (Atkeson, 1998; Bagnell and Schnei-
der, 2001; Coates et al., 2009). A concise discussion
of viewing reinforcement learning as “adaptive optimal
control” is presented in (Sutton et al., 1991).
1.3 Reinforcement Learning in the
Context of Robotics
Robotics as a reinforcement learning domain dif-
fers considerably from most well-studied reinforcement
learning benchmark problems. In this article, we high-
light the challenges faced in tackling these problems.
Problems in robotics are often best represented with
high-dimensional, continuous states and actions (note
that the 10-30 dimensional continuous actions common
in robot reinforcement learning are considered large
(Powell, 2012)). In robotics, it is often unrealistic to
assume that the true state is completely observable
and noise-free. The learning system will not be able
to know precisely in which state it is and even vastly
different states might look very similar. Thus, robotics
reinforcement learning are often modeled as partially
observed, a point we take up in detail in our formal
model description below. The learning system must
hence use filters to estimate the true state. It is often
essential to maintain the information state of the en-
vironment that not only contains the raw observations
but also a notion of uncertainty on its estimates (e.g.,
both the mean and the variance of a Kalman filter
tracking the ball in the robot table tennis example).
Experience on a real physical system is tedious to
obtain, expensive and often hard to reproduce. Even
getting to the same initial state is impossible for the
robot table tennis system. Every single trial run, also
called a roll-out, is costly and, as a result, such ap-
plications force us to focus on difficulties that do not
arise as frequently in classical reinforcement learning
benchmark examples. In order to learn within a rea-
sonable time frame, suitable approximations of state,
policy, value function, and/or system dynamics need
to be introduced. However, while real-world experi-
ence is costly, it usually cannot be replaced by learning
in simulations alone. In analytical or learned models
of the system even small modeling errors can accumu-
late to a substantially different behavior, at least for
highly dynamic tasks. Hence, algorithms need to be
robust with respect to models that do not capture all
the details of the real system, also referred to as under-
modeling, and to model uncertainty. Another chal-
lenge commonly faced in robot reinforcement learning
is the generation of appropriate reward functions. Re-
wards that guide the learning system quickly to success
are needed to cope with the cost of real-world expe-
rience. This problem is called reward shaping (Laud,
2004) and represents a substantial manual contribu-
tion. Specifying good reward functions in robotics re-
quires a fair amount of domain knowledge and may
often be hard in practice.
Not every reinforcement learning method is equally
suitable for the robotics domain. In fact, many of
the methods thus far demonstrated on difficult prob-
lems have been model-based (Atkeson et al., 1997;
Abbeel et al., 2007; Deisenroth and Rasmussen, 2011)
and robot learning systems often employ policy search
methods rather than value function-based approaches
(Gullapalli et al., 1994; Miyamoto et al., 1996; Bagnell
and Schneider, 2001; Kohl and Stone, 2004; Tedrake
3
et al., 2005; Peters and Schaal, 2008a,b; Kober and
Peters, 2009; Deisenroth et al., 2011). Such design
choices stand in contrast to possibly the bulk of the
early research in the machine learning community
(Kaelbling et al., 1996; Sutton and Barto, 1998). We
attempt to give a fairly complete overview on real
robot reinforcement learning citing most original pa-
pers while grouping them based on the key insights
employed to make the Robot Reinforcement Learn-
ing problem tractable. We isolate key insights such
as choosing an appropriate representation for a value
function or policy, incorporating prior knowledge, and
transfer knowledge from simulations.
This paper surveys a wide variety of tasks where re-
inforcement learning has been successfully applied to
robotics. If a task can be phrased as an optimiza-
tion problem and exhibits temporal structure, rein-
forcement learning can often be profitably applied to
both phrase and solve that problem. The goal of this
paper is twofold. On the one hand, we hope that
this paper can provide indications for the robotics
community which type of problems can be tackled
by reinforcement learning and provide pointers to ap-
proaches that are promising. On the other hand, for
the reinforcement learning community, this paper can
point out novel real-world test beds and remarkable
opportunities for research on open questions. We fo-
cus mainly on results that were obtained on physical
robots with tasks going beyond typical reinforcement
learning benchmarks.
We concisely present reinforcement learning tech-
niques in the context of robotics in Section 2. The chal-
lenges in applying reinforcement learning in robotics
are discussed in Section 3. Different approaches to
making reinforcement learning tractable are treated
in Sections 4 to 6. In Section 7, the example of ball-
in-a-cup is employed to highlight which of the various
approaches discussed in the paper have been particu-
larly helpful to make such a complex task tractable.
Finally, in Section 8, we summarize the specific prob-
lems and benefits of reinforcement learning in robotics
and provide concluding thoughts on the problems and
promise of reinforcement learning in robotics.
2 A Concise Introduction to
Reinforcement Learning
In reinforcement learning, an agent tries to maxi-
mize the accumulated reward over its life-time. In an
episodic setting, where the task is restarted after each
end of an episode, the objective is to maximize the to-
tal reward per episode. If the task is on-going without
a clear beginning and end, either the average reward
over the whole life-time or a discounted return (i.e., a
weighted average where distant rewards have less influ-
ence) can be optimized. In such reinforcement learning
problems, the agent and its environment may be mod-
eled being in a state sSand can perform actions
aA, each of which may be members of either dis-
crete or continuous sets and can be multi-dimensional.
A state scontains all relevant information about the
current situation to predict future states (or observ-
ables); an example would be the current position of a
robot in a navigation task1. An action ais used to con-
trol (or change) the state of the system. For example,
in the navigation task we could have the actions corre-
sponding to torques applied to the wheels. For every
step, the agent also gets a reward R, which is a scalar
value and assumed to be a function of the state and
observation. (It may equally be modeled as a random
variable that depends on only these variables.) In the
navigation task, a possible reward could be designed
based on the energy costs for taken actions and re-
wards for reaching targets. The goal of reinforcement
learning is to find a mapping from states to actions,
called policy π, that picks actions ain given states
smaximizing the cumulative expected reward. The
policy πis either deterministic or probabilistic. The
former always uses the exact same action for a given
state in the form a=π(s), the later draws a sample
from a distribution over actions when it encounters a
state, i.e., aπ(s, a) = P(a|s). The reinforcement
learning agent needs to discover the relations between
states, actions, and rewards. Hence exploration is re-
quired which can either be directly embedded in the
policy or performed separately and only as part of the
learning process.
Classical reinforcement learning approaches are
based on the assumption that we have a Markov Deci-
sion Process (MDP) consisting of the set of states S,
set of actions A, the rewards Rand transition probabil-
ities Tthat capture the dynamics of a system. Transi-
tion probabilities (or densities in the continuous state
case) T(s, a, s) = P(s|s, a)describe the effects of the
actions on the state. Transition probabilities general-
ize the notion of deterministic dynamics to allow for
modeling outcomes are uncertain even given full state.
The Markov property requires that the next state s
and the reward only depend on the previous state s
and action a(Sutton and Barto, 1998), and not on ad-
ditional information about the past states or actions.
In a sense, the Markov property recapitulates the idea
of state – a state is a sufficient statistic for predicting
the future, rendering previous observations irrelevant.
In general in robotics, we may only be able to find
some approximate notion of state.
Different types of reward functions are commonly
used, including rewards depending only on the current
state R=R(s), rewards depending on the current state
and action R=R(s, a), and rewards including the tran-
sitions R=R(s, a, s). Most of the theoretical guar-
antees only hold if the problem adheres to a Markov
structure, however in practice, many approaches work
very well for many problems that do not fulfill this
requirement.
1When only observations but not the complete state is avail-
able, the sufficient statistics of the filter can alternatively
serve as state s. Such a state is often called information or
belief state.
4
2.1 Goals of Reinforcement Learning
The goal of reinforcement learning is to discover an
optimal policy πthat maps states (or observations)
to actions so as to maximize the expected return J,
which corresponds to the cumulative expected reward.
There are different models of optimal behavior (Kael-
bling et al., 1996) which result in different definitions
of the expected return. A finite-horizon model only at-
tempts to maximize the expected reward for the hori-
zon H, i.e., the next H(time-)steps h
J=E
H
X
h=0
Rh
.
This setting can also be applied to model problems
where it is known how many steps are remaining.
Alternatively, future rewards can be discounted by
a discount factor γ(with 0γ < 1)
J=E
X
h=0
γhRh
.
This is the setting most frequently discussed in clas-
sical reinforcement learning texts. The parameter γ
affects how much the future is taken into account and
needs to be tuned manually. As illustrated in (Kael-
bling et al., 1996), this parameter often qualitatively
changes the form of the optimal solution. Policies
designed by optimizing with small γare myopic and
greedy, and may lead to poor performance if we ac-
tually care about longer term rewards. It is straight-
forward to show that the optimal control law can be
unstable if the discount factor is too low (e.g., it is
not difficult to show this destabilization even for dis-
counted linear quadratic regulation problems). Hence,
discounted formulations are frequently inadmissible in
robot control.
In the limit when γapproaches 1, the metric ap-
proaches what is known as the average-reward crite-
rion (Bertsekas, 1995),
J= lim
H→∞ E
1
H
H
X
h=0
Rh
.
This setting has the problem that it cannot distin-
guish between policies that initially gain a transient of
large rewards and those that do not. This transient
phase, also called prefix, is dominated by the rewards
obtained in the long run. If a policy accomplishes both
an optimal prefix as well as an optimal long-term be-
havior, it is called bias optimal Lewis and Puterman
(2001). An example in robotics would be the tran-
sient phase during the start of a rhythmic movement,
where many policies will accomplish the same long-
term reward but differ substantially in the transient
(e.g., there are many ways of starting the same gait
in dynamic legged locomotion) allowing for room for
improvement in practical application.
In real-world domains, the shortcomings of the dis-
counted formulation are often more critical than those
of the average reward setting as stable behavior is often
more important than a good transient (Peters et al.,
2004). We also often encounter an episodic control
task, where the task runs only for Htime-steps and
then reset (potentially by human intervention) and
started over. This horizon, H, may be arbitrarily large,
as long as the expected reward over the episode can
be guaranteed to converge. As such episodic tasks are
probably the most frequent ones, finite-horizon models
are often the most relevant.
Two natural goals arise for the learner. In the first,
we attempt to find an optimal strategy at the end of
a phase of training or interaction. In the second, the
goal is to maximize the reward over the whole time the
robot is interacting with the world.
In contrast to supervised learning, the learner must
first discover its environment and is not told the opti-
mal action it needs to take. To gain information about
the rewards and the behavior of the system, the agent
needs to explore by considering previously unused ac-
tions or actions it is uncertain about. It needs to de-
cide whether to play it safe and stick to well known ac-
tions with (moderately) high rewards or to dare trying
new things in order to discover new strategies with an
even higher reward. This problem is commonly known
as the exploration-exploitation trade-off.
In principle, reinforcement learning algorithms for
Markov Decision Processes with performance guar-
antees are known (Kakade, 2003; Kearns and Singh,
2002; Brafman and Tennenholtz, 2002) with polyno-
mial scaling in the size of the state and action spaces,
an additive error term, as well as in the horizon length
(or a suitable substitute including the discount factor
or “mixing time” (Kearns and Singh, 2002)). However,
state spaces in robotics problems are often tremen-
dously large as they scale exponentially in the num-
ber of state variables and often are continuous. This
challenge of exponential growth is often referred to as
the curse of dimensionality (Bellman, 1957) (also dis-
cussed in Section 3.1).
Off-policy methods learn independent of the em-
ployed policy, i.e., an explorative strategy that is dif-
ferent from the desired final policy can be employed
during the learning process. On-policy methods collect
sample information about the environment using the
current policy. As a result, exploration must be built
into the policy and determines the speed of the policy
improvements. Such exploration and the performance
of the policy can result in an exploration-exploitation
trade-off between long- and short-term improvement
of the policy. Modeling exploration models with prob-
ability distributions has surprising implications, e.g.,
stochastic policies have been shown to be the optimal
stationary policies for selected problems (Sutton et al.,
1999; Jaakkola et al., 1993) and can even break the
curse of dimensionality (Rust, 1997). Furthermore,
stochastic policies often allow the derivation of new
policy update steps with surprising ease.
The agent needs to determine a correlation between
actions and reward signals. An action taken does not
have to have an immediate effect on the reward but
can also influence a reward in the distant future. The
difficulty in assigning credit for rewards is directly re-
5
lated to the horizon or mixing time of the problem. It
also increases with the dimensionality of the actions as
not all parts of the action may contribute equally.
The classical reinforcement learning setup is a MDP
where additionally to the states S, actions A, and re-
wards Rwe also have transition probabilities T(s, a, s).
Here, the reward is modeled as a reward function
R(s, a). If both the transition probabilities and reward
function are known, this can be seen as an optimal
control problem (Powell, 2012).
2.2 Reinforcement Learning in the
Average Reward Setting
We focus on the average-reward model in this section.
Similar derivations exist for the finite horizon and dis-
counted reward cases. In many instances, the average-
reward case is often more suitable in a robotic setting
as we do not have to choose a discount factor and we
do not have to explicitly consider time in the deriva-
tion.
To make a policy able to be optimized by continuous
optimization techniques, we write a policy as a condi-
tional probability distribution π(s, a) = P(a|s). Below,
we consider restricted policies that are paramertized
by a vector θ. In reinforcement learning, the policy
is usually considered to be stationary and memory-
less. Reinforcement learning and optimal control aim
at finding the optimal policy πor equivalent pol-
icy parameters θwhich maximize the average return
J(π) = Ps,a µπ(s)π(s, a)R(s, a)where µπis the sta-
tionary state distribution generated by policy πacting
in the environment, i.e., the MDP. It can be shown
(Puterman, 1994) that such policies that map states
(even deterministically) to actions are sufficient to en-
sure optimality in this setting – a policy needs neither
to remember previous states visited, actions taken, or
the particular time step. For simplicity and to ease
exposition, we assume that this distribution is unique.
Markov Decision Processes where this fails (i.e., non-
ergodic processes) require more care in analysis, but
similar results exist (Puterman, 1994). The transitions
between states scaused by actions aare modeled as
T(s, a, s) = P(s|s, a). We can then frame the control
problem as an optimization of
max
πJ(π) = Ps,aµπ(s)π(s, a)R(s, a),(1)
s.t. µπ(s) = Ps,aµπ(s)π(s, a)T(s, a, s),sS, (2)
1 = Ps,aµπ(s)π(s, a)(3)
π(s, a)0,sS, a A.
Here, Equation (2) defines stationarity of the state dis-
tributions µπ(i.e., it ensures that it is well defined) and
Equation (3) ensures a proper state-action probability
distribution. This optimization problem can be tack-
led in two substantially different ways (Bellman, 1967,
1971). We can search the optimal solution directly in
this original, primal problem or we can optimize in
the Lagrange dual formulation. Optimizing in the pri-
mal formulation is known as policy search in reinforce-
ment learning while searching in the dual formulation
is known as a value function-based approach.
2.2.1 Value Function Approaches
Much of the reinforcement learning literature has fo-
cused on solving the optimization problem in Equa-
tions (1-3) in its dual form (Gordon, 1999; Puterman,
1994)2. Using Lagrange multipliers Vπ(s)and ¯
R, we
can express the Lagrangian of the problem by
L=X
s,a
µπ(s)π(s, a)R(s, a)
+X
s
Vπ(s)
X
s,a
µπ(s)π(s, a)T(s, a, s)µπ(s)
+¯
R
1X
s,a
µπ(s)π(s, a)
=X
s,a
µπ(s)π(s, a)
R(s, a) + X
s
Vπ(s)T(s, a, s)¯
R
X
s
Vπ(s)µπ(s)X
a
π(s, a)
|{z }
=1
+¯
R.
Using the property Ps,aV(s)µπ(s)π(s, a) =
Ps,a V(s)µπ(s)π(s, a), we can obtain the Karush-
Kuhn-Tucker conditions (Kuhn and Tucker, 1950) by
differentiating with respect to µπ(s)π(s, a)which yields
extrema at
µππL=R(s, a) + X
s
Vπ(s)T(s, a, s)¯
RVπ(s) = 0.
This statement implies that there are as many equa-
tions as the number of states multiplied by the num-
ber of actions. For each state there can be one or
several optimal actions athat result in the same
maximal value, and, hence, can be written in terms
of the optimal action aas Vπ(s) = R(s, a)¯
R+
PsVπ(s)T(s, a, s). As ais generated by the same
optimal policy π, we know the condition for the mul-
tipliers at optimality is
V(s) = max
a
R(s, a)¯
R+X
s
V(s)T(s, a, s)
,
(4)
where V(s)is a shorthand notation for Vπ(s). This
statement is equivalent to the Bellman Principle of
Optimality (Bellman, 1957)3that states “An optimal
policy has the property that whatever the initial state
and initial decision are, the remaining decisions must
constitute an optimal policy with regard to the state
resulting from the first decision.” Thus, we have to
perform an optimal action a, and, subsequently, fol-
low the optimal policy πin order to achieve a global
optimum. When evaluating Equation (4), we realize
that optimal value function V(s)corresponds to the
2For historical reasons, what we call the dual is often referred
to in the literature as the primal. We argue that problem
of optimizing expected reward is the fundamental problem,
and values are an auxiliary concept.
3This optimality principle was originally formulated for a set-
ting with discrete time steps and continuous states and ac-
tions but is also applicable for discrete states and actions.
6
long term additional reward, beyond the average re-
ward ¯
R, gained by starting in state swhile taking op-
timal actions a(according to the optimal policy π).
This principle of optimality has also been crucial in
enabling the field of optimal control (Kirk, 1970).
Hence, we have a dual formulation of the origi-
nal problem that serves as condition for optimality.
Many traditional reinforcement learning approaches
are based on identifying (possibly approximate) solu-
tions to this equation, and are known as value function
methods. Instead of directly learning a policy, they
first approximate the Lagrangian multipliers V(s),
also called the value function, and use it to reconstruct
the optimal policy. The value function Vπ(s)is defined
equivalently, however instead of always taking the op-
timal action a, the action ais picked according to a
policy π
Vπ(s) = X
a
π(s, a)R(s, a)¯
R+X
s
Vπ(s)T(s, a, s).
Instead of the value function Vπ(s)many algorithms
rely on the state-action value function Qπ(s, a)instead,
which has advantages for determining the optimal pol-
icy as shown below. This function is defined as
Qπ(s, a) = R(s, a)¯
R+X
s
Vπ(s)T(s, a, s).
In contrast to the value function Vπ(s), the state-
action value function Qπ(s, a)explicitly contains the
information about the effects of a particular action.
The optimal state-action value function is
Q(s, a) = R(s, a)¯
R+X
s
V(s)T(s, a, s).
=R(s, a)¯
R+X
smax
aQ(s, a)T(s, a, s).
It can be shown that an optimal, deterministic pol-
icy π(s)can be reconstructed by always picking the
action ain the current state that leads to the state s
with the highest value V(s)
π(s) = arg max
aR(s, a)¯
R+X
s
V(s)T(s, a, s)
If the optimal value function V(s)and the transi-
tion probabilities T(s, a, s)for the following states are
known, determining the optimal policy is straightfor-
ward in a setting with discrete actions as an exhaustive
search is possible. For continuous spaces, determining
the optimal action ais an optimization problem in it-
self. If both states and actions are discrete, the value
function and the policy may, in principle, be repre-
sented by tables and picking the appropriate action is
reduced to a look-up. For large or continuous spaces
representing the value function as a table becomes in-
tractable. Function approximation is employed to find
a lower dimensional representation that matches the
real value function as closely as possible, as discussed
in Section 2.4. Using the state-action value function
Q(s, a)instead of the value function V(s)
π(s) = arg max
aQ(s, a),
avoids having to calculate the weighted sum over the
successor states, and hence no knowledge of the tran-
sition function is required.
A wide variety of methods of value function based
reinforcement learning algorithms that attempt to es-
timate V(s)or Q(s, a)have been developed and
can be split mainly into three classes: (i) dynamic
programming-based optimal control approaches such
as policy iteration or value iteration, (ii) rollout-based
Monte Carlo methods and (iii) temporal difference
methods such as TD(λ) (Temporal Difference learn-
ing), Q-learning, and SARSA (State-Action-Reward-
State-Action).
Dynamic Programming-Based Methods require a
model of the transition probabilities T(s, a, s)and the
reward function R(s, a)to calculate the value function.
The model does not necessarily need to be predeter-
mined but can also be learned from data, potentially
incrementally. Such methods are called model-based.
Typical methods include policy iteration and value it-
eration.
Policy iteration alternates between the two phases
of policy evaluation and policy improvement. The ap-
proach is initialized with an arbitrary policy. Policy
evaluation determines the value function for the cur-
rent policy. Each state is visited and its value is up-
dated based on the current value estimates of its suc-
cessor states, the associated transition probabilities, as
well as the policy. This procedure is repeated until the
value function converges to a fixed point, which corre-
sponds to the true value function. Policy improvement
greedily selects the best action in every state accord-
ing to the value function as shown above. The two
steps of policy evaluation and policy improvement are
iterated until the policy does not change any longer.
Policy iteration only updates the policy once the
policy evaluation step has converged. In contrast,
value iteration combines the steps of policy evalua-
tion and policy improvement by directly updating the
value function based on Eq. (4) every time a state is
updated.
Monte Carlo Methods use sampling in order to es-
timate the value function. This procedure can be
used to replace the policy evaluation step of the dy-
namic programming-based methods above. Monte
Carlo methods are model-free, i.e., they do not need
an explicit transition function. They perform roll-
outs by executing the current policy on the system,
hence operating on-policy. The frequencies of transi-
tions and rewards are kept track of and are used to
form estimates of the value function. For example, in
an episodic setting the state-action value of a given
state action pair can be estimated by averaging all the
returns that were received when starting from them.
Temporal Difference Methods, unlike Monte Carlo
methods, do not have to wait until an estimate of the
return is available (i.e., at the end of an episode) to
update the value function. Rather, they use tempo-
ral errors and only have to wait until the next time
7
step. The temporal error is the difference between the
old estimate and a new estimate of the value function,
taking into account the reward received in the current
sample. These updates are done iterativley and, in
contrast to dynamic programming methods, only take
into account the sampled successor states rather than
the complete distributions over successor states. Like
the Monte Carlo methods, these methods are model-
free, as they do not use a model of the transition func-
tion to determine the value function. In this setting,
the value function cannot be calculated analytically
but has to be estimated from sampled transitions in
the MDP. For example, the value function could be
updated iteratively by
V(s) = V(s) + αR(s, a)¯
R+V(s)V(s),
where V(s)is the old estimate of the value function,
V(s)the updated one, and αis a learning rate. This
update step is called the TD(0)-algorithm in the dis-
counted reward case. In order to perform action selec-
tion a model of the transition function is still required.
The equivalent temporal difference learning algo-
rithm for state-action value functions is the average
reward case version of SARSA with
Q(s, a) = Q(s, a) + αR(s, a)¯
R+Q(s, a)Q(s, a),
where Q(s, a)is the old estimate of the state-action
value function and Q(s, a)the updated one. This al-
gorithm is on-policy as both the current action aas
well as the subsequent action aare chosen according
to the current policy π. The off-policy variant is called
R-learning (Schwartz, 1993), which is closely related to
Q-learning, with the updates
Q(s, a) = Q(s, a)+αR(s, a)¯
R+max
aQ(s, a)Q(s, a).
These methods do not require a model of the transi-
tion function for determining the deterministic optimal
policy π(s).H-learning (Tadepalli and Ok, 1994) is
a related method that estimates a model of the tran-
sition probabilities and the reward function in order
to perform updates that are reminiscent of value iter-
ation.
An overview of publications using value function
based methods is presented in Table 1. Here, model-
based methods refers to all methods that employ a
predetermined or a learned model of system dynam-
ics.
2.2.2 Policy Search
The primal formulation of the problem in terms of pol-
icy rather then value offers many features relevant to
robotics. It allows for a natural integration of expert
knowledge, e.g., through both structure and initializa-
tions of the policy. It allows domain-appropriate pre-
structuring of the policy in an approximate form with-
out changing the original problem. Optimal policies
often have many fewer parameters than optimal value
functions. For example, in linear quadratic control,
the value function has quadratically many parameters
in the dimensionality of the state-variables while the
policy requires only linearly many parameters. Local
search in policy space can directly lead to good results
as exhibited by early hill-climbing approaches (Kirk,
1970), as well as more recent successes (see Table 2).
Additional constraints can be incorporated naturally,
e.g., regularizing the change in the path distribution.
As a result, policy search often appears more natural
to robotics.
Nevertheless, policy search has been considered the
harder problem for a long time as the optimal solution
cannot directly be determined from Equations (1-3)
while the solution of the dual problem leveraging Bell-
man Principle of Optimality (Bellman, 1957) enables
dynamic programming based solutions.
Notwithstanding this, in robotics, policy search has
recently become an important alternative to value
function based methods due to better scalability as
well as the convergence problems of approximate value
function methods (see Sections 2.3 and 4.2). Most pol-
icy search methods optimize locally around existing
policies π, parametrized by a set of policy parameters
θi, by computing changes in the policy parameters θi
that will increase the expected return and results in it-
erative updates of the form
θi+1 =θi+ ∆θi.
The computation of the policy update is the key
step here and a variety of updates have been pro-
posed ranging from pairwise comparisons (Strens and
Moore, 2001; Ng et al., 2004a) over gradient estima-
tion using finite policy differences (Geng et al., 2006;
Kohl and Stone, 2004; Mitsunaga et al., 2005; Roberts
et al., 2010; Sato et al., 2002; Tedrake et al., 2005),
and general stochastic optimization methods (such as
Nelder-Mead (Bagnell and Schneider, 2001), cross en-
tropy (Rubinstein and Kroese, 2004) and population-
based methods (Goldberg, 1989)) to approaches com-
ing from optimal control such as differential dynamic
programming (DDP) (Atkeson, 1998) and multiple
shooting approaches (Betts, 2001). We may broadly
break down policy-search methods into “black box”
and “white box” methods. Black box methods are gen-
eral stochastic optimization algorithms (Spall, 2003)
using only the expected return of policies, estimated by
sampling, and do not leverage any of the internal struc-
ture of the RL problem. These may be very sophisti-
cated techniques (Tesch et al., 2011) that use response
surface estimates and bandit-like strategies to achieve
good performance. White box methods take advan-
tage of some of additional structure within the rein-
forcement learning domain, including, for instance, the
(approximate) Markov structure of problems, devel-
oping approximate models, value-function estimates
when available (Peters and Schaal, 2008c), or even
simply the causal ordering of actions and rewards. A
major open issue within the field is the relative mer-
its of the these two approaches: in principle, white
box methods leverage more information, but with the
exception of models (which have been demonstrated
repeatedly to often make tremendous performance im-
provements, see Section 6), the performance gains are
8
Value Function Approaches
Approach Employed by. . .
Model-Based Bakker et al. (2006); Hester et al. (2010, 2012); Kalmár et al. (1998); Martínez-Marín
and Duckett (2005); Schaal (1996); Touzet (1997)
Model-Free Asada et al. (1996); Bakker et al. (2003); Benbrahim et al. (1992); Benbrahim and
Franklin (1997); Birdwell and Livingston (2007); Bitzer et al. (2010); Conn and
Peters II (2007); Duan et al. (2007, 2008); Fagg et al. (1998); Gaskett et al. (2000);
Gräve et al. (2010); Hafner and Riedmiller (2007); Huang and Weng (2002); Huber
and Grupen (1997); Ilg et al. (1999); Katz et al. (2008); Kimura et al. (2001);
Kirchner (1997); Konidaris et al. (2011a, 2012); Kroemer et al. (2009, 2010); Kwok
and Fox (2004); Latzke et al. (2007); Mahadevan and Connell (1992); Matarić (1997);
Morimoto and Doya (2001); Nemec et al. (2009, 2010); Oßwald et al. (2010); Paletta
et al. (2007); Pendrith (1999); Platt et al. (2006); Riedmiller et al. (2009); Rottmann
et al. (2007); Smart and Kaelbling (1998, 2002); Soni and Singh (2006); Tamoši¯unait˙e
et al. (2011); Thrun (1995); Tokic et al. (2009); Touzet (1997); Uchibe et al. (1998);
Wang et al. (2006); Willgoss and Iqbal (1999)
Table 1: This table illustrates different value function based reinforcement learning methods employed for robotic tasks
(both average and discounted reward cases) and associated publications.
traded-off with additional assumptions that may be vi-
olated and less mature optimization algorithms. Some
recent work including (Stulp and Sigaud, 2012; Tesch
et al., 2011) suggest that much of the benefit of policy
search is achieved by black-box methods.
Some of the most popular white-box general re-
inforcement learning techniques that have translated
particularly well into the domain of robotics include:
(i) policy gradient approaches based on likelihood-
ratio estimation (Sutton et al., 1999), (ii) policy up-
dates inspired by expectation-maximization (Tous-
saint et al., 2010), and (iii) the path integral methods
(Kappen, 2005).
Let us briefly take a closer look at gradient-based
approaches first. The updates of the policy parameters
are based on a hill-climbing approach, that is following
the gradient of the expected return Jfor a defined
step-size α
θi+1 =θi+αθJ.
Different methods exist for estimating the gradient
θJand many algorithms require tuning of the step-
size α.
In finite difference gradients Pperturbed policy pa-
rameters are evaluated to obtain an estimate of the
gradient. Here we have ˆ
JpJ(θi+θp)Jref , where
p= [1..P ]are the individual perturbations, ˆ
Jpthe es-
timate of their influence on the return, and Jref is a
reference return, e.g., the return of the unperturbed
parameters. The gradient can now be estimated by
linear regression
θJΘTΘ1ΘTˆ
J,
where the matrix Θcontains all the stacked samples
of the perturbations θpand ˆ
Jcontains the corre-
sponding ˆ
Jp. In order to estimate the gradient the
number of perturbations needs to be at least as large
as the number of parameters. The approach is very
straightforward and even applicable to policies that
are not differentiable. However, it is usually consid-
ered to be very noisy and inefficient. For the finite
difference approach tuning the step-size αfor the up-
date, the number of perturbations P, and the type
and magnitude of perturbations are all critical tuning
factors.
Likelihood ratio methods rely on the insight that in
an episodic setting where the episodes τare generated
according to the distribution Pθ(τ) = P(τ|θ)with the
return of an episode Jτ=PH
h=1 Rhand number of
steps Hthe expected return for a set of policy param-
eter θcan be expressed as
Jθ=X
τ
Pθ(τ)Jτ.(5)
The gradient of the episode distribution can be written
as4
θPθ(τ) = Pθ(τ)θlog Pθ(τ),(6)
which is commonly known as the the likelihood ratio
or REINFORCE (Williams, 1992) trick. Combining
Equations (5) and (6) we get the gradient of the ex-
pected return in the form
θJθ=X
τ
θPθ(τ)Jτ=X
τ
Pθ(τ)θlog Pθ(τ)Jτ
=Enθlog Pθ(τ)Jτo.
If we have a stochastic policy πθ(s, a)that generates
the episodes τ, we do not need to keep track of the
probabilities of the episodes but can directly express
the gradient in terms of the policy as θlog Pθ(τ) =
PH
h=1 θlog πθ(s, a). Finally the gradient of the ex-
pected return with respect to the policy parameters
can be estimated as
θJθ=E
H
X
h=1
θlog πθ(sh, ah)
Jτ
.
If we now take into account that rewards at the
beginning of an episode cannot be caused by actions
4From multi-variate calculus we have θlog Pθ(τ) =
θPθ(τ)/P θ(τ).
9
taken at the end of an episode, we can replace the re-
turn of the episode Jτby the state-action value func-
tion Qπ(s, a)and get (Peters and Schaal, 2008c)
θJθ=E
H
X
h=1
θlog πθ(sh, ah)Qπ(sh, ah)
,
which is equivalent to the policy gradient theorem (Sut-
ton et al., 1999). In practice, it is often advisable to
subtract a reference Jref , also called baseline, from the
return of the episode Jτor the state-action value func-
tion Qπ(s, a)respectively to get better estimates, simi-
lar to the finite difference approach. In these settings,
the exploration is automatically taken care of by the
stochastic policy.
Initial gradient-based approaches such as finite dif-
ferences gradients or REINFORCE (REward Incre-
ment = Nonnegative Factor times Offset Reinforce-
ment times Characteristic Eligibility) (Williams, 1992)
have been rather slow. The weight perturbation
algorithm is related to REINFORCE but can deal
with non-Gaussian distributions which significantly
improves the signal to noise ratio of the gradient
(Roberts et al., 2010). Recent natural policy gradient
approaches (Peters and Schaal, 2008c,b) have allowed
for faster convergence which may be advantageous for
robotics as it reduces the learning time and required
real-world interactions.
A different class of safe and fast policy search meth-
ods, that are inspired by expectation-maximization,
can be derived when the reward is treated as an im-
proper probability distribution (Dayan and Hinton,
1997). Some of these approaches have proven success-
ful in robotics, e.g., reward-weighted regression (Peters
and Schaal, 2008a), Policy Learning by Weighting Ex-
ploration with the Returns (Kober and Peters, 2009),
Monte Carlo Expectation-Maximization(Vlassis et al.,
2009), and Cost-regularized Kernel Regression (Kober
et al., 2010). Algorithms with closely related update
rules can also be derived from different perspectives
including Policy Improvements with Path Integrals
(Theodorou et al., 2010) and Relative Entropy Policy
Search (Peters et al., 2010a).
Finally, the Policy Search by Dynamic Programming
(Bagnell et al., 2003) method is a general strategy that
combines policy search with the principle of optimality.
The approach learns a non-stationary policy backward
in time like dynamic programming methods, but does
not attempt to enforce the Bellman equation and the
resulting approximation instabilities (See Section 2.4).
The resulting approach provides some of the strongest
guarantees that are currently known under function
approximation and limited observability It has been
demonstrated in learning walking controllers and in
finding near-optimal trajectories for map exploration
(Kollar and Roy, 2008). The resulting method is more
expensive than the value function methods because it
scales quadratically in the effective time horizon of the
problem. Like DDP methods (Atkeson, 1998), it is tied
to a non-stationary (time-varying) policy.
An overview of publications using policy search
methods is presented in Table 2.
One of the key open issues in the field is determining
when it is appropriate to use each of these methods.
Some approaches leverage significant structure specific
to the RL problem (e.g. (Theodorou et al., 2010)), in-
cluding reward structure, Markovanity, causality of re-
ward signals (Williams, 1992), and value-function esti-
mates when available (Peters and Schaal, 2008c). Oth-
ers embed policy search as a generic, black-box, prob-
lem of stochastic optimization (Bagnell and Schneider,
2001; Lizotte et al., 2007; Kuindersma et al., 2011;
Tesch et al., 2011). Significant open questions remain
regarding which methods are best in which circum-
stances and further, at an even more basic level, how
effective leveraging the kinds of problem structures
mentioned above are in practice.
2.3 Value Function Approaches versus
Policy Search
Some methods attempt to find a value function or pol-
icy which eventually can be employed without signif-
icant further computation, whereas others (e.g., the
roll-out methods) perform the same amount of com-
putation each time.
If a complete optimal value function is known, a
globally optimal solution follows simply by greed-
ily choosing actions to optimize it. However, value-
function based approaches have thus far been difficult
to translate into high dimensional robotics as they re-
quire function approximation for the value function.
Most theoretical guarantees no longer hold for this ap-
proximation and even finding the optimal action can
be a hard problem due to the brittleness of the ap-
proximation and the cost of optimization. For high
dimensional actions, it can be as hard computing an
improved policy for all states in policy search as find-
ing a single optimal action on-policy for one state by
searching the state-action value function.
In principle, a value function requires total cover-
age of the state space and the largest local error de-
termines the quality of the resulting policy. A par-
ticularly significant problem is the error propagation
in value functions. A small change in the policy may
cause a large change in the value function, which again
causes a large change in the policy. While this may
lead more quickly to good, possibly globally optimal
solutions, such learning processes often prove unsta-
ble under function approximation (Boyan and Moore,
1995; Kakade and Langford, 2002; Bagnell et al., 2003)
and are considerably more dangerous when applied to
real systems where overly large policy deviations may
lead to dangerous decisions.
In contrast, policy search methods usually only con-
sider the current policy and its neighborhood in or-
der to gradually improve performance. The result is
that usually only local optima, and not the global one,
can be found. However, these methods work well in
conjunction with continuous features. Local coverage
and local errors results into improved scaleability in
robotics.
Policy search methods are sometimes called actor-
only methods; value function methods are sometimes
10
Policy Search
Approach Employed by. . .
Gradient Deisenroth and Rasmussen (2011); Deisenroth et al. (2011); Endo et al. (2008);
Fidelman and Stone (2004); Geng et al. (2006); Guenter et al. (2007); Gullapalli
et al. (1994); Hailu and Sommer (1998); Ko et al. (2007); Kohl and Stone (2004);
Kolter and Ng (2009a); Michels et al. (2005); Mitsunaga et al. (2005); Miyamoto
et al. (1996); Ng et al. (2004a,b); Peters and Schaal (2008c,b); Roberts et al. (2010);
Rosenstein and Barto (2004); Tamei and Shibata (2009); Tedrake (2004); Tedrake
et al. (2005)
Other Abbeel et al. (2006, 2007); Atkeson and Schaal (1997); Atkeson (1998); Bagnell and
Schneider (2001); Bagnell (2004); Buchli et al. (2011); Coates et al. (2009); Daniel
et al. (2012); Donnart and Meyer (1996); Dorigo and Colombetti (1993); Erden and
Leblebicioğlu (2008); Kalakrishnan et al. (2011); Kober and Peters (2009); Kober
et al. (2010); Kolter et al. (2008); Kuindersma et al. (2011); Lizotte et al. (2007);
Matarić (1994); Pastor et al. (2011); Peters and Schaal (2008a); Peters et al. (2010a);
Schaal and Atkeson (1994); Stulp et al. (2011); Svinin et al. (2001); Tamoši¯unait˙e
et al. (2011); Yasuda and Ohkura (2008); Youssef (2005)
Table 2: This table illustrates different policy search reinforcement learning methods employed for robotic tasks and
associated publications.
called critic-only methods. The idea of a critic is to
first observe and estimate the performance of choosing
controls on the system (i.e., the value function), then
derive a policy based on the gained knowledge. In
contrast, the actor directly tries to deduce the optimal
policy. A set of algorithms called actor-critic meth-
ods attempt to incorporate the advantages of each: a
policy is explicitly maintained, as is a value-function
for the current policy. The value function (i.e., the
critic) is not employed for action selection. Instead,
it observes the performance of the actor and decides
when the policy needs to be updated and which action
should be preferred. The resulting update step fea-
tures the local convergence properties of policy gradi-
ent algorithms while reducing update variance (Green-
smith et al., 2004). There is a trade-off between the
benefit of reducing the variance of the updates and
having to learn a value function as the samples re-
quired to estimate the value function could also be
employed to obtain better gradient estimates for the
update step. Rosenstein and Barto (2004) propose an
actor-critic method that additionally features a super-
visor in the form of a stable policy.
2.4 Function Approximation
Function approximation (Rivlin, 1969) is a family of
mathematical and statistical techniques used to rep-
resent a function of interest when it is computation-
ally or information-theoretically intractable to repre-
sent the function exactly or explicitly (e.g. in tabular
form). Typically, in reinforcement learning te func-
tion approximation is based on sample data collected
during interaction with the environment. Function ap-
proximation is critical in nearly every RL problem, and
becomes inevitable in continuous state ones. In large
discrete spaces it is also often impractical to visit or
even represent all states and actions, and function ap-
proximation in this setting can be used as a means to
generalize to neighboring states and actions.
Function approximation can be employed to rep-
resent policies, value functions, and forward mod-
els. Broadly speaking, there are two kinds of func-
tion approximation methods: parametric and non-
parametric. A parametric function approximator uses
a finite set of parameters or arguments with the goal
is to find parameters that make this approximation fit
the observed data as closely as possible. Examples in-
clude linear basis functions and neural networks. In
contrast, non-parametric methods expand representa-
tional power in relation to collected data and hence
are not limited by the representation power of a cho-
sen parametrization (Bishop, 2006). A prominent ex-
ample that has found much use within reinforcement
learning is Gaussian process regression (Rasmussen
and Williams, 2006). A fundamental problem with us-
ing supervised learning methods developed in the lit-
erature for function approximation is that most such
methods are designed for independently and identi-
cally distributed sample data. However, the data gen-
erated by the reinforcement learning process is usually
neither independent nor identically distributed. Usu-
ally, the function approximator itself plays some role
in the data collection process (for instance, by serving
to define a policy that we execute on a robot.)
Linear basis function approximators form one of the
most widely used approximate value function tech-
niques in continuous (and discrete) state spaces. This
is largely due to the simplicity of their representa-
tion as well as a convergence theory, albeit limited, for
the approximation of value functions based on samples
(Tsitsiklis and Van Roy, 1997). Let us briefly take a
closer look at a radial basis function network to illus-
trate this approach. The value function maps states to
a scalar value. The state space can be covered by a grid
of points, each of which correspond to the center of a
Gaussian-shaped basis function. The value of the ap-
proximated function is the weighted sum of the values
of all basis functions at the query point. As the in-
fluence of the Gaussian basis functions drops rapidly,
the value of the query points will be predominantly
11
influenced by the neighboring basis functions. The
weights are set in a way to minimize the error between
the observed samples and the reconstruction. For the
mean squared error, these weights can be determined
by linear regression. Kolter and Ng (2009b) discuss
the benefits of regularization of such linear function
approximators to avoid over-fitting.
Other possible function approximators for value
functions include wire fitting, whichBaird and Klopf
(1993) suggested as an approach that makes contin-
uous action selection feasible. The Fourier basis had
been suggested by Konidaris et al. (2011b). Even dis-
cretizing the state-space can be seen as a form of func-
tion approximation where coarse values serve as es-
timates for a smooth continuous function. One ex-
ample is tile coding (Sutton and Barto, 1998), where
the space is subdivided into (potentially irregularly
shaped) regions, called tiling. The number of differ-
ent tilings determines the resolution of the final ap-
proximation. For more examples, please refer to Sec-
tions 4.1 and 4.2.
Policy search also benefits from a compact represen-
tation of the policy as discussed in Section 4.3.
Models of the system dynamics can be represented
using a wide variety of techniques. In this case, it is
often important to model the uncertainty in the model
(e.g., by a stochastic model or Bayesian estimates of
model parameters) to ensure that the learning algo-
rithm does not exploit model inaccuracies. See Sec-
tion 6 for a more detailed discussion.
3 Challenges in Robot
Reinforcement Learning
Reinforcement learning is generally a hard problem
and many of its challenges are particularly apparent
in the robotics setting. As the states and actions of
most robots are inherently continuous, we are forced to
consider the resolution at which they are represented.
We must decide how fine grained the control is that we
require over the robot, whether we employ discretiza-
tion or function approximation, and what time step we
establish. Additionally, as the dimensionality of both
states and actions can be high, we face the “Curse of
Dimensionality” (Bellman, 1957) as discussed in Sec-
tion 3.1. As robotics deals with complex physical sys-
tems, samples can be expensive due to the long ex-
ecution time of complete tasks, required manual in-
terventions, and the need maintenance and repair. In
these real-world measurements, we must cope with the
uncertainty inherent in complex physical systems. A
robot requires that the algorithm runs in real-time.
The algorithm must be capable of dealing with delays
in sensing and execution that are inherent in physi-
cal systems (see Section 3.2). A simulation might al-
leviate many problems but these approaches need to
be robust with respect to model errors as discussed
in Section 3.3. An often underestimated problem is
the goal specification, which is achieved by designing
a good reward function. As noted in Section 3.4, this
choice can make the difference between feasibility and
Figure 3: This Figure illustrates the state space used in
the modeling of a robot reinforcement learning task of pad-
dling a ball.
an unreasonable amount of exploration.
3.1 Curse of Dimensionality
When Bellman (1957) explored optimal control in dis-
crete high-dimensional spaces, he faced an exponential
explosion of states and actions for which he coined the
term “Curse of Dimensionality”. As the number of di-
mensions grows, exponentially more data and compu-
tation are needed to cover the complete state-action
space. For example, if we assume that each dimension
of a state-space is discretized into ten levels, we have
10 states for a one-dimensional state-space, 103= 1000
unique states for a three-dimensional state-space, and
10npossible states for a n-dimensional state space.
Evaluating every state quickly becomes infeasible with
growing dimensionality, even for discrete states. Bell-
man originally coined the term in the context of opti-
mization, but it also applies to function approximation
and numerical integration (Donoho, 2000). While su-
pervised learning methods have tamed this exponen-
tial growth by considering only competitive optimality
with respect to a limited class of function approxima-
tors, such results are much more difficult in reinforce-
ment learning where data must collected throughout
state-space to ensure global optimality.
Robotic systems often have to deal with these high
dimensional states and actions due to the many de-
grees of freedom of modern anthropomorphic robots.
For example, in the ball-paddling task shown in Fig-
ure 3, a proper representation of a robot’s state would
consist of its joint angles and velocities for each of its
seven degrees of freedom as well as the Cartesian po-
sition and velocity of the ball. The robot’s actions
would be the generated motor commands, which often
are torques or accelerations. In this example, we have
2×(7 + 3) = 20 state dimensions and 7-dimensional
continuous actions. Obviously, other tasks may re-
quire even more dimensions. For example, human-
like actuation often follows the antagonistic principle
(Yamaguchi and Takanishi, 1997) which additionally
enables control of stiffness. Such dimensionality is a
12
major challenge for both the robotics and the rein-
forcement learning communities.
In robotics, such tasks are often rendered tractable
to the robot engineer by a hierarchical task decom-
position that shifts some complexity to a lower layer
of functionality. Classical reinforcement learning ap-
proaches often consider a grid-based representation
with discrete states and actions, often referred to as
agrid-world. A navigational task for mobile robots
could be projected into this representation by employ-
ing a number of actions like “move to the cell to the
left” that use a lower level controller that takes care
of accelerating, moving, and stopping while ensuring
precision. In the ball-paddling example, we may sim-
plify by controlling the robot in racket space (which is
lower-dimensional as the racket is orientation-invariant
around the string’s mounting point) with an opera-
tional space control law (Nakanishi et al., 2008). Many
commercial robot systems also encapsulate some of the
state and action components in an embedded control
system (e.g., trajectory fragments are frequently used
as actions for industrial robots). However, this form
of a state dimensionality reduction severely limits the
dynamic capabilities of the robot according to our ex-
perience (Schaal et al., 2002; Peters et al., 2010b).
The reinforcement learning community has a long
history of dealing with dimensionality using computa-
tional abstractions. It offers a larger set of applicable
tools ranging from adaptive discretizations (Buşoniu
et al., 2010) and function approximation approaches
(Sutton and Barto, 1998) to macro-actions or op-
tions (Barto and Mahadevan, 2003; Hart and Grupen,
2011). Options allow a task to be decomposed into
elementary components and quite naturally translate
to robotics. Such options can autonomously achieve a
sub-task, such as opening a door, which reduces the
planning horizon (Barto and Mahadevan, 2003). The
automatic generation of such sets of options is a key
issue in order to enable such approaches. We will dis-
cuss approaches that have been successful in robot re-
inforcement learning in Section 4.
3.2 Curse of Real-World Samples
Robots inherently interact with the physical world.
Hence, robot reinforcement learning suffers from most
of the resulting real-world problems. For example,
robot hardware is usually expensive, suffers from wear
and tear, and requires careful maintenance. Repair-
ing a robot system is a non-negligible effort associ-
ated with cost, physical labor and long waiting peri-
ods. To apply reinforcement learning in robotics, safe
exploration becomes a key issue of the learning process
(Schneider, 1996; Bagnell, 2004; Deisenroth and Ras-
mussen, 2011; Moldovan and Abbeel, 2012), a problem
often neglected in the general reinforcement learning
community. Perkins and Barto (2002) have come up
with a method for constructing reinforcement learn-
ing agents based on Lyapunov functions. Switching
between the underlying controllers is always safe and
offers basic performance guarantees.
However, several more aspects of the real-world
make robotics a challenging domain. As the dynamics
of a robot can change due to many external factors
ranging from temperature to wear, the learning pro-
cess may never fully converge, i.e., it needs a “tracking
solution” (Sutton et al., 2007). Frequently, the en-
vironment settings during an earlier learning period
cannot be reproduced. External factors are not al-
ways clear – for example, how light conditions affect
the performance of the vision system and, as a result,
the task’s performance. This problem makes compar-
ing algorithms particularly hard. Furthermore, the ap-
proaches often have to deal with uncertainty due to in-
herent measurement noise and the inability to observe
all states directly with sensors.
Most real robot learning tasks require some form
of human supervision, e.g., putting the pole back on
the robot’s end-effector during pole balancing (see Fig-
ure 1d) after a failure. Even when an automatic reset
exists (e.g., by having a smart mechanism that resets
the pole), learning speed becomes essential as a task
on a real robot cannot be sped up. In some tasks like
a slowly rolling robot, the dynamics can be ignored;
in others like a flying robot, they cannot. Especially
in the latter case, often the whole episode needs to be
completed as it is not possible to start from arbitrary
states.
For such reasons, real-world samples are expensive
in terms of time, labor and, potentially, finances. In
robotic reinforcement learning, it is often considered
to be more important to limit the real-world interac-
tion time instead of limiting memory consumption or
computational complexity. Thus, sample efficient al-
gorithms that are able to learn from a small number
of trials are essential. In Section 6 we will point out
several approaches that allow the amount of required
real-world interactions to be reduced.
Since the robot is a physical system, there are strict
constraints on the interaction between the learning al-
gorithm and the robot setup. For dynamic tasks, the
movement cannot be paused and actions must be se-
lected within a time-budget without the opportunity
to pause to think, learn or plan between actions. These
constraints are less severe in an episodic setting where
the time intensive part of the learning can be post-
poned to the period between episodes. Hester et al.
(2012) has proposed a real-time architecture for model-
based value function reinforcement learning methods
taking into account these challenges.
As reinforcement learning algorithms are inherently
implemented on a digital computer, the discretiza-
tion of time is unavoidable despite that physical sys-
tems are inherently continuous time systems. Time-
discretization of the actuation can generate undesir-
able artifacts (e.g., the distortion of distance between
states) even for idealized physical systems, which can-
not be avoided. As most robots are controlled at fixed
sampling frequencies (in the range between 500Hz and
3kHz) determined by the manufacturer of the robot,
the upper bound on the rate of temporal discretization
is usually pre-determined. The lower bound depends
on the horizon of the problem, the achievable speed of
changes in the state, as well as delays in sensing and
13
actuation.
All physical systems exhibit such delays in sensing
and actuation. The state of the setup (represented by
the filtered sensor signals) may frequently lag behind
the real state due to processing and communication de-
lays. More critically, there are also communication de-
lays in actuation as well as delays due to the fact that
neither motors, gear boxes nor the body’s movement
can change instantly. Due to these delays, actions may
not have instantaneous effects but are observable only
several time steps later. In contrast, in most general
reinforcement learning algorithms, the actions are as-
sumed to take effect instantaneously as such delays
would violate the usual Markov assumption. This ef-
fect can be addressed by putting some number of re-
cent actions into the state. However, this significantly
increases the dimensionality of the problem.
The problems related to time-budgets and delays
can also be avoided by increasing the duration of the
time steps. One downside of this approach is that the
robot cannot be controlled as precisely; another is that
it may complicate a description of system dynamics.
3.3 Curse of Under-Modeling and Model
Uncertainty
One way to offset the cost of real-world interaction is to
use accurate models as simulators. In an ideal setting,
this approach would render it possible to learn the be-
havior in simulation and subsequently transfer it to the
real robot. Unfortunately, creating a sufficiently accu-
rate model of the robot and its environment is chal-
lenging and often requires very many data samples. As
small model errors due to this under-modeling accu-
mulate, the simulated robot can quickly diverge from
the real-world system. When a policy is trained using
an imprecise forward model as simulator, the behav-
ior will not transfer without significant modifications
as experienced by Atkeson (1994) when learning the
underactuated pendulum swing-up. The authors have
achieved a direct transfer in only a limited number of
experiments; see Section 6.1 for examples.
For tasks where the system is self-stabilizing (that
is, where the robot does not require active control
to remain in a safe state or return to it), transfer-
ring policies often works well. Such tasks often fea-
ture some type of dampening that absorbs the energy
introduced by perturbations or control inaccuracies.
If the task is inherently stable, it is safer to assume
that approaches that were applied in simulation work
similarly in the real world (Kober and Peters, 2010).
Nevertheless, tasks can often be learned better in the
real world than in simulation due to complex mechan-
ical interactions (including contacts and friction) that
have proven difficult to model accurately. For exam-
ple, in the ball-paddling task (Figure 3) the elastic
string that attaches the ball to the racket always pulls
back the ball towards the racket even when hit very
hard. Initial simulations (including friction models,
restitution models, dampening models, models for the
elastic string, and air drag) of the ball-racket contacts
indicated that these factors would be very hard to con-
trol. In a real experiment, however, the reflections of
the ball on the racket proved to be less critical than in
simulation and the stabilizing forces due to the elas-
tic string were sufficient to render the whole system
self-stabilizing.
In contrast, in unstable tasks small variations have
drastic consequences. For example, in a pole balanc-
ing task, the equilibrium of the upright pole is very
brittle and constant control is required to stabilize the
system. Transferred policies often perform poorly in
this setting. Nevertheless, approximate models serve
a number of key roles which we discuss in Section 6,
including verifying and testing the algorithms in simu-
lation, establishing proximity to theoretically optimal
solutions, calculating approximate gradients for local
policy improvement, identifing strategies for collecting
more data, and performing “mental rehearsal”.
3.4 Curse of Goal Specification
In reinforcement learning, the desired behavior is im-
plicitly specified by the reward function. The goal of
reinforcement learning algorithms then is to maximize
the accumulated long-term reward. While often dra-
matically simpler than specifying the behavior itself,
in practice, it can be surprisingly difficult to define a
good reward function in robot reinforcement learning.
The learner must observe variance in the reward signal
in order to be able to improve a policy: if the same
return is always received, there is no way to determine
which policy is better or closer to the optimum.
In many domains, it seems natural to provide re-
wards only upon task achievement – for example, when
a table tennis robot wins a match. This view results
in an apparently simple, binary reward specification.
However, a robot may receive such a reward so rarely
that it is unlikely to ever succeed in the lifetime of a
real-world system. Instead of relying on simpler bi-
nary rewards, we frequently need to include interme-
diate rewards in the scalar reward function to guide
the learning process to a reasonable solution, a pro-
cess known as reward shaping (Laud, 2004).
Beyond the need to shorten the effective problem
horizon by providing intermediate rewards, the trade-
off between different factors may be essential. For in-
stance, hitting a table tennis ball very hard may re-
sult in a high score but is likely to damage a robot or
shorten its life span. Similarly, changes in actions may
be penalized to avoid high frequency controls that are
likely to be very poorly captured with tractable low
dimensional state-space or rigid-body models. Rein-
forcement learning algorithms are also notorious for
exploiting the reward function in ways that are not
anticipated by the designer. For example, if the dis-
tance between the ball and the desired highest point
is part of the reward in ball paddling (see Figure 3),
many locally optimal solutions would attempt to sim-
ply move the racket upwards and keep the ball on it.
Reward shaping gives the system a notion of closeness
to the desired behavior instead of relying on a reward
that only encodes success or failure (Ng et al., 1999).
Often the desired behavior can be most naturally
14
represented with a reward function in a particular
state and action space. However, this representation
does not necessarily correspond to the space where
the actual learning needs to be performed due to both
computational and statistical limitations. Employing
methods to render the learning problem tractable of-
ten result in different, more abstract state and action
spaces which might not allow accurate representation
of the original reward function. In such cases, a rewar-
dartfully specifiedin terms of the features of the space
in which the learning algorithm operates can prove re-
markably effective. There is also a trade-off between
the complexity of the reward function and the com-
plexity of the learning problem. For example, in the
ball-in-a-cup task (Section 7) the most natural reward
would be a binary value depending on whether the ball
is in the cup or not. To render the learning problem
tractable, a less intuitive reward needed to be devised
in terms of a Cartesian distance with additional direc-
tional information (see Section 7.1 for details). An-
other example is Crusher (Ratliff et al., 2006a), an
outdoor robot, where the human designer was inter-
ested in a combination of minimizing time and risk to
the robot. However, the robot reasons about the world
on the long time horizon scale as if it was a very sim-
ple, deterministic, holonomic robot operating on a fine
grid of continuous costs. Hence, the desired behavior
cannot be represented straightforwardly in this state-
space. Nevertheless, a remarkably human-like behav-
ior that seems to respect time and risk priorities can
be achieved by carefully mapping features describing
each state (discrete grid location with features com-
puted by an on-board perception system) to cost.
Inverse optimal control, also known as inverse re-
inforcement learning (Russell, 1998), is a promising
alternative to specifying the reward function manu-
ally. It assumes that a reward function can be recon-
structed from a set of expert demonstrations. This
reward function does not necessarily correspond to
the true reward function, but provides guarantees on
the resulting performance of learned behaviors (Abbeel
and Ng, 2004; Ratliff et al., 2006b). Inverse optimal
control was initially studied in the control community
(Kalman, 1964) and in the field of economics (Keeney
and Raiffa, 1976). The initial results were only ap-
plicable to limited domains (linear quadratic regulator
problems) and required closed form access to plant and
controller, hence samples from human demonstrations
could not be used. Russell (1998) brought the field
to the attention of the machine learning community.
Abbeel and Ng (2004) defined an important constraint
on the solution to the inverse RL problem when reward
functions are linear in a set of features: a policy that is
extracted by observing demonstrations has to earn the
same reward as the policy that is being demonstrated.
Ratliff et al. (2006b) demonstrated that inverse op-
timal control can be understood as a generalization
of ideas in machine learning of structured prediction
and introduced efficient sub-gradient based algorithms
with regret bounds that enabled large scale application
of the technique within robotics. Ziebart et al. (2008)
extended the technique developed by Abbeel and Ng
(2004) by rendering the idea robust and probabilis-
tic, enabling its effective use for both learning poli-
cies and predicting the behavior of sub-optimal agents.
These techniques, and many variants, have been re-
cently successfully applied to outdoor robot navigation
(Ratliff et al., 2006a; Silver et al., 2008, 2010), manipu-
lation (Ratliff et al., 2007), and quadruped locomotion
(Ratliff et al., 2006a, 2007; Kolter et al., 2007).
More recently, the notion that complex policies can
be built on top of simple, easily solved optimal con-
trol problems by exploiting rich, parametrized re-
ward functions has been exploited within reinforce-
ment learning more directly. In (Sorg et al., 2010;
Zucker and Bagnell, 2012), complex policies are de-
rived by adapting a reward function for simple opti-
mal control problems using policy search techniques.
Zucker and Bagnell (2012) demonstrate that this tech-
nique can enable efficient solutions to robotic marble-
maze problems that effectively transfer between mazes
of varying design and complexity. These works high-
light the natural trade-off between the complexity of
the reward function and the complexity of the under-
lying reinforcement learning problem for achieving a
desired behavior.
4 Tractability Through
Representation
As discussed above, reinforcement learning provides
a framework for a remarkable variety of problems of
significance to both robotics and machine learning.
However, the computational and information-theoretic
consequences that we outlined above accompany this
power and generality. As a result, naive application of
reinforcement learning techniques in robotics is likely
to be doomed to failure. The remarkable successes
that we reference in this article have been achieved
by leveraging a few key principles – effective repre-
sentations, approximate models, and prior knowledge
or information. In the following three sections, we
review these principles and summarize how each has
been made effective in practice. We hope that under-
standing these broad approaches will lead to new suc-
cesses in robotic reinforcement learning by combining
successful methods and encourage research on novel
techniques that embody each of these principles.
Much of the success of reinforcement learning meth-
ods has been due to the clever use of approximate
representations. The need of such approximations
is particularly pronounced in robotics, where table-
based representations (as discussed in Section 2.2.1)
are rarely scalable. The different ways of making rein-
forcement learning methods tractable in robotics are
tightly coupled to the underlying optimization frame-
work. Reducing the dimensionality of states or ac-
tions by smart state-action discretization is a repre-
sentational simplification that may enhance both pol-
icy search and value function-based methods (see Sec-
tion 4.1). A value function-based approach requires an
accurate and robust but general function approxima-
tor that can capture the value function with sufficient
15
precision (see Section 4.2) while maintaining stabil-
ity during learning. Policy search methods require a
choice of policy representation that controls the com-
plexity of representable policies to enhance learning
speed (see Section 4.3). An overview of publications
that make particular use of efficient representations to
render the learning problem tractable is presented in
Table 3.
4.1 Smart State-Action Discretization
Decreasing the dimensionality of state or action spaces
eases most reinforcement learning problems signifi-
cantly, particularly in the context of robotics. Here, we
give a short overview of different attempts to achieve
this goal with smart discretization.
Hand Crafted Discretization. A variety of authors
have manually developed discretizations so that ba-
sic tasks can be learned on real robots. For low-
dimensional tasks, we can generate discretizations
straightforwardly by splitting each dimension into a
number of regions. The main challenge is to find the
right number of regions for each dimension that allows
the system to achieve a good final performance while
still learning quickly. Example applications include
balancing a ball on a beam (Benbrahim et al., 1992),
one degree of freedom ball-in-a-cup (Nemec et al.,
2010), two degree of freedom crawling motions (Tokic
et al., 2009), and gait patterns for four legged walking
(Kimura et al., 2001). Much more human experience
is needed for more complex tasks. For example, in a
basic navigation task with noisy sensors (Willgoss and
Iqbal, 1999), only some combinations of binary state
or action indicators are useful (e.g., you can drive left
and forward at the same time, but not backward and
forward). The state space can also be based on vastly
different features, such as positions, shapes, and colors,
when learning object affordances (Paletta et al., 2007)
where both the discrete sets and the mapping from
sensor values to the discrete values need to be crafted.
Kwok and Fox (2004) use a mixed discrete and contin-
uous representation of the state space to learn active
sensing strategies in a RoboCup scenario. They first
discretize the state space along the dimension with
the strongest non-linear influence on the value func-
tion and subsequently employ a linear value function
approximation (Section 4.2) for each of the regions.
Learned from Data. Instead of specifying the dis-
cretizations by hand, they can also be built adap-
tively during the learning process. For example, a
rule based reinforcement learning approach automati-
cally segmented the state space to learn a cooperative
task with mobile robots (Yasuda and Ohkura, 2008).
Each rule is responsible for a local region of the state-
space. The importance of the rules are updated based
on the rewards and irrelevant rules are discarded. If
the state is not covered by a rule yet, a new one is
added. In the related field of computer vision, Pi-
ater et al. (2011) propose an approach that adaptively
and incrementally discretizes a perceptual space into
discrete states, training an image classifier based on
the experience of the RL agent to distinguish visual
classes, which correspond to the states.
Meta-Actions. Automatic construction of meta-
actions (and the closely related concept of options)
has fascinated reinforcement learning researchers and
there are various examples in the literature. The idea
is to have more intelligent actions that are composed
of a sequence of movements and that in themselves
achieve a simple task. A simple example would be to
have a meta-action “move forward 5m.” A lower level
system takes care of accelerating, stopping, and cor-
recting errors. For example, in (Asada et al., 1996),
the state and action sets are constructed in a way that
repeated action primitives lead to a change in the state
to overcome problems associated with the discretiza-
tion. Q-learning and dynamic programming based ap-
proaches have been compared in a pick-n-place task
(Kalmár et al., 1998) using modules. Huber and Gru-
pen (1997) use a set of controllers with associated
predicate states as a basis for learning turning gates
with a quadruped. Fidelman and Stone (2004) use a
policy search approach to learn a small set of parame-
ters that controls the transition between a walking and
a capturing meta-action in a RoboCup scenario. A
task of transporting a ball with a dog robot (Soni and
Singh, 2006) can be learned with semi-automatically
discovered options. Using only the sub-goals of prim-
itive motions, a humanoid robot can learn a pour-
ing task (Nemec et al., 2009). Other examples in-
clude foraging (Matarić, 1997) and cooperative tasks
(Matarić, 1994) with multiple robots, grasping with
restricted search spaces (Platt et al., 2006), and mo-
bile robot navigation (Dorigo and Colombetti, 1993).
If the meta-actions are not fixed in advance, but rather
learned at the same time, these approaches are hierar-
chical reinforcement learning approaches as discussed
in Section 5.2. Konidaris et al. (2011a, 2012) propose
an approach that constructs a skill tree from human
demonstrations. Here, the skills correspond to options
and are chained to learn a mobile manipulation skill.
Relational Representations. In a relational repre-
sentation, the states, actions, and transitions are not
represented individually. Entities of the same prede-
fined type are grouped and their relationships are con-
sidered. This representation may be preferable for
highly geometric tasks (which frequently appear in
robotics) and has been employed to learn to navigate
buildings with a real robot in a supervised setting (Co-
cora et al., 2006) and to manipulate articulated objects
in simulation (Katz et al., 2008).
4.2 Value Function Approximation
Function approximation has always been the key com-
ponent that allowed value function methods to scale
into interesting domains. In robot reinforcement learn-
ing, the following function approximation schemes
have been popular and successful. Using function
16
Smart State-Action Discretization
Approach Employed by. . .
Hand crafted Benbrahim et al. (1992); Kimura et al. (2001); Kwok and Fox (2004); Nemec et al.
(2010); Paletta et al. (2007); Tokic et al. (2009); Willgoss and Iqbal (1999)
Learned Piater et al. (2011); Yasuda and Ohkura (2008)
Meta-actions Asada et al. (1996); Dorigo and Colombetti (1993); Fidelman and Stone (2004);
Huber and Grupen (1997); Kalmár et al. (1998); Konidaris et al. (2011a, 2012);
Matarić (1994, 1997); Platt et al. (2006); Soni and Singh (2006); Nemec et al. (2009)
Relational
Representation
Cocora et al. (2006); Katz et al. (2008)
Value Function Approximation
Approach Employed by. . .
Physics-inspired
Features
An et al. (1988); Schaal (1996)
Neural Networks Benbrahim and Franklin (1997); Duan et al. (2008); Gaskett et al. (2000); Hafner
and Riedmiller (2003); Riedmiller et al. (2009); Thrun (1995)
Neighbors Hester et al. (2010); Mahadevan and Connell (1992); Touzet (1997)
Local Models Bentivegna (2004); Schaal (1996); Smart and Kaelbling (1998)
GPR Gräve et al. (2010); Kroemer et al. (2009, 2010); Rottmann et al. (2007)
Pre-structured Policies
Approach Employed by. . .
Via Points & Splines Kuindersma et al. (2011); Miyamoto et al. (1996); Roberts et al. (2010)
Linear Models Tamei and Shibata (2009)
Motor Primitives Kohl and Stone (2004); Kober and Peters (2009); Peters and Schaal (2008c,b); Stulp
et al. (2011); Tamoši¯unait˙e et al. (2011); Theodorou et al. (2010)
GMM & LLM Deisenroth and Rasmussen (2011); Deisenroth et al. (2011); Guenter et al. (2007);
Lin and Lai (2012); Peters and Schaal (2008a)
Neural Networks Endo et al. (2008); Geng et al. (2006); Gullapalli et al. (1994); Hailu and Sommer
(1998); Bagnell and Schneider (2001)
Controllers Bagnell and Schneider (2001); Kolter and Ng (2009a); Tedrake (2004); Tedrake et al.
(2005); Vlassis et al. (2009); Zucker and Bagnell (2012)
Non-parametric Kober et al. (2010); Mitsunaga et al. (2005); Peters et al. (2010a)
Table 3: This table illustrates different methods of making robot reinforcement learning tractable by employing a
suitable representation.
approximation for the value function can be com-
bined with using function approximation for learn-
ing a model of the system (as discussed in Section 6)
in the case of model-based reinforcement learning ap-
proaches.
Unfortunately the max-operator used within the
Bellman equation and temporal-difference updates can
theoretically make most linear or non-linear approxi-
mation schemes unstable for either value iteration or
policy iteration. Quite frequently such an unstable
behavior is also exhibited in practice. Linear func-
tion approximators are stable for policy evaluation,
while non-linear function approximation (e.g., neural
networks) can even diverge if just used for policy eval-
uation (Tsitsiklis and Van Roy, 1997).
Physics-inspired Features. If good hand-crafted fea-
tures are known, value function approximation can be
accomplished using a linear combination of features.
However, good features are well known in robotics only
for a few problems, such as features for local stabiliza-
tion (Schaal, 1996) and features describing rigid body
dynamics (An et al., 1988). Stabilizing a system at
an unstable equilibrium point is the most well-known
example, where a second order Taylor expansion of
the state together with a linear value function approx-
imator often suffice as features in the proximity of the
equilibrium point. For example, Schaal (1996) showed
that such features suffice for learning how to stabilize a
pole on the end-effector of a robot when within ±1530
degrees of the equilibrium angle. For sufficient fea-
tures, linear function approximation is likely to yield
good results in an on-policy setting. Nevertheless, it is
straightforward to show that impoverished value func-
tion representations (e.g., omitting the cross-terms in
quadratic expansion in Schaal’s set-up) will make it
impossible for the robot to learn this behavior. Sim-
ilarly, it is well known that linear value function ap-
proximation is unstable in the off-policy case (Tsitsiklis
and Van Roy, 1997; Gordon, 1999; Sutton and Barto,
1998).
Neural Networks. As good hand-crafted features are
rarely available, various groups have employed neural
networks as global, non-linear value function approxi-
mation. Many different flavors of neural networks have
17
Figure 4: The Brainstormer Tribots won the RoboCup
2006 MidSize League (Riedmiller et al., 2009)(Picture
reprint with permission of Martin Riedmiller).
been applied in robotic reinforcement learning. For
example, multi-layer perceptrons were used to learn
a wandering behavior and visual servoing (Gaskett
et al., 2000). Fuzzy neural networks (Duan et al., 2008)
and explanation-based neural networks (Thrun, 1995)
have allowed robots to learn basic navigation. CMAC
neural networks have been used for biped locomotion
(Benbrahim and Franklin, 1997).
The Brainstormers RoboCup soccer team is a par-
ticularly impressive application of value function ap-
proximation.(see Figure 4). It used multi-layer per-
ceptrons to learn various sub-tasks such as learning
defenses, interception, position control, kicking, mo-
tor speed control, dribbling and penalty shots (Hafner
and Riedmiller, 2003; Riedmiller et al., 2009). The re-
sulting components contributed substantially to win-
ning the world cup several times in the simulation and
the mid-size real robot leagues. As neural networks
are global function approximators, overestimating the
value function at a frequently occurring state will in-
crease the values predicted by the neural network for
all other states, causing fast divergence (Boyan and
Moore, 1995; Gordon, 1999).Riedmiller et al. (2009)
solved this problem by always defining an absorbing
state where they set the value predicted by their neu-
ral network to zero, which “clamps the neural network
down” and thereby prevents divergence. It also allows
re-iterating on the data, which results in an improved
value function quality. The combination of iteration
on data with the clamping technique appears to be the
key to achieving good performance with value function
approximation in practice.
Generalize to Neighboring Cells. As neural net-
works are globally affected from local errors, much
work has focused on simply generalizing from neigh-
boring cells. One of the earliest papers in robot re-
inforcement learning (Mahadevan and Connell, 1992)
introduced this idea by statistically clustering states to
speed up a box-pushing task with a mobile robot, see
Figure 1a. This approach was also used for a naviga-
tion and obstacle avoidance task with a mobile robot
(Touzet, 1997). Similarly, decision trees have been
used to generalize states and actions to unseen ones,
e.g., to learn a penalty kick on a humanoid robot (Hes-
ter et al., 2010). The core problem of these methods
is the lack of scalability to high-dimensional state and
action spaces.
Local Models. Local models can be seen as an ex-
tension of generalization among neighboring cells to
generalizing among neighboring data points. Locally
weighted regression creates particularly efficient func-
tion approximation in the context of robotics both in
supervised and reinforcement learning. Here, regres-
sion errors are weighted down by proximity to query
point to train local modelsThe predictions of these
local models are combined using the same weighting
functions. Using local models for value function ap-
proximation has allowed learning a navigation task
with obstacle avoidance (Smart and Kaelbling, 1998),
a pole swing-up task (Schaal, 1996), and an air hockey
task (Bentivegna, 2004).
Gaussian Process Regression. Parametrized global
or local models need to pre-specify, which requires a
trade-off between representational accuracy and the
number of parameters. A non-parametric function ap-
proximator like Gaussian Process Regression (GPR)
could be employed instead, but potentially at the cost
of a higher computational complexity. GPR has the
added advantage of providing a notion of uncertainty
about the approximation quality for a query point.
Hovering with an autonomous blimp (Rottmann et al.,
2007) has been achieved by approximation the state-
action value function with a GPR. Similarly, another
paper shows that grasping can be learned using Gaus-
sian process regression (Gräve et al., 2010) by addi-
tionally taking into account the uncertainty to guide
the exploration. Grasping locations can be learned
by approximating the rewards with a GPR, and try-
ing candidates with predicted high rewards (Kroemer
et al., 2009), resulting in an active learning approach.
High reward uncertainty allows intelligent exploration
in reward-based grasping (Kroemer et al., 2010) in a
bandit setting.
4.3 Pre-structured Policies
Policy search methods greatly benefit from employ-
ing an appropriate function approximation of the pol-
icy. For example, when employing gradient-based ap-
proaches, the trade-off between the representational
power of the policy (in the form of many policy pa-
rameters) and the learning speed (related to the num-
ber of samples required to estimate the gradient) needs
to be considered. To make policy search approaches
tractable, the policy needs to be represented with a
function approximation that takes into account do-
main knowledge, such as task-relevant parameters or
generalization properties. As the next action picked
by a policy depends on the current state and ac-
tion, a policy can be seen as a closed-loop controller.
Roberts et al. (2011) demonstrate that care needs to be
taken when selecting closed-loop parameterizations for
weakly-stable systems, and suggest forms that are par-
ticularly robust during learning. However, especially
18
Figure 5: Boston Dynamics LittleDog jumping (Kolter and Ng, 2009a) (Picture reprint with permission of Zico Kolter).
for episodic RL tasks, sometimes open-loop policies
(i.e., policies where the actions depend only on the
time) can also be employed.
Via Points & Splines. An open-loop policy may of-
ten be naturally represented as a trajectory, either
in the space of states or targets or directly as a set
of controls. Here, the actions are only a function
of time, which can be considered as a component of
the state. Such spline-based policies are very suitable
for compressing complex trajectories into few param-
eters. Typically the desired joint or Cartesian posi-
tion, velocities, and/or accelerations are used as ac-
tions. To minimize the required number of parame-
ters, not every point is stored. Instead, only impor-
tant via-points are considered and other points are in-
terpolated. Miyamoto et al. (1996) optimized the po-
sition and timing of such via-points in order to learn
a kendama task (a traditional Japanese toy similar to
ball-in-a-cup). A well known type of a via point repre-
sentations are splines, which rely on piecewise-defined
smooth polynomial functions for interpolation. For
example, Roberts et al. (2010) used a periodic cubic
spline as a policy parametrization for a flapping system
and Kuindersma et al. (2011) used a cubic spline to
represent arm movements in an impact recovery task.
Linear Models. If model knowledge of the system is
available, it can be used to create features for lin-
ear closed-loop policy representations. For example,
Tamei and Shibata (2009) used policy-gradient rein-
forcement learning to adjust a model that maps from
human EMG signals to forces that in turn is used in a
cooperative holding task.
Motor Primitives. Motor primitives combine linear
models describing dynamics with parsimonious move-
ment parametrizations. While originally biologically-
inspired, they have a lot of success for representing
basic movements in robotics such as a reaching move-
ment or basic locomotion. These basic movements
can subsequently be sequenced and/or combined to
achieve more complex movements. For both goal ori-
ented and rhythmic movement, different technical rep-
resentations have been proposed in the robotics com-
munity. Dynamical system motor primitives (Ijspeert
et al., 2003; Schaal et al., 2007) have become a popular
representation for reinforcement learning of discrete
movements. The dynamical system motor primitives
always have a strong dependence on the phase of the
movement, which corresponds to time. They can be
employed as an open-loop trajectory representation.
Nevertheless, they can also be employed as a closed-
loop policy to a limited extent. In our experience, they
offer a number of advantages over via-point or spline
based policy representation (see Section 7.2). The dy-
namical system motor primitives have been trained
with reinforcement learning for a T-ball batting task
(Peters and Schaal, 2008c,b), an underactuated pendu-
lum swing-up and a ball-in-a-cup task (Kober and Pe-
ters, 2009), flipping a light switch (Buchli et al., 2011),
pouring water (Tamoši¯unait˙e et al., 2011), and play-
ing pool and manipulating a box (Pastor et al., 2011).
For rhythmic behaviors, a representation based on the
same biological motivation but with a fairly different
technical implementation (based on half-elliptical lo-
cuses) have been used to acquire the gait patterns for
an Aibo robot dog locomotion (Kohl and Stone, 2004).
Gaussian Mixture Models and Radial Basis Function
Models. When more general policies with a strong
state-dependence are needed, general function approx-
imators based on radial basis functions, also called
Gaussian kernels, become reasonable choices. While
learning with fixed basis function centers and widths
often works well in practice, estimating them is chal-
lenging. These centers and widths can also be esti-
mated from data prior to the reinforcement learning
process. This approach has been used to generalize
a open-loop reaching movement (Guenter et al., 2007;
Lin and Lai, 2012) and to learn the closed-loop cart-
pole swingup task (Deisenroth and Rasmussen, 2011).
Globally linear models were employed in a closed-loop
block stacking task (Deisenroth et al., 2011).
19
Neural Networks are another general function ap-
proximation used to represent policies. Neural os-
cillators with sensor feedback have been used to
learn rhythmic movements where open and closed-
loop information were combined, such as gaits for
a two legged robot (Geng et al., 2006; Endo et al.,
2008). Similarly, a peg-in-hole (see Figure 1b), a ball-
balancing task (Gullapalli et al., 1994), and a naviga-
tion task (Hailu and Sommer, 1998) have been learned
with closed-loop neural networks as policy function ap-
proximators.
Locally Linear Controllers. As local linearity is
highly desirable in robot movement generation to
avoid actuation difficulties, learning the parameters of
a locally linear controller can be a better choice than
using a neural network or radial basis function repre-
sentation. Several of these controllers can be combined
to form a global, inherently closed-loop policy. This
type of policy has allowed for many applications, in-
cluding learning helicopter flight (Bagnell and Schnei-
der, 2001), learning biped walk patterns (Tedrake,
2004; Tedrake et al., 2005), driving a radio-controlled
(RC) car, learning a jumping behavior for a robot dog
(Kolter and Ng, 2009a) (illustrated in Figure 5), and
balancing a two wheeled robot (Vlassis et al., 2009).
Operational space control was also learned by Peters
and Schaal (2008a) using locally linear controller mod-
els. In a marble maze task, Zucker and Bagnell (2012)
used such a controller as a policy that expressed the
desired velocity of the ball in terms of the directional
gradient of a value function.
Non-parametric Policies. Polices based on non-
parametric regression approaches often allow a more
data-driven learning process. This approach is often
preferable over the purely parametric policies listed
above becausethe policy structure can evolve during
the learning process. Such approaches are especially
useful when a policy learned to adjust the existing
behaviors of an lower-level controller, such as when
choosing among different robot human interaction pos-
sibilities (Mitsunaga et al., 2005), selecting among dif-
ferent striking movements in a table tennis task (Pe-
ters et al., 2010a), and setting the meta-actions for
dart throwing and table tennis hitting tasks (Kober
et al., 2010).
5 Tractability Through Prior
Knowledge
Prior knowledge can dramatically help guide the learn-
ing process. It can be included in the form of initial
policies, demonstrations, initial models, a predefined
task structure, or constraints on the policy such as
torque limits or ordering constraints of the policy pa-
rameters. These approaches significantly reduce the
search space and, thus, speed up the learning process.
Providing a (partially) successful initial policy allows
a reinforcement learning method to focus on promising
regions in the value function or in policy space, see Sec-
tion 5.1. Pre-structuring a complex task such that it
can be broken down into several more tractable ones
can significantly reduce the complexity of the learn-
ing task, see Section 5.2. An overview of publications
using prior knowledge to render the learning problem
tractable is presented in Table 4. Constraints may also
limit the search space, but often pose new, additional
problems for the learning methods. For example, pol-
icy search limits often do not handle hard limits on
the policy well. Relaxing such constraints (a trick of-
ten applied in machine learning) is not feasible if they
were introduced to protect the robot in the first place.
5.1 Prior Knowledge Through
Demonstration
People and other animals frequently learn using a com-
bination of imitation and trial and error. When learn-
ing to play tennis, for instance, an instructor will re-
peatedly demonstrate the sequence of motions that
form an orthodox forehand stroke. Students subse-
quently imitate this behavior, but still need hours of
practice to successfully return balls to a precise loca-
tion on the opponent’s court. Input from a teacher
need not be limited to initial instruction. The instruc-
tor may provide additional demonstrations in later
learning stages (Latzke et al., 2007; Ross et al., 2011a)
and which can also be used as differential feedback
(Argall et al., 2008).
This combination of imitation learning with rein-
forcement learning is sometimes termed apprenticeship
learning (Abbeel and Ng, 2004) to emphasize the need
for learning both from a teacher and by practice. The
term “apprenticeship learning” is often employed to re-
fer to “inverse reinforcement learning” or “inverse op-
timal control” but is intended here to be employed in
this original, broader meaning. For a recent survey
detailing the state of the art in imitation learning for
robotics, see (Argall et al., 2009).
Using demonstrations to initialize reinforcement
learning provides multiple benefits. Perhaps the most
obvious benefit is that it provides supervised training