Content uploaded by Stefan Schaal

Author content

All content in this area was uploaded by Stefan Schaal on Sep 02, 2018

Content may be subject to copyright.

IEEE Robotics & Automation Magazine

20 1070-9932/10/$26.00ª2010 IEEE JUNE 2010

Trajectory-Based Optimal

Control Techniques

In a not too distant future, robots will be a natural part of

daily life in human society, providing assistance in many

areas ranging from clinical applications, education and care

giving, to normal household environments [1]. It is hard to

imagine that all possible tasks can be preprogrammed in such

robots. Robots need to be able to learn, either by themselves

or with the help of human supervision. Additionally, wear and

tear on robots in daily use needs to be automatically compen-

sated for, which requires a form of continuous self-calibration,

another form of learning. Finally, robots need to react to sto-

chastic and dynamic environments, i.e., they need to learn

how to optimally adapt to uncertainty and unforeseen

changes. Robot learning is going to be a key ingredient for the

future of autonomous robots.

While robot learning covers a rather large field, from learn-

ing to perceive, to plan, to make decisions, etc., we will focus

this review on topics of learning control, in particular, as it is

concerned with learning control in simulated or actual physi-

cal robots. In general, learning control refers to the process of

acquiring a control strategy for a particular control system and

a particular task by trial and error. Learning control is usually

distinguished from adaptive control [2] in that the learning sys-

tem can have rather general optimization objectives—not just,

e.g., minimal tracking error—and is permitted to fail during

the process of learning, while adaptive control emphasizes fast

convergence without failure. Thus, learning control resembles

the way that humans and animals acquire new movement

strategies, while adaptive control is a special case of learning

control that fulfills stringent performance constraints, e.g., as

needed in life-critical systems like airplanes.

Learning control has been an active topic of research for at

least three decades. However, given the lack of working robots

that actually use learning components, more work needs to be

done before robot learning will make it beyond the laboratory

environment. This article will survey some ongoing and past

activities in robot learning to assess where the field stands and

where it is going. We will largely focus on nonwheeled robots

and less on topics of state estimation, as typically explored in

wheeled robots [3]–6], and we emphasize learning in continuous

state-action spaces rather than discrete state-action spaces [7], [8].

We will illustrate the different topics of robot learning with

examples from our own research with anthropomorphic and

humanoid robots.

The Basics of Learning Control

A key question in learning control is what it is that should be

learned. To address this issue, it is helpful to begin with one of

the most general frameworks of learning control, as originally

developed in the middle of the 20th century in the fields of

optimization theory, optimal control, and in particular,

dynamic programming [9], [10]. Here, the goal of learning

control was formalized as the need to acquire a task-dependent

control policy pthat maps a continuous-valued state vector x

Digital Object Identifier 10.1109/MRA.2010.936957

© STOCKBYTE, EYEWIRE, DIGITAL VISION & BRAND X PICTURES

BY STEFAN SCHAAL AND CHRISTOPHER G. ATKESON

of a controlled system and its environment, possibly in a time t

dependent way, to a continuous-valued control vector u:

u¼p(x,t,h):(1)

The parameter vector hcontains the problem-specific

parameters in the policy pthat need to be adjusted by the

learning system. The controlled system can generally be

expressed as a nonlinear dynamics function

_x¼f(x,u,t,ex) (2)

with observation equations

y¼h(x,u,t,ey) (3)

that describe how the observations yof the system are derived

from the full-state vector x—the terms exand eydenote noise

terms. Thus, learning control means finding a (usually nonlin-

ear) function pthat is adequate for a given desired behavior

and movement system. A repertoire of motor skills is com-

posed of many such policies that are sequenced and superim-

posed to achieve complex motor skills.

How the control policy is learned, however, can proceed in

many different ways. Assuming that the model equations (2) and

(3) are unknown, one classical approach is to learn these models

using methods of function approximation and then compute a

controller based on the estimated model, which is often discussed

as the certainty-equivalence principle in the adaptive control liter-

ature [2]. Such techniques are summarized under the name

model-based learning, indirect learning, or internal model learn-

ing. Alternatively, model-free learning of the policy is possible

given an optimization or reward criterion, usually using methods

from optimal control or reinforcement learning. Such model-free

learning is also known as direct learning, since the policy is learned

directly, i.e., without a detour through model identification.

It is useful to distinguish between several general classes of

motor tasks that could be the goal of learning. Regulator tasks

keep the system at a particular set point

of operation—a typical example is bal-

ancing a pole on a fingertip or standing

upright on two legs. Tracking tasks

require the control system to follow a

given desired trajectory within the abil-

ities of the control system. Discrete

movement tasks, also called one-shot

tasks, are defined by achieving a particu-

lar goal at which the motor skill termi-

nates. A basketball foul shot or grasping a

cup of coffee are representative exam-

ples. Periodic movement tasks are typical

in the domain of locomotion. At last,

complex movement tasks are composed

of sequencing and superimposing simpler

motor skills, e.g., leading to complex

manipulation skills like emptying a dish-

washer or assembling a bookshelf.

From the viewpoint of machine learning, robot learning can

be classified as supervised learning, reinforcement learning,

learning modularizations, or learning feature representations

that subserve learning. All learning methods can benefit from

giving the learning system prior knowledge about how to

accomplish a motor task, and imitation learning or learning

from demonstration is a popular approach to introduce this bias.

In summary, the goal of robot learning is to find an appro-

priate control policy to accomplish a given movement task,

assuming that no traditional methods exist to compute the

control policy. Approaches to robot learning can be classified

and discussed using three dimensions: direct versus indirect

control, the learning method used, and the class of tasks in

question (Figure 1).

Approaches to Robot Learning

We will use the classification in Figure 1 in the following sec-

tions to guide our survey of current and previous work in robot

learning. Given space constraints, this survey is not meant to be

comprehensive but rather to present illustrative projects in the

various areas.

Learning Internal Models for Control

Using learning to acquire internal models for control is useful

when the analytical models are too complex to derive, and/or

when it can be expected that the models change over time, e.g.,

due to wear and tear. Various kinds of internal models are used in

robotics. The most well known are kinematics and dynamic

models. For instance, the direct kinematics of a robot relates joint

variables qto end-effector variables y, i.e., y¼g(q)[11].

Dynamics models include kinetic terms like forces or torques, as

in (2). The previous models are forward models, i.e., they model

the causal relationship between input and output variables, and

they are proper functions. Often, however, what is needed in

control are inverse models, e.g., the inverse kinematics q¼

g

1

(y) or the inverse dynamics u¼f1(q,_q,t). As discussed in

[12], inverse models are often not functions, as the inverse rela-

tionships may be a one-to-many map, i.e., just a relation. Such

Direct Versus Indirect Control Learning Method

Model-Free Control

Model-Based Control

Regulator Task

Tracking Task

One-Shot Tasks

Periodic Tasks

Complex/Composite Tasks

Class of Task

Supervised Learning

Reinforcement Learning

Learning Modularity

Learning Representations

Imitation Learning

C

ontro

l

a

sed

C

ontro

l

R

egu

l

ator

T

as

k

Tracking Tas

k

O

ne-

S

hot Tasks

P

eriodic Tasks

e

x/Composite Tasks

Supervised Learni

R

einforceme

L

earn

in

Le

Figure 1. Classification of robot learning along three dimensions. Topics further out

on the arrows can be considered more complex research topics than topics closer to

the center.

IEEE Robotics & Automation Magazine

JUNE 2010 21

cases pose a problem to learning methods and can be addressed

with special techniques and representations [13]–[16].

Nonlinear function approximation is needed to learn inter-

nal models. It should be noted, as will be explained later, that

function approximation is also required for other robot learning

problems, e.g., to represent value functions, reward functions,

or policies in reinforcement learning—thus, function approxi-

mation has a wide applicability in robot learning. While most

machine-learning problems in function approximation work by

processing a given data set in an offline fashion, robot learning

has severalfeatures that require specialized algorithms:

udata are available in abundance, typically at a rate from

60 to 1,000 data points per second

ugiven this continuous stream of data, learning should

never stop, but continue forever without degradation

over time. For instance, degradation happens in many

algorithms if the same data point is given to the learning

system repeatedly, e.g., when the robot is standing still

ugiven the high dimensionality of most interesting

robotic systems, the complexity of the function to be

learned is often unknown in advance, and the function

approximation system needs to be able to add new

learning resources as learning proceeds

ulearning should happen in real time, be data efficient

(squeeze the most information out of each data point),

and be computationally efficient (to achieve real-time

learning and lookup)

ulearning needs to be robust toward shifting input distri-

butions, e.g., as typical when practicing calligraphy on

one day and tennis on another day, a topic discussed in

the context of catastrophic interference [17]

ulearning needs to be able to detect relevant features in

the input from ideally hundreds or thousands of input

dimensions, and it needs to exclude automatically irrele-

vant and redundant inputs.

These requirements narrow down the learning algorithms

that are applicable to function approximation for robot learn-

ing. One approach that has favorable performance is learning

with piecewise linear models using nonparametric regression

techniques [17]–[22]. Essentially, this technique finds, in the

spirit of a first-order Taylor series expansion, the linearization

of the function at an input point, and the region (also called a

kernel) in which this linearization holds within a certain error

bound. Learning this region is the most complex part of these

techniques, and the latest developments use Bayesian statistics

[23] and dimensionality reduction [22].

A new development, largely due to increasingly faster com-

puting hardware, is the application of Gaussian process regres-

sion (GPR) to function approximation in robots [24]–[26].

GPR is a powerful function approximation tool that has

gained popularity due to its sound theory, high fitting accu-

racy, and the relative ease of application with public-domain

software libraries. As it requires an iterative optimization that

needs to invert a matrix of size N3N, where Nis the number

of training data points, GPR quickly saturates the computa-

tional resources with moderately many data points. Thus, scal-

ability to continual and real-time learning in complex robots

will require further research developments; some research

along these lines is given in [25] and [27].

Example Application

As mentioned earlier, learning inverse models can be challeng-

ing, since the inverse model problem is often a relation and not

a function, with a one-to-many mapping. Applying any arbi-

trary nonlinear function approximation method to the inverse

model problem can lead to unpredictably bad performance, as

the training data can form nonconvex solution spaces in which

averaging is inappropriate [12]. A particularly interesting

approach in control involves learning local linearizations of a

forward model (which is a proper function) and learning an

inverse mapping within the local region of the forward model;

see also [15] and [28].

Ting et al. [23] demonstrated such a forward-inverse model

learning approach with Bayesian locally weighted regression

(BLWR) to learn an inverse kinematics model for a haptic

robot arm (Figure 2) for a task-space tracking task. Training

data consisted of the arm’s joint angles q, joint velocities _q,

end-effector position in Cartesian space y, and end-effector

velocities _y. From this data, a differential forward kinematics

model _y¼J(q)_qwas learned, where Jis the Jacobian matrix.

The transformation from _qto _ycan be assumed to be locally

linear at a particular configuration qof the robot arm. BLWR

is used to learn the forward model in a

piecewise linear fashion.

The goal of the robot task is to track a

desired trajectory (y,_y) specified only in

terms of x,zCartesian positions and

velocities, i.e., the movement is sup-

posed to be in a vertical plane in front of

the robot, but the exact position of the

vertical plane is not given. Thus, the task

has one degree of redundancy. To learn

an inverse kinematics model, the local

regions from the piecewise linear for-

ward model can be reused since any local

inverse is also locally linear within these

regions. Moreover, for locally linear

models, all solution spaces for the inverse

0.2

0.1

0

z (m)

–0.1

–0.1 –0.05 0

x (m)

Desired

Learned IK

0.05 0.1

(a) (b)

Figure 2. (a) Phantom robot. (b) Learned-inverse kinematics solution; the difference

between the actual and desired trajectory is small.

IEEE Robotics & Automation Magazine

22 JUNE 2010

model are locally convex, such that an inverse can be learned

without problems. The redundancy issue can be solved by

applying an additional weight to each data point according to a

reward function, resulting in reward-weighted locally

weighted regression [15].

Figure 2 shows the performance of the learned inverse model

(Learned IK) in a figure-eight tracking task. The learned model

as well as the analytical inverse kinematics solution performs

with root-mean-squared tracking errors in positions and veloc-

ities very close to that of the analytical solution. This perform-

ance was acquired from five minutes of real-time training data.

Model-Based Learning

In considering model-based learning, it is useful to start by

assuming that the model is perfect. Later, we will address the

question of how to design a controller that is robust to flaws in

the learned model.

Conventional Dynamic Programming

Designing controllers for linear models is well understood. Work

in reinforcement learning has focused using techniques derived

from dynamic programming to design controllers for models

that are nonlinear. A large part of our own work has emphasized

pushing back the curse of dimensionality, as the memory and

computational cost of dynamic programming increase exponen-

tially with the dimensionality of the state-action space.

Dynamic programming provides a way to find globally

optimal control policies when the model of the control system

is known. This section focuses on offline planning of nonlinear

control policies for control problems with continuous states

and actions, deterministic time invariant discrete time dynam-

ics, x

kþ1

¼f(x

k

,u

k

), and a time-invariant one-step cost or

reward function L(x,u)—equivalent formulations exist for

continuous time systems [29]–[31]. We are addressing steady-

state policies, i.e., policies that are not time variant and have an

infinite time horizon. One approach to dynamic programming

is to approximate the value function V(x) (the optimal total

future cost from each state V(x)¼minukP1

k¼0L(xk,uk)) by

repeatedly solving the Bellman equation V(x)¼minu

fL(x,u)þV(f(x,u))gat sampled states xuntil the value

function estimates have converged to globally optimal val-

ues. Typically, the value function and control law are repre-

sented on a regular grid—it should be noted that more

efficient adaptive grid methods [32], [33] or function approx-

imation methods [7] also exist. Some type of interpolation is

used to approximate these functions within each grid cell. If

each dimension of the state and action is represented with a

resolution R, and the dimensionality of the state is d

x

and that

of the action is d

u

, the computational cost of the conven-

tional approach is proportional to Rdx3Rduand the memory

cost is proportional to Rdx.Thisisknownasthecurseof

dimensionality [9].

We have shown that dynamic programming can be sped up

by randomly sampling actions on each sweep rather than

exhaustively minimizing the Bellman equation with respect to

the action [34]. At each state on each update, the current best

action is reevaluated and compared to some number of random

actions. Our studies have found that only looking at one ran-

dom action on each update is most efficient. It is more effective

to propagate information about future values by reevaluating

the current best action on each update than it is to put a lot of

resources into searching for the absolute best action.

With this speedup in action search, currently available

cluster computers can easily handle ten-dimensional problems

(approximately 10

10

points can handle grids of size 50

6

,20

8

,or

10

10

, for example). Current supercomputers are created by net-

working hundreds or thousands of conventional computers.

The obvious way to implement dynamic programming on

such a cluster is to partition the grid representing the value

function and policy across the individual computing nodes,

with the borders shared between multiple nodes. When a

border cell is updated by its host node, the new value must be

communicated to all nodes that have copies of that cell. We

have implemented dynamic programming in a cluster of up to

100 nodes, with each node having eight CPU cores and 16 GB

of memory. For example, running a cluster of 40 nodes on a

six-dimensional problem with 50

6

cells, about 6 GB is used on

each node to store its region of the value function and policy.

Decomposing Problems

One way to reduce the curse of dimensionality is to break

problems into parts and develop a controller for each part sep-

arately. Each subsystem could be ten-dimensional, given the

earlier results, and a system that combined two subsystems

could be 20 dimensional. For example, we are interested in

developing a controller for biped walking [35]. We can

approximately model the dynamics of a biped with separate

models for sagittal and lateral control. These models are linked

by common actions, such as when to put down and lift the

feet. Thus, there are two parts of the state vector x: variables

that are part of the sagittal state x

s

and variables that are part of

the lateral state x

l

. There are three parts of the action vector u:

variables that are part of the sagittal action u

s

, variables that are

part of the lateral action u

l

, and variables that affect both sys-

tems u

sl

. We can perform dynamic programming on the sagit-

tal system and produce a value function V

s

(x

s

) and do the same

with the lateral system V

l

(x

l

). We can choose an optimal action

by minimizing L((x,u)þV(f(x,u)) with respect to u, with

V(x) approximated by V

s

(x

s

)þV

l

(x

l

). This approximation

ignores the linking of the two systems in the future and can be

improved by adding elements to the one-step costs for each

subsystem that bias the shared actions to behave as if the other

system was present. For example, deviations from the timing

usually seen in the complete system can be penalized.

Trajectory Optimization and Trajectory Libraries

Another way to handle complex systems is trajectory optimiza-

tion. Given a model, a variety of approaches can be used to find

a locally optimal sequence of commands for a given initial posi-

tion and one-step cost [36]–[38]. Interestingly, trajectory optimi-

zation is quite popular for generating motion in animation [39].

However, trajectory optimization is not so popular in robotics,

because it appears that it does not produce a control law but just

a fixed sequence of commands. This is not a correct view.

IEEE Robotics & Automation Magazine

JUNE 2010 23

To generate a control policy, trajectory optimization can

be applied to many initial conditions, and the resulting com-

mands can be interpolated as needed. If that is the case, why

do we need to deal with dynamic programming and the curse

of dimensionality? Dynamic programming is a global opti-

mizer, while trajectory optimization finds local optima. Often,

the local optima found are not acceptable. Some way to bias

trajectory optimization to produce reasonable trajectories

would be useful. Also, if interpolation of the results will be

done, it would be useful to produce consistent results so that

similar initial conditions lead to similar costs. There may be

discontinuities between nearby trajectories that must be

handled by interpolation of actions.

One trick to improve trajectories is to use neighboring tra-

jectories to somehow bias or guide the optimization process. A

simple way to do this is to use a neighboring trajectory as the

initial trajectory in the trajectory-optimization process. Trajec-

tories can be reoptimized using each neighbor in turn as the

initial trajectory, and the best result so far can be retained. We

have explored building explicit libraries of optimized trajecto-

ries to handle large perturbations in bipedal standing balance

[40]. One way of using the library is to use the optimized

action corresponding to the nearest state in the library at each

time step. Another way is to store the derivative of the opti-

mized action with respect to state and use that derivative to

modify the suggested action. A third way is to look up states

from multiple trajectories and generate a weighted blend of

the suggested actions.

The first and second derivatives of a trajectory’s cost with

respect to state can be used to generate a local Taylor series

model of the value function: V(x)¼V

0

þV

x

xþX

T

V

xx

X.

Given a quadratic local model of the value function, it is possible

to compute the optimal action and its first derivative, the feed-

back gains. These observations led to a trajectory optimization

method based on second-order gradient descent, differential

dynamic programming (DDP) [29]. Although this trajectory

optimization method is no longer considered the most efficient

way to find an optimal trajectory [sequential quadratic program-

ming (SQP) methods are currently preferred in many fields such

as aerospace and animation], the localmodels of the value func-

tion and policy that DDP produces are useful for machine

learning. For example, the local modelof the policy can be used

in a trajectory library to interpolate or extrapolate actions. Dis-

crepancies in adjacent local models of the value function can be

used to determine where to allocate additional library resources.

Robustness

Robustness has not been addressed well in robot learning.

Studies often focus on robustness to additive noise. It is much

more difficult to design controllers that are robust to the corre-

lated errors caused by parameter error or model structure

error. One approach to designing robust controllers is to opti-

mize controller parameters by simulating a controller control-

ling a noisy robot [41]. It is more useful to optimize controller

parameters controlling a set of robots, each with different

robot parameters. This allows the effect of correlated control-

ler errors across time to be handled in the optimization.

It is not clear how to perform a similar optimization over a

set of models in dynamic programming. Using additive noise

and performing stochastic dynamic programming does not

capture the effect of correlated errors. One approach is to

make the model parameters into model states and perform sto-

chastic dynamic programming on information states that

describe distributions of actual states and model parameters.

However, this creates a large increase in the number of states,

which is not practical for dynamic programming.

Bar-Shalom and Tse showed that DDP can be used to

locally optimize controller robustness as well as exploration

[42], [43]. This work provides an efficient solution to optimize

the typically high-dimensional information state, which

includes the means and covariances of the original model states

and the means and covariances of the model parameters.

Representing the uncertainty using a parametric probability

distribution (means and covariances) also reduces the compu-

tational cost of propagating uncertainty forward in time. The

dynamics of the system are given by an extended Kalman fil-

ter. The key observation is that the cost of uncertainty (the

state and model parameter covariances) is given by

Trace(V

xx

R), the trace of the product of the second derivative

of the value function and the covariance matrix of the state.

Minimizing the additional cost due to uncertainty makes the

controller more robust and guides exploration.

Example Application

We implemented DDP on an actual robot as part of a learning

from demonstration experiment (Figure 3). Several robustness

issues arose since models are never perfect, especially learned

models. 1) We needed initial trajectories that were consistent

with the learned models, and sometimes reasonable or feasible

trajectories do not exist due to modeling error in the learned

model. 2) During optimization, the forward integration of a

learned model in time often blows up when the learned model

is inaccurate or when the plant is unstable and the current policy

fails to stabilize it. 3) The backward integration to produce a

value function and a corresponding policy uses derivatives of the

learned model, which are often quite inaccurate in the early

stages of learning, producing inaccurate value function estimates

and ineffective policies. 4) Dynamic planners amplify modeling

Figure 3. The robot swinging up an inverted pendulum.

IEEE Robotics & Automation Magazine

24 JUNE 2010

error, because they take advantage of any modeling error that

reduces cost, and because some planners use derivatives, which

can be quite inaccurate. 5) The new knowledge gained in

attempting a task may not change the predictions the system

makes about the task (falling down might not tell us much about

theforcesneededinwalking).InthetaskshowninFigure3,we

used a direct reinforcement learning approach that adjusted the

task goals in addition to optimal control to overcome modeling

errors that the learningsystem did not handle [44].

We use another form of one-link pendulum swing-up as

an example problem to provide the reader with a visualizable

example of a value function and policy (Figure 4). In this one-

link pendulum swing-up, a motor at the base of the pendulum

swings a rigid arm from the downward stable equilibrium to

the upright unstable equilibrium and balances the arm there.

What makes this challenging is that the one-step cost function

penalizes the amount of torque used and the deviation of

the current position from the goal. The controller must try

to minimize the total cost of the trajectory. The one-step

cost function for this example is a weighted sum of the

squared position errors (~

h: difference between current angle

and the goal angle) and the squared torques s:

L(x,u)¼0:1~

h2Tþs2T, where 0.1 weights the position

error relative to the torque penalty and Tis the time step of

the simulation (0.01s). Including the time step Tin the optimi-

zation criterion allows comparison with controllers with dif-

ferent time steps and continuous time controllers. There are

no costs associated with the joint velocity. Figure 4 shows the

optimal value function and policy. The optimal trajectory is

shown as a yellow line in the value function plot and as a black

line with a yellow border in the policy plot [Figure 4(b) and

(c)]. The value function is cut off above 20 so that we can see

the details of the part of the value function that determines the

optimal trajectory. The goal is at the state (0,0).

Model-Free Learning

There are several popular methods of approaching model-

free robot learning. Value function-based methods are dis-

cussed in the context of actor-critic methods, temporal dif-

ference (TD) learning, and Q-learning. A novel wave of

algorithms avoids value functions and focuses on directly

learning the policy, either with gradient methods or proba-

bilistic methods.

Value Function Approaches

Instead of using dynamic programming, the value function

V(x) can be estimated with TD learning [7], [45]. Essentially,

TD enforces the validity of the Bellman equations for tempo-

rally adjacent states, which can be shown to lead to a spatially

consistent estimate of the value function for a given policy. To

improve the policy, TD needs to be coupled to a simultaneous

policy update using actor-critic methods [7].

Alternatively, instead of the value function V(x), the action

value function Q(x,u) can be used, which is defined as

Q(x,u)¼L(x0,u0)þminukP1

k¼1L(xk,uk) [7], [46]. Know-

ing Q(x,u) for all actions in a state allows choosing the one

with the maximal (or minimal for penalty costs) Q-value as

the optimal action. Q-learning can be conceived of as TD

learning in the joint space of states and actions.

TD and Q-learning work well for discrete state-action

spaces but become more problematic in continuous state-

action scenarios. In continuous spaces, function approximators

need to be used to represent the value function and policy.

Achieving reliable estimation of these functions usually

requires a large number of samples that densely fill the relevant

space for learning, which is hard to accomplish in actual

experiments with complex robot systems. There are also no

guarantees that, during learning, the robot will not be given

unsafe commands. Thus, many practical approaches learn first

Value

Velocity (r/s)

–6

20

15

10

5

0

–5

–10

–15

–20

–5

–4

–3

–2

–1

0

1

2

3

0

10

20

Torque (N · m)

Position (r)

Position (r)

Velocity (r/s)

–6

20

15

10

5

0

–5

–10

–15

–20

–5

–4

–3

–2

(a)

(b) (c) (d)

–1

0

1

2

3

0

10

20

–10

–6 –5 –4 –3 –2

Position (r)

Velocity (r/s)

–10123

–8

–6

–4

–2

0

2

4

6

8

10

Figure 4. (a) Configurations from the simulated one link pendulum optimal trajectory every half second and at the end of the

trajectory. (b) Value function for one-link example. (c) Policy for one-link example. (d) Trajectory-based approach: random states

(dots) and trajectories (black lines) used to plan one-link swing-up, superimposed on a contour map of the value function [33].

IEEE Robotics & Automation Magazine

JUNE 2010 25

in simulations (which is essentially a model-based approach)

until reasonable performance is achieved, before continuing to

experiment on an actual robot to adjust the control policy to

the true physics of the world [47].

In the end, it is intractable to find a globally optimal control

policy in high dimensional robot systems, as global optimality

requires exploration of the entire state-action space. Thus,

local optimization such as trajectory optimization seems to be

more practical, using initialization of the policy from some

informed guess, for instance, imitation learning [44], [48]–

[51]. Fitted Q-iteration is an example of a model-free learning

algorithm that approximates the Q-function only along some

sampled trajectories [52], [53]. Recent developments have

given up on estimating the value function and rather focus

directly on learning the control policy from trajectory rollouts,

which is the topic of the following sections.

Policy Gradient Methods

Policy gradient methods usually assume that the cost of motor

skill can be written as

J(x0)¼EsX

N

k¼0

ckL(xk,uk)

()

, (4)

which is the expected sum of discounted rewards (c2[0,1])

over a (potentially infinite) time horizon N.Theexpecta-

tion E{} is taken over all trajectories sthat start in state x

0

.

The goal is to find the motor commands u

k

that optimize

this cost function. Most approaches assume that there is a

start state x¼x

0

and/or a start state distribution [54]. The

control policy is also often compactly parameterized, e.g.,

by means of a basis function representation u¼h

T

/(x),

where hare the policy parameters [see also (1)], and /(x)isa

vector of nonlinear basis functions provided by the user.

Mainly for the purpose of exploration, the policy can

be chosen to be stochastic, e.g., with a normal distribution

uN(h

T

/(x), R), although cases exist where only a sto-

chastic policy is optimal [54].

The essence of policy gradient methods is to compute the

gradient @J/@hand optimize (4) with gradient-based incremental

updates. As discussed in more detail in [55], a variety of algorithms

exist to compute the gradient. Finite difference gradients [56]

perform a perturbation analysis of the parameter vector hand

estimate the gradient from a first-order numerical Taylor series

expansion. The REINFORCE algorithm [57], [58] is a

straightforward derivative computation of the logarithm of

(4), assuming as the probability of a trajectory ph(s)¼

p(x0)QN

k¼1p(xkjxk1,uk1)ph(uk1jxk1), and emphasizing

that the parameters honly appear in the stochastic policy p

h

such that many terms in the gradient computation drop out.

GPOMDP [59] and methods based on the policy gradient

theorem [54] are more efficient versions of REINFORCE

(for more details, see [55]). Peters and Schaal [60] suggested a

second-order gradient method derived from insights of [61]

and [62], which is currently among the fastest gradient-learn-

ing approaches. Reference [63] emphasized that the choice of

injecting noise in the stochastic policy can strongly influence

the efficiency of the gradient updates.

Policy gradient methods can scale to high-dimensional

state-action spaces, at the cost of finding only locally optimal

control policies and have become rather popular in robotics

[64]–[66]. One drawback of policy gradients is that they

require manual tuning of gradient parameters, which can be

tedious. Probabilistic methods, as discussed in the next section,

try eliminating gradient computations.

Probabilistic Direct Policy Learning

Transforming reinforcement learning into a probabilistic estima-

tion approach is inspired by the hope of bringing to bear the

wealth of statistical learning techniques that were developed over

the last 20 years of machine-learning research. An early attempt

can be found in [67], where reinforcement learning was formu-

lated as an expectation–maximization (EM) algorithm [68]. The

important idea was to treat the reward L(x,u)asapseudoprob-

ability, i.e., it has to be strictly positive, and the integral over the

state-action space of the reward has to result in a finite number.

Transforming traditional convex reward functions with the

exponential function is often used to achieve this property at the

cost thatthe learning problem gets slightly altered by this change

of cost function. Equation (4) can thus be thought of as a likeli-

hood, andthe corresponding log likelihoodbecomes

log J(x)¼log Zs

ph(s)R(s)ds, where R(s)¼X

N

k¼0

ckL(xk,uk):

(5)

This log likelihood can be optimized with the EM algo-

rithm. In [15], such an approach was used to learn operational

space controllers, where the reinforcement learning compo-

nent enabled a consistent resolution of redundancy. In [69],

the previous approach was extended to learning from trajecto-

ries—see also contribution by Kober and Peters (pp. 55–62).

Extending [70] and [71] added a more thorough treatment of

learning in the infinite discounted horizon case, where the

algorithm can essentially determine the most suitable temporal

window for optimization.

Another way of transforming reinforcement learning into a

statistical estimation problem was suggested in [72] and [73].

Here, it was realized that optimization with the stochastic

Hamilton-Jacobi-Bellman equations can be transformed into a

path-integral estimation problem, which can be derived with

the Feynman-Kac theorem [31], [74]. While this formulation

is normally based on value functions and requires a model-

based approach, Theodorou et al. [31] realized that even

model-free methods can be obtained. The resulting reinforce-

ment learning algorithm resembles the one of [69], however,

without the requirement that reinforcement is a pseudoprob-

ability. Because of its grounding in first-order principles of

optimal control theory, its simplicity, and no open learning

parameters except for the exploration noise, this algorithm

might be one of the most straightforward methods of trajec-

tory-based reinforcement learning to date. It should also be

IEEE Robotics & Automation Magazine

26 JUNE 2010

mentioned that [75] developed a

model-based reinforcement learning

framework with a special probabilistic

control cost for discrete state-action

spaces that, in its limit to continuous

state-action spaces, will result in a path-

integral formulation.

Example Application

Figure 5 illustrates our application of

path-integral reinforcement learning to

a robot-learning problem [31]. The

robot dog is to jump across a gap. The

jump should make as much forward

progress as possible, as it is a maneuver

in a legged locomotion competition,

which scores the speed of the robot.

The robot has three degree of freedoms

(DoFs) per leg, and thus a total of 12

DoFs. Each DoF was represented as a

parameterized movement primitive [76] with 50 basis func-

tions. An initial seed behavior was taught by learning from

demonstration, which allowed the robot barely to reach the

other side of the gap without falling into the gap—the demon-

stration was generated from a manual adjustment of knot

points in a spline-based trajectory plan for each leg.

Path-integral reinforcement learning primarily used the

forward progress as a reward and slightly penalized the squared

acceleration of each DoF and the squared norm of the parame-

ter vector, i.e., a typical form of complexity regularization

[77]. Learning was performed on a physical simulator of the

robot dog, as the real robot dog was not available for this

experiment. Figure 5 illustrates that after about 30 trials, the

performance of the robot was significantly improved, such that

after the jump, almost the entire body was lying on the other

side of the gap. It should be noted that applying path-integral

reinforcement learning was algorithmically very simple, and

manual tuning only focused on generate a good cost function.

Imitation Learning, Policy Parameterizations, and

Inverse Reinforcement Learning

While space constraints will not allow us to go into more

detail, three interwoven topics in robot learning are worth

mentioning.

First, imitation learning has become a popular topic to initi-

alize and speed up robot learning. Reviews on this topic can

be found, for instance, in [48], [49], and [78].

Second, determining useful parameterizations for control

policies is a topic that is often discussed in conjunction with

imitation learning. Many different approaches have been sug-

gested in the literature, for instance, based on splines [79], hid-

den Markov models [80], nonlinear attractor systems [76], and

other methods. Billard et al. [78] provide a survey of this topic.

Finally, designing useful reward functions remains one of the

most time-consuming and frustrating topics in robot learning.

Thus, extracting the reward function from observed behavior is

a topic of great importance for robot learning and imitation

learning under the assumption that the observed behavior is

optimal under a certain criterion. Inverse reinforcement learning

[81], apprenticeship learning [82], and maximum margin plan-

ning [83] are some of the prominent examples in the literature.

Conclusions

Recent trends in robot learning are to use trajectory-based

optimal control techniques and reinforcement learning to scale

complex robotic systems. On the one hand, increased compu-

tational power and multiprocessing, and on the other hand,

probabilistic reinforcement learning methods and function

approximation, have contributed to a steadily increasing inter-

est in robot learning. Imitation learning has helped signifi-

cantly to start learning with reasonable initial behavior.

However, many applications are still restricted to rather low-

dimensional domains and toy applications. Future work will

have to demonstrate the continual and autonomous learning

abilities, which were alluded to in the introduction.

Acknowledgments

This research was supported in part by National Science

Foundation grants ECS-0326095, EEC-0540865, and ECCS-

0824077, IIS-0535282, CNS-0619937, IIS-0917318, CBET-

0922784, EECS-0926052, the DARPA program on Learning

Locomotion, the Okawa Foundation, and the ATR Compu-

tational Neuroscience Laboratories.

Keywords

Robot learning, learning control, reinforcement learning,

optimal control.

References

[1] S. Schaal, “The new robotics—Towards human-centered machi nes,” HFSP

J. Frontiers Interdisciplinary Res. Life Sci., vol. 1, no. 2, pp. 115–126, 2007.

[2] K. J. Åstr€om and B. Wittenmark, Adaptive Control. Reading, MA: Addi-

son-Wesley, 1989.

[3] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, MA:

MIT Press, 2005.

0

0

100

200

300

Cost

400

500

600

10

Number of Rollouts

100

(a) (b)

Figure 5. (a) Actual and simulated robot dog. (b) Learning curve of optimizing the

jump behavior with path-integral reinforcement learning.

IEEE Robotics & Automation Magazine

JUNE 2010 27

[4] M. Buehler, The DARPA Urban Challenge: Autonomous Vehicles in City

Traffic, 1st ed. New York: Springer-Verlag, 2009.

[5] M. Buehler, K. Iagnemma, and S. Singh, The 2005 DARPA Grand Chal-

lenge: The Great Robot Race. New York: Springer-Verlag, 2007.

[6] M. Roy, G. Gordon, and S. Thrun, “Finding approximate POMDP solu-

tions through belief compression,” J. Artif. Intell. Res., vol. 23, pp. 1–40,

2005.

[7] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cam-

bridge, MA: MIT Press, 1998.

[8] J. Si, Handbook of Learning and Approximate Dynamic Programming. Hobo-

ken, NJ: IEEE Press/Wiley-Interscience, 2004.

[9] R. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ. Press,

1957.

[10] P. Dyer and S. R. McReynolds, The Computation and Theory of Optimal

Control. New York: Academic, 1970.

[11] L. Sciavicco and B. Siciliano, Modelling and Control of Robot Manipulators.

New York: Springer-Verlag, 2000.

[12] I. M. Jordan, D. E. Rumelhart, “Supervised learning with a distal

teacher,” Cogn. Sci., vol. 16, pp. 307–354, 1992.

[13] A. D’Souza, S. Vijayakumar, and S. Schaal, “Learning inverse kine-

matics,” in Proc. IEEE Int. Conf. Intelligent Robots and Systems (IROS

2001), Maui, HI, Oct. 29–Nov. 3, 2001, pp. 298–301.

[14] D. Bullock, S. Grossberg, and F. H. Guenther, “A self-organizing neural

model of motor equivalent reaching and tool use by a multijoint arm,” J.

Cogn. Neurosci., vol. 5, no. 4, pp. 408–435, 1993.

[15] J. Peters and S. Schaal, “Learning to control in operational space,” Int. J.

Robot. Res., vol. 27, pp. 197–212, 2008.

[16] Z. Ghahramani and M. I. Jordan, “Supervised learning from incomplete

data via an EM approach,” in Advances in Neural Information Processing Sys-

tems 6, J. D. Cowan, G. Tesauro, and J. Alspector, Eds. San Mateo, CA:

Morgan Kaufmann, 1994, pp. 120–127.

[17] S. Schaal and C. G. Atkeson, “Constructive incremental learning from

only local information,” Neural Comput., vol. 10, no. 8, pp. 2047–2084,

1998.

[18] W. S. Cleveland, “Robust locally weighted regression and smoothing

scatterplots,” J. Amer. Statist. Assoc., vol. 74, pp. 829–836, 1979.

[19] C. G. Atkeson, “Using local models to control movement,” in Advances

in Neural Information Processing Systems 1, D. Touretzky, Ed. San Mateo,

CA: Morgan Kaufmann, 1989, pp. 157–183.

[20] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted

learning,” Artif. Intell. Rev., vol. 11, no. 1–5, pp. 11–73, 1997.

[21] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning

for control,” Artif. Intell. Rev., vol. 11, no. 1–5, pp. 75–113, 1997.

[22] S. Vijayakumar, A. D’Souza, and S. Schaal, “Incremental online learning

in high dimensions,” Neural Comput., vol. 17, no. 12, pp. 2602–2634, 2005.

[23] J.-A. Ting, A. D’Souza, S. Vijayakumar, and S. Schaal, “A Bayesian

approach to empirical local linearizations for robotics,” in Proc. Int. Conf.

Robotics and Automation (ICRA2008), Pasadena, CA, May 19–23, 2008,

pp. 2860–2865.

[24] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine

Learning. Cambridge, MA: MIT Press, 2006.

[25] D. Nguyen-Tuong, M. Seeger, and J. Peters, “Local gaussian process

regression for real time online model learning and control,” in Proc. Advan-

ces in Neural Information Processing Systems 21 (NIPS 2008), D. Schuurmans,

J. Benigio, and D. Koller, Eds. Vancouver, BC, Dec. 8–11, 2009,

pp. 1193–1200.

[26] M. P. Deisenroth, C. E. Rasmussen, and J. Peters, “Gaussian process

dynamic programming,” Neurocomputing, vol. 72, no. 7–9, pp. 1508–

1524, 2009.

[27] L. Csat’o and M. Opper, “Sparse representation for gaussian process

models,” in Proc. Advances in Neural Information Processing Systems 13 (NIPS

2000), Denver, CO, 2001, pp. 444–450.

[28] D. M. Wolpert and M. Kawato, “Multiple paired forward and inverse

models for motor control,” Neural Netw., vol. 11, no. 7–8, pp. 1317–

1329, 1998.

[29] D. H. Jacobson and D. Q. Mayne, Differential Dynamic Programming. New

York: American Elsevier, 1970.

[30] K. Doya, “Reinforcement learning in continuous time and space,” Neu-

ral Comput., vol. 12, no. 1, pp. 219–245, Jan. 2000.

[31] E. Theodorou, J. Buchli, and S. Schaal, “Reinforcement learning in high

dimensional state spaces: A path integral approach,” submitted for

publication.

[32] R. Munos and A. Moore, “Variable resolution discretization in optimal

control,” Mach. Learn., vol. 49, no. 2/3, p. 33, 2002.

[33] C. G. Atkeson and B. J. Stephens, “Random sampling of states in

dynamic programming,” IEEE Trans. Syst., Man, Cybern. B, vol. 38,

no. 4, pp. 924–929, 2008.

[34] C. G. Atkeson, “Randomly sampling actions in dynamic programming,”

in Proc. IEEE Int. Symp. Approximate Dynamic Programming and Reinforce-

ment Learning, 2007, ADPRL’07, pp. 185–192.

[35] E. Whitman and C. G. Atkeson, “Control of a walking biped using a

combination of simple policies,” in Proc. IEEE/RAS Int. Conf. Humanoid

Robotics, Paris, France, Dec. 7–10, 2009, pp. 520–527.

[36] Tomlab Optimization Inc. (2010). PROPT—Matlab optimal control

software [Online]. Available: http://tomdyn.com/

[37] Technische Universit€at Darmstadt. (2010). DIRCOL: A direct colloca-

tion method for the numerical solution of optimal control problems

[Online]. Available: http://www.sim.informatik.tu-darmstadt.de/sw/dircol

[38] Stanford Business Software Corporation. (2010). SNOPT; Software for

large-scale nonlinear programming [Online]. Available: http://www.sbsi-

sol-optimize.com/asp/sol_product_snopt.htm

[39] A. Safonova, J. K. Hodgins, and N. S. Pollard, “Synthesizing physically

realistic human motion in low-dimensional, behavior-specific spaces,”

ACM Trans. Graph. J. (SIGGRAPH 2004 Proc.), vol. 23, no. 3, pp. 514–

521, 2004.

[40] L. Chenggang and C. G. Atkeson, “Standing balance control using a

trajectory library,” presented at the IEEE/RSJ Int. Conf. Intelligent

Robots and Systems (IROS 2009), 2009.

[41] A. Ng, “Pegasus: A policy search method for large MDPs and

POMDPs,” presented at the Uncertainty in Artificial Intelligence (UAI),

2000.

[42] E. Tse, Y. Bar-Shalom, and L. Meier, III, “Wide-sense adaptive dual

control for nonlinear stochastic systems,” IEEE Trans. Automat. Contr.,

vol. 18, no. 2, pp. 98–108, 1973.

[43] Y. Bar-Shalom and E. Tse, “Caution, probing and the value of informa-

tion in the control of uncertain systems,” Ann. Econ. Social Meas., vol. 4,

no. 3, pp. 323–338, 1976.

[44] C. G. Atkeson and S. Schaal, “Robot learning from demonstration,” in

Proc. 14th Int. Conf. Machine Learning (ICML‘97), D. H. Fisher, Jr., Ed.

Nashville, TN, July 8–12, 1997, pp. 12–20.

[45] R. S. Sutton, “Learning to predict by the methods of temporal differ-

ences,” Mach. Learn., vol. 3, no. 1, pp. 9–44, 1988.

[46] C. J. C. H. Watkins, “Learning with delayed rewards,” Ph.D. thesis,

Cambridge Univ., U.K., 1989.

[47] J. Morimoto and K. Doya, “Acquisition of stand-up behavior by a real

robot using hierarchical reinforcement learning,” Robot. Auton. Syst.,

vol. 36, no. 1, pp. 37–51, 2001.

[48] S. Schaal, “Is imitation learning the route to humanoid robots?” Trends

Cogn. Sci., vol. 3, no. 6, pp. 233–242, 1999.

[49] S. Schaal, A. Ijspeert, and A. Billard, “Computational approaches to

motor learning by imitation,” Philos. Trans. R. Soc. London B, Biol. Sci.,

vol. 358, no. 1431, pp. 537–547, 2003.

[50] C. G. Atkeson and S. Schaal, “Learning tasks from a single demon-

stration,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA’97),

Albuquerque, NM, Apr. 20–25, 1997, pp. 1706–1712.

[51] S. Schaal, “Learning from demonstration,” in Proc. Advances in Neural

Information Processing Systems 9, M. C. Mozer, M. Jordan, and T. Petsche,

Eds. Cambridge, MA, 1997, pp. 1040–1046.

[52] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode rein-

forcement learning,” J. Mach. Learn. Res., vol. 6, pp. 503–556, 2005.

[53] G. Neumann and J. Peters, “Fitted Q-iteration by advantage weighted

regression,” in Proc. Advances in Neural Information Processing Systems 21

(NIPS 2008), D. Schuurmans, J. Benigio, and D. Koller, Eds. Vancouver,

BC, Dec. 8–11, 2009, pp. 1177–1184.

IEEE Robotics & Automation Magazine

28 JUNE 2010

[54] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient

methods for reinforcement learning with function approximation,” in

Proc. Advances in Neural Processing Systems 12, S. A. Solla, T. K. Leen, and

K.-R. M€uller, Eds. Denver, CO, 2000.

[55] J. Peters and S. Schaal, “Reinforcement learning of motor skills with

policy gradients,” Neural Netw., vol. 21, no. 4, pp. 682–697, May 2008.

[56] P. Sadegh and J. Spall, “Optimal random perturbations for stochastic

approximation using a simultaneous perturbation gradient approx-

imation,” presented at the Proc. American Control Conf., 1997.

[57] R. J. Williams, “Simple statistical gradient-following algorithms for con-

nectionist reinforcement learning,” Mach. Learn., vol. 8, no. 3–4, pp. 229–

256, 1992.

[58] V. Gullapalli, “A stochastic reinforcement learning algorithm for learning

real-valued functions,” Neural Netw., vol. 3, no. 6, pp. 671–692, 1990.

[59] D. Aberdeen and J. Baxter, “Scaling internal-state policy-gradient meth-

ods for POMDPs,” in Proc. 19th Int. Conf. Machine Learning (ICML-2002),

Sydney, Australia, 2002, pp. 3–10.

[60] J. Peters and S. Schaal, “Natural actor critic,” Neurocomputing, vol. 71,

no. 7–9, pp. 1180–1190, 2008.

[61] S. Amari, “Natural gradient learning for over- and under-complete bases

In ICA,” Neural Comput., vol. 11, no. 8, pp. 1875–1883, Nov. 1999.

[62] S. Kakade, “Natural policy gradient,” presented at the Advances in Neu-

ral Information Processing Systems, Vancouver, CA, 2002.

[63] T. R€uckstieß, M. Felder, and J. Schmidhuber, “State-dependent explo-

ration for policy gradient methods,” presented at the European Conf.

Machine Learning and Principles and Practice of Knowledge Discovery in

Databases 2008, Part II, LNAI 5212, 2008.

[64] G. Endo, J. Morimoto, T. Matsubara, J. Nakanish, and G. Cheng,

“Learning CPG-based biped locomotion with a policy gradient method:

Application to a humanoid robot,” Int. J. Robot. Res., vol. 27, no. 2,

pp. 213–228, 2008.

[65] R. Tedrake, T. W. Zhang, and S. Seung, “Stochastic policy gradient rein-

forcement learning on a simple 3D biped,” in Proc. Int. Conf. Intelligent

Robots and Systems (IROS 2004), Sendai, Japan, Oct. 2004, pp. 2849–2854.

[66] J. Peters and S. Schaal, “Policy gradient methods for robotics,” in Proc.

IEEE Int. Conf. Intelligent Robotics Systems (IROS 2006), Beijing, Oct. 9–

15, 2006, pp. 2219–2225.

[67] P. Dayan and G. Hinton, “Using EM for reinforcement learning,” Neural

Comput., vol. 9, no. 2, pp. 271–278, 1997.

[68] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood

from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39,

no. 1, pp. 1–38, 1977.

[69] J. Kober and J. Peters, “Learning motor primitives in robotics,” in Proc.

Advances in Neural Information Processing Systems 21 (NIPS 2008),D.

Schuurmans, J. Benigio, and D. Koller, Eds. Vancouver, BC, Dec. 8–11,

2009, pp. 297–304.

[70] M. Toussaint and A. Storkey, “Probabilistic inference for solving discrete

and continuous state Markov decision processes,” presented at the 23nd

Int. Conf. Machine Learning (ICML 2006), 2006.

[71] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis, “Learning model-

free control by a Monte-Carlo EM algorithm,” Auton. Robots, vol. 27,

no. 2, pp. 123–130, 2009.

[72] H. J. Kappen, “Linear theory for control of nonlinear stochastic systems,”

Phys. Rev. Lett., vol. 95, no. 20, pp. 200201–200204, Nov. 2005.

[73] H. J. Kappen, “An introduction to stochastic control theory, path inte-

grals and reinforcement learning,” in Cooperative Behavior in Neural Systems,

vol. 887, J. Marro, P. L. Garrido, and J. J. Torres, Eds. 2007, pp. 149–181.

[74] E. Theodorou, J. Buchli, and S. Schaal, “Path integral stochastic optimal

control for rigid body dynamics,” presented at the IEEE Int. Symp.

Approximate Dynamic Programming and Reinforcement Learning

(ADPRL2009), Nashville, TN, Mar. 30–Apr. 2, 2009.

[75] E. Todorov, “Efficient computation of optimal actions,” Proc. Nat. Acad.

Sci. USA, vol. 106, no. 28, pp. 11478–11483, July 2009.

[76] A. Ijspeert, J. Nakanishi, and S. Schaal, “Learning attractor landscapes for

learning motor primitives,” in Advances in Neural Information Processing Systems

15, S. Becker, S. Thrun, and K. Obermayer, Eds. 2003, pp. 1547–1554.

[77] C. M. Bishop, Pattern Recognition and Machine Learning. New York:

Springer-Verlag, 2006.

[78] A. Billard, S. Calinon, R. Dillmann, and S. Schaal, “Robot programming

by demonstration,” in Handbook of Robotics,vol.1,B.SicilianoandO.Khatib,

Eds. Cambridge, MA: MIT Press, 2008, ch. 59.

[79] Y. Wada and M. Kawato, “Trajectory formation of arm movement by a

neural network with forward and inverse dynamics models,” Syst. Comput.

Jpn., vol. 24, pp. 37–50, 1994.

[80] T. Inamura, I. Toshima, H. Tanie, and Y. Nakamura, “Embodied sym-

bol emergence based on mimesis theory,” Int. J. Robot. Res., vol. 23,

no. 4–5, p. 363, Apr.-May 2004.

[81] A. Y. Ng and S. Russell, “Algorithms for inverse reinforcement

learning,” in Proc. 17th Int. Conf. Machine Learning (ICML 2000), Stanford,

CA, 2000, pp. 663–670.

[82] P. Abbeel and A. Ng, “Apprenticeship learning via inverse reinforcement

learning,” in Proc. 21st Int. Conf. Machine Learning, 2004.

[83] N. Ratliff, D. Silver, and J. A. Bagnell, “Learning to search: Functional

gradient techniques for imitation learning,” Auton. Robots, vol. 27, no. 1,

pp. 25–53, 2009.

Stefan Schaal is a professor of computer science, neuro-

science, and biomedical engineering at the University of

Southern California and an invited researcher at the ATR

Computational Neuroscience Laboratory in Japan. He has

coauthored more than 200 papers in refereed journals and

conferences. He is a cofounder of the IEEE/RAS International

Conference and Humanoid Robotics as well as Robotics Science and

Systems. He serves on the editorial board of Neural Networks,

International Journal of Humanoid Robotics, and Frontiers in Neuro-

robotics. He is a Member of the German National Academic

Foundation (Studienstiftung des Deutschen Volkes), Alexander

von Humboldt Foundation, Society for Neuroscience, the

Society for Neural Control of Movement, the IEEE, and AAAS.

His research interests include topics of statistical and machine

learning, neural networks, computational neuroscience, func-

tional brain imaging, nonlinear dynamics, nonlinear control

theory, and biomimetic robotics.

Christopher G. Atkeson received his M.S. degree in applied

mathematics (computer science) from Harvard University and

his Ph.D. degree in brain and cognitive sciences from Massa-

chusetts Institute of Technology (MIT). He is a professor at

the Robotics Institute and Human–Computer Interaction

Institute, Carnegie Mellon University. He joined the MIT as

a faculty in 1986 and moved to the Georgia Institute of

Technology College of Computing in 1994. He has received

the National Science Foundation Presidential Young Investi-

gator Award, Sloan Research Fellowship, and Teaching

Award from the MIT Graduate Student Council. His research

focuses on humanoid robotics and robot learning by using

challenging dynamic tasks such as juggling. His specific

research interests include nonparametric learning, memory-

based learning including approaches based on trajectory libra-

ries, reinforcement learning, and other forms of learning based

on optimal control, learning from demonstration, and model-

ing human behavior.

Address for Correspondence: Stefan Schaal, Computer Science,

Neuroscience, and Biomedical Engineering, University of

Southern California, Los Angeles, CA 90089-2905 USA.

E-mail: sschaal@usc.edu.

IEEE Robotics & Automation Magazine

JUNE 2010 29