ArticlePDF Available

Learning Control in Robotics

Authors:

Abstract and Figures

Recent trends in robot learning are to use trajectory-based optimal control techniques and reinforcement learning to scale complex robotic systems. On the one hand, increased computational power and multiprocessing, and on the other hand, probabilistic reinforcement learning methods and function approximation, have contributed to a steadily increasing interest in robot learning. Imitation learning has helped significantly to start learning with reasonable initial behavior. However, many applications are still restricted to rather lowdimensional domains and toy applications. Future work will have to demonstrate the continual and autonomous learning abilities, which were alluded to in the introduction.
Content may be subject to copyright.
IEEE Robotics & Automation Magazine
20 1070-9932/10/$26.00ª2010 IEEE JUNE 2010
Trajectory-Based Optimal
Control Techniques
In a not too distant future, robots will be a natural part of
daily life in human society, providing assistance in many
areas ranging from clinical applications, education and care
giving, to normal household environments [1]. It is hard to
imagine that all possible tasks can be preprogrammed in such
robots. Robots need to be able to learn, either by themselves
or with the help of human supervision. Additionally, wear and
tear on robots in daily use needs to be automatically compen-
sated for, which requires a form of continuous self-calibration,
another form of learning. Finally, robots need to react to sto-
chastic and dynamic environments, i.e., they need to learn
how to optimally adapt to uncertainty and unforeseen
changes. Robot learning is going to be a key ingredient for the
future of autonomous robots.
While robot learning covers a rather large field, from learn-
ing to perceive, to plan, to make decisions, etc., we will focus
this review on topics of learning control, in particular, as it is
concerned with learning control in simulated or actual physi-
cal robots. In general, learning control refers to the process of
acquiring a control strategy for a particular control system and
a particular task by trial and error. Learning control is usually
distinguished from adaptive control [2] in that the learning sys-
tem can have rather general optimization objectivesnot just,
e.g., minimal tracking errorand is permitted to fail during
the process of learning, while adaptive control emphasizes fast
convergence without failure. Thus, learning control resembles
the way that humans and animals acquire new movement
strategies, while adaptive control is a special case of learning
control that fulfills stringent performance constraints, e.g., as
needed in life-critical systems like airplanes.
Learning control has been an active topic of research for at
least three decades. However, given the lack of working robots
that actually use learning components, more work needs to be
done before robot learning will make it beyond the laboratory
environment. This article will survey some ongoing and past
activities in robot learning to assess where the field stands and
where it is going. We will largely focus on nonwheeled robots
and less on topics of state estimation, as typically explored in
wheeled robots [3]6], and we emphasize learning in continuous
state-action spaces rather than discrete state-action spaces [7], [8].
We will illustrate the different topics of robot learning with
examples from our own research with anthropomorphic and
humanoid robots.
The Basics of Learning Control
A key question in learning control is what it is that should be
learned. To address this issue, it is helpful to begin with one of
the most general frameworks of learning control, as originally
developed in the middle of the 20th century in the fields of
optimization theory, optimal control, and in particular,
dynamic programming [9], [10]. Here, the goal of learning
control was formalized as the need to acquire a task-dependent
control policy pthat maps a continuous-valued state vector x
Digital Object Identifier 10.1109/MRA.2010.936957
© STOCKBYTE, EYEWIRE, DIGITAL VISION & BRAND X PICTURES
BY STEFAN SCHAAL AND CHRISTOPHER G. ATKESON
of a controlled system and its environment, possibly in a time t
dependent way, to a continuous-valued control vector u:
u¼p(x,t,h):(1)
The parameter vector hcontains the problem-specific
parameters in the policy pthat need to be adjusted by the
learning system. The controlled system can generally be
expressed as a nonlinear dynamics function
_x¼f(x,u,t,ex) (2)
with observation equations
y¼h(x,u,t,ey) (3)
that describe how the observations yof the system are derived
from the full-state vector xthe terms exand eydenote noise
terms. Thus, learning control means finding a (usually nonlin-
ear) function pthat is adequate for a given desired behavior
and movement system. A repertoire of motor skills is com-
posed of many such policies that are sequenced and superim-
posed to achieve complex motor skills.
How the control policy is learned, however, can proceed in
many different ways. Assuming that the model equations (2) and
(3) are unknown, one classical approach is to learn these models
using methods of function approximation and then compute a
controller based on the estimated model, which is often discussed
as the certainty-equivalence principle in the adaptive control liter-
ature [2]. Such techniques are summarized under the name
model-based learning, indirect learning, or internal model learn-
ing. Alternatively, model-free learning of the policy is possible
given an optimization or reward criterion, usually using methods
from optimal control or reinforcement learning. Such model-free
learning is also known as direct learning, since the policy is learned
directly, i.e., without a detour through model identification.
It is useful to distinguish between several general classes of
motor tasks that could be the goal of learning. Regulator tasks
keep the system at a particular set point
of operationa typical example is bal-
ancing a pole on a fingertip or standing
upright on two legs. Tracking tasks
require the control system to follow a
given desired trajectory within the abil-
ities of the control system. Discrete
movement tasks, also called one-shot
tasks, are defined by achieving a particu-
lar goal at which the motor skill termi-
nates. A basketball foul shot or grasping a
cup of coffee are representative exam-
ples. Periodic movement tasks are typical
in the domain of locomotion. At last,
complex movement tasks are composed
of sequencing and superimposing simpler
motor skills, e.g., leading to complex
manipulation skills like emptying a dish-
washer or assembling a bookshelf.
From the viewpoint of machine learning, robot learning can
be classified as supervised learning, reinforcement learning,
learning modularizations, or learning feature representations
that subserve learning. All learning methods can benefit from
giving the learning system prior knowledge about how to
accomplish a motor task, and imitation learning or learning
from demonstration is a popular approach to introduce this bias.
In summary, the goal of robot learning is to find an appro-
priate control policy to accomplish a given movement task,
assuming that no traditional methods exist to compute the
control policy. Approaches to robot learning can be classified
and discussed using three dimensions: direct versus indirect
control, the learning method used, and the class of tasks in
question (Figure 1).
Approaches to Robot Learning
We will use the classification in Figure 1 in the following sec-
tions to guide our survey of current and previous work in robot
learning. Given space constraints, this survey is not meant to be
comprehensive but rather to present illustrative projects in the
various areas.
Learning Internal Models for Control
Using learning to acquire internal models for control is useful
when the analytical models are too complex to derive, and/or
when it can be expected that the models change over time, e.g.,
due to wear and tear. Various kinds of internal models are used in
robotics. The most well known are kinematics and dynamic
models. For instance, the direct kinematics of a robot relates joint
variables qto end-effector variables y, i.e., y¼g(q)[11].
Dynamics models include kinetic terms like forces or torques, as
in (2). The previous models are forward models, i.e., they model
the causal relationship between input and output variables, and
they are proper functions. Often, however, what is needed in
control are inverse models, e.g., the inverse kinematics q¼
g
1
(y) or the inverse dynamics u¼f1(q,_q,t). As discussed in
[12], inverse models are often not functions, as the inverse rela-
tionships may be a one-to-many map, i.e., just a relation. Such
Direct Versus Indirect Control Learning Method
Model-Free Control
Model-Based Control
Regulator Task
Tracking Task
One-Shot Tasks
Periodic Tasks
Complex/Composite Tasks
Class of Task
Supervised Learning
Reinforcement Learning
Learning Modularity
Learning Representations
Imitation Learning
C
ontro
l
a
sed
C
ontro
l
R
egu
l
ator
T
as
k
Tracking Tas
k
ne-
hot Tasks
P
eriodic Tasks
e
x/Composite Tasks
Supervised Learni
R
einforceme
L
earn
in
Le
Figure 1. Classification of robot learning along three dimensions. Topics further out
on the arrows can be considered more complex research topics than topics closer to
the center.
IEEE Robotics & Automation Magazine
JUNE 2010 21
cases pose a problem to learning methods and can be addressed
with special techniques and representations [13][16].
Nonlinear function approximation is needed to learn inter-
nal models. It should be noted, as will be explained later, that
function approximation is also required for other robot learning
problems, e.g., to represent value functions, reward functions,
or policies in reinforcement learningthus, function approxi-
mation has a wide applicability in robot learning. While most
machine-learning problems in function approximation work by
processing a given data set in an offline fashion, robot learning
has severalfeatures that require specialized algorithms:
udata are available in abundance, typically at a rate from
60 to 1,000 data points per second
ugiven this continuous stream of data, learning should
never stop, but continue forever without degradation
over time. For instance, degradation happens in many
algorithms if the same data point is given to the learning
system repeatedly, e.g., when the robot is standing still
ugiven the high dimensionality of most interesting
robotic systems, the complexity of the function to be
learned is often unknown in advance, and the function
approximation system needs to be able to add new
learning resources as learning proceeds
ulearning should happen in real time, be data efficient
(squeeze the most information out of each data point),
and be computationally efficient (to achieve real-time
learning and lookup)
ulearning needs to be robust toward shifting input distri-
butions, e.g., as typical when practicing calligraphy on
one day and tennis on another day, a topic discussed in
the context of catastrophic interference [17]
ulearning needs to be able to detect relevant features in
the input from ideally hundreds or thousands of input
dimensions, and it needs to exclude automatically irrele-
vant and redundant inputs.
These requirements narrow down the learning algorithms
that are applicable to function approximation for robot learn-
ing. One approach that has favorable performance is learning
with piecewise linear models using nonparametric regression
techniques [17][22]. Essentially, this technique finds, in the
spirit of a first-order Taylor series expansion, the linearization
of the function at an input point, and the region (also called a
kernel) in which this linearization holds within a certain error
bound. Learning this region is the most complex part of these
techniques, and the latest developments use Bayesian statistics
[23] and dimensionality reduction [22].
A new development, largely due to increasingly faster com-
puting hardware, is the application of Gaussian process regres-
sion (GPR) to function approximation in robots [24][26].
GPR is a powerful function approximation tool that has
gained popularity due to its sound theory, high fitting accu-
racy, and the relative ease of application with public-domain
software libraries. As it requires an iterative optimization that
needs to invert a matrix of size N3N, where Nis the number
of training data points, GPR quickly saturates the computa-
tional resources with moderately many data points. Thus, scal-
ability to continual and real-time learning in complex robots
will require further research developments; some research
along these lines is given in [25] and [27].
Example Application
As mentioned earlier, learning inverse models can be challeng-
ing, since the inverse model problem is often a relation and not
a function, with a one-to-many mapping. Applying any arbi-
trary nonlinear function approximation method to the inverse
model problem can lead to unpredictably bad performance, as
the training data can form nonconvex solution spaces in which
averaging is inappropriate [12]. A particularly interesting
approach in control involves learning local linearizations of a
forward model (which is a proper function) and learning an
inverse mapping within the local region of the forward model;
see also [15] and [28].
Ting et al. [23] demonstrated such a forward-inverse model
learning approach with Bayesian locally weighted regression
(BLWR) to learn an inverse kinematics model for a haptic
robot arm (Figure 2) for a task-space tracking task. Training
data consisted of the arm’s joint angles q, joint velocities _q,
end-effector position in Cartesian space y, and end-effector
velocities _y. From this data, a differential forward kinematics
model _y¼J(q)_qwas learned, where Jis the Jacobian matrix.
The transformation from _qto _ycan be assumed to be locally
linear at a particular configuration qof the robot arm. BLWR
is used to learn the forward model in a
piecewise linear fashion.
The goal of the robot task is to track a
desired trajectory (y,_y) specified only in
terms of x,zCartesian positions and
velocities, i.e., the movement is sup-
posed to be in a vertical plane in front of
the robot, but the exact position of the
vertical plane is not given. Thus, the task
has one degree of redundancy. To learn
an inverse kinematics model, the local
regions from the piecewise linear for-
ward model can be reused since any local
inverse is also locally linear within these
regions. Moreover, for locally linear
models, all solution spaces for the inverse
0.2
0.1
0
z (m)
–0.1
–0.1 –0.05 0
x (m)
Desired
Learned IK
0.05 0.1
(a) (b)
Figure 2. (a) Phantom robot. (b) Learned-inverse kinematics solution; the difference
between the actual and desired trajectory is small.
IEEE Robotics & Automation Magazine
22 JUNE 2010
model are locally convex, such that an inverse can be learned
without problems. The redundancy issue can be solved by
applying an additional weight to each data point according to a
reward function, resulting in reward-weighted locally
weighted regression [15].
Figure 2 shows the performance of the learned inverse model
(Learned IK) in a figure-eight tracking task. The learned model
as well as the analytical inverse kinematics solution performs
with root-mean-squared tracking errors in positions and veloc-
ities very close to that of the analytical solution. This perform-
ance was acquired from five minutes of real-time training data.
Model-Based Learning
In considering model-based learning, it is useful to start by
assuming that the model is perfect. Later, we will address the
question of how to design a controller that is robust to flaws in
the learned model.
Conventional Dynamic Programming
Designing controllers for linear models is well understood. Work
in reinforcement learning has focused using techniques derived
from dynamic programming to design controllers for models
that are nonlinear. A large part of our own work has emphasized
pushing back the curse of dimensionality, as the memory and
computational cost of dynamic programming increase exponen-
tially with the dimensionality of the state-action space.
Dynamic programming provides a way to find globally
optimal control policies when the model of the control system
is known. This section focuses on offline planning of nonlinear
control policies for control problems with continuous states
and actions, deterministic time invariant discrete time dynam-
ics, x
kþ1
¼f(x
k
,u
k
), and a time-invariant one-step cost or
reward function L(x,u)equivalent formulations exist for
continuous time systems [29][31]. We are addressing steady-
state policies, i.e., policies that are not time variant and have an
infinite time horizon. One approach to dynamic programming
is to approximate the value function V(x) (the optimal total
future cost from each state V(x)¼minukP1
k¼0L(xk,uk)) by
repeatedly solving the Bellman equation V(x)¼minu
fL(x,u)þV(f(x,u))gat sampled states xuntil the value
function estimates have converged to globally optimal val-
ues. Typically, the value function and control law are repre-
sented on a regular gridit should be noted that more
efficient adaptive grid methods [32], [33] or function approx-
imation methods [7] also exist. Some type of interpolation is
used to approximate these functions within each grid cell. If
each dimension of the state and action is represented with a
resolution R, and the dimensionality of the state is d
x
and that
of the action is d
u
, the computational cost of the conven-
tional approach is proportional to Rdx3Rduand the memory
cost is proportional to Rdx.Thisisknownasthecurseof
dimensionality [9].
We have shown that dynamic programming can be sped up
by randomly sampling actions on each sweep rather than
exhaustively minimizing the Bellman equation with respect to
the action [34]. At each state on each update, the current best
action is reevaluated and compared to some number of random
actions. Our studies have found that only looking at one ran-
dom action on each update is most efficient. It is more effective
to propagate information about future values by reevaluating
the current best action on each update than it is to put a lot of
resources into searching for the absolute best action.
With this speedup in action search, currently available
cluster computers can easily handle ten-dimensional problems
(approximately 10
10
points can handle grids of size 50
6
,20
8
,or
10
10
, for example). Current supercomputers are created by net-
working hundreds or thousands of conventional computers.
The obvious way to implement dynamic programming on
such a cluster is to partition the grid representing the value
function and policy across the individual computing nodes,
with the borders shared between multiple nodes. When a
border cell is updated by its host node, the new value must be
communicated to all nodes that have copies of that cell. We
have implemented dynamic programming in a cluster of up to
100 nodes, with each node having eight CPU cores and 16 GB
of memory. For example, running a cluster of 40 nodes on a
six-dimensional problem with 50
6
cells, about 6 GB is used on
each node to store its region of the value function and policy.
Decomposing Problems
One way to reduce the curse of dimensionality is to break
problems into parts and develop a controller for each part sep-
arately. Each subsystem could be ten-dimensional, given the
earlier results, and a system that combined two subsystems
could be 20 dimensional. For example, we are interested in
developing a controller for biped walking [35]. We can
approximately model the dynamics of a biped with separate
models for sagittal and lateral control. These models are linked
by common actions, such as when to put down and lift the
feet. Thus, there are two parts of the state vector x: variables
that are part of the sagittal state x
s
and variables that are part of
the lateral state x
l
. There are three parts of the action vector u:
variables that are part of the sagittal action u
s
, variables that are
part of the lateral action u
l
, and variables that affect both sys-
tems u
sl
. We can perform dynamic programming on the sagit-
tal system and produce a value function V
s
(x
s
) and do the same
with the lateral system V
l
(x
l
). We can choose an optimal action
by minimizing L((x,u)þV(f(x,u)) with respect to u, with
V(x) approximated by V
s
(x
s
)þV
l
(x
l
). This approximation
ignores the linking of the two systems in the future and can be
improved by adding elements to the one-step costs for each
subsystem that bias the shared actions to behave as if the other
system was present. For example, deviations from the timing
usually seen in the complete system can be penalized.
Trajectory Optimization and Trajectory Libraries
Another way to handle complex systems is trajectory optimiza-
tion. Given a model, a variety of approaches can be used to find
a locally optimal sequence of commands for a given initial posi-
tion and one-step cost [36][38]. Interestingly, trajectory optimi-
zation is quite popular for generating motion in animation [39].
However, trajectory optimization is not so popular in robotics,
because it appears that it does not produce a control law but just
a fixed sequence of commands. This is not a correct view.
IEEE Robotics & Automation Magazine
JUNE 2010 23
To generate a control policy, trajectory optimization can
be applied to many initial conditions, and the resulting com-
mands can be interpolated as needed. If that is the case, why
do we need to deal with dynamic programming and the curse
of dimensionality? Dynamic programming is a global opti-
mizer, while trajectory optimization finds local optima. Often,
the local optima found are not acceptable. Some way to bias
trajectory optimization to produce reasonable trajectories
would be useful. Also, if interpolation of the results will be
done, it would be useful to produce consistent results so that
similar initial conditions lead to similar costs. There may be
discontinuities between nearby trajectories that must be
handled by interpolation of actions.
One trick to improve trajectories is to use neighboring tra-
jectories to somehow bias or guide the optimization process. A
simple way to do this is to use a neighboring trajectory as the
initial trajectory in the trajectory-optimization process. Trajec-
tories can be reoptimized using each neighbor in turn as the
initial trajectory, and the best result so far can be retained. We
have explored building explicit libraries of optimized trajecto-
ries to handle large perturbations in bipedal standing balance
[40]. One way of using the library is to use the optimized
action corresponding to the nearest state in the library at each
time step. Another way is to store the derivative of the opti-
mized action with respect to state and use that derivative to
modify the suggested action. A third way is to look up states
from multiple trajectories and generate a weighted blend of
the suggested actions.
The first and second derivatives of a trajectory’s cost with
respect to state can be used to generate a local Taylor series
model of the value function: V(x)¼V
0
þV
x
xþX
T
V
xx
X.
Given a quadratic local model of the value function, it is possible
to compute the optimal action and its first derivative, the feed-
back gains. These observations led to a trajectory optimization
method based on second-order gradient descent, differential
dynamic programming (DDP) [29]. Although this trajectory
optimization method is no longer considered the most efficient
way to find an optimal trajectory [sequential quadratic program-
ming (SQP) methods are currently preferred in many fields such
as aerospace and animation], the localmodels of the value func-
tion and policy that DDP produces are useful for machine
learning. For example, the local modelof the policy can be used
in a trajectory library to interpolate or extrapolate actions. Dis-
crepancies in adjacent local models of the value function can be
used to determine where to allocate additional library resources.
Robustness
Robustness has not been addressed well in robot learning.
Studies often focus on robustness to additive noise. It is much
more difficult to design controllers that are robust to the corre-
lated errors caused by parameter error or model structure
error. One approach to designing robust controllers is to opti-
mize controller parameters by simulating a controller control-
ling a noisy robot [41]. It is more useful to optimize controller
parameters controlling a set of robots, each with different
robot parameters. This allows the effect of correlated control-
ler errors across time to be handled in the optimization.
It is not clear how to perform a similar optimization over a
set of models in dynamic programming. Using additive noise
and performing stochastic dynamic programming does not
capture the effect of correlated errors. One approach is to
make the model parameters into model states and perform sto-
chastic dynamic programming on information states that
describe distributions of actual states and model parameters.
However, this creates a large increase in the number of states,
which is not practical for dynamic programming.
Bar-Shalom and Tse showed that DDP can be used to
locally optimize controller robustness as well as exploration
[42], [43]. This work provides an efficient solution to optimize
the typically high-dimensional information state, which
includes the means and covariances of the original model states
and the means and covariances of the model parameters.
Representing the uncertainty using a parametric probability
distribution (means and covariances) also reduces the compu-
tational cost of propagating uncertainty forward in time. The
dynamics of the system are given by an extended Kalman fil-
ter. The key observation is that the cost of uncertainty (the
state and model parameter covariances) is given by
Trace(V
xx
R), the trace of the product of the second derivative
of the value function and the covariance matrix of the state.
Minimizing the additional cost due to uncertainty makes the
controller more robust and guides exploration.
Example Application
We implemented DDP on an actual robot as part of a learning
from demonstration experiment (Figure 3). Several robustness
issues arose since models are never perfect, especially learned
models. 1) We needed initial trajectories that were consistent
with the learned models, and sometimes reasonable or feasible
trajectories do not exist due to modeling error in the learned
model. 2) During optimization, the forward integration of a
learned model in time often blows up when the learned model
is inaccurate or when the plant is unstable and the current policy
fails to stabilize it. 3) The backward integration to produce a
value function and a corresponding policy uses derivatives of the
learned model, which are often quite inaccurate in the early
stages of learning, producing inaccurate value function estimates
and ineffective policies. 4) Dynamic planners amplify modeling
Figure 3. The robot swinging up an inverted pendulum.
IEEE Robotics & Automation Magazine
24 JUNE 2010
error, because they take advantage of any modeling error that
reduces cost, and because some planners use derivatives, which
can be quite inaccurate. 5) The new knowledge gained in
attempting a task may not change the predictions the system
makes about the task (falling down might not tell us much about
theforcesneededinwalking).InthetaskshowninFigure3,we
used a direct reinforcement learning approach that adjusted the
task goals in addition to optimal control to overcome modeling
errors that the learningsystem did not handle [44].
We use another form of one-link pendulum swing-up as
an example problem to provide the reader with a visualizable
example of a value function and policy (Figure 4). In this one-
link pendulum swing-up, a motor at the base of the pendulum
swings a rigid arm from the downward stable equilibrium to
the upright unstable equilibrium and balances the arm there.
What makes this challenging is that the one-step cost function
penalizes the amount of torque used and the deviation of
the current position from the goal. The controller must try
to minimize the total cost of the trajectory. The one-step
cost function for this example is a weighted sum of the
squared position errors (~
h: difference between current angle
and the goal angle) and the squared torques s:
L(x,u)¼0:1~
h2Tþs2T, where 0.1 weights the position
error relative to the torque penalty and Tis the time step of
the simulation (0.01s). Including the time step Tin the optimi-
zation criterion allows comparison with controllers with dif-
ferent time steps and continuous time controllers. There are
no costs associated with the joint velocity. Figure 4 shows the
optimal value function and policy. The optimal trajectory is
shown as a yellow line in the value function plot and as a black
line with a yellow border in the policy plot [Figure 4(b) and
(c)]. The value function is cut off above 20 so that we can see
the details of the part of the value function that determines the
optimal trajectory. The goal is at the state (0,0).
Model-Free Learning
There are several popular methods of approaching model-
free robot learning. Value function-based methods are dis-
cussed in the context of actor-critic methods, temporal dif-
ference (TD) learning, and Q-learning. A novel wave of
algorithms avoids value functions and focuses on directly
learning the policy, either with gradient methods or proba-
bilistic methods.
Value Function Approaches
Instead of using dynamic programming, the value function
V(x) can be estimated with TD learning [7], [45]. Essentially,
TD enforces the validity of the Bellman equations for tempo-
rally adjacent states, which can be shown to lead to a spatially
consistent estimate of the value function for a given policy. To
improve the policy, TD needs to be coupled to a simultaneous
policy update using actor-critic methods [7].
Alternatively, instead of the value function V(x), the action
value function Q(x,u) can be used, which is defined as
Q(x,u)¼L(x0,u0)þminukP1
k¼1L(xk,uk) [7], [46]. Know-
ing Q(x,u) for all actions in a state allows choosing the one
with the maximal (or minimal for penalty costs) Q-value as
the optimal action. Q-learning can be conceived of as TD
learning in the joint space of states and actions.
TD and Q-learning work well for discrete state-action
spaces but become more problematic in continuous state-
action scenarios. In continuous spaces, function approximators
need to be used to represent the value function and policy.
Achieving reliable estimation of these functions usually
requires a large number of samples that densely fill the relevant
space for learning, which is hard to accomplish in actual
experiments with complex robot systems. There are also no
guarantees that, during learning, the robot will not be given
unsafe commands. Thus, many practical approaches learn first
Value
Velocity (r/s)
–6
20
15
10
5
0
–5
–10
–15
–20
–5
–4
–3
–2
–1
0
1
2
3
0
10
20
Torque (N · m)
Position (r)
Position (r)
Velocity (r/s)
–6
20
15
10
5
0
–5
–10
–15
–20
–5
–4
–3
–2
(a)
(b) (c) (d)
–1
0
1
2
3
0
10
20
–10
–6 –5 –4 –3 –2
Position (r)
Velocity (r/s)
10123
–8
–6
–4
–2
0
2
4
6
8
10
Figure 4. (a) Configurations from the simulated one link pendulum optimal trajectory every half second and at the end of the
trajectory. (b) Value function for one-link example. (c) Policy for one-link example. (d) Trajectory-based approach: random states
(dots) and trajectories (black lines) used to plan one-link swing-up, superimposed on a contour map of the value function [33].
IEEE Robotics & Automation Magazine
JUNE 2010 25
in simulations (which is essentially a model-based approach)
until reasonable performance is achieved, before continuing to
experiment on an actual robot to adjust the control policy to
the true physics of the world [47].
In the end, it is intractable to find a globally optimal control
policy in high dimensional robot systems, as global optimality
requires exploration of the entire state-action space. Thus,
local optimization such as trajectory optimization seems to be
more practical, using initialization of the policy from some
informed guess, for instance, imitation learning [44], [48]
[51]. Fitted Q-iteration is an example of a model-free learning
algorithm that approximates the Q-function only along some
sampled trajectories [52], [53]. Recent developments have
given up on estimating the value function and rather focus
directly on learning the control policy from trajectory rollouts,
which is the topic of the following sections.
Policy Gradient Methods
Policy gradient methods usually assume that the cost of motor
skill can be written as
J(x0)¼EsX
N
k¼0
ckL(xk,uk)
()
, (4)
which is the expected sum of discounted rewards (c2[0,1])
over a (potentially infinite) time horizon N.Theexpecta-
tion E{} is taken over all trajectories sthat start in state x
0
.
The goal is to find the motor commands u
k
that optimize
this cost function. Most approaches assume that there is a
start state x¼x
0
and/or a start state distribution [54]. The
control policy is also often compactly parameterized, e.g.,
by means of a basis function representation u¼h
T
/(x),
where hare the policy parameters [see also (1)], and /(x)isa
vector of nonlinear basis functions provided by the user.
Mainly for the purpose of exploration, the policy can
be chosen to be stochastic, e.g., with a normal distribution
uN(h
T
/(x), R), although cases exist where only a sto-
chastic policy is optimal [54].
The essence of policy gradient methods is to compute the
gradient @J/@hand optimize (4) with gradient-based incremental
updates. As discussed in more detail in [55], a variety of algorithms
exist to compute the gradient. Finite difference gradients [56]
perform a perturbation analysis of the parameter vector hand
estimate the gradient from a first-order numerical Taylor series
expansion. The REINFORCE algorithm [57], [58] is a
straightforward derivative computation of the logarithm of
(4), assuming as the probability of a trajectory ph(s)¼
p(x0)QN
k¼1p(xkjxk1,uk1)ph(uk1jxk1), and emphasizing
that the parameters honly appear in the stochastic policy p
h
such that many terms in the gradient computation drop out.
GPOMDP [59] and methods based on the policy gradient
theorem [54] are more efficient versions of REINFORCE
(for more details, see [55]). Peters and Schaal [60] suggested a
second-order gradient method derived from insights of [61]
and [62], which is currently among the fastest gradient-learn-
ing approaches. Reference [63] emphasized that the choice of
injecting noise in the stochastic policy can strongly influence
the efficiency of the gradient updates.
Policy gradient methods can scale to high-dimensional
state-action spaces, at the cost of finding only locally optimal
control policies and have become rather popular in robotics
[64][66]. One drawback of policy gradients is that they
require manual tuning of gradient parameters, which can be
tedious. Probabilistic methods, as discussed in the next section,
try eliminating gradient computations.
Probabilistic Direct Policy Learning
Transforming reinforcement learning into a probabilistic estima-
tion approach is inspired by the hope of bringing to bear the
wealth of statistical learning techniques that were developed over
the last 20 years of machine-learning research. An early attempt
can be found in [67], where reinforcement learning was formu-
lated as an expectationmaximization (EM) algorithm [68]. The
important idea was to treat the reward L(x,u)asapseudoprob-
ability, i.e., it has to be strictly positive, and the integral over the
state-action space of the reward has to result in a finite number.
Transforming traditional convex reward functions with the
exponential function is often used to achieve this property at the
cost thatthe learning problem gets slightly altered by this change
of cost function. Equation (4) can thus be thought of as a likeli-
hood, andthe corresponding log likelihoodbecomes
log J(x)¼log Zs
ph(s)R(s)ds, where R(s)¼X
N
k¼0
ckL(xk,uk):
(5)
This log likelihood can be optimized with the EM algo-
rithm. In [15], such an approach was used to learn operational
space controllers, where the reinforcement learning compo-
nent enabled a consistent resolution of redundancy. In [69],
the previous approach was extended to learning from trajecto-
riessee also contribution by Kober and Peters (pp. 5562).
Extending [70] and [71] added a more thorough treatment of
learning in the infinite discounted horizon case, where the
algorithm can essentially determine the most suitable temporal
window for optimization.
Another way of transforming reinforcement learning into a
statistical estimation problem was suggested in [72] and [73].
Here, it was realized that optimization with the stochastic
Hamilton-Jacobi-Bellman equations can be transformed into a
path-integral estimation problem, which can be derived with
the Feynman-Kac theorem [31], [74]. While this formulation
is normally based on value functions and requires a model-
based approach, Theodorou et al. [31] realized that even
model-free methods can be obtained. The resulting reinforce-
ment learning algorithm resembles the one of [69], however,
without the requirement that reinforcement is a pseudoprob-
ability. Because of its grounding in first-order principles of
optimal control theory, its simplicity, and no open learning
parameters except for the exploration noise, this algorithm
might be one of the most straightforward methods of trajec-
tory-based reinforcement learning to date. It should also be
IEEE Robotics & Automation Magazine
26 JUNE 2010
mentioned that [75] developed a
model-based reinforcement learning
framework with a special probabilistic
control cost for discrete state-action
spaces that, in its limit to continuous
state-action spaces, will result in a path-
integral formulation.
Example Application
Figure 5 illustrates our application of
path-integral reinforcement learning to
a robot-learning problem [31]. The
robot dog is to jump across a gap. The
jump should make as much forward
progress as possible, as it is a maneuver
in a legged locomotion competition,
which scores the speed of the robot.
The robot has three degree of freedoms
(DoFs) per leg, and thus a total of 12
DoFs. Each DoF was represented as a
parameterized movement primitive [76] with 50 basis func-
tions. An initial seed behavior was taught by learning from
demonstration, which allowed the robot barely to reach the
other side of the gap without falling into the gapthe demon-
stration was generated from a manual adjustment of knot
points in a spline-based trajectory plan for each leg.
Path-integral reinforcement learning primarily used the
forward progress as a reward and slightly penalized the squared
acceleration of each DoF and the squared norm of the parame-
ter vector, i.e., a typical form of complexity regularization
[77]. Learning was performed on a physical simulator of the
robot dog, as the real robot dog was not available for this
experiment. Figure 5 illustrates that after about 30 trials, the
performance of the robot was significantly improved, such that
after the jump, almost the entire body was lying on the other
side of the gap. It should be noted that applying path-integral
reinforcement learning was algorithmically very simple, and
manual tuning only focused on generate a good cost function.
Imitation Learning, Policy Parameterizations, and
Inverse Reinforcement Learning
While space constraints will not allow us to go into more
detail, three interwoven topics in robot learning are worth
mentioning.
First, imitation learning has become a popular topic to initi-
alize and speed up robot learning. Reviews on this topic can
be found, for instance, in [48], [49], and [78].
Second, determining useful parameterizations for control
policies is a topic that is often discussed in conjunction with
imitation learning. Many different approaches have been sug-
gested in the literature, for instance, based on splines [79], hid-
den Markov models [80], nonlinear attractor systems [76], and
other methods. Billard et al. [78] provide a survey of this topic.
Finally, designing useful reward functions remains one of the
most time-consuming and frustrating topics in robot learning.
Thus, extracting the reward function from observed behavior is
a topic of great importance for robot learning and imitation
learning under the assumption that the observed behavior is
optimal under a certain criterion. Inverse reinforcement learning
[81], apprenticeship learning [82], and maximum margin plan-
ning [83] are some of the prominent examples in the literature.
Conclusions
Recent trends in robot learning are to use trajectory-based
optimal control techniques and reinforcement learning to scale
complex robotic systems. On the one hand, increased compu-
tational power and multiprocessing, and on the other hand,
probabilistic reinforcement learning methods and function
approximation, have contributed to a steadily increasing inter-
est in robot learning. Imitation learning has helped signifi-
cantly to start learning with reasonable initial behavior.
However, many applications are still restricted to rather low-
dimensional domains and toy applications. Future work will
have to demonstrate the continual and autonomous learning
abilities, which were alluded to in the introduction.
Acknowledgments
This research was supported in part by National Science
Foundation grants ECS-0326095, EEC-0540865, and ECCS-
0824077, IIS-0535282, CNS-0619937, IIS-0917318, CBET-
0922784, EECS-0926052, the DARPA program on Learning
Locomotion, the Okawa Foundation, and the ATR Compu-
tational Neuroscience Laboratories.
Keywords
Robot learning, learning control, reinforcement learning,
optimal control.
References
[1] S. Schaal, “The new roboticsTowards human-centered machi nes,” HFSP
J. Frontiers Interdisciplinary Res. Life Sci., vol. 1, no. 2, pp. 115126, 2007.
[2] K. J. Åstrom and B. Wittenmark, Adaptive Control. Reading, MA: Addi-
son-Wesley, 1989.
[3] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, MA:
MIT Press, 2005.
0
0
100
200
300
Cost
400
500
600
10
Number of Rollouts
100
(a) (b)
Figure 5. (a) Actual and simulated robot dog. (b) Learning curve of optimizing the
jump behavior with path-integral reinforcement learning.
IEEE Robotics & Automation Magazine
JUNE 2010 27
[4] M. Buehler, The DARPA Urban Challenge: Autonomous Vehicles in City
Traffic, 1st ed. New York: Springer-Verlag, 2009.
[5] M. Buehler, K. Iagnemma, and S. Singh, The 2005 DARPA Grand Chal-
lenge: The Great Robot Race. New York: Springer-Verlag, 2007.
[6] M. Roy, G. Gordon, and S. Thrun, “Finding approximate POMDP solu-
tions through belief compression,” J. Artif. Intell. Res., vol. 23, pp. 140,
2005.
[7] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cam-
bridge, MA: MIT Press, 1998.
[8] J. Si, Handbook of Learning and Approximate Dynamic Programming. Hobo-
ken, NJ: IEEE Press/Wiley-Interscience, 2004.
[9] R. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ. Press,
1957.
[10] P. Dyer and S. R. McReynolds, The Computation and Theory of Optimal
Control. New York: Academic, 1970.
[11] L. Sciavicco and B. Siciliano, Modelling and Control of Robot Manipulators.
New York: Springer-Verlag, 2000.
[12] I. M. Jordan, D. E. Rumelhart, “Supervised learning with a distal
teacher,” Cogn. Sci., vol. 16, pp. 307354, 1992.
[13] A. D’Souza, S. Vijayakumar, and S. Schaal, “Learning inverse kine-
matics,” in Proc. IEEE Int. Conf. Intelligent Robots and Systems (IROS
2001), Maui, HI, Oct. 29Nov. 3, 2001, pp. 298301.
[14] D. Bullock, S. Grossberg, and F. H. Guenther, “A self-organizing neural
model of motor equivalent reaching and tool use by a multijoint arm,” J.
Cogn. Neurosci., vol. 5, no. 4, pp. 408435, 1993.
[15] J. Peters and S. Schaal, “Learning to control in operational space,” Int. J.
Robot. Res., vol. 27, pp. 197212, 2008.
[16] Z. Ghahramani and M. I. Jordan, “Supervised learning from incomplete
data via an EM approach,” in Advances in Neural Information Processing Sys-
tems 6, J. D. Cowan, G. Tesauro, and J. Alspector, Eds. San Mateo, CA:
Morgan Kaufmann, 1994, pp. 120127.
[17] S. Schaal and C. G. Atkeson, “Constructive incremental learning from
only local information,” Neural Comput., vol. 10, no. 8, pp. 20472084,
1998.
[18] W. S. Cleveland, “Robust locally weighted regression and smoothing
scatterplots,” J. Amer. Statist. Assoc., vol. 74, pp. 829836, 1979.
[19] C. G. Atkeson, “Using local models to control movement,” in Advances
in Neural Information Processing Systems 1, D. Touretzky, Ed. San Mateo,
CA: Morgan Kaufmann, 1989, pp. 157183.
[20] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted
learning,” Artif. Intell. Rev., vol. 11, no. 15, pp. 1173, 1997.
[21] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted learning
for control,” Artif. Intell. Rev., vol. 11, no. 15, pp. 75113, 1997.
[22] S. Vijayakumar, A. D’Souza, and S. Schaal, “Incremental online learning
in high dimensions,” Neural Comput., vol. 17, no. 12, pp. 26022634, 2005.
[23] J.-A. Ting, A. D’Souza, S. Vijayakumar, and S. Schaal, “A Bayesian
approach to empirical local linearizations for robotics,” in Proc. Int. Conf.
Robotics and Automation (ICRA2008), Pasadena, CA, May 1923, 2008,
pp. 28602865.
[24] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine
Learning. Cambridge, MA: MIT Press, 2006.
[25] D. Nguyen-Tuong, M. Seeger, and J. Peters, “Local gaussian process
regression for real time online model learning and control,” in Proc. Advan-
ces in Neural Information Processing Systems 21 (NIPS 2008), D. Schuurmans,
J. Benigio, and D. Koller, Eds. Vancouver, BC, Dec. 811, 2009,
pp. 11931200.
[26] M. P. Deisenroth, C. E. Rasmussen, and J. Peters, “Gaussian process
dynamic programming,” Neurocomputing, vol. 72, no. 79, pp. 1508
1524, 2009.
[27] L. Csat’o and M. Opper, “Sparse representation for gaussian process
models,” in Proc. Advances in Neural Information Processing Systems 13 (NIPS
2000), Denver, CO, 2001, pp. 444450.
[28] D. M. Wolpert and M. Kawato, “Multiple paired forward and inverse
models for motor control,” Neural Netw., vol. 11, no. 78, pp. 1317
1329, 1998.
[29] D. H. Jacobson and D. Q. Mayne, Differential Dynamic Programming. New
York: American Elsevier, 1970.
[30] K. Doya, “Reinforcement learning in continuous time and space,” Neu-
ral Comput., vol. 12, no. 1, pp. 219245, Jan. 2000.
[31] E. Theodorou, J. Buchli, and S. Schaal, “Reinforcement learning in high
dimensional state spaces: A path integral approach,” submitted for
publication.
[32] R. Munos and A. Moore, “Variable resolution discretization in optimal
control,” Mach. Learn., vol. 49, no. 2/3, p. 33, 2002.
[33] C. G. Atkeson and B. J. Stephens, “Random sampling of states in
dynamic programming,” IEEE Trans. Syst., Man, Cybern. B, vol. 38,
no. 4, pp. 924929, 2008.
[34] C. G. Atkeson, “Randomly sampling actions in dynamic programming,”
in Proc. IEEE Int. Symp. Approximate Dynamic Programming and Reinforce-
ment Learning, 2007, ADPRL’07, pp. 185192.
[35] E. Whitman and C. G. Atkeson, “Control of a walking biped using a
combination of simple policies,” in Proc. IEEE/RAS Int. Conf. Humanoid
Robotics, Paris, France, Dec. 710, 2009, pp. 520527.
[36] Tomlab Optimization Inc. (2010). PROPTMatlab optimal control
software [Online]. Available: http://tomdyn.com/
[37] Technische Universitat Darmstadt. (2010). DIRCOL: A direct colloca-
tion method for the numerical solution of optimal control problems
[Online]. Available: http://www.sim.informatik.tu-darmstadt.de/sw/dircol
[38] Stanford Business Software Corporation. (2010). SNOPT; Software for
large-scale nonlinear programming [Online]. Available: http://www.sbsi-
sol-optimize.com/asp/sol_product_snopt.htm
[39] A. Safonova, J. K. Hodgins, and N. S. Pollard, “Synthesizing physically
realistic human motion in low-dimensional, behavior-specific spaces,”
ACM Trans. Graph. J. (SIGGRAPH 2004 Proc.), vol. 23, no. 3, pp. 514
521, 2004.
[40] L. Chenggang and C. G. Atkeson, “Standing balance control using a
trajectory library,” presented at the IEEE/RSJ Int. Conf. Intelligent
Robots and Systems (IROS 2009), 2009.
[41] A. Ng, “Pegasus: A policy search method for large MDPs and
POMDPs,” presented at the Uncertainty in Artificial Intelligence (UAI),
2000.
[42] E. Tse, Y. Bar-Shalom, and L. Meier, III, “Wide-sense adaptive dual
control for nonlinear stochastic systems,” IEEE Trans. Automat. Contr.,
vol. 18, no. 2, pp. 98108, 1973.
[43] Y. Bar-Shalom and E. Tse, “Caution, probing and the value of informa-
tion in the control of uncertain systems,” Ann. Econ. Social Meas., vol. 4,
no. 3, pp. 323338, 1976.
[44] C. G. Atkeson and S. Schaal, “Robot learning from demonstration,” in
Proc. 14th Int. Conf. Machine Learning (ICML‘97), D. H. Fisher, Jr., Ed.
Nashville, TN, July 812, 1997, pp. 1220.
[45] R. S. Sutton, “Learning to predict by the methods of temporal differ-
ences,” Mach. Learn., vol. 3, no. 1, pp. 944, 1988.
[46] C. J. C. H. Watkins, “Learning with delayed rewards,” Ph.D. thesis,
Cambridge Univ., U.K., 1989.
[47] J. Morimoto and K. Doya, “Acquisition of stand-up behavior by a real
robot using hierarchical reinforcement learning,” Robot. Auton. Syst.,
vol. 36, no. 1, pp. 3751, 2001.
[48] S. Schaal, “Is imitation learning the route to humanoid robots?” Trends
Cogn. Sci., vol. 3, no. 6, pp. 233242, 1999.
[49] S. Schaal, A. Ijspeert, and A. Billard, “Computational approaches to
motor learning by imitation,” Philos. Trans. R. Soc. London B, Biol. Sci.,
vol. 358, no. 1431, pp. 537547, 2003.
[50] C. G. Atkeson and S. Schaal, “Learning tasks from a single demon-
stration,” in Proc. IEEE Int. Conf. Robotics and Automation (ICRA’97),
Albuquerque, NM, Apr. 2025, 1997, pp. 17061712.
[51] S. Schaal, “Learning from demonstration,” in Proc. Advances in Neural
Information Processing Systems 9, M. C. Mozer, M. Jordan, and T. Petsche,
Eds. Cambridge, MA, 1997, pp. 10401046.
[52] D. Ernst, P. Geurts, and L. Wehenkel, “Tree-based batch mode rein-
forcement learning,” J. Mach. Learn. Res., vol. 6, pp. 503556, 2005.
[53] G. Neumann and J. Peters, “Fitted Q-iteration by advantage weighted
regression,” in Proc. Advances in Neural Information Processing Systems 21
(NIPS 2008), D. Schuurmans, J. Benigio, and D. Koller, Eds. Vancouver,
BC, Dec. 811, 2009, pp. 11771184.
IEEE Robotics & Automation Magazine
28 JUNE 2010
[54] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient
methods for reinforcement learning with function approximation,” in
Proc. Advances in Neural Processing Systems 12, S. A. Solla, T. K. Leen, and
K.-R. Muller, Eds. Denver, CO, 2000.
[55] J. Peters and S. Schaal, “Reinforcement learning of motor skills with
policy gradients,” Neural Netw., vol. 21, no. 4, pp. 682697, May 2008.
[56] P. Sadegh and J. Spall, “Optimal random perturbations for stochastic
approximation using a simultaneous perturbation gradient approx-
imation,” presented at the Proc. American Control Conf., 1997.
[57] R. J. Williams, “Simple statistical gradient-following algorithms for con-
nectionist reinforcement learning,” Mach. Learn., vol. 8, no. 34, pp. 229
256, 1992.
[58] V. Gullapalli, “A stochastic reinforcement learning algorithm for learning
real-valued functions,” Neural Netw., vol. 3, no. 6, pp. 671692, 1990.
[59] D. Aberdeen and J. Baxter, “Scaling internal-state policy-gradient meth-
ods for POMDPs,” in Proc. 19th Int. Conf. Machine Learning (ICML-2002),
Sydney, Australia, 2002, pp. 310.
[60] J. Peters and S. Schaal, “Natural actor critic,” Neurocomputing, vol. 71,
no. 79, pp. 11801190, 2008.
[61] S. Amari, “Natural gradient learning for over- and under-complete bases
In ICA,” Neural Comput., vol. 11, no. 8, pp. 18751883, Nov. 1999.
[62] S. Kakade, “Natural policy gradient,” presented at the Advances in Neu-
ral Information Processing Systems, Vancouver, CA, 2002.
[63] T. Ruckstieß, M. Felder, and J. Schmidhuber, “State-dependent explo-
ration for policy gradient methods,” presented at the European Conf.
Machine Learning and Principles and Practice of Knowledge Discovery in
Databases 2008, Part II, LNAI 5212, 2008.
[64] G. Endo, J. Morimoto, T. Matsubara, J. Nakanish, and G. Cheng,
“Learning CPG-based biped locomotion with a policy gradient method:
Application to a humanoid robot,” Int. J. Robot. Res., vol. 27, no. 2,
pp. 213228, 2008.
[65] R. Tedrake, T. W. Zhang, and S. Seung, “Stochastic policy gradient rein-
forcement learning on a simple 3D biped,” in Proc. Int. Conf. Intelligent
Robots and Systems (IROS 2004), Sendai, Japan, Oct. 2004, pp. 28492854.
[66] J. Peters and S. Schaal, “Policy gradient methods for robotics,” in Proc.
IEEE Int. Conf. Intelligent Robotics Systems (IROS 2006), Beijing, Oct. 9
15, 2006, pp. 22192225.
[67] P. Dayan and G. Hinton, “Using EM for reinforcement learning,” Neural
Comput., vol. 9, no. 2, pp. 271278, 1997.
[68] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39,
no. 1, pp. 138, 1977.
[69] J. Kober and J. Peters, “Learning motor primitives in robotics,” in Proc.
Advances in Neural Information Processing Systems 21 (NIPS 2008),D.
Schuurmans, J. Benigio, and D. Koller, Eds. Vancouver, BC, Dec. 811,
2009, pp. 297304.
[70] M. Toussaint and A. Storkey, “Probabilistic inference for solving discrete
and continuous state Markov decision processes,” presented at the 23nd
Int. Conf. Machine Learning (ICML 2006), 2006.
[71] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis, “Learning model-
free control by a Monte-Carlo EM algorithm,” Auton. Robots, vol. 27,
no. 2, pp. 123130, 2009.
[72] H. J. Kappen, “Linear theory for control of nonlinear stochastic systems,”
Phys. Rev. Lett., vol. 95, no. 20, pp. 200201200204, Nov. 2005.
[73] H. J. Kappen, “An introduction to stochastic control theory, path inte-
grals and reinforcement learning,” in Cooperative Behavior in Neural Systems,
vol. 887, J. Marro, P. L. Garrido, and J. J. Torres, Eds. 2007, pp. 149181.
[74] E. Theodorou, J. Buchli, and S. Schaal, “Path integral stochastic optimal
control for rigid body dynamics,” presented at the IEEE Int. Symp.
Approximate Dynamic Programming and Reinforcement Learning
(ADPRL2009), Nashville, TN, Mar. 30Apr. 2, 2009.
[75] E. Todorov, “Efficient computation of optimal actions,” Proc. Nat. Acad.
Sci. USA, vol. 106, no. 28, pp. 1147811483, July 2009.
[76] A. Ijspeert, J. Nakanishi, and S. Schaal, “Learning attractor landscapes for
learning motor primitives,” in Advances in Neural Information Processing Systems
15, S. Becker, S. Thrun, and K. Obermayer, Eds. 2003, pp. 15471554.
[77] C. M. Bishop, Pattern Recognition and Machine Learning. New York:
Springer-Verlag, 2006.
[78] A. Billard, S. Calinon, R. Dillmann, and S. Schaal, “Robot programming
by demonstration,” in Handbook of Robotics,vol.1,B.SicilianoandO.Khatib,
Eds. Cambridge, MA: MIT Press, 2008, ch. 59.
[79] Y. Wada and M. Kawato, “Trajectory formation of arm movement by a
neural network with forward and inverse dynamics models,” Syst. Comput.
Jpn., vol. 24, pp. 3750, 1994.
[80] T. Inamura, I. Toshima, H. Tanie, and Y. Nakamura, “Embodied sym-
bol emergence based on mimesis theory,” Int. J. Robot. Res., vol. 23,
no. 45, p. 363, Apr.-May 2004.
[81] A. Y. Ng and S. Russell, “Algorithms for inverse reinforcement
learning,” in Proc. 17th Int. Conf. Machine Learning (ICML 2000), Stanford,
CA, 2000, pp. 663670.
[82] P. Abbeel and A. Ng, “Apprenticeship learning via inverse reinforcement
learning,” in Proc. 21st Int. Conf. Machine Learning, 2004.
[83] N. Ratliff, D. Silver, and J. A. Bagnell, “Learning to search: Functional
gradient techniques for imitation learning,” Auton. Robots, vol. 27, no. 1,
pp. 2553, 2009.
Stefan Schaal is a professor of computer science, neuro-
science, and biomedical engineering at the University of
Southern California and an invited researcher at the ATR
Computational Neuroscience Laboratory in Japan. He has
coauthored more than 200 papers in refereed journals and
conferences. He is a cofounder of the IEEE/RAS International
Conference and Humanoid Robotics as well as Robotics Science and
Systems. He serves on the editorial board of Neural Networks,
International Journal of Humanoid Robotics, and Frontiers in Neuro-
robotics. He is a Member of the German National Academic
Foundation (Studienstiftung des Deutschen Volkes), Alexander
von Humboldt Foundation, Society for Neuroscience, the
Society for Neural Control of Movement, the IEEE, and AAAS.
His research interests include topics of statistical and machine
learning, neural networks, computational neuroscience, func-
tional brain imaging, nonlinear dynamics, nonlinear control
theory, and biomimetic robotics.
Christopher G. Atkeson received his M.S. degree in applied
mathematics (computer science) from Harvard University and
his Ph.D. degree in brain and cognitive sciences from Massa-
chusetts Institute of Technology (MIT). He is a professor at
the Robotics Institute and HumanComputer Interaction
Institute, Carnegie Mellon University. He joined the MIT as
a faculty in 1986 and moved to the Georgia Institute of
Technology College of Computing in 1994. He has received
the National Science Foundation Presidential Young Investi-
gator Award, Sloan Research Fellowship, and Teaching
Award from the MIT Graduate Student Council. His research
focuses on humanoid robotics and robot learning by using
challenging dynamic tasks such as juggling. His specific
research interests include nonparametric learning, memory-
based learning including approaches based on trajectory libra-
ries, reinforcement learning, and other forms of learning based
on optimal control, learning from demonstration, and model-
ing human behavior.
Address for Correspondence: Stefan Schaal, Computer Science,
Neuroscience, and Biomedical Engineering, University of
Southern California, Los Angeles, CA 90089-2905 USA.
E-mail: sschaal@usc.edu.
IEEE Robotics & Automation Magazine
JUNE 2010 29
... or other equations of motion defining the system. This task can be cast as an instance of meta-optimization [45] or trajectory learning for control [24,46,47]. We formalize the meta-optimization problem as a list of problem- ...
Article
Full-text available
Reduction of the circuit depth of quantum circuits is a crucial bottleneck to enabling quantum technology. This depth is inversely proportional to the number of available quantum gates that have been synthesized. Moreover, quantum gate-synthesis and control problems exhibit a vast range of external parameter dependencies, both physical and application specific. In this paper, we address the possibility of learning families of optimal-control pulses that depend adaptively on various parameters, in order to obtain a global optimal mapping from the space of potential parameter values to the control space and hence to produce continuous classes of gates. Our proposed method is tested on different experimentally relevant quantum gates and proves capable of producing high-fidelity pulses even in the presence of multiple variables or uncertain parameters with wide ranges.
... Moreover, ML could be implemented to improve the trajectory-planning process [27][28][29] or reduce vibrations [30,31]. ...
Article
Full-text available
Positioning accuracy in robotics is a key issue for the manufacturing process. One of the possible ways to achieve high accuracy is the implementation of machine learning (ML), which allows robots to learn from their own practical experience and find the best way to perform the prescribed operation. Usually, accuracy improvement methods cover the generation of a positioning error map for the whole robot workspace, providing corresponding correction models. However, most practical cases require extremely high positioning accuracy only at a few essential points on the trajectory. This paper provides a methodology for the online deep Q-learning-based approach intended to increase positioning accuracy at key points by analyzing experimentally predetermined robot properties and their impact on overall accuracy. Using the KUKA-YouBot robot as a test system, we perform accuracy measurement experiments in the following three axes: (i) after a long operational break, (ii) using different loads, and (iii) at different speeds. To use this data for ML, the relationships between the robot’s operating time from switching on, load, and positioning accuracy are defined. In addition, the gripper vibrations are evaluated when the robot arm moves at various speeds in vertical and horizontal planes. It is found that the robot’s degrees of freedom (DOFs) clearances are significantly influenced by operational heat, which affects its static and dynamic accuracy. Implementation of the proposed ML-based compensation method resulted in a positioning error decrease at the trajectory key points by more than 30%.
... Thus, the solution of the transformed stochastic HJB is formulated as a conditional expectation value with respect to the system dynamics. As a result, the optimal control can be estimated using Monte Carlo methods drawing samples of stochastic trajectories [5]. While in general the resulting optimal feedback control function has an unknown structure, there are different approaches for its representation. ...
Article
Full-text available
In this paper, a novel feature-based sampling strategy for nonlinear Model Predictive Path Integral (MPPI) control is presented. Using the MPPI approach, the optimal feedback control is calculated by solving a stochastic optimal control (OCP) problem online by evaluating the weighted inference of sampled stochastic trajectories. While the MPPI algorithm can be excellently parallelized, the closed-loop performance strongly depends on the information quality of the sampled trajectories. To draw samples, a proposal density is used. The solver’s and thus, the controller’s performance is of high quality if the sampled trajectories drawn from this proposal density are located in low-cost regions of state-space. In classical MPPI control, the explored state-space is strongly constrained by assumptions that refer to the control value’s covariance matrix, which are necessary for transforming the stochastic Hamilton–Jacobi–Bellman (HJB) equation into a linear second-order partial differential equation. To achieve excellent performance even with discontinuous cost functions, in this novel approach, knowledge-based features are introduced to constitute the proposal density and thus the low-cost region of state-space for exploration. This paper addresses the question of how the performance of the MPPI algorithm can be improved using a feature-based mixture of base densities. Furthermore, the developed algorithm is applied to an autonomous vessel that follows a track and concurrently avoids collisions using an emergency braking feature. Therefore, the presented feature-based MPPI algorithm is applied and analyzed in both simulation and full-scale experiments.
... Learning from demonstration (LfD) and imitation learning allow agents to execute a task by observing the task being performed (Hussein et al., 2017). In the robotics domain, a goal of imitation learning is to produce a mapping, π , from states to actions, known as a control policy (Argall et al., 2009;Schaal and Atkeson, 2010), that has the maximum likelihood of producing the demonstration dataset D = {ρ 1 , ρ 2 , . . . , ρ n }, where each ρ = (s 1 , a 1 ), (s 2 , a 2 ), . . . ...
Article
Full-text available
Generalizing prior experiences to complete new tasks is a challenging and unsolved problem in robotics. In this work, we explore a novel framework for control of complex systems called Primitive Imitation for Control (PICO). The approach combines ideas from imitation learning, task decomposition, and novel task sequencing to generalize from demonstrations to new behaviors. Demonstrations are automatically decomposed into existing or missing sub-behaviors which allows the framework to identify novel behaviors while not duplicating existing behaviors. Generalization to new tasks is achieved through dynamic blending of behavior primitives. We evaluated the approach using demonstrations from two different robotic platforms. The experimental results show that PICO is able to detect the presence of a novel behavior primitive and build the missing control policy.
... Learning to make control decisions online in a stable and efficient manner is important in computer animation (Ling et al., 2020;Zhang & van de Panne, 2018), resource management (Zhou et al., 2011;Ignaciuk & Bartoszewicz, 2010), robotics (Andrychowicz et al., 2020;Xie et al., 2018;Schaal & Atkeson, 2010), and autonomous vehicles (Chen et al., 2020;Sadigh et al., 2016). Online decision making has a variety of challenges: from partial-observability and asymmetric information (Warrington et al., 2021;Choudhury et al., 2018), to function approximation and bootstrapping error (van Hasselt et al., 2018). ...
Preprint
Full-text available
We consider online imitation learning (OIL), where the task is to find a policy that imitates the behavior of an expert via active interaction with the environment. We aim to bridge the gap between the theory and practice of policy optimization algorithms for OIL by analyzing one of the most popular OIL algorithms, DAGGER. Specifically, if the class of policies is sufficiently expressive to contain the expert policy, we prove that DAGGER achieves constant regret. Unlike previous bounds that require the losses to be strongly-convex, our result only requires the weaker assumption that the losses be strongly-convex with respect to the policy's sufficient statistics (not its parameterization). In order to ensure convergence for a wider class of policies and losses, we augment DAGGER with an additional regularization term. In particular, we propose a variant of Follow-the-Regularized-Leader (FTRL) and its adaptive variant for OIL and develop a memory-efficient implementation, which matches the memory requirements of FTL. Assuming that the loss functions are smooth and convex with respect to the parameters of the policy, we also prove that FTRL achieves constant regret for any sufficiently expressive policy class, while retaining $O(\sqrt{T})$ regret in the worst-case. We demonstrate the effectiveness of these algorithms with experiments on synthetic and high-dimensional control tasks.
... In general, learning control is the process of acquiring control strategies for a specific control system and a specific task through iterative trials [205]. It enables the estimation of unknown information as the system proceeds. ...
Article
Full-text available
Ball screw feed-drive system (BSFDS) is the precision transmission mechanism widely used in micron-scale positioning or motion trajectory control. Its desired specifications including high acceleration, speed, accuracy, and stability are challenged by vibration, friction, thermal error, uncertainty, etc. Inspired by these challenges, the modeling and control issues have been widely studied and discussed for decades. This paper presents an overview of modeling and control approaches, including identification, linear parameter varying, thermal error modeling and control, nonlinear control, and robust control. In particular, it reviews the emerging control issues and approaches, such as artificial intelligence, learning control, and data-driven control, which have increased in recent years.
... A standard MPC controller is not designed to handle uncertain events. Hence, parametric uncertainty can significantly affect control performance Schaal and Christopher (2010); Fesharaki et al. (2017). Currently, most of the climate control greenhouse applications assume perfect knowledge on the model parameters (for example, Blasco et al. (2007); Ding et al. (2018)). ...
Preprint
Full-text available
Achieving optimal resource use efficiency is a key challenge in modern greenhouse production systems. Optimal performance in terms of crop yield and resource efficiency can in theory be achieved via optimal control. Standard optimal controllers are not designed to deal with uncertainty, whereas considerable model prediction errors occur due to the mismatch between the model and the real system. This paper explores the relation between prediction uncertainty, and performance with respect to crop yield, CO 2 demand, ventilation demand, and heating energy. This is done using the following steps 1) formulation of parametric uncertainty underlying prediction uncertainty, 2) extension of an existing controller model with parametric uncertainty, 3) design of a sample-based robust model predictive controller and 4) analysis of control performance under increasing parametric uncertainty. The results predict that control performance is highly sensitive to parametric uncertainty. A relative parameter uncertainty of 20%, reduced crop yield with 11% compared to the case without uncertainty. Furthermore, a 20% uncertainty decreased CO 2 demand with 80%, whereas it increased ventilation demand with 96%, and increased heating energy demand with 90%.
Book
At the dawn of the new millennium, robotics is undergoing a major transformation in scope and dimension. From a largely dominant industrial focus, robotics is rapidly expanding into the challenges of unstructured environments. Interacting with, assi- ing, serving, and exploring with humans, the emerging robots will increasingly touch people and their lives. The goal of the new series of Springer Tracts in Advanced Robotics (STAR) is to bring, in a timely fashion, the latest advances and developments in robotics on the basis of their significance and quality. It is our hope that the wider dissemination of research developments will stimulate more exchanges and collaborations among the research community and contribute to further advancement of this rapidly growing field. The volume edited by Martin Buehler, Karl Iagnemma and Sanjiv Singh presents a unique and extensive collection of the scientific results by the teams which took part into the DARPA Grand Challenge in October 2005 in the Nevada desert. This event reached an incredible peak of popularity in the media, the race of the century like someone called it! The Grand Challenge demonstrated the fast growing progress - ward the development of robotics technology, as it showed the feasibility of using mobile robots operating autonomously in real world scenarios.
Article
The minimum torque-change model predicts and reproduces human multijoint movement data quite well. However, there are three criticisms of the current neural network models for trajectory formation based on the minimum torque-change criterion: (1) their spatial representation of time; (2) backpropagation is essential; and (3) they require too many iterations. Accordingly, a new neural network model for trajectory formation is proposed based on the minimum torque-change criterion. This neural network model basically uses a forward dynamics model, an inverse dynamics model and a trajectory formation mechanism which generates an approximate minimum torque-change trajectory. It does not require spatial representation of time or backpropagation. Furthermore, there are fewer iterations required to obtain an approximate optimal solution. Finally, the proposed neural network model can be applied broadly in the engineering field because it is a new method for solving optimization problems with boundary conditions.
Article
This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.
Article
We address the role of noise and the issue of efficient computation in stochastic optimal control problems. We consider a class of nonlinear control problems that can be formulated as a path integral and where the noise plays the role of temperature. The path integral displays symmetry breaking and there exists a critical noise value that separates regimes where optimal control yields qualitatively different solutions. The path integral can be computed efficiently by Monte Carlo integration or by a Laplace approximation, and can therefore be used to solve high dimensional stochastic control problems.
Article
Three studies are presented which explore the information conveyed by demonstrations and which evaluate the effectiveness of demonstrations in providing the learner with sufficient information to model both kinematic and kinetic features of motor skill. In all three experiments adult female participants viewed a video-recorded demonstration of an adult female dance expert’s performance of a two-dimensional dance step. Immediately after viewing the demonstration participants attempted to replicate the movement. Performance attempts were video-recorded and biomechanically analysed on a number of measures including movement outcome accuracy, movement form (angular displacement). absolute and relative timing of movements and peak force on landing and take-off. Results of the programme of research suggest that the demonstration viewing condition affected the quality of visual information conveyed which in turn affected subsequent efforts at replication. Results are discussed in terms of coordination and control aspects of motor skill acquisition.
Article
A Markov Decision Process (MDP) is a natural framework for formulating sequential decision-making problems under uncertainty. In recent years, researchers have greatly advanced algorithms for learning and acting in MDPs. This article reviews such algorithms, beginning with well-known dynamic programming methods for solving MDPs such as policy iteration and value iteration, then describes approximate dynamic programming methods such as trajectory based value iteration, and finally moves to reinforcement learning methods such as Q-Learning, SARSA, and least-squares policy iteration. We describe algorithms in a unified framework, giving pseudocode together with memory and iteration complexity analysis for each. Empirical evaluations of these techniques with four representations across four domains, provide insight into how these algorithms perform with various feature sets in terms of running time and performance.