Conference PaperPDF Available

Efficient Reinforcement Learning for Motor Control

Authors:

Abstract

Artificial learners often require many more trials than humans or animals when learning motor control tasks in the absence of expert knowledge. We implement two key ingredients of biological learning systems, generalization and incorporation of uncertainty into the decision-making process, to speed up artificial learning. We present a coherent and fully Bayesian framework that allows for efficient artificial learning in the absence of expert knowledge. The success of our learning framework is demonstrated on challenging nonlinear control problems in simulation and in hardware. Learning from experience is a fundamental characteristic of intelligence and holds great potential for artificial sys- tems. Computational approaches for artificial learning from experience are studied in reinforcement learning (RL) and adaptive control. Although these fields have been studied for decades, the rate at which artificial systems learn still lags behind biological learners with respect to the amount of experience required to solve a task. Experience can be gathered by direct interaction with the environment. Traditionally, learning even relatively simple control tasks from scratch has been considered "daunting" (14) in the absence of strong task-specific prior assumptions. In the context of robotics, one popular approach employs knowl- edge provided by a human "teacher" to restrict the solution space (1), (3), (14). However, expert knowledge can be difficult to obtain, expensive, or simply not available. In the context of control systems and in the absence of strong task- specific knowledge, artificial learning algorithms often need more trials than physically feasible. In this paper, we propose a principled way of learning control tasks without any expert knowledge or engineered solutions. Our approach mimics two fundamental properties of human experience-based learning. The first important characteristic of humans is that we can generalize experience to unknown situations. Second, humans explicitly model and incorporate uncertainty into their decisions (6). We present a general and fully Bayesian framework for efficient RL by coherently combining generalization and uncertainty representation.
Efficient Reinforcement Learning for Motor Control
Marc Peter Deisenrothand Carl Edward Rasmussen
Department of Engineering, University of Cambridge
Trumpington Street, Cambridge CB2 1PZ, UK
Abstract Artificial learners often require many more trials
than humans or animals when learning motor control tasks
in the absence of expert knowledge. We implement two key
ingredients of biological learning systems, generalization and
incorporation of uncertainty into the decision-making process,
to speed up artificial learning. We present a coherent and fully
Bayesian framework that allows for efficient artificial learning
in the absence of expert knowledge. The success of our learning
framework is demonstrated on challenging nonlinear control
problems in simulation and in hardware.
I. INTRODUCTION
Learning from experience is a fundamental characteristic
of intelligence and holds great potential for artificial sys-
tems. Computational approaches for artificial learning from
experience are studied in reinforcement learning (RL) and
adaptive control. Although these fields have been studied
for decades, the rate at which artificial systems learn still
lags behind biological learners with respect to the amount
of experience required to solve a task. Experience can be
gathered by direct interaction with the environment.
Traditionally, learning even relatively simple control tasks
from scratch has been considered “daunting” [14] in the
absence of strong task-specific prior assumptions. In the
context of robotics, one popular approach employs knowl-
edge provided by a human “teacher” to restrict the solution
space [1], [3], [14]. However, expert knowledge can be
difficult to obtain, expensive, or simply not available. In the
context of control systems and in the absence of strong task-
specific knowledge, artificial learning algorithms often need
more trials than physically feasible.
In this paper, we propose a principled way of learning
control tasks without any expert knowledge or engineered
solutions. Our approach mimics two fundamental properties
of human experience-based learning. The first important
characteristic of humans is that we can generalize experience
to unknown situations. Second, humans explicitly model
and incorporate uncertainty into their decisions [6]. We
present a general and fully Bayesian framework for efficient
RL by coherently combining generalization and uncertainty
representation.
Unlike for discrete domains [10], generalization and in-
corporation of uncertainty into the decision-making process
are not consistently combined in RL although heuristics
exist [1]. In the context of motor control, generalization
typically requires a model or a simulator, that is, an internal
representation of the system dynamics. Since our objective
MP Deisenroth acknowledges support by the German Research Foun-
dation (DFG) through grant RA 1030/1-3 to CE Rasmussen.
is to reduce the interactions with the real system needed
to successfully learn a motor control task, we have to face
the problem of sparse data. Thus, we explicitly require a
probabilistic model to additionally represent and quantify
uncertainty. For this purpose, we use flexible non-parametric
probabilistic Gaussian process (GP) models to extract valu-
able information from data and to reduce model bias.
II. GEN ERAL SETUP
We consider discrete-time control problems with
continuous-valued states xand external control signals
(actions) u. The dynamics of the system are described by a
Markov decision process (MDP), a computational framework
for decision-making under uncertainty. An MDP is a tuple
of four objects: the state space, the action space (also
called the control space), the one-step transition function
f, and the immediate cost function c(x)that penalizes the
distance to a given target xtarget. The deterministic transition
dynamics
xt=f(xt1,ut1)(1)
are not known in advance. However, in the context of a
control problem, we assume that the immediate cost function
c(·)can be chosen given the target xtarget.
The goal is to find a policy πthat minimizes the expected
long-term cost
Vπ(x0) =
T
X
t=0
Ext[c(xt)] (2)
of following a policy πfor a finite horizon of Ttime steps.
The function Vπis called the value function, and Vπ(x0)is
called the value of the state x0under policy π.
A policy πis defined as a function that maps states to
actions. We consider stationary deterministic policies that are
parameterized by a vector ψ. Therefore, ut1=π(xt1,ψ)
and xt=f(xt1, π(xt1,ψ)). Thus, a state xtat time t
depends implicitly on the policy parameters ψ.
More precisely: In the context of motor control problems,
we aim to find a good policy πthat leads to a low expected
long-term cost Vπ(x0)given an initial state distribution
p(x0). We assume that no task-specific expert knowledge
is available. Furthermore, we desire to minimize interactions
with the real system. The setup we consider therefore cor-
responds to an RL problem with very limited interaction
resources.
We decompose the learning problem into a hierarchy of
three sub-problems described in Figure 1. At the bottom
level, a probabilistic model of the transition function fis
intermediate layer: approximate inference
top layer: policy optimization
bottom layer: learning the transition dynamics
π
Vπ
f
Fig. 1. The learning problem can be divided into three hierarchical
problems. At the bottom layer, the transition dynamics fare learned. Based
on the transition dynamics, the value function Vπcan be evaluated using
approximate inference techniques. At the top layer, an optimal control
problem has to be solved to determine a model-optimal policy π.
Algorithm 1 Fast learning for control
1: set policy to random policy initialization
2: loop
3: execute policy interaction
4: record collected experience
5: learn probabilistic dynamics model bottom layer
6: loop policy search
7: simulate system with policy π intermed. layer
8: compute expected long-term cost for π, eq. (2)
9: improve policy top layer
10: end loop
11: end loop
learned (Section II-B). Given the model of the transition
dynamics and a policy π, the expected long-term cost in
equation (2) is evaluated. This policy evaluation requires the
computation of the predictive state distributions p(xt)for
t= 1, . . . , T (intermediate layer in Figure 1, Section II-C).
At the top layer (Section II-D), the policy parameters ψare
optimized based on the result of the policy evaluation. This
parameter optimization is called an indirect policy search.
The search is typically non-convex and requires iterative
optimization techniques. The policy evaluation and policy
improvement steps alternate until the policy search converges
to a local optimum. If the transition dynamics are given, the
two top layers correspond to an optimal control problem.
A. High-Level Summary of the Learning Approach
A high-level description of the proposed framework is
given in Algorithm 1. Initially, we set the policy to random
(line 1). The framework involves learning in two stages:
First, when interacting with the system (line 3) experience
is collected (line 4) and the internal probabilistic dynam-
ics model is updated based on both historical and novel
observations (line 5). Second, the policy is refined in the
light of the updated dynamics model (loop over lines 7–9)
using approximate inference and gradient-based optimization
techniques for policy evaluation and policy improvement,
respectively. The model-optimized policy is applied to the
real system (line 3) to gather novel experience (line 4).
The subsequent model update (line 5) accounts for pos-
sible discrepancies between the predicted and the actually
encountered state trajectory. With increasing experience, the
probabilistic model describes the dynamics well in regions
of the state space that are promising, that is, regions along
trajectories with low expected cost.
B. Bottom Layer: Learning the Transition Dynamics
We learn the short-term transition dynamics fin equa-
tion (1) with Gaussian process models [13]. A GP can be
considered a distribution over functions and is utilized for
state-of-the-art Bayesian non-parametric regression [13]. GP
regression combines both flexible non-parametric modeling
and tractable Bayesian inference.
The GP dynamics model takes as input a representation
of state-action pairs (xt1,ut1). The GP targets are a
representation of the consecutive states xt=f(xt1,ut1).
The dynamics models are learned using evidence maximiza-
tion [13]. A key advantage of GPs is that a parametric
structure of the underlying function does not need to be
known. Instead, an adaptive, probabilistic model for the
latent transition dynamics is inferred directly from observed
data. The GP also consistently describes the uncertainty
about the model itself.
C. Intermediate Layer: Approximate Inference for Long-
Term Predictions
For an arbitrary pair (x,u), the GP returns a Gaus-
sian predictive distribution p(f(x,u)). Thus, when we
simulate the probabilistic GP model forward in time, the
predicted states are uncertain. We therefore need to be able
to predict successor states when the current state is given by
a probability distribution. Generally, a Gaussian distributed
state followed by nonlinear dynamics (as modeled by a
GP) results in a non-Gaussian successor state. We adopt the
results from [4], [11] and approximate the true predictive
distribution by a Gaussian with the same mean and the same
covariance matrix (exact moment matching). Throughout all
computations, we explicitly take the model uncertainty into
account by averaging over all plausible dynamics models
captured by the GP. To predict a successor state, we average
over both the uncertainty in the current state and the uncer-
tainty about the possibly imprecise dynamics model itself.
Thus, we reduce model bias, which is one of the strongest
arguments against model-based learning algorithms [2], [3],
[16].
Although the policy is deterministic, we need to consider
distributions over actions: For a single deterministic state
xt, the policy will deterministically return a single action.
However, during the forward simulation, the states are given
by a probability distribution p(xt),t= 0, . . . , T . Therefore,
we require the predictive distribution p(π(xt)) over actions
to determine the distribution p(xt+1)of the consecutive state.
We focus on nonlinear policies πrepresented by a radial
basis function (RBF) network. Therefore,
π(x) =
N
X
i=1
βiφi(x),(3)
where the basis functions φare axis-aligned Gaussians
centered at µi, i = 1, . . . , N . An RBF network is equivalent
to the mean function of a GP or a Gaussian mixture model.
In short, the policy parameters ψof the RBF policy are
the locations µiand the weights βias well as the length-
scales of the Gaussian basis functions φand the amplitude
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
distance
cost
quadratic
saturating
Fig. 2. Quadratic (red, dashed) and saturating (blue, solid) cost functions.
The x-axis shows the distance of the state to the target, the y-axis shows the
corresponding immediate cost. In contrast to the quadratic cost function, the
saturating cost function can encode that a state is simply “far away” from
the target. The quadratic cost function pays much attention to how “far
away” the state really is.
of the latent policy π. The RBF policy in equation (3) allows
for an analytic computation of a distribution over actions as
required for consistent predictions.
The GP model for the transition dynamics and the RBF
policy allow for the computation of the joint Gaussian
probability distribution p(xt,ut), which is required to com-
pute the distribution p(xt+1)of the consecutive state via
moment matching. By iteration, we can thus compute an
approximate distribution of the state sequence x0,...,xT,
which is required to evaluate the expected long-term cost in
equation (2).
D. Top Layer: Policy Optimization
The optimization problem at the top layer in Figure 1 cor-
responds to finding policy parameters ψthat minimize the
expected long term finite-horizon cost Vπψin equation (2).
We employ a conjugate gradients minimizer, which re-
quires the partial derivatives of the value function with
respect to the policy parameters. These derivatives are
computed analytically by repeated application of the chain
rule [12]. Hence, our approach is a gradient-based policy
search method.
III. COS T FUNCTION
We assume that the immediate cost function cin equa-
tion (2) does not incorporate any solution-specific knowledge
such as penalties on the control signal or speed variables (in
regulator problems). Only the target state xtarget is given. An
autonomous learner must be able to learn the remainder of
the task by itself: If the system reaches the target state but
overshoots due to too high velocities, the learning algorithm
should account for this kind of failing in a next trial. We
employ a cost function that solely uses a geometric distance
dof the current state to the target state. Thus, overshooting
causes higher long-term cost than staying close to the target.
A cost function commonly used in optimal control (par-
ticularly in combination with linear systems) is the quadratic
cost (red, dashed in Figure 2). One problem with the
quadratic cost function is that the long-term cost in equa-
tion (2) is highly dependent on the worst state along a pre-
dicted state trajectory. A second problem with the quadratic
cost is that the expected cumulative cost in equation (2) is
highly sensitive to details of a distribution that essentially
−1.5 −1 −0.5 0 0.5 1 1.5
0
0.2
0.4
0.6
0.8
1
state
cost function
peaked state distribution
wide state distribution
(a) Initially, when the mean of the
state is far away from the tar-
get, uncertain states (red, dashed-
dotted) are preferred to more certain
states with a more peaked distribu-
tion (black, dashed). This leads to
initial exploration.
−1.5 −1 −0.5 0 0.5 1 1.5
0
0.2
0.4
0.6
0.8
1
state
cost function
peaked state distribution
wide state distribution
(b) Finally, when the mean of the
state is close to the target, cer-
tain states with peaked distributions
cause less expected cost and are
therefore preferred to more uncer-
tain states (red, dashed-dotted). This
leads to exploitation once close to
the target.
Fig. 3. Automatic exploration and exploitation due to the saturating cost
function (blue, solid). The x-axes describe the state space. The target state
is the origin.
encode that the model has lost track of the state. In particular
in the early stages of learning, the predictive state uncertainty
may grow rapidly with the time horizon. To avoid an extreme
dependence on these arbitrary details, we instead use the cost
function
c(x)=1exp a2
2d(x,xtarget)2(4)
that is locally quadratic but which saturates at unity for large
deviations dfrom the desired target xtarget (blue function,
solid, in Figure 2). In equation (4), the Euclidean distance
from the state xto the target state is denoted by d, and
the parameter acontrols the width of the cost function. The
saturating cost function in equation (4) resembles the cost
function in human reasoning [5].
A. Exploration and Exploitation
The saturating cost function in equation (4) allows for
natural exploration even if the policy is greedy, that is,
it minimizes the expected long-term cost in equation (2).
This property is illustrated in Figure 3. If the mean of a
state distribution p(xt)is far away from the target xtarget, a
wide state distribution is more likely to have substantial tails
in some low-cost region than a fairly peaked distribution
(Figure 3(a)). If we initially start from a state distribution
in a high-cost region, the saturating cost therefore leads to
automatic exploration by favoring uncertain states.
If the mean of the state distribution is close to the
target as in Figure 3(b), wide distributions are likely to
have substantial tails in high-cost regions. By contrast, the
mass of a peaked distribution is more concentrated in low-
cost regions. In this case, a greedy policy prefers peaked
distributions close to the target, which leads to exploitation.
Hence, even for a greedy policy, the combination of a
probabilistic dynamics model and a saturating cost function
leads to exploration as long as the states are far away from
the target. Once close to the target, a greedy policy does not
veer from a trajectory that lead the system to certain states
close to the target.
One way to encourage further exploration is to modify the
objective function in equation (2). Incorporation of the state
uncertainty itself would be an option, but it would lead to
extreme designs [8]. However, we are particularly interested
in exploring promising regions of the state space, where
“promising” is directly defined by the saturating cost function
cin equation (4). Therefore, we consider the variance of
the predicted cost, which can be computed analytically. To
encourage goal-directed exploration, we therefore minimize
the objective function
˜
Vπ(x0) =
T
X
t=0
Ex[c(xt)] + b σx[c(xt)] .(5)
Here, σdenotes the standard deviation of the predicted cost.
For b < 0uncertainty in the cost is encouraged, for b > 0
uncertainty in the cost is penalized.
What is the difference between the variance of the state
and the variance of the cost? The variance of the predicted
cost depends on the variance of the state: If the state dis-
tribution is fairly peaked, the variance of the corresponding
cost is always small. However, an uncertain state does not
necessarily cause a wide cost distribution: If the mean of
the state distribution is in a high-cost region and the tails of
the distribution do not substantially cover low-cost regions,
the uncertainty of the predicted cost is very low. The only
case the cost distribution can be uncertain is if a) the state
is uncertain and b) a non-negligible part of the mass of the
state distribution is in a low-cost region. Hence, using the
uncertainty of the cost for exploration avoids extreme designs
by exploring regions that might be close to the target.
Exploration-favoring policies (b < 0) do not greedily
follow the most promising path (in terms of the expected
cost (2)), but they aim to gather more information to find
a better strategy in the long-term. By contrast, uncertainty-
averse policies (b > 0) follow the most promising path
and after finding a solution, they never veer from it. If
uncertainty-averse policies find a solution, they presumably
find it quicker (in terms of number of trials required) than an
exploration-favoring strategy. However, exploration-favoring
strategies find solutions more reliably. Furthermore, at the
end of the day they often provide better solutions than
uncertainty-averse policies.
IV. RES ULTS
Our proposed learning framework is applied to two under-
actuated nonlinear control problems: the Pendubot [15] and
the inverted pendulum. The under-actuation of both systems
makes myopic policies fail. In the following, we exactly
follow the steps in Algorithm 1.
A. Pendubot
The Pendubot depicted in Figure 4 is a two-link, under-
actuated robot [15]. The first joint exerts torque, but the
second joint cannot. The system has four continuous-valued
state variables: two joint angles and two joint angular ve-
locities. The angles of the joints, θ1and θ2, are measured
anti-clockwise from upright. An applied external torque u
[3.5,3.5] Nm controls the first joint. In our simulation, the
values for the masses and the lengths of the pendulums are
θ1
θ2
u
start
target
learn
Fig. 4. Pendubot. The control task is to swing both links up and to balance
them in the inverted position by exerting a torque to the first joint only.
m1= 0.5 kg = m2and 1= 0.6 m = 2. The sampling
frequency is set to 13.¯
3 Hz, which is fairly low for this kind
of problem: A sampling frequency of 2,000 Hz is used in [9].
Starting from the position, where both joints hang down,
the objective is to swing the Pendubot up and to balance it in
the inverted position. Note that the dynamics of the system
can be chaotic.
Our cost function penalizes the Euclidean distance dfrom
the tip of the outer pendulum to the target state.
The width 1/a = 0.5 m of in the cost function in
equation (4) is chosen, such that the immediate cost is about
unity as long as the distance between the pendulum tip
and the target state is greater than the length 2of the
outer pendulum. Thus, the tip of the outer pendulum has
to cross horizontal to significantly reduce the immediate
cost from unity. Initially, we set the exploration parameter
in equation (5) to b=0.2to favor more exploration in
predicted high-reward regions of the state space. We increase
the exploration parameter linearly, such that it reaches 0 in
the last trial. The learning algorithm is fairly robust to the
selection of the exploration parameter bin equation (5): In
most cases, we could learn the tasks with b[0.5,0.2].
Figure 5 sketches a solution to the Pendubot problem after
an experience of approximately 90 s. The learned controller
attempts to keep the pendulums aligned, which, from a
mechanical point of view, leads to a faster swing-up motion.
B. Inverted Pendulum (Cart-Pole)
The inverted pendulum shown in Figure 6 consists of a
cart with mass m1and an attached pendulum with mass m2
and length , which swings freely in the plane. The pendulum
angle θis measured anti-clockwise from hanging down. The
cart moves horizontally on a track with an applied external
force u. The state of the system is given by the position x
and the velocity ˙xof the cart and the angle θand angular
velocity ˙
θof the pendulum.
The objective is to swing the pendulum up and to balance
it in the inverted position in the middle of the track by simply
pushing the cart to the left and to the right.
We reported simulation results of this system in [12]. In
this paper, we demonstrate our learning algorithm in real
hardware. Unlike classical control methods, our algorithm
learns a model of the system dynamics in equation (1)
from data only. It is therefore not necessary to provide a
probably inaccurate idealized mathematical description of the
transition dynamics that includes parameters, such as friction,
motor constants, or delays. Since 125 mm, we set
applied torque
immediate reward
applied torque
immediate reward
applied torque
immediate reward
applied torque
immediate reward
applied torque
immediate reward
applied torque
immediate reward
Fig. 5. Illustration of the learned Pendubot swing up. Six snapshots of the
swing up (top left to bottom right) are shown. The cross marks the target
state of the tip of the outer pendulum. The green bar shows the applied
torque. The gray bar shows the immediate reward (negative cost plus 1). In
order to swing the Pendubot up, energy is induced first, and the Pendubot
swings left and then right up. Close to the target in the inverted position
(red cross), the controller does no longer apply significant torques and keeps
the Pendubot close to the target.
u
start state target state
Fig. 6. Inverted pendulum. The task is to swing the pendulum up and to
balance it in the inverted position in the middle of the track by applying
horizontal forces to the cart only.
the sampling frequency to 10 Hz, which is about five times
faster than the characteristic frequency of the pendulum.
Furthermore, we choose the cost function in equation (4)
with 1/a 0.07 m, such that the cost incurred does not
substantially differ from unity if the distance between the
pendulum tip and the target state is greater than . The force
is constrained to u[10,10] N.
Following Algorithm 1, we initialized the learning system
with two trials of length T= 2.5 s, where random actions
(horizontal forces to the cart) were applied. The five seconds
of data collected in these trials were used to train a first
probabilistic dynamics model. Using this model to internally
Fig. 7. Inverted pendulum in hardware; snapshots of a controlled trajectory.
The pendulum is swung up and balanced in the inverted position close to
the target state (green cross). To solve this task, our algorithm required only
17.5 s of interaction with the physical system.
simulate the dynamics, the parameters of the RBF controller
were optimized. In the third trial, this controller is applied
to the real system. The controller manages to keep the
cart in the middle of the track, but the pendulum does
not go beyond horizontal—the system never experienced
states where the pendulum is above horizontal. However,
it takes the new observations into account and re-train the
probabilistic dynamics model. With the uncertainty in the
predictions decreases and a good policy for to the updated
model is found. Applying this new policy for another 2.5 s
leads to the fourth trial where the controller swings the
pendulum up, but drastically overshoots. However, for the
first time states close to the target state ware encountered.
Taking these observations into account, the dynamics model
is updated, and the corresponding controller is learned. In the
fifth trial, the controller learned to reduce the angular velocity
substantially since falling over leads to high expected cost.
After two more trials, the learned controller can solve the
cart-pole task based on a total of 17.5 s experience only.
Figure 7 shows snapshots of a test trajectory of 20 s length.
A video showing the entire learning process can be found at
http://mlg.eng.cam.ac.uk/marc/.
Our learning algorithm is very general and worked imme-
diately when we applied it to real hardware. Since we could
derive all required parameters (width of the cost function and
time sampling frequency) from the length of the pendulum,
no parameter tuning was necessary.
V. DISCUSSION
Our approach learns very fast in terms of the amount of
experience required to solve a task. However, the current
implementation requires about ten minutes of CPU time
on a standard PC per policy search. The most demanding
computations are the approximate inference based on mo-
ment matching and the computation of the derivatives, which
require O(T n2D3)operations. Here Tis the prediction
horizon, nthe size of the dynamics training set, and D
is the dimension of the state. Once the policy has been
learned, the policy can be implemented and applied in real
time (Section IV-B).
The model of the transition dynamics fin equation (1) is
probabilistic, but the internal simulation is fully determinis-
tic: For a given policy parameterization and an initial state
distribution p(x0)the approximate inference is deterministic
and does not require any sampling. This property is still
valid if the transition dynamics fand/or the policy π
are stochastic. Due to the deterministic simulative model,
any optimization method for deterministic functions can be
employed for the policy search.
The algorithms directly generalizes to multiple actuators
by using a policy in equation (3) with multivariate outputs.
With this approach, we successfully applied our learning
framework to the Pendubot task with two actuators (not
discussed in this paper.)
We have demonstrated learning in the special case where
we assume that the state is fully observable. In principle,
there is nothing to hinder the use of the algorithm when
observations are noisy. After learning a generative model for
the latent transitions, the hidden state can be tracked using
the GP-ADF filter proposed in [4]. The learning algorithm
and the involved computations generalize directly to this
setting.
Our experience is that the probabilistic GP dynamics
model leads to fairly robust controllers: First, since the
model can be considered a distribution over all models that
plausibly explain the experience, incorporation of new ex-
perience does not usually make previously plausible models
implausible. Second, the moment-matching approximations
in the approximate inference is a conservative approxima-
tion of a distribution: Let qbe the approximate Gaussian
distribution that is computed by moment matching and pbe
the true predictive distribution, then we minimize KL(p||q).
Minimizing KL(p||q)ensures that qis non-zero where the
true distribution pis non-zero. This is an important issue
in the context of coherent predictions and, therefore, robust
control: The approximate distribution qis not overconfident,
but can be too cautious since it captures all modes of the
true distribution as shown by [7]. If we still can learn a con-
troller using the admittedly conservative moment-matching
approximation, the controller is expected to be robust.
If a deterministic dynamics model is used, incorporation
of new experience can drastically change the model. We
observed that this model change can have strong influence to
the optimization procedure [12]. In case of the Pendubot, we
have experimental evidence that the deterministic learning
algorithm cannot explore the state space sufficiently well,
that is, it never came close to the target state at all.
The general form of the saturating cost function in
equation (4) can be chosen for arbitrary control problems.
Therefore, it is not problem specific. However, it clearly
favors the incorporation of uncertainty into dynamics models.
Hence, it can be considered algorithm-specific.
Our learning algorithm does not require an explicit global
model of the value function Vπ, but instead evaluates the
value function for an initial state distribution p(x0). Although
global value function models are often used to derive an
optimal policy they are an additional source of errors. It is
often unclear how an error in the value function affects to
policy if the value function model is not exact.
VI. CONCLUSION
We proposed a general framework for efficient reinforce-
ment learning in the context of motor control problems. The
key ingredient of this framework is a probabilistic model
for the transition dynamics, which mimics two important
features of biological learners: the ability to generalize and
the explicit incorporation of uncertainty into the decision-
making process. We successfully applied our algorithm to
the simulated Pendubot and the cart-pole problem in real
hardware demonstrating the flexibility and the success of
our approach. To our best knowledge, we report an unprece-
dented speed of learning for both tasks.
REFERENCES
[1] Pieter Abbeel, Morgan Quigley, and Andrew Y. Ng. Using Inaccurate
Models in Reinforcement Learning. In Proceedings of the 23rd
International Conference on Machine Learning, pages 1–8, Pittsburgh,
PA, USA, June 2006.
[2] Christopher G. Atkeson and Juan C. Santamar´
ıa. A Comparison of
Direct and Model-Based Reinforcement Learning. In Proceedings of
the International Conference on Robotics and Automation, 1997.
[3] Christopher G. Atkeson and Stefan Schaal. Robot Learning from
Demonstration. In Proceedings of the 14th International Conference
on Machine Learning, pages 12–20, Nashville, TN, USA, July 1997.
Morgan Kaufmann.
[4] Marc P. Deisenroth, Marco F. Huber, and Uwe D. Hanebeck. Analytic
Moment-based Gaussian Process Filtering. In Proceedings of the
26th International Conference on Machine Learning, pages 225–232,
Montreal, Canada, June 2009. Omnipress.
[5] Konrad P. K¨
ording and Daniel M. Wolpert. The Loss Function of
Sensorimotor Learning. In Proceedings of the National Academy of
Sciences, volume 101, pages 9839–9842, 2004.
[6] Konrad P. K¨
ording and Daniel M. Wolpert. Bayesian Decision Theory
in Sensorimotor Control. Trends in Cognitive Sciences, 10(7):319–326,
June 2006.
[7] Malte Kuss and Carl E. Rasmussen. Assessing Approximations for
Gaussian Process Classification. In Advances in Neural Information
Processing Systems 18, pages 699–706. The MIT Press, Cambridge,
MA, USA, 2006.
[8] David J. C. MacKay. Information-Based Objective Functions for
Active Data Selection. Neural Computation, 4:590–604, 1992.
[9] Rowland O’Flaherty, Ricardo G. Sanfelice, and Andrew R. Teel.
Robust Global Swing-Up of the Pendubot Via Hybrid Control. In
Proceedings of the 2008 American Control Conference, pages 1424–
1429, Seattle, WA, USA, June 2008.
[10] Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An
Analytic Solution to Discrete Bayesian Reinforcement Learning. In
Proceedings of the 23rd International Conference on Machine Learn-
ing, pages 697–704, Pittsburgh, PA, USA, 2006. ACM.
[11] Joaquin Qui˜
nonero-Candela, Agathe Girard, Jan Larsen, and Carl E.
Rasmussen. Propagation of Uncertainty in Bayesian Kernel Models—
Application to Multiple-Step Ahead Forecasting. In IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing,
volume 2, pages 701–704, April 2003.
[12] Carl E. Rasmussen and Marc P. Deisenroth. Recent Advances in
Reinforcement Learning, volume 5323 of Lecture Notes in Computer
Science, chapter Probabilistic Inference for Fast Learning in Control,
pages 229–242. Springer-Verlag, November 2008.
[13] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Pro-
cesses for Machine Learning. Adaptive Computation and Machine
Learning. The MIT Press, Cambridge, MA, USA, 2006.
[14] Stefan Schaal. Learning From Demonstration. In Advances in Neural
Information Processing Systems 9, pages 1040–1046. The MIT Press,
Cambridge, MA, USA, 1997.
[15] Mark W. Spong and Daniel J. Block. The Pendubot: A Mechatronic
System for Control Research and Education. In Proceedings of the
Conference on Decision and Control, pages 555–557, 1995.
[16] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An
Introduction. Adaptive Computation and Machine Learning. The MIT
Press, Cambridge, MA, USA, 1998.
... A common approach is to use a tabular model where the agent learns a model for each state-action based on the frequencies of different outcomes at each state. The agent could also learn the model using any supervised learning technique, such as decision trees [3] or Gaussian Process regression [4]. ...
... Another approach that addresses our problem is Gaussian Process RL. Deisenroth and Rasmussen [4] present one such approach, where the agent maintains a model of the domain using Gaussian Process regression. This model is able to generalize experience to unknown situations as well as represent uncertainty. ...
... Another option is to use information based on the variance of predictions by each model. Similar to [4], we experimented with an approach where the agent is given intrinsic rewards based on the variance of the model's predictions. In this case, the model's reward R is modified as follows: ...
Conference Paper
A developing agent needs to explore to learn about the world and learn good behaviors. In many real world tasks, this exploration can take far too long, and the agent must make decisions about which states to explore, and which states not to explore. Bayesian methods attempt to address this problem, but take too much computation time to run in reasonably sized domains. In this paper, we present TEXPLORE, the first algorithm to perform targeted exploration in real time in large domains. The algorithm learns multiple possible models of the domain that generalize action effects across states. We experiment with possible ways of adding intrinsic motivation to the agent to drive exploration. TEXPLORE is fully implemented and tested in a novel domain called Fuel World that is designed to reflect the type of targeted exploration needed in the real world. We show that our algorithm significantly outperforms representative examples of both model-free and model-based RL algorithms from the literature and is able to quickly learn to perform well in a large world in real-time.
... Yin et al. (2016a) proposed an approximate dynamic programming method to reduce the time delay of affected passengers in a congested metro line. Deisenroth and Rasmussen (2009) presented a Gauss process reinforcement learning algorithm and applied it to inverted pendulum control in hardware. Yin et al. (2016b) develop an integrated train operation algorithm based on -learning to realize real-time train operations with online adjusting the timetable. ...
Article
To estimate the parameters of an induction motor in a data-based manner, this paper proposes a new offline method to estimate rotor resistance and excitation inductance based on the deep-Q-learning approach. In this method, parameter estimation can be facilitated without disturbing the model error or operating state. To achieve this goal, three important elements, namely observation, action, and reward, are appropriately designed. To improve the robustness and accelerate the convergence, a new concept, denoted as Q-sensitivity, is proposed and investigated in detail. The experimental results show that a high-Q-sensitivity design can allow the proposed method to obtain a fast and torque-maximized estimation. Results from the comparative studies confirm the accuracy and robustness of the proposed method.
... Reinforcement learning control (RLC), which was initially introduced by Sutton et al. [7], and later improved by many others [9,10] is a natural approach. With RLC, the control system starts with no information whatsoever and solves the Bellman equation based on trial-and-error experience. ...
... Thus, for learning and adaptation of a robot control policy, the robot must gather information about the world exploring it. Thus, an efficient RL algorithm for trajectory tracking control must learn in few samples [20]. During the on-going interactions between the robot control policy and the environment, a model of the robot dynamics must be learned on-line, allowing the policy to adapt itself to changes in the system response to control actions. ...
Conference Paper
The off-shore industry requires periodic monitoring of underwater infrastructure for preventive maintenance and security reasons. The tasks in hostile environments can be achieved automatically through autonomous robots like UUV, AUV and ASV. When the robot controllers are based on prespecified conditions they could not function properly in these hostile changing environments. It is beneficial to count with adaptive control strategies that are capable to readapt the control policies when deviations, from the desired behavior, are detected. In this paper, we present an online selective reinforcement learning approach for learning reference tracking control policies given no prior knowledge of the dynamical system. The proposed approach enables real-time adaptation of the control policies, executed by the adaptive controller, based on ongoing interactions with the non-stationary environments. Applications on surface vehicles under non-stationary and perturbed environments are simulated. The presented simulation results demonstrate the performance of this novel algorithm to find optimal control policies that solve the trajectory tracking control task in unmanned vehicles.
... Recently, in the literature of machine learning, a non-parametric regression model called Gaussian process was introduced [3] [8] [12]. Gaussian processes are considered as one of the most successful regression models applied in many domains, e.g., biosystems [1], predictive control for chemical plants [7], hydraulic systems [6], learning inverted pendulum [5] and non-linear system identification [21]. ...
Article
Full-text available
In this paper, we present the Gaussian process regression as the predictive model for Quality-of-Service (QoS) attributes in Web service systems. The goal is to predict performance of the execution system expressed as QoS attributes given existing execution system, service repository, and inputs, e.g., streams of requests. In order to evaluate the performance of Gaussian process regression the simulation environment was developed. Two quality indexes were used, namely, Mean Absolute Error and Mean Squared Error. The results obtained within the experiment show that the Gaussian process performed the best with linear kernel and statistically significantly better comparing to Classification and Regression Trees (CART) method.
Thesis
The permanent magnet synchronous motor (PMSM) is a commonly used traction motor in automotive applications due to its high power and torque density with respect to volume and weight. These characteristics are constrained by the maximum temperature at which vital components can still operate without harm. Moreover, important rotor component temperatures cannot be measured economically. Temperature estimation methods such as model-based approaches can alleviate the problem of missing thermal information at potentially no additionally required equipment. This work collates a portfolio of data-driven thermal models from the domain of machine learning and investigates their feasibility for the task of accurate thermal modeling on the example of a PMSM data set recorded on a test bench. Aside from the average estimation error, the required amount of model parameters as an approximation for the computational demand dictates design decisions throughout. The whole process of designing a machine learning model is illuminated and carried out for varying linear models; tree-based models; feed-forward, recurrent, and convolutional neural networks, as well as various hybrid gray-box modeling approaches. Moreover, a hybrid modeling paradigm with thermal neural networks is highlighted, which was first introduced by this work's author. Eventually, an expert-designed, data-driven lumped-parameter thermal network is optimized under different algorithms in order to put machine learning models to the test against the state of the art of thermal modeling.
Preprint
Full-text available
In this paper we focus on developing a control algorithm for multi-terrain tracked robots with flippers using a reinforcement learning (RL) approach. The work is based on the deep deterministic policy gradient (DDPG) algorithm, proven to be very successful in simple simulation environments. The algorithm works in an end-to-end fashion in order to control the continuous position of the flippers. This end-to-end approach makes it easy to apply the controller to a wide array of circumstances, but the huge flexibility comes to the cost of an increased difficulty of solution. The complexity of the task is enlarged even more by the fact that real multi-terrain robots move in partially observable environments. Notwithstanding these complications, being able to smoothly control a multi-terrain robot can produce huge benefits in impaired people daily lives or in search and rescue situations.
Article
Many engineering systems can be characterized as complex since they have a nonlinear behaviour incorporating a stochastic uncertainty. It has been shown that one of the most appropriate methods for modelling of such systems is based on the application of Gaussian processes (GPs). The GP models provide a probabilistic non-parametric modelling approach for black-box identification of nonlinear stochastic systems. This chapter reviews the methods for modelling and control of complex stochastic systems based on GP models. The GP-based modelling method is applied in a process engineering case study, which represents the dynamic modelling and control of a laboratory gas–liquid separator. The variables to be controlled are the pressure and the liquid level in the separator and the manipulated variables are the apertures of the valves for the gas flow and the liquid flow. GP models with different regressors and different covariance functions are obtained and evaluated. A selected GP model of the gas–liquid separator is further used to design an explicit stochastic model predictive controller to ensure the optimal control of the separator.
Article
Full-text available
We propose a fully Bayesian approach for efficient reinforcement learning (RL) in Markov decision processes with continuous-valued state and action spaces when no expert knowledge is available. Our framework is based on well-established ideas from statistics and machine learning and learns fast since it carefully models, quantifies, and incorporates available knowledge when making decisions. The key ingredient of our framework is a probabilistic model, which is implemented using a Gaussian process (GP), a distribution over functions. In the context of dynamic systems, the GP models the transition function. By considering all plausible transition functions simultaneously, we reduce model bias, a problem that frequently occurs when deterministic models are used. Due to its generality and efficiency, our RL framework can be considered a conceptual and practical approach to learning models and controllers when
Conference Paper
We present a novel Bayesian reinforcement learning algorithm that addresses model bias and exploration overhead issues. The algorithm combines different aspects of several state-of-the-art reinforcement learning methods that use Gaussian Processes model-based approaches to increase the use of the online data samples. The algorithm uses a smooth reward function requiring the reward value to be derived from the environment state. It works with continuous states and actions in a coherent way with a minimized need for expert knowledge in parameter tuning. We analyse and discuss the practical benefits of the selected approach in comparison to more traditional methodological choices, and illustrate the use of the algorithm in a motor control problem involving a two-link simulated arm.
Conference Paper
Full-text available
Reinforcement learning (RL) was originally proposed as a framework to allow agents to learn in an online fashion as they interact with their environment. Existing RL algo- rithms come short of achieving this goal be- cause the amount of exploration required is often too costly and/or too time consum- ing for online learning. As a result, RL is mostly used for oine learning in simulated environments. We propose a new algorithm, called BEETLE, for eectiv e online learning that is computationally ecien t while mini- mizing the amount of exploration. We take a Bayesian model-based approach, framing RL as a partially observable Markov decision process. Our two main contributions are the analytical derivation that the optimal value function is the upper envelope of a set of mul- tivariate polynomials, and an ecien t point- based value iteration algorithm that exploits this simple parameterization.
Conference Paper
Full-text available
We propose an analytic moment-based filter for nonlinear stochastic dynamic systems modeled by Gaussian processes. Exact expressions for the expected value and the co-variance matrix are provided for both the prediction step and the filter step, where an additional Gaussian assumption is exploited in the latter case. Our filter does not require further approximations. In particular, it avoids finite-sample approximations. We compare the filter to a variety of Gaussian filters, that is, the EKF, the UKF, and the recent GP-UKF proposed by Ko et al. (2007).
Article
Combining local state-feedback laws and open-loop schedules, we design a hybrid control algorithm for robust global stabilization of the pendubot to the upright configuration (both links straight up with zero velocity). Our hybrid controller performs the swing-up task robustly by executing a decision-making algorithm designed to work under the presence of perturbations. The hybrid control algorithm features logic variables, timers, and hysteresis. We explicitly design the control strategy and implement it in a real pendubot system using Matlab/Simulink with Real-time Workshop. Experimental results show the main capabilities of our hybrid controller.
Conference Paper
The object of Bayesian modelling is predictive distribution, which, in a forecasting scenario, enables evaluation of forecasted values and their uncertainties. We focus on reliably estimating the predictive mean and variance of forecasted values using Bayesian kernel based models such as the Gaussian process and the relevance vector machine. We derive novel analytic expressions for the predictive mean and variance for Gaussian kernel shapes under the assumption of a Gaussian input distribution in the static case, and of a recursive Gaussian predictive density in iterative forecasting. The capability of the method is demonstrated for forecasting of time-series and compared to approximate methods.
Conference Paper
Combining local state-feedback laws and open-loop schedules, we design a hybrid control algorithm for robust global stabilization of the pendubot to the upright configuration (both links straight up with zero velocity). Our hybrid controller performs the swing-up task by executing a decision-making algorithm designed to work under the presence of perturbations. The hybrid control algorithm features logic variables, timers, and hysteresis. We explicitly design, implement, and validate this control strategy in a real pendubot system using Matlab/Simulink with Real-time Workshop. Experimental results show the main capabilities of our hybrid controller.
Conference Paper
By now it is widely accepted that learning a task from scratch, i.e., without any prior knowledge, is a daunting undertaking. Humans, however, rarely attempt to learn from scratch. They extract initial biases as well as strate- gies how to approach a learning problem from instructions and/or demon- strations of other humans. For learning control, this paper investigates how learning from demonstration can be applied in the context of reinforcement learning. We consider priming the Q -function, the value function, the pol- icy, and the model of the task dynamics as possible areas where demonstra- tions can speed up learning. In general nonlinear learning problems, only model-based reinforcement learning shows significant speed-up after a demonstration, while in the special case of linear quadratic regulator (LQR) problems, all methods profit from the demonstration. In an implementation of pole balancing on a complex anthropomorphic robot arm, we demon- strate that, when facing the complexities of real signal processing, model- based reinforcement learning offers the most robustness for LQR problems. Using the suggested methods, the robot learns pole balancing in just a sin- gle trial after a 30 second long demonstration of the human instructor.
Conference Paper
Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace‘s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate.
Conference Paper
In the model-based policy search approach to reinforcement learning (RL), policies are found using a model (or "simulator") of the Markov decision process. However, for high-dimensional continuous-state tasks, it can be extremely difficult to build an accurate model, and thus often the algorithm returns a policy that works in simulation but not in real-life. The other extreme, model-free RL, tends to require infeasibly large numbers of real-life trials. In this paper, we present a hybrid algorithm that requires only an approximate model, and only a small number of real-life trials. The key idea is to successively "ground" the policy evaluations using real-life trials, but to rely on the approximate model to suggest local changes. Our theoretical results show that this algorithm achieves near-optimal performance in the real system, even when the model is only approximate. Empirical results also demonstrate that---when given only a crude model and a small number of real-life trials---our algorithm can obtain near-optimal performance in the real system.