Conference PaperPDF Available

PILCO: A Model-Based and Data-Efficient Approach to Policy Search.

Authors:

Abstract and Figures

In this paper, we introduce PILCO, a practical, data-efficient model-based policy search method. PILCO reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, PILCO can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.
Content may be subject to copyright.
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
Marc Peter Deisenroth marc@cs.washington.edu
Department of Computer Science & Engineering, University of Washington, USA
Carl Edward Rasmussen cer54@cam.ac.uk
Department of Engineering, University of Cambridge, UK
Abstract
In this paper, we introduce pilco, a practi-
cal, data-efficient model-based policy search
method. Pilco reduces model bias, one of
the key problems of model-based reinforce-
ment learning, in a principled way. By learn-
ing a probabilistic dynamics model and ex-
plicitly incorporating model uncertainty into
long-term planning, pilco can cope with
very little data and facilitates learning from
scratch in only a few trials. Policy evaluation
is performed in closed form using state-of-
the-art approximate inference. Furthermore,
policy gradients are computed analytically
for policy improvement. We report unprece-
dented learning efficiency on challenging and
high-dimensional control tasks.
1. Introduction and Related Work
To date, reinforcement learning (RL) often suffers from
being data inefficient, i.e., RL requires too many tri-
als to learn a particular task. For example, learning
one of the simplest RL tasks, the mountain-car, often
requires tens if not hundreds or thousands of trials—
independent of whether policy iteration, value itera-
tion, or policy search methods are used. Hence, RL
methods are often largely inapplicable to mechanical
systems that quickly wear out, e.g., low-cost robots.
Increasing data efficiency requires either having infor-
mative prior knowledge or extracting more information
from available data. In this paper, we do not assume
that any expert knowledge is available (e.g., in terms
of demonstrations or differential equations for the dy-
namics). Instead, we elicit a general policy-search
framework for data-efficient learning from scratch.
Appearing in Proceedings of the 28
th
International Con-
ference on Machine Learning, Bellevue, WA, USA, 2011.
Copyright 2011 by the author(s)/owner(s).
Generally, model-based methods, i.e., methods that
learn a dynamics model of the environment, are more
promising to efficiently extract valuable information
from available data than model-free methods such as
Q-learning or TD-learning. One reason why model-
based methods are not widely used in learning from
scratch is that they suffer from model bias, i.e., they in-
herently assume that the learned dynamics model suffi-
ciently accurately resembles the real environment, see,
e.g., (Schneider, 1997; Schaal, 1997; Atkeson & Santa-
mar´ıa, 1997). Model bias is especially an issue when
only a few samples and no informative prior knowledge
about the task to be learned are available.
Fig. 1 illustrates how model bias affects learning.
Given a small data set of observed deterministic tran-
sitions (left), multiple transition functions plausibly
could have generated the data (center). Choosing a
single one causes severe consequences: When long-
term predictions (or sampling trajectories from this
model) leave the training data, the predictions of the
function approximator are essentially arbitrary, but
they are claimed with full confidence! By contrast, a
probabilistic function approximator places a posterior
distribution over the transition function (right) and
expresses the level of uncertainty about the model.
Hence, for learning from scratch, we first require a
probabilistic dynamics model to express model uncer-
tainty. We employ non-parametric probabilistic Gaus-
sian processes (GPs) for this purpose. Second, model
uncertainty must be incorporated into planning and
policy evaluation. Deterministic approximate infer-
ence techniques for policy evaluation allows us to apply
policy search based on analytic policy gradients. An
explicit value function model is not required. Based
on these ideas, we propose a model-based policy search
method, which we call pilco (probabilistic inference
for learning control). Pilco achieves unprecedented
data efficiency in continuous state-action domains and
is directly applicable to physical systems, e.g., robots.
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
−5 −4 −3 −2 −1 0 1 2 3 4 5
−2
0
2
(x
i
, u
i
)
f(x
i
, u
i
)
−5 −4 −3 −2 −1 0 1 2 3 4 5
−2
0
2
(x
i
, u
i
)
f(x
i
, u
i
)
−5 −4 −3 −2 −1 0 1 2 3 4 5
−2
0
2
(x
i
, u
i
)
f(x
i
, u
i
)
Figure 1. Small data set of observed transitions (left), multiple plausible deterministic function approximators (center),
probabilistic function approximator (right). The probabilistic approximator models uncertainty about the latent function.
A common approach in designing adaptive controllers,
which takes uncertainty of the model parameters
into account, is to add an extra term in the cost
function of a minimum-variance controller (Fabri &
Kadirkamanathan, 1998). Here, the uncertainty of the
model parameters is penalized to improve the model-
parameter estimation. Abbeel et al. (2006) proposed
further other successful heuristics to deal with inaccu-
rate models. Based on good-guess parametric dynam-
ics models, locally optimal controllers, and temporal
bias terms to account for model discrepancies, very im-
pressive results were obtained. Schneider (1997) and
Bagnell & Schneider (2001) proposed to account for
model bias by explicitly modeling and averaging over
model uncertainty. Pilco builds upon the success-
ful approach by Schneider (1997), where model un-
certainty is treated as temporally uncorrelated noise.
However, pilco neither requires sampling methods for
planning, nor is it restricted to a finite number of plau-
sible models.
Algorithms with GP dynamics models in RL were
presented by Rasmussen & Kuss (2004), Ko et al.
(2007), and Deisenroth et al. (2009). Shortcomings
of these approaches are that the dynamics models are
either learned by motor babbling, which is data ineffi-
cient, or value function models have to be maintained,
which does not scale well to high dimensions. The
approaches by Engel et al. (2003) and Wilson et al.
(2010) are based GP value function models and, thus,
suffer from the same problems. As an indirect pol-
icy search method, pilco does not require an explicit
value function model.
An extension of pilco to deal with planning and con-
trol under consideration of task-space constraints in a
robotic manipulation task is presented in (Deisenroth
et al., 2011).
Throughout this paper, we consider dynamic systems
x
t
= f(x
t1
, u
t1
) (1)
with continuous-valued states x R
D
and controls
u R
F
and unknown transition dynamics f. The
objective is to find a deterministic policy/controller π :
x 7→ π(x) = u that minimizes the expected return
J
π
(θ) =
X
T
t=0
E
x
t
[c(x
t
)] , x
0
N(µ
0
, Σ
0
) , (2)
of following π for T steps, where c(x
t
) is the cost (neg-
ative reward) of being in state x at time t. We assume
that π is a function parametrized by θ and that c en-
codes some information about a target state x
target
.
2. Model-based Indirect Policy Search
In the following, we detail the key components of the
pilco policy-search framework: the dynamics model,
analytic approximate policy evaluation, and gradient-
based policy improvement.
2.1. Dynamics Model Learning
Pilco’s probabilistic dynamics model is implemented
as a GP, where we use tuples (x
t1
, u
t1
) R
D+F
as training inputs and differences
t
= x
t
x
t1
+
ε R
D
, ε N(0, Σ
ε
), Σ
ε
= diag([σ
ε
1
, . . . , σ
ε
D
]), as
training targets. The GP yields one-step predictions
p(x
t
|x
t1
, u
t1
) = N
x
t
|µ
t
, Σ
t
, (3)
µ
t
= x
t1
+ E
f
[∆
t
] , (4)
Σ
t
= var
f
[∆
t
] . (5)
Throughout this paper, we consider a prior mean func-
tion m 0 and the squared exponential (SE) kernel
k with automatic relevance determination. The SE
covariance function is defined as
k(
˜
x,
˜
x
0
) = α
2
exp
1
2
(
˜
x
˜
x
0
)
>
Λ
1
(
˜
x
˜
x
0
)
(6)
with
˜
x
:
= [x
>
u
>
]
>
. Here, we define α
2
as the variance
of the latent function f and Λ
:
= diag([`
2
1
, . . . , `
2
D
]),
which depends on the characteristic length-scales `
i
.
Given n training inputs
˜
X = [
˜
x
1
, . . . ,
˜
x
n
] and corre-
sponding training targets y = [∆
1
, . . . ,
n
]
>
, the pos-
terior GP hyper-parameters (length-scales `
i
, signal
variance α
2
, noise variances Σ
ε
) are learned by evi-
dence maximization (Rasmussen & Williams, 2006).
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
The posterior predictive distribution p(∆
|
˜
x
) for an
arbitrary, but known, test input
˜
x
is Gaussian with
mean and variance
m
f
(
˜
x
) = E
f
[∆
] = k
>
(K + σ
2
ε
I)
1
y = k
>
β , (7)
σ
2
f
(∆
) = var
f
[∆
] = k
∗∗
k
>
(K + σ
2
ε
I)
1
k
, (8)
respectively, where k
:
= k(
˜
X,
˜
x
), k
∗∗
:
= k(
˜
x
,
˜
x
),
β
:
= (K + σ
2
ε
I)
1
y, and K being the Gram matrix
with entries K
ij
= k(
˜
x
i
,
˜
x
j
).
For multivariate targets, we train conditionally inde-
pendent GPs for each target dimension, i.e., the GPs
are independent for deterministically given test inputs.
For uncertain inputs, the target dimensions covary.
2.2. Policy Evaluation
Minimizing and evaluating J
π
in Eq. (2) requires long-
term predictions of the state evolution. To obtain the
state distributions p(x
1
), . . . , p(x
T
), we cascade one-
step predictions, see Eqs. (3)–(5). Doing this properly
requires mapping uncertain test inputs through the
GP dynamics model. In the following, we assume that
these test inputs are Gaussian distributed and extend
the results from Qui˜nonero-Candela et al. (2003) to the
multivariate case and the incorporation of controls.
For predicting x
t
from p(x
t1
), we require a joint
distribution p(x
t1
, u
t1
). As the control u
t1
=
π(x
t1
, θ) is a function of the state, we compute the
desired joint as follows: First, we compute the mean
µ
u
and the covariance Σ
u
of the predictive control dis-
tribution p(u
t1
) by integrating out the state. Subse-
quently, the cross-covariance cov[x
t1
, u
t1
] is com-
puted. Finally, we approximate the joint state-control
distribution p(
˜
x
t1
) = p(x
t1
, u
t1
) by a Gaussian
with the correct mean and covariance. These compu-
tations depend on the parametrization of the policy π.
For many interesting controller parametrizations, the
required computations can be performed analytically,
although often neither p(u
t1
) nor p(x
t1
, u
t1
) are
exactly Gaussian (Deisenroth, 2010).
From now on, we assume a joint Gaussian distribution
p(
˜
x
t1
) = N
˜
x
t1
| ˜µ
t1
,
˜
Σ
t1
at time t 1. When
predicting the distribution
p(∆
t
) =
Z
p(f(
˜
x
t1
)|
˜
x
t1
)p(
˜
x
t1
) d
˜
x
t1
, (9)
we integrate out the random variable
˜
x
t1
. Note that
the transition probability p(f(
˜
x
t1
)|
˜
x
t1
) is obtained
from the posterior GP distribution. Computing the
exact predictive distribution in Eq. (9) is analytically
intractable. Therefore, we approximate p(∆
t
) by a
Gaussian using exact moment matching, see Fig. 2.
−1 −0.5 0 0.5 1
−1
−0.5
0
0.5
1
t
−1 −0.5 0 0.5 1
0
1
(x
t−1
,u
t−1
)
p(x
t−1
,u
t−1
)
−1
−0.5
0
0.5
1
00.511.5
p(
t
)
Figure 2. GP prediction at an uncertain input. The input
distribution p(x
t1
, u
t1
) is assumed Gaussian (lower right
panel). When propagating it through the GP model (up-
per right panel), we obtain the shaded distribution p(∆
t
),
upper left panel. We approximate p(∆
t
) by a Gaussian
with the exact mean and variance (upper left panel).
For the time being, assume the mean µ
and the co-
variance Σ
of the predictive distribution p(∆
t
) are
known. Then, a Gaussian approximation to the de-
sired distribution p(x
t
) is given as N
x
t
|µ
t
, Σ
t
with
µ
t
= µ
t1
+ µ
(10)
Σ
t
= Σ
t1
+ Σ
+ cov[x
t1
,
t
] + cov[∆
t
, x
t1
] (11)
cov[x
t1
,
t
] = cov[x
t1
, u
t1
]Σ
1
u
cov[u
t1
,
t
] (12)
where the computation of the cross-covariances in
Eq. (12) depends on the policy parametrization, but
can often be computed analytically. The computation
of the cross-covariance cov[x
t1
,
t
] in Eq. (11) is de-
tailed by Deisenroth (2010).
In the following, we compute the mean µ
and the
variance Σ
of the predictive distribution, see Eq. (9).
2.2.1. Mean Prediction
Following the law of iterated expectations, for target
dimensions a = 1, . . . , D, we obtain
µ
a
= E
˜
x
t1
[E
f
[f(
˜
x
t1
)|
˜
x
t1
]] = E
˜
x
t1
[m
f
(
˜
x
t1
)]
=
Z
m
f
(
˜
x
t1
)N
˜
x
t1
| ˜µ
t1
,
˜
Σ
t1
d
˜
x
t1
(13)
= β
>
a
q
a
(14)
with β
a
= (K
a
+ σ
2
ε
a
)
1
y
a
and q
a
= [q
a
1
, . . . , q
a
n
]
>
.
With m
f
given in Eq. (7), the entries of q
a
R
n
are
q
a
i
=
Z
k
a
(
˜
x
i
,
˜
x
t1
)N
˜
x
t1
| ˜µ
t1
,
˜
Σ
t1
d
˜
x
t1
=
α
2
a
|
˜
Σ
t1
Λ
1
a
+I|
exp
1
2
ν
>
i
(
˜
Σ
t1
+ Λ
a
)
1
ν
i
, (15)
ν
i
:
= (
˜
x
i
˜µ
t1
) . (16)
Here, ν
i
in Eq. (16) is the difference between the train-
ing input
˜
x
i
and the mean of the “test” input distri-
bution p(x
t1
, u
t1
).
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
2.2.2. Covariance Matrix of the Prediction
To compute the predictive covariance matrix Σ
R
D×D
we distinguish between diagonal elements and
off-diagonal elements: Using the law of iterated vari-
ances, we obtain for target dimensions a, b = 1, . . . , D
σ
2
aa
=E
˜
x
t1
var
f
[∆
a
|
˜
x
t1
]
+E
f,
˜
x
t1
[∆
2
a
](µ
a
)
2
(17)
σ
2
ab
=E
f,
˜
x
t1
[∆
a
b
]µ
a
µ
b
, a 6= b , (18)
respectively, where µ
a
is known from Eq. (14). The
off-diagonal terms do not contain the additional term
E
˜
x
t1
[cov
f
[∆
a
,
b
|
˜
x
t1
]] because of the conditional
independence assumption of the GP models: Differ-
ent target dimensions do not covary for given
˜
x
t1
.
First we compute the terms that are common to both
the diagonal and off-diagonal entries: With the Gaus-
sian approximation N
˜
x
t1
| ˜µ
t1
,
˜
Σ
t1
of p(
˜
x
t1
)
and the law of iterated expectations, we obtain
E
f,
˜
x
t1
[∆
a
b
] = E
˜
x
t1
E
f
[∆
a
|
˜
x
t1
] E
f
[∆
b
|
˜
x
t1
]
(7)
=
Z
m
a
f
(
˜
x
t1
)m
b
f
(
˜
x
t1
)p(
˜
x
t1
) d
˜
x
t1
(19)
due to the conditional independence of
a
and
b
given
˜
x
t1
. Using now the definition of the mean func-
tion m
f
in Eq. (7), we obtain
E
f,
˜
x
t1
[∆
a
b
] = β
>
a
Qβ
b
, (20)
Q
:
=
Z
k
a
(
˜
X,
˜
x
t1
) k
b
(
˜
X,
˜
x
t1
)
>
p(
˜
x
t1
) d
˜
x
t1
. (21)
Using standard results from Gaussian multiplications
and integration, we obtain the entries Q
ij
of Q R
n×n
Q
ij
=
k
a
(
˜
x
i
,˜µ
t1
)k
b
(
˜
x
j
,˜µ
t1
)
|R|
exp
1
2
z
>
ij
R
1
˜
Σ
t1
z
ij
(22)
where we defined R
:
=
˜
Σ
t1
(Λ
1
a
+Λ
1
b
)+I and z
ij
:
=
Λ
1
a
ν
i
+ Λ
1
b
ν
j
with ν
i
taken from Eq. (16). Hence,
the off-diagonal entries of Σ
are fully determined by
Eqs. (14)–(16), (18), and (20)–(22).
From Eq. (17), we see that the diagonal entries of Σ
contain an additional term
E
˜
x
t1
var
f
[∆
a
|
˜
x
t1
]
=α
2
a
tr
(K
a
+σ
2
ε
a
I)
1
Q
(23)
with Q given in Eq. (22). This term is the expected
variance of the latent function (see Eq. (8)) under the
distribution of
˜
x
t1
.
With the Gaussian approximation N
t
|µ
, Σ
of p(∆
t
), we obtain a Gaussian approximation
N
x
t
|µ
t
, Σ
t
of p(x
t
) through Eqs. (10)–(12).
To evaluate the expected return J
π
in Eq. (2), it re-
mains to compute the expected values
E
x
t
[c(x
t
)] =
Z
c(x
t
)N
x
t
|µ
t
, Σ
t
dx
t
, (24)
t = 0, . . . , T , of the cost c with respect to the predic-
tive state distributions. We assume that the cost c is
chosen so that Eq. (24) can be solved analytically, e.g.,
polynomials. In this paper, we use
c(x) = 1 exp(−kx x
target
k
2
2
c
) [0, 1] , (25)
which is a squared exponential subtracted from unity.
In Eq. (25), x
target
is the target state and σ
2
c
controls
the width of c. This unimodal cost can be considered
a smooth approximation of a 0-1 cost of a target area.
2.3. Analytic Gradients for Policy Improvement
Both µ
t
and Σ
t
are functionally dependent on the
mean µ
u
and the covariance Σ
u
of the control sig-
nal (and θ) through ˜µ
t1
and
˜
Σ
t1
, respectively, see
Eqs. (15), (16), and (22), for instance. Hence, we can
analytically compute the gradients of the expected re-
turn J
π
with respect to the policy parameters θ, which
we sketch in the following. We obtain the deriva-
tive dJ
π
/ dθ by repeated application of the chain-rule:
First, we swap the order of differentiating and sum-
ming in Eq. (2), and with E
t
:
= E
x
t
[c(x
t
)], we obtain
dE
t
dθ
=
dE
t
dp(x
t
)
dp(x
t
)
dθ
:
=
E
t
µ
t
dµ
t
dθ
+
E
t
Σ
t
dΣ
t
dθ
, (26)
where we used the shorthand notation dE
t
/ dp(x
t
)
:
=
{E
t
/∂µ
t
, E
t
/∂Σ
t
} for taking the derivative of E
t
with
respect to both the mean and covariance of x
t
. Second,
from Sec. 2.2, we know that the predicted mean µ
t
and
the covariance Σ
t
are functionally dependent on the
moments of p(x
t1
) and the controller parameters θ
through u
t1
. By applying the chain-rule to Eq. (26),
we thus obtain
dp(x
t
)
dθ
=
p(x
t
)
p(x
t1
)
dp(x
t1
)
dθ
+
p(x
t
)
θ
, (27)
p(x
t
)
p(x
t1
)
=
µ
t
p(x
t1
)
,
Σ
t
p(x
t1
)
. (28)
From here onward, we focus on dµ
t
/ dθ, see Eq. (26),
but computing dΣ
t
/ dθ in Eq. (26) is similar. We get
dµ
t
dθ
=
µ
t
µ
t1
dµ
t1
dθ
+
µ
t
Σ
t1
dΣ
t1
dθ
+
µ
t
θ
. (29)
Since dp(x
t1
)/ dθ in Eq. (27) is known from time step
t 1 and µ
t
/∂p(x
t1
) is computed by applying the
chain-rule to Eqs. (14)–(16), we conclude with
µ
t
θ
=
µ
p(u
t1
)
p(u
t1
)
θ
=
µ
µ
u
µ
u
θ
+
µ
Σ
u
Σ
u
θ
.
(30)
The partial derivatives of µ
u
/∂θ and Σ
u
/∂θ, see
Eq. (30), depend on the policy parametrization θ. The
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
Algorithm 1 pilco
1: init: Sample controller parameters θ N(0, I).
Apply random control signals and record data.
2: repeat
3: Learn probabilistic (GP) dynamics model, see
Sec. 2.1, using all data.
4: Model-based policy search, see Sec. 2.22.3.
5: repeat
6: Approximate inference for policy evaluation,
see Sec. 2.2: get J
π
(θ), Eqs. (10)–(12), (24).
7: Gradient-based policy improvement, see
Sec. 2.3: get dJ
π
(θ)/ dθ, Eqs. (26)–(30).
8: Update parameters θ (e.g., CG or L-BFGS).
9: until convergence; return θ
10: Set π
π(θ
).
11: Apply π
to system (single trial/episode) and
record data.
12: until task learned
individual partial derivatives in Eqs. (26)–(30) can
be computed analytically by repeated application of
the chain-rule to Eqs. (10)–(12), (14)–(16), (20)–(23),
and (26)–(30). We omit further lengthy details and
refer to (Deisenroth, 2010) for more information.
Analytic derivatives allow for standard gradient-based
non-convex optimization methods, e.g., CG or L-
BFGS, which return optimized policy parameters θ
.
Analytic gradient computation of J
π
is much more ef-
ficient than estimating policy gradients through sam-
pling: For the latter, the variance in the gradient
estimate grows quickly with the number of parame-
ters (Peters & Schaal, 2006).
3. Experimental Results
In this section, we report pilco’s success in efficiently
learning challenging control tasks, including both stan-
dard benchmark problems and high-dimensional con-
trol problems. In all cases, pilco learns completely
from scratch by following the steps detailed in Alg. 1.
The results discussed in the following are typical,
i.e., they do neither represent best nor worst cases.
Videos and further information will be made avail-
able at http://mlg.eng.cam.ac.uk/carl/pilco and
at http://cs.uw.edu/homes/marc/pilco.
3.1. Cart-Pole Swing-up
Pilco was applied to learning to control a real cart-
pole system, see Fig. 3. The system consists of a cart
with mass 0.7 kg running on a track and a freely swing-
ing pendulum with mass 0.325 kg attached to the cart.
The state of the system is the position of the cart, the
velocity of the cart, the angle of the pendulum, and the
angular velocity. A horizontal force u [10, 10] N
could be applied to the cart. The objective was to
learn a controller to swing the pendulum up and to
balance it in the inverted position in the middle of
the track. A linear controller is not capable of do-
ing this (Raiko & Tornio, 2009). The learned state-
feedback controller was a nonlinear RBF network, i.e.,
π(x, θ) =
X
n
i=1
w
i
φ
i
(x) , (31)
φ
i
(x) = exp(
1
2
(x µ
i
)
>
Λ
1
(x µ
i
)) (32)
with n = 50 squared exponential basis functions cen-
tered at µ
i
. In our experiment, θ = {w
i
, Λ, µ
i
} R
305
.
Pilco successfully learned a sufficiently good dy-
namics model and a good controller for this stan-
dard benchmark problem fully automatically in only
a handful of trials and a total experience of 17.5 s.
Snapshots of a 20 s test trajectory are shown in Fig. 3.
3.2. Cart-Double-Pendulum Swing-up
In the following, we show the results for pilco learning
a dynamics model and a controller for the cart-double-
pendulum swing-up. The cart-double pendulum sys-
tem consists of a cart (mass 0.5 kg) running on a track
and a freely swinging two-link pendulum (each link
of mass 0.5 kg) attached to it. The state of the sys-
tem is the position x
1
and the velocity ˙x
1
of the cart
and the angles θ
2
, θ
3
and the angular velocities of both
attached pendulums. The control signals |u| 20 N
were horizontal forces to the cart. Initially, the sys-
tem was expected to be in a state x
0
at location x,
where both pendulums hung down. The objective was
to learn a policy π
to swing the double pendulum
up to the inverted position and to balance it with the
cart being at the expected start location x. A linear
controller is not capable of solving the this problem.
A standard control approach to solving the cart-double
pendulum task is to design two separate controllers,
one for the swing up and one linear controller for the
balancing task, see for instance (Zhong & ock, 2001),
requiring prior knowledge about the task’s solution.
Unlike this engineered solution, pilco fully automat-
ically learned a dynamics model and a single nonlin-
ear RBF controller, see Eq. (31), with n = 200 and
θ R
1816
to jointly solve the swing-up and balanc-
ing. For this, Pilco required about 20–30 trials cor-
responding to an interaction time of about 60 s–90 s.
3.3. Unicycle Riding
We applied pilco to riding a 5-DoF unicycle in a re-
alistic simulation of the one shown in Fig. 4(a). The
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
1
2
3
4
5
6
Figure 3. Real cart-pole system. Snapshots of a controlled trajectory of 20 s length after having learned the task. To solve
the swing-up plus balancing, pilco required only 17.5 s of interaction with the physical system.
frame
flywheel
wheel
(a) Robotic uni-
cycle.
1 2 3 4 5
0
20
40
60
80
100
time in s
distance distribution in %
d 3 cm
d (3,10] cm d (10,50] cm
d > 50cm
(b) Histogram (after 1,000 test runs)
of the distances of the flywheel from
being upright.
Figure 4. Robotic unicycle system and simulation results.
The state space is R
12
, the control space R
2
.
unicycle is 0.76 m high and consists of a 1 kg wheel, a
23.5 kg frame, and a 10 kg flywheel mounted perpen-
dicularly to the frame. Two torques could be applied
to the unicycle: The first torque |u
w
| 10 Nm was ap-
plied directly on the wheel and mimics a human rider
using pedals. The torque produced longitudinal and
tilt accelerations. Lateral stability of the wheel could
be maintained by steering the wheel toward the falling
direction of the unicycle and by applying a torque
|u
t
| 50 Nm to the flywheel. The dynamics of the
robotic unicycle can be described by 12 coupled first-
order ODEs, see (Forster, 2009).
The goal was to ride the unicycle, i.e., to prevent it
from falling. To solve the balancing task, we used a
linear controller π(x, θ) = Ax + b with θ = {A, b}
R
28
. The covariance Σ
0
of the initial state was 0.25
2
I
allowing each angle to be off by about 30
(twice the
standard deviation).
Pilco differs from conventional controllers in that it
learns a single controller for all control dimensions
jointly. Thus, pilco takes the correlation of all control
and state dimensions into account during planning and
control. Learning separate controllers for each control
variable is often unsuccessful (Naveh et al., 1999).
Pilco required about 20 trials (experience of about
30 s) to learn a dynamics model and a controller that
keeps the unicycle upright. The interaction time
Table 1. Pilco’s data efficiency scales to high dimensions.
cart-pole cart-double-pole unicycle
state space R
4
R
6
R
12
# trials 10 20–30 20
experience 20 s 60 s–90 s 20 s–30 s
parameter space R
305
R
1816
R
28
is fairly short since a trial was aborted when the
turntable hit the ground, which happened quickly
during the five random trials used for initialization.
Fig. 4(b) shows empirical results after 1,000 test runs
with the learned policy: Differently-colored bars show
the distance of the flywheel from a fully upright po-
sition. Depending on the initial configuration of the
angles, the unicycle had a transient phase of about
a second. After 1.2 s, either the unicycle had fallen
or the learned controller had managed to balance it
very closely to the desired upright position. The suc-
cess rate was approximately 93%; bringing the uni-
cycle upright from extreme initial configurations was
sometimes impossible due to the torque constraints.
3.4. Data Efficiency
Tab. 1 summarizes the results presented in this pa-
per: For each task, the dimensionality of the state and
parameter spaces is listed together with the required
number of trials and the corresponding total interac-
tion time. The table shows that pilco can efficiently
find good policies even in high dimensions. The gener-
ality of this statement depends on both the complexity
of the dynamics model and the controller to be learned.
In the following, we compare pilco’s data efficiency
(required interaction time) to other RL methods that
learn previously discussed tasks from scratch, i.e.,
without informative prior knowledge. This excludes
methods relying on known dynamics models or expert
demonstrations.
Fig. 5 shows the interaction time with the cart-pole
system required by pilco and algorithms in the lit-
erature that solved this task from scratch (Kimura
& Kobayashi, 1999), (Doya, 2000), (Coulom, 2002),
(Wawrzynski & Pacut, 2004), (Riedmiller, 2005),
(Raiko & Tornio, 2009), (van Hasselt, 2010). Dy-
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
KK D C WP R RT vH pilco
10
1
10
2
10
3
10
4
10
5
KK: Kimura & Kobayashi 1999
D: Doya 2000
C: Coulom 2002
WP: Wawrzynski & Pacut 2004
R: Riedmiller 2005
RT: Raiko & Tornio 2009
vH: van Hasselt 2010
pilco: Deisenroth & Rasmussen 2011
required interaction time in s
Figure 5. Data efficiency for learning the cart-pole task
in the absence of expert knowledge. The horizontal axis
chronologically orders the references according to their
publication date. The vertical axis shows the required in-
teraction time with the cart-pole system on a log-scale.
namics models were only learned by Doya (2000)
and Raiko & Tornio (2009), using RBF networks and
multi-layered perceptrons, respectively. Note that
both the NFQ algorithm by Riedmiller (2005) and the
(C)ACLA algorithms by van Hasselt (2010) were ap-
plied to balancing the pole without swing up. In all
cases without state-space discretization, cost functions
similar to ours were used. Fig. 5 demonstrates pilco’s
data efficiency since pilco outperforms any other al-
gorithm by at least one order of magnitude.
We cannot present comparisons for the cart-double
pendulum swing up or unicycle riding: To the best
of our knowledge, fully autonomous learning has not
yet succeeded in learning these tasks from scratch.
4. Discussion and Conclusion
Trial-and-error learning leads to some limitations in
the discovered policy: Pilco is not an optimal control
method; it merely finds a solution for the task. There
are no guarantees of global optimality: Since the opti-
mization problem for learning the policy parameters is
not convex, the discovered solution is invariably only
locally optimal. It is also conditional on the experience
the learning system was exposed to. In particular, the
learned dynamics models are only confident in areas
of the state space previously observed.
Pilco exploits analytic gradients of an approximation
to the expected return J
π
for indirect policy search.
Obtaining nonzero gradients depends on two factors:
the state distributions p(x
1
), . . . , p(x
T
) along a pre-
dicted trajectory and the width σ
c
of the immediate
cost in Eq. (25). If the cost is very peaked, say, a 0-1
cost with 0 being exactly in the target and 1 otherwise,
and the dynamics model is poor, i.e., the distributions
p(x
1
), . . . , p(x
T
) nowhere cover the target region (im-
plicitly defined through σ
c
), pilco obtains gradients
with value zero and gets stuck in a local optimum.
Although pilco is relatively robust against the choice
of the width σ
c
of the cost in Eq. (25), there is no
guarantee that pilco always learns with a 0-1 cost.
However, we have evidence that pilco can learn with
this cost, e.g., pilco could solve the cart-pole task
with a cost width σ
c
to 10
6
m. Hence, pilco’s un-
precedented data efficiency cannot solely be attributed
to any kind of reward shaping.
One of pilco’s key benefits is the reduction of model
bias by explicitly incorporating model uncertainty into
planning and control. Pilco, however, does not take
temporal correlation into account. Instead, model un-
certainty is treated similarly to uncorrelated noise.
This can result in an under-estimation of model un-
certainty (Schneider, 1997). On the other hand, the
moment-matching approximation used for approxi-
mate inference is typically a conservative approxima-
tion. Simulation results suggest that the predictive
distributions p(x
1
), . . . , p(x
T
) used for policy evalua-
tion are usually not overconfident.
The probabilistic dynamics model was crucial to
pilco’s learning success: We also applied the pilco-
framework with a deterministic dynamics model to a
simulated cart-pole swing-up. For a fair comparison,
we used the posterior mean function of a GP, i.e., only
the model uncertainty was discarded. Learning from
scratch with this deterministic model was unsuccessful
because of the missing representation of model uncer-
tainty: Since the initial training set for the dynamics
model did not contain states close to the target state,
the predictive model was overconfident during plan-
ning (see Fig. 1, center). When predictions left the
regions close to the training set, the model’s extrapo-
lation eventually fell back to the uninformative prior
mean function (with zero variance) yielding essentially
useless predictions.
We introduced pilco, a practical model-based policy
search method using analytic gradients for policy im-
provement. Pilco advances state-of-the-art RL meth-
ods in terms of learning speed by at least an order of
magnitude. Key to pilco’s success is a principled way
of reducing model bias in model learning, long-term
planning, and policy learning. Pilco does not rely
on expert knowledge, such as demonstrations or task-
specific prior knowledge. Nevertheless, pilco allows
for unprecedented data-efficient learning from scratch
in continuous state and control domains. Demo code
will be made publicly available at http://mloss.org.
The results in this paper suggest using probabilistic
dynamics models for planning and policy learning to
account for model uncertainties in the small-sample
case—even if the underlying system is deterministic.
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
Acknowledgements
We are very grateful to Jan Peters and Drew Bagnell
for valuable suggestions concerning the presentation of
this work. M. Deisenroth has been supported by ONR
MURI grant N00014-09-1-1052 and by Intel Labs.
References
Abbeel, P., Quigley, M., and Ng, A. Y. Using Inaccu-
rate Models in Reinforcement Learning. In Proceed-
ings of the ICML, pp. 1–8, 2006.
Atkeson, C. G. and Santamar´ıa, J. C. A Comparison
of Direct and Model-Based Reinforcement Learning.
In Proceedings of the ICRA, 1997.
Bagnell, J. A. and Schneider, J. G. Autonomous Heli-
copter Control using Reinforcement Learning Policy
Search Methods. In Proceedings of the ICRA, pp.
1615–1620, 2001.
Coulom, R. Reinforcement Learning Using Neural
Networks, with Applications to Motor Control. PhD
thesis, Institut National Polytechnique de Grenoble,
2002.
Deisenroth, M. P. Efficient Reinforcement Learning
using Gaussian Processes. KIT Scientific Publish-
ing, 2010. ISBN 978-3-86644-569-7.
Deisenroth, M. P., Rasmussen, C. E., and Peters, J.
Gaussian Process Dynamic Programming. Neuro-
computing, 72(7–9):1508–1524, 2009.
Deisenroth, M. P., Rasmussen, C. E., and Fox, D.
Learning to Control a Low-Cost Manipulator using
Data-Efficient Reinforcement Learning. In Proceed-
ings of R:SS, 2011.
Doya, K. Reinforcement Learning in Continuous Time
and Space. Neural Computation, 12(1):219–245,
2000. ISSN 0899-7667.
Engel, Y., Mannor, S., and Meir, R. Bayes Meets Bell-
man: The Gaussian Process Approach to Temporal
Difference Learning. In Proceedings of the ICML,
pp. 154–161, 2003.
Fabri, S. and Kadirkamanathan, V. Dual Adaptive
Control of Nonlinear Stochastic Systems using Neu-
ral Networks. Automatica, 34(2):245–253, 1998.
Forster, D. Robotic Unicycle. Report, Department of
Engineering, University of Cambridge, UK, 2009.
Kimura, H. and Kobayashi, S. Efficient Non-Linear
Control by Combining Q-learning with Local Linear
Controllers. In Proceedings of the ICML, pp. 210–
219, 1999.
Ko, J., Klein, D. J., Fox, D., and Haehnel, D. Gaussian
Processes and Reinforcement Learning for Identifi-
cation and Control of an Autonomous Blimp. In
ICRA, pp. 742–747, 2007.
Naveh, Y., Bar-Yoseph, P. Z., and Halevi, Y. Nonlin-
ear Modeling and Control of a Unicycle. Journal of
Dynamics and Control, 9(4):279–296, 1999.
Peters, J. and Schaal, S. Policy Gradient Methods for
Robotics. In Proceedings of the IROS, pp. 2219–
2225, 2006.
Qui˜nonero-Candela, J., Girard, A., Larsen, J., and
Rasmussen, C. E. Propagation of Uncertainty in
Bayesian Kernel Models—Application to Multiple-
Step Ahead Forecasting. In Proceedings of the
ICASSP, pp. 701–704, 2003.
Raiko, T. and Tornio, M. Variational Bayesian Learn-
ing of Nonlinear Hidden State-Space Models for
Model Predictive Control. Neurocomputing, 72(16–
18):3702–3712, 2009.
Rasmussen, C. E. and Kuss, M. Gaussian Processes
in Reinforcement Learning. In NIPS, pp. 751–759.
2004.
Rasmussen, C. E. and Williams, C. K. I. Gaussian
Processes for Machine Learning. The MIT Press,
2006.
Riedmiller, M. Neural Fitted Q Iteration—First Expe-
riences with a Data Efficient Neural Reinforcement
Learning Method. In Proceedings of the ECML,
2005.
Schaal, S. Learning From Demonstration. In NIPS,
pp. 1040–1046. 1997.
Schneider, J. G. Exploiting Model Uncertainty Esti-
mates for Safe Dynamic Control Learning. In NIPS,
pp. 1047–1053. 1997.
van Hasselt, H. Insights in Reinforcement Learn-
ing. ohrmann Print Service, 2010. ISBN 978-
90-39354964.
Wawrzynski, P. and Pacut, A. Model-free off-policy
Reinforcement Learning in Continuous Environ-
ment. In Proceedings of the IJCNN, pp. 1091–1096,
2004.
Wilson, A., Fern, A., and Tadepalli, P. Incorporating
Domain Models into Bayesian Optimization for RL.
In ECML-PKDD, pp. 467–482, 2010.
Zhong, W. and ock, H. Energy and Passivity Based
Control of the Double Inverted Pendulum on a Cart.
In Proceedings of the CCA, pp. 896–901, 2001.
... Some modern data-driven PID tuning methods such as deep learning, RL and genetic optimization have been presented in [9], [10] method by employing RL and deep neural networks has been derived in [11]. Doerr et al. [12] proposed a modelbased RL framework by resorting to one of the famous model-based RL approaches, namely, probabilistic inference for learning control (PILCO) [13], which has data-efficient structure that rely on Bayesian inference with Gaussian process. In particular, authors in [12] obtained optimal PID parameters for a seven degree-of-freedom robot arm balancing an inverted pendulum by considering the PID as an RL agent with augmented state-space and analytically solving error gradients to design the optimal gains. ...
... Model-based RL methods are more promising in terms of data efficiency while learning control policies [13]. One such model-based on-policy RL method which has gain popularity is PILCO [13]. ...
... Model-based RL methods are more promising in terms of data efficiency while learning control policies [13]. One such model-based on-policy RL method which has gain popularity is PILCO [13]. This method uses Bayesian inference with Gaussian process (GP) to learn dynamics model and control policies. ...
Conference Paper
Proportional-Integral-Derivative (PID) controller is widely used across various industrial process control applications because of its straightforward implementation. However, it can be challenging to fine-tune the PID parameters in practice to achieve robust performance. The paper proposes a model-based reinforcement learning (RL) framework to tune PID controllers leveraging the probabilistic inference for learning control (PILCO) method. In particular, an optimal policy given by PILCO is transformed into a set of robust PID tuning parameters for underactuated mechanical systems. The robustness of the devised controller is verified with simulation studies for a benchmark cart-pole system under server disturbances and system parameter uncertainties.
... We extend these ideas to the more general framework of learning latent dynamics models for reinforcement learning (RL). Recent advances in model-based RL have showcased the potential improvements in the performance and sample complexity that can be gained by learning accurate latent dynamics models (Deisenroth & Rasmussen, 2011;Hafner et al., 2019a;Schrittwieser et al., 2020). These models summarize the transitions that the agent experiences by interacting with its environment in a low-dimensional latent code that is learnt alongside their dynamics. ...
... ,Hafner et al. (2019a),Kaiser et al. (2019) andHa & Schmidhuber (2018) learn world models with the purpose of learning policies, either by training the agent within the world model entirely, or by extracting useful latent features of the environment. Dynamics models are also pervasive in planning tasks.Deisenroth & Rasmussen (2011) use Gaussian process regression to learn environment dynamics for planning in a sample efficient manner.Schrittwieser et al. (2020) learn a latent state dynamics model without a reconstruction objective to play chess, shogi and Go using Monte Carlo Tree Search.Hafner et al. (2019b) learn a recurrent state space model, representing laten ...
Preprint
Full-text available
Humans are skillful navigators: We aptly maneuver through new places, realize when we are back at a location we have seen before, and can even conceive of shortcuts that go through parts of our environments we have never visited. Current methods in model-based reinforcement learning on the other hand struggle with generalizing about environment dynamics out of the training distribution. We argue that two principles can help bridge this gap: latent learning and parsimonious dynamics. Humans tend to think about environment dynamics in simple terms -- we reason about trajectories not in reference to what we expect to see along a path, but rather in an abstract latent space, containing information about the places' spatial coordinates. Moreover, we assume that moving around in novel parts of our environment works the same way as in parts we are familiar with. These two principles work together in tandem: it is in the latent space that the dynamics show parsimonious characteristics. We develop a model that learns such parsimonious dynamics. Using a variational objective, our model is trained to reconstruct experienced transitions in a latent space using locally linear transformations, while encouraged to invoke as few distinct transformations as possible. Using our framework, we demonstrate the utility of learning parsimonious latent dynamics models in a range of policy learning and planning tasks.
... A popular one is to start with a Gaussian distribution, pass it through a nonlinear transformation model, approximate the resulting non-Gaussian distribution with a Gaussian one, and then repeat this procedure. When the transition dynamics is expressed as a Gaussian Process (GP) using Gaussian kernel, PILCO [4] derived analytical formulas for propagating the exact mean and covariance of the system states across a time step, thereby converting the stochastic optimization problem into a deterministic one. The analytical gradients were then used to update the policy with gradient descent. ...
... We use OpenAI gym's environment to simulate the scenario and allow for continuous action space. Following [4], the mass of cart and pole is taken as 0.7, 0.325 kg, and the controller is: ...
Preprint
Full-text available
In this paper, we introduce a novel online model-based reinforcement learning algorithm that uses Unscented Transform to propagate uncertainty for the prediction of the future reward. Previous approaches either approximate the state distribution at each step of the prediction horizon with a Gaussian, or perform Monte Carlo simulations to estimate the rewards. Our method, depending on the number of sigma points employed, can propagate either mean and covariance with minimal points, or higher-order moments with more points similarly to Monte Carlo. The whole framework is implemented as a computational graph for online training. Furthermore, in order to prevent explosion in the number of sigma points when propagating through a generic state-dependent uncertainty model, we add sigma-point expansion and contraction layers to our graph, which are designed using the principle of moment matching. Finally, we propose gradient descent inspired by Sequential Quadratic Programming to update policy parameters in the presence of state constraints. We demonstrate the proposed method with two applications in simulation. The first one designs a stabilizing controller for the cart-pole problem when the dynamics is known with state-dependent uncertainty. The second example, following up on our previous work, tunes the parameters of a control barrier function-based Quadratic Programming controller for a leader-follower problem in the presence of input constraints.
... Across a range of continuous control tasks, we demonstrate that ALM achieves higher sample efficiency than prior model-based and model-free RL methods, including on tasks that stymie prior MBRL methods. Because ALM does not require ensembles (Chua et al., 2018;Janner et al., 2019) or decision-time planning (Deisenroth & Rasmussen, 2011;Sikchi et al., 2020;Morgan et al., 2021), our open-source implementation performs updates 10× and 6× faster than MBPO Janner et al. (2019) and REDQ Chen et al. (2021) respectively, and achieves near-optimal returns in about 50% less time than SAC. ...
... Prior model-based RL methods use models in many ways, using it to search for optimal action sequences (Garcia et al., 1989;Springenberg et al., 2020;Hafner et al., 2018;Chua et al., 2018;Hafner et al., 2019;Xie et al., 2020), to generate synthetic data (Sutton, 1991;Luo et al., 2018;Hafner et al., 2019;Janner et al., 2019;Shen et al., 2020), to better estimate the value function (Deisenroth & Rasmussen, 2011;Chua et al., 2018;Buckman et al., 2018;Feinberg et al., 2018), or some combination thereof Hamrick et al., 2020;Hansen et al., 2022). Similar to prior work on stochastic value gradients Hafner et al., 2019;Clavera et al., 2020;, our approach uses model rollouts to estimate the value function for a policy gradient. ...
Preprint
Full-text available
While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning low-dimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While such sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50\% less wall-clock time.
... When APG is used for trajectory tracking with a fixedlength horizon, it is highly related to a paradigm termed Backpropagation-Through-Time (BPTT), which has been extensively studied [27], [30], [31], [32], [33], [34]. Bakker et al. for example use Recurrent Neural Networks (RNN) for offline policy learning [35]. ...
Preprint
Full-text available
Control design for robotic systems is complex and often requires solving an optimization to follow a trajectory accurately. Online optimization approaches like Model Predictive Control (MPC) have been shown to achieve great tracking performance, but require high computing power. Conversely, learning-based offline optimization approaches, such as Reinforcement Learning (RL), allow fast and efficient execution on the robot but hardly match the accuracy of MPC in trajectory tracking tasks. In systems with limited compute, such as aerial vehicles, an accurate controller that is efficient at execution time is imperative. We propose an Analytic Policy Gradient (APG) method to tackle this problem. APG exploits the availability of differentiable simulators by training a controller offline with gradient descent on the tracking error. We address training instabilities that frequently occur with APG through curriculum learning and experiment on a widely used controls benchmark, the CartPole, and two common aerial robots, a quadrotor and a fixed-wing drone. Our proposed method outperforms both model-based and model-free RL methods in terms of tracking error. Concurrently, it achieves similar performance to MPC while requiring more than an order of magnitude less computation time. Our work provides insights into the potential of APG as a promising control method for robotics. To facilitate the exploration of APG, we open-source our code and make it available at https://github.com/lis-epfl/apg_trajectory_tracking.
... The predictive equations for a Gaussian test input have been looked at before [38], [39]. Generally, if a Gaussian input is multiplied with the nonlinear GP predictive distribution, the resulting distribution is non-Gaussian, ...
Article
Full-text available
Inspired by the success of control barrier functions (CBFs) in addressing safety, and the rise of data-driven techniques for modeling functions, we propose a non-parametric approach for online synthesis of CBFs using Gaussian Processes (GPs). A dynamical system is defined to be safe if a subset of its states remains within the prescribed set, also called the safe set . CBFs achieve safety by designing a candidate function a priori. However, designing such a function can be challenging. Consider designing a CBF in a disaster recovery scenario where safe and navigable regions need to be determined. The decision boundary for safety here is unknown and cannot be designed a priori. Moreover, CBFs employ a parametric design approach and cannot handle arbitrary changes to the safe set in practice. In our approach, we work with safety samples to construct the CBF online by assuming a flexible GP prior on these samples, and term our formulation as a Gaussian CBF. GPs have favorable properties such as analytical tractability and robust uncertainty estimation. This allows realizing the posterior with high safety guarantees while also computing associated partial derivatives analytically for safe control. Moreover, Gaussian CBFs can change the safe set arbitrarily based on sampled data, thus allowing non-convex safe sets. We validated experimentally on a quadrotor by demonstrating safe control for 1) arbitrary safe sets, 2) collision avoidance with online safe set synthesis, 3) and juxtaposed Gaussian CBFs with CBFs in the presence of noisy states. The experiment video link is: https://youtu.be/HX6uokvCiGk.
... Approaches to learning dynamical models have mainly focused on gradient descent-based methods, with early work on RNNs in the 1990s (Schmidhuber 1990). More recent work includes PILCO (Deisenroth and Rasmussen 2011), which is a probabilistic model-based policy search method and Black-DROPS (Chatzilygeroudis et al. 2017) that employs CMA-ES for data-efficient optimization of complex control problems. Additionally, interest has increased in learning dynamical models directly from high-dimensional images for robotic tasks (Watter et al. 2015;Hafner et al. 2018) and also video games (Guzdial, Li, and Riedl 2017). ...
Article
Deep reinforcement learning approaches have shown impressive results in a variety of different domains, however, more complex heterogeneous architectures such as world models require the different neural components to be trained separately instead of end-to-end. While a simple genetic algorithm recently showed end-to-end training is possible, it failed to solve a more complex 3D task. This paper presents a method called Deep Innovation Protection (DIP) that addresses the credit assignment problem in training complex heterogenous neural network models end-to-end for such environments. The main idea behind the approach is to employ multiobjective optimization to temporally reduce the selection pressure on specific components in multi-component network, allowing other components to adapt. We investigate the emergent representations of these evolved networks, which learn to predict properties important for the survival of the agent, without the need for a specific forward-prediction loss.
... Englert et al. [2013] presented a model-based approach to imitation learning that shares a similar objective to our approach. They modified the model-based policy search algorithm PILCO [Deisenroth and Rasmussen, 2011] such that it minimizes the KL to the distribution of the demonstrator instead of maximizing the reward. While our objective is similar, we obtain a closed form solution for linear feedback controllers with our approach while Englert et al. [2013] obtain a highly non-linear policy by performing a computationally heavy, non-convex optimization. ...
Thesis
Full-text available
Robots had a great impact on the manufacturing industry ever since the early seventies when companies such as KUKA and ABB started deploying their first industrial robots. These robots merely performed very specific tasks in specific ways within well-defined environments. Still, they proved to be very useful as they could exceed human performance at these tasks. However, in order to enable robots to enter our daily life, they need to become more versatile and need to operate in much less structured environments. This thesis is partly devoted to stretching these limitations by means of learning, namely imitation learning (IL) and inverse reinforcement learning (IRL). Reinforcement learning (RL) is a powerful approach to enable robots to solve a task in an unknown environment. The practitioner describes a desired behavior by specifying a reward function and the robot autonomously interacts with the environment in order to find a control policy that generates high accumulated reward. However, RL is not suitable for teaching new tasks by non-experts because specifying appropriate reward functions can be difficult. Demonstrating the desired behavior is often easier for non-experts. Imitation learning can be used in order to enable the robot to reproduce the demonstrations. However, without explicitly inferring and modeling the intentions of the demonstrations, it can become difficult to solve the task for unseen situations. Inverse reinforcement learning (IRL) therefore aims to infer a reward function from the demonstrations, such that optimizing this reward function yields the desired behavior even for different situations. This thesis introduces a unifying approach to solve the inverse reinforcement learning problem in the same way as the reinforcement learning problem. This is achieved by framing both problems as information projection problems, i.e., we strive to minimize the relative entropy between a probabilistic model of the robot behavior and a given desired distribution. Furthermore, a trust region on the robot behavior is used to stabilize the optimization. For inverse reinforcement learning, the desired distribution is implicitly given by the expert demonstrations. The resulting optimization can be efficiently solved using state-of-the-art reinforcement learning methods. For reinforcement learning, the log-likelihood of the desired distribution is given by the reward function. The resulting optimization problem corresponds to a standard reinforcement learning formulation, except for an additional objective of maximizing the entropy of the robot behavior. This entropy objective adds little overhead to the optimization, but can lead to better exploration and more diversified policies. Trust-region I-projections are not only useful for training robots, but can also be applied to other machine learning problems. I-projections are typically used for variational inference, in order to approximate an intractable distribution by a simpler model. However, the resulting optimization problems are usually optimized based on stochastic gradient descent which often suffers from high variance in the gradient estimates. As trust-region I-projections where shown to be effective for reinforcement learning and inverse reinforcement learning, this thesis also explores their use for variational inference. More specifically, trust-region I-projections are investigated for the problem of approximating an intractable distribution by a Gaussian mixture model (GMM) with an adaptive number of components. GMMs are highly desirable for variational inference because they can yield arbitrary accurate approximations while inference from GMMs is still relatively cheap. In order to make learning the GMM feasible, we derive a lower bound that enables us to decompose the objective function. The optimization can then be performed by iteratively updating individual components using a technique from reinforcement learning. The resulting method is capable of learning approximations of significantly higher quality than existing variational inference methods. Due to the similarity of the underlying optimization problems, the insights gained from our variational inference method are also useful for IL and IRL. Namely, a similar lower bound can be applied also for the I-projection formulation of imitation learning. However, whereas for variational inference the lower bound serves to decompose the objective function, for imitation learning it allows us to provide a reward signal to the robot that does not depend on its behavior. Compared to reward functions that are relative to the current behavior of the robot---which are typical for popular adversarial methods---behavior-independent reward functions have the advantages that we can show convergence even for greedy optimization. Furthermore, behavior-independent reward functions solve the inverse reinforcement learning problem, thereby closing the gap between imitation learning and IRL. However, algorithms derived from our non-adversarial formulation are actually very similar to existing AIL methods, and we can even show that adversarial inverse reinforcement learning (AIRL) is indeed an instance of our formulation. AIRL was derived from an adversarial formulation, and we point out several problems of that derivation. In contrast, we show that AIRL can be straightforwardly derived from out non-adversarial formulation. Furthermore, we demonstrate that the non-adversarial formulation can be also used to derive novel algorithms by presenting a non-adversarial method for offline imitation learning.
Preprint
While Reinforcement Learning can achieve impressive results for complex tasks, the learned policies are generally prone to fail in downstream tasks with even minor model mismatch or unexpected perturbations. Recent works have demonstrated that a policy population with diverse behavior characteristics can generalize to downstream environments with various discrepancies. However, such policies might result in catastrophic damage during the deployment in practical scenarios like real-world systems due to the unrestricted behaviors of trained policies. Furthermore, training diverse policies without regulation of the behavior can result in inadequate feasible policies for extrapolating to a wide range of test conditions with dynamics shifts. In this work, we aim to train diverse policies under the regularization of the behavior patterns. We motivate our paradigm by observing the inverse dynamics in the environment with partial state information and propose Diversity in Regulation(DiR) training diverse policies with regulated behaviors to discover desired patterns that benefit the generalization. Considerable empirical results on various variations of different environments indicate that our method attains improvements over other diversity-driven counterparts.
Conference Paper
Full-text available
The acquisition and improvement of motor skills and control policies for robotics from trial and error is of essential importance if robots should ever leave precisely pre-structured environments. However, to date only few existing reinforcement learning methods have been scaled into the domains of high-dimensional robots such as manipulator, legged or humanoid robots. Policy gradient methods remain one of the few exceptions and have found a variety of applications. Nevertheless, the application of such methods is not without peril if done in an uninformed manner. In this paper, we give an overview on learning with policy gradient methods for robotics with a strong focus on recent advances in the field. We outline previous applications to robotics and show how the most recently developed methods can significantly improve learning performance. Finally, we evaluate our most promising algorithm in the application of hitting a baseball with an anthropomorphic arm
Article
This thesis is a study of practical methods to estimate value functions with feedforward neural networks in model-based reinforcement learning. Focus is placed on problems in continuous time and space, such as motor-control tasks. In this work, the continuous TD(lambda) algorithm is refined to handle situations with discontinuous states and controls, and the vario-eta algorithm is proposed as a simple but efficient method to perform gradient descent. The main contributions of this thesis are experimental successes that clearly indicate the potential of feedforward neural networks to estimate high-dimensional value functions. Linear function approximators have been often preferred in reinforcement learning, but successful value function estimations in previous works are restricted to mechanical systems with very few degrees of freedom. The method presented in this thesis was tested successfully on an original task of learning to swim by a simulated articulated robot, with 4 control variables and 12 independent state variables, which is significantly more complex than problems that have been solved with linear function approximators so far.
Article
A unicycle system is composed of a unicycle and a rider. This system is inherently unstable, but together with a skilled rider can be autonomously controlled and stabilized. A dynamical investigation, a control design and a numerical solution of a nonlinear unicycle autonomous model are presented. The use of a nonlinear model for the control design is shown in this paper to be of great importance. A three-rigid-body physical model was selected for the dynamical study of the system. In a linearized model important physical characteristics of the unicycle system disappear (e.g. interactions between the longitudinal and lateral systems are being neglected), and therefore it is not recommended to be used for the control design. A nonlinear control law, which replaces the rider in stabilizing the model, was derived in the present work, using a nonlinear unicycle model. A simulation study shows good performance of this controller. Time spectral element methods are developed and used for integrating the nonlinear equations of motion. The approach employs the time discontinuous Galerkin method which leads to A-stable high order accurate time integration schemes.
Conference Paper
The object of Bayesian modelling is predictive distribution, which, in a forecasting scenario, enables evaluation of forecasted values and their uncertainties. We focus on reliably estimating the predictive mean and variance of forecasted values using Bayesian kernel based models such as the Gaussian process and the relevance vector machine. We derive novel analytic expressions for the predictive mean and variance for Gaussian kernel shapes under the assumption of a Gaussian input distribution in the static case, and of a recursive Gaussian predictive density in iterative forecasting. The capability of the method is demonstrated for forecasting of time-series and compared to approximate methods.
Conference Paper
The paper considers the design of a nonlinear controller for the double inverted pendulum (DIP), a system consisting of two inverted pendulums mounted on a cart. The swingup controller bringing the pendulums from any initial position to the unstable up-up position is designed based on passivity properties and energy shaping. While the swingup controller drives the DIP into a region of attraction around the unstable up-up position, the balance controller designed on the basis of the linearized model stabilizes the DIP at the unstable equilibrium. The simulation results show the effectiveness of the proposed nonlinear design method for the DIP system
Article
A suboptimal dual adaptive system is developed for control of stochastic, nonlinear, discrete time plants that are affine in the control input. The nonlinear functions are assumed to be unknown and neural networks are used to approximate them. Both Gaussian radial basis function and sigmoidal multilayer perceptron neural networks are considered and parameter adjustment is based on Kalman filtering. The result is a control law that takes into consideration the uncertainty of the parameter e stimates, thereby eliminating the need to perform prior open-loop plant identification. The performance of the system is analyzed by simulation and Monte Carlo analysis.
Article
This paper studies the identification and model predictive control in nonlinear hidden state-space models. Nonlinearities are modelled with neural networks and system identification is done with variational Bayesian learning. In addition to the robustness of control, the stochastic approach allows for various control schemes, including combinations of direct and indirect controls, as well as using probabilistic inference for control. We study the noise-robustness, speed, and accuracy of three different control schemes as well as the effect of changing horizon lengths and initialisation methods using a simulated cart–pole system. The simulations indicate that the proposed method is able to find a representation of the system state that makes control easier especially under high noise.