Content uploaded by Marc Peter Deisenroth
Author content
All content in this area was uploaded by Marc Peter Deisenroth on Jul 10, 2014
Content may be subject to copyright.
Content uploaded by Marc Peter Deisenroth
Author content
All content in this area was uploaded by Marc Peter Deisenroth on Jul 10, 2014
Content may be subject to copyright.
PILCO: A ModelBased and DataEﬃcient Approach to Policy Search
Marc Peter Deisenroth marc@cs.washington.edu
Department of Computer Science & Engineering, University of Washington, USA
Carl Edward Rasmussen cer54@cam.ac.uk
Department of Engineering, University of Cambridge, UK
Abstract
In this paper, we introduce pilco, a practi
cal, dataeﬃcient modelbased policy search
method. Pilco reduces model bias, one of
the key problems of modelbased reinforce
ment learning, in a principled way. By learn
ing a probabilistic dynamics model and ex
plicitly incorporating model uncertainty into
longterm planning, pilco can cope with
very little data and facilitates learning from
scratch in only a few trials. Policy evaluation
is performed in closed form using stateof
theart approximate inference. Furthermore,
policy gradients are computed analytically
for policy improvement. We report unprece
dented learning eﬃciency on challenging and
highdimensional control tasks.
1. Introduction and Related Work
To date, reinforcement learning (RL) often suﬀers from
being data ineﬃcient, i.e., RL requires too many tri
als to learn a particular task. For example, learning
one of the simplest RL tasks, the mountaincar, often
requires tens if not hundreds or thousands of trials—
independent of whether policy iteration, value itera
tion, or policy search methods are used. Hence, RL
methods are often largely inapplicable to mechanical
systems that quickly wear out, e.g., lowcost robots.
Increasing data eﬃciency requires either having infor
mative prior knowledge or extracting more information
from available data. In this paper, we do not assume
that any expert knowledge is available (e.g., in terms
of demonstrations or diﬀerential equations for the dy
namics). Instead, we elicit a general policysearch
framework for dataeﬃcient learning from scratch.
Appearing in Proceedings of the 28
th
International Con
ference on Machine Learning, Bellevue, WA, USA, 2011.
Copyright 2011 by the author(s)/owner(s).
Generally, modelbased methods, i.e., methods that
learn a dynamics model of the environment, are more
promising to eﬃciently extract valuable information
from available data than modelfree methods such as
Qlearning or TDlearning. One reason why model
based methods are not widely used in learning from
scratch is that they suﬀer from model bias, i.e., they in
herently assume that the learned dynamics model suﬃ
ciently accurately resembles the real environment, see,
e.g., (Schneider, 1997; Schaal, 1997; Atkeson & Santa
mar´ıa, 1997). Model bias is especially an issue when
only a few samples and no informative prior knowledge
about the task to be learned are available.
Fig. 1 illustrates how model bias aﬀects learning.
Given a small data set of observed deterministic tran
sitions (left), multiple transition functions plausibly
could have generated the data (center). Choosing a
single one causes severe consequences: When long
term predictions (or sampling trajectories from this
model) leave the training data, the predictions of the
function approximator are essentially arbitrary, but
they are claimed with full conﬁdence! By contrast, a
probabilistic function approximator places a posterior
distribution over the transition function (right) and
expresses the level of uncertainty about the model.
Hence, for learning from scratch, we ﬁrst require a
probabilistic dynamics model to express model uncer
tainty. We employ nonparametric probabilistic Gaus
sian processes (GPs) for this purpose. Second, model
uncertainty must be incorporated into planning and
policy evaluation. Deterministic approximate infer
ence techniques for policy evaluation allows us to apply
policy search based on analytic policy gradients. An
explicit value function model is not required. Based
on these ideas, we propose a modelbased policy search
method, which we call pilco (probabilistic inference
for learning control). Pilco achieves unprecedented
data eﬃciency in continuous stateaction domains and
is directly applicable to physical systems, e.g., robots.
PILCO: A ModelBased and DataEﬃcient Approach to Policy Search
−5 −4 −3 −2 −1 0 1 2 3 4 5
−2
0
2
(x
i
, u
i
)
f(x
i
, u
i
)
−5 −4 −3 −2 −1 0 1 2 3 4 5
−2
0
2
(x
i
, u
i
)
f(x
i
, u
i
)
−5 −4 −3 −2 −1 0 1 2 3 4 5
−2
0
2
(x
i
, u
i
)
f(x
i
, u
i
)
Figure 1. Small data set of observed transitions (left), multiple plausible deterministic function approximators (center),
probabilistic function approximator (right). The probabilistic approximator models uncertainty about the latent function.
A common approach in designing adaptive controllers,
which takes uncertainty of the model parameters
into account, is to add an extra term in the cost
function of a minimumvariance controller (Fabri &
Kadirkamanathan, 1998). Here, the uncertainty of the
model parameters is penalized to improve the model
parameter estimation. Abbeel et al. (2006) proposed
further other successful heuristics to deal with inaccu
rate models. Based on goodguess parametric dynam
ics models, locally optimal controllers, and temporal
bias terms to account for model discrepancies, very im
pressive results were obtained. Schneider (1997) and
Bagnell & Schneider (2001) proposed to account for
model bias by explicitly modeling and averaging over
model uncertainty. Pilco builds upon the success
ful approach by Schneider (1997), where model un
certainty is treated as temporally uncorrelated noise.
However, pilco neither requires sampling methods for
planning, nor is it restricted to a ﬁnite number of plau
sible models.
Algorithms with GP dynamics models in RL were
presented by Rasmussen & Kuss (2004), Ko et al.
(2007), and Deisenroth et al. (2009). Shortcomings
of these approaches are that the dynamics models are
either learned by motor babbling, which is data ineﬃ
cient, or value function models have to be maintained,
which does not scale well to high dimensions. The
approaches by Engel et al. (2003) and Wilson et al.
(2010) are based GP value function models and, thus,
suﬀer from the same problems. As an indirect pol
icy search method, pilco does not require an explicit
value function model.
An extension of pilco to deal with planning and con
trol under consideration of taskspace constraints in a
robotic manipulation task is presented in (Deisenroth
et al., 2011).
Throughout this paper, we consider dynamic systems
x
t
= f(x
t−1
, u
t−1
) (1)
with continuousvalued states x ∈ R
D
and controls
u ∈ R
F
and unknown transition dynamics f. The
objective is to ﬁnd a deterministic policy/controller π :
x 7→ π(x) = u that minimizes the expected return
J
π
(θ) =
X
T
t=0
E
x
t
[c(x
t
)] , x
0
∼ N(µ
0
, Σ
0
) , (2)
of following π for T steps, where c(x
t
) is the cost (neg
ative reward) of being in state x at time t. We assume
that π is a function parametrized by θ and that c en
codes some information about a target state x
target
.
2. Modelbased Indirect Policy Search
In the following, we detail the key components of the
pilco policysearch framework: the dynamics model,
analytic approximate policy evaluation, and gradient
based policy improvement.
2.1. Dynamics Model Learning
Pilco’s probabilistic dynamics model is implemented
as a GP, where we use tuples (x
t−1
, u
t−1
) ∈ R
D+F
as training inputs and diﬀerences ∆
t
= x
t
− x
t−1
+
ε ∈ R
D
, ε ∼ N(0, Σ
ε
), Σ
ε
= diag([σ
ε
1
, . . . , σ
ε
D
]), as
training targets. The GP yields onestep predictions
p(x
t
x
t−1
, u
t−1
) = N
x
t
µ
t
, Σ
t
, (3)
µ
t
= x
t−1
+ E
f
[∆
t
] , (4)
Σ
t
= var
f
[∆
t
] . (5)
Throughout this paper, we consider a prior mean func
tion m ≡ 0 and the squared exponential (SE) kernel
k with automatic relevance determination. The SE
covariance function is deﬁned as
k(
˜
x,
˜
x
0
) = α
2
exp
−
1
2
(
˜
x −
˜
x
0
)
>
Λ
−1
(
˜
x −
˜
x
0
)
(6)
with
˜
x
:
= [x
>
u
>
]
>
. Here, we deﬁne α
2
as the variance
of the latent function f and Λ
:
= diag([`
2
1
, . . . , `
2
D
]),
which depends on the characteristic lengthscales `
i
.
Given n training inputs
˜
X = [
˜
x
1
, . . . ,
˜
x
n
] and corre
sponding training targets y = [∆
1
, . . . , ∆
n
]
>
, the pos
terior GP hyperparameters (lengthscales `
i
, signal
variance α
2
, noise variances Σ
ε
) are learned by evi
dence maximization (Rasmussen & Williams, 2006).
PILCO: A ModelBased and DataEﬃcient Approach to Policy Search
The posterior predictive distribution p(∆
∗

˜
x
∗
) for an
arbitrary, but known, test input
˜
x
∗
is Gaussian with
mean and variance
m
f
(
˜
x
∗
) = E
f
[∆
∗
] = k
>
∗
(K + σ
2
ε
I)
−1
y = k
>
∗
β , (7)
σ
2
f
(∆
∗
) = var
f
[∆
∗
] = k
∗∗
− k
>
∗
(K + σ
2
ε
I)
−1
k
∗
, (8)
respectively, where k
∗
:
= k(
˜
X,
˜
x
∗
), k
∗∗
:
= k(
˜
x
∗
,
˜
x
∗
),
β
:
= (K + σ
2
ε
I)
−1
y, and K being the Gram matrix
with entries K
ij
= k(
˜
x
i
,
˜
x
j
).
For multivariate targets, we train conditionally inde
pendent GPs for each target dimension, i.e., the GPs
are independent for deterministically given test inputs.
For uncertain inputs, the target dimensions covary.
2.2. Policy Evaluation
Minimizing and evaluating J
π
in Eq. (2) requires long
term predictions of the state evolution. To obtain the
state distributions p(x
1
), . . . , p(x
T
), we cascade one
step predictions, see Eqs. (3)–(5). Doing this properly
requires mapping uncertain test inputs through the
GP dynamics model. In the following, we assume that
these test inputs are Gaussian distributed and extend
the results from Qui˜noneroCandela et al. (2003) to the
multivariate case and the incorporation of controls.
For predicting x
t
from p(x
t−1
), we require a joint
distribution p(x
t−1
, u
t−1
). As the control u
t−1
=
π(x
t−1
, θ) is a function of the state, we compute the
desired joint as follows: First, we compute the mean
µ
u
and the covariance Σ
u
of the predictive control dis
tribution p(u
t−1
) by integrating out the state. Subse
quently, the crosscovariance cov[x
t−1
, u
t−1
] is com
puted. Finally, we approximate the joint statecontrol
distribution p(
˜
x
t−1
) = p(x
t−1
, u
t−1
) by a Gaussian
with the correct mean and covariance. These compu
tations depend on the parametrization of the policy π.
For many interesting controller parametrizations, the
required computations can be performed analytically,
although often neither p(u
t−1
) nor p(x
t−1
, u
t−1
) are
exactly Gaussian (Deisenroth, 2010).
From now on, we assume a joint Gaussian distribution
p(
˜
x
t−1
) = N
˜
x
t−1
 ˜µ
t−1
,
˜
Σ
t−1
at time t − 1. When
predicting the distribution
p(∆
t
) =
Z
p(f(
˜
x
t−1
)
˜
x
t−1
)p(
˜
x
t−1
) d
˜
x
t−1
, (9)
we integrate out the random variable
˜
x
t−1
. Note that
the transition probability p(f(
˜
x
t−1
)
˜
x
t−1
) is obtained
from the posterior GP distribution. Computing the
exact predictive distribution in Eq. (9) is analytically
intractable. Therefore, we approximate p(∆
t
) by a
Gaussian using exact moment matching, see Fig. 2.
−1 −0.5 0 0.5 1
−1
−0.5
0
0.5
1
∆
t
−1 −0.5 0 0.5 1
0
1
(x
t−1
,u
t−1
)
p(x
t−1
,u
t−1
)
−1
−0.5
0
0.5
1
00.511.5
p(∆
t
)
Figure 2. GP prediction at an uncertain input. The input
distribution p(x
t−1
, u
t−1
) is assumed Gaussian (lower right
panel). When propagating it through the GP model (up
per right panel), we obtain the shaded distribution p(∆
t
),
upper left panel. We approximate p(∆
t
) by a Gaussian
with the exact mean and variance (upper left panel).
For the time being, assume the mean µ
∆
and the co
variance Σ
∆
of the predictive distribution p(∆
t
) are
known. Then, a Gaussian approximation to the de
sired distribution p(x
t
) is given as N
x
t
µ
t
, Σ
t
with
µ
t
= µ
t−1
+ µ
∆
(10)
Σ
t
= Σ
t−1
+ Σ
∆
+ cov[x
t−1
, ∆
t
] + cov[∆
t
, x
t−1
] (11)
cov[x
t−1
, ∆
t
] = cov[x
t−1
, u
t−1
]Σ
−1
u
cov[u
t−1
, ∆
t
] (12)
where the computation of the crosscovariances in
Eq. (12) depends on the policy parametrization, but
can often be computed analytically. The computation
of the crosscovariance cov[x
t−1
, ∆
t
] in Eq. (11) is de
tailed by Deisenroth (2010).
In the following, we compute the mean µ
∆
and the
variance Σ
∆
of the predictive distribution, see Eq. (9).
2.2.1. Mean Prediction
Following the law of iterated expectations, for target
dimensions a = 1, . . . , D, we obtain
µ
a
∆
= E
˜
x
t−1
[E
f
[f(
˜
x
t−1
)
˜
x
t−1
]] = E
˜
x
t−1
[m
f
(
˜
x
t−1
)]
=
Z
m
f
(
˜
x
t−1
)N
˜
x
t−1
 ˜µ
t−1
,
˜
Σ
t−1
d
˜
x
t−1
(13)
= β
>
a
q
a
(14)
with β
a
= (K
a
+ σ
2
ε
a
)
−1
y
a
and q
a
= [q
a
1
, . . . , q
a
n
]
>
.
With m
f
given in Eq. (7), the entries of q
a
∈ R
n
are
q
a
i
=
Z
k
a
(
˜
x
i
,
˜
x
t−1
)N
˜
x
t−1
 ˜µ
t−1
,
˜
Σ
t−1
d
˜
x
t−1
=
α
2
a
√

˜
Σ
t−1
Λ
−1
a
+I
exp
−
1
2
ν
>
i
(
˜
Σ
t−1
+ Λ
a
)
−1
ν
i
, (15)
ν
i
:
= (
˜
x
i
− ˜µ
t−1
) . (16)
Here, ν
i
in Eq. (16) is the diﬀerence between the train
ing input
˜
x
i
and the mean of the “test” input distri
bution p(x
t−1
, u
t−1
).
PILCO: A ModelBased and DataEﬃcient Approach to Policy Search
2.2.2. Covariance Matrix of the Prediction
To compute the predictive covariance matrix Σ
∆
∈
R
D×D
we distinguish between diagonal elements and
oﬀdiagonal elements: Using the law of iterated vari
ances, we obtain for target dimensions a, b = 1, . . . , D
σ
2
aa
=E
˜
x
t−1
var
f
[∆
a

˜
x
t−1
]
+E
f,
˜
x
t−1
[∆
2
a
]−(µ
a
∆
)
2
(17)
σ
2
ab
=E
f,
˜
x
t−1
[∆
a
∆
b
]−µ
a
∆
µ
b
∆
, a 6= b , (18)
respectively, where µ
a
∆
is known from Eq. (14). The
oﬀdiagonal terms do not contain the additional term
E
˜
x
t−1
[cov
f
[∆
a
, ∆
b

˜
x
t−1
]] because of the conditional
independence assumption of the GP models: Diﬀer
ent target dimensions do not covary for given
˜
x
t−1
.
First we compute the terms that are common to both
the diagonal and oﬀdiagonal entries: With the Gaus
sian approximation N
˜
x
t−1
 ˜µ
t−1
,
˜
Σ
t−1
of p(
˜
x
t−1
)
and the law of iterated expectations, we obtain
E
f,
˜
x
t−1
[∆
a
∆
b
] = E
˜
x
t−1
E
f
[∆
a

˜
x
t−1
] E
f
[∆
b

˜
x
t−1
]
(7)
=
Z
m
a
f
(
˜
x
t−1
)m
b
f
(
˜
x
t−1
)p(
˜
x
t−1
) d
˜
x
t−1
(19)
due to the conditional independence of ∆
a
and ∆
b
given
˜
x
t−1
. Using now the deﬁnition of the mean func
tion m
f
in Eq. (7), we obtain
E
f,
˜
x
t−1
[∆
a
∆
b
] = β
>
a
Qβ
b
, (20)
Q
:
=
Z
k
a
(
˜
X,
˜
x
t−1
) k
b
(
˜
X,
˜
x
t−1
)
>
p(
˜
x
t−1
) d
˜
x
t−1
. (21)
Using standard results from Gaussian multiplications
and integration, we obtain the entries Q
ij
of Q ∈ R
n×n
Q
ij
=
k
a
(
˜
x
i
,˜µ
t−1
)k
b
(
˜
x
j
,˜µ
t−1
)
√
R
exp
1
2
z
>
ij
R
−1
˜
Σ
t−1
z
ij
(22)
where we deﬁned R
:
=
˜
Σ
t−1
(Λ
−1
a
+Λ
−1
b
)+I and z
ij
:
=
Λ
−1
a
ν
i
+ Λ
−1
b
ν
j
with ν
i
taken from Eq. (16). Hence,
the oﬀdiagonal entries of Σ
∆
are fully determined by
Eqs. (14)–(16), (18), and (20)–(22).
From Eq. (17), we see that the diagonal entries of Σ
∆
contain an additional term
E
˜
x
t−1
var
f
[∆
a

˜
x
t−1
]
=α
2
a
−tr
(K
a
+σ
2
ε
a
I)
−1
Q
(23)
with Q given in Eq. (22). This term is the expected
variance of the latent function (see Eq. (8)) under the
distribution of
˜
x
t−1
.
With the Gaussian approximation N
∆
t
µ
∆
, Σ
∆
of p(∆
t
), we obtain a Gaussian approximation
N
x
t
µ
t
, Σ
t
of p(x
t
) through Eqs. (10)–(12).
To evaluate the expected return J
π
in Eq. (2), it re
mains to compute the expected values
E
x
t
[c(x
t
)] =
Z
c(x
t
)N
x
t
µ
t
, Σ
t
dx
t
, (24)
t = 0, . . . , T , of the cost c with respect to the predic
tive state distributions. We assume that the cost c is
chosen so that Eq. (24) can be solved analytically, e.g.,
polynomials. In this paper, we use
c(x) = 1 − exp(−kx − x
target
k
2
/σ
2
c
) ∈ [0, 1] , (25)
which is a squared exponential subtracted from unity.
In Eq. (25), x
target
is the target state and σ
2
c
controls
the width of c. This unimodal cost can be considered
a smooth approximation of a 01 cost of a target area.
2.3. Analytic Gradients for Policy Improvement
Both µ
t
and Σ
t
are functionally dependent on the
mean µ
u
and the covariance Σ
u
of the control sig
nal (and θ) through ˜µ
t−1
and
˜
Σ
t−1
, respectively, see
Eqs. (15), (16), and (22), for instance. Hence, we can
analytically compute the gradients of the expected re
turn J
π
with respect to the policy parameters θ, which
we sketch in the following. We obtain the deriva
tive dJ
π
/ dθ by repeated application of the chainrule:
First, we swap the order of diﬀerentiating and sum
ming in Eq. (2), and with E
t
:
= E
x
t
[c(x
t
)], we obtain
dE
t
dθ
=
dE
t
dp(x
t
)
dp(x
t
)
dθ
:
=
∂E
t
∂µ
t
dµ
t
dθ
+
∂E
t
∂Σ
t
dΣ
t
dθ
, (26)
where we used the shorthand notation dE
t
/ dp(x
t
)
:
=
{∂E
t
/∂µ
t
, ∂E
t
/∂Σ
t
} for taking the derivative of E
t
with
respect to both the mean and covariance of x
t
. Second,
from Sec. 2.2, we know that the predicted mean µ
t
and
the covariance Σ
t
are functionally dependent on the
moments of p(x
t−1
) and the controller parameters θ
through u
t−1
. By applying the chainrule to Eq. (26),
we thus obtain
dp(x
t
)
dθ
=
∂p(x
t
)
∂p(x
t−1
)
dp(x
t−1
)
dθ
+
∂p(x
t
)
∂θ
, (27)
∂p(x
t
)
∂p(x
t−1
)
=
∂µ
t
∂p(x
t−1
)
,
∂Σ
t
∂p(x
t−1
)
. (28)
From here onward, we focus on dµ
t
/ dθ, see Eq. (26),
but computing dΣ
t
/ dθ in Eq. (26) is similar. We get
dµ
t
dθ
=
∂µ
t
∂µ
t−1
dµ
t−1
dθ
+
∂µ
t
∂Σ
t−1
dΣ
t−1
dθ
+
∂µ
t
∂θ
. (29)
Since dp(x
t−1
)/ dθ in Eq. (27) is known from time step
t − 1 and ∂µ
t
/∂p(x
t−1
) is computed by applying the
chainrule to Eqs. (14)–(16), we conclude with
∂µ
t
∂θ
=
∂µ
∆
∂p(u
t−1
)
∂p(u
t−1
)
∂θ
=
∂µ
∆
∂µ
u
∂µ
u
∂θ
+
∂µ
∆
∂Σ
u
∂Σ
u
∂θ
.
(30)
The partial derivatives of ∂µ
u
/∂θ and ∂Σ
u
/∂θ, see
Eq. (30), depend on the policy parametrization θ. The
PILCO: A ModelBased and DataEﬃcient Approach to Policy Search
Algorithm 1 pilco
1: init: Sample controller parameters θ ∼ N(0, I).
Apply random control signals and record data.
2: repeat
3: Learn probabilistic (GP) dynamics model, see
Sec. 2.1, using all data.
4: Modelbased policy search, see Sec. 2.2–2.3.
5: repeat
6: Approximate inference for policy evaluation,
see Sec. 2.2: get J
π
(θ), Eqs. (10)–(12), (24).
7: Gradientbased policy improvement, see
Sec. 2.3: get dJ
π
(θ)/ dθ, Eqs. (26)–(30).
8: Update parameters θ (e.g., CG or LBFGS).
9: until convergence; return θ
∗
10: Set π
∗
← π(θ
∗
).
11: Apply π
∗
to system (single trial/episode) and
record data.
12: until task learned
individual partial derivatives in Eqs. (26)–(30) can
be computed analytically by repeated application of
the chainrule to Eqs. (10)–(12), (14)–(16), (20)–(23),
and (26)–(30). We omit further lengthy details and
refer to (Deisenroth, 2010) for more information.
Analytic derivatives allow for standard gradientbased
nonconvex optimization methods, e.g., CG or L
BFGS, which return optimized policy parameters θ
∗
.
Analytic gradient computation of J
π
is much more ef
ﬁcient than estimating policy gradients through sam
pling: For the latter, the variance in the gradient
estimate grows quickly with the number of parame
ters (Peters & Schaal, 2006).
3. Experimental Results
In this section, we report pilco’s success in eﬃciently
learning challenging control tasks, including both stan
dard benchmark problems and highdimensional con
trol problems. In all cases, pilco learns completely
from scratch by following the steps detailed in Alg. 1.
The results discussed in the following are typical,
i.e., they do neither represent best nor worst cases.
Videos and further information will be made avail
able at http://mlg.eng.cam.ac.uk/carl/pilco and
at http://cs.uw.edu/homes/marc/pilco.
3.1. CartPole Swingup
Pilco was applied to learning to control a real cart
pole system, see Fig. 3. The system consists of a cart
with mass 0.7 kg running on a track and a freely swing
ing pendulum with mass 0.325 kg attached to the cart.
The state of the system is the position of the cart, the
velocity of the cart, the angle of the pendulum, and the
angular velocity. A horizontal force u ∈ [−10, 10] N
could be applied to the cart. The objective was to
learn a controller to swing the pendulum up and to
balance it in the inverted position in the middle of
the track. A linear controller is not capable of do
ing this (Raiko & Tornio, 2009). The learned state
feedback controller was a nonlinear RBF network, i.e.,
π(x, θ) =
X
n
i=1
w
i
φ
i
(x) , (31)
φ
i
(x) = exp(−
1
2
(x − µ
i
)
>
Λ
−1
(x − µ
i
)) (32)
with n = 50 squared exponential basis functions cen
tered at µ
i
. In our experiment, θ = {w
i
, Λ, µ
i
} ∈ R
305
.
Pilco successfully learned a suﬃciently good dy
namics model and a good controller for this stan
dard benchmark problem fully automatically in only
a handful of trials and a total experience of 17.5 s.
Snapshots of a 20 s test trajectory are shown in Fig. 3.
3.2. CartDoublePendulum Swingup
In the following, we show the results for pilco learning
a dynamics model and a controller for the cartdouble
pendulum swingup. The cartdouble pendulum sys
tem consists of a cart (mass 0.5 kg) running on a track
and a freely swinging twolink pendulum (each link
of mass 0.5 kg) attached to it. The state of the sys
tem is the position x
1
and the velocity ˙x
1
of the cart
and the angles θ
2
, θ
3
and the angular velocities of both
attached pendulums. The control signals u ≤ 20 N
were horizontal forces to the cart. Initially, the sys
tem was expected to be in a state x
0
at location x,
where both pendulums hung down. The objective was
to learn a policy π
∗
to swing the double pendulum
up to the inverted position and to balance it with the
cart being at the expected start location x. A linear
controller is not capable of solving the this problem.
A standard control approach to solving the cartdouble
pendulum task is to design two separate controllers,
one for the swing up and one linear controller for the
balancing task, see for instance (Zhong & R¨ock, 2001),
requiring prior knowledge about the task’s solution.
Unlike this engineered solution, pilco fully automat
ically learned a dynamics model and a single nonlin
ear RBF controller, see Eq. (31), with n = 200 and
θ ∈ R
1816
to jointly solve the swingup and balanc
ing. For this, Pilco required about 20–30 trials cor
responding to an interaction time of about 60 s–90 s.
3.3. Unicycle Riding
We applied pilco to riding a 5DoF unicycle in a re
alistic simulation of the one shown in Fig. 4(a). The
PILCO: A ModelBased and DataEﬃcient Approach to Policy Search
1
2
3
4
5
6
Figure 3. Real cartpole system. Snapshots of a controlled trajectory of 20 s length after having learned the task. To solve
the swingup plus balancing, pilco required only 17.5 s of interaction with the physical system.
frame
ﬂywheel
wheel
(a) Robotic uni
cycle.
1 2 3 4 5
0
20
40
60
80
100
time in s
distance distribution in %
d ≤ 3 cm
d ∈ (3,10] cm d ∈ (10,50] cm
d > 50cm
(b) Histogram (after 1,000 test runs)
of the distances of the ﬂywheel from
being upright.
Figure 4. Robotic unicycle system and simulation results.
The state space is R
12
, the control space R
2
.
unicycle is 0.76 m high and consists of a 1 kg wheel, a
23.5 kg frame, and a 10 kg ﬂywheel mounted perpen
dicularly to the frame. Two torques could be applied
to the unicycle: The ﬁrst torque u
w
 ≤ 10 Nm was ap
plied directly on the wheel and mimics a human rider
using pedals. The torque produced longitudinal and
tilt accelerations. Lateral stability of the wheel could
be maintained by steering the wheel toward the falling
direction of the unicycle and by applying a torque
u
t
 ≤ 50 Nm to the ﬂywheel. The dynamics of the
robotic unicycle can be described by 12 coupled ﬁrst
order ODEs, see (Forster, 2009).
The goal was to ride the unicycle, i.e., to prevent it
from falling. To solve the balancing task, we used a
linear controller π(x, θ) = Ax + b with θ = {A, b} ∈
R
28
. The covariance Σ
0
of the initial state was 0.25
2
I
allowing each angle to be oﬀ by about 30
◦
(twice the
standard deviation).
Pilco diﬀers from conventional controllers in that it
learns a single controller for all control dimensions
jointly. Thus, pilco takes the correlation of all control
and state dimensions into account during planning and
control. Learning separate controllers for each control
variable is often unsuccessful (Naveh et al., 1999).
Pilco required about 20 trials (experience of about
30 s) to learn a dynamics model and a controller that
keeps the unicycle upright. The interaction time
Table 1. Pilco’s data eﬃciency scales to high dimensions.
cartpole cartdoublepole unicycle
state space R
4
R
6
R
12
# trials ≤ 10 20–30 ≈ 20
experience ≈ 20 s ≈ 60 s–90 s ≈ 20 s–30 s
parameter space R
305
R
1816
R
28
is fairly short since a trial was aborted when the
turntable hit the ground, which happened quickly
during the ﬁve random trials used for initialization.
Fig. 4(b) shows empirical results after 1,000 test runs
with the learned policy: Diﬀerentlycolored bars show
the distance of the ﬂywheel from a fully upright po
sition. Depending on the initial conﬁguration of the
angles, the unicycle had a transient phase of about
a second. After 1.2 s, either the unicycle had fallen
or the learned controller had managed to balance it
very closely to the desired upright position. The suc
cess rate was approximately 93%; bringing the uni
cycle upright from extreme initial conﬁgurations was
sometimes impossible due to the torque constraints.
3.4. Data Eﬃciency
Tab. 1 summarizes the results presented in this pa
per: For each task, the dimensionality of the state and
parameter spaces is listed together with the required
number of trials and the corresponding total interac
tion time. The table shows that pilco can eﬃciently
ﬁnd good policies even in high dimensions. The gener
ality of this statement depends on both the complexity
of the dynamics model and the controller to be learned.
In the following, we compare pilco’s data eﬃciency
(required interaction time) to other RL methods that
learn previously discussed tasks from scratch, i.e.,
without informative prior knowledge. This excludes
methods relying on known dynamics models or expert
demonstrations.
Fig. 5 shows the interaction time with the cartpole
system required by pilco and algorithms in the lit
erature that solved this task from scratch (Kimura
& Kobayashi, 1999), (Doya, 2000), (Coulom, 2002),
(Wawrzynski & Pacut, 2004), (Riedmiller, 2005),
(Raiko & Tornio, 2009), (van Hasselt, 2010). Dy
PILCO: A ModelBased and DataEﬃcient Approach to Policy Search
KK D C WP R RT vH pilco
10
1
10
2
10
3
10
4
10
5
KK: Kimura & Kobayashi 1999
D: Doya 2000
C: Coulom 2002
WP: Wawrzynski & Pacut 2004
R: Riedmiller 2005
RT: Raiko & Tornio 2009
vH: van Hasselt 2010
pilco: Deisenroth & Rasmussen 2011
required interaction time in s
Figure 5. Data eﬃciency for learning the cartpole task
in the absence of expert knowledge. The horizontal axis
chronologically orders the references according to their
publication date. The vertical axis shows the required in
teraction time with the cartpole system on a logscale.
namics models were only learned by Doya (2000)
and Raiko & Tornio (2009), using RBF networks and
multilayered perceptrons, respectively. Note that
both the NFQ algorithm by Riedmiller (2005) and the
(C)ACLA algorithms by van Hasselt (2010) were ap
plied to balancing the pole without swing up. In all
cases without statespace discretization, cost functions
similar to ours were used. Fig. 5 demonstrates pilco’s
data eﬃciency since pilco outperforms any other al
gorithm by at least one order of magnitude.
We cannot present comparisons for the cartdouble
pendulum swing up or unicycle riding: To the best
of our knowledge, fully autonomous learning has not
yet succeeded in learning these tasks from scratch.
4. Discussion and Conclusion
Trialanderror learning leads to some limitations in
the discovered policy: Pilco is not an optimal control
method; it merely ﬁnds a solution for the task. There
are no guarantees of global optimality: Since the opti
mization problem for learning the policy parameters is
not convex, the discovered solution is invariably only
locally optimal. It is also conditional on the experience
the learning system was exposed to. In particular, the
learned dynamics models are only conﬁdent in areas
of the state space previously observed.
Pilco exploits analytic gradients of an approximation
to the expected return J
π
for indirect policy search.
Obtaining nonzero gradients depends on two factors:
the state distributions p(x
1
), . . . , p(x
T
) along a pre
dicted trajectory and the width σ
c
of the immediate
cost in Eq. (25). If the cost is very peaked, say, a 01
cost with 0 being exactly in the target and 1 otherwise,
and the dynamics model is poor, i.e., the distributions
p(x
1
), . . . , p(x
T
) nowhere cover the target region (im
plicitly deﬁned through σ
c
), pilco obtains gradients
with value zero and gets stuck in a local optimum.
Although pilco is relatively robust against the choice
of the width σ
c
of the cost in Eq. (25), there is no
guarantee that pilco always learns with a 01 cost.
However, we have evidence that pilco can learn with
this cost, e.g., pilco could solve the cartpole task
with a cost width σ
c
to 10
−6
m. Hence, pilco’s un
precedented data eﬃciency cannot solely be attributed
to any kind of reward shaping.
One of pilco’s key beneﬁts is the reduction of model
bias by explicitly incorporating model uncertainty into
planning and control. Pilco, however, does not take
temporal correlation into account. Instead, model un
certainty is treated similarly to uncorrelated noise.
This can result in an underestimation of model un
certainty (Schneider, 1997). On the other hand, the
momentmatching approximation used for approxi
mate inference is typically a conservative approxima
tion. Simulation results suggest that the predictive
distributions p(x
1
), . . . , p(x
T
) used for policy evalua
tion are usually not overconﬁdent.
The probabilistic dynamics model was crucial to
pilco’s learning success: We also applied the pilco
framework with a deterministic dynamics model to a
simulated cartpole swingup. For a fair comparison,
we used the posterior mean function of a GP, i.e., only
the model uncertainty was discarded. Learning from
scratch with this deterministic model was unsuccessful
because of the missing representation of model uncer
tainty: Since the initial training set for the dynamics
model did not contain states close to the target state,
the predictive model was overconﬁdent during plan
ning (see Fig. 1, center). When predictions left the
regions close to the training set, the model’s extrapo
lation eventually fell back to the uninformative prior
mean function (with zero variance) yielding essentially
useless predictions.
We introduced pilco, a practical modelbased policy
search method using analytic gradients for policy im
provement. Pilco advances stateoftheart RL meth
ods in terms of learning speed by at least an order of
magnitude. Key to pilco’s success is a principled way
of reducing model bias in model learning, longterm
planning, and policy learning. Pilco does not rely
on expert knowledge, such as demonstrations or task
speciﬁc prior knowledge. Nevertheless, pilco allows
for unprecedented dataeﬃcient learning from scratch
in continuous state and control domains. Demo code
will be made publicly available at http://mloss.org.
The results in this paper suggest using probabilistic
dynamics models for planning and policy learning to
account for model uncertainties in the smallsample
case—even if the underlying system is deterministic.
PILCO: A ModelBased and DataEﬃcient Approach to Policy Search
Acknowledgements
We are very grateful to Jan Peters and Drew Bagnell
for valuable suggestions concerning the presentation of
this work. M. Deisenroth has been supported by ONR
MURI grant N000140911052 and by Intel Labs.
References
Abbeel, P., Quigley, M., and Ng, A. Y. Using Inaccu
rate Models in Reinforcement Learning. In Proceed
ings of the ICML, pp. 1–8, 2006.
Atkeson, C. G. and Santamar´ıa, J. C. A Comparison
of Direct and ModelBased Reinforcement Learning.
In Proceedings of the ICRA, 1997.
Bagnell, J. A. and Schneider, J. G. Autonomous Heli
copter Control using Reinforcement Learning Policy
Search Methods. In Proceedings of the ICRA, pp.
1615–1620, 2001.
Coulom, R. Reinforcement Learning Using Neural
Networks, with Applications to Motor Control. PhD
thesis, Institut National Polytechnique de Grenoble,
2002.
Deisenroth, M. P. Eﬃcient Reinforcement Learning
using Gaussian Processes. KIT Scientiﬁc Publish
ing, 2010. ISBN 9783866445697.
Deisenroth, M. P., Rasmussen, C. E., and Peters, J.
Gaussian Process Dynamic Programming. Neuro
computing, 72(7–9):1508–1524, 2009.
Deisenroth, M. P., Rasmussen, C. E., and Fox, D.
Learning to Control a LowCost Manipulator using
DataEﬃcient Reinforcement Learning. In Proceed
ings of R:SS, 2011.
Doya, K. Reinforcement Learning in Continuous Time
and Space. Neural Computation, 12(1):219–245,
2000. ISSN 08997667.
Engel, Y., Mannor, S., and Meir, R. Bayes Meets Bell
man: The Gaussian Process Approach to Temporal
Diﬀerence Learning. In Proceedings of the ICML,
pp. 154–161, 2003.
Fabri, S. and Kadirkamanathan, V. Dual Adaptive
Control of Nonlinear Stochastic Systems using Neu
ral Networks. Automatica, 34(2):245–253, 1998.
Forster, D. Robotic Unicycle. Report, Department of
Engineering, University of Cambridge, UK, 2009.
Kimura, H. and Kobayashi, S. Eﬃcient NonLinear
Control by Combining Qlearning with Local Linear
Controllers. In Proceedings of the ICML, pp. 210–
219, 1999.
Ko, J., Klein, D. J., Fox, D., and Haehnel, D. Gaussian
Processes and Reinforcement Learning for Identiﬁ
cation and Control of an Autonomous Blimp. In
ICRA, pp. 742–747, 2007.
Naveh, Y., BarYoseph, P. Z., and Halevi, Y. Nonlin
ear Modeling and Control of a Unicycle. Journal of
Dynamics and Control, 9(4):279–296, 1999.
Peters, J. and Schaal, S. Policy Gradient Methods for
Robotics. In Proceedings of the IROS, pp. 2219–
2225, 2006.
Qui˜noneroCandela, J., Girard, A., Larsen, J., and
Rasmussen, C. E. Propagation of Uncertainty in
Bayesian Kernel Models—Application to Multiple
Step Ahead Forecasting. In Proceedings of the
ICASSP, pp. 701–704, 2003.
Raiko, T. and Tornio, M. Variational Bayesian Learn
ing of Nonlinear Hidden StateSpace Models for
Model Predictive Control. Neurocomputing, 72(16–
18):3702–3712, 2009.
Rasmussen, C. E. and Kuss, M. Gaussian Processes
in Reinforcement Learning. In NIPS, pp. 751–759.
2004.
Rasmussen, C. E. and Williams, C. K. I. Gaussian
Processes for Machine Learning. The MIT Press,
2006.
Riedmiller, M. Neural Fitted Q Iteration—First Expe
riences with a Data Eﬃcient Neural Reinforcement
Learning Method. In Proceedings of the ECML,
2005.
Schaal, S. Learning From Demonstration. In NIPS,
pp. 1040–1046. 1997.
Schneider, J. G. Exploiting Model Uncertainty Esti
mates for Safe Dynamic Control Learning. In NIPS,
pp. 1047–1053. 1997.
van Hasselt, H. Insights in Reinforcement Learn
ing. W¨ohrmann Print Service, 2010. ISBN 978
9039354964.
Wawrzynski, P. and Pacut, A. Modelfree oﬀpolicy
Reinforcement Learning in Continuous Environ
ment. In Proceedings of the IJCNN, pp. 1091–1096,
2004.
Wilson, A., Fern, A., and Tadepalli, P. Incorporating
Domain Models into Bayesian Optimization for RL.
In ECMLPKDD, pp. 467–482, 2010.
Zhong, W. and R¨ock, H. Energy and Passivity Based
Control of the Double Inverted Pendulum on a Cart.
In Proceedings of the CCA, pp. 896–901, 2001.