Content uploaded by Pascal Klink
Author content
All content in this area was uploaded by Pascal Klink on Oct 12, 2023
Content may be subject to copyright.
This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no longer be accessible.
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 1
Tracking Control for a Spherical Pendulum via
Curriculum Reinforcement Learning
Pascal Klink1, Florian Wolf1, Kai Ploeger1, Jan Peters1,3, and Joni Pajarinen2
Abstract—Reinforcement Learning (RL) allows learning non-
trivial robot control laws purely from data. However, many suc-
cessful applications of RL have relied on ad-hoc regularizations,
such as hand-crafted curricula, to regularize the learning perfor-
mance. In this paper, we pair a recent algorithm for automatically
building curricula with RL on massively parallelized simulations
to learn a tracking controller for a spherical pendulum on a
robotic arm via RL. Through an improved optimization scheme
that better respects the non-Euclidean task structure, we allow
the method to reliably generate curricula of trajectories to be
tracked, resulting in faster and more robust learning compared
to an RL baseline that does not exploit this form of structured
learning. The learned policy matches the performance of an
optimal control baseline on the real system, demonstrating the
potential of curriculum RL to jointly learn state estimation and
control for non-linear tracking tasks.
Index Terms—Non-Linear Control, Reinforcement Learning
I. INTRODUCTION
Due to a steady increase in available computation over the
last decades, reinforcement learning (RL) [1] has been applied
to increasingly challenging learning tasks both in simulated
[2], [3] and robotic domains [4]–[6]. Learning control of non-
trivial systems via reinforcement learning (RL) is particularly
appealing when dealing with partially observable systems and
high-dimensional observations such as images, or if quick
generalization to multiple related tasks is desired.
In this paper, we provide another demonstration of the po-
tential of reinforcement learning to find solutions to a non-
trivial control task that has, to the best of our knowledge, not
been tackled using learning-based methods. More precisely,
we focus on the tracking control of a spherical pendulum
attached to a four degrees-of-freedom Barrett Whole Arm
Manipulator (WAM) [7], as shown in Figure 1. The partial ob-
servability of the system arising from access to only positional
information paired with an inherently unstable, underactuated
system and non-trivial kinematics results in a challenge for
modern reinforcement learning algorithms.
With reinforcement learning being applied to increasingly
demanding learning tasks such as the one presented in this
paper, different strategies for improving learning performance,
such as guiding the learning agent through highly shaped and
1P. Klink, F. Wolf, K. Ploeger and J. Peters are with the Intelligent Au-
tonomous Systems Group at the Technical University of Darmstadt, Germany.
Correspondence to: pascal.klink@tu-darmstadt.de
2J. Pajarinen is with the Department of Electrical Engineering and Automa-
tion at Aalto University, Finland.
3J. Peters is also with the German Research Center for AI (Research
Department: Systems AI for Robot Learning), Hessian.AI and the Centre of
Cognitive Science.
Fig. 1: An image of our simulation (left) and robot envi-
ronment (right) of the spherical pendulum tracking task. The
eight-shaped target trajectories evolving in three spatial dimen-
sions require coordinated control to simultaneously balance
the pendulum and achieve good tracking performance. The
pendulum is mounted to a Barrett WAM robotic arm and is
tracked by an Optitrack system. In the left image, the colored
dots visualize the upcoming target trajectory to be followed,
and the blue line visualizes the achieved trajectory.
-informative reward functions [8], [9], have evolved. In this
paper, we improve the training performance of the learning
agent via curricula, i.e., tailored sequences of learning tasks
that adapt the environment’s complexity to the capability of
the learning agent. For the considered tracking task, we adapt
the complexity via the target trajectories that are to be tracked
by the controller, starting from small deviations from an
initial position and progressing to a set of eight-shaped target
trajectories requiring the robot to move in all three dimensions.
Scheduling the complexity of the learning tasks is subject
to ongoing research [10], and solutions to this problem are
motivated from different perspectives, such as two-player
games [11] or the maximization of intrinsic motivation [12].
In this paper, we generate the curriculum of tasks using the
CUR ROT algorithm [13], which defines the curriculum as
a constrained interpolation between an initial- and desired
distribution of training tasks and is well-suited to our goal of
directing learning to a set of target trajectories. Applications
of CURROT have so far relied on training tasks that can be
arXiv:2309.14096v1 [cs.LG] 25 Sep 2023
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 2
represented in a low-dimensional vector space. We create a
curriculum over desired trajectories, a high-dimensional space
of learning tasks, allowing us to benchmark the CURROT
algorithm in this unexplored setting.
We demonstrate that the sampling-based optimization scheme
of CU RROT that drives the evolution of the learning tasks
faces challenges in high-dimensional scenarios. Furthermore,
the default assumption of a Euclidean distance on the vector
space of learning tasks can lead to curricula that do not
facilitate learning. Addressing both pitfalls, we obtain robust
convergence to the target distribution of tasks, resulting in
a tracking controller that can be applied to the real system.
Contributions:
•We demonstrate a simulation-based approach for learn-
ing tracking controllers for an underactuated, partially
observable, and highly unstable non-linear system that
directly transfer to reality.
•Our approach includes a curriculum reinforcement learn-
ing method that reliably works with high-dimensional
task spaces equipped with Mahalanobis distances, such
as trajectories, commonly encountered in robotics.
•Through ablations, we confirm the robustness of our
method and provide insights into the importance of the
policy structure for generalization in tracking tasks.
II. RE LATE D WOR K
As of today, there exist many demonstrations of applying
reinforcement learning (RL) to real-world robotic problems,
ranging from locomotion [6], [14], [15] to object manipulation
[4], [16], [17], where the RL agents need to process high-
dimensional observations, such as images [4] or grids of
surface height measurements [6] in order to produce appro-
priate actions. The RL agent typically controls the robot via
desired joint positions [6], [15], joint position deltas [4], joint
velocities [16], or even joint torques [15]. Depending on the
application scenario, actions are restricted to a manifold of
save actions [16], [17].
Spherical Pendulum: Inverted pendulum systems have been
investigated since the 1960s [18] as an archetype of an
inherently unstable system and are a long-standing evaluation
task for reinforcement learning algorithms [19], with swing-up
and stabilization tasks successfully solved on real systems via
RL [20], [21]. Other learning-based approaches tune linear
quadratic regulators (LQRs) and PID controllers in a data-
driven manner to successfully stabilize an inverted pendulum
mounted on a robotic arm [22], [23]. The extension of the
one-dimensional inverted pendulum task to two dimensions
has been widely studied in the control community, resulting
in multiple real-world applications in which the pendulum has
been mounted either to an omnidirectional moving base [24],
[25], a platform driven via leading screws [26], a SCARA
robotic arm [27], or a seven degrees-of-freedom collaborative
robotic arm [28]. The controllers for these systems were
synthesized either via linear controller design in task space
[27], a time-variant LQR around pre-planned trajectories [28],
linear output regulation [25], sliding-mode control [24], or
feedback linearization [26]. In these approaches, the control
laws assumed observability of the complete state, requiring
specially designed pendulum systems featuring joint encoders
or magneto-resistive sensors and additional processing logic
to infer velocities.
In this paper, we learn tracking control of a spherical pendulum
on a robotic arm from position-only observations via reinforce-
ment learning. To the best of our knowledge, this has not yet
been achieved, and we believe that the combination of non-
trivial kinematics, underactuation, and partial observability is
a good opportunity to demonstrate the capabilities of modern
deep RL agents.
Curriculum Reinforcement Learning: The complexity of
this learning task provides an opportunity to utilize methods
from the field of curriculum reinforcement learning [10].
These methods improve the learning performance of RL agents
in various application scenarios [3], [5], [6] by adaptively
modifying environment aspects of a contextual- [29] or, more
generally, a configurable Markov Decision Process [30]. As do
their application scenarios, motivations for and realizations of
these algorithms differ widely, e.g., in the form of two-player
games [11], [31], approaches that maximize intrinsic motiva-
tion [12], [32], or as interpolations between task distributions
[13], [33]. We will focus on the CUR ROT algorithm [13]
belonging to the last category of approaches, as it is well suited
for our goal of tracking a specific set of target trajectories and
has so far been applied to rather low-dimensional settings,
allowing us to extend its application scenarios to the high-
dimensional space of trajectories faced here.
III. REI NF OR CE ME NT LEARNING SY STEM
In this section, we describe the trajectory tracking task, its sim-
ulation in IsaacSim [34], and the curriculum learning approach
[13] we utilized to speed up learning in this environment.
A. Simulation Environment and Policy Representation
As shown in Figure 1, we aim to learn a tracking task of
a spherical pendulum that is mounted on a four degrees-of-
freedom Barrett Whole Arm Manipulator (WAM) [7] via a
3D printed universal joint1. The robot can be approximately
modeled as an underactuated rigid body system
M(q)¨
q=c(q,˙
q) + g(q) + τpad (1)
with six degrees of freedom q= [qwqp]∈R6that represent
the joint positions of the Barrett WAM (qw) and the pendulum
(qP), and four control signals τ∈R4that drive the joints of
the Barrett WAM, where τpad = [τ0 0] appends the (always
zero) controls for the non-actuated universal joint of the
spherical pendulum. The universal joint does not possess any
encoders, and we can infer the state of the pole only through
position measurements provided by an OptiTrack system [35]
at 120 Hz. Hence, albeit the Barrett WAM can be controlled at
500 Hz and delivers updates on its joint positions at the same
frequency, we run the control law only at 125 Hz due to the
OptiTrack frequency. In the following, we denote a variable’s
1We designed the universal joint such that it has a large range of motion.
Furthermore, the use of skateboard bearings resulted in low joint friction.
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 3
tt+∆it+∆L
γ(t)
qw,t
xp,t
Tt
τt
OtTtAt
Fig. 2: The policy is a standard feedforward neural network
with [1024,512,256,256] hidden layers that observes a history
Otof joint positions qw,t and pole directions xp,t, a history
Atof past actions τt, and a lookahead Ttof the trajectory
γ(t)to be followed.
value at a discrete time index as xtand the value at arbitrary
continuous time as x(t). We learn a tracking control law for
following desired trajectories γ:[ts, te]7→R3of the pendulum
tip from a fixed initial configuration qw,0. The control law
generates torques on top of a gravity compensation term g(qw)
based on a history of positional observations, applied torques,
and information about the desired trajectory γ
τt=π(Ot,At,Tt)+g(qw,t)Ot={ot−i|i∈[0, K −1]}
Tt={(γt+∆i,˙
γt+∆i)|i∈[1, L]}At={τt−i|i∈[1, K]},(2)
where K=15,L=20, and the ∆i’s are spread out over the
interval [0,1.04] (Figure 2) to capture both the immediately
upcoming positions and velocities of γ(t)as well as the
future behavior of the trajectory. An observation otis given
by the joint position of the Barrett WAM qw,t as well as
a three-dimensional unit-vector xp,t∈R3that represents the
orientation of the pole (Figure 2). In simulation, we compute
this vector using the difference between the pendulum tip xtip,t
and the pendulum base xbase,t. On the real system, we compute
this vector from Optitrack measurements of four points on
the pendulum. We reconstruct neither the pendulum joint
positions qp,t nor the joint velocities ˙
qtsince this information
is implicitly contained in the observation- and action histories
Otand At.
We learn πvia the proximal policy optimization (PPO)
algorithm implemented in the RL Games library [36]. This
choice is motivated by our use of the IsaacSim simulation
environment [34], which allows us to simulate a large number
of environments in parallel on a single GPU2. The chosen PPO
implementation is designed to leverage this parallel simulation
during training. Screenshots of the simulation environment
2We used 2048 parallel environments for learning.
are shown in Figures 1 and 2. The trajectories evolve over
a total duration of 12 seconds, resulting in 12·125=1500
steps per episode. The reward function at a given time-step t
mainly penalizes tracking failures and additionally regularizes
excessive movement of the robot
r(qt,τt)=
−α
1−γ,if tipped(qt)
1−1000∥γt−xtip,t∥2
2−1e−1∥˙
qw,t∥2
2
−1e−1∥qw,t−qw,0∥2
2−1e−3∥τt∥2
2,else.
(3)
The function tipped(qt)returns true if either |qp,t|≥0.5πor
if the z-coordinate of the pendulum tip xtip,t is less than
five centimeters above the z-coordinate of the pendulum base
xbase,t. The episode ends if tipped(qt)evaluates to true. The
large amplification of the tracking error is required since
∥γt−xtip,t∥2is measured in meters. With the chosen amplifi-
cation, a tracking error of three centimeters leads to a penalty
of −0.9. In our experiments, we use a discount factor of
γ=0.992 and evaluate the learning system for multiple α.
B. Facilitating Sim2Real Transfer
To enable successful transfer from simulation to reality, we
first created a rigid-body model of the Barrett WAM (Eq. 1)
based on the kinematic and inertial data sheets from Barrett
Technology [7] in the MuJoCo physics simulator [37]. We
chose the MuJoCo simulator for initial investigations since it
allows us to more accurately model the actuation of the Barrett
WAM via tendons and differentials3. Opposed to the simplified
model (1), this more faithful model of the Barrett WAM
requires an extended state space qext=[qwqrqp]∈R10, in
which the joint- qwand rotor positions qrof the Barrett WAM
are coupled via tendons that transfer the torques generated at
the rotors to the joints (and vice versa). The joint encoders
of the WAM are located at the rotors, and hence, we can
only observe qr, which may differ from qwdepending on
the stiffness of the tendons. During our initial evaluations, we
found that modeling this discrepancy between measured- and
real joint positions as well as delayed actions (approximated
as an exponential filter)
˜
τt=ω⊙˜
τt−1+ (1−ω)⊙τt,ω∈[0,1]4,(4)
where ⊙represents the element-wise multiplication of vectors
and 1is a vector of all ones, were required to achieve stable
behavior of the learned policy on the real system. When not
modeling these effects, the actions generated by the learned
policies resulted in unstable feedback loops. A final extension
to the model is given by simulating a Stribeck-like behavior
of friction by compensating the coulomb friction modeled by
MuJoCo
˜
τa,t =˜
τt+c⊙tanh(β⊙˙
qw),(5)
where c∈R4
≥0is the coefficient of coulomb friction simulated
by MuJoCo, β∈R4
≥0is the reduction of friction due to
3IsaacSim also has support for tendon modelling. However, this support is
significantly more restricted at the moment, preventing to recreate the tendon
structure of the Barrett WAM in simulation.
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 4
movement. Having completed our model, we then adjusted the
tendon stiffness, rotor armature, damping, coulomb friction c,
as well as ωand βusing trajectories from the real system.
Given the lack of possibilities to model the tendon drives of
the Barret WAM in IsaacSim, we simulate the robot without
tendons and model the discrepancies between qrobserved by
the policy and qwby a simple spring-damper model
¨
qr=KP(qw−Tqqr) + KD(˙
qw−Tq˙
qr) + Tτ˜
τ(6)
where Tq,Tτ∈R4×4model the transformation of joint posi-
tion and -torques via the tendons and KP,KD∈R4×4model
the spring-damper properties of the tendons.
Given the policy’s reliance on Optitrack measurements of
the pendulum, which are exchanged over the network, we
measured the time delays arising from the communication
over the network stack. We then modeled these delays in the
simulation, as detailed in Appendix A.
During learning, we randomize the masses within 75% and
125% of their nominal values and randomize damping and
coulomb friction within 50% and 150% of their nominal
values. Additionally, we add zero-mean Gaussian distributed
noise with a standard deviation of 0.005 to the actions gen-
erated by the agent, which are normalized between −1and
1. The observations are corrupted by uniform noise within
[−0.01,0.01]. Finally, the amount of action delay is also
randomized by sampling ωfrom [0.5,0.9], and βis set to
zero 25% of the time and sampled from [0,100] otherwise.
C. Trajectory Representations
We represent the target trajectories γ:[ts, te]7→R3via a con-
strained three-dimensional LTI system that is driven by a
sequence of jerks (time-derivatives of accelerations)
∀t∈[ts, te] : γ(t)∈ P (7)
∀t∈[ts, te] :
d3
dt3γ(t)
2
≤jUB (8)
γ(ts)=γ(te)˙
γ(ts)= ˙
γ(te)=0 ¨
γ(ts)=¨
γ(te)=0 (9)
with a convex set P⊂R3of allowed positions. We model the
LTI system as three individual triple integrator models. For
simplicity of exposition, we focus on only one of the three
systems, i.e., γ:[ts, te]7→R. The full system is obtained by
simple “concatenation” of three copies of the following system
˙
x(t) = Ax(t) + Bu(t)A=
010
001
000
B=
0
0
1
(10)
with xi(t) = di−1
dti−1γ(t)and u(t) = d3
dt3γ(t). To represent the
trajectories as some finite-dimensional vectors u∈RK, we
assume that the control trajectory of jerks u(t)is piece-wise
constant
u(t) =
K
X
k=1
uk1k(t),1k(t)= (1,if tk−1≤t<tk
0,else
with t0=tsand tK=te. This assumption allows us to represent
x(t)at time tas a linear combination of the initial system state
and the piece-wise constant jerks
x(t)=Φ(ts, t)x(ts) +
ψ(ts, t1, t)ψ(t1, t2, t). . . ψ(tK−1, tK, t)
| {z }
Ψ(t)∈R3×K
u1
u2
.
.
.
uK
| {z }
u∈RK
.(11)
We derive Φand ψin the appendix. With the closed-form
solution (11), we can rewrite Constraint (9) as a system of
three linear equations
x(te) = Φ(ts, te)x(ts)+Ψ(te)u
⇔x(te)−Φ(ts, te)x(ts)=Ψ(te)u⇔0=Ψ(te)u.(12)
We know that x(te)−Φ(ts, te)x(ts) = 0due to the form of
Φ(ts, te), and since our initial state x(ts)is, per definition,
given by xs=[γ(ts) 0 0]. We can hence represent all trajecto-
ries that fulfill Constraint (9) in a (K−3)-dimensional basis
of the kernel ker(Ψ(te)). We refer to vectors in this kernel as
˜
u∈RK−3. The two remaining constraints (7) and (8) specify
a convex set in ker(Ψ(te)). As described in the next section,
generating a curriculum over trajectories will require sampling
in an ϵ-ball around a given kernel element ˜
uin the convex set,
which we perform using simple rejection sampling.
D. Curriculum Reinforcement Learning
By now, we can represent target trajectories γ(t)via a vector
c= [˜
u1˜
u2˜
u3]∈R3(K−3) that represent the trajectory behavior
in the three spatial dimensions. We will treat γ(t)and c
interchangeably for the remainder of this paper and refer to
cas context or task, following the wording in [13]. We are
interested in learning a policy πthat performs well on a target
distribution µ(γ)=µ(c)over trajectories. To facilitate learning,
we use the curriculum method CU RROT [13], which creates a
curriculum of task distributions pi(c)by iteratively minimizing
their Wasserstein distance W2(p, µ)to the target distribution
µ(c)under a given distance function d(c1,c2)and subject to
a performance constraint
arg min
p
W2(p, µ)s.t. p(V(π, δ)) = 1,(13)
where the set V(π, δ) = {c∈ C |J(π, c)≥δ}is the set of
contexts c∈ C in which the agent achieves a performance
J(π, c) = Eπ[P∞
t=0 γtr(qt,τt)] of at least δ. We refer to
[13] for the precise definition and derivation of the algorithm
and, for brevity, only state the resulting algorithm.
The task distribution pi(c)is represented by a set of N
particles, i.e., ˆpi(c) = 1
NPN
n=1 δcpi,n (c)with δcref (c)being
the Dirac distribution on cref. Each particle is updated by min-
imizing the distance d(c,cµ,ϕ(n))to a target particle cµ,ϕ(n)
min
c∈C d(c,cµ,ϕ(n))s.t. ˆ
J(π, c)≥δ d(c,cpi,n )≤ϵ, (14)
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 5
ˆ
V(π,δ)
d(cpi,n,c)=
V(π,δ)
pi(c)
pi+1(c)
µ(c)
Fig. 3: Task sampling scheme used by CUR ROT. (Left)
A particle-based representation ˆpi(c)of the task distribu-
tion pi(c)is updated to minimize the Wasserstein distance
W2(ˆpi,ˆµ)while keeping all particles in the feasible set
V(π, δ)of tasks in which agent πachieves a performance
of at least δ. The yellow lines indicate which particles of ˆpi
have been matched to ˆµto compute W2(ˆpi,ˆµ). (Right) In
practice, CURROT needs to rely on an approximation ˆ
V(π, δ)
of V(π, δ), which is why a trust-region d(cpi,n ,c)≤ϵis intro-
duced to avoid the overly greedy exploitation of approximation
errors. The indicated trust region (black dotted line) belongs
to the non-opaque particle.
where ˆ
J(π, c)is a prediction of J(π, c)using Nadaraya-
Watson kernel regression [38]
ˆ
J(π, c)= PL
l=1 Kh(c,cl)Jl
PL
l=1 Kh(c,cl), Kh(c,cl)= exp −d(c,cl)2
2h2.
(15)
The parameter ϵin (14) limits the displacements of the
particles within one update step, preventing the exploitation of
faulty performance estimates ˆ
J(π, c). The kernel bandwidth
his set to a fraction of ϵ, e.g., h=0.3ϵin [13], given its
purpose to capture the trend of J(π, c)within the trust-
region around cpi,n. The Lcontexts cland episodic return
Jlused for predicting the agent performance are stored in two
buffers, for whose update rules we refer to [13]. The Ntarget
particles cµ,ϕ(n)are in each iteration sampled from µ(c)and
the permutation ϕ(n)assigning them to cpi,n is obtained by
minimizing an assignment problem
W2(ˆpi,ˆµ) = min
ϕ∈Perm(N) 1
N
N
X
n=1
d(cpi,n,cµ,ϕ(n))2!1
2
.(16)
If we can optimize d(cpi,n,cµ,ϕ(n))to zero for each par-
ticle in each iteration, we essentially sample from µ(c).
Figure 3 shows a schematic visualization of CU RROT. A
crucial ingredient in the CU RROT algorithm is the distance
function d(c1,c2)that expresses the (dis)similarity between
two learning tasks. So far, dhas been assumed to be the
Euclidean distance in continuous spaces in [13]. A critical part
of our experimental investigation of the benefit of curricula
for learning tracking control will be the comparison of the
Euclidean distance between the context vectors c1and c2
and a Mahalanobis distance [39]. In the following section,
we describe this distance and other improvements that we
benchmark in the experimental section.
IV. IMP ROV ED CURRICULUM GE NE RATI ON
The CU RROT algorithm has so far been evaluated in rather
low-dimensional scenarios, with two- or three-dimensional
context spaces Cthat lend themselves to a Euclidean interpre-
tation. In this section, we describe technical adjustments of
the CURROT algorithm that improve the creation of curricula
over trajectories, i.e. over a high-dimensional context space C
with a more intricate metric structure.
A. Affine Metrics
In [13], the CURROT algorithm has been evaluated under the
assumption of a Euclidean metric
d(c1,c2) = ∥c1−c2∥2=q(c1−c2)T(c1−c2)
in continuous context spaces C. For our trajectory repre-
sentation, this corresponds to a Euclidean distance between
elements in ker(Ψ(te)). However, according to Eq. (11), we
know that the difference between two (one-dimensional) LTI
system states is given by
x1(t)−x2(t) = Ψ(t)(u1−u2).
This observation allows us to compute the distance of the
trajectories γ1(t),γ2(t)generated by c1,c2via a Mahalanobis
distance
dΨ(c1,c2) = q(c1−c2)TA(c1−c2),
A=ΓT
3
Ψ3(ts)
Ψ3(t1)
.
.
.
Ψ3(te)
T
Ψ3(ts)
Ψ3(t1)
.
.
.
Ψ3(te)
Γ3
where Γ∈RK×K−3maps the elements ˜
u∈ker(Ψ(te)) to jerk
sequences u. We then “repeat” Γand Ψ(t)to capture the
three spatial dimensions, forming the block diagonal matrices
Γ3=blkdiag {Γ}3
n=1and Ψ3(t)=blkdiag {Ψ(t)}3
n=1. The
Mahalanobis distance can be computed with no change to
the algorithm by whitening the contexts cand computing the
Euclidean distance in the whitened space.
B. Sampling-Based Optimization
The optimization of (14) is carried out in parallel by uniformly
sampling contexts in an n-dimensional ϵ-half ball
Bn
≥0(cpi,n, ϵ) = c∥c−cpi,n∥2≤ϵ∧
⟨c−cpi,n,cµ,ϕ(n)−cpi,n ⟩ ≥ 0
around cpi,n and selecting the sample with minimum distance
to cµ,ϕ(n)that fulfills the performance constraint. ⟨·,·⟩ denotes
the dot product. In higher dimensions, this sampling scheme
faces two problems. Firstly, the mass of a ball is increasingly
concentrated on the surface for higher dimensions, resulting
in samples that are increasingly concentrated at the border of
the trust region. Secondly, the chance of sampling a context
cfor which d(c,cµ,ϕ(n))< d(c,cpi,n)decreases dramatically
for higher dimensions as soon as ∥d(cpi,n,cµ,ϕ(n))∥≤ϵ. To
remedy both problems, we first sample unit vectors that make
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 6
2 14 27 39 51
0.2
0.4
0.6
0.8
p(kck2≤0.95 |c∈Bn)
0.0
0.1
0.2
0.3
0.4
0.5
phc,eni≤0.25πc∈Sn
≥0
Fig. 4: (Left) In higher dimensions, the Euclidean norm of
a vector ∥c∥2in the n-dimensional ball Bnincreasingly
converges to one. The angle ∠cenbetween a context cin the
n-dimensional half sphere Sn
≥0and any n-dimensional vector
enis with increasing certainty larger than 45◦=0.25π. (Right)
This behavior requires adapting the sampling-based optimiza-
tion of Objective (14) to sample those unit vectors that make
an angle of less than 45◦with a descent direction and scale
them uniformly in [0, ϵ]. Unlike the default CURROT sampling
scheme (black samples), this sampling scheme (colored dots,
color indicates density) more robustly finds descent directions
in high-dimensional tasks.
an angle less than θ= 0.25πwith the descent direction
cµ,ϕ(n)−cpi,n. Such unit vectors can be sampled using, e.g.,
the sampling scheme described in [40]. We then scale these
search direction vectors by a scalar that we uniformly sample
from the interval [0, ϵ]. Figure 4 contrasts the new sampling
scheme with the one introduced with the CU RROT algorithm
in [13].
C. Tracking Metrics other than Reward
The constraint p(V(π, δ))=1 in Objective (13) is controlling
the curriculum’s progression towards µ(c)by preventing it
from sampling contexts in which the agent does not fulfill
a performance threshold δ. We generalize this constraint to
define Vbased on an arbitrary function M(π, c)∈Robtained
from a rollout of the policy πin a context c. We hence
define V(π, δ) = {c∈ C |M(π, c)≥δ}. The same Nadaraya-
Watson kernel regression introduced in Section III-D can
approximate M(π, c). In our setting, the increased flexibility
enables restricting training to those trajectories for which the
agent can stabilize the pendulum throughout almost the whole
episode, i.e., almost all of the 1500 episode steps. Encoding
this restriction via a fixed lower bound on the episode return
is hard to achieve due to, e.g., regularizing terms on the joint
velocities and the penalty for non-precise tracking of γ(t).
These terms can result in highly differing returns for episodes
in which the agent stabilized the pendulum the entire episode.
D. GPU Implementation
The authors of [13] provide an implementation of CURROT
in NumPy [41] and SciPy [42], computing the assignment to
the target distribution particles using the SciPy-provided linear
sum assignment solver. Given the large number of parallel
simulations that we utilize, our application of CU RROT needed
c1
c2CL⊂R2
CH⊂R51
Fig. 5: A visualization of the eight-shaped target trajectories
γ(t)(in yellow) that the learning agent is required to track in
our experiments. The trajectories are projected onto a dome
that is centered around the robot. Due to the particular shape
of the trajectories, we can represent them via both a low-
and high-dimensional parametric description, providing the
possibility to test how CURROT scales to high-dimensional
context representations. Compared to the low-dimensional
representation, eight-shaped trajectories are only a small part
of the full, high-dimensional context space.
to work with a large number of particles Nand contexts for
performance prediction L. We hence created a GPU-based
implementation using PyTorch [43]. To solve the assignment
problem (16), we implemented a default auction algorithm [44]
using the PyKeOps library [45], which provides highly effi-
cient CUDA routines for reduction operations on large arrays.
We also use the PyKeOps library for the Nadaraya-Watson
kernel regression. The GPU implementation of CU RROT and
the code for running the experiments described in the next
section will be made publicly available upon acceptance.
V. EX PE RI ME NT S
In this section, we answer the following questions by evaluat-
ing the described learning system in simulation as well as on
the real system:
•Do curricula stabilize or speed up learning in the trajec-
tory tracking task?
•How do the proposed changes to the CU RROT algorithm
alter the generated curriculum and its benefit on the
learning agent?
•Does the behavior learned in simulation transfer to the
real system?
The experiment requires the agent to track eight-shaped tra-
jectories projected onto a sphere (Figure 5). The target distri-
bution µ(γ)of tasks encodes eight-shaped trajectories whose
maximal distance to the starting position is 0.36-0.4m in the
x-dimension and 0.18-0.2m in the y-dimension. We choose
the z-coordinate of the trajectory such that the trajectory has
a constant distance to the first joint of the Barrett WAM, i.e.,
moves on a sphere centered on this joint.
We chose this particular task since the trajectories encoded
by the target distributions µ(γ)can, in addition to the param-
eterization via jerks, be parameterized in a two-dimensional
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 7
0 500 1000 1500 2000
Epoch
0
33
66
100
Completion [%]
0 500 1000 1500 2000
Epoch
5
3
1
Av. Track. Err. [cm]
-8 -16 -24
Penalty Scale
0
500
1000
1500
Epochs to Completion
PPO CurrOTAO CurrOTACurrOT CurrOTL
Fig. 6: (Left) Ablation over different tipping penalties αand their effect on the average required number of epochs until
successfully completing the target trajectories. (Middle) Completion rate (i.e. fraction of maximum steps per episode) over
epochs for α=−8for different learning methods. (Right) Achieved tracking error during the agent lifetime over epochs for
different learning methods. In the middle and right plot, thick lines represent the median, and the shaded areas represent
interquartile ranges. Statistics are computed from 10 seeds.
parameter space, which enables us to benchmark how the
CUR ROT algorithm proposed in [13] behaves both in low-
and high-dimensional task parameterizations.
For the two-dimensional representation, we represent the
maximum distance in x- and y-dimension when generating
curricula in the two-dimensional context space CL⊂R2. When
representing trajectories via jerks, we compose the jerk se-
quence uof K=20 constant segments evenly spread in the
interval [1,10.5]. The first- and last second of each trajectory
is always stationary at x(ts). Hence, the actual movement
happens within [1,11] seconds. Due to constraint (9) of
starting and ending in x(ts), the parameterization reduces to
17 dimensions for each task space dimension, i.e., CH⊂R51.
For building the curricula, we define the set V(π, δ)to contain
those trajectories for which the policy manages to keep the
pendulum upright for at least 1400 steps, i.e., those trajectories
which fully complete their movements during the lifetime of
the agent (remember that the policy is stationary for the last
second, i.e., the last 125 out of 1500 steps).
We ablate the default PPO learner as well as four ablations
of the CURROT method introduced in Section III-D
•CURROT: The default algorithm as introduced by [13]
using our GPU-based implementation and using M(π, c)
instead of J(π, c)to define V(π, δ ).
•CURROTL: The default algorithm exploiting the low-
dimensional parameterization of the target trajectories to
generate curricula in R2instead of R51.
•CURROTA: A variation of CU RROT that uses the metric
dΨto capture the dissimilarity between the generated
trajectories rather than the context variables.
•CURROTAO: The version of CUR ROT that combines
the use of dΨwith improvements to the sampling-based
optimization of Objective (14).
For all curricula, we choose the trust-region parameter ϵof
Objective (14) according to method described in [13], i.e.
setting ϵ≈0.05 maxc1,c2∈C d(c1,c2). All curricula train on an
initial distribution p0(c)of trajectories that barely deviate
from the initial position until Ep0hˆ
M(π, c)i≥δ, at which
point the methods start updating the context distribution. All
methods train for 262 million learning steps, where a policy
update is performed after 64 environment steps, resulting in
64·2048=131072 samples generated between a policy update.
A. Quantitative Results
Figure 6 shows the performance of the learned policies.
More precisely, we show the average tracking error during
the agent’s lifetime and the number of completed trajectory
steps on µ(γ). While the tracking errors behave similarly
between the investigated methods, the curricula shorten the
required training iterations until we can track complete target
trajectories.
The results indicate that by first focusing on trajectories
that can be tracked entirely and then gradually transforming
them into more complicated ones, we avoid wasteful biased
sampling of initial parts of the trajectory due to system resets
once the pendulum falls over.
We additionally ablate the results over the penalty term αthat
the agent receives when the pendulum topples over (Eq. 3).
Figure 6 shows that its influence on the learning speed of
the agent is limited, as the epochs required by PP O to track
the target trajectories completely stays relatively constant even
when increasing αby a factor of three.
For the curriculum methods themselves, we can make two
observations. First, in the high-dimensional context space
CH⊂R51, all curricula learn the task reliably with comparable
learning speed. Second, operating in the low-dimensional
context space CL⊂R2does not lead to faster learning. Both
results surprised us since, in the high-dimensional context
space, we expected that exploiting the structure of the context
space via dΨ(c1,c2)and the improved optimization would
significantly improve the curricula. Furthermore, we expected
the low-dimensional representation CL⊂R2to ease the
performance estimation ˆ
J(π, c)via kernel regression since
we can more densely populate the context space CLwith
samples of the current agent performance. The following
section highlights why the observed performance did not
behave according to our expectations.
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 8
(a) CU RROTAO (b) CUR ROTA(c) CUR ROT (d) CURROTL
Fig. 7: Evaluation of training distributions pifor different ablations of CUR ROT. Brighter colors indicate later iterations. The
black dotted line indicates the ”boundaries” of the support of the target distribution. Note that the distributions have been
projected onto a 2D plane, omitting the z-coordinate.
0 600 1200
Epoch
0.0
0.2
0.4
0.6
0.8
1.0
W2(p,µ)
/W2(p0,µ)
CurrOTAO CurrOTACurrOT CurrOTL
1800 2000
Fig. 8: Evolution of Wasserstein distance W2(p, µ)compared
to the initial distance over learning epochs for different
variants of CURROT. The dashed horizontal lines represent
ϵ
W2(p0,µ), i.e., the fraction between the trust-region ϵfor
Objective (14) and the initial Wasserstein distance W2(p0, µ).
Thick lines represent the medians, and shaded areas visualize
interquartile ranges. Statistics are computed from 10 seeds.
B. Qualitative Analysis of Generated Curricula
To better understand the dynamics of the generated curricula,
we visualize the generated trajectories throughout different
learning epochs and the evolution of the Wasserstein distance
W2(pi, µ)in Figures 7 and 8. Focusing on the evolution of
Wasserstein distances shown in Figure 8, we can see that
only CU RROTAO and CU RROTLcan converge to the target
distribution, achieving zero Wasserstein distance. CURROT
and CU RROTAdo not converge to µ(c)after initially ex-
hibiting fast progression towards µ(c)but slowing down as
the Wasserstein distance approaches the value of the trust
region parameter ϵ. This slowing-down behavior is precisely
due to the naive sampling in the half-ball of the default
CURROT algorithm, which we discussed in Section III-D.
If the target contexts are well outside the trust region, even
samples that do make an angle larger than 0.25πwith the
descent direction cµ,ϕ(n)−cpi,n decrease the distance to the
target cµ,ϕ(n). Once the target contexts are on or within the
boundary of the trust region, the effect visualized in Figure 4
takes place, preventing further approach to the target samples.
As shown in Figure 7, the effect of the resulting bias on the
generated trajectories greatly depends on the chosen metric.
By measuring dissimilarity via the Euclidean distance between
contexts c1and c2, CURROT generates trajectories that behave
entirely differently from the target trajectories during the initial
and later stages of training. Incorporating domain knowledge
via dΨallows CUR ROTAto generate trajectories with similar
qualitative behavior to the target trajectories throughout the
learning process.
While underlining the importance of the proposed improve-
ments in CU RROTAO , the high performance achieved by
CUR ROT and CU RROTA, despite the potentially strong dis-
similarity in generated trajectories, stresses a critical obser-
vation: The success of a curriculum is inherently dependent
on the generalization capability of the learning agent. By
conditioning the policy behavior on limited-time lookahead
windows of the target trajectory Tt, the learning agent seems
capable of generalizing well to unseen trajectories as long as
those trajectories visit similar task-space positions as the tra-
jectories in the training distribution. Consequently, the failure
of CURROT to generate trajectories of similar shape to those
in µ(γ)is compensated for by the generalization capabilities
of the learning agent. All in all, the results indicate that
convergence of pi(c)to µ(c)is only a sufficient condition
for good agent performance on µ(c), but not a necessary one.
C. Alternative Trajectory Representation
The surprising effectiveness of CURROT, despite its ignorance
of the context space structure, led us to conclude that the policy
structure leads to rather strong generalization capabilities of
the agent, concealing shortcomings of the generated curricula.
We changed the policy architecture to test this hypothesis,
replacing the trajectory lookahead Ttsimply by the contextual
parameter c∈R51 and the current time index t∈R.
While this representation still contains all required information
about the desired target position γ(t)at time step t, it does
not straightforwardly allow the agent to exploit common
subsections of two different trajectories. Figure 9 visualizes
the results of this experiment. Comparing Figures 6 and 9,
we see that the different context representation slows down
the learning progress of all curricula and consequently leads
to higher tracking errors after 2000 epochs. We also see that
the lookahead Ttbenefits learning with PP O, as with the
new context representation, none of the 10 seeds learn to
complete the trajectory within 2000 epochs. More importantly,
we see how the inadequate metric of CU RROT now leads
to a failure in the curriculum generation, with W2(pi, µ)
staying almost constant for the entire 2000 epochs as the agent
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 9
0
33
66
100
Completion [%]
PPO CurrOTAO CurrOTACurrOT CurrOTL
5
3
1
Av. Track. Err. [cm]
0 500 1000 1500 2000
Epoch
0.0
0.2
0.4
0.6
0.8
1.0
W2(p,µ)
/W2(p0,µ)
0 500 1000 1500 2000
Epoch
0.0
0.2
0.4
0.6
0.8
1.0
p(M(π, c)>δ)|c∼pi)
Fig. 9: (Top Left) Completion rate (i.e. fraction of maximum
steps per episode) over epochs for α=−8for different
learning methods. (Top Right) Tracking error achieved during
the agent lifetime over epochs for different learning methods.
(Bottom Left) Wasserstein distance between training- and
target distribution over epochs. (Bottom Right) Percentage that
M(π, c)≥δon the training distribution pi(c)over epochs.
We show median and interquartile ranges that are computed
from 10 seeds.
struggles to solve the tasks in the curriculum. This failure to
generate tasks of adequate complexity is also shown in Figure
9, where we visualize the percentage of tasks csampled by
the curriculum for which M(π, c)≥δ. As we can see, this
percentage drops to zero under CURROT once the algorithm
starts updating p0(c). The other curricula maintain a non-
zero success percentage. We can also observe a pronounced
drop in success rate for CURROTA, which is not present for
CURROTAO, potentially due to the more targeted sampling
in the approximate update step of the context distribution
resulting in more similar trajectories.
Apart from this ablation, we also performed experiments for
increasing context space dimensions, generating curricula in
up to 399-dimensional context spaces. The results in Appendix
C show that CURROTAO generates beneficial curricula across
all investigated dimensions and trajectory representations,
while the curricula of CU RROTAbecome less effective in
higher dimensions for the alternative trajectory representation
presented in this section. For CU RROT, the resulting picture
stays unchanged with poor observed performance for the
alternative trajectory representation regardless of the context
space dimension.
D. Real Robot Results
To assess transferability to the real world and put the achieved
results into perspective, we evaluated the three best policies
learned with PPO and C UR ROTAO for α=−8on the real
robot and compared them to an optimal control baseline. We
evaluated each seed on 10 trajectories sampled from µ(γ).
x
y
y
z
(a) DDP (b) PPO (c) CU RROTAO
Fig. 10: Generated trajectories on the real robot. We visu-
alize projections to the xy- (top) and yz-plane (bottom) to
highlight the three-dimensional nature of the trajectory. The
reference trajectories are shown in blue. Other colors indicate
trajectories that have been generated by the different (learned)
controllers. For PPO and CURROTAO, we evaluate the three
best-performing seeds, indicated by colors.
The target trajectories are shown in Figure 10 and Figure 11
shows snapshots of the policy execution on the real system.
Given the architectural simplicity of the agent policy, it was
easy to embed it in a C++-based ROS [46] controller using the
Eigen library, receiving the pole information from Optitrack
via UDP packets. The execution time of the policy network
was less than a millisecond and hence posed no issue for our
target control frequency of 125 Hz. Given that the pole starts
in an upright position during training, we attached a thread to
the tip of the pendulum to stabilize the pendulum via a pulley
system before starting the controller. We then simultaneously
release the thread and start the controller. Given the negligible
weight of the thread, we did not observe any interference with
the pole. To ensure safety during the policy execution, we first
executed the policies in a MuJoCo simulation embedded in the
ROS ecosystem. We monitored the resulting minimum- and
maximum joint positions qmin and qmax, and defined a safe
region S, which the agent is not supposed to leave during
execution on the real system
S= [¯
q−1.25(qmax −qmin),¯
q+ 1.25(qmax −qmin)],
where ¯
q=1
/2(qmin +qmax). Inspired by the results of [28],
we decided to compare the results of PPO and CURROTAO
to an optimal control baseline that we obtained by computing
a time-varying linear feedback controller using the differential
dynamic programming (DD P) algorithm implemented in the
Crocoddyl library [47]. At the convergence of DDP, we can
obtain a time-varying linear controller from the internally
computed linearization of the dynamics on the optimal trajec-
tory. We use the same cost function as for the reinforcement
learning agent, simply removing the penalty term for tipping
the pendulum, as the gradient-based DD P does not run into
danger of tipping the pendulum. The obtained time-varying
controller requires access to full state information, i.e., position
and velocity of the robot and pole, which we infer using a
high-gain non-linear observer [48], whose gains we tuned on
the real system using the synthesized controller to achieve the
best tracking performance.
We performed the tracking experiments twice on different
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 10
(a) t≈4.0(b) t≈5.2(c) t≈6.5(d) t≈8.0(e) t≈11
Fig. 11: Snapshots of the policy learned with CU RROTAO during execution on the real robot. The dotted red line visualizes
the target trajectory to be tracked by the policy. The generated trajectory is visualized by the colored line, where brighter colors
indicate later time-steps.
days, obtaining 20 trajectories per seed and method, from
which we can compute statistics. Figure 10 visualizes the
result of the policy rollouts on the real system, and Table I
provides quantitative data. As shown in Figure 10, the policies
learned with CURROTAO seem to track the reference trajecto-
ries more precisely than the other methods. This impression is
TABLE I: Mean and standard deviation of tracking errors
achieved with different controllers. We evaluate both on a
ROS-embedded MuJoCo simulation (Sim) and the real robot
(Real). For PPO and CUR ROTAO , the color of the seeds
corresponds to the trajectories shown in Figure 10. In each
row, statistics are computed from 20 policy executions.
DDP
Gain Sim Real
Completion Error [cm] Completion Error [cm]
High 1.00 0.75±0.02 - -
Low 1.00 2.58±0.11 1.00 3.11±0.12
CURROTAO
Seed Sim Real
Completion Error [cm] Completion Error [cm]
21.00 1.98±0.06 1.00 2.30±0.16
41.00 1.80±0.07 1.00 2.44±0.09
51.00 1.85±0.07 1.00 2.10±0.13
Avg. 1.00 1.88±0.10 1.00 2.28±0.19
PPO
Seed Sim Real
Completion Error [cm] Completion Error [cm]
11.00 3.10±0.14 1.00 2.82±0.20
51.00 2.30±0.08 1.00 2.71±0.20
81.00 2.15±0.07 0.55 4.18±0.66
Avg. 1.00 2.52±0.43 0.85 3.07±0.68
backed up by the data in Table I, where the average tracking
performance of both DDP and PPO on the real robot is about
35% worse than that of CURROTAO . Comparing the results of
the ROS-embedded MuJoCo simulation and the execution on
the real robot, we see that, on average, the performance on the
real system is about 20% worse across all methods. Regarding
reliability, one out of the three policies learned with PPO did
not reliably perform the tracking task, as it left the safe region
Sduring execution. We can observe a significantly worse real
robot tracking performance for this particular seed in Figure
10. Looking at the DD P results again, we see a distinction
between high and low gains. The high gain setting corresponds
to precisely using reward function 3, resulting in the best track-
ing performance across all methods in simulation. However,
the high gains of the generated time-varying linear feedback
controllers resulted in unstable behavior in the real system.
To obtain stable controllers in the real system, we needed to
increase the regularization of actions, position, and velocity by
a factor of 30. We assume that more sophisticated methods that
better account for uncertainty in the model parameters could
further improve performance. Since our baseline only aims to
put the learned behavior into perspective, we did not explore
such advanced methods. Instead, we interpreted the results as
evidence that deep RL-based methods can learn precise control
of highly unstable systems comparable to classical control
methods.
Looking back at Figure 6, we see that the tracking errors
in the ROS-embedded MuJoCo simulation in Table I are
slightly worse than the error observed in Isaac Sim, where
PPO consistently achieved a tracking error of less than 2cm,
and CU RROTAO consistently achieved a tracking error of less
than 1.5cm. This performance gap may be caused by our
approximate modeling of actuation delay and tendons, and we
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 11
expect additional efforts on system identification, modeling,
and domain randomization to close this gap.
VI. CONCLUSIONS
We presented an approach that learns a tracking controller for
an inverted spherical pendulum mounted to a four degrees-
of-freedom Barret Whole Arm Manipulator. We showed that
increasingly available massively parallel simulators allow off-
the-shelf reinforcement learning algorithms paired with cur-
ricula to reliably learn this non-trivial partially observable
control task across policy- and task-space representations. Our
evaluations of curricula and their effect on learning success
showed multiple interesting results. Apart from confirming
the sample-complexity benefit of learning the tracking task
via curricula, we showed that a) the generation of curricula
is possible in high-dimensional context spaces and b) that
high-dimensionality does not need to make the curriculum
generation less efficient. However, we also saw that the very
structure of our learning agent was a significant factor in the
robustness of the generated curricula, allowing it to track target
trajectories that are significantly different from the trajectories
encountered in the curricula. These findings motivate future
investigations into the interplay between agent generalization
and curricula. From a technical point of view, we demonstrated
the importance of appropriately encoding the structure of the
context space Cvia the distance function of CURROT, particu-
larly when the generalization capability of the agent is limited.
An interesting next step is to generalize CURROT to work
with arbitrary Riemannian manifolds. On the robotic side, we
see much potential in applications to other robotic tasks, e.g.,
locomotion problems. On this particular setup, investigating
the control of a non-rigidly attached inverted pendulum would
allow us to tackle more complicated movements that, e.g.,
require the robot to thrust the pendulum into the air and
catch it again. Furthermore, a non-rigidly attached pendulum
would pose an additional challenge for modeling the system in
simulation and deriving controllers using optimal control, as
contact friction becomes essential to balancing the non-rigidly
attached pendulum.
ACK NOW LE DG EM EN TS
Joni Pajarinen was supported by Research Council of Finland
(formerly Academy of Finland) (decision 345521).
APPENDIX A
MODELLING NET WORK COMMUNICATION DELAYS
Since our measurements indicated a non-negligible chance
of delayed network packets, we modeled this effect during
training. With the simulation advancing in discrete timesteps,
we modeled the network delays in multiples of simulation
steps. More formally, the observation of the pendulum xp,t at
time tonly becomes available to the agent at time t+δt, where
δt∈[0,1,2,3,4]. Furthermore, even if t+i+δt+i< t+δtfor
some i > 0, the observation at t+icannot become available
before time t+δt. We realized this behavior by a FIFO queue,
where we sample δtupon entry of an observation xp,t.
We also observed packet losses over the network. Since those
losses seemed to correlate with packet delays, we, in each
timestep, drop the first packet in the queue with a chance of
25%. Hence, the longer the queue is non-empty, i.e., packets
are subject to delays, the higher the chance of packets being
lost. The probabilities for the delays are given by
p(δt) = 0.905 0.035 0.02 0.02 0.02δt.(17)
APPENDIX B
ANALYTIC SOLUTION TO TH E LTI SYSTEM EQ UATI ON S
Given that Constraints (7) and (8) on the LTI system specify
a convex set, which can be relatively easily dealt with, we
turn towards Constraint (9), for which we need to derive the
closed-form solution of the LTI system (10)
x(t) = Φ(ts, t)x(ts) + Zt
ts
Φ(τ, t)Bu(τ)dτ. (18)
The transition matrix Φ(ts, t)is given by
Φ(ts, t) = eA∆s=I+A∆s+A2∆2
s
2+. . . +Ak∆k
s
k!
=I+A∆s+∆2
s
2
001
000
000
=
1 ∆s∆2
s
2
0 1 ∆s
0 0 1
,
(19)
where ∆s=t−ts. We can now turn towards the sec-
ond term in Equation (18). For solving the corresponding
integral, we exploit the assumption that the control signal
u(t)is piece-wise constant on the intervals [ti, ti+1)with
ts=t0<t1< . . . <tn−1<tK=te. With that, the second term
reduces to
Zt
t0
Φ(τ, t)Bu(τ)dτ=
K
X
k=1
ukZmin(tk,t)
tk−1
Φ(τ, t)Bdτ. (20)
We are hence left to solve
Zth
tl
Φ(τ, t)Bdτ=Zth
tl
(t−τ)2
2
t−τ
1
dτ=
t2τ
2−tτ2
2+τ3
6
tτ −τ2
2
τ
th
τ=tl
=
t2th
2−tt2
h
2+t3
h
6
tth−t2
h
2
th
−
t2tl
2−tt2
l
2+t3
l
6
ttl−t2
l
2
tl
=
t2∆lh
2−t˜
∆2
lh
2+˜
∆3
lh
6
t∆lh −˜
∆2
lh
2
∆lh
=ψ(tl, th, t),
(21)
where ∆lh =th−tland ˜
∆i
lh = (th−tl)i. We assume tl≤
th≤tand otherwise define ψ(tl, th, t)to be zero.
APPENDIX C
HIGH-DIMENSIONAL ABLATIONS
To test the robustness of the different CU RROT versions w.r.t.
changes in context space dimensions, we increased the number
of sections to represent the jerk trajectory u(t). We tested
three numbers, resulting in 99,198, and 399 context space
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 12
99 198 399
Context Dimension
0
500
1000
1500
2000
Epochs to Completion
currotAO currotAcurrot currotAO(JC) currotA(JC) currot(JC)
99 198 399
Context Dimension
0.00
0.02
0.04
0.06
0.08
0.10
Av. Track. Err. [cm]
99 198 399
Context Dimension
0.0
0.2
0.4
0.6
0.8
W(pf,µ)
/W(p0,µ)
Fig. 12: Quantitative results for different CUR ROT versions under increasing dimensions. We show the mean Epochs To
Completion (left), Final Average Tracking Error (middle), and Final (Normalized) Wasserstein distance (right). The error bars
indicate the standard error. Statistics are computed from 10 seeds. The abbreviation (JC) stands for experiments in which the
agent is given the alternative trajectory representation described in Section V-C.
dimensions. When increasing the dimension, we observed that
the condition number of the whitening matrix for CURROTAO
and CU RROTAincreased significantly, leading to high-jerk
interpolations, which were smooth in position and velocity but
exhibited strong oscillations in acceleration. We counteracted
this behavior by not only measuring the LTI system state via
the matrix A(Section 4.4.1) but also adding the transform
Γ3as additional rows to the entries Ψ3(t)in the definition
of Γ, where Γmaps the elements of ker(Ψ(te)) to piece-
wise constant jerk trajectories and Γ3is its block-diagonal
version as defined in the main chapter. The resulting explicit
regularization of the generated jerks prevented the previously
observed high jerk interpolations.
Figure 12 shows the results of the experiments with increasing
context space dimensions. The required number of epochs
to fully track the target trajectories and the final tracking
performance stays almost constant for all methods when using
the default trajectory representation, as it allows for good
generalization of learned behavior. However, the Wasserstein
distances between the final context distribution of the curricu-
lum pf(c)and the target distribution µ(c)increases with the
context space dimension for CURROTAand CURROT.
When using the alternative trajectory representation from
Section V-C (indicated by (J C) in Figure 12), this increasingly
poor convergence to µ(c)leads to a noticeable performance
decrease in the required epochs to completion and final
tracking performance for CURROTA. The performance of
CURROTAO decreases only slightly, as the convergence of
pf(c)to µ(c)seems unaffected by higher-dimensional context
spaces. As discussed in the main paper, CURROT does not
allow for good learning with the alternative trajectory repre-
sentation, rendering the observed tracking performance rather
uninformative as they are computed on partially tracked tra-
jectories. The presented results highlight the importance of the
improved optimization scheme implemented in CU RROTAO,
which was not obvious from the experiments in the main paper.
REFERENCES
[1] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning.
MIT Press, 1998.
[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, p. 529, 2015.
[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot, et al., “Mastering the game of go with deep neural networks
and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
[4] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning
hand-eye coordination for robotic grasping with deep learning and large-
scale data collection,” The International journal of robotics research,
vol. 37, no. 4-5, pp. 421–436, 2018.
[5] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew,
A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al., “Solving
rubik’s cube with a robot hand,” arXiv preprint arXiv:1910.07113, 2019.
[6] N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk
in minutes using massively parallel deep reinforcement learning,” in
Conference on Robot Learning, 2022.
[7] B. Technology, “Barrett whole arm manipulator,” https://advanced.
barrett.com/wam-arm- 1, July 2023.
[8] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward
transformations: Theory and application to reward shaping,” in Interna-
tional Conference on Machine Learning (ICML), 1999.
[9] A. Gupta, A. Pacchiano, Y. Zhai, S. Kakade, and S. Levine, “Unpacking
reward shaping: Understanding the benefits of reward engineering on
sample complexity,” Neural Information Processing Systems (NeurIPS),
vol. 35, 2022.
[10] S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone,
“Curriculum learning for reinforcement learning domains: A framework
and survey,” Journal of Machine Learning Research (JMLR), vol. 21,
no. 181, pp. 1–50, 2020.
[11] S. Sukhbaatar, Z. Lin, I. Kostrikov, G. Synnaeve, A. Szlam, and
R. Fergus, “Intrinsic motivation and automatic curricula via asymmetric
self-play,” in International Conference on Learning Representations
(ICLR), 2018.
[12] A. Baranes and P.-Y. Oudeyer, “Intrinsically motivated goal exploration
for active motor learning in robots: A case study,” in International
Conference on Intelligent Robots and Systems (IROS), 2010.
[13] P. Klink, H. Yang, C. D’Eramo, J. Peters, and J. Pajarinen, “Curriculum
reinforcement learning via constrained optimal transport,” in Interna-
tional Conference on Machine Learning. PMLR, 2022, pp. 11 341–
11 358.
[14] S. Chen, B. Zhang, M. W. Mueller, A. Rai, and K. Sreenath,
“Learning torque control for quadrupedal locomotion,” arXiv preprint
arXiv:2203.05194, 2022.
[15] Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath,
“Robust and versatile bipedal jumping control through multi-task rein-
forcement learning,” arXiv preprint arXiv:2302.09450, 2023.
[16] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement
learning for robotic manipulation with asynchronous off-policy updates,”
in international conference on robotics and automation (ICRA), 2017.
[17] P. Liu, K. Zhang, D. Tateo, S. Jauhri, Z. Hu, J. Peters, and G. Chal-
vatzaki, “Safe reinforcement learning of dynamic high-dimensional
SUBMITTED TO IEEE TRANSACTIONS ON ROBOTICS 13
robotic tasks: navigation, manipulation, interaction,” in International
Conference on Robotics and Automation (ICRA), 2023.
[18] K. H. Lundberg and T. W. Barton, “History of inverted-pendulum
systems,” IFAC Proceedings Volumes, vol. 42, no. 24, pp. 131–135,
2010.
[19] O. G. Selfridge, R. S. Sutton, and A. G. Barto, “Training and tracking
in robotics.” in Ijcai, 1985.
[20] J. Kober and J. Peters, “Policy search for motor primitives in robotics,”
Advances in neural information processing systems, 2008.
[21] M. Lutter, B. Belousov, S. Mannor, D. Fox, A. Garg, and J. Peters,
“Continuous-time fitted value iteration for robust policies,” Transactions
on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 45, no. 5,
pp. 5534–5548, 2022.
[22] A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe, “Automatic lqr
tuning based on gaussian process global optimization,” in International
conference on robotics and automation (ICRA), 2016.
[23] A. Doerr, D. Nguyen-Tuong, A. Marco, S. Schaal, and S. Trimpe,
“Model-based policy search for automatic tuning of multivariate pid
controllers,” in International Conference on Robotics and Automation
(ICRA), 2017.
[24] S.-T. Kao, W.-J. Chiou, and M.-T. Ho, “Balancing of a spherical inverted
pendulum with an omni-directional mobile robot,” in International
Conference on Control Applications (CCA), 2013.
[25] S.-T. Kao and M.-T. Ho, “Tracking control of a spherical inverted pendu-
lum with an omnidirectional mobile robot,” in International Conference
on Advanced Robotics and Intelligent Systems (ARIS), 2017.
[26] R. Yang, Y.-Y. Kuen, and Z. Li, “Stabilization of a 2-dof spherical pen-
dulum on xy table,” in International Conference on Control Applications
(CCA), 2000.
[27] B. Sprenger, L. Kucera, and S. Mourad, “Balancing of an inverted pen-
dulum with a scara robot,” IEEE/ASME Transactions on Mechatronics,
vol. 3, no. 2, pp. 91–97, 1998.
[28] M. N. Vu, C. Hartl-Nesic, and A. Kugi, “Fast swing-up trajectory
optimization for a spherical pendulum on a 7-dof collaborative robot,”
in International Conference on Robotics and Automation (ICRA), 2021.
[29] A. Hallak, D. Di Castro, and S. Mannor, “Contextual markov decision
processes,” arXiv preprint arXiv:1502.02259, 2015.
[30] A. M. Metelli, M. Mutti, and M. Restelli, “Configurable Markov decision
processes,” in International Conference on Machine Learning (ICML),
2018.
[31] M. Dennis, N. Jaques, E. Vinitsky, A. Bayen, S. Russell, A. Critch, and
S. Levine, “Emergent complexity and zero-shot transfer via unsuper-
vised environment design,” in Neural Information Processing Systems
(NeurIPS), 2020.
[32] R. Portelas, C. Colas, K. Hofmann, and P.-Y. Oudeyer, “Teacher algo-
rithms for curriculum learning of deep rl in continuously parameterized
environments,” in Conference on Robot Learning (CoRL), 2019.
[33] J. Chen, Y. Zhang, Y. Xu, H. Ma, H. Yang, J. Song, Y. Wang, and
Y. Wu, “Variational automatic curriculum learning for sparse-reward co-
operative multi-agent problems,” Neural Information Processing Systems
(NeurIPS), 2021.
[34] NVIDIA, “Nvidia isaac sim,” https://developer.nvidia.com/isaac-sim,
July 2023.
[35] N. Point, “Optitrack,” https://optitrack.com, July 2023.
[36] D. Makoviichuk and V. Makoviychuk, “rl-games: A high-performance
framework for reinforcement learning,” https://github.com/Denys88/rl
games, May 2021.
[37] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-
based control,” in International Conference on Intelligent Robots and
Systems (IROS), 2012.
[38] E. A. Nadaraya, “On estimating regression,” Theory of Probability &
Its Applications, vol. 9, no. 1, pp. 141–142, 1964.
[39] P. C. Mahalanobis, “On the generalized distance in statistics,” in Pro-
ceedings of the National Institute of Science of India, 1936.
[40] A. Asudeh, H. V. Jagadish, G. Miklau, and J. Stoyanovich, “On obtain-
ing stable rankings,” Proceedings of the VLDB Endowment (PVLDB),
vol. 12, no. 3, p. 237–250, 2018.
[41] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen,
D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern,
M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F.
del R´
ıo, M. Wiebe, P. Peterson, P. G´
erard-Marchant, K. Sheppard,
T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant,
“Array programming with NumPy,” Nature, vol. 585, no. 7825, pp. 357–
362, Sept. 2020.
[42] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy,
D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J.
van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J.
Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, ˙
I. Polat, Y. Feng, E. W.
Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henrik-
sen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro,
F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0:
Fundamental Algorithms for Scientific Computing in Python,” Nature
Methods, vol. 17, pp. 261–272, 2020.
[43] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An
imperative style, high-performance deep learning library,” in Neural
information processing systems (NeurIPS), 2019.
[44] D. P. Bertsekas, “Auction algorithms.” Encyclopedia of optimization,
vol. 1, pp. 73–77, 2009.
[45] B. Charlier, J. Feydy, J. A. Glaun`
es, F.-D. Collin, and G. Durif, “Kernel
operations on the gpu, with autodiff, without memory overflows,”
Journal of Machine Learning Research, vol. 22, no. 74, pp. 1–6, 2021.
[Online]. Available: http://jmlr.org/papers/v22/20-275.html
[46] M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs, E. Berger,
R. Wheeler, and A. Ng, “Ros: an open-source robot operating system,”
in ICRA Workshop on Open Source Robotics, 2009.
[47] C. Mastalli, R. Budhiraja, W. Merkt, G. Saurel, B. Hammoud,
M. Naveau, J. Carpentier, L. Righetti, S. Vijayakumar, and N. Mansard,
“Crocoddyl: An Efficient and Versatile Framework for Multi-Contact
Optimal Control,” in International Conference on Robotics and Automa-
tion (ICRA), 2020.
[48] H. K. Khalil and L. Praly, “High-gain observers in nonlinear feedback
control,” International Journal of Robust and Nonlinear Control, vol. 24,
no. 6, pp. 993–1015, 2014.