PreprintPDF Available

High Acceleration Reinforcement Learning for Real-World Juggling with Binary Rewards

Authors:

Abstract and Figures

Robots that can learn in the physical world will be important to en-able robots to escape their stiff and pre-programmed movements. For dynamic high-acceleration tasks, such as juggling, learning in the real-world is particularly challenging as one must push the limits of the robot and its actuation without harming the system, amplifying the necessity of sample efficiency and safety for robot learning algorithms. In contrast to prior work which mainly focuses on the learning algorithm, we propose a learning system, that directly incorporates these requirements in the design of the policy representation, initialization, and optimization. We demonstrate that this system enables the high-speed Barrett WAM manipulator to learn juggling two balls from 56 minutes of experience with a binary reward signal. The final policy juggles continuously for up to 33 minutes or about 4500 repeated catches. The videos documenting the learning process and the evaluation can be found at https://sites.google.com/view/jugglingbot
Content may be subject to copyright.
High Acceleration Reinforcement Learning for
Real-World Juggling with Binary Rewards
Kai Ploeger, Michael Lutter, Jan Peters
Computer Science Department, Technical University of Darmstadt
{kai, michael, jan}@robot-learning.de
Abstract: Robots that can learn in the physical world will be important to en-
able robots to escape their stiff and pre-programmed movements. For dynamic
high-acceleration tasks, such as juggling, learning in the real-world is particularly
challenging as one must push the limits of the robot and its actuation without
harming the system, amplifying the necessity of sample efficiency and safety for
robot learning algorithms. In contrast to prior work which mainly focuses on the
learning algorithm, we propose a learning system, that directly incorporates these
requirements in the design of the policy representation, initialization, and opti-
mization. We demonstrate that this system enables the high-speed Barrett WAM
manipulator to learn juggling two balls from 56 minutes of experience with a bi-
nary reward signal. The final policy juggles continuously for up to 33 minutes or
about 4500 repeated catches. The videos documenting the learning process and
the evaluation can be found at https://sites.google.com/view/jugglingbot
Keywords: Reinforcement Learning, Dynamic Manipulation, Juggling
1 Introduction
Robot learning is one promising approach to overcome the stiff and pre-programmed movements
of current robots. When learning a task, the robot autonomously explores different movements and
improves its behavior using scalar rewards. In recent years, research has focused a lot on improving
task-agnostic deep reinforcement learning (DRL) algorithms by changing either the optimization
[1,2], the simulation to use perturbed physics parameters [3,4], or the task to gradually increase
complexity [5]. While these approaches have propelled learning robots to very complex domains in
simulation, ranging from full-body control of humanoids [6] to control of dexterous hands [7,8],
most of these approaches are not applicable to learn on physical systems as they neglect the intricate
complexities of the real world. On the physical system, the learning is constrained to real-time and
a single instance. Hence, the learning must not damage the robot with jerky actions and must be
sample efficient.
Consider the high-acceleration task of juggling two balls next to each other with a single anthro-
pomorphic manipulator. The manipulator is required to throw a ball upwards, move to the right,
catch and throw the second ball and return to the left in time to catch the first ball. To sustain
this cyclic juggling pattern, the robot must always throw the ball sufficiently vertical and maintain
precise timing. Therefore, this task pushes the limits of the robot to achieve the required high ac-
celerations (of up to 8g), while maintaining precise control of the end-effector and the safety of the
physical system. The task is not only inherently difficult to master1but also requires learning on the
physical system. Real-world experience is required as simulation models, while good at simulating
contact-free rigid-bodies, cannot represent the non-linear effects close to the torque limits of the
actuators and the highly dynamic contacts between end-effector and balls. For the tendon driven
Barrett WAM, rigid-body-simulators also cannot model the dominating cable dynamics at high ac-
celerations. Hence, simulation-based solutions cannot be transferred for very dynamic tasks given
our current simulators. For this specific task even robot to robot transfer between robots of the same
Equal contribution
1For reference, the untrained human jugglers of the lab achieve 2 repeated catches and improve to about 20
repeated catches after a few hours of training.
4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA.
arXiv:2010.13483v3 [cs.RO] 31 Oct 2020
(a) (b) (c) (d) (e)
Figure 1: The juggling movement consisting of four separate movements, which are repeated to
achieve juggling of two balls with a single anthropomorphic manipulator: (a) throw ball 1, (b) catch
ball 2, (c) throw ball 2, (d) catch ball one, (e) repeat.
model is not possible. The learned policy fails immediately when transferred to a different Barrett
WAM. Therefore, the optimal policy depends on the exact robot instance and must be learned on the
individual robot. The high accelerations amplify the safety and sample efficiency requirements for
learning on the physical system as collisions at high velocities severely damage the system. Further-
more, the high acceleration movements induce a lot wear and tear and cannot be executed for days.
Therefore, high acceleration tasks are an ideal problem to test the limitations of current learning
algorithms on the physical system.
To emphasize important considerations for building a real-world learning system for high-
acceleration tasks, we describe our robot learning setup to learn juggling and the taken system
design considerations. We limit ourselves to existing methods to focus solely on the requirements
for the real-world application of current learning approaches. This is in contrast to many prior works,
which mainly propose new policy representations or policy optimizations and demonstrate that the
new algorithm can be applied to the physical system. Instead, we focus on a single challenging task
and evaluate which existing representations and algorithms can be used to build a safe and sample
efficient robot learning system. We also highlight limitations of learning approaches, which - given
the current knowledge - we do not dare to apply to the physical system for robot juggling. After-
wards, we describe one approach to solve robot toss juggling with two balls and an anthropomorphic
manipulator and validate the chosen approach in the real-world. The used approach - in our opin-
ion - optimally combines engineering expertise and learning to obtain a reliable, safe, and sample
efficient solution for this task.
Our contribution is (1) the application of robot learning to the challenging task of single-arm toss
juggling with two balls and (2) highlighting the challenges of applying a learning system in the
physical world for a high-acceleration task. In the following, we first cover the prior work on robot
juggling and learning dynamical tasks on the physical robot in Section 2. Afterwards, we compare
different approaches to learn real-world toss juggling in Section 3and describe the implemented
learning system in Section 4. The experimental setup and results are presented in Section 5.
2 Related Work
2.1 Robot Juggling
For decades robot juggling has been used to showcase the ingenuity of mechanical system design and
control engineering as listed in table 1. Starting with Claude Shannon in the 1970s, many different
juggling machines were built.2For example, the Shannon juggler used cups to bounce up to five balls
off a surface [9], the devil-sticking machine stabilized a stick in the air [10,11], paddle jugglers, built
from designated hardware [12,9] or by attaching tennis rackets to manipulators [13,14,15,16,17,
18,19], juggled multiple balls using paddling. Toss jugglers were built using manipulators to juggle
one [20,21] or two balls [22] and even humanoids were used to juggle up to three balls using two
arms [23,24]. Most of these approaches proposed new engineering solutions for movements and
controllers showing that these systems achieve juggling when the control parameters are manually
2A historic overview of robot juggling is available at https://www.youtube.com/watch?v=
2ZfaADDlH4w.
2
fine-tuned. Only a few approaches used supervised learning for model learning [14,11], behavioral
cloning [18], or evolutionary strategies with dense fitness functions [25] to achieve paddle juggling
and devil sticking. We build upon the vast experience on end-effector and controller design for
robot juggling but in contrast to the prior work, we demonstrate, to the best of our knowledge, the
first robot learning system that learns toss juggling with two balls and a single anthropomorphic
manipulator in the real world using only binary rewards. Using this approach, we achieve juggling
of up to 33 minutes and high repeatability between trials.
Table 1: Prior work on different types of robot juggling
Juggling Type Approach Papers
Devil Sticking Engineered [9]
Devil Sticking Model Learning [11]
Paddle Juggling Engineered [15,16,9,17,12,19]
Paddle Juggling Imitation [18]
Paddle Juggling Model Learning [14]
Paddle Juggling Evolutionary Strategies [25]
Toss Juggling Engineered [22,24,23,21,20]
Toss Juggling Reinforcement Learning [Ours]
2.2 Learning Dynamical Tasks on the Physical Robot
Despite the recent surge of deep reinforcement learning algorithms for controlling robots, most of
these approaches are constrained to learn in simulation due to sample complexity and the high risk
of catastrophic policies. Only relatively few DRL approaches have been applied to physical robots,
e.g., robot manipulators [26,27,8,28,29,30] or mobile robots [31,32]. Most of the work for
manipulators focuses on non-dynamic tasks. Only Schwab et al. [30] and B¨
uchler et al. [33] ap-
plied DRL to learn the dynamic tasks of Ball-in-a-Cup or robot table tennis. In [30], the authors
built a fully automated environment to achieve large-scale data collection and engineered classical
safety mechanisms to avoid damaging the physical system. Using the safe and automated environ-
ment, SAC-X was able to learn Ball-in-a-Cup from raw pixels within three days [30]. Most other
approaches for learning dynamical tasks on the physical system use more classical robot learning
techniques. These algorithms combine engineering- and task knowledge with learning to achieve
sample efficient and safe learning that does not require completely safe and fully automated envi-
ronments. For example, combining model learning with trajectory optimization/model predictive
control [34,8,11] or model-free reinforcement learning with engineered policy representation, ex-
pert policy initialization, and dense rewards [35,36,18,37,38]. In our work, we extend the classical
robot learning approach to a robot learning system that learns the high acceleration task of juggling
with designed feature representations, expert policy initialization, and binary rewards instead of
dense rewards. We also discuss the necessary design decisions that incorporate engineering and task
expertise to achieve safety and sample efficiency. This focus is in contrast to most prior work as
these mostly highlight the details of the learning algorithms but not the many engineering details
that enable learning on the physical system.
3 System Design for High-Acceleration Tasks
Designing a robot learning system for robot juggling can be approached from different learning
paradigms with different benefits. In the following we briefly discuss these trade-offs to motivate
our approach presented in section 4.
3.1 Model-based vs. Model-free Learning
To minimize the wear and tear during the high acceleration movements and minimize the manual
resets of picking up the balls, one desires to learn with minimal trials. Commonly model-based
reinforcement learning (MBRL) methods are much more sample efficient compared to model-free
reinforcement learning (MFRL) [39,40]. However, MBRL requires to learn an accurate model
describing the system such that the optimal policy transfers to the real system. For the considered
task of robot juggling, the model would need to accurately describe the rigid-body dynamics of the
manipulator, the stiction of the cable-drives, the contacts of the end-effectors with the balls, and the
ball-movement. The model would also need to be robust to out-of-distribution model exploitation
to avoid optimizing a spurious and potentially harmful solution. Out-of-distribution exploitation
is especially challenging for deep networks as the networks do not generalize well to previously
unexplored states and this may result in unpredictable behaviors potentially damaging the system.
3
We are not aware of a model-learning approach that can capture the different phenomena ranging
from multi-point contacts to low-level stiction with sufficiently high fidelity and are robust to out of
distribution generalization. Therefore, we resort to a MFRL approach. Besides the theoretical as-
pects, the practical implementation of observing the necessary quantities to construct such accurate
models is challenging by itself, e.g., the collisions of the ball and end-effector are obscured by the
end-effector and hence not observable.
3.2 Open-Loop Policy vs. Closed-Loop Policy
Closed-loop policies are favorable for robot juggling as the interactions between the end-effector
and ball are not consistent. A closed-loop policy could recover from these irregularities. Learning a
closed-loop policy for juggling is non-trivial as one closes the loop on noisy ball observation. The
noisy observations or potential outliers might cause the system to become unstable. The unstable
behavior might cause the robot to hit its joint limits with high momentum and severely damage the
system. One would need to guarantee that the closed-loop policy is stable for all possible obser-
vations including many edge-cases. For the pre-dominant closed-loop deep network controller of
deep reinforcement learning, the stable behavior cannot be guaranteed. Especially as the out of dis-
tribution prediction for networks is undefined. Currently, to the best of our knowledge, no network
verification process for complex closed-loop systems exists. Therefore, we currently do not dare to
apply a real-time closed-loop network policy to the physical system where the network is expected
to execute accelerations of up to 8g.
Instead we are using an open-loop policy consisting of a desired position trajectory and a tracking
controller. This representation is sufficient for the task as well trained jugglers can juggle basic
patterns blindfolded and prior literature [9,11,17] has shown the stability of open-loop juggling
robots. A naive but sufficient safety verification of this policy can be achieved by the stability of the
tracking controller and by enforcing tight box constraints in joint space for the desired trajectory.
For the juggling setup the box-constraints prevent self-collisions and hitting the joint limits. Hybrid
approaches that adapt the desired trajectory in closed-loop w.r.t. to ball observations exist. We also
tried a naive local adaption of the desired trajectory using a task-space controller but this adaptation
even reduced system performance. The system would adapt to the balls during catching but could
not throw straight afterwards. As the open-loop policy already achieved high repeatability and long
juggling duration, we did not investigate more complex hybrid policies further.
4 System Implementation
4.1 Policy Representation
The probabilistic policy is defined by a normal distribution N(θ;µ,Σ)over via-points in joint
space. Each parameter θ= [q0, . . . qN,˙
q0,..., ˙
qN, t0, . . . , tN]corresponds to a possible juggling
movement consisting of a via-point sequence. Each via point is defined by the position qi, velocity
˙
qiand duration ti. To execute the movement, the via-points are interpolated using cubic splines
and tracked by a PD controller with gravity compensation. Therefore, the motor torques at each
time-step are computed by,
τ=KP(qref q) + KD(˙
qref ˙
q) + g(q)
with qref (t) =
3
X
j=0
ai,j (tti,0)j,˙qref (t) =
3
X
j=1
jai,j (tti,0)j1, ti,0=
i1
X
k=0
tk
the control gains KPand KDand the jth parameter of the ith spline aij . The gains are set to be
low compared to industrial manipulators to achieve smooth movements. The spline parameters are
computed using the via points by
ai,0=qi,ai,1=˙qi,ai,2= 3 (qi+1 qi)t2
i(˙qi+1 + 2 ˙qi)ti,
ai,3= 2 (qiqi+1)t3
i+ ( ˙qi+1 +˙qi)t2
i.
We initialize the parameters with expert demonstrations to reduce the sample complexity. Especially
in the case of binary rewards, such initialization is required as the reward signal is sparse. Most
random initialization would not make any contact with the balls. Hence, the rl algorithm could not
infer any information about the desired task. Instead of using kinesthetic demonstrations [36,35,18,
37,38], we directly initialize the interpretable via points manually. This initialization is preferable
4
for robot juggling because the human demonstrator cannot achieve the necessary accelerations using
kinesthetic teaching.
The desired juggling movement for two balls consists of four repeated movements, i.e., (a) throwing
the first ball, (b) catching the second ball (c) throwing the second ball, and (d) catching the first ball
(Fig. 1). We define the switching points between these movements as the via-points of the policy and
to achieve a limit-cycle, we keep repeating these via-points. The cyclic pattern is prepended with
an initial stroke movement that quickly enters the limit cycle without dropping a ball. Applying
the limit cycles PD-references from the start would result in a bad first throw. Furthermore, we
enforce symmetry of cyclic pattern and zero velocity at the via point. Thus reducing the effective
dimensionality from 48 open parameters to only 21.
4.2 Policy Optimization
The policy optimization is framed as an episodic reinforcement learning problem, sampling a single
policy parameter θi N (µ,Σ)per roll-out and evaluating the episodic reward. This framing is
identical to a bandit setting with high-dimensional and continuous actions. For the physical system,
the episodic exploration is favorable over step-based exploration because this exploration yields
smooth and stable action sequences given the previous policy representation. To optimize the policy
parameters, we use a variant of the information-theoretic policy search approach episodic Relative
Entropy Policy Search (eREPS) [41,42]. Our eREPS variation not only limits the Kullback-Leibler
(KL) divergence when computing the sample weights but also enforces the reverse KL divergence
when updating the policy parameters. This optimization can be solved in closed form for Gaussian
distributions as described in Abdolmaleki et al. [43]. We use eREPS instead of EM-based [36,35]
or gradient-based [44,45] policy search as the KL constraint prevents premature convergence and
large, possibly unsafe, jumps of the policy mean.3
Let the optimization problem of eREPS be defined as
πk+1 = arg max
πZπ(θ)R(θ)dθ,s.t. dKL(πk+1 ||πk)
with the updated policy πk+1, the episodic reward R, and the KL divergence dKL. With the addi-
tional constraint of πbeing a probability distribution, this optimization problem can be solved by
first optimizing the dual to obtain the optimal Lagrangian multiplier ηand fitting the new policy
using weighted maximum likelihood. The sample-based optimization of the dual is described by
η= arg min
η
η +ηlog
N
X
i=0
πk(θi) exp (R(θi)).
The optimization of the weighted likelihood Lto obtain the updated policy is described by
µk+1,Σk+1 = arg max
(µ,Σ)
N
X
i=0
wilog(L(µ,Σ|θi)) s.t. dKL(πk||πk+1 ).
with the weights wi= exp (R(θi))/Piexp (R(θi)). We incorporate the reverse KL con-
straint to the optimization to guarantee that the policy parameters adhere to the KL constraint. In the
original eREPS formulation only the KL divergence between the evaluated samples is constrained
but not the KL divergence between the parametrized policies. Especially for high-dimensional prob-
lems and few sample evaluations, the KL divergence between policies is larger than without this
explicit constraint. The reverse KL divergence is used as this formulation enables solving the policy
update in closed form for Gaussian distributions. For the multivariate Gaussian policy distribution
this update is described by
µk+1 =ξµk+µs
1 + ξ,Σk+1 =Σs+ξΣk+ξ(µk+1 µk) (µk+1 µk)T
1 + η
µs=
N
X
i=0
wiθiΣs=
N
X
i=0
wi(θiµk+1) (θiµk+1 )T
3Further information about reinforcement learning for robotics can be found in the surveys Kober et al. [46]
and Deisenroth et al. [41].
5
(a) (b)
Figure 2: (a) Mechanical design of the juggling end-effector, the Russian style juggling balls, and
the ball launcher. The end-effector consists of a funnel to achieve precise control of the throwing
movement. The Russian juggle balls are partially filled with sand to avoid bouncing. (b) The
MuJoCo juggling environment used for simulating the rigid-body-dynamics.
with the optimal Lagrangian multiplier ξ. This optimal multiplier can be obtained by solving
ξ= arg max
ξlog(|Σk+1|)
N
X
i=0
wi(θiµk+1)TΣ1
k+1 (θiµk+1)
+ξ2tr Σ1
k+1Σkn+ log Σk+1
Σk+ (µk+1 µk)TΣ1
k+1 (µk+1 µk).
The complete derivation can be found in the appendix and the source code of eREPS is available at
https://github.com/hanyas/rl.
4.3 Reward Function
The reward function assigns a positive binary reward as long as the robot is juggling. Juggling is
defined as keeping both balls at least 60cm above the floor, which is measured using the external
marker tracker OptiTrack. Therefore, the step-based reward is described by
r(t) = 1if mini(bi,z)0.6
0otherwise
with the ith ball height bi,z. This reward signal maximizes the juggling duration and is not engi-
neered to incorporate any knowledge about the desired ball- or manipulator trajectory. This choice
of reward function is intuitive but also uninformative as a bad action only causes a delayed negative
reward. For example, a bad action within the throwing will cause a zero reward - a drop - seconds
after the action. This delay between action and reward, i.e., ’credit assignment’, is a challenge in
many RL problems. The choice of the binary reward functions is in stark contrast to prior work as
most of the previously proposed approaches use more informative dense rewards [30,8,36,25].
The binary rewards are favorable as these rewards do not require tuning a dense reward function. To
specify the dense reward function one would need to predict the optimal ball trajectory and compute
the distance to the optimal trajectory. Especially as the optimal ball trajectory depends on the robot
capabilities and end-effector this prediction is challenging. Furthermore, one would need to align
the initialization of the juggling movement with the optimal ball trajectory. One could possibly
initialize a good juggling movement but that might not fit with the specified dense reward, e.g., the
dense reward prefers higher throws than the initialization. With the binary reward one only needs to
provide a good initialization and does not need to tune the reward parameters. Besides the reward
tuning aspect, evaluating the dense reward is also more challenging compared to evaluating the
binary reward in the real world. For evaluating the dense reward one would require precise tracking
6
(a) (b)
Figure 3: (a) Mean juggling duration of final policies learned with varying batch sizes N. The
maximal juggling duration is 10s. (b) Comparison of the learned and hand-tuned policy on each
30 episodes with a maximum duration of 120s on the real system. The learned policy achieves an
average juggling duration of 106.82s while the hand-tuned policy achieves 66.52s.
which is non-trivial to frequent occlusions by the end-effector. Even though we used OptiTrack to
track the balls, we had a hard time achieving good tracking performance during the experiments due
to the wear and tear on the reflective tape and the frequent occlusions. Every time a ball is in the
end-effector, the ball is not observable.
5 Experiments
5.1 Experimental Setup
The experiments are performed with a Barrett WAM to achieve the high accelerations required for
robot juggling. The 4 degrees of freedom (DoF) variant is used rather than the 7 DoF version to allow
for more precise control at high velocities and to save weight. To achieve high repeatability, we use
75mm Russian style juggling balls that dissipate the kinetic energy of the balls and prevent them
from bouncing out of the end-effector. These balls consist of a hollow plastic shell partially filled
with 37.5g of sand (Figure 2a). The hard ball shells also result in more accurate throws compared to
traditional beanbag juggling balls [21,20], as they do not deform. As successfully described in prior
toss juggling approaches [24,23,21,20], the funnel-shaped end-effector passively compensates for
minor differences in ball trajectories by sliding the balls from the edge of the 170mm opening to
the center. The second ball is released via a launching mechanism 3m above the floor, shown in
Figure 2a, to achieve consistent height, velocity, and timing. This mechanism releases the ball by
pushing the ball to an opening using a piston attached to a pulley system. The release of the ball is
detected using Optitrack to start the episode.
5.2 Simulation Studies
The initial validation of the proposed learning system is performed in MuJoCo to evaluate the con-
vergence of different numbers of roll-outs per episode over seeds. Figure 3a shows the juggling
duration distribution of the final policy averaged over 60 different seeds at 10,25 and 50 roll-outs
per episode. For 10 roll-outs per episode, the learning system frequently converges to a sub-optimal
final policy, which does not achieve consistent juggling of 10 seconds. The probability of converg-
ing to a good policy is the highest for 25 roll-outs per episode and hence, we use 25 roll-outs per
episode on the physical system. It is important to point out, that learning juggling in the simulation
is actually more challenging compared to the real world as the stabilizing passive dynamics of the
system could not be modeled accurately.
5.3 Learning on the real Barrett WAM
For the learning on the physical Barrett WAM 20 episodes were performed. During each episode
25 randomly sampled parameters were executed and the episodic reward evaluated. If the robot
successfully juggles for 10s, the roll-out is stopped. Roll-outs that were corrupted due to obvious
environment errors were repeated using the same parameters. Minor variations caused by the envi-
ronment initialization were not repeated. After collecting the samples, the policy was updated using
eREPS with a KL constraint of 2. The learning progress is shown in Figure 4. Initially, the robot on
average achieves between 1to 3repeated catches. These initial catches are important as the robot
7
1 5 10 15 20
Episodes
1
3
5
7
9
11
13
15
17
19
21
23
Repeated Catches [s]
1000
2000
3000
4000
5000
Reward
Figure 4: Reward distributions while learning to juggle on the physical system. Each point corre-
sponds to a single roll-out on the system. Starting from the expert initialization, which only achieves
juggling for a few repeated catches, the robot continuously improves using only binary rewards. Af-
ter 56 minutes of training, the final policy achieves consistent juggling for more than 10s.
would otherwise not receive any meaningful information. Therefore, the rudimentary expert initial-
ization must achieve some repeated catches to ensure fast learning. After the initial episodes, the
robot rapidly improves the juggling duration. Starting from episode 10, the robot achieves consis-
tent juggling of 10 seconds and only very few balls are dropped during the start. During the next 10
episodes, the average reward oscillates as the number of dropped balls varies but the robot achieves
successful completion for the other trials. At episode 20 the robot achieves perfect juggling of all
25 randomly sampled parameters. Videos documenting the learning process and the evaluation can
be found at https://sites.google.com/view/jugglingbot.
To test the repeatability and stability of the learned policy, the deterministic policy mean of episode
20 is executed for 30 repeated roll-outs with a maximum duration of 120 seconds. The achieved
performance is compared to a hand-tuned policy in Figure 3b. Averaging at 106.82s, the learned
policy performs significantly better compared to the hand-tuned policy with 66.51s. The weaker
performance of the hand-tuned policy within the stroke-based initiation movement matches our ex-
pectations as tuning the stroke-based movement is the hardest part of the manual parameter tuning.
Both policies do not achieve perfect repeatability due to the residual stochasticity of the environ-
ment.
To test the stability of the learned policy, the juggling was repeatedly executed and the maximum
juggling duration recorded. The learned policy achieved juggling for 33 minutes, which corresponds
to more than 4500 repeated catches on the second try, after 15 minutes on the first one. The high
number of repeated catches, highlights the precision of the Barrett WAM, the end-effector design
and both policies. Once the juggling is initiated successfully, the policies can recover from minor
variations due to the passive stability induced by the end-effector design.
6 Conclusion
We described a robot learning system capable of learning toss juggling of two balls with a sin-
gle anthropomorphic manipulator using only binary rewards. We demonstrated that our system can
learn this high acceleration task within 56 minutes of experience, utilizing sufficient engineering and
task knowledge designing the robot learning system. Starting from a rudimentary expert initializa-
tion, the system consistently improves until achieving repeated juggling of up to 33 minutes, which
corresponds to more than 4500 repeated catches. Furthermore, the learned policy outperforms a
hand-tuned policy in terms of repeatability and achieves significantly higher rewards average across
30 trials. In addition, we highlighted and discussed the incorporated engineering and task expertise
to make learning on the physical system viable. This discussion should help future scientists and
practitioners to address the needs of a physical system when building future robot learning systems.
Nevertheless, this approach also pointed out the shortcomings of state-of-the-art robot learning ap-
proaches for learning dynamic tasks on the physical system. Despite the incorporated engineering
and task knowledge the learning still takes up to 5hours and hence, reiterates the necessity for more
sample efficient representations and learning approaches for sparse and binary rewards.
8
Acknowledgments
This project has received funding from the European Union’s Horizon 2020 research and innovation
program under grant agreement No #640554 (SKILLS4ROBOTS). Furthermore, this research was
also supported by grants from ABB and NVIDIA.
References
[1] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization
algorithms. arXiv preprint arXiv:1707.06347, 2017.
[2] S. Fujimoto, H. Van Hoof, and D. Meger. Addressing function approximation error in actor-
critic methods. arXiv preprint arXiv:1802.09477, 2018.
[3] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron,
M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint
arXiv:1808.00177, 2018.
[4] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomiza-
tion for transferring deep neural networks from simulation to the real world. In International
Conference on Intelligent Robots and Systems (IROS), 2017.
[5] P. Klink, H. Abdulsamad, B. Belousov, and J. Peters. Self-paced contextual reinforcement
learning. Conference on Robot Learning (CoRL), 2019.
[6] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. Eslami,
M. Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint
arXiv:1707.02286, 2017.
[7] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino,
M. Plappert, G. Powell, R. Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint
arXiv:1910.07113, 2019.
[8] A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar. Deep dynamics models for learning
dexterous manipulation. arXiv preprint arXiv:1909.11652, 2019.
[9] S. Schaal and C. G. Atkeson. Open loop stable control strategies for robot juggling. In Inter-
national Conference on Robotics and Automation (ICRA), 1993.
[10] S. Schaal and C. G. Atkeson. Robot juggling: An implementation of memory-based learning.
1994.
[11] S. Schaal and C. G. Atkeson. Robot juggling: implementation of memory-based learning.
IEEE Control Systems, 1994.
[12] P. Reist and R. D’Andrea. Bouncing an unconstrained ball in three dimensions with a blind
juggling robot. In International conference on Robotics and Automation (ICRA), 2009.
[13] E. W. Aboaf, C. G. Atkeson, and D. J. Reinkensmeyer. Task-level robot learning. In Interna-
tional Conference on Robotics and Automation (ICRA), 1988.
[14] E. W. Aboaf, S. M. Drucker, and C. G. Atkeson. Task-level robot learning: Juggling a tennis
ball more accurately. In International Conference on Robotics and Automation (ICRA), 1989.
[15] A. A. Rizzi, L. L. Whitcomb, and D. E. Koditschek. Distributed real-time control of a spatial
robot juggler. Computer, 1992.
[16] A. A. Rizzi and D. E. Koditschek. Further progress in robot juggling: The spatial two-juggle.
Departmental Papers (ESE), 1993.
[17] S. Schaal, D. Sternad, and C. G. Atkeson. One-handed juggling: A dynamical approach to a
rhythmic movement task. 1996.
[18] J. Kober, K. M¨
ulling, O. Kr¨
omer, C. H. Lampert, B. Sch ¨
olkopf, and J. Peters. Movement
templates for learning of hitting and batting. In International Conference on Robotics and
Automation (ICRA), 2010.
[19] M. M¨
uller, S. Lupashin, and R. D’Andrea. Quadrocopter ball juggling. In International Con-
ference on Intelligent Robots and Systems (IROS), 2011.
[20] T. Sakaguchi, Y. Masutani, and F. Miyazaki. A study on juggling tasks. In International
Workshop on Intelligent Robots and Systems (IROS), 1991.
[21] T. Sakaguchi, M. Fujita, H. Watanabe, and F. Miyazaki. Motion planning and control for a
robot performer. In International Conference on Robotics and Automation (ICRA), 1993.
[22] T. Kizaki and A. Namiki. Two ball juggling with high-speed hand-arm and high-speed vision
system. In International Conference on Robotics and Automation (ICRA), 2012.
[23] M. Riley and C. G. Atkeson. Robot catching: Towards engaging human-humanoid interaction.
Autonomous Robots (AURO), 2002.
9
[24] J. Kober, M. Glisson, and M. Mistry. Playing catch and juggling with a humanoid robot. In
International Conference on Humanoid Robots (Humanoids), 2012.
[25] S. Schaal and D. Sternad. Learning passive motor control strategies with genetic algorithms,
pages 913–918. Addison-Wesley, Redwood City, CA, 1993.
[26] L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700
robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pages
3406–3413. IEEE, 2016.
[27] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination
for robotic grasping with deep learning and large-scale data collection. The International
Journal of Robotics Research (IJRR), 2018.
[28] T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and
S. Levine. Residual reinforcement learning for robot control. In International Conference on
Robotics and Automation (ICRA), pages 6023–6029. IEEE, 2019.
[29] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly,
M. Kalakrishnan, V. Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-
based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
[30] D. Schwab, T. Springenberg, M. F. Martins, T. Lampe, M. Neunert, A. Abdolmaleki, T. Herk-
weck, R. Hafner, F. Nori, and M. Riedmiller. Simultaneously learning vision and feature-based
control policies for real-world ball-in-a-cup. arXiv preprint arXiv:1902.04706, 2019.
[31] G. Kahn, P. Abbeel, and S. Levine. Badgr: An autonomous self-supervised learning-based
navigation system. arXiv preprint arXiv:2002.05700, 2020.
[32] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy
deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
[33] D. B¨
uchler, S. Guist, R. Calandra, V. Berenz, B. Sch¨
olkopf, and J. Peters. Learning to play
table tennis from scratch using muscular robots. arXiv preprint arXiv:2006.05935, 2020.
[34] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng. An application of reinforcement learning to
aerobatic helicopter flight. In Advances in Neural Information Processing Systems (NeuRIPS).
2007.
[35] P. Kormushev, S. Calinon, and D. G. Caldwell. Robot motor skill coordination with em-based
reinforcement learning. In nternational Conference on Intelligent Robots and Systems (IROS),
2010.
[36] J. Kober and J. R. Peters. Policy search for motor primitives in robotics. In Advances in Neural
Information Processing Systems (NeuRIPS), 2009.
[37] J. Kober, E. ¨
Oztop, and J. Peters. Reinforcement learning to adjust robot movements to new
situations. In International Joint Conference on Artificial Intelligence (IJCAI), 2011.
[38] K. M¨
ulling, J. Kober, O. Kroemer, and J. Peters. Learning to select and generalize striking
movements in robot table tennis. The International Journal of Robotics Research (IJRR),
2013.
[39] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to
policy search. In International Conference on machine learning (ICML), 2011.
[40] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful
of trials using probabilistic dynamics models. In Advances in Neural Information Processing
Systems (NeurIPS), 2018.
[41] M. P. Deisenroth, G. Neumann, J. Peters, et al. A survey on policy search for robotics. Foun-
dations and Trends® in Robotics, 2013.
[42] J. Peters, K. Mulling, and Y. Altun. Relative entropy policy search. In Conference on Artificial
Intelligence (AAAI), 2010.
[43] A. Abdolmaleki, B. Price, N. Lau, L. P. Reis, and G. Neumann. Deriving and improving
cma-es with information geometric trust regions. In Genetic and Evolutionary Computation
Conference (GECCO), pages 657–664, 2017.
[44] S. M. Kakade. A natural policy gradient. In Advances in neural information processing sys-
tems, pages 1531–1538, 2002.
[45] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008.
[46] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The
International Journal of Robotics Research (IJRR), 2013.
10
A KL Constrained Maximum Likelihood Optimization for eREPS
The standard eREPS formulation [41] solves the optimization problem described by
πk+1 = arg max
πZπ(θ)R(θ)dθ,s.t. dKL(πk+1 ||πk)(1)
via computing the importance weights of each sample and fitting the policy to the weighted samples.
The sample weights are described by
wi=exp(R(θi))
Piexp(R(θi))with η= arg min
η
η +ηlog
N
X
i=0
πk(θi) exp (R(θi)).
The fitting of the policy to the weighted samples is described by
µk+1,Σk+1 = arg max
(µ,Σ)
N
X
i=0
wilog(L(µ,Σ|θi)).
In the case of Gaussian policies this optimization can be solved in closed form and the updated
parameters are described by
µk+1 =
N
X
i=0
wiθiΣk+1 =
N
X
i=0
wi(θiµk+1) (θiµk+1 )T.
For an in-depth derivation of this approach please refer to [41]. This approach to updating the policy
works well if Nnwith the parameter dimensionality n. In this case the KL divergence between
two consecutive parametrized policies is smaller than . If Nnthe KL divergence between
two consecutive parametrized policies must not necessarily be smaller than . In this case the KL
divergence between the weighted and unweighted samples is bounded by but the KL divergence
between the parametrized policies is not. To ensure that the KL divergence between parametrized
policies is bounded after the maximum likelihood optimization, we change the optimization problem
to include a KL constraint. The constrained objective is described by
µk+1,Σk+1 = arg max
(µ,Σ)
N
X
i=0
wilog(L(µ,Σ|θi)) s.t. dKL(πk||πk+1 ).
Please note that the order of the KL divergence is switched compared to Equation 1and that the
KL divergence is not symmetric. We switch from the I-projection to the M-projection because
otherwise this optimization has no closed form solution for Gaussian policies. For bounding the
distance between two consecutive policies the reverse KL divergence can be used as for small KL
divergences both projections are comparable. The constrained optimization problem can be solved
using Lagrangian multipliers as initially derived by [43]. The Lagrangian Lfor a Gaussian policy is
described by
L=
N
X
i=0
wilog(L(µ,Σ|θi)) + ξ(dKL(N(µk,Σk)|| N (µk+1,Σk+1 )))
=1
2nlog(2π)log(|Σk+1|)
N
X
i=0
wi(θiµk+1)TΣ1
k+1 (θiµk+1)
+ξ
22tr Σ1
k+1Σkn+ log Σk+1
Σk+ (µk+1 µk)TΣ1
k+1 (µk+1 µk).
The updates for the mean and covariance can be computed in closed form by setting µk+1L:= 0
and Σk+1 L:= 0, i.e.,
µt+1 L=Σ1
k+1 "N
X
i=1
wiθ+ξµk(ξ+ 1) µk+1#:= 0
Σt+1 L=Σ1
k+1 Σs+ξΣk+ξ(µk+1 µk) (µk+1 µk)T(ξ+ 1) Σt+1Σ1
k+1 := 0
11
The resulting update rule for the mean and covariance is described by
µk+1 =ξµk+µs
1 + ξ,Σk+1 =Σs+ξΣk+ξ(µk+1 µk) (µk+1 µk)T
1 + η
µs=
N
X
i=0
wiθiΣs=
N
X
i=0
wi(θiµk+1) (θiµk+1 )T
with the optimal multiplier ξ. The optimal multiplier cannot be obtained in closed form as µk+1
and Σk+1 depend on ξ. Hence, ξmust be obtained by solving
ξ= arg max
ξlog(|Σt+1|)
N
X
i=0
wi(θiµk+1)TΣ1
k+1 (θiµk+1)
+ξ2tr Σ1
k+1Σkn+ log Σk+1
Σk+ (µk+1 µk)TΣ1
k+1 (µk+1 µk)
with gradient based optimization.
12
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Generalization and adaptation of learned skills to novel situations is a core requirement for intelligent autonomous robots. Although contextual reinforcement learning provides a principled framework for learning and generalization of behaviors across related tasks, it generally relies on uninformed sampling of environments from an unknown, uncontrolled context distribution, thus missing the benefits of structured, sequential learning. We introduce a novel relative entropy reinforcement learning algorithm that gives the agent the freedom to control the intermediate task distribution, allowing for its gradual progression towards the target context distribution. Empirical evaluation shows that the proposed curriculum learning scheme drastically improves sample efficiency and enables learning in scenarios with both broad and sharp target context distributions in which classical approaches perform sub-optimally.
Article
Full-text available
We use reinforcement learning (RL) to learn dexterous in-hand manipulation policies that can perform vision-based object reorientation on a physical Shadow Dexterous Hand. The training is performed in a simulated environment in which we randomize many of the physical properties of the system such as friction coefficients and an object’s appearance. Our policies transfer to the physical robot despite being trained entirely in simulation. Our method does not rely on any human demonstrations, but many behaviors found in human manipulation emerge naturally, including finger gaiting, multi-finger coordination, and the controlled use of gravity. Our results were obtained using the same distributed RL system that was used to train OpenAI Five. We also include a video of our results: https://youtu.be/jwSbzNHGflM .
Conference Paper
Full-text available
CMA-ES is one of the most popular stochastic search algorithms. It performs favourably in many tasks without the need of extensive parameter tuning. The algorithm has many beneficial properties, including automatic step-size adaptation, efficient covariance updates that incorporates the current samples as well as the evolution path and its invariance properties. Its update rules are composed of well established heuristics where the theoretical foundations of some of these rules are also well understood. In this paper we will fully derive all CMA-ES update rules within the framework of expectation-maximisation-based stochastic search algorithms using information-geometric trust regions. We show that the use of the trust region results in similar updates to CMA-ES for the mean and the covariance matrix while it allows for the derivation of an improved update rule for the step-size. Our new algorithm, Trust-Region Co-variance Matrix Adaptation Evolution Strategy (TR-CMA-ES) is fully derived from first order optimization principles and performs favourably in compare to standard CMA-ES algorithm.
Article
Dynamic tasks such as table tennis are relatively easy to learn for humans, but pose significant challenges to robots. Such tasks require accurate control of fast movements and precise timing in the presence of imprecise state estimation of the flying ball and the robot. Reinforcement learning (RL) has shown promise in learning complex control tasks from data. However, applying step-based RL to dynamic tasks on real systems is safety-critical as RL requires exploring and failing safely for millions of time steps in high-speed and high-acceleration regimes. This article demonstrates that using robot arms driven by pneumatic artificial muscles (PAMs) enables safe end-to-end learning of table tennis using model-free RL. In particular, we learn from scratch for thousands of trials while a stochastic policy acts on the low-level controls of the real system. The robot returns and smashes real balls with 5 ms −1 and 12 ms −1 on average, respectively, to a desired landing point. Additionally, we present hybrid sim and real training (HYSR), a practical procedure that avoids training with real balls by virtually replaying recorded ball trajectories and applying actions to the real robot. To the best of authors’ knowledge, this work pioneers (i) failsafe learning of a safety-critical dynamic task using anthropomorphic robot arms, (ii) learning a precision-demanding problem with a PAM-driven system that is inherently hard to control as well as (iii) train a robot to play table tennis without real balls.
Article
Mobile robot navigation is typically regarded as a geometric problem, in which the robot's objective is to perceive the geometry of the environment in order to plan collision-free paths towards a desired goal. However, a purely geometric view of the world can be insufficient for many navigation problems. For example, a robot navigating based on geometry may avoid a field of tall grass because it believes it is untraversable, and will therefore fail to reach its desired goal. In this work, we investigate how to move beyond these purely geometric-based approaches using a method that learns about physical navigational affordances from experience. Our reinforcement learning approach, which we call BADGR, is an end-to-end learning-based mobile robot navigation system that can be trained with autonomously-labeled off-policy data gathered in real-world environments, without any simulation or human supervision. BADGR can navigate in real-world urban and off-road environments with geometrically distracting obstacles. It can also incorporate terrain preferences, generalize to novel environments, and continue to improve autonomously by gathering more data. Videos, code, and other supplemental material are available on our website https://sites.google.com/view/badgr</uri
Article
We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.
Conference Paper
Current learning-based robot grasping approaches exploit human-labeled datasets for training the models. However, there are two problems with such a methodology: (a) since each object can be grasped in multiple ways, manually labeling grasp locations is not a trivial task; (b) human labeling is biased by semantics. While there have been attempts to train robots using trial-and-error experiments, the amount of data used in such experiments remains substantially low and hence makes the learner prone to over-fitting. In this paper, we take the leap of increasing the available training data to 40 times more than prior work, leading to a dataset size of 50K data points collected over 700 hours of robot grasping attempts. This allows us to train a Convolutional Neural Network (CNN) for the task of predicting grasp locations without severe overfitting. In our formulation, we recast the regression problem to an 18-way binary classification over image patches. We also present a multi-stage learning approach where a CNN trained in one stage is used to collect hard negatives in subsequent stages. Our experiments clearly show the benefit of using large-scale datasets (and multi-stage training) for the task of grasping. We also compare to several baselines and show state-of-the-art performance on generalization to unseen objects for grasping.