Figure 3 - uploaded by Kai Ploeger
Content may be subject to copyright.
(a) Mean juggling duration of final policies learned with varying batch sizes N. The maximal juggling duration is 10s. (b) Comparison of the learned and hand-tuned policy on each 30 episodes with a maximum duration of 120s on the real system. The learned policy achieves an average juggling duration of 106.82s while the hand-tuned policy achieves 66.52s.
Source publication
Robots that can learn in the physical world will be important to en-able robots to escape their stiff and pre-programmed movements. For dynamic high-acceleration tasks, such as juggling, learning in the real-world is particularly challenging as one must push the limits of the robot and its actuation without harming the system, amplifying the necess...
Contexts in source publication
Context 1
... initial validation of the proposed learning system is performed in MuJoCo to evaluate the convergence of different numbers of roll-outs per episode over seeds. Figure 3a shows the juggling duration distribution of the final policy averaged over 60 different seeds at 10, 25 and 50 roll-outs per episode. For 10 roll-outs per episode, the learning system frequently converges to a sub-optimal final policy, which does not achieve consistent juggling of 10 seconds. ...
Context 2
... test the repeatability and stability of the learned policy, the deterministic policy mean of episode 20 is executed for 30 repeated roll-outs with a maximum duration of 120 seconds. The achieved performance is compared to a hand-tuned policy in Figure 3b. Averaging at 106.82s, the learned policy performs significantly better compared to the hand-tuned policy with 66.51s. ...
Context 3
... initial validation of the proposed learning system is performed in MuJoCo to evaluate the convergence of different numbers of roll-outs per episode over seeds. Figure 3a shows the juggling duration distribution of the final policy averaged over 60 different seeds at 10, 25 and 50 roll-outs per episode. For 10 roll-outs per episode, the learning system frequently converges to a sub-optimal final policy, which does not achieve consistent juggling of 10 seconds. ...
Context 4
... test the repeatability and stability of the learned policy, the deterministic policy mean of episode 20 is executed for 30 repeated roll-outs with a maximum duration of 120 seconds. The achieved performance is compared to a hand-tuned policy in Figure 3b. Averaging at 106.82s, the learned policy performs significantly better compared to the hand-tuned policy with 66.51s. ...
Similar publications
Reinforcement learning refers to powerful algorithms for solving goal related problems by maximizing the reward over many time steps. By incorporating them into the dynamic movement primitives (DMPs) which are now widely used parametric representations in robotics, movements obtained from a single human demonstration can be adapted so that a robot...
Bagging is an essential skill that humans perform in their daily activities. However, deformable objects, such as bags, are complex for robots to manipulate. A learning‐based framework that enables robots to learn bagging is presented. The novelty of this framework is its ability to learn and perform bagging without relying on simulations. The lear...
This paper proposes a novel learning-based framework for autonomous driving based on the concept of maximal safety probability. Efficient learning requires rewards that are informative of desirable/undesirable states, but such rewards are challenging to design manually due to the difficulty of differentiating better states among many safe states. O...
Robots that can learn in the physical world will be important to enable robots to escape their stiff and pre-programmed movements. For dynamic high-acceleration tasks, such as juggling, learning in the real-world is particularly challenging as one must push the limits of the robot and its actuation without harming the system, amplifying the necessi...
Potential-based reward shaping is commonly used to incorporate prior knowledge of how to solve the task into reinforcement learning because it can formally guarantee policy invariance. As such, the optimal policy and the ordering of policies by their returns are not altered by potential-based reward shaping. In this work, we highlight the dependenc...