Figure 4 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
Results of Cart Pole and Mountain Car experiments for different values of the bias parameter í µí±. Each plot showing the length of the evaluation episodes that were run every 500 steps averaged over five evaluation runs per training run and plotting the mean of ten separate training runs with the shaded area being the standard error of the mean.
Source publication
Potential-based reward shaping is commonly used to incorporate prior knowledge of how to solve the task into reinforcement learning because it can formally guarantee policy invariance. As such, the optimal policy and the ordering of policies by their returns are not altered by potential-based reward shaping. In this work, we highlight the dependenc...
Context in source publication
Context 1
... this experiment we test the theory in section 5.2 for the choice of the bias parameter í µí± in a deep RL setting with a constant −1 reward. Figure 4(b) shows the average evaluation performance of agents in the Mountain Car environment with differently shifted potential functions compared with an agent without additional reward shaping. For the same reason as in the Cart Pole experiment we do not test for the comparably small changes to the bias í µí± that a compensation of the initial Q-values would create. ...
Similar publications
Multi-feed radial distribution systems are used to reduce the losses in the system using reconfiguration techniques. Reconfiguration can reduce the losses in the system only to a certain extent. Introduction of distributed generators has vastly improved the performance of distribution systems. Distributed generators can be used for reduction of los...
The application of deep reinforcement learning algorithms to economic battery dispatch problems has significantly increased recently. However, optimizing battery dispatch over long horizons can be challenging due to delayed rewards. In our experiments we observe poor performance of popular actor-critic algorithms when trained on yearly episodes wit...
Training of deep reinforcement learning agents is slowed considerably by the presence of input dimensions that do not usefully condition the reward function. Existing modules such as layer normalization can be trained with weight decay to act as a form of selective attention, i.e. an input mask, that shrinks the scale of unnecessary inputs, which i...
Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved us...
This work investigates the implementation of the Deep Deterministic Policy Gradient (DDPG) algorithm to enhance the target-reaching capability of the seven degree-of-freedom (7-DoF) Franka Pandarobotic arm. A simulated environment is established by employing OpenAI Gym, PyBullet, and Panda Gym. After 100,000 training time steps, the DDPG algorithm...