Results of Cart Pole and Mountain Car experiments for different values of the bias parameter í µí±. Each plot showing the length of the evaluation episodes that were run every 500 steps averaged over five evaluation runs per training run and plotting the mean of ten separate training runs with the shaded area being the standard error of the mean.

Results of Cart Pole and Mountain Car experiments for different values of the bias parameter í µí±. Each plot showing the length of the evaluation episodes that were run every 500 steps averaged over five evaluation runs per training run and plotting the mean of ten separate training runs with the shaded area being the standard error of the mean.

Source publication
Preprint
Full-text available
Potential-based reward shaping is commonly used to incorporate prior knowledge of how to solve the task into reinforcement learning because it can formally guarantee policy invariance. As such, the optimal policy and the ordering of policies by their returns are not altered by potential-based reward shaping. In this work, we highlight the dependenc...

Context in source publication

Context 1
... this experiment we test the theory in section 5.2 for the choice of the bias parameter í µí± in a deep RL setting with a constant −1 reward. Figure 4(b) shows the average evaluation performance of agents in the Mountain Car environment with differently shifted potential functions compared with an agent without additional reward shaping. For the same reason as in the Cart Pole experiment we do not test for the comparably small changes to the bias í µí± that a compensation of the initial Q-values would create. ...

Similar publications

Article
Full-text available
Multi-feed radial distribution systems are used to reduce the losses in the system using reconfiguration techniques. Reconfiguration can reduce the losses in the system only to a certain extent. Introduction of distributed generators has vastly improved the performance of distribution systems. Distributed generators can be used for reduction of los...
Preprint
Full-text available
The application of deep reinforcement learning algorithms to economic battery dispatch problems has significantly increased recently. However, optimizing battery dispatch over long horizons can be challenging due to delayed rewards. In our experiments we observe poor performance of popular actor-critic algorithms when trained on yearly episodes wit...
Preprint
Full-text available
Training of deep reinforcement learning agents is slowed considerably by the presence of input dimensions that do not usefully condition the reward function. Existing modules such as layer normalization can be trained with weight decay to act as a form of selective attention, i.e. an input mask, that shrinks the scale of unnecessary inputs, which i...
Preprint
Full-text available
Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved us...
Article
Full-text available
This work investigates the implementation of the Deep Deterministic Policy Gradient (DDPG) algorithm to enhance the target-reaching capability of the seven degree-of-freedom (7-DoF) Franka Pandarobotic arm. A simulated environment is established by employing OpenAI Gym, PyBullet, and Panda Gym. After 100,000 training time steps, the DDPG algorithm...