Average length of evaluation runs (with í µí¼– = 0.05) on a 25x25 Gridworld with potential-based reward shaping where Φ(í µí± ) = í µí±‰ * (í µí± ).

Average length of evaluation runs (with í µí¼– = 0.05) on a 25x25 Gridworld with potential-based reward shaping where Φ(í µí± ) = í µí±‰ * (í µí± ).

Source publication
Preprint
Full-text available
Potential-based reward shaping is commonly used to incorporate prior knowledge of how to solve the task into reinforcement learning because it can formally guarantee policy invariance. As such, the optimal policy and the ordering of policies by their returns are not altered by potential-based reward shaping. In this work, we highlight the dependenc...

Context in source publication

Context 1
... figure 1 we show the average episode length for ten evaluation runs every 500 training steps in a simple 25-by-25 Gridworld. The task of the agent is to move from the top left into the bottom right corner of the grid. ...

Similar publications

Article
Full-text available
Reinforcement learning (RL) systems can be complex and non-interpretable, making it challenging for non-AI experts to understand or intervene in their decisions. This is due in part to the sequential nature of RL in which actions are chosen because of their likelihood of obtaining future rewards. However, RL agents discard the qualitative features...
Article
Full-text available
Multi-feed radial distribution systems are used to reduce the losses in the system using reconfiguration techniques. Reconfiguration can reduce the losses in the system only to a certain extent. Introduction of distributed generators has vastly improved the performance of distribution systems. Distributed generators can be used for reduction of los...
Preprint
Full-text available
We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$Code, that solves multi-turn code generation using only single-step rewards. Our key...
Article
Full-text available
This work investigates the implementation of the Deep Deterministic Policy Gradient (DDPG) algorithm to enhance the target-reaching capability of the seven degree-of-freedom (7-DoF) Franka Pandarobotic arm. A simulated environment is established by employing OpenAI Gym, PyBullet, and Panda Gym. After 100,000 training time steps, the DDPG algorithm...
Preprint
Full-text available
The application of deep reinforcement learning algorithms to economic battery dispatch problems has significantly increased recently. However, optimizing battery dispatch over long horizons can be challenging due to delayed rewards. In our experiments we observe poor performance of popular actor-critic algorithms when trained on yearly episodes wit...