Figure 2 - available via license: Creative Commons Attribution 4.0 International
Content may be subject to copyright.
Average episode return (left) and surprise (right) versus environment interactions (average over 5 seeds, with one shaded standard deviation) in the Maze environment. S-Max and S-Adapt are the only objectives that allow the RL agents to consistently find the goal in the maze. These also cause the largest change in surprise when compared to the random agent.
Source publication
Both entropy-minimizing and entropy-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment's level of natural entropy. However, neither method alone results in an agent that will consistently learn intelligent behavior across environments...
Contexts in source publication
Context 1
... the other hand, the S-Max agent learns to navigate the Maze and reach the goal but fails to catch any butterflies in the Butterflies environment. Quantitatively, the S-Min agent achieves the lowest or near-lowest entropy in all environments, while the S-Max agent achieves the highest or near-highest entropy in all environments ( Figures 2 to 4), as expected. However, we highlight that the qualitatively interesting direction for entropy control is correlated not with a single objective, but with the scale of the absolute difference in the final entropy achieved by the agent versus that of the Random agent (Figures 2 to 4). ...
Context 2
... the S-Min agent achieves the lowest or near-lowest entropy in all environments, while the S-Max agent achieves the highest or near-highest entropy in all environments ( Figures 2 to 4), as expected. However, we highlight that the qualitatively interesting direction for entropy control is correlated not with a single objective, but with the scale of the absolute difference in the final entropy achieved by the agent versus that of the Random agent (Figures 2 to 4). In the Maze environment, the S-Max agent drives a significant increase in entropy over the Random agent, while the S-Min agent achieves a relatively small decrease (Figure 2). ...
Context 3
... we highlight that the qualitatively interesting direction for entropy control is correlated not with a single objective, but with the scale of the absolute difference in the final entropy achieved by the agent versus that of the Random agent (Figures 2 to 4). In the Maze environment, the S-Max agent drives a significant increase in entropy over the Random agent, while the S-Min agent achieves a relatively small decrease (Figure 2). Similarly, in the Butterflies environment, the opposite holds in the large map (Figure 2b). ...
Context 4
... the Maze environment, the S-Max agent drives a significant increase in entropy over the Random agent, while the S-Min agent achieves a relatively small decrease (Figure 2). Similarly, in the Butterflies environment, the opposite holds in the large map (Figure 2b). Interestingly, in the small map, the S-Min and S-Max agents achieve roughly the same absolute change in entropy (Figure 2a). ...
Context 5
... in the Butterflies environment, the opposite holds in the large map (Figure 2b). Interestingly, in the small map, the S-Min and S-Max agents achieve roughly the same absolute change in entropy (Figure 2a). This is because in the smaller map, avoiding butterflies is equally challenging compared to catching butterflies, while in the larger map, the butterflies are easily avoided. ...
Context 6
... on the success modes of the singleobjective agents, the proposed S-Adapt agent can adapt to the entropy landscape to achieve entropy control across all didactic environments ( Figures 2 to 4). In Maze, the S-Adapt agent converges to a surprise-maximizing strategy similar to S-Max, as demonstrated by the high entropy achieved by the end of training (Fig- ure 2). ...
Context 7
... on the success modes of the singleobjective agents, the proposed S-Adapt agent can adapt to the entropy landscape to achieve entropy control across all didactic environments ( Figures 2 to 4). In Maze, the S-Adapt agent converges to a surprise-maximizing strategy similar to S-Max, as demonstrated by the high entropy achieved by the end of training (Fig- ure 2). On the other hand, in Tetris, the SAdapt agent converges to a surprise-minimizing strategy, achieving low entropy on par with the S-Min agent by the end of training (Figure 4). ...
Context 8
... the SAdapt agent converges to surprise-maximizing behavior. However, as the size of the grid is increased, and the density of butterflies decreases, the effect of minimizing entropy becomes much stronger versus the Random agent and the S-Adapt agent correctly converges to the surprise-minimizing strategy (Figure 2b). More details on the effect of butterfly density on the behavior of the S-Adapt agent can be found in Appendix B ...
Similar publications
Deep Reinforcement Learning (DRL) is promising for multi-agent path planning problems in which sparse external environmental rewards may cause the agent group to make overly conservative decisions and explore the environment inefficiently. In general, the reward shaping mechanism is used to mitigate the above problems with the additional reward fun...