Fig 5 - uploaded by Yubo Huang
Content may be subject to copyright.
Source publication
In reinforcement learning (RL), we always expect the agent to explore as many states as possible in the initial stage of training and exploit the explored information in the subsequent stage to discover the most returnable trajectory. Based on this principle, in this paper, we soften the proximal policy optimization by introducing the entropy and d...
Contexts in source publication
Context 1
... this experiment, using the control variable method, we ablate three core components of SPOD: dual-track advantage estimator (DTAE), entropy term and clipped margin respectively, to quantify their contribution to the overall performance of SPOD. First, without DTAE, the agent will adopt GAE to estimate the advantage of one action at state s. Fig. 5 shows DTAE can acquire high cumulative return as well as keep faster training speed compared with GAE. Then, without entropy term in Eq. 31, SPOD will degenerate to PPO and the algorithm's performance will also decrease. Finally, without the clipped margin, cumulative return remains extremely low level throughout the training process ...
Context 2
... this experiment, using the control variable method, we ablate three core components of SPOD: dual-track advantage estimator (DTAE), entropy term and clipped margin respectively, to quantify their contribution to the overall performance of SPOD. First, without DTAE, the agent will adopt GAE to estimate the advantage of one action at state s. Fig. 5 shows DTAE can acquire high cumulative return as well as keep faster training speed compared with GAE. Then, without entropy term in Eq. 31, SPOD will degenerate to PPO and the algorithm's performance will also decrease. Finally, without the clipped margin, cumulative return remains extremely low level throughout the training process ...
Similar publications
Reinforcement learning, which acquires a policy maximizing long-term rewards, has been actively studied. Unfortunately, this learning type is too slow and difficult to use in practical situations because the state-action space becomes huge in real environments. Many studies have incorporated human knowledge into reinforcement Learning. Though human...