Effects of removing some actions from the set of admissible actions on the learned policies as the probability of removing actions (σ) increases from 0 to 1. Each perturbation was done 32 times for each environment and the average and standard error of the results are shown. (a) The average return for different policies. (b) The average lengths of episodes for different policies.

Effects of removing some actions from the set of admissible actions on the learned policies as the probability of removing actions (σ) increases from 0 to 1. Each perturbation was done 32 times for each environment and the average and standard error of the results are shown. (a) The average return for different policies. (b) The average lengths of episodes for different policies.

Source publication
Preprint
Full-text available
We present ICU-Sepsis, an environment that can be used in benchmarks for evaluating reinforcement learning (RL) algorithms. Sepsis management is a complex task that has been an important topic in applied RL research in recent years. Therefore, MDPs that model sepsis management can serve as part of a benchmark to evaluate RL algorithms on a challeng...

Context in source publication

Context 1
... such policies should have higher variance across runs, where some runs would not allow the actions that are being exploited to obtain unrealistically high returns. Figure 6 shows that the variance is indeed higher for the Variant compared to ICU-Sepsis, where the average return and episode lengths stay more stable as actions are progressively made inadmissible. ...