Figure 1 - uploaded by Luchen Li

Content may be subject to copyright.

# (a) Validations on the test set, with parameters trained from different numbers of observed patient variables as an indicator of partial observability, and from corresponding sets of patient variables while assuming full observability. Results shown are the OPPE outcome of current parameters every training iteration. (b) Influence of the number of expansions in each tree search. (c) Action updates in a single belief, when actions in search trees are dictated by π orˆπorˆ orˆπ. Negative actions during test are truncated at 0.

Source publication

Health-related data is noisy and stochastic in implying the true physiological states of patients, limiting information contained in single-moment observations for sequential clinical decision making. We model patient-clinician interactions as partially observable Markov decision processes (POMDPs) and optimize sequential treatment based on belief...

## Contexts in source publication

**Context 1**

... steps), the model and RL agent is validated on the test set, with each per-trajectory IS weight clipped into [1e − 30, 1e4]. Results in Figure 1 (a) show that, modelling as MDPs is more susceptible to the number of patient variables involved and consistently yields poorer performances despite the observation dimensionality than modelling as POMDPs, vindicating that patient-clinician interactions are de facto POMDPs. Moreover, as the incompleteness of observations increases in POMDP framework, both AEHS and AE+A2C learn worse, but AEHS deteriorates slightly less than AE+A2C, and learns faster and better across all three cases. ...

**Context 2**

... test whether our heuristics are applied efficiently, we first investigate the influence of the number of node expansions N s in each suffix tree. As is shown in Figure 1 (b), increasing N s does not dramatically boost performance. This is because the trees are best-first: Eq. (13) always selects the most influential node to expand, implying that there is no need to expand many more nodes to evaluate the root. ...

**Context 3**

... we compare how the target policy π is updated and how the look-ahead policyˆπpolicyˆ policyˆπ would appear when selecting actions a T by π orˆπorˆ orˆπ respectively during tree explorations, as real actions are unanimously dictated by µ. The results visualized in Figure 1 (c) suggest that, ˆ π tends to update ahead of π like a precursor, and updates more stably on the whole. In addition, π is able to update faster when simulations are dictated byˆπbyˆ byˆπ than by π. ...

**Context 4**

... histograms of the estimates are shown in Supp. Figure 1, together with the 95% confidence lower bound and the value of the behavior policy R µ = 1 N N n R n . In both cases, the 95% confidence lower bounds of π value exceed the value of µ by considerable margins. ...

**Context 5**

... steps), the model and RL agent is validated on the test set, with each per-trajectory IS weight clipped into [1e − 30, 1e4]. Results in Figure 1 (a) show that, modelling as MDPs is more susceptible to the number of patient variables involved and consistently yields poorer performances despite the observation dimensionality than modelling as POMDPs, vindicating that patient-clinician interactions are de facto POMDPs. Moreover, as the incompleteness of observations increases in POMDP framework, both AEHS and AE+A2C learn worse, but AEHS deteriorates slightly less than AE+A2C, and learns faster and better across all three cases. ...

**Context 6**

... test whether our heuristics are applied efficiently, we first investigate the influence of the number of node expansions N s in each suffix tree. As is shown in Figure 1 (b), increasing N s does not dramatically boost performance. This is because the trees are best-first: Eq. (13) always selects the most influential node to expand, implying that there is no need to expand many more nodes to evaluate the root. ...

**Context 7**

... we compare how the target policy π is updated and how the look-ahead policyˆπpolicyˆ policyˆπ would appear when selecting actions a T by π orˆπorˆ orˆπ respectively during tree explorations, as real actions are unanimously dictated by µ. The results visualized in Figure 1 (c) suggest that, ˆ π tends to update ahead of π like a precursor, and updates more stably on the whole. In addition, π is able to update faster when simulations are dictated byˆπbyˆ byˆπ than by π. ...

## Similar publications

Goal-conditioned Hierarchical Reinforcement Learning (HRL) is a promising approach for scaling up reinforcement learning (RL) techniques. However, it often suffers from training inefficiency as the action space of the high-level, i.e., the goal space, is large. Searching in a large goal space poses difficulty for both high-level subgoal generation...