Content uploaded by Luchen Li

Author content

All content in this area was uploaded by Luchen Li on Aug 18, 2021

Content may be subject to copyright.

The Actor Search Tree Critic (ASTC) for Off-Policy

POMDP Learning in Medical Decision Making

Luchen Li

Imperial College London

l.li17@imperial.ac.uk

Matthieu Komorowski

Imperial College London

matthieu.komorowski@gmail.com

A. Aldo Faisal

Imperial College London

a.faisal@imperial.ac.uk

Abstract

Off-policy reinforcement learning enables near-optimal policy from suboptimal

experience, thereby provisions opportunity for artiﬁcial intelligence applications in

healthcare. Previous works have mainly framed patient-clinician interactions as

Markov decision processes, while true physiological states are not necessarily fully

observable from clinical data. We capture this situation with partially observable

Markov decision process, in which an agent optimises its actions in a belief

represented as a distribution of patient states inferred from individual history

trajectories. A Gaussian mixture model is ﬁtted for the observed data. Moreover, we

take into account the fact that nuance in pharmaceutical dosage could presumably

result in signiﬁcantly different effect by modelling a continuous policy through

a Gaussian approximator directly in the policy space, i.e. the actor. To address

the challenge of inﬁnite number of possible belief states which renders exact

value iteration intractable, we evaluate and plan for only every encountered belief,

through heuristic search tree by tightly maintaining lower and upper bounds of the

true value of belief. We further resort to function approximations to update value

bounds estimation, i.e. the critic, so that the tree search can be improved through

more compact bounds at the fringe nodes that will be back-propagated to the root.

Both actor and critic parameters are learned via gradient-based approaches. Our

proposed policy trained from real intensive care unit data is capable of dictating

dosing on vasopressors and intravenous ﬂuids for sepsis patients that lead to the

best patient outcomes.

1 Introduction

Many recent examples [

1

,

2

,

3

] have demonstrated that machine learning can deliver above-human

performance in classiﬁcation-based diagnostics. However, key to medicine is diagnosis paired with

treatment, i.e. the sequential decisions that have to be made by clinician for patient treatment.

Automatic treatment optimisation based on reinforcement learning has been explored in simulated

patients for HIV therapy [

4

] using ﬁtted-Q iteration, dynamic insulin dosage in diabetes using

model-based reinforcement learning [5], and anaesthesia depth control using actor-critic [6].

We focus on principled closed-loop approaches to learn a near-optimal treatment policy from vast

electronic healthcare records (EHRs). Previous works mainly modelled this problem as a Markov

decision process (MDP) and learned a policy of actions

π(a|s) = P(a|s)

based on estimated values

v(s)

of possible actions

a

in each state

s

. [

7

] investigated ﬁtted Q-iteration with linear function

approximation to optimise treatment of schizophrenia. [

8

] compared different supervised learning

methods to approximate action values, including neural networks, and successfully predicted the

weaning of mechanical ventilation and sedation dosages. In addition to crafting a reward function that

reﬂects domain knowledge, another avenue explored inverse reinforcement learning to recover one

from expert behaviours. For example, [

9

] proposed hemoglobin-A1c dosages for diabetes patients by

Preprint. Work in progress.

arXiv:1805.11548v3 [cs.AI] 3 Jun 2018

implementing Markov chain Monte Carlo sampling to infer posterior distribution over rewards given

observed states and actions. Further, multiple objective optimal treatments were explored [

10

] with

non-deterministic ﬁtted-Q to provide guidelines on antipsychotic drug treatments for schizophrenia

patients. [

11

] inferred hidden states via discriminative hidden Markov model, and investigated a

Deep-ﬁtted Q variant to learn heparin dosages to manage thrombosis.

However, the true physiological state of the patient is not necessarily fully observable by clinical

measurement techniques. Practically, this observability is further restricted by the actual subset of

measures taken in hospital, restricting both the nature and frequency of data recorded and omitting

information readily visible to the clinician . Those latencies are especially salient in intensive care

units (ICUs). Strictly speaking, the decision making process acting on the patient state therefore

should be mathematically formulated as a partially observable MDP (POMDP). Consequently, the

true patient state is latent and thus the patient’s physiological belief state

b

can only be represented as

a probability distribution of the states given the history of observations and actions. Exhaustive and

exact POMDP solutions are computationally intractable because belief states constitute a continuous

hyperplane that contains inﬁnite possibilities. We alleviate this obstruct by evaluating the sequentially

encountered belief states through learning both upper and lower bounds of their true values and

exploring reachable sequences using heuristic search trees to obtain locally optimal value estimates

ˆvτ(b)

. The learning of value bounds in the tree search is conferred via keeping back-propagating that

of newly expanded nodes to the root, which are usually computed ofﬂine and remain static. We model

the lower and upper value bounds as function approximators so that the tree search can be improved

through more accurate value bounds at the fringe nodes as we process the incoming data. Moreover,

to realise continuous action spaces (e.g. milliliter of medicine dripped per hour) we explicitly use

a continuous policy implemented through function approximation. We embed these features in an

actor-critic framework to update the parameters of the policy (the actor) and the value bounds (the

critic) via gradient-based methods.

2 Preliminaries

POMDP Model

A POMDP framework can be represented by the tuple

{S,A,O,T,Ω,R, γ , b0}

[

12

], where

S

is the state space,

A

the action space,

O

the observation space,

T

the stochastic

state transition function:

T(s0, s, a) = P(st+1 =s0|st=s, at=a)

,

Ω

the stochastic observation

function:

Ω(o, s, a) = P(ot+1 =o|st+1 =s, at=a)

,

R

the immediate reward function:

r=

R(s, a)

,

γ∈[0,1)

the discount factor indicating the weighing of present value of future rewards, and

b0the agent’s initial knowledge before receiving any information.

rt−1

st−1st

ot−1ot

bt−1bt

at−1

rt

st+1

ot+1

bt+1

at

rt+1

at+1

h= 0

h= 1

h= 2

h= 3

(a)

1

2

5 6

8

12 13

9 10

7

3 4

a

o

(b)

1

2

5 6

8

14 15

17 18 19

16

12 13

9 10

7

3 4

(c)

1

2

5 6

8

11

14 15

17 18 19

16

12

20 21 22

23 24 25

9 10

7

3 4

(d)

1

2

5 6

8

11

14 15

17 18 19

16

12 13

9 10

20 21 22

7

3 4

Fig. 1. (left)

Graphical model of POMDP.

(right)

Tree search. Each circle node represents a belief

state, each dotted node an action-belief-state pair. (a) select the best fringe node to expand; (b) expand

the selected node by choosing the action with the maximal upper value bound and considering all

possible observations, and back-propagate value bounds through all its ancestors to the root; (c) no

revision is required for previous choices on actions, expand the next best fringe node; (d) a previous

action choice (example is showing in horizon 1) is no longer optimal after the latest expansion, the

new optimal action is selected.

The agent’s belief state is represented by the probability distribution over states given historical

observations and actions,

bt(s) = P(s|b0, o1, a1, ..., at−1, ot)

. Executing

a

in belief

b

and receiving

2

observation o, the new belief is updated through:

b0(s0) = τ(b, a, o) = ηΩ(o, s0, a)X

s∈S

T(s0, s, a)b(s)(1)

where η=1

P(o|b,a)is a normalisation constant, and

P(o|b, a) = X

s0∈S

Ω(o, s0, a)X

s∈S

T(s0, s, a)b(s)(2)

The optimal policy

P(a|b) = π(a|b)

speciﬁes the best action to select in a belief, its value function

being updated via the ﬁxed point of Bellman equation [13]:

v∗(b) = max

a∈A "RB(b, a) + γX

o∈O

P(o|b, a)v∗(τ(b, a, o))#(3)

RB(b, a) = Ps∈S R(s, a)b(s)

is the probability-weighted immediate reward. Graphical model of a

fraction of POMDP framework is shown in Fig. 1 (left).

Continuous Control

Policy gradient method can go beyond the limit of a ﬁnite action space and

achieve continuous control. Instead of choosing actions based on action-value estimates, a policy

is directly optimised in the policy space. Our policy

π(a|s, u) = Factor(x(s),u)

is a function of

state feature vector

x(s)

parameterised by weight vector

u

. The objective function measuring the

performance of policy

π

is deﬁned as the value, or expected total rewards from future, of the start

state:

J(π) = Eπ[G0|s0] = Eπ"T−1

X

t=0

γtrt|s0#(4)

where

G0

is the return

Gt=PT−t−1

k=0 γkrt+k

at

t= 0

. According to policy gradient theorem [

14

],

the gradient of Jw.r.t. uis:

∇uJ(π) = 1X

s∈S

Pstat

π(s)X

a∈A

qπ(st, a)∇uπ(a|st,ut) = EπGt

∇uπ(at|st,ut)

π(at|st,ut)(5)

Pstat

π(s)

denotes the stationary distribution of states under policy

π

(i.e. the chance a state will

be visited within an episode).

qπ(st, at) = Eπ[Gt|st, at]

is the value of

at

in state

st

.

u

can

subsequently be updated through gradient ascent in the direction (with step size

α

) that maximally

increases J:

ut+1 =ut+αGt

∇uπ(at|st,ut)

π(at|st,ut)(6)

3 Methodology

In this section, we introduce how our algorithm combines tree search with policy gradient/function

approximations to realise both continuous action space and efﬁcient online planning for POMDP. We

will derive the components of our algorithm as we go along this section and provide an overview

ﬂowchart in Fig. 2.

Heuristic Search in POMDPs

The complexity of solving POMDPs is mainly due to the curse of

dimensionality: a belief state in an

|S|

space is an

(|S| − 1)

-dimensional continuous simplex, with all

its elements sum to one, and curse of history, as it acknowledges previous observations and actions,

the combination of which grows exponentially with the planning horizon [

12

]. Exact value iteration

for Eq. 3 in conventional reinforcement learning is therfore computationally intractable.

The optimal value function

v∗(b)

of a ﬁnite-horizon POMDP is piecewise linear and convex in the

belief state [

15

], represented by a set of

|S|

-dimensional convex hyperplanes, whose total amount

1∇uJ(π) = Ps∈S Pstat

π(s)Pa∈A qπ(st, a)∇uπ(a|st,ut) = EπPa∈A qπ(st, a)∇uπ(a|st,ut)=

EπhPa∈A π(a|st,ut)qπ(st, a)∇uπ(a|st,ut)

π(a|st,ut)i=Eπhqπ(st, at)∇uπ(at|st,ut)

π(at|st,ut)i=EπhGt∇uπ(at|st,ut)

π(at|st,ut)i.

3

grows exponentially. Most exhaustive algorithms for POMDPs are dedicated to learning either lower

bound [

12

,

16

] or upper bound [

17

,

18

,

19

] of

v∗(b)

by maintaining a subset of the aforementioned

hyperplanes. Tree-search based solutions usually prunes away less likely observations or actions

[20, 21, 22], or expanding fringe nodes according to a predeﬁned heuristic [23, 24, 25].

Our methodology is compatible with any tree-search based POMDP solution. Here we implement

on a heuristic tree search introduced in [

25

] to focus computations on every encountered belief (i.e.

plan at decision time) and explore only reachable sequences. Speciﬁcally, a search tree rooted at

the current belief

bcurr

is built, whose value estimate is conﬁned by its lower bound

ˆvL(bcurr )

and

upper bound

ˆvU(bcurr )

that after each step of look-ahead become tighter to the true optimal value

v∗(bcurr )

. Here we use subscript on the belief state to denote its temporal position within the episode

in environmental experience, and superscript the horizon explored for it in the tree search (analogous

for observation and action). At each update during exploration, only the fringe node that leads to the

maximum error on the root b0

curr is expanded:

b∗

curr = arg max

bh

curr , h∈{0,...,H}"γhˆvU(bh

curr )−ˆvL(bh

curr )

h−1

Y

i=0

P(oi

curr |bi

curr , ai

curr )πτ(ai

curr |bi

curr )#

(7)

where

H

is the maximum horizon explored so far,

ˆvU(bh

curr )−ˆvL(bh

curr )

the error on the fringe

node,

Qh−1

i=0 P(oi

curr |bi

curr , ai

curr )πτ(ai

curr |bi

curr )

the probability of reaching it from the root, and

γh

the time discount (Fig. 1 (right) (a)). Once a node is expanded, the estimates of value bounds of

all its ancestors are updated in a bottom-up fashion analogous to equation Eq. 3, substituting

v∗

with

ˆvL

or

ˆvU

as appropriate, to the root (Fig. 1 (right) (b)), and previous choices on actions in expanded

belief nodes along the path are revised (Fig. 1 (right) (c-d)) to ensure that the optimal action

ai∗

curr

in

bi

curr for ∀h∈ {0, ..., H }is always explored based on current estimates.

ai∗

curr = arg max

a∈A "RB(bi

curr , a) + γX

o∈O

P(o|bi

curr , a)ˆvUτ(bi

curr , a, o)#(8)

Eq. 8 is the deterministic tree policy

πτ(a|b)

that guides exploration within the tree. Each expansion

leads to more compact value bounds at the root. Tree exploration is terminated when the interval

between the value bounds estimates at the root belief changes trivially or a time limit is reached.

Gaussian States

Observed patient information is modelled as a Gaussian mixture, each observation

being generated from one of a ﬁnite set of Gaussian distributions that represent genuine physiological

states. The total number of latent states is decided by Bayesian information criterion [

26

] through

cross-validation using the development set. The terminal state is observable and corresponds either

patient discharge or death. Eq. 1 can be further expressed as:

b0(s0) = ηPa(s0|o)Pa(o)

Pa(s0)X

s∈S

Ta(s0, s)b(s) = η0Pa(s0|o)

Pa(s0)X

s∈S

Ta(s0, s)b(s)(9)

The superscript

a

denotes a subset divided from the data according to the action taken during this

transition

2

. The division into subsets enables parallel computations.

Pa(s0|o)

is the posterior

distribution of s0when observing ospeculated from the trained Gaussian mixture model.

The transition function

Ta(s0, s)

is learned by maximum a posteriori [

27

] to allow possibilities for

transitions that did not occur in the development dataset.

Prior knowledge on transitions is modelled by Grifﬁths-Engen-McCloskey (GEM) distributions

[

28

] according to relative Euclidean distances between state centroids, whose elements, if sorted,

decrease exponentially and sum to one, to reﬂect higher probabilities of transiting into similar states.

Speciﬁcally, a GEM distribution is deﬁned by a discount parameter

c1

and a concentration parameter

2

In our implementation,

Pa(s0|o)

for

∀a∈ A

are globally computed as

P(s0|o)∀o∈ O

, regardless of the

action leading to it, because the impact of the observation on the distribution of states is signiﬁcantly more

substantial than the action administered.

4

c2

, and can be explained by a stick-breaking construction: break a stick for the

k

-th time into two

parts, whose length proportions conform to a Beta distribution:

Vk∼Beta(1 −c1, c2+kc1),0≤c1<1, c2>−c1(10)

Then the length proportions of the off-broken parts in the whole stick are:

pk=V1, k = 1

(1 −V1)(1 −V2)...(1 −Vk−1)Vk, k = 2,3, ... (11)

The probability vector

p

consisting of elements calculated from Eq. 11 constitutes a GEM distribution.

Actor-Critic

Including to Eq. 5 a baseline term

B(st)

as a comparison with

Gt

signiﬁcantly

reduces variance in gradient estimates without changing equality

3

. This baseline is required to

discern states, a natural candidate would be the state value or its parametric approximation

ˆv(s, w) =

Fcritic(x(s),w). Then Eq. 6 becomes:

ut+1 =ut+α[Gt−ˆv(st,wt)] ∇uπ(at|st,ut)

π(at|st,ut)(12)

The parametric policy

π(a|s, u)

is called the actor, and the parametric value function

ˆv(s, w)

the

critic.

The complete empirical return

Gt

is only available at the end of each episode, and therefore at

each update we need to look forward to future rewards to decide current theoretical estimate of

Gt

. A close approximation of

Gt

that is available at each decision moment and thus enables

more data-efﬁcient backward-view learning is

λ

-return:

Gλ

t= (1 −λ)P∞

n=1 λn−1Gt:t+n

, with

λ

specifying the relative decaying rate among returns available after various steps

Gt:t+n

. Since

Gλ

t−ˆv(st,wt)≈rt+γˆv(st+1,wt)−ˆv(st,wt)

(denoted as

δt

), substituting

Gt

with

Gλ

t

in Eq. 12

yields:

ut+1 =ut+αδteu

t(13)

eu

t=γλeu

t−1+∇uπ(at|st,ut)

π(at|st,ut)(14)

eu

is the eligibility trace [

29

] for

u

, and is initialised to

0

for every episode. Analogous update rules

apply to the critic.

Actor Search Tree Critic

The history-dependent probabilistic belief state reﬂects the information

the agent would need to know about the current time step to optimise its decision. This belief state is

used as the state feature vector for the actor-critic, i.e.

x(s) = b

. This is based on the notion that state

mechanism is supposed to allow weight parameter to update towards similar directions by similar

samples, while similar situations have similar distributions of states, with each component in the

distribution implying the responsibility for updating corresponding component in the weight.

Our critic parameterises the lower and upper value bounds in the tree search, instead of parameterising

the value function as whole (as done in conventional actor-critic methods). Note that the value bounds

at the fringe nodes are here updated as we parse the data for off-policy reinforcement learning. In

previous works these bounds would have been computed ofﬂine and not improved during online

planning. We use linear representations for the value bounds:

ˆvL(b, wL) = wL T b, ˆvU(b, wU) = wU T b(15)

At each step

t

, a local (as opposed to a value function optimal to all belief states) optimal value is

estimated for the current belief

bt

through heuristic tree search with fringe nodes values approximated

by

wL

and

wU

, denoted as

ˆvτ(bt,wL

t,wU

t)

. The critic parameters are updated through stochastic

gradient descent (SGD) to adjust in the direction that most reduces the error on each training

example by minimising the mean square error between the current approximation and its target

ˆvτ(bt,wL

t,wU

t):

wL

t+1 =4wL

t+βLˆvτ(bt,wL

t,wU

t)−ˆvL(bt,wL)bt(16)

3Because Pa∈A B(s)∇uπ(a|s, u) = B(s)∇uPa∈A π(a|s, u) = B(s)∇u1 = 0,∀s∈ S.

4wL

t+1 =

wL

t−1

2βL∇wˆvτ(bt,wL

t,wU

t)−ˆvL(bt,wL)2=wL

t+βLˆvτ(bt,wL

t,wU

t)−ˆvL(bt,wL)∇wˆvL(bt,wL) =

wL

t+βLˆvτ(bt,wL

t,wU

t)−ˆvL(bt,wL)bt.

5

βL

is the step size for updating

wL

. Similar for

wU

t

. As the weight vector also has an impact on the

target

ˆvτ(bt,wL

t,wU

t)

, which is ignored during SGD update, the update is by deﬁnition semi-gradient,

which usually learns faster than full gradient methods and with linear approximators (Eq. 15), is

guaranteed to converge (near) to a local optimum under standard stochastic approximation conditions

[30]: ∞

X

t=1

βt=∞,

∞

X

t=1

β2

t<∞(17)

To ensure convergence, we set the step sizes at time step taccording to:

βL

t=0.1

Ehkbtk2

wL

ti, βU

t=0.1

Ehkbtk2

wU

ti(18)

Fig. 2. ASTC algorithm ﬂowchart.

To realise continuous action space,

the actor is modelled as a Gaussian

distribution, with a mean vector ap-

proximated as a linear function (for

simplicity of gradient computation)

of weights and belief state:

π(a|b, u) = N(uTb, σ2)(19)

σ

is a hyperparameter for standard

deviation. In this circumstance the

gradient in Eq. 14 is calculated as:

∇uπ(a|b, u)

π(a|b, u)=1

σ2(a−uTb)b

(20)

The moment-wise error, or temporal

difference (TD) error that motivates update to actor in Eq. 13 is:

δt=rt+γˆvτ(bt+1,wL

t,wU

t)−ˆvτ(bt,wL

t,wU

t)(21)

In off-policy reinforcement learning, as we use retrospective data generated by a behaviour policy

πb

(being clinicians’ actual treatment decisions) to optimise a target policy

π

(being our actor), the

actor-critic approach is tuned via importance sampling on the eligibility trace [31]:

eu

t=ρtγλeu

t−1+∇uπ(at|bt,ut)

π(at|bt,ut)(22)

where

ρt=π(at|bt,ut)

πb(at|bt)

5

is the importance sampling ratio. Importance sampling mechanisms help

ensure that we are not biased by differences between the two policies in choosing actions (e.g. optimal

actions may look different than clinical actions taken).

4 Experimental Results

We both train and test our algorithm on synthetic and real ICU patients separately.

On Synthetic Data

We ﬁrst synthesize a dataset where we have full access to its dynamics, from

which a true theoretic optimal policy

π∗

can be computed from fully model-reliant approaches such

as dynamical programming. Note that although with synthetic data, we choose to learn on an existing

(simulated) dataset, to test the algorithm’s capability of off-policy learning. The dataset is further

divided into two mutually exclusive subsets for algorithm development and test. The suboptimality

of our behaviour policy

πb

that dictates actions during data generation, is systematically related to

π∗

with -greedy 6.

5πb(at|bt)

is actually equivalent to

πb(at|st)

, both

bt

and

st

denote the environmental state at an identical

moment, depending on whether the agent’s notion is represented by belief or fully observable state.

6

decides the fraction of occasions when the agent explores actions randomly instead of sticking to the

optimal one.

6

In our synthetic data, the action space contains six discrete (categorical) actions. Observed data are

generated from a Gaussian mixture model, with state transition probabilities denser at closer states.

All parameters for data generation are unknown to the reinforcement learning agent.

Fig. 3.

Action selections in test set under proposed/optimal/behaviour (showing is

= 0.3

) policy

(synthetic data).

Fig. 3 visualises action selections under

π∗

,

π

, and

πb

of the ﬁrst 200 time steps (to avoid clutter)

in the test set. Note that the optimal and behaviour policies are discrete, while the target policy is

continuous. Strict resemblance between the target policy and the optimal policy can be observed.

And there is no trace of the target policy varying with the behaviour policy.

On Retrospective ICU Data

We subsequently apply the methodology on the Medical Information

Mart for Intensive Care III (MIMIC-III) [

32

], a publically available de-identiﬁed electronic healthcare

record database of patients in ICUs of a US hospital. We include adult patients conforming to the

international consensus sepsis-3 criteria [

33

], and exclude admissions where treatment was withdrawn,

or mortality was undocumented. This selection procedure leads to 18,919 ICU admissions in total,

which are further divided into development and test sets according to proportions 4:1. Time series

data are temporally discritised with 4-hour time steps and aligned to the approximate time of onset

of sepsis. Measurements within the 4h period are either averaged or summed according to clinical

implications. The outcome is mortality, either hospital or 90-day mortality, whichever is available.

The maximum dose of vasopressors (mcg/kg/min) and total volume of intravenous ﬂuids (mL/h)

administered within each 4h period deﬁne our action space. Vasopressors include norepinephrine,

epinephrine, vasopressin, dopamine and phenylephrine, and are converted to norepinephrine equiva-

lent. Intravenous ﬂuids include boluses and background infusions of crystalloids, colloids and blood

products, and are normalised by tonicity. Patient variables of interest are constituted by demographics

(age, gender

7

, weight, readmission to ICU

7

, Elixhauser premorbid status), vital signs (modiﬁed

SOFA, SIRS, Glasgow coma scale, heart rate, systolic/mean/diastolic blood pressure, shock index,

respiratory rate, Sp

O2

, temperature), laboratory values (potassium, sodium, chloride, glucose, BUN,

creatinine, Magnesium, calcium, ionised calcium, carbon dioxide, SGOT, SGPT, total bilirubin,

albumin, hemoglobin, white blood cells count, platelets count, PTT, PT, INR, pH, Pa

O2

, PaC

O2

, base

excess, bicarbonate, lactate), ventilation parameters (mechanical ventilation

7

, Fi

O2

), ﬂuid balance

(cumulated intravenous ﬂuid intake, mean vasopressor dose over 4h, urine output over 4h, cumulated

urine output, cumulated ﬂuid balance since admission), and other interventions (renal replacement

therapy 7, sedation 7).

Missing data in patient continuous variables are imputed via linear interpolations, binary variables

are interpolated via sample-and-hold. All continuous variables are normalised to

[0,1]

. To promote

patient survival (discharge from ICU), each transition to death is penalised by -10, each transition to

discharge is rewarded with +10. All non-terminal transitions are zero-rewarded.

Off-policy policy evaluation (OPPE) of the learned policy is usually conferred via importance

sampling, where one has to trade between variance and bias. [

34

,

35

,

36

,

37

] have extended this to

more accurate estimators to minimise estimation error sources for discrete action spaces. However,

importance-sampling based approaches usually assume coverage in the behaviour policy

πb

(actions

possible in target policy

π

have to be possible in

πb

) to calculate the importance sampling ratio, which

is mathematically meaningless in our case where both target and behaviour policies are continuous.

Instead of using OPE to provide theoretical policy evaluation, we focus on empirically evaluating our

learned policy by comparing how the similarity between clinicians’ decisions and our suggestions

7Binary variable.

7

Fig. 4.

Distributions of returns vs. action deviations. (Left) distributions of returns for different levels

of average absolute vasopressor deviations between clinicians and proposed policy per time step. The

uppermost subplot shows empirical outcomes from patients whose vasopressors actually received

deviated per time step less than

1

3

of overall vasopressor deviations (ascending) in the test set, and the

lowermost subplot higher than 2

3; (Right) intravenous ﬂuid counterparts.

indicates patient outcomes: this provides an empirical validation and is commonly adopted [

8

,

11

]

for medical scenerios involving retrospective dataset.

Fig. 4 shows probability mass functions (histograms) of returns of start states (i.e.

γT−1rT−1

,

T

being the length of that time series) in test set divided into three mutually exclusive groups according

to the average (per time step) absolute deviation from clinicians’ decision and the proposed dose in

terms of vasopressor or intravenous ﬂuid within each episode (i.e. individual patient) respectively.

The boundaries between two adjacent groups are set to terciles (shown as the grey dotted vertical

lines in Fig. 5) of the whole test dataset for each drug to reﬂect equal weighing. It is observable that

for both drugs, higher returns are more likely to be obtained when doctors behave more closely to our

suggestions.

The distributions of action deviations between clinicians and the proposed policy in terms of each drug

for survivors and non-survivors with bootstrapping (random sampling with replacement) estimations

in Fig. 5 demonstrate that, among survivors our proposed policy captures doctors’ decisions most of

the time, while same is not true of non-survivors, especially for intravenous ﬂuids.

Fig. 5.

Distribution of action deviations with bootstrapping for all survivors, non-survivors, and

overall patients in test set are plotted separately, with terciles for each drug plotted as two grey dotted

vertical lines, separating the whole test set into three groups with equal patient numbers.

5 Conclusion

This article provides an online POMDP solution to take into account uncertainty and history infor-

mation in clinical applications. Our proposed policy is capable of dictating near-optimal dosages in

terms of vasopressor and intravenous ﬂuid in a continuous action space, to which behaving similarly

would lead to signiﬁcantly better patient outcomes than that in the original retrospective dataset.

8

Further research directions include investigating inverse reinforcement learning to recover the reward

function that clinicians were conforming to, modelling states/observations to non-trivial distributions

to more appropriately extract genuine physiological states, and phrasing the problem into a multi-

objective MDP to absorb multiple criteria.

Our overall aim is to develop clinical decision support systems that provision clinicians dynami-

cal treatment planning given previous course of patient measurements and medical interventions,

enhancing clinical decision making, not replacing it.

References

[1]

V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner,

T. Madams, and J. Cuadros. Development and validation of a deep learning algorithm for detection of

diabetic retinopathy in retinal fundus photographs. Jama, 316(22):2402–2410, 2016.

[2]

A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun. Dermatologist-level

classiﬁcation of skin cancer with deep neural networks. Nature, 542(7639):115, 2017.

[3]

G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. van der Laak, B. van

Ginneken, and C. I Sánchez. A survey on deep learning in medical image analysis. Medical image analysis,

42:60–88, 2017.

[4]

D. Ernst, G. B. Stan, J. Goncalves, and L. Wehenkel. Clinical data based optimal sti strategies for hiv: a

reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control,

pages 667–672, Dec 2006.

[5]

M. K. Bothe, L. Dickens, K. Reichel, A. Tellmann, B. Ellger, M. Westphal, and A. A. Faisal. The use

of reinforcement learning algorithms to meet the challenges of an artiﬁcial pancreas. Expert review of

medical devices, 10(5):661–73, 2013.

[6]

C. Lowery and A. A. Faisal. Towards efﬁcient, personalized anesthesia using continuous reinforcement

learning for propofol infusion control. In 2013 6th International IEEE/EMBS Conference on Neural

Engineering (NER), pages 1414–1417, Nov 2013.

[7]

S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and S. A. Murphy. Informing sequential

clinical decision-making through Âreinforcement learning: an empirical study. Machine Learning,

84(1):109–136, Jul 2011.

[8]

N. Prasad, L. Cheng, C. Chivers, M. Draugelis, and B. E Engelhardt. A reinforcement learning approach

to weaning of mechanical ventilation in intensive care units. Proceedings of Uncertainty in Artiﬁcial

Intelligence (UAI), 2017.

[9]

H. Asoh, M. Shiro, S. Akaho, T. Kamishima, K. Hasida, E. Aramaki, and T. Kohro. An application of

inverse reinforcement learning to medical records of diabetes treatment. In European Conference on

Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Sep 2013.

[10]

D. J. Lizotte and E. B. Laber. Multi-objective markov decision processes for data-driven decision support.

Journal of Machine Learning Research, 17(211):1–28, 2016.

[11]

S. Nemati, M. M. Ghassemi, and G. D. Clifford. Optimal medication dosing from suboptimal clinical

examples: A deep reinforcement learning approach. 2016 38th Annual International Conference of the

IEEE Engineering in Medicine and Biology Society (EMBC), pages 2978–2981, 2016.

[12] J. Pineau, G. J. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for pomdps. In

IJCAI, pages 1025–1032, 2003.

[13] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edition, 1957.

[14]

R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning

with function approximation. In Proceedings of the 12th International Conference on Neural Information

Processing Systems, pages 1057–1063, 1999.

[15]

E. J. Sondik. The optimal control of partially observable markov processes over the inﬁnite horizon:

Discounted costs. Operations Research, 26(2):282–304, 1978.

[16]

M. T. J. Spaan. A point-based pomdp algorithm for robot planning. In Proceedings. ICRA ’04. 2004 IEEE

International Conference on Robotics and Automation, volume 3, pages 2399–2404, April 2004.

[17]

M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially observable environ-

ments: Scaling up. In Proceedings of the Twelfth International Conference on International Conference on

Machine Learning, pages 362–370, 1995.

[18]

A. R. Cassandra, L. P Kaelbling, and M. L. Littman. Acting optimally in partially observable stochastic

domains. Twelfth National Conference on Artiﬁcial Intelligence (AAAI-94), pages 1023–1028, 1994.

9

[19]

M. Hauskrecht. Value-function approximations for partially observable markov decision processes. J. Artif.

Int. Res., 13(1):33–94, August 2000.

[20]

S. Paquet, B. Chaib-draa, and S. Ross. Hybrid pomdp algorithms. In In Proceedings of The Workshop on

Multi-Agent Sequential Decision Making in Uncertain Domains (MSDM-2006), 2006.

[21]

D. A. McAllester and S. Singh. Approximate planning for factored pomdps using belief state simpliﬁcation.

In Proceedings of the Fifteenth Conference on Uncertainty in Artiﬁcial Intelligence, pages 409–416, 1999.

[22]

M. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for near-optimal planning in large

markov decision processes. Mach. Learn., 49(2-3):193–208, November 2002.

[23]

T. Smith and R. Simmons. Heuristic search value iteration for pomdps. In Proceedings of the 20th

Conference on Uncertainty in Artiﬁcial Intelligence, pages 520–527, 2004.

[24]

R. Washington. Bi-pomdp: Bounded, incremental partially-observable markov-model planning. In In

Proceedings of the 4th European Conference on Planning (ECP, pages 440–451. Springer, 1997.

[25]

S. Ross and B. Chaib-Draa. Aems: An anytime online search algorithm for approximate policy reﬁnement

in large pomdps. In Proceedings of the 20th International Joint Conference on Artiﬁcal Intelligence, pages

2592–2598, 2007.

[26] E. S. Gideon. Estimating the dimension of a model. The Annals of Statistics, 6, 03 1978.

[27] K. P. Murphy. Machine learning: a probabilistic perspective. Cambridge, MA, 2012.

[28] W. Buntine and M. Hutter. A bayesian view of the poisson-dirichlet process. arXiv:1007.0296v2, 2012.

[29]

A. H. Klopf. A drive-reinforcement model of single neuron function: An alternative to the hebbian neuronal

model. AIP Conference Proceedings, 151(1):265–270, 1986.

[30] R. S. Sutton and A. G. Barto. Reinforcement Learning : An Introduction. MIT Press, 1998.

[31] T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. ICML, 2012.

[32]

A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits,

L. Anthony Celi, and R. G. Mark. Mimic-iii, a freely accessible critical care database. 3, 2017.

[33]

M. Singer, C.S. Deutschman, C. Seymour, et al. The third international consensus deﬁnitions for sepsis

and septic shock (sepsis-3). JAMA, 315(8):801–810, 2016.

[34]

N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings

of The 33rd International Conference on Machine Learning, volume 48, pages 652–661, Jun 2016.

[35]

A. R. Mahmood, H. P. van Hasselt, and R. S. Sutton. Weighted importance sampling for off-policy learning

with linear function approximation. In Advances in Neural Information Processing Systems 27, pages

3014–3022. 2014.

[36]

D. Precup, R. S. Sutton, and S. P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings

of the Seventeenth International Conference on Machine Learning, pages 759–766, 2000.

[37]

Philip S. Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforcement

learning. In ICML, 2016.

10