PreprintPDF Available

The Actor Search Tree Critic (ASTC) for Off-Policy POMDP Learning in Medical Decision Making

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Off-policy reinforcement learning enables near-optimal policy from suboptimal experience, thereby provisions opportunity for artificial intelligence applications in healthcare. Previous works have mainly framed patient-clinician interactions as Markov decision processes, while true physiological states are not necessarily fully observable from clinical data. We capture this situation with partially observable Markov decision process, in which an agent optimises its actions in a belief represented as a distribution of patient states inferred from individual history trajectories. A Gaussian mixture model is fitted for the observed data. Moreover, we take into account the fact that nuance in pharmaceutical dosage could presumably result in significantly different effect by modelling a continuous policy through a Gaussian approximator directly in the policy space, i.e. the actor. To address the challenge of infinite number of possible belief states which renders exact value iteration intractable, we evaluate and plan for only every encountered belief, through heuristic search tree by tightly maintaining lower and upper bounds of the true value of belief. We further resort to function approximations to update value bounds estimation, i.e. the critic, so that the tree search can be improved through more compact bounds at the fringe nodes that will be back-propagated to the root. Both actor and critic parameters are learned via gradient-based approaches. Our proposed policy trained from real intensive care unit data is capable of dictating dosing on vasopressors and intravenous fluids for sepsis patients that lead to the best patient outcomes.
Content may be subject to copyright.
The Actor Search Tree Critic (ASTC) for Off-Policy
POMDP Learning in Medical Decision Making
Luchen Li
Imperial College London
l.li17@imperial.ac.uk
Matthieu Komorowski
Imperial College London
matthieu.komorowski@gmail.com
A. Aldo Faisal
Imperial College London
a.faisal@imperial.ac.uk
Abstract
Off-policy reinforcement learning enables near-optimal policy from suboptimal
experience, thereby provisions opportunity for artificial intelligence applications in
healthcare. Previous works have mainly framed patient-clinician interactions as
Markov decision processes, while true physiological states are not necessarily fully
observable from clinical data. We capture this situation with partially observable
Markov decision process, in which an agent optimises its actions in a belief
represented as a distribution of patient states inferred from individual history
trajectories. A Gaussian mixture model is fitted for the observed data. Moreover, we
take into account the fact that nuance in pharmaceutical dosage could presumably
result in significantly different effect by modelling a continuous policy through
a Gaussian approximator directly in the policy space, i.e. the actor. To address
the challenge of infinite number of possible belief states which renders exact
value iteration intractable, we evaluate and plan for only every encountered belief,
through heuristic search tree by tightly maintaining lower and upper bounds of the
true value of belief. We further resort to function approximations to update value
bounds estimation, i.e. the critic, so that the tree search can be improved through
more compact bounds at the fringe nodes that will be back-propagated to the root.
Both actor and critic parameters are learned via gradient-based approaches. Our
proposed policy trained from real intensive care unit data is capable of dictating
dosing on vasopressors and intravenous fluids for sepsis patients that lead to the
best patient outcomes.
1 Introduction
Many recent examples [
1
,
2
,
3
] have demonstrated that machine learning can deliver above-human
performance in classification-based diagnostics. However, key to medicine is diagnosis paired with
treatment, i.e. the sequential decisions that have to be made by clinician for patient treatment.
Automatic treatment optimisation based on reinforcement learning has been explored in simulated
patients for HIV therapy [
4
] using fitted-Q iteration, dynamic insulin dosage in diabetes using
model-based reinforcement learning [5], and anaesthesia depth control using actor-critic [6].
We focus on principled closed-loop approaches to learn a near-optimal treatment policy from vast
electronic healthcare records (EHRs). Previous works mainly modelled this problem as a Markov
decision process (MDP) and learned a policy of actions
π(a|s) = P(a|s)
based on estimated values
v(s)
of possible actions
a
in each state
s
. [
7
] investigated fitted Q-iteration with linear function
approximation to optimise treatment of schizophrenia. [
8
] compared different supervised learning
methods to approximate action values, including neural networks, and successfully predicted the
weaning of mechanical ventilation and sedation dosages. In addition to crafting a reward function that
reflects domain knowledge, another avenue explored inverse reinforcement learning to recover one
from expert behaviours. For example, [
9
] proposed hemoglobin-A1c dosages for diabetes patients by
Preprint. Work in progress.
arXiv:1805.11548v3 [cs.AI] 3 Jun 2018
implementing Markov chain Monte Carlo sampling to infer posterior distribution over rewards given
observed states and actions. Further, multiple objective optimal treatments were explored [
10
] with
non-deterministic fitted-Q to provide guidelines on antipsychotic drug treatments for schizophrenia
patients. [
11
] inferred hidden states via discriminative hidden Markov model, and investigated a
Deep-fitted Q variant to learn heparin dosages to manage thrombosis.
However, the true physiological state of the patient is not necessarily fully observable by clinical
measurement techniques. Practically, this observability is further restricted by the actual subset of
measures taken in hospital, restricting both the nature and frequency of data recorded and omitting
information readily visible to the clinician . Those latencies are especially salient in intensive care
units (ICUs). Strictly speaking, the decision making process acting on the patient state therefore
should be mathematically formulated as a partially observable MDP (POMDP). Consequently, the
true patient state is latent and thus the patient’s physiological belief state
b
can only be represented as
a probability distribution of the states given the history of observations and actions. Exhaustive and
exact POMDP solutions are computationally intractable because belief states constitute a continuous
hyperplane that contains infinite possibilities. We alleviate this obstruct by evaluating the sequentially
encountered belief states through learning both upper and lower bounds of their true values and
exploring reachable sequences using heuristic search trees to obtain locally optimal value estimates
ˆvτ(b)
. The learning of value bounds in the tree search is conferred via keeping back-propagating that
of newly expanded nodes to the root, which are usually computed offline and remain static. We model
the lower and upper value bounds as function approximators so that the tree search can be improved
through more accurate value bounds at the fringe nodes as we process the incoming data. Moreover,
to realise continuous action spaces (e.g. milliliter of medicine dripped per hour) we explicitly use
a continuous policy implemented through function approximation. We embed these features in an
actor-critic framework to update the parameters of the policy (the actor) and the value bounds (the
critic) via gradient-based methods.
2 Preliminaries
POMDP Model
A POMDP framework can be represented by the tuple
{S,A,O,T,,R, γ , b0}
[
12
], where
S
is the state space,
A
the action space,
O
the observation space,
T
the stochastic
state transition function:
T(s0, s, a) = P(st+1 =s0|st=s, at=a)
,
the stochastic observation
function:
Ω(o, s, a) = P(ot+1 =o|st+1 =s, at=a)
,
R
the immediate reward function:
r=
R(s, a)
,
γ[0,1)
the discount factor indicating the weighing of present value of future rewards, and
b0the agent’s initial knowledge before receiving any information.
rt1
st1st
ot1ot
bt1bt
at1
rt
st+1
ot+1
bt+1
at
rt+1
at+1
h= 0
h= 1
h= 2
h= 3
(a)
1
2
5 6
8
12 13
9 10
7
3 4
a
o
(b)
1
2
5 6
8
14 15
17 18 19
16
12 13
9 10
7
3 4
(c)
1
2
5 6
8
11
14 15
17 18 19
16
12
20 21 22
23 24 25
9 10
7
3 4
(d)
1
2
5 6
8
11
14 15
17 18 19
16
12 13
9 10
20 21 22
7
3 4
Fig. 1. (left)
Graphical model of POMDP.
(right)
Tree search. Each circle node represents a belief
state, each dotted node an action-belief-state pair. (a) select the best fringe node to expand; (b) expand
the selected node by choosing the action with the maximal upper value bound and considering all
possible observations, and back-propagate value bounds through all its ancestors to the root; (c) no
revision is required for previous choices on actions, expand the next best fringe node; (d) a previous
action choice (example is showing in horizon 1) is no longer optimal after the latest expansion, the
new optimal action is selected.
The agent’s belief state is represented by the probability distribution over states given historical
observations and actions,
bt(s) = P(s|b0, o1, a1, ..., at1, ot)
. Executing
a
in belief
b
and receiving
2
observation o, the new belief is updated through:
b0(s0) = τ(b, a, o) = ηΩ(o, s0, a)X
s∈S
T(s0, s, a)b(s)(1)
where η=1
P(o|b,a)is a normalisation constant, and
P(o|b, a) = X
s0∈S
Ω(o, s0, a)X
s∈S
T(s0, s, a)b(s)(2)
The optimal policy
P(a|b) = π(a|b)
specifies the best action to select in a belief, its value function
being updated via the fixed point of Bellman equation [13]:
v(b) = max
a∈A "RB(b, a) + γX
o∈O
P(o|b, a)v(τ(b, a, o))#(3)
RB(b, a) = Ps∈S R(s, a)b(s)
is the probability-weighted immediate reward. Graphical model of a
fraction of POMDP framework is shown in Fig. 1 (left).
Continuous Control
Policy gradient method can go beyond the limit of a finite action space and
achieve continuous control. Instead of choosing actions based on action-value estimates, a policy
is directly optimised in the policy space. Our policy
π(a|s, u) = Factor(x(s),u)
is a function of
state feature vector
x(s)
parameterised by weight vector
u
. The objective function measuring the
performance of policy
π
is defined as the value, or expected total rewards from future, of the start
state:
J(π) = Eπ[G0|s0] = Eπ"T1
X
t=0
γtrt|s0#(4)
where
G0
is the return
Gt=PTt1
k=0 γkrt+k
at
t= 0
. According to policy gradient theorem [
14
],
the gradient of Jw.r.t. uis:
uJ(π) = 1X
s∈S
Pstat
π(s)X
a∈A
qπ(st, a)uπ(a|st,ut) = EπGt
uπ(at|st,ut)
π(at|st,ut)(5)
Pstat
π(s)
denotes the stationary distribution of states under policy
π
(i.e. the chance a state will
be visited within an episode).
qπ(st, at) = Eπ[Gt|st, at]
is the value of
at
in state
st
.
u
can
subsequently be updated through gradient ascent in the direction (with step size
α
) that maximally
increases J:
ut+1 =ut+αGt
uπ(at|st,ut)
π(at|st,ut)(6)
3 Methodology
In this section, we introduce how our algorithm combines tree search with policy gradient/function
approximations to realise both continuous action space and efficient online planning for POMDP. We
will derive the components of our algorithm as we go along this section and provide an overview
flowchart in Fig. 2.
Heuristic Search in POMDPs
The complexity of solving POMDPs is mainly due to the curse of
dimensionality: a belief state in an
|S|
space is an
(|S| − 1)
-dimensional continuous simplex, with all
its elements sum to one, and curse of history, as it acknowledges previous observations and actions,
the combination of which grows exponentially with the planning horizon [
12
]. Exact value iteration
for Eq. 3 in conventional reinforcement learning is therfore computationally intractable.
The optimal value function
v(b)
of a finite-horizon POMDP is piecewise linear and convex in the
belief state [
15
], represented by a set of
|S|
-dimensional convex hyperplanes, whose total amount
1uJ(π) = Ps∈S Pstat
π(s)Pa∈A qπ(st, a)uπ(a|st,ut) = EπPa∈A qπ(st, a)uπ(a|st,ut)=
EπhPa∈A π(a|st,ut)qπ(st, a)uπ(a|st,ut)
π(a|st,ut)i=Eπhqπ(st, at)uπ(at|st,ut)
π(at|st,ut)i=EπhGtuπ(at|st,ut)
π(at|st,ut)i.
3
grows exponentially. Most exhaustive algorithms for POMDPs are dedicated to learning either lower
bound [
12
,
16
] or upper bound [
17
,
18
,
19
] of
v(b)
by maintaining a subset of the aforementioned
hyperplanes. Tree-search based solutions usually prunes away less likely observations or actions
[20, 21, 22], or expanding fringe nodes according to a predefined heuristic [23, 24, 25].
Our methodology is compatible with any tree-search based POMDP solution. Here we implement
on a heuristic tree search introduced in [
25
] to focus computations on every encountered belief (i.e.
plan at decision time) and explore only reachable sequences. Specifically, a search tree rooted at
the current belief
bcurr
is built, whose value estimate is confined by its lower bound
ˆvL(bcurr )
and
upper bound
ˆvU(bcurr )
that after each step of look-ahead become tighter to the true optimal value
v(bcurr )
. Here we use subscript on the belief state to denote its temporal position within the episode
in environmental experience, and superscript the horizon explored for it in the tree search (analogous
for observation and action). At each update during exploration, only the fringe node that leads to the
maximum error on the root b0
curr is expanded:
b
curr = arg max
bh
curr , h∈{0,...,H}"γhˆvU(bh
curr )ˆvL(bh
curr )
h1
Y
i=0
P(oi
curr |bi
curr , ai
curr )πτ(ai
curr |bi
curr )#
(7)
where
H
is the maximum horizon explored so far,
ˆvU(bh
curr )ˆvL(bh
curr )
the error on the fringe
node,
Qh1
i=0 P(oi
curr |bi
curr , ai
curr )πτ(ai
curr |bi
curr )
the probability of reaching it from the root, and
γh
the time discount (Fig. 1 (right) (a)). Once a node is expanded, the estimates of value bounds of
all its ancestors are updated in a bottom-up fashion analogous to equation Eq. 3, substituting
v
with
ˆvL
or
ˆvU
as appropriate, to the root (Fig. 1 (right) (b)), and previous choices on actions in expanded
belief nodes along the path are revised (Fig. 1 (right) (c-d)) to ensure that the optimal action
ai
curr
in
bi
curr for h∈ {0, ..., H }is always explored based on current estimates.
ai
curr = arg max
a∈A "RB(bi
curr , a) + γX
o∈O
P(o|bi
curr , avUτ(bi
curr , a, o)#(8)
Eq. 8 is the deterministic tree policy
πτ(a|b)
that guides exploration within the tree. Each expansion
leads to more compact value bounds at the root. Tree exploration is terminated when the interval
between the value bounds estimates at the root belief changes trivially or a time limit is reached.
Gaussian States
Observed patient information is modelled as a Gaussian mixture, each observation
being generated from one of a finite set of Gaussian distributions that represent genuine physiological
states. The total number of latent states is decided by Bayesian information criterion [
26
] through
cross-validation using the development set. The terminal state is observable and corresponds either
patient discharge or death. Eq. 1 can be further expressed as:
b0(s0) = ηPa(s0|o)Pa(o)
Pa(s0)X
s∈S
Ta(s0, s)b(s) = η0Pa(s0|o)
Pa(s0)X
s∈S
Ta(s0, s)b(s)(9)
The superscript
a
denotes a subset divided from the data according to the action taken during this
transition
2
. The division into subsets enables parallel computations.
Pa(s0|o)
is the posterior
distribution of s0when observing ospeculated from the trained Gaussian mixture model.
The transition function
Ta(s0, s)
is learned by maximum a posteriori [
27
] to allow possibilities for
transitions that did not occur in the development dataset.
Prior knowledge on transitions is modelled by Griffiths-Engen-McCloskey (GEM) distributions
[
28
] according to relative Euclidean distances between state centroids, whose elements, if sorted,
decrease exponentially and sum to one, to reflect higher probabilities of transiting into similar states.
Specifically, a GEM distribution is defined by a discount parameter
c1
and a concentration parameter
2
In our implementation,
Pa(s0|o)
for
a∈ A
are globally computed as
P(s0|o)o∈ O
, regardless of the
action leading to it, because the impact of the observation on the distribution of states is significantly more
substantial than the action administered.
4
c2
, and can be explained by a stick-breaking construction: break a stick for the
k
-th time into two
parts, whose length proportions conform to a Beta distribution:
VkBeta(1 c1, c2+kc1),0c1<1, c2>c1(10)
Then the length proportions of the off-broken parts in the whole stick are:
pk=V1, k = 1
(1 V1)(1 V2)...(1 Vk1)Vk, k = 2,3, ... (11)
The probability vector
p
consisting of elements calculated from Eq. 11 constitutes a GEM distribution.
Actor-Critic
Including to Eq. 5 a baseline term
B(st)
as a comparison with
Gt
significantly
reduces variance in gradient estimates without changing equality
3
. This baseline is required to
discern states, a natural candidate would be the state value or its parametric approximation
ˆv(s, w) =
Fcritic(x(s),w). Then Eq. 6 becomes:
ut+1 =ut+α[Gtˆv(st,wt)] uπ(at|st,ut)
π(at|st,ut)(12)
The parametric policy
π(a|s, u)
is called the actor, and the parametric value function
ˆv(s, w)
the
critic.
The complete empirical return
Gt
is only available at the end of each episode, and therefore at
each update we need to look forward to future rewards to decide current theoretical estimate of
Gt
. A close approximation of
Gt
that is available at each decision moment and thus enables
more data-efficient backward-view learning is
λ
-return:
Gλ
t= (1 λ)P
n=1 λn1Gt:t+n
, with
λ
specifying the relative decaying rate among returns available after various steps
Gt:t+n
. Since
Gλ
tˆv(st,wt)rt+γˆv(st+1,wt)ˆv(st,wt)
(denoted as
δt
), substituting
Gt
with
Gλ
t
in Eq. 12
yields:
ut+1 =ut+αδteu
t(13)
eu
t=γλeu
t1+uπ(at|st,ut)
π(at|st,ut)(14)
eu
is the eligibility trace [
29
] for
u
, and is initialised to
0
for every episode. Analogous update rules
apply to the critic.
Actor Search Tree Critic
The history-dependent probabilistic belief state reflects the information
the agent would need to know about the current time step to optimise its decision. This belief state is
used as the state feature vector for the actor-critic, i.e.
x(s) = b
. This is based on the notion that state
mechanism is supposed to allow weight parameter to update towards similar directions by similar
samples, while similar situations have similar distributions of states, with each component in the
distribution implying the responsibility for updating corresponding component in the weight.
Our critic parameterises the lower and upper value bounds in the tree search, instead of parameterising
the value function as whole (as done in conventional actor-critic methods). Note that the value bounds
at the fringe nodes are here updated as we parse the data for off-policy reinforcement learning. In
previous works these bounds would have been computed offline and not improved during online
planning. We use linear representations for the value bounds:
ˆvL(b, wL) = wL T b, ˆvU(b, wU) = wU T b(15)
At each step
t
, a local (as opposed to a value function optimal to all belief states) optimal value is
estimated for the current belief
bt
through heuristic tree search with fringe nodes values approximated
by
wL
and
wU
, denoted as
ˆvτ(bt,wL
t,wU
t)
. The critic parameters are updated through stochastic
gradient descent (SGD) to adjust in the direction that most reduces the error on each training
example by minimising the mean square error between the current approximation and its target
ˆvτ(bt,wL
t,wU
t):
wL
t+1 =4wL
t+βLˆvτ(bt,wL
t,wU
t)ˆvL(bt,wL)bt(16)
3Because Pa∈A B(s)uπ(a|s, u) = B(s)uPa∈A π(a|s, u) = B(s)u1 = 0,s∈ S.
4wL
t+1 =
wL
t1
2βLwˆvτ(bt,wL
t,wU
t)ˆvL(bt,wL)2=wL
t+βLˆvτ(bt,wL
t,wU
t)ˆvL(bt,wL)wˆvL(bt,wL) =
wL
t+βLˆvτ(bt,wL
t,wU
t)ˆvL(bt,wL)bt.
5
βL
is the step size for updating
wL
. Similar for
wU
t
. As the weight vector also has an impact on the
target
ˆvτ(bt,wL
t,wU
t)
, which is ignored during SGD update, the update is by definition semi-gradient,
which usually learns faster than full gradient methods and with linear approximators (Eq. 15), is
guaranteed to converge (near) to a local optimum under standard stochastic approximation conditions
[30]:
X
t=1
βt=,
X
t=1
β2
t<(17)
To ensure convergence, we set the step sizes at time step taccording to:
βL
t=0.1
Ehkbtk2
wL
ti, βU
t=0.1
Ehkbtk2
wU
ti(18)
Fig. 2. ASTC algorithm flowchart.
To realise continuous action space,
the actor is modelled as a Gaussian
distribution, with a mean vector ap-
proximated as a linear function (for
simplicity of gradient computation)
of weights and belief state:
π(a|b, u) = N(uTb, σ2)(19)
σ
is a hyperparameter for standard
deviation. In this circumstance the
gradient in Eq. 14 is calculated as:
uπ(a|b, u)
π(a|b, u)=1
σ2(auTb)b
(20)
The moment-wise error, or temporal
difference (TD) error that motivates update to actor in Eq. 13 is:
δt=rt+γˆvτ(bt+1,wL
t,wU
t)ˆvτ(bt,wL
t,wU
t)(21)
In off-policy reinforcement learning, as we use retrospective data generated by a behaviour policy
πb
(being clinicians’ actual treatment decisions) to optimise a target policy
π
(being our actor), the
actor-critic approach is tuned via importance sampling on the eligibility trace [31]:
eu
t=ρtγλeu
t1+uπ(at|bt,ut)
π(at|bt,ut)(22)
where
ρt=π(at|bt,ut)
πb(at|bt)
5
is the importance sampling ratio. Importance sampling mechanisms help
ensure that we are not biased by differences between the two policies in choosing actions (e.g. optimal
actions may look different than clinical actions taken).
4 Experimental Results
We both train and test our algorithm on synthetic and real ICU patients separately.
On Synthetic Data
We first synthesize a dataset where we have full access to its dynamics, from
which a true theoretic optimal policy
π
can be computed from fully model-reliant approaches such
as dynamical programming. Note that although with synthetic data, we choose to learn on an existing
(simulated) dataset, to test the algorithm’s capability of off-policy learning. The dataset is further
divided into two mutually exclusive subsets for algorithm development and test. The suboptimality
of our behaviour policy
πb
that dictates actions during data generation, is systematically related to
π
with -greedy 6.
5πb(at|bt)
is actually equivalent to
πb(at|st)
, both
bt
and
st
denote the environmental state at an identical
moment, depending on whether the agent’s notion is represented by belief or fully observable state.
6
decides the fraction of occasions when the agent explores actions randomly instead of sticking to the
optimal one.
6
In our synthetic data, the action space contains six discrete (categorical) actions. Observed data are
generated from a Gaussian mixture model, with state transition probabilities denser at closer states.
All parameters for data generation are unknown to the reinforcement learning agent.
Fig. 3.
Action selections in test set under proposed/optimal/behaviour (showing is
= 0.3
) policy
(synthetic data).
Fig. 3 visualises action selections under
π
,
π
, and
πb
of the first 200 time steps (to avoid clutter)
in the test set. Note that the optimal and behaviour policies are discrete, while the target policy is
continuous. Strict resemblance between the target policy and the optimal policy can be observed.
And there is no trace of the target policy varying with the behaviour policy.
On Retrospective ICU Data
We subsequently apply the methodology on the Medical Information
Mart for Intensive Care III (MIMIC-III) [
32
], a publically available de-identified electronic healthcare
record database of patients in ICUs of a US hospital. We include adult patients conforming to the
international consensus sepsis-3 criteria [
33
], and exclude admissions where treatment was withdrawn,
or mortality was undocumented. This selection procedure leads to 18,919 ICU admissions in total,
which are further divided into development and test sets according to proportions 4:1. Time series
data are temporally discritised with 4-hour time steps and aligned to the approximate time of onset
of sepsis. Measurements within the 4h period are either averaged or summed according to clinical
implications. The outcome is mortality, either hospital or 90-day mortality, whichever is available.
The maximum dose of vasopressors (mcg/kg/min) and total volume of intravenous fluids (mL/h)
administered within each 4h period define our action space. Vasopressors include norepinephrine,
epinephrine, vasopressin, dopamine and phenylephrine, and are converted to norepinephrine equiva-
lent. Intravenous fluids include boluses and background infusions of crystalloids, colloids and blood
products, and are normalised by tonicity. Patient variables of interest are constituted by demographics
(age, gender
7
, weight, readmission to ICU
7
, Elixhauser premorbid status), vital signs (modified
SOFA, SIRS, Glasgow coma scale, heart rate, systolic/mean/diastolic blood pressure, shock index,
respiratory rate, Sp
O2
, temperature), laboratory values (potassium, sodium, chloride, glucose, BUN,
creatinine, Magnesium, calcium, ionised calcium, carbon dioxide, SGOT, SGPT, total bilirubin,
albumin, hemoglobin, white blood cells count, platelets count, PTT, PT, INR, pH, Pa
O2
, PaC
O2
, base
excess, bicarbonate, lactate), ventilation parameters (mechanical ventilation
7
, Fi
O2
), fluid balance
(cumulated intravenous fluid intake, mean vasopressor dose over 4h, urine output over 4h, cumulated
urine output, cumulated fluid balance since admission), and other interventions (renal replacement
therapy 7, sedation 7).
Missing data in patient continuous variables are imputed via linear interpolations, binary variables
are interpolated via sample-and-hold. All continuous variables are normalised to
[0,1]
. To promote
patient survival (discharge from ICU), each transition to death is penalised by -10, each transition to
discharge is rewarded with +10. All non-terminal transitions are zero-rewarded.
Off-policy policy evaluation (OPPE) of the learned policy is usually conferred via importance
sampling, where one has to trade between variance and bias. [
34
,
35
,
36
,
37
] have extended this to
more accurate estimators to minimise estimation error sources for discrete action spaces. However,
importance-sampling based approaches usually assume coverage in the behaviour policy
πb
(actions
possible in target policy
π
have to be possible in
πb
) to calculate the importance sampling ratio, which
is mathematically meaningless in our case where both target and behaviour policies are continuous.
Instead of using OPE to provide theoretical policy evaluation, we focus on empirically evaluating our
learned policy by comparing how the similarity between clinicians’ decisions and our suggestions
7Binary variable.
7
Fig. 4.
Distributions of returns vs. action deviations. (Left) distributions of returns for different levels
of average absolute vasopressor deviations between clinicians and proposed policy per time step. The
uppermost subplot shows empirical outcomes from patients whose vasopressors actually received
deviated per time step less than
1
3
of overall vasopressor deviations (ascending) in the test set, and the
lowermost subplot higher than 2
3; (Right) intravenous fluid counterparts.
indicates patient outcomes: this provides an empirical validation and is commonly adopted [
8
,
11
]
for medical scenerios involving retrospective dataset.
Fig. 4 shows probability mass functions (histograms) of returns of start states (i.e.
γT1rT1
,
T
being the length of that time series) in test set divided into three mutually exclusive groups according
to the average (per time step) absolute deviation from clinicians’ decision and the proposed dose in
terms of vasopressor or intravenous fluid within each episode (i.e. individual patient) respectively.
The boundaries between two adjacent groups are set to terciles (shown as the grey dotted vertical
lines in Fig. 5) of the whole test dataset for each drug to reflect equal weighing. It is observable that
for both drugs, higher returns are more likely to be obtained when doctors behave more closely to our
suggestions.
The distributions of action deviations between clinicians and the proposed policy in terms of each drug
for survivors and non-survivors with bootstrapping (random sampling with replacement) estimations
in Fig. 5 demonstrate that, among survivors our proposed policy captures doctors’ decisions most of
the time, while same is not true of non-survivors, especially for intravenous fluids.
Fig. 5.
Distribution of action deviations with bootstrapping for all survivors, non-survivors, and
overall patients in test set are plotted separately, with terciles for each drug plotted as two grey dotted
vertical lines, separating the whole test set into three groups with equal patient numbers.
5 Conclusion
This article provides an online POMDP solution to take into account uncertainty and history infor-
mation in clinical applications. Our proposed policy is capable of dictating near-optimal dosages in
terms of vasopressor and intravenous fluid in a continuous action space, to which behaving similarly
would lead to significantly better patient outcomes than that in the original retrospective dataset.
8
Further research directions include investigating inverse reinforcement learning to recover the reward
function that clinicians were conforming to, modelling states/observations to non-trivial distributions
to more appropriately extract genuine physiological states, and phrasing the problem into a multi-
objective MDP to absorb multiple criteria.
Our overall aim is to develop clinical decision support systems that provision clinicians dynami-
cal treatment planning given previous course of patient measurements and medical interventions,
enhancing clinical decision making, not replacing it.
References
[1]
V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner,
T. Madams, and J. Cuadros. Development and validation of a deep learning algorithm for detection of
diabetic retinopathy in retinal fundus photographs. Jama, 316(22):2402–2410, 2016.
[2]
A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun. Dermatologist-level
classification of skin cancer with deep neural networks. Nature, 542(7639):115, 2017.
[3]
G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. van der Laak, B. van
Ginneken, and C. I Sánchez. A survey on deep learning in medical image analysis. Medical image analysis,
42:60–88, 2017.
[4]
D. Ernst, G. B. Stan, J. Goncalves, and L. Wehenkel. Clinical data based optimal sti strategies for hiv: a
reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control,
pages 667–672, Dec 2006.
[5]
M. K. Bothe, L. Dickens, K. Reichel, A. Tellmann, B. Ellger, M. Westphal, and A. A. Faisal. The use
of reinforcement learning algorithms to meet the challenges of an artificial pancreas. Expert review of
medical devices, 10(5):661–73, 2013.
[6]
C. Lowery and A. A. Faisal. Towards efficient, personalized anesthesia using continuous reinforcement
learning for propofol infusion control. In 2013 6th International IEEE/EMBS Conference on Neural
Engineering (NER), pages 1414–1417, Nov 2013.
[7]
S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and S. A. Murphy. Informing sequential
clinical decision-making through Âreinforcement learning: an empirical study. Machine Learning,
84(1):109–136, Jul 2011.
[8]
N. Prasad, L. Cheng, C. Chivers, M. Draugelis, and B. E Engelhardt. A reinforcement learning approach
to weaning of mechanical ventilation in intensive care units. Proceedings of Uncertainty in Artificial
Intelligence (UAI), 2017.
[9]
H. Asoh, M. Shiro, S. Akaho, T. Kamishima, K. Hasida, E. Aramaki, and T. Kohro. An application of
inverse reinforcement learning to medical records of diabetes treatment. In European Conference on
Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Sep 2013.
[10]
D. J. Lizotte and E. B. Laber. Multi-objective markov decision processes for data-driven decision support.
Journal of Machine Learning Research, 17(211):1–28, 2016.
[11]
S. Nemati, M. M. Ghassemi, and G. D. Clifford. Optimal medication dosing from suboptimal clinical
examples: A deep reinforcement learning approach. 2016 38th Annual International Conference of the
IEEE Engineering in Medicine and Biology Society (EMBC), pages 2978–2981, 2016.
[12] J. Pineau, G. J. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for pomdps. In
IJCAI, pages 1025–1032, 2003.
[13] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edition, 1957.
[14]
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning
with function approximation. In Proceedings of the 12th International Conference on Neural Information
Processing Systems, pages 1057–1063, 1999.
[15]
E. J. Sondik. The optimal control of partially observable markov processes over the infinite horizon:
Discounted costs. Operations Research, 26(2):282–304, 1978.
[16]
M. T. J. Spaan. A point-based pomdp algorithm for robot planning. In Proceedings. ICRA ’04. 2004 IEEE
International Conference on Robotics and Automation, volume 3, pages 2399–2404, April 2004.
[17]
M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially observable environ-
ments: Scaling up. In Proceedings of the Twelfth International Conference on International Conference on
Machine Learning, pages 362–370, 1995.
[18]
A. R. Cassandra, L. P Kaelbling, and M. L. Littman. Acting optimally in partially observable stochastic
domains. Twelfth National Conference on Artificial Intelligence (AAAI-94), pages 1023–1028, 1994.
9
[19]
M. Hauskrecht. Value-function approximations for partially observable markov decision processes. J. Artif.
Int. Res., 13(1):33–94, August 2000.
[20]
S. Paquet, B. Chaib-draa, and S. Ross. Hybrid pomdp algorithms. In In Proceedings of The Workshop on
Multi-Agent Sequential Decision Making in Uncertain Domains (MSDM-2006), 2006.
[21]
D. A. McAllester and S. Singh. Approximate planning for factored pomdps using belief state simplification.
In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 409–416, 1999.
[22]
M. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for near-optimal planning in large
markov decision processes. Mach. Learn., 49(2-3):193–208, November 2002.
[23]
T. Smith and R. Simmons. Heuristic search value iteration for pomdps. In Proceedings of the 20th
Conference on Uncertainty in Artificial Intelligence, pages 520–527, 2004.
[24]
R. Washington. Bi-pomdp: Bounded, incremental partially-observable markov-model planning. In In
Proceedings of the 4th European Conference on Planning (ECP, pages 440–451. Springer, 1997.
[25]
S. Ross and B. Chaib-Draa. Aems: An anytime online search algorithm for approximate policy refinement
in large pomdps. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, pages
2592–2598, 2007.
[26] E. S. Gideon. Estimating the dimension of a model. The Annals of Statistics, 6, 03 1978.
[27] K. P. Murphy. Machine learning: a probabilistic perspective. Cambridge, MA, 2012.
[28] W. Buntine and M. Hutter. A bayesian view of the poisson-dirichlet process. arXiv:1007.0296v2, 2012.
[29]
A. H. Klopf. A drive-reinforcement model of single neuron function: An alternative to the hebbian neuronal
model. AIP Conference Proceedings, 151(1):265–270, 1986.
[30] R. S. Sutton and A. G. Barto. Reinforcement Learning : An Introduction. MIT Press, 1998.
[31] T. Degris, M. White, and R. S. Sutton. Off-policy actor-critic. ICML, 2012.
[32]
A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits,
L. Anthony Celi, and R. G. Mark. Mimic-iii, a freely accessible critical care database. 3, 2017.
[33]
M. Singer, C.S. Deutschman, C. Seymour, et al. The third international consensus definitions for sepsis
and septic shock (sepsis-3). JAMA, 315(8):801–810, 2016.
[34]
N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings
of The 33rd International Conference on Machine Learning, volume 48, pages 652–661, Jun 2016.
[35]
A. R. Mahmood, H. P. van Hasselt, and R. S. Sutton. Weighted importance sampling for off-policy learning
with linear function approximation. In Advances in Neural Information Processing Systems 27, pages
3014–3022. 2014.
[36]
D. Precup, R. S. Sutton, and S. P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings
of the Seventeenth International Conference on Machine Learning, pages 759–766, 2000.
[37]
Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement
learning. In ICML, 2016.
10
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Deep learning algorithms, in particular convolutional networks, have rapidly become a methodology of choice for analyzing medical images. This paper reviews the major deep learning concepts pertinent to medical image analysis and summarizes over 300 contributions to the field, most of which appeared in the last year. We survey the use of deep learning for image classification, object detection, segmentation, registration, and other tasks and provide concise overviews of studies per application area. Open challenges and directions for future research are discussed.
Conference Paper
It is an important issue to utilize large amount of medical records which are being accumulated on medical information systems to improve the quality of medical treatment. The process of medical treatment can be considered as a sequential interaction process between doctors and patients. From this viewpoint, we have been modeling medical records using Markov decision processes (MDPs). Using our model, we can simulate the future of each patient and evaluate each treatment. In order to do so, the reward function of the MDP should be specied. However, there is no explicit information regarding the reward value in medical records. In this study we report our results of applying an inverse reinforcement learning (IRL) algorithm to medical records of diabetes treatment to explore the reward function that doctors have in mind during their treatments.
Article
Skin cancer, the most common human malignancy, is primarily diagnosed visually, beginning with an initial clinical screening and followed potentially by dermoscopic analysis, a biopsy and histopathological examination. Automated classification of skin lesions using images is a challenging task owing to the fine-grained variability in the appearance of skin lesions. Deep convolutional neural networks (CNNs) show potential for general and highly variable tasks across many fine-grained object categories. Here we demonstrate classification of skin lesions using a single CNN, trained end-to-end from images directly, using only pixels and disease labels as inputs. We train a CNN using a dataset of 129,450 clinical images-two orders of magnitude larger than previous datasets-consisting of 2,032 different diseases. We test its performance against 21 board-certified dermatologists on biopsy-proven clinical images with two critical binary classification use cases: keratinocyte carcinomas versus benign seborrheic keratoses; and malignant melanomas versus benign nevi. The first case represents the identification of the most common cancers, the second represents the identification of the deadliest skin cancer. The CNN achieves performance on par with all tested experts across both tasks, demonstrating an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists. Outfitted with deep neural networks, mobile devices can potentially extend the reach of dermatologists outside of the clinic. It is projected that 6.3 billion smartphone subscriptions will exist by the year 2021 (ref. 13) and can therefore potentially provide low-cost universal access to vital diagnostic care.
Article
We present new methodology based on Multi-Objective Markov Decision Processes for developing sequential decision support systems from data. Our approach uses sequential decision-making data to provide support that is useful to many different decision-makers, each with different, potentially time-varying preference. To accomplish this, we develop an extension of fitted-Q iteration for multiple objectives that computes policies for all scalarization functions, i.e. preference functions, simultaneously from continuous-state, finite-horizon data. We identify and address several conceptual and computational challenges along the way, and we introduce a new solution concept that is appropriate when different actions have similar expected outcomes. Finally, we demonstrate an application of our method using data from the Clinical Antipsychotic Trials of Intervention Effectiveness and show that our approach offers decision-makers increased choice by a larger class of optimal policies.
Article
Importance: Deep learning is a family of computational methods that allow an algorithm to program itself by learning from a large set of examples that demonstrate the desired behavior, removing the need to specify rules explicitly. Application of these methods to medical imaging requires further assessment and validation. Objective: To apply deep learning to create an algorithm for automated detection of diabetic retinopathy and diabetic macular edema in retinal fundus photographs. Design and setting: A specific type of neural network optimized for image classification called a deep convolutional neural network was trained using a retrospective development data set of 128 175 retinal images, which were graded 3 to 7 times for diabetic retinopathy, diabetic macular edema, and image gradability by a panel of 54 US licensed ophthalmologists and ophthalmology senior residents between May and December 2015. The resultant algorithm was validated in January and February 2016 using 2 separate data sets, both graded by at least 7 US board-certified ophthalmologists with high intragrader consistency. Exposure: Deep learning-trained algorithm. Main outcomes and measures: The sensitivity and specificity of the algorithm for detecting referable diabetic retinopathy (RDR), defined as moderate and worse diabetic retinopathy, referable diabetic macular edema, or both, were generated based on the reference standard of the majority decision of the ophthalmologist panel. The algorithm was evaluated at 2 operating points selected from the development set, one selected for high specificity and another for high sensitivity. Results: The EyePACS-1 data set consisted of 9963 images from 4997 patients (mean age, 54.4 years; 62.2% women; prevalence of RDR, 683/8878 fully gradable images [7.8%]); the Messidor-2 data set had 1748 images from 874 patients (mean age, 57.6 years; 42.6% women; prevalence of RDR, 254/1745 fully gradable images [14.6%]). For detecting RDR, the algorithm had an area under the receiver operating curve of 0.991 (95% CI, 0.988-0.993) for EyePACS-1 and 0.990 (95% CI, 0.986-0.995) for Messidor-2. Using the first operating cut point with high specificity, for EyePACS-1, the sensitivity was 90.3% (95% CI, 87.5%-92.7%) and the specificity was 98.1% (95% CI, 97.8%-98.5%). For Messidor-2, the sensitivity was 87.0% (95% CI, 81.1%-91.0%) and the specificity was 98.5% (95% CI, 97.7%-99.1%). Using a second operating point with high sensitivity in the development set, for EyePACS-1 the sensitivity was 97.5% and specificity was 93.4% and for Messidor-2 the sensitivity was 96.1% and specificity was 93.9%. Conclusions and relevance: In this evaluation of retinal fundus photographs from adults with diabetes, an algorithm based on deep machine learning had high sensitivity and specificity for detecting referable diabetic retinopathy. Further research is necessary to determine the feasibility of applying this algorithm in the clinical setting and to determine whether use of the algorithm could lead to improved care and outcomes compared with current ophthalmologic assessment.
Conference Paper
Misdosing medications with sensitive therapeutic windows, such as heparin, can place patients at unnecessary risk, increase length of hospital stay, and lead to wasted hospital resources. In this work, we present a clinician-in-the-loop sequential decision making framework, which provides an individualized dosing policy adapted to each patient's evolving clinical phenotype. We employed retrospective data from the publicly available MIMIC II intensive care unit database, and developed a deep reinforcement learning algorithm that learns an optimal heparin dosing policy from sample dosing trails and their associated outcomes in large electronic medical records. Using separate training and testing datasets, our model was observed to be effective in proposing heparin doses that resulted in better expected outcomes than the clinical guidelines. Our results demonstrate that a sequential modeling approach, learned from retrospective data, could potentially be used at the bedside to derive individualized patient dosing policies.
Article
In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have orders of magnitude lower mean squared error than existing methods---it makes more efficient use of the available data. Our new estimator is based on two advances: an extension of the doubly robust estimator (Jiang and Li, 2015), and a new way to mix between model based estimates and importance sampling based estimates.
Article
Importance Definitions of sepsis and septic shock were last revised in 2001. Considerable advances have since been made into the pathobiology (changes in organ function, morphology, cell biology, biochemistry, immunology, and circulation), management, and epidemiology of sepsis, suggesting the need for reexamination.Objective To evaluate and, as needed, update definitions for sepsis and septic shock.Process A task force (n = 19) with expertise in sepsis pathobiology, clinical trials, and epidemiology was convened by the Society of Critical Care Medicine and the European Society of Intensive Care Medicine. Definitions and clinical criteria were generated through meetings, Delphi processes, analysis of electronic health record databases, and voting, followed by circulation to international professional societies, requesting peer review and endorsement (by 31 societies listed in the Acknowledgment).Key Findings From Evidence Synthesis Limitations of previous definitions included an excessive focus on inflammation, the misleading model that sepsis follows a continuum through severe sepsis to shock, and inadequate specificity and sensitivity of the systemic inflammatory response syndrome (SIRS) criteria. Multiple definitions and terminologies are currently in use for sepsis, septic shock, and organ dysfunction, leading to discrepancies in reported incidence and observed mortality. The task force concluded the term severe sepsis was redundant.Recommendations Sepsis should be defined as life-threatening organ dysfunction caused by a dysregulated host response to infection. For clinical operationalization, organ dysfunction can be represented by an increase in the Sequential [Sepsis-related] Organ Failure Assessment (SOFA) score of 2 points or more, which is associated with an in-hospital mortality greater than 10%. Septic shock should be defined as a subset of sepsis in which particularly profound circulatory, cellular, and metabolic abnormalities are associated with a greater risk of mortality than with sepsis alone. Patients with septic shock can be clinically identified by a vasopressor requirement to maintain a mean arterial pressure of 65 mm Hg or greater and serum lactate level greater than 2 mmol/L (>18 mg/dL) in the absence of hypovolemia. This combination is associated with hospital mortality rates greater than 40%. In out-of-hospital, emergency department, or general hospital ward settings, adult patients with suspected infection can be rapidly identified as being more likely to have poor outcomes typical of sepsis if they have at least 2 of the following clinical criteria that together constitute a new bedside clinical score termed quickSOFA (qSOFA): respiratory rate of 22/min or greater, altered mentation, or systolic blood pressure of 100 mm Hg or less.Conclusions and Relevance These updated definitions and clinical criteria should replace previous definitions, offer greater consistency for epidemiologic studies and clinical trials, and facilitate earlier recognition and more timely management of patients with sepsis or at risk of developing sepsis.
Conference Paper
We study the problem of evaluating a policy that is different from the one that generates data. Such a problem, known as off-policy evaluation in reinforcement learning (RL), is encountered whenever one wants to estimate the value of a new solution, based on historical data, before actually deploying it in the real system, which is a critical step of applying RL in most real-world applications. Despite the fundamental importance of the problem, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the so-called doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and has low variance, and as a point estimator, it outperforms the most popular importance-sampling estimator and its variants in most occasions. We also provide theoretical results on the hardness of the problem, and show that our estimator can match the asymptotic lower bound in certain scenarios.
Conference Paper
Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, weighted importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. Second, we show that these benefits extend to a new weighted-importance-sampling version of offpolicy LSTD(λ). We show empirically that our new WIS-LSTD(λ) algorithm can result in much more rapid and reliable convergence than conventional off-policy LSTD(λ) (Yu 2010, Bertsekas & Yu 2009).