Available via license: CC BY-NC-ND 4.0
Content may be subject to copyright.
CONCEPT-DRIVEN OFF POLICY EVALUATION
Ritam Majumdar, Jack Teversham, Sonali Parbhoo
Imperial College London
{r.majumdar24,jack.teversham22,s.parbhoo}@imperial.ac.uk
ABS TRACT
Evaluating off-policy decisions using batch data poses significant challenges due
to limited sample sizes leading to high variance. To improve Off-Policy Evaluation
(OPE), we must identify and address the sources of this variance. Recent research
on Concept Bottleneck Models (CBMs) shows that using human-explainable con-
cepts can improve predictions and provide better understanding. We propose
incorporating concepts into OPE to reduce variance. Our work introduces a family
of concept-based OPE estimators, proving that they remain unbiased and reduce
variance when concepts are known and predefined. Since real-world applications
often lack predefined concepts, we further develop an end-to-end algorithm to
learn interpretable, concise, and diverse parameterized concepts optimized for
variance reduction. Our experiments with synthetic and real-world datasets show
that both known and learned concept-based estimators significantly improve OPE
performance. Crucially, we show that, unlike other OPE methods, concept-based
estimators are easily interpretable and allow for targeted interventions on specific
concepts, further enhancing the quality of these estimators.
1 INTRODUCTION
In domains like healthcare, education, and public policy, where interacting with the environment can
be risky, prohibitively expensive, or unethical (Sutton & Barto, 2018; Murphy et al., 2001; Mandel
et al., 2014), estimating the value of a policy from batch data before deployment is essential for the
practical application of RL. OPE aims to estimate the effectiveness of a specific policy, known as the
evaluation or target policy, using offline data collected beforehand from a different policy, known
as the behavior policy (e.g., Komorowski et al. (2018a); Precup et al. (2000); Thomas & Brunskill
(2016); Jiang & Li (2016)).
Importance sampling (IS) methods are a popular class of methods for OPE which adjust for distri-
butional mismatches between behavior and target policies by reweighting historical data, yielding
generally unbiased and consistent estimates (Precup et al., 2000). Despite their desirable properties
(Thomas & Brunskill, 2016; Jiang & Li, 2016; Farajtabar et al., 2018), IS methods often face high
variance, especially with limited overlap between behavioral samples and evaluation targets or in
data-scarce conditions. Evaluation policies may outperform behavior policies for specific individuals
or subgroups (Keramati et al., 2021b), making it misleading to rely solely on aggregate policy value
estimates. In practice however, these groups are often unknown, prompting the need for methods to
learn interpretable characterizations of the circumstances where the evaluation policy benefits certain
individuals over others.
In this paper, we propose performing OPE using interpretable concepts (Koh et al., 2020; Madeira
et al., 2023) instead of relying solely on state and action information. We demonstrate that this
approach offers significant practical benefits for evaluation. These concepts can capture critical
aspects in historical data, such as key transitions in a patient’s treatment or features affecting short-
term outcomes that serve as proxies for long-term results. By learning interpretable concepts from
data, we introduce a new family of concept-based IS estimators that provide more accurate value
estimates and stronger statistical guarantees. Additionally, these estimators allow us to identify which
concepts contribute most to variance in evaluation. When the evaluation is unreliable, we can modify,
intervene on, or remove these high-variance concepts to assess how the resulting evaluation improves
(Marcinkeviˇ
cs et al., 2024; Madeira et al., 2023).
1
arXiv:2411.19395v1 [stat.ML] 28 Nov 2024
Consider a physician treating two patients with similar disease dynamics. Although their blood
counts and oxygen levels may differ, their overall disease profiles might be alike. Therefore, if one
patient responds well to a particular treatment, the same treatment could potentially benefit the other.
By learning meaningful concepts based on disease profiles rather than individual symptoms at each
time point, we can more reliably evaluate which actions are likely to be effective. This is illustrated
in Figure 1.
Figure 1: Simple example of a state vs concept. In this sce-
nario, the state is the viral load in a patient’s blood, whereas
the concept is defined as the viral load being above or below
a certain threshold
x
. The concept divides patients into two
groups, in which different treatments are administered, indi-
cated by the frequency of syringes. We do evaluation based
on these two conceptual groups.
Our work makes the following key
contributions: i) We introduce a new
family of IS estimators based on inter-
pretable concepts; ii) We derive theo-
retical conditions ensuring lower vari-
ance compared to existing IS estima-
tors; iii) We propose an end-to-end al-
gorithm for optimizing parameterized
concepts when concepts are unknown,
using OPE characteristics like vari-
ance; iv) We show, through synthetic
and real experiments, that our estima-
tors for both known and unknown con-
cepts outperform existing ones; v) We
interpret the learned concepts to ex-
plain OPE characteristics and suggest
intervention strategies to further im-
prove OPE estimates.
2 RE LATED WORK
Off-Policy Evaluation. There is a long history of methods for performing OPE, broadly categorized
into model-based or model-free (Sutton & Barto, 2018). Model-based methods, such as the Direct
Method (DM), learn a model of the environment to simulate trajectories and estimate the policy value
(Paduraru, 2013; Chow et al., 2015; Hanna et al., 2017; Fonteneau et al., 2013; Liu et al., 2018b).
These methods often rely on strong assumptions about the parametric model for statistical guarantees.
Model-free methods, like IS, correct sampling bias in off-policy data through reweighting to obtain
unbiased estimates (e.g., Precup et al. (2000); Horvitz & Thompson (1952); Thomas & Brunskill
(2016)). Doubly robust (DR) estimators (e.g., Jiang & Li (2016); Farajtabar et al. (2018)) combine
model-based DM and model-free IS for OPE but may fail to reduce variance when both DM and IS
have high variance. Various methods have been developed to refine estimation accuracy in IS, such
as truncating importance weights and estimating weights from steady-state visitation distributions
(Liu et al., 2018a; Xie et al., 2019; Doroudi et al., 2017; Bossens & Thomas, 2024).
Off-Policy Evaluation based on Subgroups. Keramati et al. (2021b) extend OPE to estimate
treatment effects for subgroups and provide actionable insights on which subgroups may benefit
from specific treatments, assuming subgroups are known or identified using regression trees. Unlike
regression trees, which are limited in scalability, our approach employs CBMs to learn interpretable
concepts that directly characterize individuals, enabling a new family of IS estimators based on these
concepts. Similarly, Shen et al. (2021) propose reducing variance by omitting likelihood ratios for
certain states. Our work complements this by summarizing relevant trajectory information using
concepts, rather than omitting states irrelevant to the return. The advantage of using concepts as
opposed to states is that we can easily interpret and intervene on these concepts unlike the state
information.
Marginalized Importance Sampling (MIS) estimators (Uehara et al., 2020; Liu et al., 2018a; Nachum
et al., 2019; Zhang et al., 2020b;a) mitigate the high variance of traditional IS by reweighting data
tuples using density ratios computed from state visitation at each time step. These estimators enhance
robustness by focusing on states with high visitation density ratios, thereby marginalizing out less
visited states. However, MIS has its challenges: computing density ratios can introduce high variance,
particularly in complex state spaces, and it obscures which aspects of the state space contribute
directly to variance. Some studies, such as Katdare et al. (2023) and Fujimoto et al. (2023), improve
2
MIS by decomposing density ratio estimation into components like large density ratio mismatch
and transition probability mismatch. Our work differs from MIS by chararcterizing states using
interpretable concepts rather than solely relying on density ratios. This approach enables targeted
interventions that enhance policy adjustments, leading to better returns and reduced variance in OPE.
Unlike MIS, our method provides interpretability, which becomes increasingly important as problem
complexity grows. Proposals for hybrid estimators, such as those in Pavse & Hanna (2022a), suggest
using low-dimensional abstraction of state spaces with MIS to manage high-dimensional spaces more
effectively. Our work differs in the sense we use concepts instead of state abstractions which can be
easily plugged into the existing Importance Sampling OPE definitions as elaborated in Sections 4.2
and Appendix C.
Concept Bottleneck Models. Concept Bottleneck Models (Koh et al., 2020) are a class of prediction
models that first predict a set of human interpretable concepts, and subsequently use these concepts
to predict a downstream label. Variations of these models include learning soft probabilistic concepts
(Mahinpei et al., 2021), learning hierarchical concepts (Panousis et al., 2023) and learning concepts in
a semi-supervised manner (Sawada & Nakamura, 2022). The key advantage of these models is they
allow us to explicitly intervene on concepts and interpret what might happen to a downstream label
if certain concepts were changed (Marcinkeviˇ
cs et al., 2024). Unlike previous works, we leverage
this idea to introduce a new class of estimators for off-policy evaluation where we group trajectories
based on interpretable concepts which are relevant for the downstream evaluation task.
3 PRELIMINARIES
Concept Bottleneck Models Conventional CBMs learn a mapping from some input features
x∈Rd
to targets
y
via some interpretable concepts
c∈Rk
based on training data of the form
{xn, cn, yn}N
n=1
. This mapping is a composition of a mapping from inputs to concepts,
f:Rd→Rk
,
and a mapping from concepts to targets,
g:Rk→R
. These may be trained via independent,
sequential or joint training (Marcinkeviˇ
cs et al., 2024). Variations which consider learning concepts
in a greedy fashion or in a semisupervised way include Wu et al. (2022); Havasi et al. (2022).
Markov Decision Processes (MDP). An MDP is defined by a tuple
M= (S,A, P, R, γ, T )
.
S
and
A
are the state and action spaces,
P:S × A → ∆(S)
and
R:S × A → ∆(R)
are
the transition and reward functions,
γ∈[0,1]
is the discount factor,
T∈Z+
is the fixed time
horizon. A policy
π:S → ∆(A)
is a mapping from each state to a probability distribution over
actions in
A
. A
T
-step trajectory following policy
π
is denoted by
τ= [(st, at, rt, st+1)]T
t=1
where
s1∼d1, at∼π(st), rt∼r(st, at), st+1 ∼p(st, at)
. The value function of policy
π
, denoted by
Vπ:S → R
, maps each state to the expected discounted sum of rewards starting from that state
following policy π. That is, Vπ(s) = Eπ[PT
t=1 γt−1rt|s1=s].
Off-Policy Evaluation. In OPE, we have a dataset of
T
-step trajectories
D={τ(n)}N
n=1
inde-
pendently generated by a behaviour policy
πb
. Our goal is to estimate the value function of another
evaluation policy,
πe
. We aim to use
D
to produce an estimator,
ˆ
Vπe
, that has low mean squared
error,
MSE(Vπe,ˆ
Vπe) = ED∼Pτ
πb[(Vπe−ˆ
Vπe)2]
. Here,
Pτ
πb
denotes the distribution of trajectories
τ, under πb, from which Dis sampled.
4 CO NCE PT-BA SED OFF -POLICY EVAL UATI ON
In this section, we formally define the mathematical definition of the concept, outline their desiderata,
and present the corresponding OPE estimators. In the following sections, we divide our Concept-OPE
studies into two parts. Section 5 covers scenarios where concepts are known from domain knowledge,
while Section 6 addresses cases where concepts are unknown and must be learned by optimizing a
parameterized representation.
4.1 FO RMAL DEFI NI TIO N OF T HE CO NCE PT
Given a dataset
D={τ(n)}N
n=1
of
n T
-step trajectories, let
ϕ:S × A × R× S → C ∈ Rd
denote a function that maps trajectory histories
ht
to interpretable concepts in
d
-dimensional concept
3
space
C
. This mapping results in the concept vector
ct= [c1
t, c2
t,..., cd
t]
at time
t
, defined as
ϕ(ht)
.
These concepts can capture various vital information in the history
ht
, such as transition dynamics,
short-term rewards, influential states, interdependencies in actions across timesteps, etc. Without
loss of generality, in this work, we consider concepts
ct
to be just functions of current state
st
.
This assumption considers the scenario where concepts capture important information based on the
criticalness of the state. The concept function
ϕ
satisfies the following desiderata: explainability,
conciseness, better trajectory coverage and diversity. A detailed description of desiderata is provided
in Appendix A.
4.2 CO NCEPT-BAS ED ES TIM ATOR S FO R OPE.
We introduce a new class of concept-based OPE estimators to formalize the application of concepts in
OPE. These estimators are adapted versions of their original non-concept-based counterparts. Here,
we present the results specifically for per-decision IS and standard IS estimators, as these serve as the
foundation for several other estimators. We also demonstrate in Appendix C how these methods can
be extended to other estimators.
Definition 4.1 (Concept-Based Importance Sampling (CIS)).
ˆ
VCI S
πe=1
N
N
X
n=1
ρ(n)
0:T
T
X
t=0
γtr(n)
t;ρ(n)
0:T=
T
Y
t′=0
πc
e(a(n)
t′|c(n)
t′)
πc
b(a(n)
t′|c(n)
t′)
Definition 4.2 (Concept-based Per-Decision Importance Sampling, CPDIS).
ˆ
VCP DI S
πe=1
N
N
X
n=1
T
X
t=0
γtρ(n)
0:tr(n)
t;ρ(n)
0:t=
t
Y
t′=0
πc
e(a(n)
t′|c(n)
t′)
πc
b(a(n)
t′|c(n)
t′)
Concept-based variants of IS replace the traditional IS ratio with one that leverages the concept
ct
at
time
t
instead of the state
st
. This enables customized evaluations for various concept types, such as:
1) subgroups with similar short-term outcomes, 2) cases with comparable state-visitation densities,
and 3) subjects with high-variance transitions. Details on selecting concept types are in Appendix B.
5 CO NCE PT-BAS ED OPE UNDER KNOWN CONCEPTS
We first consider the scenario where the concepts are known apriori using domain knowledge and
human expertise. These concepts automatically satisfy the desiderata defined in Appendix A.
5.1 THEORETICAL ANALYS IS OF KN OWN CONCEPTS
In this subsection, we discuss the theoretical guarantees of OPE under known concepts. We make
the completeness assumption where every action of a particular state has a non-zero probability of
appearing in the batch data. When this assumption is satisfied, we obtain unbiasedness and lower
variance when compared with traditional estimators. Proofs follow in Appendix D.
Assumption 5.1 (Completeness).∀s∈S, a ∈A, if πb(a|s), πc
b(a|c)>0then πe(a|s), πc
e(a|c)>0
This assumption states that if an action appears in the batch data with some probability, it also has a
chance of being evaluated with some probability.
Assumption 5.2.
∀s∈S, a ∈A, |πc
e(a|c)−πe(a|s)|< β and |πc
e(a|c)−πe(a|s)|< β
. This
assumption states that for all states
s
, the policies conditioned on concepts are allowed to differ from
the state policies by atmost β, which is defined by the practitioner.
This assumption constrains concept-based policies to be close to state-based policies, with a maximum
allowable difference of
β
, defined by the practitioner. This is to ensure that the evaluation policy
πc
e
under concepts is reflective of the original policy
πe
. If the practitioner is confident in the
state representation, they may set a lower
β
to find concepts that align closely with state policies.
Conversely, a higher βallows for more deviation between concept and state policies.
Theorem 5.3 (Bias).Under known-concepts, when assumption 5.1 holds, both
ˆ
VCI S
πe
and
ˆ
VCP DI S
πe
are unbiased estimators of the true value function Vπe. (Proof: See Appendix D for details.)
4
Theorem 5.4 (Variance comparison with traditional OPE estimators).When
Cov(ρc
0:trt, ρc
0:krk)≤
Cov(ρ0:trt, ρ0:krk)
, the variance of known concept-based IS estimators is lower than traditional
estimators, i.e. Vπb[ˆ
VCI S ]≤Vπb[ˆ
VIS ],Vπb[ˆ
VCP D IS ]≤Vπb[ˆ
VP DI S ]. (Proof: See Appendix D)
As noted in Jiang & Li (2016), the covariance assumption across timesteps is crucial yet challenging
for OPE variance comparisons. Concepts being interpretable allows a user to design policies which
align with this assumption, thereby reducing variance. We also compare concept-based estimators to
the MIS estimator, the gold standard for minimizing variance via steady-state distribution ratios.
Theorem 5.5 (Variance comparison with MIS estimator).When
Cov(ρc
0:trt, ρc
0:krk)≤
Cov(dπe(st,at)
dπb(st,at)rt,dπe(sk,ak)
dπb(sk,ak)rk)
, the variance of known concept-based IS estimators is lower than
the Variance of MIS estimator, i.e. Vπb[ˆ
VCI S ]≤Vπb[ˆ
VMI S ],Vπb[ˆ
VCP D IS ]≤Vπb[ˆ
VMI S ].
Finally, we evaluate the CR-bounds on the MSE and quantify the tightness achieved using concepts.
Theorem 5.6 (Confidence bounds for Concept-based estimators).The Cramer-Rao bound on the
Mean-Square Error of CIS and CPDIS estimator under known-concepts is tightened by a factor of
K2T, where Kis the ratio of the cardinality of the concept-space and state-space.
High IS ratios arise from low behavior policy probabilities
πb
due to poor batch sampling, leading to
worst-case bounds. Concepts address this by better characterizing poorly sampled states, increasing
probabilities, and reducing skewed IS ratios, thus tightening the bounds.
5.2 EX PERIM EN TAL SE TU P AND ME TR IC S
Environments: We consider a synthetic domain: WindyGridworld and the real world MIMIC-III
dataset for acutely hypotensive ICU patients as our experiment domains for the rest of the paper.
WindyGridworld: We (as human experts) define the concept
ct=ϕ(distance to target,wind)
as a
function of the distance to the target and the wind acting on the agent at a given state. This concept
can take 25 unique values, ranging from 0 to 24. For example:
ct= 0
when
distance to target ∈
[15,19] ×[15,19]
and
wind = [0,0]
. The first and second co-ordinates represent the horizontal and
vertical features respectively. Detailed description of known concepts in Appendix G.
MIMIC: The concept
ct∈Z15
represents a function of 15 different vital signs (interpretable features)
of a patient at a given timestep. The vital signs considered are: Creatinine, FiO
2
, Lactate, Partial
Pressure of Oxygen (PaO
2
), Partial Pressure of CO
2
, Urine Output, GCS score, and electrolytes such
as Calcium, Chloride, Glucose, HCO
3
, Magnesium, Potassium, Sodium, and SpO
2
. Each vital sign
is binned into 10 discrete levels, ranging from 0 (very low) to 9 (very high).
For example, a patient with the concept representation
[0,2,1,1,2,0,9,5,2,0,6,2,1,5,9]
shows
the following conditions: acute kidney injury-AKI (very low creatinine), severe hypoxemia (very low
PaO
2
), metabolic alkalosis (very high SpO
2
), and critical electrolyte imbalances (low potassium and
magnesium), along with severe hypoglycemia. The normal GCS score indicates preserved neurologi-
cal function, but over-oxygenation and potential respiratory failure are likely. The combination of
anuria, AKI, and hypoglycemia points strongly toward hypotension or shock as underlying causes.
Policy descriptions: In the case of WindyGridworld, we run a PPO Schulman et al. (2017) algorithm
for 10k epochs and consider the evaluation policy
πe
as the policy at epoch 10k, while the behavior
policy
πb
is taken as the policy at epoch 5k. For the MIMIC case, we generate the behavior
policy
πb
by running an Approximate Nearest Neighbors algorithm with 200 neighbors, using
Manhattan distance as the distance metric. The evaluation policy πeinvolves a more aggressive use
of vasopressors (10% more) compared to the behavior policy. See Appendix F for further details.
Metrics: In the case of the synthetic domain, we measure bias, variance, mean squared error, and
the effective sample size (ESS) to assess the quality of our concept-based OPE estimates. The ESS
is defined as
N×Vπe[ˆ
Von−policy
πe]
Vπb[ˆ
Vπe]
, where
N
is the number of trajectories in the off-policy data, and
ˆ
Von−policy
πe
and
ˆ
Vπe
are the on-policy and OPE estimates of the value function, respectively. For
MIMIC, where the true on-policy estimate is unknown due to the unknown transition dynamics
and environment model, we only consider variance as the metric. Additionally, we compare the
Inverse Propensity scores (IPS) under concepts and states to better underscore the reasons for variance
reduction: Figures 3,10.
5
Figure 2: WindyGridworld: Known Concept-based estimators have lower variance, MSE, higher
ESS compared to traditional OPE estimators, with a higher Bias. MIMIC: Known Concept-based
estimators improve upon the variance.
Figure 3: Inverse propensity score comparisons under concepts and states. We observe the frequency
of the lower IPS scores are left skewed in case of concepts over states. This indicates the source of
variance reduction in concepts lies in the lowered IPS scores.
5.3 RE SULTS A ND DISCUSSION
Known concept-based estimators demonstrate reduced variance, improved ESS, and lower
MSE compared to traditional estimators, although they come with slightly higher bias. Figure 2
compares known-concept and traditional OPE estimators. We observe a consistent reduction in
variance and an increase in ESS across all sample sizes for the concept-based estimators. Although
our theoretical analysis suggests that known-concept estimators are unbiased, practical results indicate
some bias. While unbiased estimates are generally preferred, they can lead to higher errors when
the behavior policy does not cover all states. This issue is especially pronounced in limited data
settings, which are common in medical applications. Despite this bias-variance trade-off, the MSE
for concept-based OPE estimators shows a 1-2 order of magnitude improvement over traditional
estimators due to significant variance reduction. In the real-world MIMIC example, concept-based
estimators exhibit a variance reduction of one order of magnitude compared to traditional OPE
estimators. This demonstrates that categorizing diverse states—such as varying gridworld positions or
patient vital signs—into shared concepts based on common attributes improves OPE characterization.
The Inverse Propensity Scores (IPS) are more left-skewed under concepts as compared to
states. Figure 3 compares the IPS scores under concept and state estimators. We observe, the
frequency of lower IPS scores is higher under concepts as opposed to states. This indicates the
source of variance reduction in Concept-based OPE lies in the lowering of the IPS scores, which
is also backed theoretically in Theorem 5.4 when the rewards
rt
are fixed to 1. Similar result for
N={100,300,500,1500,2000}can be found in the Appendix I.
6 CO NCE PT-BAS ED OPE UNDER UNK NOWN CONCEPTS
While domain knowledge and predefined concepts can enhance OPE, in real-world situations concepts
are typically unknown. In this section, we address cases where concepts are unknown and must be
estimated. We use a parametric representation of concepts via CBMs, which initially may not meet
the required desiderata. This section introduces a methodology to optimize parameterized concepts
to meet these desiderata, alongside improving OPE metrics like variance.
6.1 METHODOLOGY
Algorithm 1 outlines the training methodology. We split the batch trajectories
D
into training
trajectories Ttrain and evaluation trajectories TOPE, with the evaluation policy πe, the behavior policy
πb
, and an OPE estimator (eg: CIS/CPDIS) known beforehand. We aim to learn our concepts using
6
Algorithm 1 Parameterized Concept-based Off Policy Evaluation
Require: Trajectories {Ttrain,TOPE}, Policies {πe,πb}, OPE Estimator.
Ensure: CBM θ, concept policies ˜πc{θb, θe}
Loss terms: {Loutput ,Linterpretability,Ldiversity ,LOPE-metric,Lpolicy }= 0
1: while Not Converged do
2: for trajectory in Ttrain do
3: for (s, a, r, s′, o)in trajectory do ▷Choices for o:s′(Next state) / r(Next reward)
4: c′, o′←CBM(s)▷CBM predicts concept c′and output label o′
5: Loutput += Coutput(o, o′)▷Eg: MSE/Cross-entropy between true next state and predicted next state
6: Linterpretability += Cinterpretability(c′)▷Eg: L1-loss over weights
7: Ldiversity += Cdiversity(c′)▷Eg: Cosine distance between sub-concepts
8: Lpolicy += Cpolicy(c′)▷Eg: MSE/Cross-entropy between predicted logits and true logits in Assn 5.2
9: end for
10: end for
11: Returns ←Estimator(Ttrain, πe, πb,CBM)▷Eg: CIS/CPDIS
12: Loss(θ, θb, θe) = Loutput +Linterpretability +Ldiversity +COPE-metric(Returns)▷Eg: Variance
13: Gradient Descent on {θ, θb, θe} using Loss(θ, θb, θe)
14: end while
15: Return Concept OPE Returns ←Estimator(TOPE, πe, πb, C BM )
a CBM parameterized by
θ
. The CBM maps states to outputs through an intermediary concept
layer. In this work, the output
o
is the next state, indicating that the bottleneck concepts capture
transition dynamics. Other possible outputs could include short-term rewards, long-term returns, or
any user-defined information of interest present in the batch data. In addition to learning concepts,
we also learn parameterized concept policies
˜πc
which maps concepts to actions parameterized by
θb, θefor behavior and evaluation policy respectively.
For each transition tuple
(s, a, r, s′)
, the CBM computes a concept vector
c′
and an output
o′
. Since
the concepts are initially unknown, they do not inherently satisfy the concept desiderata and must
be learned through constraints. Lines 5-7 impose soft constraints on the concepts to meet these
desiderata using loss functions. The losses are updated based on output, interpretability, and diversity,
with MSE used for
Coutput
, L1 loss for
Cinterpretability
, and cosine distance for
Cdiversity
. In Line 8, we
constrain the difference between the concept policies and the original policies to satisfy Assumption
5.2. For our experiments, we take
β= 0
, however a user can choose a different value to allow for
more deviation in the concept policies
˜
πc
and original policies
π
. In line 11, we evaluate the OPE
estimator’s returns based on the concepts at the current iteration with metrics like variance. The
aggregate loss,
Loss(θ)
, guides gradient descent on CBM parameters
θ
. Finally, the OPE estimator is
applied to
TOPE
using learned concepts, yielding concept-based OPE returns. Integrating multiple
competing loss components makes this problem complex, and, to our knowledge, this is the first
approach that incorporates the OPE metric directly into the loss function.
6.2 THEORETICAL ANALYS IS OF U NK NO WN CONCEPTS
The theoretical implications mainly differ in the bias, consequently MSE and their Confidence bounds
on moving from known to unknown concepts, as analyzed below. Proofs are listed in Appendix E.
Theorem 6.1 (Bias).Under Assumptions 5.1, 5.2, the unknown concept-based estimators are biased.
The change of measure theorem from probability distributions
πb
to
πc
b
is not applicable on moving
from known to unknown concepts, leading to bias. In the special case where
πc
b(.|ct) = πb(.|st)
, the
estimator is unbiased.
Theorem 6.2 (Variance comparison with traditional OPE estimators).Under Assumption 5.2, when
Cov(ρc
0:trt, ρc
0:krk)≤Cov(ρ0:trt, ρ0:krk)
, the variance of concept-based IS estimators is lower
than the traditional estimators, i.e. Vπb[ˆ
VCI S ]≤Vπb[ˆ
VIS ],Vπb[ˆ
VCP D IS ]≤Vπb[ˆ
VP DI S ].
Theorem 6.3 (Variance comparison with MIS estimator).Under Assumption 5.2, when
Cov(ρc
0:trt, ρc
0:krk)≤Cov(dπe(st,at)
dπb(st,at)rt,dπe(sk,ak)
dπb(sk,ak)rk)
, like known concepts, the variance is lower
than the Variance of MIS estimator, i.e. Vπb[ˆ
VCI S ]≤Vπb[ˆ
VMI S ],Vπb[ˆ
VCP D IS ]≤Vπb[ˆ
VMI S ].
7
Figure 4: For both domains, unknown concept-based estimators show lower variance. In WindyGrid-
world, they improve MSE and ESS but exhibit higher bias compared to traditional OPE estimators.
Similar to known concepts, when the covariance assumption is satisfied, even unknown concept-based
estimators can provide lower variances than traditional and MIS estimators. In known concepts
however, this assumption has to be satisfied by the practitioner, whereas in unknown concepts, this
assumption can be used as a loss function in our methodology to implicitly reduce variance. (Line 12)
Theorem 6.4 (Confidence bounds for Concept-based estimators).The Cramer-Rao bound on the
Mean-Square Error of CIS and CPDIS estimator loosen by
ϵ(|Eπc
e[ˆ
Vπe]|2)
, under unknown concepts
over known-concepts. Here,
Eπc
e[ˆ
Vπe]
is the on-policy estimate of concept-based IS (PDIS) estimator.
The confidence bounds of unknown concepts mirror that of known-concepts, with the addition of
the bias term whose maximum value is the true on-policy estimate of the estimator. This is typically
unknown in real-world scenarios and requires additional domain knowledge to mitigate.
6.3 EX PERIM EN TAL S ET UP
Environments, Policy descriptions, Metrics: Same as those in known concepts section.
Concept representation: In both examples, we use a 4-dimensional concept
ct∈ R4
, where each
sub-concept is a linear weighted function of human-interpretable features
f
, i.e.,
ci
t=w·f(st)
, with
w
optimized as previously discussed. Detailed descriptions of the features and optimized concepts
after CBM training are provided in Appendix I. For MIMIC, features
f
are normalized vital signs, as
threshold information for discretization is unavailable. In brevity of space, we move the training and
hyperparameter details to Appendix G.
6.4 RE SULTS A ND DISCUSSION
Optimized concepts using Algorithm 1 yield improvements across all metrics except bias
compared to
traditional OPE estimators
.Significant improvements in variance, MSE, and ESS
are observed for the WindyGridworld and MIMIC datasets, with gains of 1-2 and 2-3 orders of
magnitude, respectively. This improvement is due to our algorithm’s ability to identify concepts
that satisfy the desiderata, including achieving variance reduction as specified in line 12 of the
algorithm. However, like known concepts, optimized concepts show a higher bias than traditional
estimators. This is because, unlike variance, bias cannot be optimized in the loss function without the
true on-policy estimate, which is typically unavailable in real-world settings. As a result, external
information may be essential for further bias reduction.
Optimized concepts yield improvements across all metrics besides bias over
known concept estimators
.Our methodology achieves 1-2 orders of magnitude improve-
ment in variance, MSE, and ESS compared to known concepts. This suggests that our algorithm
can learn concepts that surpass human-defined ones in improving OPE metrics. This is particularly
valuable in cases with imperfect experts or highly complex real-world scenarios where perfect
expertise is unfeasible. However, these optimized concepts introduce higher bias, primarily because
the training algorithm prioritized variance reduction over bias minimization. This bias could be
reduced by incorporating variance regularization into the training process.
Optimized concepts are interpretable, show conciseness and diversity. We list the optimized
concepts in Appendix I. These concepts exhibit sparse weights, enhancing their conciseness, with
significant variation in weights across different dimensions of the concepts, reflecting diversity.
This work focuses on linearly varying concepts, but more complex concepts, such as symbolic
representations (Majumdar et al., 2023), could better model intricate environments.
8
7 INTERVENTIONS ON CONCE PTS FO R INSIGHTS ON EVALUATION
Concepts provide interpretations, allowing practitioners to identify sources of variance—an advantage
over traditional state abstractions like Pavse & Hanna (2022a). Concepts also clarify reasons behind
OPE characteristics, such as high variance, enabling corrective interventions based on domain
knowledge or human evaluation. We outline the details of performing interventions next.
7.1 METHODOLOGY
Given trajectory history
ht
and concept
ct
, we define
cint
t
as the intervention (alternative) concept
an expert proposes at time t. We define criteria κ: (ht, ct)→ {0,1}as a function constructed from
domain expertise that takes in (
ht, ct
) as input and outputs a boolean value. This criteria function
determines whether an intervention needs to be conducted over the current concept
ct
or not. For e.g.,
if a practitioner has access to true on-policy values, he/she can estimate which concepts suffer from
bias. If a concept doesn’t suffer from bias, the criteria
κ(ht, ct)=1
is satisfied and the concept is not
intervened upon, else
κ(ht, ct)=0
and the intervened concept
cint
t
is used instead. The final concept
˜ct
is then defined as:
˜ct=κ(ht, ct)·ct+ (1 −κ(ht, ct)) ·cint
t
. Under the absence of true on-policy
values, the practitioner may chose to intervene using a different criteria instead.
We define criteria
κ
for our experiments as follows. In Windygridworld, we assume access to oracle
concepts, listed in Appendix G. When the learned concept
ct
matches the true concept,
κ(ht, ct)=1
,
otherwise 0. In MIMIC, the interventions are based on a patient’s urine output at a specific timestep
with
κ(ht, ct) = 1
when urine output > 30 ml/hr, and 0 otherwise. Performing interventions based
on urine output enables us to assess the role of kidney function in hypotension management. In this
work, we consider 3 possible intervention strategies either based on state representations or based on
domain knowledge.
Interventions that replace concepts with state representations and state-based policies. We intervene
on the concept with the state and use policies dependent on state to perform OPE, i.e
cint
t=st
,
πc
e(at|˜ct) = πe(at|st)
,
πc
b(at|˜ct) = πb(at|st)
. This can be thought of as a comparative measure a
practitioner can look for between the concept and the state representations.
Interventions that replace concepts with state representations and maximum likelihood estimator
(MLE) of state-based policies. We replace the erroneous concept with the corresponding state and use
the MLE of the state conditioned policy to perform OPE, i.e
cint
t=st
,
πc
e(at|˜ct) = MLE(πe(at|st))
,
πc
b(at|˜ct) = MLE(πb(at|st))
. This can be thought of as a comparative measure a practitioner can
look for between the concept and states, while priortizing over the most confident action.
Interventions using a qualitative concept while retaining concept-based policies. In this approach,
a human expert replaces the concept using external domain knowledge, and policies are adjusted
to reflect the new concept values. This method aligns with Tang & Wiens (2023), where human-
annotated counterfactual trajectories enhance semi-offline OPE. However, while Tang & Wiens focus
on quantitative counterfactual annotations in the state representation, we employ human interventions
to qualitatively adjust concepts. In case of WindyGridworld, we consider the oracle concepts as our
qualitative concept, while for MIMIC, we consider the learnt CPDIS estimator as qualitative concept
while intervening on CIS estimator.
7.2 RESULTS AND INTER PR ETATIONS FROM INTERVENTIONS ON LEARNED CONCEPTS
We interpret the optimized concepts in Fig. 5. In the WindyGridworld environment, we compare
the ground-truth concepts with the optimized ones and observe two additional concepts predicted
in the bottom-right region. This likely stems from overfitting to reduce variance in the OPE loss,
suggesting a need for inspection and possible intervention. Additionally, we compare our clusters
with state-abstraction baseline (clustering in the state-space), and observe the clusters to be widely
different from the learnt concepts. For MIMIC, prior studies indicate that patients with urine output
above 30 ml/hr are less susceptible to hypotension than those with lower output Kellum & Prowle
(2018); Singer et al. (2016); Vincent & De Backer (2013). Using this knowledge, we analyze patient
trajectories and find that lower urine output correlates with higher variance, while higher output
corresponds to lower variance. This insight helps identify patients who may benefit from targeted
interventions.
Interpretable concepts allow for targeted interventions that further enhance OPE estimates.
In the WindyGridworld environment, we observe a reduction in bias. This occurs because replacing
erroneous concepts with oracle concepts introduces information about the on-policy estimates that
9
Figure 5: Interpretation of Optimized Concepts. WindyGridworld: The first two subplots compare
true oracle concepts with optimized concepts derived from the proposed methodology. Baseline
with State Abstractions: The third subplot shows OPE performance as the number of state clusters
increases, peaking at
K= 33
clusters before a spike in MSE and subsequent gradual improvement.
The fourth subplot highlights state clusters at
K= 33
, the optimal abstraction for OPE, which differs
from both oracle and optimized concepts, underscoring the meaningfulness of learned concepts.
MIMIC: Domain knowledge suggests patients with low urine output exhibit greater variance in
learned concepts compared to high-output patients, revealing potential intervention targets.
Figure 6: Interventions: Qualitative interventions reduce Bias and MSE for unknown estimators in
WindyGridworld and lower variance in MIMIC. Behavior-policy-based interventions improve over
non-intervened concepts but are outperformed by qualitative interventions.
was previously missing during the optimization of unknown concepts, all while maintaining the same
order of variance and ESS estimates. Similarly, in MIMIC, applying qualitative interventions to states
with low urine output further reduces variance by 1-2 orders of magnitude.
Not all interventions improve Concept OPE characteristics and should be used at the practi-
tioner’s discretion. In WindyGridworld, state-based interventions increase bias and MSE compared
to qualitative ones, while in MIMIC, they result in higher variance. This arises because traditional
state policies (
πb
and
πe
) fail to compensate for the lack of on-policy information, undermining
the advantages of concept-based policies (
πc
b
and
πc
e
). In contrast, qualitative interventions, such
as oracle concepts in WindyGridworld or urine output thresholds in MIMIC, retain the benefits of
concept-based policies and effectively address domain-specific issues. Importantly, this framework
allows practitioners to inspect and choose among alternative interventions as needed.
8 CONCLUSIONS, LIMITATIONS AND FUTURE WORK
We introduced a new family of concept-based OPE estimators, demonstrating that known-concept
estimators can outperform traditional ones with greater accuracy and theoretical guarantees. For
unknown concepts, we proposed an algorithm to learn interpretable concepts that improve OPE
evaluations by identifying performance issues and enabling targeted interventions to reduce variance.
These advancements benefit safety-critical fields like healthcare, education, and public policy by
supporting reliable, interpretable policy evaluations. By reducing variance and providing policy
insights, this approach enhances informed decision-making, facilitates personalized interventions,
and refines policies before deployment for greater real-world effectiveness. A limitation of our work
is trajectory distribution mismatch when learning unknown concepts, particularly in low-sample
settings, which can lead to high-variance OPE. Targeted interventions help mitigate this issue. We
also did not address hidden confounding variables or potential CBM concept leakage, focusing
instead on evaluation. Future work will address these challenges and extend our approach to more
general, partially observable environments.
10
REFERENCES
Panagiotis Anagnostou, Petros T. Barmbas, Aristidis G. Vrahatis, and Sotiris K. Tasoulis. Approxi-
mate knn classification for biomedical data, 2020. URL
https://arxiv.org/abs/2012.
02149.
David M. Bossens and Philip S. Thomas. Low variance off-policy evaluation with state-based
importance sampling, 2024. URL https://arxiv.org/abs/2212.03932.
Markus Böck, Julien Malle, Daniel Pasterk, Hrvoje Kukina, Ramin Hasani, and Clemens Heitzinger.
Superhuman performance on sepsis mimic-iii data by distributional reinforcement learning. PLOS
ONE, 17:e0275358, 11 2022. doi: 10.1371/journal.pone.0275358.
Yinlam Chow, Marek Petrik, and Mohammad Ghavamzadeh. Robust policy optimization with
baseline guarantees. arXiv preprint arXiv:1506.04514, 2015.
Shayan Doroudi, Philip S Thomas, and Emma Brunskill. Importance sampling for fair policy
selection. Grantee Submission, 2017.
Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust
off-policy evaluation. In International Conference on Machine Learning, pp. 1447–1456. PMLR,
2018.
Raphael Fonteneau, Susan A Murphy, Louis Wehenkel, and Damien Ernst. Batch mode reinforcement
learning based on the synthesis of artificial trajectories. Annals of operations research, 208:383–
416, 2013.
Scott Fujimoto, David Meger, and Doina Precup. A deep reinforcement learning approach to
marginalized importance sampling with the successor representation, 2023.
Ary L. Goldberger, Luis A. Nunes Amaral, L Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov,
Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and Harry Eugene
Stanley. Physiobank, physiotoolkit, and physionet: components of a new research resource
for complex physiologic signals. Circulation, 101 23:E215–20, 2000. URL
https://api.
semanticscholar.org/CorpusID:642375.
Omer Gottesman, Joseph Futoma, Yao Liu, Sonali Parbhoo, Leo Anthony Celi, Emma Brunskill, and
Finale Doshi-Velez. Interpretable off-policy evaluation in reinforcement learning by highlighting
influential transitions, 2020.
Deepak Gupta, Russell Loane, Soumya Gayen, and Dina Demner-Fushman. Medical image retrieval
via nearest neighbor search on pre-trained image features, 2022. URL
https://arxiv.org/
abs/2210.02401.
Josiah Hanna, Peter Stone, and Scott Niekum. Bootstrapping with models: Confidence intervals for
off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31,
2017.
Marton Havasi, Sonali Parbhoo, and Finale Doshi-Velez. Addressing leakage in concept bottleneck
models. Advances in Neural Information Processing Systems, 35:23386–23397, 2022.
Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from
a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of
dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing,
STOC ’98, pp. 604–613, New York, NY, USA, 1998. Association for Computing Machinery. ISBN
0897919629. doi: 10.1145/276698.276876. URL
https://doi.org/10.1145/276698.
276876.
Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In
International Conference on Machine Learning, pp. 652–661. PMLR, 2016.
11
Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-Wei H Lehman, Mengling Feng, Mohammad
Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a
freely accessible critical care database. Scientific data, 3:160035, May 2016. ISSN 2052-4463.
doi: 10.1038/sdata.2016.35. URL https://europepmc.org/articles/PMC4878278.
Pulkit Katdare, Nan Jiang, and Katherine Rose Driggs-Campbell. Marginalized importance sampling
for off-environment policy evaluation. In Jie Tan, Marc Toussaint, and Kourosh Darvish (eds.),
Proceedings of The 7th Conference on Robot Learning, volume 229 of Proceedings of Machine
Learning Research, pp. 3778–3788. PMLR, 06–09 Nov 2023. URL
https://proceedings.
mlr.press/v229/katdare23a.html.
John A. Kellum and John R. Prowle. Acute kidney injury in the critically ill: Clinical epi-
demiology and outcomes. Nature Reviews Nephrology, 14(10):641–656, 2018. doi: 10.1038/
s41581-018-0052-0.
Ramtin Keramati, Omer Gottesman, Leo Anthony Celi, Finale Doshi-Velez, and Emma Brun-
skill. Identification of subgroups with similar benefits in off-policy policy evaluation. CoRR,
abs/2111.14272, 2021a. URL https://arxiv.org/abs/2111.14272.
Ramtin Keramati, Omer Gottesman, Leo Anthony Celi, Finale Doshi-Velez, and Emma Brunskill.
Identification of subgroups with similar benefits in off-policy policy evaluation. arXiv preprint
arXiv:2111.14272, 2021b.
Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and
Percy Liang. Concept bottleneck models. In Hal Daumé III and Aarti Singh (eds.), Proceedings of
the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine
Learning Research, pp. 5338–5348. PMLR, 13–18 Jul 2020. URL
https://proceedings.
mlr.press/v119/koh20a.html.
Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The
artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care.
Nature medicine, 24(11):1716—1720, November 2018a. ISSN 1078-8956. doi: 10.1038/
s41591-018-0213-5. URL https://doi.org/10.1038/s41591- 018-0213-5.
Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The
artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care.
Nature medicine, 24(11):1716–1720, 2018b.
Matthieu Komorowski, Leo Anthony Celi, Omar Badawi, Anthony C. Gordon, and A. Aldo Faisal.
The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive
care. Nature Medicine, 24:1716–1720, 2018c. doi: 10.1038/s41591-018-0213- 5. URL
https:
//doi.org/10.1038/s41591-018-0213-5.
Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-
horizon off-policy estimation. Advances in neural information processing systems, 31, 2018a.
Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo A Faisal, Finale Doshi-
Velez, and Emma Brunskill. Representation balancing mdps for off-policy policy evaluation.
Advances in Neural Information Processing Systems, 31, 2018b.
Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off-policy
evaluation via conditional importance sampling, 2020. URL
https://arxiv.org/abs/
1910.06508.
Yao Liu, Yannis Flet-Berliac, and Emma Brunskill. Offline policy optimization with eligible actions,
2022. URL https://arxiv.org/abs/2207.00632.
Pedro Madeira, André Carreiro, Alex Gaudio, Luís Rosado, Filipe Soares, and Asim Smailagic.
Zebra: Explaining rare cases through outlying interpretable concepts. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3781–3787, 2023.
Anita Mahinpei, Justin Clark, Isaac Lage, Finale Doshi-Velez, and Weiwei Pan. Promises and pitfalls
of black-box concept learning models. arXiv preprint arXiv:2106.13314, 2021.
12
Ritam Majumdar, Vishal Jadhav, Anirudh Deodhar, Shirish Karande, Lovekesh Vig, and Venkatara-
mana Runkana. Symbolic regression for pdes using pruned differentiable programs, 2023. URL
https://arxiv.org/abs/2303.07009.
Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Offline policy
evaluation across representations with applications to educational games. In AAMAS, volume 1077,
2014.
Riˇ
cards Marcinkeviˇ
cs, Sonia Laguna, Moritz Vandenhirtz, and Julia E Vogt. Beyond concept
bottleneck models: How to make black boxes intervenable? arXiv preprint arXiv:2401.13544,
2024.
Anton Matsson and Fredrik D. Johansson. Case-based off-policy policy evaluation using prototype
learning, 2021. URL https://arxiv.org/abs/2111.11113.
S A Murphy, M J van der Laan, J M Robins, and Conduct Problems Prevention Research Group.
Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96
(456):1410–1423, 2001. doi: 10.1198/016214501753382327. URL
https://doi.org/10.
1198/016214501753382327. PMID: 20019887.
Ofir Nachum, Yinlam Chow, Bo Dai, and Lihong Li. Dualdice: Behavior-agnostic estimation of
discounted stationary distribution corrections, 2019.
Cosmin Paduraru. Off-policy evaluation in Markov decision processes. PhD thesis, 2013.
Konstantinos P Panousis, Dino Ienco, and Diego Marcos. Hierarchical concept discovery models: A
concept pyramid scheme. arXiv preprint arXiv:2310.02116, 2023.
Brahma S. Pavse and Josiah P. Hanna. Scaling marginalized importance sampling to high-dimensional
state-spaces via state abstraction, 2022a.
Brahma S. Pavse and Josiah P. Hanna. Scaling marginalized importance sampling to high-dimensional
state-spaces via state abstraction, 2022b. URL https://arxiv.org/abs/2212.07486.
Achim Peine, Andreas Hallawa, Jan Bickenbach, Peter Sidler, Andreas Markewitz, Alexandre
Levesque, Jeremy Levesque, and Nils Haake. Development and validation of a reinforcement
learning algorithm to dynamically optimize mechanical ventilation in critical care. npj Digital
Medicine, 4(1):32, 2021. doi: 10.1038/s41746- 021-00388-6.
Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy
evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning,
ICML ’00, pp. 759–766, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN
1558607072.
Yoshihide Sawada and Keigo Nakamura. Concept bottleneck model with additional unsupervised
concepts. IEEE Access, 10:41758–41765, 2022.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347.
Simon P Shen, Yecheng Ma, Omer Gottesman, and Finale Doshi-Velez. State relevance for off-policy
evaluation. In International Conference on Machine Learning, pp. 9537–9546. PMLR, 2021.
Mervyn Singer, Clifford S. Deutschman, Christopher W. Seymour, et al. The third international
consensus definitions for sepsis and septic shock (sepsis-3). JAMA, 315(8):801–810, 2016. doi:
10.1001/jama.2016.0287.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Shengpu Tang and Jenna Wiens. Counterfactual-augmented importance sampling for semi-offline
policy evaluation, 2023. URL https://arxiv.org/abs/2310.17146.
Philip Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement
learning. In International Conference on Machine Learning, pp. 2139–2148. PMLR, 2016.
13
Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for
off-policy evaluation, 2020.
Jean-Louis Vincent and Daniel De Backer. Circulatory shock. New England Journal of Medicine,
369(18):1726–1734, 2013. doi: 10.1056/NEJMra1208943.
Carissa Wu, Sonali Parbhoo, Marton Havasi, and Finale Doshi-Velez. Learning optimal summaries
of clinical time-series with concept bottleneck models. In Machine Learning for Healthcare
Conference, pp. 648–672. PMLR, 2022.
Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for reinforce-
ment learning with marginalized importance sampling. Advances in neural information processing
systems, 32, 2019.
Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. Gendice: Generalized offline estimation of
stationary values, 2020a.
Shangtong Zhang, Bo Liu, and Shimon Whiteson. Gradientdice: Rethinking generalized offline
estimation of stationary values, 2020b.
A CO NCE PT DES IDER ATA
Explainability: Explainability ensures that the concept function
ϕ
is composed of human-
interpretable functions
f1, f2, . . . , fn
. Each interpretable function
fi
depends on the current state,
past actions, rewards, and states, i.e., st, a0:t−1, r0:t−1, s0:t−1. Mathematically:
ct=ϕ(st, a0:t−1, r0:t−1, s0:t−1) = ψ(f1(st, a0:t−1, r0:t−1, s0:t−1), . . . , fn(st, a0:t−1, r0:t−1, s0:t−1))
(1)
Here,
ψ
maps the human-interpretable functions
fi
to the concept
ct
, and both
ϕ
and
ψ
share the
same co-domain space
C
. In essence,
ϕ
can be defined using a single interpretable function or a
combination of multiple interpretable functions.
As a running example in this paper (applicable across domains), the concept function
ϕ(st)
for
diagnosing hypertension can be expressed using human-interpretable features:
ct=ϕ(st)
=ϕ(SBP,DBP,HR,Glucose levels,GCS,Age,Weight)
=ψ(f1(SBP), f2(DBP), f3(HR), f4(Glucose levels), f5(GCS), f6(Age,Weight))
Where:
•f1(SBP)maps Systolic Blood Pressure to a category (e.g., Low, Normal, High).
•f2(DBP)maps Diastolic Blood Pressure to a category (e.g., Low, Normal, High).
•f3(HR)maps Heart Rate to a category (e.g., Low, Normal, High).
•f4(Glucose levels)maps blood glucose levels to a category (e.g., Low, Normal, High).
•f5(GCS)maps GCS scores to a category.
•f6(Age,Weight)maps age and weight to Body Mass Index (BMI).
This ensures that the concept
ϕ(st)
for diagnosing hypertension is built from human-interpretable
features, making the diagnostic process explainable. Each function
fi
translates raw medical data
into intuitive categories that are meaningful to medical practitioners.
Conciseness: Conciseness ensures that the concept function
ϕ
represents the minimal mapping of
interpretable functions
f1, f2, . . . , fn
to the concept
ct
. If multiple mappings
ψ1, ψ2, . . . , ψm
satisfy
ϕ, we choose the mapping ψthat provides the simplest composition of fito describe ct.
E.g. Obesity can be represented by different combinations of human-interpretable functions. We
select the least complex representation that remains interpretable. The two possible representations
are:
ct=ψ1(f1(height), f2(weight), f3(SBP), f4(DBP))
ct=ψ2(f5(BMI), f3(SBP))
14
Since BMI encapsulates both height and weight, and either SBP or DBP accurately summarizes blood
pressure pertinent to Obesity, the concept ct=ψ2(f5(BMI), f3(SBP)) is more concise.
Better Trajectory Coverage: Concept-based policies have a higher coverage than traditional state
policies. Mathematically:
X
τ∈T1
T
X
t=0
πc(at|ct)≥X
τ∈T2
T
X
t=0
π(at|st)(2)
Here,
πc, π
represent policies conditioned on concepts and states respectively,
T1,T2
is the set of all
possible trajectories under πc, π and Tis the total number of timesteps.
Diversity: The diversity property ensures that each dimension of the concept at a given timestep
captures distinct and independent aspects of the state space, minimizing overlap.
As an example, the concept function
ϕ(st)
for a comprehensive patient health assessment can be
represented as:
ϕ(st)=[c1
t, c2
t, . . . , cd
t]
= [c1
t(Cardiovascular Health), c2
t(Metabolic Health), c3
t(Respiratory Health)]
= [ψ1(f1(blood pressure), f2(cholesterol levels), f3(heart rate variability)),
ψ2(f1(blood glucose levels), f2(BMI), f3(metabolic history)),
ψ3(f1(lung function), f2(oxygen saturation), f3(respiratory history))]
Each dimension of the concept
ci
t
captures unique information, contributing to a holistic assessment
of the patient’s health without redundancy.
B CHOICE OF CO NCE PT TYP ES
Concepts capturing subgroups with short-term benefits. If
ϕ
maps state
st
and action
at
to immediate
reward
rt
, the resulting concepts can identify subgroups with similar short-term benefits, facilitating
more personalized OPE, as seen in Keramati et al. (2021a). Unlike Keramati et al. (2021b), we do
not limit ϕto a regression tree.
Concepts capturing high-variance transitions. If
ϕ
highlights changes in state
st
and action
at
that
cause significant shifts in value estimates, it can capture influential transitions or dynamics from
historical data, similar to Gottesman et al. (2020).
Concepts capturing least influential states. If
ϕ
identifies the least (or most) influential states
st
, it
can help focus more on critical states, reducing variance by only applying IS ratios to those states
Bossens & Thomas (2024).
Concepts capturing state-density information. If
ϕ
extracts information from histories to predict
state-action visitation counts, concept-based OPE with
ϕ
functions similarly to Marginalized OPE
estimators, like Xie et al. (2019), which reweight trajectories based on state-visitation distributions.
However, density-based concepts may be less interpretable and harder to intervene in the context of
OPE.
C GENERALIZED CO NCE PT-BAS ED OPE ES TIM ATORS
Building on the OPE estimators discussed in the main paper, we extend the integration of concepts
into other popular OPE estimators. Without making any additional assumptions about the estimators’
definitions, concepts can be seamlessly incorporated into the original formulations of these estimators.
Definition C.1 (Concept-based Weighted Importance Sampling, CWIS).
ˆ
VCW I S
πe=PN
n=1 ρ(n)
0:TPT
t=0 γtr(n)
t
PN
n=1 ρ(n)
0:T
;ρ(n)
0:T=
T
Y
t′=0
πe(a(n)
t′|c(n)
t′)
πb(a(n)
t′|c(n)
t′)
15
Definition C.2 (Concept-based Per-Decision Weighted Importance Sampling, CPDWIS).
ˆ
VCP DW I S
πe=PN
n=1 PT
t=0 ρ(n)
0:tγtr(n)
t
PN
n=1 PT
t=0 ρ(n)
0:t
;ρ(n)
0:t=
t
Y
t′=0
πe(a(n)
t′|c(n)
t′)
πb(a(n)
t′|c(n)
t′)
Definition C.3 (Concept-based Doubly Robust Estimator, CDR).
ˆ
VCDR =1
N
N
X
i=1
T
X
t=0
t
Y
k=0
πe(a(i)
k|c(i)
k)
πb(a(i)
k|c(i)
k)r(i)
t−ˆ
Q(s(i)
t, a(i)
t)+ˆ
V(s(i)
t)
Assuming good model-based estimates
ˆ
V(st),ˆ
Q(s(i)
t, a(i)
t)
, all the advantages seen in the traditional
DR estimator translate over to the concept-space representation. It’s important to note, the concepts
are only used to reweight the Importance Sampling ratios and are not incorporated in the model-based
estimates. This allows concepts to have a general form and are not under any markovian assumption,
thus satisfying the Bellman equation.
Definition C.4 (Concept-based Marginalized Importance Sampling Estimator, CMIS).
ˆ
VCM I S =
N
X
n=1
T
X
t=0
dπc
e(ct)
dπc
b(ct)γtrt
Different algorithms from the DICE family attempt to estimate the state-distribution ratio
dπe(st)
dπb(st)
.
MIS in the concept representation accounts for concept-visitation counts. These counts retain all
the statistical guarantees of the state representation. However, a drawback is that concept-visitation
counts are less intuitive than the original concept definition. This makes it harder to assess the quality
of the OPE.
D KN OW N CONC EPT-BA SED OPE ESTIMATO RS: THEORETICAL PROOFS
In this section, we provide the detailed proofs for the known concept scenario.
Theorem. For any arbitary function f,Ec∼dπcf(c) = Es∼dπf(ϕ(s))
Proof: See Pavse & Hanna (2022b).
D.1 IS
D.1.1 BIAS
Bias =|Eπc
b[ˆ
VCI S
πc
e]−Eπc
e[ˆ
VCI S
πc
e]|(a)
=|Eπc
b"ρ(n)
0:T
T
X
t=0
γtr(n)
t#−Eπc
e[ˆ
VCI S
πe]|(b)
=|
N
X
n=1 T
Y
t=0
πc
b(a(n)
t|c(n)
t)!ρ(n)
0:T
T
X
t=0
γtr(n)
t−Eπe[ˆ
VCI S
πc
e]|(c)
=|
N
X
n=1
T
Y
t=0 πc
b(a(n)
t|c(n)
t)πc
e(a(n)
t|c(n)
t)
πc
b(a(n)
t|c(n)
t)!T
X
t=0
γtr(n)
t−Eπc
e[ˆ
VCI S
πe]|(d)
=|
N
X
n=1
T
Y
t=0
πc
e(a(n)
t|c(n)
t)
T
X
t=0
γtr(n)
t−Eπc
e[ˆ
VCI S
πc
e]|= 0 (e)
Explanation of steps:
(a)
We start by expressing the definition of Bias as the difference between expected values of
the value function sampled under the behavior policy
πc
b
and the concept-based evaluation
policy πc
e(a|c).
16
(b) We expand the respective definitions.
(c)
Each term is expanded to represent the probability of the trajectories, factoring in the
importance sampling ratio.
(d)
Grouping similar terms. This change of measure is possible as the concepts are known and
can be modify the trajectory probabilities.
(e)
The denominator of the IS term cancels with the probability of the trajectory under
πc
b
.
Using the definition of Eπc
e[ˆ
VCI S
πc
e] = PN
n=1 QT
t=0 πc
e(a(n)
t|c(n)
t)PT
t=0 γtr(n)
t.
D.1.2 VARIANCE
V[ˆ
VCI S
πc
e] = Eπc
b[( ˆ
VCI S
πc
e)2]−(Eπc
b[ˆ
VCI S
πc
e])2(a)
We first evaluate the expectation of the square of the estimator:
Eπc
b[( ˆ
VCI S
πe)2] = Eπc
b
ρ(n)
0:T
T
X
t=0
γtr(n)
t!2
(b)
=Eπc
b"T
X
t=0
T
X
t′=0
ρ2
0:Tγ(t+t′)r(n)
tr(n′)
t′#(c)
=
N
X
n=1
T
Y
t=0
(πc
e(a(n)
t|c(n)
t))2
πc
b(a(n)
t|c(n)
t)
T
X
t=0
T
X
t′=0
γ(t+t′)rtrt′(d)
Evaluating the second term in the variance expression:
(Eπc
b[ˆ
VCI S
πc
e])2= Eπc
b[ρ(n)
0:T
T
X
t=0
γtr(n)
t]!2
(e)
=
N
X
n=1
T
Y
t=0 πc
b(a(n)
t|c(n)
t)(πc
e(a(n)
t|c(n)
t)
πc
b(a(n)
t|c(n)
t))!2T
X
t=0
T
X
t′=0
γ(t+t′)rtrt′(f)
=
N
X
n=1
T
Y
t=0 πc
e(a(n)
t|c(n)
t)2T
X
t=0
T
X
t′=0
γ(t+t′)rtrt′(g)
Subtracting the squared expectation from the expectation of the squared estimator:
V[ˆ
VCI S
πc
e] =
N
X
n=1
T
Y
t=0 (πc
e(a(n)
t|c(n)
t)2(1
πc
b(a(n)
t|c(n)
t)−1)!T
X
t=0
T
X
t′=0
γ(t+t′)rtrt′(h)
Explanation of steps:
(a) We begin with the definition of variance for our estimator.
(b) We evaluate the first term of the Variance.
(c),(d) We expand the square of the estimator as the square of a sum of weighted returns.
(e) We calculate the square of the expectation of the estimator.
(f) We expand this squared expectation.
(g) The denominator of the IS ratio cancels with the probability of the trajectory.
D.1.3 VA RI AN CE CO MPARI SO N BETWE EN CIS RATIOS AND IS RATIO S
Theorem. V[QT
t=0
πc
e(at|ct)
πc
b(at|ct)]≤V[QT
t=0
πe(at|st)
πb(at|st)]
17
Proof: The proof is similar to Pavse & Hanna (2022b), where we generalize to concepts from state
abstractions. Using Lemma D and Assumption 5.1, we can say that:
Ec∼dπc
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)=Es∼dπ
T
Y
t=0
πe(at|st)
πb(at|st)= 1 (a)
Denoting the difference between the two variances as D:
D=V[
T
Y
t=0
πe(at|st)
πb(at|st)]−V[
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)](b)
=Eπb[
T
Y
t=0 πe(at|st)
πb(at|st)2
]−[Eπb
T
Y
t=0 πe(at|st)
πb(at|st)]2−Eπc
b[
T
Y
t=0 πc
e(at|ct)
πc
b(at|ct)2
]+[Eπc
b
T
Y
t=0 πc
e(at|ct)
πc
b(at|ct)]2
(c)
=Eπb[
T
Y
t=0 πe(at|st)
πb(at|st)2
]−Eπc
b[
T
Y
t=0 πc
e(at|ct)
πc
b(at|ct)2
](d)
=X
s
T
Y
t=0
πb(at|st)[
T
Y
t=0 πe(at|st)
πb(at|st)2
]−X
c
T
Y
t=0
πc
b(at|ct)[
T
Y
t=0 πc
e(at|ct)
πc
b(at|ct)2
](e)
=X
c X
s
T
Y
t=0
πb(at|st)[
T
Y
t=0 πe(at|st)
πb(at|st)2
]−
T
Y
t=0
πc
b(at|ct)[
T
Y
t=0 πc
e(at|ct)
πc
b(at|ct)2
]!(f)
=X
c X
s
[
T
Y
t=0 πe(at|st)2
πb(at|st)]−[
T
Y
t=0 πc
e(at|ct)2
πc
b(at|ct)]!(g)
We will analyse the difference of variance for 1 fixed concept and denote it as D’:
D′= [
T
Y
t=0 πc
e(at|ct)2
πc
b(at|ct)]− X
s
[
T
Y
t=0 πe(at|st)2
πb(at|st)]!(h)
Now, if we can show
D′≥0
for
|c|
, where
|c|
is the cardinality of the concept representation, then
the difference will always be positive, thus completing our proof. We will use induction to prove
D′≥0
on the total number of concepts from 1 to
|c|=n < |S|
. Now, our induction statement
T(n)
to prove is,
D′≥0
where
n=|c′|
. For
n= 1
, the statement is trivially true where every concept
can be represented as the traditional representation of the state.Our inductive hypothesis states that
D′= [
T
Y
t=0 πc
e(at|ct)2
πc
b(at|ct)]−X
s
[
T
Y
t=0 πe(at|st)2
πb(at|st)]!≥0(i)
Now, we define
S=Ps[QT
t=0 πe(at|st)2
πb(at|st)]
,
C=QT
t=0 πc
e(at|ct)2
,
C′=QT
t=0 πc
b(at|ct)
. After
making the substitutions, we obtain
C2≤SC ′(j)
This result holds true for
|c|=n
as per the induction. Now, we add a new state
sn+1
to the concept
as part of the induction, and obtain the following difference:
D′=S×πe(a|sn+1)2
πb(a|sn+1)−C
C′×πe(a|sn+1)2
πb(a|sn+1)(k)
Let πe(a|sn+1) = Xand πb(a|sn+1 ) = Y. Substituting, we get:
D′=SX2
Y−C
C′
X2
Y=(SC ′−C)X2
C′Y(l)
D’ is minimum when C is maximized, hence we substitute
C≤√SC ′
from the induction hypothesis
in the expression
D′≤(SC ′−√SC′)X2
C′Y(m)
As
SC ′≥0
, the term
SC ′−√SC′
is never negative, leading to
D′≤0
, since the remaining
quantities are always positive. Thus, the induction hypothesis holds, and that concludes the proof.
18
D.1.4 VA RI AN CE CO MPARI SO N BETWE EN CIS A ND IS E ST IM ATORS
Theorem. When
Cov(Qt
t=0
πc
e(at|ct)
πc
b(at|ct)rt,Qk
t=0
πc
e(at|ct)
πc
b(at|ct)rk)≤
Cov(Qk
t=0
πe(at|st)
πb(at|st)rt,Qk
t=0
πe(at|st)
πb(at|st)rk)
, the variance of known concept-based IS estimators is
lower than traditional estimators, i.e. Vπb[ˆ
VCI S ]≤Vπb[ˆ
VIS ],Vπb[ˆ
VCP D IS ]≤Vπb[ˆ
VP DI S ].
Proof: Using Lemma D and Assumption 5.1, we can say that:
Ec∼dπc"T
X
t=0
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt(ct, at)#=Es∼dπ"T
X
t=0
T
Y
t=0
πe(at|st)
πb(at|st)rt(st, at)#(a)
The Variance for a single example of a CIS estimator is given by
V[ˆ
VCI S ] = 1
T2 T
X
t=0
V[
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt]+2X
t<k
Cov(
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt,
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rk)!(b)
The Variance for a single example of a IS estimator is given by
V[ˆ
VIS ] = 1
T2 T
X
t=0
V[
T
Y
t=0
πe(at|st)
πb(at|st)rt]+2X
t<k
Cov(
T
Y
t=0
πe(at|st)
πb(at|st)rt,
T
Y
t=0
πe(at|st)
πb(at|st)rk)!(c)
We take the difference between the variances, and note the difference of the covariances is not positive
as per the assumption. Hence, if we show the differences of variances per timestep is negative, we
complete our proof.
D=V[
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt]−V[
T
Y
t=0
πe(at|st)
πb(at|st)rt](d)
=Eπc
b[ T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt!2
]−[Eπc
b T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt!]2−Eπb[ T
Y
t=0
πe(at|st)
πb(at|st)rt!2
]+[Eπb T
Y
t=0
πe(at|st)
πb(at|st)rt!]2
(e)
=Eπc
b[ T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt!2
]−Eπb[ T
Y
t=0
πe(at|st)
πb(at|st)rt!2
](e)
=X
c
T
Y
t=0
(πc
b(at|ct)) [ T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt!2
]−X
s
T
Y
t=0
(πb(at|st)) [ T
Y
t=0
πe(at|st)
πb(at|st)rt!2
](f)
=X
cX
s∈ϕ−1(c)
T
Y
t=0
(πc
b(at|ct)) [ T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt!2
]−X
s
T
Y
t=0
(πb(at|st)) [ T
Y
t=0
πe(at|st)
πb(at|st)rt!2
]
(g)
≤R2
max
X
cX
s∈ϕ−1(c)
T
Y
t=0
(πc
b(at|ct)) [ T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)!2
]−X
s
T
Y
t=0
(πb(at|st)) [ T
Y
t=0
πe(at|st)
πb(at|st)!2
]
(h)
The rest of the proof is identical to the previous subsection, wherein we perform induction on the
cardinality of the concept for and the term inside the bracket is never positive, thus completing the
proof.
19
D.1.5 UPP ER BOUND ON THE VARIANCE
V[ˆ
VCI S
πc
b] = Eπc
b( T
X
t=0
γtrt
T
Y
t′=0
πe(at′|ct′)
πb(at′|ct′)!2
)−Eπc
b T
X
t=0
γtrt
T
Y
t′=0
πe(at′|ct′)
πb(at′|ct′)!2
(a)
≤Eπc
b( T
X
t=0
γtrt
T
Y
t′=0
πe(at′|ct′)
πb(at′|ct′)!2
)(b)
≤1
N
N
X
n=1
( T
X
t=0
γtrt
T
Y
t′=0
πe(at′|ct′)
πb(at′|ct′)!2
) + 7T2R2
maxU2T
cln(2
δ)
3(N−1) +v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(X2
i−X2
j)2
(c)
≤T2R2
maxU2T
c(1
N+ln 2
δ
3(N−1)) + v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(X2
i−X2
j)2(d)
Explanation of steps:
(a) We begin with the definition of variance.
(b) The second term is always greater than 0
(c)
Applying Bernstein inequality with probability 1-
δ
.
Xi
refers to the CIS estimate for 1
sample.
(d) Grouping terms 1 and 2 together,where Uc=max πc
e(a|c)
πc
b(a|c).
The first term of the variance dominates the second with increase in number of samples. Thus,
Variance is of the complexity O(T2R2
maxU2T
c
N)
D.1.6 UPP ER BOUND ON THE MSE
MSE =Bias2+Variance =Variance ∼ O(T2R2
maxU2T
c
N)(a)
The Upper Bound on the MSE of Concept-based IS estimator is of the same form as the Cramer-Rao
bounds of the traditional IS estimator as stated in Jiang & Li (2016). We investigate when the MSE
bounds can be tightened in the concept representation. We first say,
Uc=max πc
e(a|c)
πc
b(a|c)=Us
K1
K2
(b)
Here,
Us=max πe(a|s)
πb(a|s)
,
K1
is the cardinality of the states which have the same concept
c
under
evaluation policy
πe
, while
K2
refers to the same quantity under the behavior policy
πb
. Typically,
the maximum value of the IS ratio occurs when
πe(a|s)>> πb(a|s)
, i.e. the action taken is very
likely under the evaluation policy
πe
while it’s unlikely under the behavior policy
πb
. This typically
happens when that particular state has less coverage, or doesn’t appear in the data generated by
the behavior policy
πb
. Under concepts however, similar states are visited and categorized, which
improves the information on the state
s
through
c
, leading to
K2>1
. On the other hand, as both
πc
e(a|s)and πe(a|s)are close to 1, K1= 1. Thus, K=K1
K2<1and Hence,
O(T2R2
maxU2T
c
N)∼ O(T2R2
max(UsK)2T
N)∼ O(T2R2
maxUs2T
N)K2T(3)
Thus, the Concept-based MSE bounds are tightened by a factor of K2T.
D.1.7 VARIANCE COMPARISON WITH MIS EST IMATOR
Theorem. Let
ρ
be the product of the Importance Sampling ratio in the state space, and
dπe
,
dπb
be
the stationary density ratios. Then,
E(ρ0:T|st, at) = dπe(st, at)
dπb(st, at)
20
Proof: See Liu et al. (2020)
Theorem. Let Xtand Ytbe two sequences of random variables. Then
V(X
t
Yt)−V(X
t
E[Yt|Xt]) ≥2X
t<k
E[YtYk]−2X
t<k
E[E[Yt|Xt]E[Yk|Xk]]
Proof: See Liu et al. (2020)
Theorem. When
Cov(Qt
t=0
πc
e(at|ct)
πc
b(at|ct)rt,Qk
t=0
πc
e(at|ct)
πc
b(at|ct)rk)≤Cov(dπe(st,at)
dπb(st,at)rt,dπe(sk,ak)
dπb(sk,ak)rk)
, the
variance of known CIS estimators is lower than the Variance of MIS estimator, i.e.
Vπb[ˆ
VCI S ]≤
Vπb[ˆ
VMI S ].
Proof: We start from the assumption:
Cov(
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt,
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rk) = E[
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rtrk]−E[
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rt]E[
T
Y
t=0
πc
e(at|ct)
πc
b(at|ct)rk]
(a)
=E[
T
Y
t=0
πe(at|st)
πb(at|st)
T
Y
t=0
πe(ak|sk)
πb(ak|sk)K2rtrk]−E[
T
Y
t=0
πe(at|st)
πb(at|st)Krt]E[
T
Y
t=0
πe(at|st)
πb(at|st)Krk](b)
≤E[
T
Y
t=0
πe(at|st)
πb(at|st)
T
Y
t=0
πe(ak|sk)
πb(ak|sk)rtrk]−E[
T
Y
t=0
πe(at|st)
πb(at|st)rt]E[
T
Y
t=0
πe(at|st)
πb(at|st)rk](c)
≤E[(dπe(st, at)
dπb(st, at))(dπe(sk, ak)
dπb(sk, ak))rtrk]−E[dπe(st, at)
dπb(st, at))rt]E[dπe(sk, ak)
dπb(sk, ak))rk](d)
Explanation of steps:
(a) We begin with the definition of covariance.
(b),(c) Using the definition of πc, with K (the ratio of state-space distribution ratio) <1.
(d) Applying Lemma D.1.7 to both the terms.
Finally, using Lemma D.1.7, substituting
Yt=QT
t=0
πe(at|st)
πb(at|st)rt
and
Xt=st, at, rt
completes our
proof.
D.2 PDIS
D.2.1 BIAS
Bias =|Eπc
b[ˆ
VCP DI S
πc
e]−Eπc
e[ˆ
VCP DI S
πc
e]|(a)
=|Eπc
b"T
X
t=0
γtρ(n)
0:tr(n)
t#−Eπc
e[ˆ
VCP DI S
πc
e]|(b)
=|
N
X
n=1 T
Y
t=0
πc
b(a(n)
t|c(n)
t)!T
X
t=0
γtρ(n)
0:tr(n)
t−Eπc
e[ˆ
VCP DI S
πc
e]|(c)
=|
N
X
n=1
T
X
t=0
γt t
Y
t′=0
πc
b(a(n)
t′|c(n)
t′)(πc
e(a(n)
t′|c(n)
t′)
πc
b(a(n)
t′|c(n)
t′))!r(n)
t−Eπc
e[ˆ
VCP DI S
πe]|= 0 (d)
Explanation of steps: Similar to CIS.
21
D.2.2 VARIANCE
Following the process similar to CIS estimator:
V[ˆ
VCP DI S
πc
b] = Eπc
b[( ˆ
VCP DI S
πc
b)2]−(Eπc
b[ˆ
VCP DI S
πc
b])2(a)
We first evaluate the expectation of the square of the estimator:
Eπc
b[( ˆ
VCP DI S
πc
b)2] = Eπc
b
T
X
t=0
γtρ0:trt!2
(b)
=Eπc
b"T
X
t=0
T
X
t′=0
ρ0:tρ0:t′γ(t+t′)rtrt′#(c)
=
N
X
n=1
T
X
t=0
T
X
t′=0 t
Y
t′′=0
πc
b(at|ct)(πc
e(at|ct)
πc
b(at|ct))!
t′
Y
t′′′=0
(πc
e(at|ct)
πc
b(at|ct))
γ(t+t′)rtrt′
(d)
=
N
X
n=1
T
X
t=0
T
X
t′=0 t
Y
t′′=0
πc
e(at|ct)!
t′
Y
t′′′=0
(πc
e(at|ct)
πc
b(at|ct))
γ(t+t′)rtrt′(e)
Evaluating the second term in the variance expression:
(Eπc
b[ˆ
VCP DI S
πc
b])2= N
X
n=1
T
X
t=0
Eπc
b[γtρ0:trt]!2
(f)
=
N
X
n=1
T
X
t=0
T
X
t′=0 t
Y
t′′=0
πc
b(at′′ |ct′′ )(πc
e(at′′ |ct′′ )
πc
b(at′′ |ct′′ ))!
t′
Y
t′′′=0
πc
b(at′′′ |st′′′ )(πc
e(at′′′ |ct′′′ )
πc
b(at′′′ |ct′′′ ))
γ(t+t′)rtrt′
(g)
=
N
X
n=1
T
X
t=0
T
X
t′=0 t
Y
t′′=0
πc
e(at′′ |ct′′ )!
t′
Y
t′′′=0
πc
e(at′′′ |st′′′ )
γ(t+t′)rtrt′(h)
Subtracting the squared expectation from the expectation of the squared estimator:
V[ˆ
VCP DI S
πc
b] =
N
X
n=1
T
X
t=0
T
X
t′=0 t
Y
t′′′=0
πc
e(at′′′ |ct′′′ )!
t′
Y
t′′=0
πc
e(at′′ |ct′′ )( 1
πc
b(at′′ |ct′′ )−1)
γ(t+t′)rtrt′
(i)
Explanation of steps: Similar to CIS.
D.2.3 VA RI AN CE CO MPARI SO N BETWE EN CPDIS RATIOS AND PDIS RATIOS
Theorem. V[PT
t=0 Qt
t′=0
πc
e(at′|ct′)
πc
b(at′|ct′)]≤V[PT
t=0 Qt
t′=0
πe(at′|st′)
πb(at′|st′)]
Proof: Similar to CIS estimator.
D.2.4 VA RI AN CE CO MPARI SO N BETWE EN CPDIS A ND PDIS ES TI MATO RS
Theorem. If for any fixed 0≤t≤k < T , if
Cov(
t
Y
t′=0
πc
e(at′|ct′)
πc
b(at′|ct′)rt,
k
Y
t′=0
πc
e(at′|ct′)
πc
b(at′|ct′)rk)≤Cov(
t
Y
t′=0
πe(at′|st′)
πb(at′|st′)rt,
T
Y
t=0
πe(at′|st′)
πb(at′|st′)rk)
then V[ˆ
VCP D IS ]≤V[ˆ
VP DI S ].
Proof: Similar to CIS estimator.
22
D.2.5 UPP ER BOUND ON THE VARIANCE
V[ˆ
VCP DI S
πc
b] = Eπc
b( T
X
t=0
γtrt
t
Y
t′=0
πc
e(at′|ct′)
πc
b(at′|ct′)!2
)−Eπc
b T
X
t=0
γtrt
t
Y
t′=0
πc
e(at′|ct′)
πc
b(at′|ct′)!2
(a)
≤Eπc
b( T
X
t=0
γtrt
t
Y
t′=0
πc
e(at′|ct′)
πc
b(at′|ct′)!2
)(b)
≤1
N
N
X
n=1
( T
X
t=0
γtrt
t
Y
t′=0
πc
e(at′|ct′)
πc
b(at′|ct′)!2
) + 7T2R2
maxU2T
cln(2
δ)
3(N−1) +v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(X2
i−X2
j)2
(c)
≤T2R2
maxU2T
c(1
N+ln 2
δ
3(N−1)) + v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(X2
i−X2
j)2(d)
Explanation of steps: Similar to CIS.
D.2.6 UPP ER BOUND ON THE MSE
MSE =Bias2+Variance =Variance ∼ O(T2R2
maxU2T
c
N)∼ O(T2R2
maxUs2T
N)K2T(4)
Proof: Similar to CIS estimator.
D.2.7 VARIANCE COMPARISON WITH MIS EST IMATOR
Theorem. When
Cov(Qt
t=0
πc
e(at|ct)
πc
b(at|ct)rt,Qk
t=0
πc
e(at|ct)
πc
b(at|ct)rk)≤Cov(dπe(st,at)
dπb(st,at)rt,dπe(sk,ak)
dπb(sk,ak)rk)
,
the variance of known CPDIS estimators is lower than the Variance of MIS estimator, i.e.
Vπb[ˆ
VCP D IS ]≤Vπb[ˆ
VMI S ].
Proof: Similar to CIS estimator.
E UN KNOWN CO NCE PT-BAS ED OPE ESTI MATOR S: THEORETICAL PROO FS
In this section, we provide the theoretical proofs of the unknown concept scenarios.
E.1 IS
E.1.1 BIAS
We begin by stating the expression for the expected value of the CIS estimator under πb:
Bias =|Eπb[ˆ
VCI S
πe]−Eπe[ˆ
VCI S
πc
e]|(a)
=|Eπb"ρ(n)
0:T
T
X
t=0
γtr(n)
t#−Eπe[ˆ
VCI S
πc
e]|(b)
=|
N
X
n=1 T
Y
t=0
πb(a(n)
t|s(n)
t)!ρ(n)
0:T
T
X
t=0
γtr(n)
t−Eπe[ˆ
VCI S
πc
e]|(c)
=|
N
X
n=1
T
Y
t=0 πb(a(n)
t|s(n)
t)πc
e(a(n)
t|˜c(n)
t)
πc
b(a(n)
t|˜c(n)
t)!T
X
t=0
γtr(n)
t−Eπe[ˆ
VCI S
πc
e]|(d)
23
Explanation of steps:
(a)
We start by expressing the definition of Bias as the difference between expected values of
the value function sampled under the behavior policy
πb
and the concept-based evaluation
policy πe(a|c).
(b) We expand the respective definitions.
(c)
Each term is expanded to represent the probability of the trajectories, factoring in the
importance sampling ratio.
(d)
Similar terms are grouped together to concisely represent the impact of the importance
sampling ratios.
The bias of the CIS estimator is mimimum when the concepts
˜ct
equals the traditional state repre-
sentations
st
, thus, implying imperfect concept-based sampling induces bias. As the concepts are
unknown, the reparameterization of the probabilities of the behavior trajectories isn’t possible, thus
leading to a finite bias as opposed to Known-concept representations.
E.1.2 VARIANCE
We start with the definition of variance for the CIS estimator:
V[ˆ
VCI S
πe] = Eπb[( ˆ
VCI S
πe)2]−(Eπb[ˆ
VCI S
πe])2(a)
We first evaluate the expectation of the square of the estimator:
Eπb[( ˆ
VCI S
πe)2] = Eπb
ρ(n)
0:T
T
X
t=0
γtr(n)
t!2
(b)
=Eπb"T
X
t=0
T
X
t′=0
ρ2
0:Tγ(t+t′)r(n)
tr(n′)
t′#(c)
=
N
X
n=1
T
Y
t=0 πb(a(n)
t|s(n)
t)(πc
e(a(n)
t|˜c(n)
t)
πc
b(a(n)
t|˜c(n)
t))2!T
X
t=0
T
X
t′=0
γ(t+t′)rtrt′(d)
Evaluating the second term in the variance expression:
(Eπb[ˆ
VCI S
πe])2= Eπb[ρ(n)
0:T
T
X
t=0
γtr(n)
t]!2
(e)
=
N
X
n=1
T
Y
t=0 πb(a(n)
t|s(n)
t)(πc
e(a(n)
t|˜c(n)
t)
πc
b(a(n)
t|˜c(n)
t))!2T
X
t=0
T
X
t′=0
γ(t+t′)rtrt′(f)
Subtracting the squared expectation from the expectation of the squared estimator:
V[ˆ
VCI S
πe] =
N
X
n=1
T
Y
t=0 (πb(a(n)
t|s(n)
t)−πb(a(n)
t|s(n)
t)2)(πc
e(a(n)
t|˜c(n)
t)
πc
b(a(n)
t|˜c(n)
t))2!T
X
t=0
T
X
t′=0
γ(t+t′)rtrt′
(g)
Explanation of steps:
(a) We begin with the definition of variance for our estimator.
(b) We expand the square of the estimator as the square of a sum of weighted returns.
(c),(d)
We further expand the expected value of this squared sum and evaluate the expected values
under the assumption that trajectories are sampled independently.
(e) We calculate the square of the expectation of the estimator.
(f) We expand this squared expectation.
(g)
We obtain the final expression for variance by subtracting the squared expectation from the
expectation of the squared estimator, simplifying to consider the covariance terms.
24
E.1.3
VARIANCE COMPARISON BETWEEN CON CEPT IS RATIOS AND TRADITIONAL IS RATI OS
Theorem. V[QT
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)]≤V[QT
t=0
πe(at|st)
πb(at|st)]
Proof: The proof is similar to Pavse & Hanna (2022b) and the ones we used in known concepts, where
we generalize to parameterized concepts from state abstractions. The proof remains intact because we
make no assumptions on how the concepts are derived, as long as they satisfy the desiderata. Using
Lemma D and Assumption 5.1, we can say that:
Ec∼dπc
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)=Es∼dπ
T
Y
t=0
πe(at|st)
πb(at|st)= 1 (a)
Denoting the difference between the two variances as D:
D=Var[
T
Y
t=0
πe(at|st)
πb(at|st)]−Var[
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)](b)
=Eπb[
T
Y
t=0 πe(at|st)
πb(at|st)2
]−Eπb[
T
Y
t=0 πc
e(at|˜ct)
πc
b(at|˜ct)2
](c)
=X
s
T
Y
t=0
πb(at|st)[
T
Y
t=0 πe(at|st)
πb(at|st)2
]−X
c
T
Y
t=0
πc
b(at|˜ct)[
T
Y
t=0 πc
e(at|˜ct)
πc
b(at|˜ct)2
](d)
=X
c X
s
T
Y
t=0
πb(at|st)[
T
Y
t=0 πe(at|st)
πb(at|st)2
]−
T
Y
t=0
πc
b(at|˜ct)[
T
Y
t=0 πc
e(at|˜ct)
πc
b(at|˜ct)2
]!(e)
=X
c X
s
[
T
Y
t=0 πe(at|st)2
πb(at|st)]−[
T
Y
t=0 πc
e(at|˜ct)2
πc
b(at|˜ct)]!(f)
We will analyse the difference of variance for 1 fixed concept and denote it as D’:
D′= [
T
Y
t=0 πc
e(at|˜ct)2
πc
b(at|˜ct)]− X
s
[
T
Y
t=0 πe(at|st)2
πb(at|st)]!(g)
Now, if we can show
D′≥0
for
|c|
, where
|c|
is the cardinality of concept representation, then the
difference will always be positive, thus completing our proof. We will use induction to prove
D′≥0
on the total number of concepts from 1 to
|c|=n < |S|
. Now, our induction statement
T(n)
to
prove is,
D′≥0
where
n=|c′|
. For
n= 1
, the statement is trivially true where every concept can
be represented as the traditional representation of the state.Our inductive hypothesis states that
D′= [
T
Y
t=0 πc
e(at|˜ct)2
πc
b(at|˜ct)]−X
s
[
T
Y
t=0 πe(at|st)2
πb(at|st)]!≥0(h)
Now, we define
S=Ps[QT
t=0 πe(at|st)2
πb(at|st)]
,
C=QT
t=0 πc
e(at|˜ct)2
,
C′=QT
t=0 πc
b(at|˜ct)
. After
making the substitutions, we obtain
C2≤SC ′(i)
This result holds true for
|c|=n
as per the induction. Now, we add a new state
sn+1
to the concept
as part of the induction, and obtain the following difference:
D′=S×πe(a|sn+1)2
πb(a|sn+1)−C
C′×πe(a|sn+1)2
πb(a|sn+1)(j)
Let πe(a|sn+1) = Xand πb(a|sn+1 ) = Y. Substituting, we get:
D′=SX2
Y−C
C′
X2
Y=(SC ′−C)X2
C′Y(k)
D’ is minimum when C is maximized, hence we substitute
C≤√SC ′
from the induction hypothesis
in the expression
D′≤(SC ′−√SC′)X2
C′Y(l)
As
SC ′≥0
, the term
SC ′−√SC′
is never negative, leading to
D′≤0
, since the remaining
quantities are always positive. Thus, the induction hypothesis holds, and that concludes the proof.
25
E.1.4 VARIANCE COMPARISON BETWEEN UNKNOWN CIS AND IS ES TI MATORS
Theorem. When
Cov(Qt
t=0
πc
e(at|ct)
πc
b(at|ct)rt,Qk
t=0
πc
e(at|ct)
πc
b(at|ct)rk)≤
Cov(Qk
t=0
πe(at|st)
πb(at|st)rt,Qk
t=0
πe(at|st)
πb(at|st)rk)
, the variance of unknown CIS estimator is lower
than IS estimator, i.e. Vπb[ˆ
VCI S ]≤Vπb[ˆ
VIS ].
Proof: Using Lemma D and Assumption 5.1, we can say that:
Ec∼dπc"T
X
t=0
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rt(ct, at)#=Es∼dπ"T
X
t=0
T
Y
t=0
πe(at|st)
πb(at|st)rt(st, at)#(a)
The Variance for a single example of a CIS estimator is given by
V[ˆ
VCI S ] = 1
T2 T
X
t=0
V[
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rt]+2X
t<k
Cov(
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rt,
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rk)!(b)
The Variance for a single example of a IS estimator is given by
V[ˆ
VIS ] = 1
T2 T
X
t=0
V[
T
Y
t=0
πe(at|st)
πb(at|st)rt]+2X
t<k
Cov(
T
Y
t=0
πe(at|st)
πb(at|st)rt,
T
Y
t=0
πe(at|st)
πb(at|st)rk)!(c)
We take the difference between the variances, and note the difference of the covariances is not positive
as per the assumption. Hence, if we show the differences of variances per timestep is negative, we
complete our proof.
D=V[
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rt]−V[
T
Y
t=0
πe(at|st)
πb(at|st)rt](d)
=Eπc
b[ T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rt!2
]−Eπb[ T
Y
t=0
πe(at|st)
πb(at|st)rt!2
](e)
=X
c
T
Y
t=0
(πc
b(at|˜ct)) [ T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rt!2
]−X
s
T
Y
t=0
(πb(at|st)) [ T
Y
t=0
πe(at|st)
πb(at|st)rt!2
](f)
=X
cX
s∈ϕ−1(c)
T
Y
t=0
(πc
b(at|˜ct)) [ T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rt!2
]−X
s
T
Y
t=0
(πb(at|st)) [ T
Y
t=0
πe(at|st)
πb(at|st)rt!2
]
(g)
≤R2
max
X
cX
s∈ϕ−1(c)
T
Y
t=0
(πc
b(at|˜ct)) [ T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)!2
]−X
s
T
Y
t=0
(πb(at|st)) [ T
Y
t=0
πe(at|st)
πb(at|st)!2
]
(h)
The rest of the proof is identical to the previous subsection of known concepts, wherein we apply
induction over the cardinality of the concepts and show the term inside the bracket is never positive,
thus completing the proof.
26
E.1.5 UPP ER BOUND ON THE BIAS
Unlike known concepts, there exists a finite bias in case of unknown concepts, and the finite bounds
need to be analyzed.
Bias =|
N
X
n=1
T
Y
t=0 πb(a(n)
t|s(n)
t)πc
e(a(n)
t|˜c(n)
t)
πc
b(a(n)
t|˜c(n)
t)!T
X
t=0
γtr(n)
t−Eπc
e[ˆ
VCI S
πc
e]|(a)
≤ |
N
X
n=1
T
Y
t=0 πc
e(a(n)
t|˜c(n)
t)(πb(a(n)
t|s(n)
t)
πb(a(n)
t|˜c(n)
t
)!T
X
t=0
γtr(n)
t|+|Eπc
e[ˆ
VCI S
πc
e]|(b)
≤1
N|
N
X
n=1
T
Y
t=0
πc
e(a(n)
t|˜c(n)
t)
πc
b(a(n)
t|˜c(n)
t)
T
X
t=0
γtr(n)
t|+7T RmaxUT
cln(2
δ)
3(N−1) +v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(Xi−Xj)2+|Eπc
e[ˆ
VCI S
πe]|
(c)
≤T RmaxUT
c(1
N+ln 2
δ
3(N−1)) + v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(Xi−Xj)2+|Eπc
e[ˆ
VCI S
πe]|(d)
Explanation of steps:
(a) We begin with the evaluated Bias expression.
(b) Applying triangle inequality.
(c)
Applying Bernstein inequality with probability 1-
δ
.
Xi
refers to the CIS estimate for 1
sample.
(d) Grouping terms 1 and 2 together,where Uc=max πc
e(a|˜c)
πc
b(a|˜c).
The first term of the bias dominates the second in terms of the number of samples, with the true
expectation of the CIS estimator being unknown in general cases. Generally, the maximum possible
reward is known, which leads to the first term dominating the Bias expression. Thus, Bias is of the
complexity O(T RmaxUT
c
N)
E.1.6 UPP ER BOUND ON THE VARIANCE
V[ˆ
VCP DI S
πb] = Eπb( T
X
t=0
γtrt
T
Y
t′=0
πc
e(at′|˜ct′)
πc
b(at′|˜ct′)!2
)−Eπb T
X
t=0
γtrt
T
Y
t′=0
πc
e(at′|˜ct′)
πc
b(at′|˜ct′)!2
(a)
≤Eπb( T
X
t=0
γtrt
T
Y
t′=0
πc
e(at′|˜ct′)
πc
b(at′|˜ct′)!2
)(b)
≤1
N
N
X
n=1
( T
X
t=0
γtrt
T
Y
t′=0
πc
e(at′|˜ct′)
πc
b(at′|˜ct′)!2
) + 7T2R2
maxU2T
cln(2
δ)
3(N−1) +v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(X2
i−X2
j)2
(c)
≤T2R2
maxU2T
c(1
N+ln 2
δ
3(N−1)) + v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(X2
i−X2
j)2(d)
Explanation of steps:
(a) We begin with the definition of variance.
(b) The second term is always greater than 0
(c)
Applying Bernstein inequality with probability 1-
δ
.
Xi
refers to the CIS estimate for 1
sample.
27
(d) Grouping terms 1 and 2 together,where Uc=max πe(a|˜c)
πb(a|˜c).
The first term of the variance dominates the second with increase in number of samples. Thus,
Variance is of the complexity O(T2R2
maxU2T
c
N)
E.1.7 UPP ER BOUND ON THE MSE
MSE =Bias2+Variance (a)
∼ O(T RmaxUT
c
N)2+ϵ(|Eπc
e[ˆ
VCI S
πe]|2) + O(T2R2
maxU2T
c
N)(b)
∼ O(T2R2
maxU2T
c
N) + ϵ(|Eπc
e[ˆ
VCI S
πe]|2)(c)
∼ O(T2R2
maxU2T
s
N)K2T+ϵ(|Eπc
e[ˆ
VCI S
πe]|2)(d)
The arguments are similar to the known-concept bounds of the MSE, with the difference being
the expressions for
Uc, Us, K
are over approximations of concepts instead of true concepts and an
irreducible error over
Eπc
e[ˆ
VCI S
πe]
as the distribution is sampled in the concept representations instead
of state representations.
E.1.8 VARIANCE COMPARISON WITH MIS EST IMATOR
Theorem. When
Cov(Qt
t=0
πc
e(at|ct)
πc
b(at|ct)rt,Qk
t=0
πc
e(at|ct)
πc
b(at|ct)rk)≤Cov(dπe(st,at)
dπb(st,at)rt,dπe(sk,ak)
dπb(sk,ak)rk)
, the
variance is lower than the Variance of MIS estimator, i.e. Vπb[ˆ
VCI S ]≤Vπb[ˆ
VMI S ].
Proof: We start from the assumption:
Cov(
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rt,
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rk) = E[
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rtrk]−E[
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rt]E[
T
Y
t=0
πc
e(at|˜ct)
πc
b(at|˜ct)rk]
(a)
=E[
T
Y
t=0
πe(at|st)
πb(at|st)
T
Y
t=0
πe(at|st)
πb(at|st)K2rtrk]−E[
T
Y
t=0
πe(at|st)
πb(at|st)Krt]E[
T
Y
t=0
πe(at|st)
πb(at|st)Krk](b)
≤E[
T
Y
t=0
πe(at|st)
πb(at|st)
T
Y
t=0
πe(at|st)
πb(at|st)rtrk]−E[
T
Y
t=0
πe(at|st)
πb(at|st)rt]E[
T
Y
t=0
πe(at|st)
πb(at|st)rk](c)
≤E[(dπe(st, at)
dπb(st, at))(dπe(sk, ak)
dπb(sk, ak))rtrk]−E[dπe(st, at)
dπb(st, at))rt]E[dπe(sk, ak)
dπb(sk, ak))rk](d)
Explanation of steps:
(a) We begin with the definition of covariance.
(b),(c) Using the definition of πc, with K (the ratio of state-space distribution ratio) <1.
(c) Applying Lemma D.1.7 to both the terms.
Finally, using Lemma D.1.7, substituting
Yt=QT
t′=0 πe(at′|st′)
πb(at′|st′)rt
and
Xt=st, at, rt
completes
our proof.
28
E.2 PDIS
E.2.1 BIAS
Bias =|Eπb[ˆ
VCP DI S
πe]−Eπc
e[ˆ
VCP DI S
πc
e]|(a)
=|Eπb"T
X
t=0
γtρ(n)
0:tr(n)
t#−Eπc
e[ˆ
VCP DI S
πc
e]|(b)
=|
N
X
n=1 T
Y
t=0
πb(a(n)
t|s(n)
t)!T
X
t=0
γtρ(n)
0:tr(n)
t−Eπc
e[ˆ
VCP DI S
πc
e]|(c)
=|
N
X
n=1
T
X
t=0
γt t
Y
t′=0
πc
e(a(n)
t′|˜c(n)
t′)(πb(a(n)
t′|s(n)
t′)
πb(a(n)
t′|˜c(n)
t′))!r(n)
t−Eπc
e[ˆ
VCP DI S
πc
e]|(d)
Explanation of steps: Similar to CIS.
E.2.2 VARIANCE
Following the process similar to CIS estimator:
V[ˆ
VCP DI S
πb] = Eπb[( ˆ
VCP DI S
πb)2]−(Eπb[ˆ
VCP DI S
πb])2(a)
We first evaluate the expectation of the square of the estimator:
Eπb[( ˆ
VCP DI S
πb)2] = Eπb
T
X
t=0
γtρ0:trt!2
(b)
=Eπb"T
X
t=0
T
X
t′=0
ρ0:tρ0:t′γ(t+t′)rtrt′#(c)
=
N
X
n=1
T
X
t=0
T
X
t′=0 t
Y
t′′=0
πb(at|st)(πc
e(at|˜ct)
πc
b(at|˜ct))!
t′
Y
t′′′=0
(πc
e(at|˜ct)
πc
b(at|˜ct))
γ(t+t′)rtrt′
(d)
Evaluating the second term in the variance expression:
(Eπb[ˆ
VCP DI S
πb])2= N
X
n=1
T
X
t=0
Eπb[γtρ0:trt]!2
(e)
=
N
X
n=1
T
X
t=0
T
X
t′=0 t
Y
t′′=0
πb(at′′ |st′′ )(πc
e(at′′ |˜ct′′ )
πc
b(at′′ |˜ct′′ ))!
t′
Y
t′′′=0
πb(at′′′ |st′′′ )(πc
e(at′′′ |˜ct′′′ )
πc
b(at′′′ |˜ct′′′ ))
γ(t+t′)rtrt′
(f)
Subtracting the squared expectation from the expectation of the squared estimator:
V[ˆ
VCP DI S
πb] =
N
X
n=1
T
X
t=0
T
X
t′=0 t
Y
t′′′=0
πb(at′′′ |st′′′ )(πc
e(at′′′ |˜ct′′′ )
πc
b(at′′′ |˜ct′′′ ))!
t′
Y
t′′=0
(1 −πb(at′′ |st′′ ))(πc
e(at′′ |˜ct′′ )
πc
b(at′′ |˜ct′′ ))
γ(t+t′)rtrt′
(g)
Explanation of steps: Similar to CIS
29
E.2.3 VARIANCE COMPARISON BETWEEN UNKNOWN CPDIS AND PDIS EST IM ATOR S
Theorem E.1. When
Cov(Qt
t=0
πc
e(at|ct)
πc
b(at|ct)rt,Qk
t=0
πc
e(at|ct)
πc
b(at|ct)rk)≤
Cov(Qk
t=0
πe(at|st)
πb(at|st)rt,Qk
t=0
πe(at|st)
πb(at|st)rk)
, the variance of parameterized CPDIS estimators
is lower than PDIS estimator, i.e. Vπb[ˆ
VCP D IS ]≤Vπb[ˆ
VP DI S ].
Proof: Similar to CIS estimator.
E.2.4 UPP ER BOUND ON THE BIAS
Unlike known concepts, there exists a finite bias in case of unknown concepts, and the bounds need
to be analyzed.
Bias =|
N
X
n=1
T
X
t=0
γt t
Y
t′=0
πc
e(a(n)
t′|˜c(n)
t′)(πb(a(n)
t′|s(n)
t′)
πc
b(a(n)
t′|˜c(n)
t′))!r(n)
t−Eπc
e[ˆ
VCP DI S
πe]|(a)
≤ |
N
X
n=1
T
X
t=0
γt t
Y
t′=0
πc
e(a(n)
t′|˜c(n)
t′)(πb(a(n)
t′|s(n)
t′)
πc
b(a(n)
t′|˜c(n)
t′))!r(n)
t+|Eπc
e[ˆ
VCP DI S
πe]|(b)
≤1
N
N
X
n=1
T
X
t=0
γt
t
Y
t′=0
(πc
e(a(n)
t′|˜c(n)
t′)
πc
b(a(n)
t′|˜c(n)
t′))r(n)
t+7T RmaxUT
cln(2
δ)
3(N−1) +v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(Xi−Xj)2+Eπc
e[ˆ
VCP DI S
πe]
(c)
≤T RmaxUT
c(1
N+ln 2
δ
3(N−1)) + v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(Xi−Xj)2+|Eπc
e[ˆ
VCP DI S
πe]|(d)
Explanation of steps: Similar to CIS.
E.2.5 UPP ER BOUND ON THE VARIANCE
V[ˆ
VCP DI S
πb] = Eπb( T
X
t=0
γtrt
t
Y
t′=0
πc
e(at′|˜ct′)
πc
b(at′|˜ct′)!2
)−Eπb T
X
t=0
γtrt
t
Y
t′=0
πc
e(at′|˜ct′)
πc
b(at′|˜ct′)!2
(a)
≤Eπb( T
X
t=0
γtrt
t
Y
t′=0
πc
e(at′|˜ct′)
πc
b(at′|˜ct′)!2
)(b)
≤1
N
N
X
n=1
( T
X
t=0
γtrt
t
Y
t′=0
πc
e(at′|˜ct′)
πc
b(at′|˜ct′)!2
) + 7T2R2
maxU2T
cln(2
δ)
3(N−1) +v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(X2
i−X2
j)2
(c)
≤T2R2
maxU2T
c(1
N+ln 2
δ
3(N−1)) + v
u
u
t
ln(2
δ)
N3−N2
N
X
i<j
(X2
i−X2
j)2(d)
Explanation of steps: Similar to CIS.
30
E.2.6 UPP ER BOUND ON THE MSE
MSE =Bias2+Variance (a)
∼ O(T RmaxUT
c
N)2+ϵ(|Eπc
e[ˆ
VCP DI S
πe]|2) + O(T2R2
maxU2T
c
N)(b)
∼ O(T2R2
maxU2T
c
N) + ϵ(|Eπc
e[ˆ
VCP DI S
πe]|2)(c)
∼ O(T2R2
maxU2T
s
N)K2T+ϵ(|Eπc
e[ˆ
VCP DI S
πe]|2)(d)
The arguments are similar to the known-concept bounds of the MSE, with the difference being
the expressions for
Uc, Us, K
are over approximations of concepts instead of true concepts and an
irreducible error over
Eπc
e[ˆ
VCP DI S
πe]
as the distribution is sampled in the concept representations
instead of state representations.
E.2.7 VARIANCE COMPARISON WITH MIS EST IMATOR
Theorem. When
Cov(Qt
t=0
πc
e(at|ct)
πc
b(at|ct)rt,Qk
t=0
πc
e(at|ct)
πc
b(at|ct)rk)≤Cov(dπe(st,at)
dπb(st,at)rt,dπe(sk,ak)
dπb(sk,ak)rk)
, the
variance is lower than the Variance of MIS estimator, i.e. Vπb[ˆ
VCP D IS ]≤Vπb[ˆ
VMI S ].
Explanation of Steps: Similar to CIS
F ENVIRONMENTS
WindyGridworld Figure 7 illustrates the Windy Gridworld environment, a 20x20 grid divided into
regions with varying wind directions and penalties. The agent’s goal is to navigate from a randomly
chosen starting point to a fixed goal in the top-right corner. Off-diagonal winds increase in strength
near non-windy regions, affecting the agent’s movement. Each of the four available actions moves
the agent four steps in the chosen direction. Reaching the goal earns a +5 reward while moving away
results in a -0.2 penalty. Additional negative rewards are based on regional penalties within the grid.
Each episode ends after 200 steps.
The grid is split into 25 blocks, each measuring 4x4 units with each region having a penalty based
on the wind-strength. Blocks affected by wind display the direction and strength (e.g., ’
←↑
(-2,+2)’
indicates northward and westward winds with a strength of 2 units each). This setup encourages the
agent to navigate through non-penalty areas for optimal rewards.
MIMIC-III We use the publicly available MIMIC-III database (Johnson et al., 2016) from Phys-
ioNet (Goldberger et al., 2000), which records the treatment and progression of ICU patients at the
Beth Israel Deaconess Medical Center in Boston, Massachusetts. We focus on the task of managing
acutely hypotensive patients in the ICU. Our preprocessing follows the original MIMIC-III steps
detailed in Komorowski et al. (2018c) and used in subsequent works (Keramati et al., 2021b; Matsson
& Johansson, 2021). After processing the data in Excel, we group patients by ’ICU-stayID’ to form
distinct trajectories.
The state space includes 15 features: Creatinine, FiO
2
, Lactate, Partial Pressure of Oxygen, Partial
Pressure of CO
2
, Urine Output, GCS score, and electrolytes such as Calcium, Chloride, Glucose,
HCO
3
, Magnesium, Potassium, Sodium, and SpO
2
. Each feature is binned into 10 levels from 0
(very low) to 9 (very high).
Treatments for hypotension include IV fluid bolus administration and vasopressor initiation, with
doses categorized into four levels: "none," "low," "medium," and "high," forming a total action space
of 16 discrete actions. The reward function depends on the next mean arterial pressure (MAP) and
ranges from -1 to 0, linearly distributed between 20 and 65. A MAP above 65 indicates that the
patient is not experiencing hypotension.
31
0 4 8 12 16 20
X
0
4
8
12
16
20
Y
Penalty: 0 Wind: (+1,-1)
Penalty: -2 Wind: (+0,-1)
Penalty: -1 Wind: (+0,-1)
Penalty: -1 Wind: (+0,-1)
Penalty: -1
Penalty: 0 Penalty: 0 Wind: (+1,-1)
Penalty: -2 Wind: (+0,-1)
Penalty: -1 Wind: (+0,-1)
Penalty: -1
Wind: (-1,+1)
Penalty: -2 Penalty: 0 Penalty: 0 Wind: (+1,-1)
Penalty: -2 Wind: (+0,-1)
Penalty: -1
Wind: (-1)
Penalty: -1 Wind: (-1,+1)
Penalty: -2 Penalty: 0 Penalty: 0 Wind: (+1,-1)
Penalty: -2
Wind: (-1)
Penalty: -1 Wind: (-1)
Penalty: -1 Wind: (-1,+1)
Penalty: -2 Penalty: 0 Penalty: 0
Figure 7: Schematic of windy-gridworld environment. The top-right corner refers to the goal target
of the agent. The wind direction and reward penalty is indicated in each region.
G ADDITIONAL EXPE RIM ENTAL DETAILS
G.1 UN KNOW N CONCEPTS EXP ERI ME NTAL SE TUP
Environments, Policy descriptions, Metrics: Same as those in known concepts section.
Training and Hyperparameter Details: We use 400 training, 50 validation, and 50 test trajectories
sampled from the behavior policy to train the CBMs, which predict the next state transitions from the
current state. The model architecture includes an input layer, a bottleneck, two 256-neuron layers,
and an output layer, all with ReLU activations. Training is performed using the Adam Optimizer with
a learning rate of 1e-3 on an Nvidia P100 GPU (16 GB) within the Pytorch framework. 1
Training targets multiple loss components: the OPE metric, interpretability, diversity, and CBM
output. The non-convex nature of the loss landscape can lead to issues such as non-convergence and
NaN values. To address this, we employ a three-stage training strategy.
In the first stage, we optimize all losses except the OPE metric to stabilize the initial training process.
In the second stage, the OPE metric is gradually incorporated into the optimization until convergence
is achieved. Finally, in the third stage, we freeze the CBM weights to refine the remaining losses
while controlling variations in the OPE metric.
1
The code for replicating the experiments of the paper can be found at:
https://github.com/
ai4ai-lab/Concept- driven-OPE
32
This strategy balances the learning of critical on-policy features with maintaining relevant OPE
metrics, thereby enhancing concept learning and policy generalization. Despite these efforts, man-
aging the complexity of the loss landscape remains a significant challenge, particularly in dynamic
environments, and represents an important direction for future research.
G.2 KN OW N, O RACL E AN D INTERVENED CONC EPT S FO R WIN DYGRI DWOR LD
X Y Known Oracle Optimized
Concept Concept Concept
(0,4) (0,4) 0 0 0
(4,8) (0,4) 1 1 1
(8,12) (0,4) 2 1 1
(12,16) (0,4) 3 1 0
(16,20) (0,4) 4 1 1
(0,4) (4,8) 5 2 2
(4,8) (4,8) 6 0 0
(8,12) (4,8) 7 1 1
(12,16) (4,8) 8 1 1
(16,20) (4,8) 9 1 0
(0,4) (8,12) 10 2 2
(4,8) (8,12) 11 2 2
(8,12) (8,12) 12 0 0
(12,16) (8,12) 13 1 1
(16,20) (8,12) 14 1 1
(0,4) (12,16) 15 2 2
(4,8) (12,16) 16 2 2
(8,12) (12,16) 17 2 2
(12,16) (12,16) 18 3 3
(16,20) (12,16) 19 3 3
(0,4) (16,20) 20 2 2
(4,8) (16,20) 21 2 2
(8,12) (16,20) 22 2 2
(12,16) (16,20) 23 2 2
(16,20) (16,20) 24 3 3
Table 1: WindyGridworld Concept Information
G.3 ADDITIONAL DESCRIPTION ON POLICES FOR MIMIC-III
For the MIMIC-III dataset, it is common to generate behavior trajectories using K-nearest neighbors
(KNN) as the true on-policy trajectories are unavailable. Examples of works that generate behavior
trajectories or policies using KNNs include (Gottesman et al., 2020; Böck et al., 2022; Liu et al.,
2022; Keramati et al., 2021b; Komorowski et al., 2018b; Peine et al., 2021). In this paper, we employ
a popular variant of KNN, known as approximate nearest neighbors (ANN) search.
The advantages of ANN over traditional KNN include scalability, reduced computational cost,
efficient indexing, and support for dynamic data. These benefits allow us to generate behavior and
evaluation policies with a larger number of neighbors (200 in this study, which is double that used in
prior works employing KNN) while achieving faster inference times. Examples of papers that use
approximate nearest neighbors in medical applications include (Anagnostou et al., 2020; Gupta et al.,
2022). For readers interested in the foundational work outlining the benefits of ANN over KNN, we
refer to the seminal paper (Indyk & Motwani, 1998).
33
Figure 8: Comparison of learned concepts with state abstraction clusters: The first two subplots show
the true oracle concepts and the optimized concepts obtained using the methodology described in
the main paper. The third subplot illustrates the OPE performance as the number of state clusters
varies, showing improvement up to
K= 33
clusters, followed by a spike in MSE and then a
gradual improvement. The final subplot visualizes the state clusters for
K= 33
, the best-performing
state abstraction for OPE. These clusters lack correspondence with the true oracle concepts or
the optimized concepts, highlighting that learned concepts capture more meaningful and useful
information compared to state abstractions.
H AB LATION EX PER IME NTS
H.1 STATE ABSTRACTION CLUSTERING BASELINE
In this subsection, we present an ablation study to compare the performance of OPE under concept-
based representations versus state abstractions. This experiment is conducted in the Windy Gridworld
environment. For state abstractions, we apply K-means clustering on the state representations
(coordinates (x, y)) with varying values of K. The results are summarized in Figure 8.
We plot the mean squared error (MSE) of the OPE across different numbers of state abstraction
clusters. Initially, the MSE decreases as the number of clusters increases, but it eventually exhibits a
sudden rise followed by a downward trend as
K
grows further. The minimum MSE is observed at
K= 33
. Upon inspecting the clusters for
K= 33
, we find that they primarily correspond to local
geographical regions, showing no alignment with meaningful features such as the distance from the
goal, wind penalty, etc.
These clusters differ significantly from the learned concepts shown in Figure 8. Moreover, they are
neither readily interpretable nor easily amenable to intervention, highlighting the importance of using
concept-based representations for OPE.
H.2 IMPERFECT CONCEPTS BASELINE
In this ablation study, we evaluate the performance of concept-based OPE when the quality of
concepts is poor or imperfect. Using the Windy Gridworld environment as an example, we define
concepts as functions solely of the horizontal distance to the target. This approach neglects critical
information such as vertical distance to the target, wind effects, and region penalties. As a result,
these concepts violate one of the primary desiderata: diversity. By capturing only one important
concept dimension while disregarding others, these poor concepts fail to represent the full complexity
of the environment.
Figure 9 presents our results with suboptimal concepts. We observe that suboptimal concepts
exhibit inferior OPE characteristics, including higher bias, variance, and MSE, as well as lower ESS,
compared to traditional OPE estimators. This demonstrates that not all concept-based estimators
lead to improved performance; the quality of the concepts plays a crucial role, which is closely tied
to the desiderata they satisfy. Furthermore, this highlights the importance of having an algorithm
capable of learning concepts with favorable OPE characteristics, especially in scenarios involving
imperfect experts or highly complex domains where obtaining expertise is challenging. Nevertheless,
poor concepts still allow for potential interventions, as the root cause of the poor OPE characteristics
can be readily identified.
34
Figure 9: Imperfect concepts baseline. We consider a scenario where the concepts are just function of
the horizontal distance to target, thus ignoring vital information like vertical distance to target, wind
regions, thus lacking diversity, one of the important desiderata. We observe the OPE performance to
be poor compared to traditional estimators, with higher bias, variance, MSE and lower ESS.
H.3 INVERSE PROPENSITY SCORES COMPARISON BETWEEN CONCEPTS AND STATE
REPRESENTATIONS
In this ablation study, we compare the inverse propensity scores (IS ratios) for concepts and states in
the Windy Gridworld environment, focusing on known concepts. While the analysis is specific to this
environment, the insights generalize to other domains. From Figure 10, we observe that the IPS scores
under concepts are skewed more towards the left compared to those under states. Quantitatively,
there is a reduction of nearly 1–2 orders of magnitude in the IPS scores. This highlights that the
variance reduction achieved with concepts is directly linked to lower IPS scores, demonstrating a
better characterization under concepts compared to states.
I OPTIMIZED PARA MET ERI ZED CONCEPTS
Table 2: WindyGridworld: Coefficients of the human interpretable features learnt while optimizing
parameterized concepts. Here, the concept
ct
is a 4-dimensional vector
[c1, c2, c3, c4]
, where
ci=
wT
ifi, with fibeing the human interpretable features.
Feature CIS CPDIS
c1c2c3c4c1c2c3c4
f1: X-coordinate 0.15 -0.07 0.05 0.19 -0.23 0.33 -0.03 0.03
f2: Y-coordinate -0.02 -0.23 0.07 -0.12 -0.22 0.25 0.02 -0.06
f3: Horizontal distance from target -0.02 0.07 -0.10 0.00 -0.15 -0.30 0.02 -0.11
f4: Vertical distance from target 0.06 -0.26 -0.09 0.06 -0.11 0.10 -0.04 -0.21
f5: Horizontal Wind 0.05 0.12 -0.12 0.00 -0.15 0.20 0.29 -0.14
f6: Vertical Wind 0.26 0.01 -0.02 0.00 -0.18 0.06 -0.17 0.19
f7: Region penalty 0.24 0.18 -0.25 0.15 0.23 0.01 -0.11 0.22
f8: Distance to left wall -0.14 -0.25 0.01 0.05 -0.13 0.24 0.16 0.14
f9: Distance to right wall 0.02 0.00 0.01 0.19 -0.12 -0.28 0.06 0.16
f10: Distance to top wall -0.01 -0.20 -0.21 0.07 -0.33 -0.05 -0.04 -0.01
f11: Distance to bottom wall -0.16 0.07 0.22 -0.22 0.06 -0.13 0.13 -0.22
f12: Penalty of left subregion -0.06 0.08 -0.08 -0.22 -0.07 -0.01 0.03 -0.16
f13: Penalty of right subregion -0.03 0.02 -0.20 -0.20 -0.07 -0.18 -0.34 -0.21
f14: Penalty of top subregion 0.16 0.19 -0.08 -0.17 0.00 0.04 -0.07 0.21
f15: Penalty of bottom subregion 0.08 0.24 0.05 -0.19 0.17 -0.07 -0.12 0.21
f16: Distance to left subregion -0.11 0.05 0.00 0.26 0.10 -0.07 0.22 0.04
f17: Distance to right subregion 0.00 -0.17 0.04 0.13 0.05 -0.13 0.06 0.11
f18: Distance to top subregion 0.07 -0.03 0.13 0.08 -0.12 0.01 0.06 0.00
f19: Distance to bottom subregion -0.06 -0.09 -0.06 -0.01 -0.19 -0.01 0.06 0.13
Constant -0.06 -0.16 0.14 -0.01 -0.08 -0.01 0.00 0.13
35
Table 3: MIMIC: Coefficients of the human interpretable features learnt while optimizing parameter-
ized concepts.
Feature CIS CPDIS
c1c2c3c4c1c2c3c4
f1: Creatinine -0.08 -0.24 0.19 -0.18 -0.08 -0.24 0.19 -0.18
f2: FiO2-0.13 0.00 0.04 -0.06 -0.13 0.00 0.04 -0.06
f3: Lactate -0.24 -0.02 -0.23 0.21 -0.24 -0.02 -0.23 0.21
f4: Partial Pressure of O20.09 -0.07 -0.06 -0.12 0.09 -0.07 -0.06 -0.12
f5: Partial Pressure of CO2-0.21 0.16 0.19 -0.03 -0.21 0.16 0.19 -0.03
f6: Urine Output 0.06 0.07 0.06 0.22 0.06 0.07 0.06 0.22
f7: GCS Score 0.11 -0.05 -0.01 0.15 0.11 -0.05 -0.01 0.15
f8: Calcium 0.16 -0.20 0.06 0.16 0.16 -0.20 0.06 0.16
f9: Chloride 0.02 -0.11 -0.04 0.14 0.02 -0.11 -0.04 0.14
f10: Glucose 0.06 -0.10 -0.10 -0.08 0.05 -0.10 -0.10 -0.08
f11: HCO20.21 0.14 -0.20 -0.22 0.20 0.14 -0.20 -0.22
f12: Magnesium -0.15 -0.02 -0.20 0.01 -0.15 -0.02 -0.20 0.01
f13: Potassium 0.04 0.08 0.15 -0.26 0.04 0.08 0.15 -0.26
f14: Sodium 0.00 -0.02 0.24 0.19 0.00 -0.02 0.24 0.19
f15: SpO2-0.17 -0.20 -0.06 -0.23 -0.17 -0.20 -0.06 -0.23
36
Figure 10: Inverse propensity score comparisons under concepts and states. Column 1 represents IPS
score comparison between CIS and IS, while column 2 is the IPS score comparison between CPDIS
and PDIS. Rows indicate varying number of trajectories. We observe, across all trajectory samples,
the frequency of the lower IPS scores are left skewed in case of concepts over states. This indicates
the source of variance reduction in concepts infact lies in the lowered IPS scores.
37