Conference PaperPDF Available

Interpretable Deep Causal Learning for Moderation Effects

Authors:

Abstract

In this extended abstract paper, we address the problem of interpretability and targeted regularization in causal machine learning models. In particular, we focus on the problem of estimating individual causal/treatment effects under observed confounders, which can be controlled for and moderate the effect of the treatment on the outcome of interest. Black-box ML models adjusted for the causal setting perform generally well in this task, but they lack interpretable output identifying the main drivers of treatment heterogeneity and their functional relationship. We propose a novel deep counterfactual learning architecture for estimating individual treatment effects that can simultaneously: i) convey targeted regularization on, and produce quantify uncertainty around the quantity of interest (i.e., the Conditional Average Treatment Effect); ii) disentangle baseline prognostic and moderating effects of the covariates and output interpretable score functions describing their relationship with the outcome. Finally, we demonstrate the use of the method via a simple simulated experiment.
Interpretable Deep Causal Learning for Moderation Effects
Alberto Caron 1 2 Gianluca Baio 1Ioanna Manolopoulou 1
Abstract
In this extended abstract paper, we address the
problem of interpretability and targeted regular-
ization in causal machine learning models. In
particular, we focus on the problem of estimat-
ing individual causal/treatment effects under ob-
served confounders, which can be controlled for
and moderate the effect of the treatment on the out-
come of interest. Black-box ML models adjusted
for the causal setting perform generally well in
this task, but they lack interpretable output identi-
fying the main drivers of treatment heterogeneity
and their functional relationship. We propose a
novel deep counterfactual learning architecture
for estimating individual treatment effects that
can simultaneously: i) convey targeted regulariza-
tion on, and produce quantify uncertainty around
the quantity of interest (i.e., the Conditional Av-
erage Treatment Effect); ii) disentangle baseline
prognostic and moderating effects of the covari-
ates and output interpretable score functions de-
scribing their relationship with the outcome. Fi-
nally, we demonstrate the use of the method via
a simple simulated experiment and a real-world
application1.
1. Introduction
In the past years, there has been a growing interest towards
applying ML methods for causal inference. Disciplines
such as precision medicine and socio-economic sciences in-
evitably call for highly personalized decision making when
designing and deploying policies. Although in these fields
exploration of policies in the real world through random-
ized experiments is costly, in order to answer counterfactual
questions such as “what would have happened if individual
1
Department of Statistical Science, University College London,
London, UK
2
The Alan Turing Institute, London, UK. Correspon-
dence to: Alberto Caron <alberto.caron.19@ucl.ac.uk>.
Workshop on Interpretable ML in Healthcare at International Con-
ference on Machine Learning (ICML). Copyright 2022 by the
author(s).
1
Code for full reproducibility can be found at
https://
github.com/albicaron/ICNN.
i
undertook medical treatment A instead of treatment B”
one can rely on observational data, provided that the con-
founding factors can be controlled for. Black-box causal
ML models proposed in many recent contributions perform
remarkably well in the task of estimating individual coun-
terfactual outcomes, but significantly lack interpretability,
which is a key component in the design of personalized treat-
ment rules. This is because they jointly model the outcome
dependency on the covariates and on the treatment variable.
Knowledge of what are the main moderating factors of a
treatment can unequivocally lead to overall better policy
design, as moderation effects can be leveraged to achieve
higher cumulative utility when deploying the policy (e.g.,
by avoiding treating patient with uncertain or borderline
response, better treatment allocation on budget/resources
constraints,...). Another main issue of existing causal ML
models, related to the one of interpretability, is carefully de-
signed regularization (Nie & Wager,2020;Hahn et al.,2020;
Caron et al.,2022b). Large observational studies generally
include measurements on a high number of pre-treatment
covariates, and disentangling prognostic
2
and moderating
effects allows the application of targeted regularization on
both, that avoids incurring in unintended finite sample bias
and large variance (see (Hahn et al.,2020) for a detailed
discussion on Regularization Induced Confounding bias).
This is useful in many scenarios where treatment effect is
believed to be a sparser and relatively less complex function
of the covariates compared to the baseline prognostic effect,
so it necessitates carefully tailored regularization.
1.1. Related Work
Among the most influential and recent contributions on ML
regression-based techniques for individualized treatment
effects learning, we particularly emphasize the work of
(Johansson et al.,2016;Shalit et al.,2017;Yao et al.,2018)
on deep learning models, (Alaa & van der Schaar,2017;
2018) on Gaussian Processes, (Hahn et al.,2020;Caron
et al.,2022b) on Bayesian Additive Regression Trees, and
finally the literature on the more general class of Meta-
Learners models (K
¨
unzel et al.,2017;Nie et al.,2020). We
refer the reader to (Caron et al.,2022a) for a detailed review
of the above methods.
2
Prognostic effect is defined as the baseline effect of the covari-
ates on the outcome, in absence of treatment.
Interpretable Deep Causal Learning for Moderation Effects
X
A Y
X=fX(εX)
A=fA(X, εA)
Y=fY(X, A) + εY
Figure 1.
Causal DAG and set of structural equations describing
a setting that satisfies the backdoor criterion. The underlying
assumption is that conditioning on the confounders
X
is sufficient
to identify the causal effect
AY
. Models generally assume
mean-zero additive error term for the outcome equation. The red
arrow in the DAG represent the moderating effect of
X
in the
AYrelationship.
In particular we build on top of contributions by (Nie et al.,
2020;Hahn et al.,2020;Caron et al.,2022b), that have pre-
viously addressed the two issues of targeted regularization
in causal ML. Our work proposes a new deep architecture
that can separate baseline prognostic and treatment effects,
and, by borrowing ideas from recent work on Neural Addi-
tive Models (NAMs) (Agarwal et al.,2021), a deep learn-
ing version of Generalized Additive Models, can output
interpretable score functions describing the impact of each
covariate in terms of their prognostic and treatment effects.
2. Problem Framework
In this section we briefly introduce the main notation setup
for causal effects identification and estimation under ob-
served confounders scenarios, by utilizing the framework of
Structural Causal Models (SCMs) and do-calculus (Pearl,
2009). We assume we have access to data of observational
nature described by the tuple
Di={Xi, Ai, Yi} p(·)
,
with
i {1, ..., N }
, where
Xi X
is a set of covariates,
Ai A
a binary manipulative variable, and
YiR
is the
outcome. We assume then that the causal relationships be-
tween the three variables are fully described by the SCM
depicted in Figure 1, both in the forms of causal DAG and
set of structural equations. A causal DAG is a graph made of
vertices and edges
(V,E)
, where vertices represent the ob-
servational random variables, while edges represent causal
functional relationships. Notice that we assume, in line with
most of the literature, zero-mean additive error structure for
the outcome equation. The ultimate goal is to identify and
estimate the Conditional Average Treatment Effects (CATE),
defined as the effect of intervening on the manipulative vari-
able
Ai
, by setting equal to some value
a
(or
do(Ai=a
in
the do-calculus notation), on the outcome
Yi
, conditional on
covariates
Xi
(i.e., conditional on patient’s characteristics,
...). In the case of binary Ai, CATE is defined as:
CATE: τ(xi) = E[Yi|do(Ai= 1),Xi=x]
E[Yi|do(Ai= 0),Xi=x].(1)
In order to identify the quantity in (1) we make two standard
assumptions. The first assumption is that there are no unob-
served confounders (unconfoundedness) or equivalently
in Pearl’s terminology, that
Xi
satisfies the backdoor cri-
terion. The second assumption is common support, which
states that there is no deterministic selection into either of
the treatment arms conditional on the covariates, or equiv-
alently that
p(Ai= 1|Xi=x)(0,1),i
. The latter
guarantees that we could theoretically observe data points
with
Xi=x
in each of the two arms of
A
. Under these
two assumptions, we can identify CATE
τ(xi)
in terms of
observed quantities only, replacing the do-operator in (1)
with the factual Ai, by conditioning on Xi:
E[Yi|do(Ai=a),Xi=x] = E[Yi|Ai=a, Xi=x].
Once CATE is identified as above, there are different ways
in which it can be estimated in practice. We will briefly
describe few of them in the next section.
3. Targeted CATE estimation
Very early works in the literature on CATE estimation pro-
posed fitting a single model ˆ
fY(Xi, Ai)(S-Learners). The
main drawback of S-Learners is that they are unable to ac-
count for any group-specific distributional difference, which
becomes more relevant the stronger the selection bias is.
Most of the subsequent contributions instead suggested
splitting the sample into treatment subgroups and fit sep-
arate, arm-specific models
ˆ
fYa(xi)
(T-Learners). While
T-Learners are able to account for distributional variation
attributable to
Ai
, they are less sample efficient, prone to
CATE overfitting and to regularization induced confounding
bias (K
¨
unzel et al.,2017;Hahn et al.,2020;Caron et al.,
2022b). In addition they do not produce credible intervals
directly on CATE, as a CATE estimator is derived as the dif-
ference of two separate models’ fit
ˆτ(xi) = ˆ
f1(xi)ˆ
f0(xi)
,
with the induced variance being potentially very large:
Vˆτ(xi)=Vˆ
f1(xi)ˆ
f0(xi)=
=Vˆ
f1(xi)+Vˆ
f0(xi)2Covˆ
f1(xi),ˆ
f0(xi).
Finally some of the most recent additions to the literature
(Hahn et al.,2020;Nie et al.,2020;Caron et al.,2022b)
proposed using (Robinson,1988) additively separable re-
parametrization of the outcome function, which reads:
Robinson: Yi=µ(Xi)
| {z }
Prognostic Eff
+τ(Xi)
| {z }
CATE
Ai+εi,(2)
where
µ(xi) = E[Yi|do(Ai= 0),Xi=x]
is the
prognostic effect function and
τ(xi)
is the CATE func-
tion as defined in (1). We assume like most contribu-
tions that
E(εi) = 0
. The distinctive trait of Robinson’s
parametrization is that the outcome function explicitly in-
cludes the function of interest, i.e. CATE
τ(xi)
, while in
Interpretable Deep Causal Learning for Moderation Effects
the usual S- or T-Learner (and subsequent variations of
these) parametrizations CATE is implicitly obtained post-
estimation as
ˆτ(xi) = ˆ
f1(xi)ˆ
f0(xi)
. This means that
(2) is able to differentiate between the baseline prognos-
tic effect
µ(xi)
(in absence of treatment) and moderating
effects embedded in the CATE function
τ(xi)
, of the co-
variates. As a consequence, by utilizing (2), one can convey
different degree of regularization when estimating the two
functions. This is particularly useful as CATE is usually
believed to display simpler patterns than
µ(xi)
; so by esti-
mating it separately, one is able to apply stronger targeted
regularization.
3.1. Interpretable Causal Neural Networks
Following (Robinson,1988), and the more recent work by
(Nie et al.,2020;Hahn et al.,2020;Caron et al.,2022b), we
propose a very simple deep learning architecture for inter-
pretable and targeted CATE estimation, based on Robinson
parametrization. The architecture is made of two separa-
ble neural nets blocks that respectively learn the prognostic
function
µ(xi)
and the CATE function
τ(xi)
, but are “re-
connected” at the end of the pipeline to minimize a single
loss function, unlike T-Learners which instead minimize
separate loss functions on
f1(·)
and
f0(·)
. Our target loss
function to minimize is generally defined as follows:
TCNN: min
µ(·) (·)Lyµ(x) + τ(x)a, y,(3)
where
Ly(·)
can be any standard loss function (e.g., MSE,
negative log-likelihood,...). Through its separable block
structure, the model allows the design of different NN archi-
tectures for learning
µ(·)
and
τ(·)
while preserving sample
efficiency (i.e., avoiding sample splitting as in T-Learners),
and to produce uncertainty measures around CATE τ(·)di-
rectly. Thus, if
τ(·)
is believed to display simple moderating
patterns as a function of
Xi
, a shallower NN structure with
less hidden layers and units, and more aggressive regulariza-
tion (e.g., higher regularization rate or dropout probabilities),
can be specified, while retaining higher level of complexity
in the
µ(·)
block. We generally refer to this model as Tar-
geted Causal Neural Network (TCNN) for simplicity from
now onwards. Figure 2provide a simple visual represen-
tation. While in this work we focus on binary intervention
variables
Ai
for simplicity, TCNN can be easily extended
to multi-category
Ai
by adding extra blocks to the structure
in Figure 2.
In addition to the separable structure, and in order to guar-
antee higher level of interpretability on prognostic and mod-
erating factors, we also propose using a recently developed
neural network version of Generalized Additive Models
(GAMs), named Neural Additive Models (NAMs) (Agarwal
et al.,2021), as the two
µ(·)
and
τ(·)
NN building blocks
of TCNN. We refer to this particular version of TCNN as
X
A
...
ϕµ(·)
...
ϕτ(·)
Ly(µ(x) + τ(x)a, y)
µ(·)
τ(·)
Figure 2.
Intuitive TCNN structure. The deep architecture is mod-
elled through a sample efficient, tailored loss function based on
Robinson’s parametrization.
Interpretable Causal Neural Network (ICNN). Contrary to
normal NNs, which fully “connect” inputs to every nodes in
the first hidden layer, NAMs “connect” each single input to
its own NN structure and thus outputs input-specific score
functions, that fully describe the predicted relationship be-
tween each input and the outcome. NAM’s score functions
have an intuitive interpretation as Shapley values (Shapley,
1953): how much of an impact each input has on the final
predicted outcome. The structure of the loss function (3) in
ICNN thus becomes additive also in the
P
covariate-specific
µj(·)and τj(·)functions:
ICNN: min
µ(·) (·)LyP
X
j=1
µj(xj) +
P
X
j=1
τj(xj)a, y,
where the single
µj(xj)
score function represents the Shap-
ley value in terms of prognostic effect of covariate
xj
,
while
τj(xj)
its Shapley value in terms of moderating ef-
fect. Hence, the NAM architecture in ICNN allows us to
estimate the impact of each covariates as a prognostic and
moderating factor and quantify the uncertainty around them
as well. Under ICNN, the outcome function thus becomes
twice additively separable, as:
Yi=
P
X
j=1
µj(xi,j ) +
P
X
j=1
τj(xi,j )Ai+εi,(4)
where
i {1, ..., N }
and
j {1, ..., P }
. Naturally, the
downside of NAMs is that they might miss out on inter-
action terms among the covariates. These could possibly
be constructed and added manually as additional inputs,
although this is not particularly convenient nor computation-
ally ideal.
3.2. Links to Previous Work
We conclude the section by highlighting similarities and
differences between TCNN (and ICNN) and other popular
methods employing Robinson’s parametrization. Differ-
ently than R-Learner (Nie et al.,2020), TCNN is not a
multi-step plug-in (and cross-validated) estimator and does
not envisage the use of propensity score. Instead, simi-
larly to Bayesian Causal Forest (BCF) (Hahn et al.,2020;
Interpretable Deep Causal Learning for Moderation Effects
Model Train PEHEτTest PEHEτ
S-NN 1.046 ±0.007 1.076 ±0.007
T-NN 1.021 ±0.002 1.074 ±0.002
R-CF 1.467 ±0.002 1.494 ±0.002
R-NN 0.706 ±0.003 0.712 ±0.003
R-NAM 0.787 ±0.002 0.787 ±0.002
TCNN 0.361 ±0.001 0.362 ±0.001
ICNN 0.328 ±0.001 0.331 ±0.001
Table 1.
Performance on simulated experiment, measured as 70%-
30% train-test set PEHEτ. Bold indicates better performance.
Caron et al.,2022b), estimation in TCNN is carried out in
a single, more sample efficient step, although BCF is in-
herently Bayesian and relatively computationally intensive.
To obtain better coverage properties in terms of uncertainty
quantification in both TCNN and ICNN, we implement the
MC dropout technique (Gal & Ghahramani,2016) in both
µ(·)
and
τ(·)
blocks to perform approximate Bayesian in-
ference, that is, we re-sample multiple times from the NN
model with dropout layers to build an approximate poste-
rior predictive distribution. This produces credible intervals
around CATE estimates
τ(·)
in a very straightforward way,
and, in ICNN specifically, credible intervals around each
inputs’ score function, as we will show in the experimental
section.
4. Experiments
We hereby present results from a simple simulated exper-
iment on CATE estimation, to compare TCNN and ICNN
performance against some of the state of the art methods. In
addition, we demonstrate how ICNN with MC dropout in
particular can be employed to produce highly interpretable
score function measures, fully describing the estimated mod-
erating effects of the covariates
xi
in
τ(·)
, and uncertainty
around them. For performance comparison we rely on the
root Precision in Estimating Heterogeneous Treatment Ef-
fects (PEHE) metric (Hill,2011), defined as:
pPEHEτ=qEτi(xi)τi(xi))2,(5)
and the list of models we compare include: S-Learner ver-
sion of NNs (S-NN); T-Learner version of NNs (T-NN);
Causal Forest (Wager & Athey,2018), a particular type
of R-Learner (R-CF); a “unique-block”, fully connected
NN that uses Robinson’s parametrization minimizing the
loss function in (3) (R-NN); a “unique-block” NAM, again
minimizing the loss function in (3) (R-NAM); our TCNN
with fully connected NN blocks; and ICNN. S-NN, T-NN
and R-NN all feature two [50, 50] hidden layers. R-NAM
features two [20, 20] hidden layers for each input. TCNN
features two [50, 50] hidden layers in the
µ(·)
block, and

X
1



Figure 3.
Score function output from ICNN model relative to co-
variate
X1
, depicting its moderating effect on CATE, plus MC
dropout generated credible intervals.
one [20] hidden layer in the
τ(·)
block. ICNN features two
[20, 20] hidden layers in the
µ(·)
block, and one [50] hidden
layer in the τ(·)block, for each input.
We simulate
N= 2000
data points on
P= 10
correlated
covariates, with binary
Ai
and continuous
Yi
. The exper-
iment was run for
B= 100
replications and results on
70%-30% train-test sets average
PEHEτ
, plus 95% Monte
Carlo errors, can be found in Table 1. The full description
of the data generating process utilized for this simulated ex-
periment can be found in the appendix Section A. NN mod-
els minimizing the Robinson loss function in (3) perform
considerably better than S- and T-Learner baselines on this
particular example, especially TCNN and ICNN that present
the additional advantage of conveying targeted regulariza-
tion. Considering the ICNN model only, we can then access
the score functions on the
τ(·)
NAM block that describe
the moderating effects of the covariates
xi
. In particular
in Figure 3we plot the score function of the first covariate
Xi,1
on CATE
τ(·)
, plus the approximate Bayesian credible
intervals generated through MC dropout resampling (Gal
& Ghahramani,2016). In this specific simulated example,
CATE function is generated as
τ(xi) = 3 + 0.8X2
i,1
. So
only
Xi,1
, out of all
P= 10
covariates, drives the simple
heterogeneity patterns in treatment response across individ-
uals, in a quadratic form. As Figure 3shows, ICNN is able
in this example to learn a score function that very closely
approximates the underlying true relationship 0.8X2
i,1, and
quantifies uncertainty around it. Naturally, in a different
simulated setup with strong interaction terms among the
covariates, performance of ICNN would inevitably deteri-
orate compared to the other versions of NN and models
considered here. Thus, performance and interpretability in
this type of scenario would certainly constitute a trade-off.
Interpretable Deep Causal Learning for Moderation Effects




















































































Figure 4.
Score functions (or Shapley values) with associated MC dropout bands describing moderation effects of each covariate on
estimated CATE: τj(xj),j {1, ..., P }.
4.1. Real-World Example: the ACTG-175 data
Finally, we briefly demonstrate the use of ICNN on a real-
world example. Although the focus of the paper so far has
been on observational type of studies, we will analyze data
from a randomized experiment to show that the methods
introduced naturally extend to this setting as well, with the
non-negligible additional benefit that both unconfounded-
ness and common support assumptions hold by construction
(i.e., no “causal” arrow going from
XA
in Figure 1
DAG). The data we use are taken from the ACTG-175 study,
a randomized controlled trial comparing standard mono-
therapy against a combination of therapies in the treatment
of HIV-1-infected patients with CD4 cell counts between
200 and 500. Details of the design can be found in the
original contribution by Hammer et al. (1996). The dataset
features
N= 2139
observations and
P= 12
covariates
X
(which are listed in the appendix section below), a bi-
nary treatment
A
(mono-therapy VS multi-therapy) and a
continuous outcome
Y
(difference in CD4 cell counts be-
tween baseline and after 20
±
5 weeks after undertaking the
treatment this is done in order to take into account any
individual unobserved time pattern in the CD4 cell count).
The aim is to investigate the moderation effects of the covari-
ates in terms of heterogeneity of treatment across patients.
In order to do so, we run ICNN and obtain the estimated
score functions, together with approximated Bayesian MC
dropout bands, for each covariate
Xj
, and we report these
in Figure 4. The results generally suggest a good degree
of treatement heterogeneity, with most of the covariates
playing a significant moderating role.
5. Conclusion
In this extended abstract paper, we have addressed the is-
sue of interpretability and targeted regularization in causal
machine learning models for the estimation of heteroge-
neous/individual treatment effects. In particular we have
proposed a novel deep learning architecture (TCNN) that is
able to convey regularization and quantify uncertainty when
learning the CATE function, and, in its interpretable version
(ICNN), to output interpretable score function describing
the estimated prognostic and moderation effects of the co-
variates
Xi
. We have benchmarked TCNN and ICNN by
comparing their performance against some of the popular
methods for CATE estimation on a simple simulated exper-
iment, where we have also illustrated how score functions
are very intuitive and interpretable measures for moderation
effects analysis. Finally, we have demonstrated the use of
ICNN on a real-world dataset based on the ACTG-175 study
(Hammer et al.,1996).
References
Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D.,
Liu, L., Ghavamzadeh, M., Fieguth, P., Cao, X., Khosravi,
A., Acharya, U. R., Makarenkov, V., and Nahavandi, S.
A review of uncertainty quantification in deep learning:
Techniques, applications and challenges. Information
Fusion, 76:243–297, 2021.
Agarwal, R., Melnick, L., Frosst, N., Zhang, X., Lengerich,
B., Caruana, R., and Hinton, G. E. Neural additive mod-
els: Interpretable machine learning with neural nets. In
Proceedings of the 35th International Conference on Neu-
ral Information Processing Systems, volume 34, pp. 4699–
4711, 2021.
Alaa, A. and van der Schaar, M. Limits of estimating het-
erogeneous treatment effects: Guidelines for practical
algorithm design. In Proceedings of the 35th Interna-
tional Conference on Machine Learning, pp. 129–138,
2018.
Interpretable Deep Causal Learning for Moderation Effects
Alaa, A. M. and van der Schaar, M. Bayesian inference of
individualized treatment effects using multi-task Gaus-
sian Processes. In Proceedings of the 31st International
Conference on Neural Information Processing Systems,
NIPS’17, pp. 3427–3435, 2017.
Athey, S. and Wager, S. Policy Learning With Observational
Data. Econometrica, 89(1):133–161, January 2021.
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
D. Weight uncertainty in neural networks. In Proceed-
ings of the 32nd International Conference on Interna-
tional Conference on Machine Learning - Volume 37, pp.
1613–1622, 2015.
Caron, A., Baio, G., and Manolopoulou, I. Estimating indi-
vidual treatment effects using non-parametric regression
models: A review. Journal of the Royal Statistical Society:
Series A (Statistics in Society), pp. 1–35, 2022a.
Caron, A., Baio, G., and Manolopoulou, I. Shrinkage
bayesian causal forests for heterogeneous treatment ef-
fects estimation. Journal of Computational and Graphi-
cal Statistics, pp. 1–13, 2022b.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E.,
Hansen, C., Newey, W., and Robins, J. Double/debiased
machine learning for treatment and structural parameters.
The Econometrics Journal, 21(1):C1–C68, 2018.
Chipman, H. A., George, E. I., and McCulloch, R. E. BART:
Bayesian additive regression trees. Ann. Appl. Stat., 4(1):
266–298, 03 2010.
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approxi-
mation: Representing model uncertainty in deep learning.
In Proceedings of The 33rd International Conference on
Machine Learning, volume 48, pp. 1050–1059, 2016.
Hahn, P. R., Carvalho, C. M., Puelz, D., and He, J. Regular-
ization and confounding in linear regression for treatment
effect estimation. Bayesian Anal., 13(1):163–182, 03
2018.
Hahn, P. R., Murray, J. S., and Carvalho, C. M. Bayesian
Regression Tree Models for Causal Inference: Regulariza-
tion, Confounding, and Heterogeneous Effects. Bayesian
Analysis, 15(3):965 1056, 2020.
Hammer, S. M., Katzenstein, D. A., Hughes, M. D., Gun-
dacker, H., Schooley, R. T., Haubrich, R. H., Henry,
W. K., Lederman, M. M., Phair, J. P., Niu, M., Hirsch,
M. S., and Merigan, T. C. A trial comparing nucleoside
monotherapy with combination therapy in hiv-infected
adults with CD4 cell counts from 200 to 500 per cubic
millimeter. N. Engl. J. Med., 335:1081–1090, 1996.
Hartford, J., Lewis, G., Leyton-Brown, K., and Taddy, M.
Deep IV: A flexible approach for counterfactual predic-
tion. In Proceedings of the 34th International Conference
on Machine Learning, volume 70, pp. 1414–1423, 2017.
Hill, J. L. Bayesian nonparametric modeling for causal infer-
ence. Journal of Computational and Graphical Statistics,
20(1):217–240, 2011.
Hodson, R. Precision medicine. Nature, 547(7619), 2016.
Horvitz, D. G. and Thompson, D. J. A generalization of sam-
pling without replacement from a finite universe. Journal
of the American Statistical Association, 47(260):663–685,
1952.
Imbens, G. W. and Rubin, D. B. Causal Inference for Statis-
tics, Social, and Biomedical Sciences: An Introduction.
Cambridge University Press, 2015.
Johansson, F., Shalit, U., and Sontag, D. Learning represen-
tations for counterfactual inference. In Proceedings of
The 33rd International Conference on Machine Learning,
volume 48, pp. 3020–3029, 2016.
Kaddour, J., Zhu, Y., Liu, Q., Kusner, M. J., and Silva, R.
Causal effect inference for structured treatments. In Pro-
ceedings of the 35th International Conference on Neural
Information Processing Systems, volume 34, pp. 24841–
24854, 2021.
Kitagawa, T. and Tetenov, A. Who should be treated? empir-
ical welfare maximization methods for treatment choice.
Econometrica, 86(2):591–616, 2018.
K
¨
unzel, S., Sekhon, J., Bickel, P., and Yu, B. Meta-learners
for estimating heterogeneous treatment effects using ma-
chine learning. Proceedings of the National Academy of
Sciences, 116, 06 2017.
Lakshminarayanan, B., Pritzel, A., and Blundell, C. Sim-
ple and scalable predictive uncertainty estimation using
deep ensembles. In Proceedings of the 31st International
Conference on Neural Information Processing Systems,
pp. 6405–6416. Curran Associates Inc., 2017.
Nie, X. and Wager, S. Quasi-oracle estimation of heteroge-
neous treatment effects. Biometrika, 108(2):299–319, 09
2020.
Nie, X., Brunskill, E., and Wager, S. Learning when-to-treat
policies. Journal of the American Statistical Association,
0(ja):1–58, 2020.
Pearce, T., Leibfried, F., and Brintrup, A. Uncertainty in
neural networks: Approximately bayesian ensembling. In
Proceedings of the Twenty Third International Conference
on Artificial Intelligence and Statistics, volume 108, pp.
234–244, 2020.
Interpretable Deep Causal Learning for Moderation Effects
Pearl, J. Causality: Models, Reasoning and Inference. Cam-
bridge University Press, USA, 2nd edition, 2009. ISBN
052189560X.
Peters, J., Janzing, D., and Schlkopf, B. Elements of Causal
Inference: Foundations and Learning Algorithms. The
MIT Press, 2017.
Robinson, P. M. Root-n-consistent semiparametric regres-
sion. Econometrica, 56(4):931–954, 1988.
Rubin, D. B. Bayesian inference for causal effects: The role
of randomization. Ann. Statist., 6(1):34–58, 01 1978.
Shalit, U., Johansson, F. D., and Sontag, D. Estimating
individual treatment effect: Generalization bounds and
algorithms. In Proceedings of the 34th International
Conference on Machine Learning - Volume 70, volume 70,
pp. 3076–3085, 2017.
Shapley, L. S. A value for n-person games. In Contribu-
tions to the Theory of Games II, pp. 307–317. Princeton
University Press, 1953.
Wager, S. and Athey, S. Estimation and inference of hetero-
geneous treatment effects using random forests. Journal
of the American Statistical Association, 113(523):1228–
1242, 2018.
Yao, L., Li, S., Li, Y., Huai, M., Gao, J., and Zhang, A. Rep-
resentation learning for treatment effect estimation from
observational data. In Advances in Neural Information
Processing Systems 31, pp. 2633–2643, 2018.
Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. A
robust method for estimating optimal treatment regimes.
Biometrics, 68(4):1010–1018, 2012.
Interpretable Deep Causal Learning for Moderation Effects
A. Data Generating Process
In this appendix section we briefly describe the data generating process utilized for the simulated experiment in Section 4. We
generated
N= 2000
data points on
P= 10
correlated covariates, of which 5 continuous and 5 binary, drawn from a Gaussian
Copula
CGauss
Θ(u)=ΦΘΦ1(u1),...,Φ1(uP)
, where the covariance matrix is such that
Θjk = 0.1|jk|+ 0.1I(j=k)
.
The data generating process is fully described by the following quantities:
µ(xi) = 6 + 0.3 exp(Xi,1)+1X2
i,2+ 1.5|Xi,3|+ 0.8Xi,4,
τ(xi) = 3 + 0.8X2
i,1,
π(xi) = Λ 1.5+0.5Xi,1+νi
10,
AiBernoulliπ(xi),
Yi=µ(xi) + τ(xi)Ai+εi,where εi N(0, σ2),
(6)
where:
Λ(·)
is the logistic cumulative distribution function; the error’s standard deviation is
σ2= 0.5
; and
νi
Uniform(0,1)
. More details on the DGP and the models employed can be found at
https://github.com/
albicaron/ICNN, for full reproducibility.
B. The ACTG-175 Trial Data
In Table here below we report the description of the 12 covariates utilized in the analysis in Section 4.1.
Table 2. ACTG-175 data covariates X
Variable Description
age Numeric
wtkg Numeric
hemo Binary (hemophilia = 1)
homo Binary (homosexual = 1)
drugs Binary (intravenous drug use = 1)
oprior
Binary (non-zidovudine antiretroviral therapy prior to initiation of
study treatment = 1)
z30
Binary (zidovudine use in the 30 days prior to treatment initiation =
1)
preanti
Numeric (number of days of previously received antiretroviral ther-
apy)
race Binary
gender Binary
str2 Binary: antiretroviral history (0 = naive, 1 = experienced)
karnof hi Binary: Karnofsky score (0 = <100,1=100)
Preprint
In this paper, we address the challenge of performing counterfactual inference with observational data via Bayesian nonparametric regression adjustment, with a focus on high-dimensional settings featuring multiple actions and multiple correlated outcomes. We present a general class of counterfactual multi-task deep kernels models that estimate causal effects and learn policies proficiently thanks to their sample efficiency gains, while scaling well with high dimensions. In the first part of the work, we rely on Structural Causal Models (SCM) to formally introduce the setup and the problem of identifying counterfactual quantities under observed confounding. We then discuss the benefits of tackling the task of causal effects estimation via stacked coregionalized Gaussian Processes and Deep Kernels. Finally, we demonstrate the use of the proposed methods on simulated experiments that span individual causal effects estimation, off-policy evaluation and optimization.
Article
Full-text available
In this paper, we address the challenge of performing counterfactual inference with observational data via Bayesian nonparametric regression adjustment, with a focus on high-dimensional settings featuring multiple actions and multiple correlated outcomes. We present a general class of counterfactual multi-task deep kernels models that estimate causal effects and learn policies proficiently thanks to their sample efficiency gains, while scaling well with high dimensions. In the first part of the work, we rely on Structural Causal Models (SCM) to formally introduce the setup and the problem of identifying counterfactual quantities under observed confounding. We then discuss the benefits of tackling the task of causal effects estimation via stacked coregionalized Gaussian Processes and Deep Kernels. Finally, we demonstrate the use of the proposed methods on simulated experiments that span individual causal effects estimation, off-policy evaluation and optimization.
Article
Full-text available
This article develops a sparsity-inducing version of Bayesian Causal Forests, a recently proposed nonparametric causal regression model that employs Bayesian Additive Regression Trees and is specifically designed to estimate heterogeneous treatment effects using observational data. The sparsity-inducing component we introduce is motivated by empirical studies where not all the available covariates are relevant, leading to different degrees of sparsity underlying the surfaces of interest in the estimation of individual treatment effects. The extended version presented in this work, which we name Shrinkage Bayesian Causal Forest, is equipped with an additional pair of priors allowing the model to adjust the weight of each covariate through the corresponding number of splits in the tree ensemble. These priors improve the model’s adaptability to sparse data generating processes and allow to perform fully Bayesian feature shrinkage in a framework for treatment effects estimation, and thus to uncover the moderating factors driving heterogeneity. In addition, the method allows prior knowledge about the relevant confounding covariates and the relative magnitude of their impact on the outcome to be incorporated in the model. We illustrate the performance of our method in simulated studies, in comparison to Bayesian Causal Forest and other state-of-the-art models, to demonstrate how it scales up with an increasing number of covariates and how it handles strongly confounded scenarios. Finally, we also provide an example of application using real-world data. Supplementary materials for this article are available online.
Article
Full-text available
Large observational data are increasingly available in disciplines such as health, economic and social sciences, where researchers are interested in causal questions rather than prediction. In this paper, we examine the problem of estimating heterogeneous treatment effects using non‐parametric regression‐based methods, starting from an empirical study aimed at investigating the effect of participation in school meal programs on health indicators. First, we introduce the setup and the issues related to conducting causal inference with observational or non‐fully randomized data, and how these issues can be tackled with the help of statistical learning tools. Then, we review and develop a unifying taxonomy of the existing state‐of‐the‐art frameworks that allow for individual treatment effects estimation via non‐parametric regression models. After presenting a brief overview on the problem of model selection, we illustrate the performance of some of the methods on three different simulated studies. We conclude by demonstrating the use of some of the methods on an empirical analysis of the school meal program data.
Article
Full-text available
Uncertainty quantification (UQ) methods play a pivotal role in reducing the impact of uncertainties during both optimization and decision making processes. They have been applied to solve a variety of real-world applications in science and engineering. Bayesian approximation and ensemble learning techniques are two of the most widely-used types of uncertainty quantification (UQ) methods. In this regard, researchers have proposed different UQ methods and examined their performance in a variety of applications such as computer vision (e.g., self-driving cars and object detection), image processing (e.g., image restoration), medical image analysis (e.g., medical image classification and segmentation), natural language processing (e.g., text classification, social media texts and recidivism risk-scoring), bioinformatics, etc. This study reviews recent advances in UQ methods used in deep learning, investigates the application of these methods in reinforcement learning, and highlights the fundamental research challenges and directions associated with the UQ field.
Article
In many areas, practitioners seek to use observational data to learn a treatment assignment policy that satisfies application‐specific constraints, such as budget, fairness, simplicity, or other functional form constraints. For example, policies may be restricted to take the form of decision trees based on a limited set of easily observable individual characteristics. We propose a new approach to this problem motivated by the theory of semiparametrically efficient estimation. Our method can be used to optimize either binary treatments or infinitesimal nudges to continuous treatments, and can leverage observational data where causal effects are identified using a variety of strategies, including selection on observables and instrumental variables. Given a doubly robust estimator of the causal effect of assigning everyone to treatment, we develop an algorithm for choosing whom to treat, and establish strong guarantees for the asymptotic utilitarian regret of the resulting policy.
Article
Flexible estimation of heterogeneous treatment effects lies at the heart of many statistical applications, such as personalized medicine and optimal resource allocation. In this article we develop a general class of two-step algorithms for heterogeneous treatment effect estimation in observational studies. First, we estimate marginal effects and treatment propensities to form an objective function that isolates the causal component of the signal. Then, we optimize this data-adaptive objective function. The proposed approach has several advantages over existing methods. From a practical perspective, our method is flexible and easy to use: in both steps, any loss-minimization method can be employed, such as penalized regression, deep neural networks, or boosting; moreover, these methods can be fine-tuned by cross-validation. Meanwhile, in the case of penalized kernel regression, we show that our method has a quasi-oracle property. Even when the pilot estimates for marginal effects and treatment propensities are not particularly accurate, we achieve the same error bounds as an oracle with prior knowledge of these two nuisance components. We implement variants of our approach based on penalized regression, kernel ridge regression, and boosting in a variety of simulation set-ups, and observe promising performance relative to existing baselines.
Article
Many applied decision-making problems have a dynamic component: The policymaker needs not only to choose whom to treat, but also when to start which treatment. For example, a medical doctor may choose between postponing treatment (watchful waiting) and prescribing one of several available treatments during the many visits from a patient. We develop an “advantage doubly robust” estimator for learning such dynamic treatment rules using observational data under the assumption of sequential ignorability. We prove welfare regret bounds that generalize results for doubly robust learning in the single-step setting, and show promising empirical performance in several different contexts. Our approach is practical for policy optimization, and does not need any structural (e.g., Markovian) assumptions.
Article
Many scientific and engineering challenges---ranging from personalized medicine to customized marketing recommendations---require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. Given a potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms that, to our knowledge, is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially as the number of covariates increases.
Article
One of the main objectives of empirical analysis of experiments and quasi‐experiments is to inform policy decisions that determine the allocation of treatments to individuals with different observable covariates. We study the properties and implementation of the Empirical Welfare Maximization (EWM) method, which estimates a treatment assignment policy by maximizing the sample analog of average social welfare over a class of candidate treatment policies. The EWM approach is attractive in terms of both statistical performance and practical implementation in realistic settings of policy design. Common features of these settings include: (i) feasible treatment assignment rules are constrained exogenously for ethical, legislative, or political reasons, (ii) a policy maker wants a simple treatment assignment rule based on one or more eligibility scores in order to reduce the dimensionality of individual observable characteristics, and/or (iii) the proportion of individuals who can receive the treatment is a priori limited due to a budget or a capacity constraint. We show that when the propensity score is known, the average social welfare attained by EWM rules converges at least at n^(−1/2) rate to the maximum obtainable welfare uniformly over a minimally constrained class of data distributions, and this uniform convergence rate is minimax optimal. We examine how the uniform convergence rate depends on the richness of the class of candidate decision rules, the distribution of conditional treatment effects, and the lack of knowledge of the propensity score. We offer easily implementable algorithms for computing the EWM rule and an application using experimental data from the National JTPA Study.