PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Social science theories often postulate causal relationships among a set of variables or events. Although directed acyclic graphs (DAGs) are increasingly used to represent these theories, their full potential has not yet been realized in practice. As non-parametric causal models, DAGs require no assumptions about the functional form of the hypothesized relationships. Nevertheless, to simplify the task of empirical evaluation, researchers tend to invoke such assumptions anyway, even though they are typically arbitrary and do not reflect any theoretical content or prior knowledge. Moreover, functional form assumptions can engender bias, whenever they fail to accurately capture the complexity of the causal system under investigation. In this article, we introduce causal-graphical normalizing flows (cGNFs), a novel approach to causal inference that leverages deep neural networks to empirically evaluate theories represented as DAGs. Unlike conventional approaches, cGNFs model the full joint distribution of the data according to a DAG supplied by the analyst, without relying on stringent assumptions about functional form. In this way, the method allows for flexible, semi-parametric estimation of any causal estimand that can be identified from the DAG, including total effects, conditional effects, direct and indirect effects, and path-specific effects. We illustrate the method with a reanalysis of Blau and Duncan's (1967) model of status attainment and Zhou's (2019) model of conditional versus controlled mobility. To facilitate adoption, we provide open-source software together with a series of online tutorials for implementing cGNFs. The article concludes with a discussion of current limitations and directions for future development.
Deep Learning With DAGs
Sourabh Balgi1, Adel Daoud2, Jose M. Pe˜na1, Geoffrey T. Wodtke3
, Jesse Zhou3
1Department of Computer and Information Science, Link¨oping University, Link¨oping, Sweden
2Institute for Analytical Sociology, Link¨oping University, Link¨oping, Sweden
3Department of Sociology, University of Chicago, Chicago, IL, USA
January 10, 2024
Abstract
Social science theories often postulate causal relationships among a set of variables or events. Al-
though directed acyclic graphs (DAGs) are increasingly used to represent these theories, their full poten-
tial has not yet been realized in practice. As non-parametric causal models, DAGs require no assumptions
about the functional form of the hypothesized relationships. Nevertheless, to simplify the task of empirical
evaluation, researchers tend to invoke such assumptions anyway, even though they are typically arbitrary
and do not reflect any theoretical content or prior knowledge. Moreover, functional form assumptions
can engender bias, whenever they fail to accurately capture the complexity of the causal system under
investigation. In this article, we introduce causal-graphical normalizing flows (cGNFs), a novel approach
to causal inference that leverages deep neural networks to empirically evaluate theories represented as
DAGs. Unlike conventional approaches, cGNFs model the full joint distribution of the data according
to a DAG supplied by the analyst, without relying on stringent assumptions about functional form. In
this way, the method allows for flexible, semi-parametric estimation of any causal estimand that can be
identified from the DAG, including total effects, conditional effects, direct and indirect effects, and path-
specific effects. We illustrate the method with a reanalysis of Blau and Duncan’s (1967) model of status
attainment and Zhou’s (2019) model of conditional versus controlled mobility. To facilitate adoption,
we provide open-source software together with a series of online tutorials for implementing cGNFs. The
article concludes with a discussion of current limitations and directions for future development.
Direct correspondence to: Geoffrey T. Wodtke, University of Chicago, Department of Sociology, 1126 E. 59th
St., Chicago, IL, 60637; wodtke@uchicago.edu. The authors thank Steve Raudenbush, Xiang Zhou, Bernie Koch,
Kaz Yamaguchi, and participants in the conference on “New Methods to Measure Inter-generational Mobility” at the
University of Chicago for helpful comments and discussions. We used ChatGPT, version 4.0, for light copyediting,
assistance with the LaTeX code for our computational and directed acyclic graphs, and aid in debugging Python
scripts. Responsibility for all content and any potential errors in this manuscript rests solely with the authors. This
research was supported by a grant from the U.S. National Science Foundation (No. 2015613).
1 Introduction
Theories in the social sciences often posit systems of causal relationships between variables. Early models of
status attainment, for example, framed social mobility as the result of a multi-generational causal process
(Becker and Tomes, 1979; Blau and Duncan, 1967; Haller and Portes, 1973; Loury, 1981; Sewell et al.,
1970). According to these models, an individual’s social and economic standing is shaped by both ascribed
characteristics and personal achievements. Specifically, factors like parental education and occupation are
thought to influence college aspirations and attendance, which in turn shape career choices and other life
outcomes.
Recent debates among mobility researchers have centered on the role of higher education in the status
attainment process. Some argue that post-secondary education acts as a “great equalizer,” moderating the
influence of family background on career success (Hout, 1988; Torche, 2011). Others maintain that education
primarily perpetuates existing social hierarchies, with any moderating influence driven by confounding factors
like motivation or ability (Karlson and Birkelund, 2019; Zhou, 2019). Regardless of their position, all these
perspectives share a focus on causal systems—sets of interconnected variables that generate, or limit, social
mobility.
Traditionally, linear path analysis—a form of structural equation modeling—was widely used to study
these causal systems, particularly in seminal studies on the status attainment process (Alwin and Hauser,
1975; Blau and Duncan, 1967; Sewell et al., 1970). This approach represents causal relations between
variables through a set of linear and additive equations. It also allows for estimation of multiple causal
effects simultaneously, facilitating the evaluation of both simple and more complex hypotheses about the
causal system under investigation.
While linear path analysis is a powerful method, it has a key drawback: namely, the assumption of
linearity. This assumption is problematic because most theories in the social sciences do not explicitly suggest,
or even subtly imply, linear and additive relationships among variables. Moreover, the social phenomena
under study are rarely, if ever, strictly linear in reality, as complex forms of non-linearity, interaction, and
moderation are all endemic to causal systems involving humans (Abbott, 1988; Hedstr¨om and Swedberg,
1998; Lieberson, 1985). As a result, the use of linear path analysis often results in both an unfaithful
translation of theory and an inaccurate approximation of reality.
Structural equation modeling has evolved significantly since the advent of linear path analysis, offering
greater flexibility to accommodate non-linearities and interactions (Bollen, 1989; Bollen et al., 2022; Kline,
2023; Winship and Mare, 1983), but these advances come with a caveat. Most applications still compel
researchers to specify the exact functional form of the causal relationships under study. However, the true
form of these relationships is typically unknown, and in many cases, a hypothesized form cannot even be
derived from theory. Consequently, researchers often resort to arbitrary conventions or a flawed specification
search (MacCallum, 1986; Spirtes et al., 2000), resulting in a disconnect between theory and method as well
as potentially misleading inferences.
In response to these challenges, social scientists are increasingly using directed acyclic graphs (DAGs) to
depict causal systems (Elwert, 2013; Knight and Winship, 2013; Pearl, 2009). As non-parametric structural
equation models (SEMs), DAGs represent causal relationships between variables without presupposing their
functional form. This shift has been transformative for causal modeling in the social sciences. Unlike
traditional SEMs, DAGs faithfully capture the type of prior knowledge that is typically available to analysts,
while requiring no assumptions about the form of the hypothesized causal relations (Pearl, 2010).
Nevertheless, the full utility of DAGs has not yet been realized in empirical practice. When researchers
1
move from theoretical representation to empirical evaluation, they often dilute the advantages of DAGs by
reintroducing arbitrary assumptions about functional form in order to simplify the task of estimation or
facilitate communication of results (e.g., Wodtke et al. 2011, 2016; Wodtke and Parbst 2017). Alternatively,
when DAGs are used to guide non- or semi-parametric estimation of causal effects, researchers tend to focus
on a single or narrow set of estimands (Daoud and Dubhashi, 2023; Koch et al., 2021; Lundberg et al.,
2021). This approach allows for estimation of selected causal relationships without stringent parametric
assumptions, but most of the hypothesized causal system is not evaluated empirically, even when the data
permit a broader analysis. In this way, DAGs only guide the evaluation of isolated aspects of a causal system,
while other dimensions are overlooked or sidelined.
In this study, we present a new approach to causal inference that combines deep learning with DAGs
to flexibly model entire causal systems. Our approach, which we call a causal-Graphical Normalizing Flow
(cGNF; Balgi et al. 2022a; Javaloy et al. 2023; Wehenkel and Louppe 2021), models the full joint distribution
of the data, as factorized according to a DAG supplied by the analyst, using deep neural networks that impose
minimal functional form restrictions on the hypothesized causal relations. Once the cGNF has been learned
from data, it can be used to simulate any causal estimand identified under the DAG, including total effects,
conditional effects, direct and indirect effects, and path-specific effects, among many others. Additionally,
this approach offers a straightforward method for conducting sensitivity analyses, whenever certain estimands
may not be identified due to unobserved confounding (Balgi et al., 2022b). cGNFs thus provide a highly
versatile method for empirically evaluating theories about causal systems, without the need for restrictive
parametric assumptions.
In the sections that follow, we first offer an overview of DAGs before introducing a class of distribution
models known as normalizing flows. We begin by introducing normalizing flows for univariate distributions
to demonstrate foundational principles, and then we extend these flows to multivariate joint distributions.
Next, we show how any DAG can be modeled as a normalizing flow, and we demonstrate how this flow can
be flexibly parameterized using a special class of invertible neural networks. After a brief primer on deep
learning and how these networks are trained, we then illustrate the method by re-analyzing two seminal
studies of social mobility: Blau and Duncan’s (1967) model of status attainment and Zhou’s (2019) model of
conditional versus controlled mobility. We conclude with a discussion of current limitations and directions
for future development.
To facilitate adoption, we provide open-source software for implementing cGNFs in Python and R, tailored
for common applications in the social sciences. We also provide a series of online tutorials to help acquaint
researchers with the software, its code, and the associated workflow. All these resources are available at
https://github.com/cGNF-Dev.
2 Directed Acyclic Graphs as Non-parametric SEMs
Directed acyclic graphs (DAGs) represent causal relationships among a set of variables (Elwert, 2013; Pearl,
2009, 2010). They consist of nodes and directed edges. The nodes symbolize variables, while the directed
edges between nodes represent causal effects of arbitrary form. The orientation of the edges signals the
direction of influence from one variable to another, and the term “acyclic” specifies that the graph must not
include cycles. In other words, traversing the directed edges from any starting node should never loop back
to the point of origin.
Figure 1 displays a simple DAG with four observed variables: V1,V2,V3, and V4, collectively denoted by
2
V. The directed edges in the graph establish the causal connections between these variables. Specifically,
they indicate that V1causes V2and V3, which in turn cause V4. They also show that V3is caused by V2.
The epsilon terms, denoted as {ϵV1, ϵV2, ϵV3, ϵV4}, represent random disturbances that account for unobserved
factors influencing each observed variable.
ϵV1
V1
ϵV2
V2
ϵV3
V3
ϵV4
V4
Figure 1: A Simple Example of a Directed Acyclic Graph (DAG).
Note: In this DAG, V1causes V2and V3,V2causes V3and V4, and V3causes V4. The {ϵV1, ϵV2, ϵV3, ϵV4}
terms are random disturbances.
In a DAG, variables that are directly caused by a preceding variable are called its children. Conversely,
variables that directly cause a subsequent variable are identified as its parents. To illustrate, in Figure 1, V3
has two parents, V1and V2, but only a single child, V4.
A DAG can be interpreted as a non-parametric structural equation model (SEM), as it represents a set
of causal relationships between variables without prescribing their functional form. Thus, any DAG can be
translated into a corresponding system of assignment equations, where each child variable is determined by
an unspecified function of its parents.
Consider, for instance, the DAG in Figure 1. It can be represented using the following set of structural
equations:
V1:= gV1(ϵV1)
V2:= gV2(V1, ϵV2)
V3:= gV3(V1, V2, ϵV3)
V4:= gV4(V2, V3, ϵV4),(1)
where the := symbol is an assignment operator used to indicate the direction of causal influence. In these
equations, the set of functions {gV1, gV2, gV3, gV4}do not impose any restrictions on the form of the causal
relationships among variables. For example, the equation V3:= gV3(V1, V2, ϵV3) only signifies that V3is
determined by an unrestricted function gV3of its parent variables V1and V2, along with a random disturbance
ϵV3that may follow any distribution.
The non-parametric SEM in Equation (1) can also be summarized more succinctly as follows:
Vi:= gVi(Vp
i, ϵVi), Vi {V1, V2, V3, V4}(2)
3
where Vp
irepresents the observed parents of any given variable Viin the set V={V1, V2, V3, V4}and ϵVi
denotes the unobserved causes affecting this variable. As before, gViis an unrestricted function, requiring
no commitment to any particular form for the relationship between Viand its parents.
Under the non-parametric SEM in Equation (1), a larger and more complex probability distribution for
the observed data can be decomposed into several smaller, simpler distributions. These smaller distributions
each involve only a subset of the observed variables, and they can be pieced back together to reconstruct the
full distribution of all the variables taken together.
In general, the joint probability distribution of any kvariables X={X1, X2, ..., Xk}can be decomposed
into a product of conditional distributions.1For example, let fX(x1, ..., xk) denote the joint probability that
X1=x1,X2=x2, ..., and Xk=xk. The product rule of joint probability allows us to order these variables
arbitrarily and then decompose their joint distribution as follows:
fX(x1, ..., xk) = fX1(x1)fX2|x1(x2)×... ×fXk|x1...xk1(xk)
=fX1(x1)
k
Y
i=2
fXi|x1...xi1(xi) (3)
where fXi|x1...xi1(xi) denotes the conditional probability that Xi=xi, given the values of X1, ..., Xi1. We
refer to Equation (3) as the autoregressive factorization of the joint distribution, as it involves a decomposition
where each variable is conditioned upon all of its predecessors.
Each variable, however, may not be sensitive to all of its predecessors. When the observed data are
generated from a model resembling a DAG, variables are directly influenced only by their parents, and
this enables a more economical decomposition of their joint distribution. Specifically, the joint distribution
can be decomposed into a product of conditional probabilities, with each depending only on the parents
of the variable in question. For example, the non-parametric SEM in Equation (1) allows the following
decomposition of the joint probability distribution for V:
fV(v1, v2, v3, v4) = fV1(v1)fV2|v1(v2)fV3|v1v2(v3)fV4|v2v3(v4)
=
4
Y
i=1
fVi|vp
i(vi), Vi {V1, V2, V3, V4},(4)
where fVi|vp
i(vi) denotes the conditional probability that a variable Vitakes the value vi, given the values
of its parents vp
i. This decomposition is referred to as the Markov factorization of the joint probability
distribution for the observed data (Pearl, 2009).
While an observational distribution, such as Equation (4), describes the likelihood of seeing different
values in the data under existing conditions, causal inference involves interventional distributions, which
capture how these probabilities would change if certain variables were externally manipulated.
Interventions are represented in DAGs by “mutilating” them. This involves removing incoming edges to
the variable or variables being manipulated, and then setting these nodes to fixed values for deterministic
interventions, or assigning them values drawn from a prescribed distribution in the case of stochastic inter-
ventions. To illustrate, consider a deterministic intervention where the variable V3is set to the value v
3for
everyone. Figure 2 displays the mutilated version of our original DAG corresponding to this intervention.
Here, the edges from V1,V2, and ϵV3into V3have been deleted, and V3has been assigned the value v
3.
1We denote an arbitrary set of kvariables as X={X1, X2, ..., Xk}, while V={V1, V2, ..., Vk}represents a set of variables
with a defined causal structure.
4
ϵV1
V1
ϵV2
V2
ϵV3
v
3
ϵV4
V4(v
3)
Figure 2: A Mutilated Directed Acyclic Graph (DAG).
Note: In this mutilated DAG, V4(v
3) represents the potential outcome of V4under an intervention that sets
V3equal to V
3.
With a mutilated DAG, the corresponding set of non-parametric structural equations also takes a modified
form. Specifically, these equations can now be expressed as follows:
V1:= gV1(ϵV1)
V2:= gV2(V1, ϵV2)
V3:= v
3
V4(v
3) := gV4(V2, v
3, ϵV4),(5)
where V4(v
3) denotes the potential outcome of V4when V3is set to v
3.
By extension, the interventional joint distribution resulting from this manipulation is given as follows:
fV(v
3)(v1, v2, v
3, v4) = fV1(v1)fV2|v1(v2)fV4|v2v
3(v4).(6)
This expression is obtained by removing the conditional probability fV3|v1v2(v3) for the manipulated variable
V3from the Markov factorization of the joint distribution. The remaining probabilities are then conditioned
on V3=v
3wherever this variable appears as a parent. Known as a truncated Markov factorization (Pearl,
2009), the resulting distribution, denoted by fV(v
3), describes the joint probability of different outcomes for
each variable under an intervention that exposes everyone to v
3.
With interventional distributions corresponding to different manipulations on the variable V3, we can
recover a variety of causal estimands, provided that they are identified from the observed data. For example,
under the DAG outlined previously, the average total effect of V3on V4is given by the following expression:
ATEV3V4=E[V4(v
3)V4(v3)]
=X
v1X
v2X
v4
v4fV(v
3)(v1, v2, v
3, v4)fV(v3)(v1, v2, v3, v4).(7)
In this equation, fV(v
3)is the interventional distribution after setting V3to v
3, while fV(v3)is the interven-
tional distribution when V3is set to a different value v3.2
2If these variables were strictly continuous, the probability-weighted sum in Equation (7) would just be replaced with a
5
A similar procedure can be used to recover other estimands identified from the DAG, such as the
ATEV2V4or the ATEV2V3, among a variety of other possibilities. In each case, the appropriate out-
come variable is averaged over the relevant interventional distributions, and then the resulting averages are
compared.
Another approach to obtaining these estimands involves Monte Carlo sampling from the relevant inter-
ventional distributions, and then averaging these samples together. For example, the average total effect of
V3on V4can also be formulated as follows:
ATEV3V4=E[V4(v
3)V4(v3)]
= lim
J→∞
1
J
J
X
j=1
˜
V4
j(v
3)˜
V4
j(v3),(8)
where ˜
V4
j(v
3) and ˜
V4
j(v3) denote Monte Carlo samples drawn from interventional distributions with V3set
to v
3and v3, respectively. As the total number of samples Japproaches infinity, the target estimand is
recovered exactly.
In sum, the Markov factorization of the joint distribution, together with the interventional distributions
obtained by truncating it in different ways, provide access to all the causal estimands that can be non-
parametrically identified from a given DAG. If we had a distribution model based on the Markov factorization,
we could then construct interventional distributions and draw Monte Carlo samples from them in order to
quantify all these effects. In the next section, we introduce a class of distribution models known as normalizing
flows, which are well-suited to this end.
3 An Introduction to Normalizing Flows
A normalizing flow is an invertible transformation designed to map one variable–or a set of variables–onto
another, which follows a standard normal distribution (Kobyzev et al., 2020; Papamakarios et al., 2021;
Rezende and Mohamed, 2015; Tabak and Vanden-Eijnden, 2010; Tabak and Turner, 2013). This mapping
can then be used to simulate new data from either an observational or interventional distribution. This is
achieved by drawing Monte Carlo samples from the standard normal distribution and then transforming
these samples via the inverse of the flow.
3.1 Univariate Normalizing Flows
For a single variable X1with a probability distribution fX1, a normalizing flow can be formally defined as
follows:
Z1=h(X1) N (0,1),(9)
where hdenotes a function or composition of multiple functions that map X1onto another variable Z1
equipped with the standard normal distribution. Essentially, normalizing flows are just a transformation of
one variable with an arbitrary distribution into another that is normally distributed with zero mean and
unit variance.
density-weighted integral.
6
Normalizing flows accommodate discrete variables by first dequantizing them (Balgi et al., 2022a; Uria
et al., 2013; Ziegler and Rush, 2019). Dequantization turns integers, like zeros and ones, into continuous
values by adding a small amount of random noise, usually drawn from a uniform or normal distribution
with minuscule variance. This recasts discrete variables as continuous, making them suitable for subse-
quent normalizing transformations. Conversely, discrete variables can be easily restored by rounding their
dequantized values to the nearest integer. Thus, dequantization enables normalizing flows to map any type
of variable–binary, ordinal, polytomous, and so on–to the standard normal distribution. In Part A.1 of the
Appendix, we provide additional details on the use of dequantization with normalizing flows for discrete
data.
In general, the functions that compose a normalizing flow hmay assume any form, provided they are
bijective–that is, as long as they map each input to a single unique output. This constraint ensures that the
standard normal variable Z1can be mapped back to the original variable X1by applying the inverse of the
flow:
X1=h1(Z1)fX1.(10)
It also allows for convenient Monte Carlo sampling from the arbitrary distribution fX1. This is accomplished
by initially drawing Monte Carlo samples from the standard normal distribution and then transforming these
samples using the inverse of the flow.
In addition, the transformation hencodes a particular form for the distribution of X1. This form is given
by the change of variables formula and can be expressed as follows:
fX1(x1) = fZ1(h(x1))
h
∂x1
.(11)
In this expression, fZ1represents the standard normal distribution, while h
∂x1is the absolute value of the
derivative of the normalizing flow with respect to x1. This latter term captures the extent to which the
transformation hmodifies X1so that it conforms to a standard normal distribution.
To illustrate, consider a simple and familiar example. Suppose that X1were normally distributed with
mean µand variance σ2. In this case, a normalizing flow for X1could be constructed using a single linear
transformation:
Z1=h(X1) = X1µ
σ N (0,1).(12)
Conversely, inverting the flow maps the standard normal variable Z1back to X1as follows:
X1=h1(Z1) = σZ1+µ N (µ, σ2).(13)
This flow just mirrors the common practice of standardizing a normally distributed variable to compute
z-scores. Here, a simple linear transformation converts a variable with an arbitrary normal distribution into
another variable that adheres to the standard normal distribution.
Although this example serves as a useful illustration, it is contrived because the distribution of X1has
a known parametric form, fX1=N(µ, σ2). Whenever this distribution is known, a normalizing flow can be
easily derived, where X1and Z1can be mapped back and forth from one to another using relatively simple
analytic expressions. In practice, however, the distribution fX1and the normalizing flow hare typically
7
unknown.3In this situation, these functions must be inferred by fitting a highly expressive model, such
as a deep neural network, to the available data. We address this challenge in the subsequent section, after
extending normalizing flows to multivariate distributions.
3.2 Multivariate Normalizing Flows
For a set of random variables X={X1, ..., Xk}with a joint probability distribution fX, a multivariate
normalizing flow can be formally defined as follows:
Z=h(X) N (0,I).(14)
Similar to the univariate case, hrepresents a composition of bijective functions that transform the set of
variables Xinto another set Z={Z1, ..., Zk}, which follows a multivariate standard normal distribution.
This distribution is denoted as N(0,I), where Iis the identity matrix, indicating that all elements of Zare
uncorrelated by construction.
The set of variables Zcan be mapped back to Xusing the inverse of the flow as follows:
X=h1(Z)fX.(15)
This allows for convenient Monte Carlo sampling–now from the joint distribution fX–where samples are
initially drawn from N(0,I) and then transformed with h1.
The form of the joint distribution for Xcan also be expressed using the change of variables formula.
Specifically, in the multivariate setting, this distribution can be expressed as follows:
fX(x) = fZ(h(x)) detJh(x),(16)
where fZis the multivariate standard normal distribution and Jh(x)is the Jacobian matrix associated with
h(x). This matrix contains all the partial first derivatives of the transformation h(x) with respect to each
component of x. The term detJh(x)represents the absolute value of its determinant. Conceptually, this
determinant captures how the transformation hmodifies each element of xto map these variables onto
another set that follows a multivariate standard normal distribution.
Although the transformations that compose hcan take any form as long as they are bijective, we focus
on a subclass of multivariate flows with an autoregressive structure (Bengio and Bengio, 1999; Frey, 1998;
Kingma et al., 2016; Kobyzev et al., 2020). These flows can be formally represented as follows:
Z=h(X)
={h1(X1;c1), ..., hi(Xi;ci(X1, ..., Xi1)) , ..., hk(Xk;ck(X1, ..., Xk1))}.(17)
In this expression, ciis a function of the first i1 variables in X, known as a conditioner, while each
transformation hi, here and henceforth referred to as a normalizer, is a function of its conditioner and
the variable Xi. In substantive terms, an autoregressive flow like Equation (17) orders the elements of
3In general, the function hcan be conceptualized as the composition of the inverse of the standard normal cumulative
distribution function (CDF) and an arbitrary CDF for the variable of interest, X1, provided that it is smooth with a finite
first derivative. The composition of the inverse normal CDF with the CDF for X1yields a normalizing transformation because
FX1(X1) = U1Uniform(0,1) and F1
Z(U1) = Z1 N (0,1), where FX1denotes the CDF of X1and FZdenotes the standard
normal CDF. If X1is discrete and thus FX1is not smooth, the variable can be dequantized using a smooth distribution for the
added random noise, as detailed in Part A.1 of the Appendix.
8
Xarbitrarily and then transforms each variable Xiinto a new variable Zithat follows a standard normal
distribution, conditional on its predecessors X1, ..., Xi1.4Each conditioner cidetermines the location of the
distribution for Xias a function of X1, ..., Xi1, and each normalizer hiadjusts the shape of this distribution,
based on its location given by the conditioner, to follow a standard normal curve.5
The joint distribution for Xthat follows from an autoregressive flow and the change of variables formula
is given by:
fX(x) = fZ(h(x)) detJh(x)
=fZ(h1(x1;c1))
k
Y
i=2
fZ(hi(xi;ci(x1, ..., xi1))) detJh(x)
=fZ(h1(x1;c1))
∂h1
∂x1
k
Y
i=2
fZ(hi(xi;ci(x1, ..., xi1)))
∂hi
∂xi
,(18)
where fZrepresents the univariate standard normal distribution. In this expression, the second equality
comes from factorizing the joint distribution of Zas fZ(z) = Qk
i=1 fZ(zi) using the product rule for inde-
pendent and identically distributed variables. The final equality arises from the triangular structure of the
Jacobian matrix associated with an autoregressive flow. With a triangular Jacobian matrix, its determinant
is the product of its diagonal elements–that is, detJh(x)=Qk
i=1
∂hi
∂xi.
The expression given by Equation (18) closely resembles the joint distribution of Xas factorized in
Equation (3) from the previous section. This is because the term fZ(h1(x1, c1)) ∂h1
∂x1encodes the marginal
probability fX1(x1), while each component of the product, denoted by fZ(hi(xi;ci(x1, ..., xi1))) ∂hi
∂xifor
i= 2, ..., k, encodes the conditional probability fXi|x1...xi1(xi). Thus, autoregressive flows are built upon
the autoregressive factorization of the joint distribution fX(x), which does not rely on any independence
restrictions among the elements of Xnor any predefined ordering of these variables.
What if we constructed a normalizing flow similar to Equation (17), but built upon the Markov factoriza-
tion of the joint distribution, as given by an assumed DAG, instead of the autoregressive factorization? This
approach would enable convenient Monte Carlo sampling not only from the observational joint distribution
but also from a broad array of interventional distributions obtained by appropriately truncating the flow.
This is the conceptual foundation underlying causal-graphical normalizing flows, which we introduce in the
next section.
4 causal-Graphical Normalizing Flows
A causal-graphical normalizing flow (cGNF) resembles an autoregressive flow, but with a key distinction:
the conditioner for each variable is a function of its parents, as indicated by a directed acyclic graph (DAG).
Specifically, for a set of kcausally ordered variables V={V1, ..., Vk}, a cGNF can be formulated as follows:
Z=h(V)
={h1(V1;c1), ..., hi(Vi;ci(Vp
i)) , ..., hk(Xk;ck(Vp
k))}.(19)
4With autoregressive flows, each function hican be conceptualized as the composition of the inverse of the standard normal
cumulative distribution function (CDF) with an arbitrary conditional CDF for Xi, given its predecessors.
5The conditioner for X1, denoted by c1, degenerates into a constant because it does not depend on any preceding variables.
9
In this expression, the conditioner cidepends on Vp
i, which represents the parents of the variable Vi. The
normalizer, denoted by hi, is a function of both the conditioner and Viin turn.
Essentially, a cGNF arranges the elements of Vin causal order. It then maps each variable Vito a new
variable Zi, which follows a standard normal distribution conditional on the parents of Vi. To this end, each
conditioner cishifts the location of the distribution for Vias a function of its parents Vp
i. The normalizer
hithen transforms the shape of this distribution, given its location from the conditioner, to resemble the
standard normal curve. The causal order and parent-child relationships among the variables of interest all
come from a DAG supplied by the analyst.
The conditioners and normalizers within a cGNF are unknown functions that may be quite complex.
To model these functions, we use a special class of artificial neural networks. Not only do these networks
satisfy all the conditions that define a bijective map, they are also capable of approximating any monotonic
transformation of one variable or set of variables into another (Huang et al., 2018; Wehenkel and Louppe,
2019, 2021).
Artificial neural networks draw their inspiration from the structure and function of human brains (Chollet,
2021; Goodfellow et al., 2016; LeCun et al., 2015). They consist of multiple interconnected nodes, also referred
to as “neurons” or “units,” organized into multiple layers. A typical architecture for a deep neural network
includes an input layer, several intermediary or “hidden” layers, and an output layer.
Data flows through a deep neural network in a hierarchical manner. Nodes within each layer receive
inputs from preceding layers and then generate an output. These outputs, in turn, serve as the inputs for
nodes in subsequent layers. The number of layers and the quantity of nodes within each layer shape the
network’s expressiveness–that is, its ability to approximate a wide range of complex functions.
The connections between nodes in adjacent layers are governed by a set of weights and activation func-
tions. Specifically, each node produces an output by applying an activation function to the weighted sum of
its inputs.
The weights that link nodes across layers are parameters that must be learned, or estimated, from data.
The process of estimating these weights is known as training. During training, the weights are initialized at
random values and then adjusted using an algorithm designed to make the network’s output as accurate as
possible.
Input1
Input2
Hidden
Layer
Output
w1
w2
w3
w4
w5
w6
w7
w8
w9
Figure 3: An Example of a Deep Neural Network Depicted as a Computational Graph.
Note: In this network, there is an input layer with two nodes, a hidden layer with three nodes, and an output
layer with a single node. Each layer is fully connected, as indicated by the arrows between nodes, and each
connection between nodes is controlled by a weight wi.
10
Figure 3 illustrates a generic example of a deep neural network using a computational graph. By com-
bining nodes, weights, and activation functions, layer upon layer, networks with this basic structure can
approximate highly complex functions. Indeed, networks with sufficiently flexible architectures are universal
function approximators, capable of accurately modeling any continuous mapping from one finite-dimensional
space to another (Hornik et al., 1989).
In the following sections, we outline the specialized architecture of the neural networks used to model a
cGNF, and discuss how these networks are trained. Next, we explain how a trained cGNF facilitates Monte
Carlo sampling from the observational joint distribution of the data. Finally, we show how a trained cGNF
also enables Monte Carlo sampling from interventional distributions, allowing for causal inference.
4.1 Parameterizing cGNFs
We use unconstrained monotonic neural networks (UMNNs) to model the transformations that compose a
cGNF (Wehenkel and Louppe, 2019, 2021). UMNNs are invertible and differentiable networks capable of
approximating any monotonic transformation. Their architecture is based on the principle that a mono-
tonic function must exhibit a strictly positive derivative, which can then be integrated to yield the desired
transformation. Specifically, when used to model the transformations in a cGNF, these networks can be
represented as follows:
hi(Vi;ci(Vp
i) ; θi) = ZVi
0
βi(t;ci(Vp
i;ψi) ; ϕi) dt+αi(ci(Vp
i;ψi)) ,(20)
where θidenotes the union of the parameters ϕiand ψi.
In Equation (20), the conditioner is now modeled using a deep neural network, denoted by ci(Vp
i;ψi).
To produce its output, this network takes the parents of Vias inputs and then transforms them using a set
of weights, denoted by ψi, together with the rectified linear unit (ReLU) activation function.6Termed the
embedding network,ci(Vp
i;ψi) models how Vivaries as a function of its parents Vp
iin the assumed DAG.
The output from the embedding network serves two purposes. First, it is used to generate a scalar offset
term, denoted by αi(ci(Vp
i;ψi)). Second, it also serves as input to another neural network, denoted by
βi(t;ci(Vp
i;ψi) ; ϕi). This other network uses a distinct set of weights, ϕi, and a special activation function–
specifically, the exponential linear unit incremented by one (ELUPlus)–to generate an output that is strictly
positive.7Referred to as the integrand network, it models the derivative, at a given point t, of a monotonic
function intended to map the variable Vionto the standard normal distribution, using the output from the
embedding network.
The normalizer, then, is constructed with both the embedding and integrand networks in tandem. It
comes from integrating the output of the integrand network, βi(t;ci(Vp
i;ψi) ; ϕi), from t= 0 to the observed
value of Vi, and then adding the offset term αi(ci(Vp
i;ψi)). The integration is performed using Clenshaw-
Curtis quadrature, a numerical technique for approximating the area under a curve (Clenshaw and Curtis,
1960). The resulting function, denoted by hi(Vi;ci(Vp
i) ; θi), is an invertible, monotonic transformation of
the variable Vi. Its parameters, θi={ϕi, ψi}, are weights associated with the embedding and integrand
networks. These weights can be trained to transform Viinto a new variable Zi, which follows the standard
6The ReLU activation function returns the value zero for any negative input, but for any positive input, it returns the value
of the input itself. Formally, this function can be expressed as f(x) = max(0, x).
7The ELUPlus activation function is designed to produce an output greater than zero, regardless of its input. Formally, this
function can be expressed as f(x) = x+ 1 if x > 0, and f(x) = exp(x) if x0.
11
Vp
3
V1
V2
c31
c32
Embedding Network (ψ3)
Conditioner
c3(Vp
3;ψ3)
t
c31
c32
Integrand Network (ϕ3)
Normalizer
h3(V3;c3(Vp
3) ; θ3)
Rdt
V3Z3
Figure 4: An Unconstrained Monotonic Neural Network (UMNN) Depicted as a Computational Graph.
Note: This figure illustrates a simple UMNN for variable V3in our DAG from Figure 1. The conditioner, c3,
is modeled using an embedding network. This network has an input layer with two nodes, which correspond
to the parents of V3(i.e., V1and V2). The input layer is followed by a single hidden layer containing three
nodes and an output layer with two nodes, labeled c31 and c32. The outputs of the embedding network
serve as inputs for the integrand network. Specifically, the integrand network has an input layer with three
nodes corresponding to c31 and c32 from the embedding network, as well as the integration points t. This
input layer is followed by a single hidden layer with two nodes and an output layer with a single node. The
normalizer, h3, integrates the output of the integrand network, using V3as the upper limit of integration.
This calculation yields Z3, a transformed version of V3that conforms to the standard normal distribution.
12
normal distribution conditional on its parents Vp
i. Figure 4 illustrates a simple example of this model using
a computational graph.8
Thus, for an entire set of kcausally ordered variables V={V1, ..., Vk}, a cGNF parameterized by
θ={θ1, ..., θk}can be compactly expressed as follows:
Z=h(V;θ) N (0,I).(21)
In this equation, N(0,I) is the multivariate standard normal distribution, hdenotes the composition of
normalizers for each variable in the data, and θrepresents the full collection of weights from the UMNNs
used to model each normalizer.
The architecture for the embedding and integrand networks that compose a UMNN can be customized
with a varying number of layers, nodes per layer, and inter-layer connections. In general, more elaborate
architectures–with additional layers, nodes, and connections–are better equipped to approximate more com-
plex transformations. When configured with a sufficiently flexible architecture, UMNNs function as universal
density approximators, capable of modeling any distribution irrespective of its complexity (Huang et al., 2018;
Wehenkel and Louppe, 2019).
For empirical applications in the social sciences, we suggest architectures with at least four hidden layers,
a minimum of 10 to 20 nodes per layer, and a fully connected configuration among them. For simplicity,
we also recommend using identical architectures across all the normalizers, from h1to hk, which does not
appear to compromise their performance in practice (Wehenkel and Louppe, 2019).
Although increasing the complexity of UMNNs enables more accurate approximation, it also introduces
the risk of over-fitting. Over-fitting occurs when the network begins to model random variation in the sample
data used to train it, capturing noise in addition to the underlying signal. As a result, the trained network
can yield estimates that are less reliable and suffer from greater uncertainty. Nevertheless, the problem
of over-fitting can be mitigated through the use of regularization techniques during training, as we discuss
below. Moreover, recent studies suggest that over-parameterizing neural networks, such that their number
of weights exceeds the number of observations available for training, might actually enhance performance,
regardless of whether regularization methods are employed (Allen-Zhu et al., 2019; Brutzkus et al., 2017;
Yang et al., 2020; Zhang et al., 2021).
4.2 Training cGNFs
To train a cGNF, we adjust the weights, denoted by θ={θ1, ..., θk}, in the embedding and integrand
networks that compose the UMNNs. When adjusting these weights, the objective is to minimize a loss
function that quantifies the fit of the model to a set of observed sample data. Specifically, the loss function
used to train a cGNF is the negative log-likelihood, derived from the joint distribution for Vusing the change
8The UMNN described here can be conceptualized as a model for the following transformation: zi=F1
Z(FVi|vp
i(vi)), where
FVi|vp
iis the cumulative distribution function (CDF) for the variable Vi, conditional on its parents, and F1
Zis the inverse of
the standard normal CDF. The composition of the inverse normal CDF with any arbitrary CDF for a continuous variable yields
a monotonic normalizing transformation.
13
of variables formula. This loss function can be formally expressed as follows:
−LL (θ) = ln n
Y
l=1
fVvl;θ!
=ln n
Y
l=1
fZhvl;θdetJh(vl;θ)!
=
n
X
l=1
ln fZhvl;θ
n
X
l=1
ln detJh(vl;θ)
=
n
X
l=1
k
X
i=1
ln fZhivl
i;civp,l
i;θi
n
X
l=1
k
X
i=1
ln
∂hi
∂vl
i,(22)
where l= 1, ..., n indexes observations in the sample data. As before, i= 1, ..., k indexes the variables
under consideration, fZdenotes the multivariate standard normal distribution, fZdenotes the univariate
standard normal distribution, and hi(Vi;ci(Vp
i) ; θi) denotes the UMNN intended to normalize each variable
conditional on its parents.
To find values for the network weights that minimize the negative log-likelihood, we use the method of
stochastic gradient descent (SGD), an algorithm that iteratively adjusts the weights based on the gradient
of the loss function (LeCun et al., 2015; Chollet, 2021; Goodfellow et al., 2016). The gradient refers to the
set of partial derivatives of the loss function with respect to each weight in the network. Essentially, the
gradient is like an arrow pointing toward the steepest increase in the loss function due to a small change
in the weights. SGD adjusts the weights by moving them in the opposite direction of the gradient, thereby
reducing the loss and improving the model’s fit to the sample data.
At each iteration of the algorithm, the weights are adjusted using the gradient computed on a random
subset of the data, known as a mini-batch, while the size of the adjustments is governed by a hyper-parameter
called the learning rate. When the learning rate is set at a small value, the weights are only adjusted a little
bit after each mini-batch is processed, which prevents the algorithm from overshooting their optimal values.
Completing one pass through all mini-batches in the sample data marks the end of one training epoch.
The algorithm adjusts the weights repeatedly, cycling through mini-batches and epochs over and over, until
it reaches a stopping criterion. In our case, the algorithm terminates when further adjustments to the
weights no longer reduce the loss, as measured on a separate validation sample held out from the data used
for training. This stopping criterion functions as a form of implicit regularization, helping to prevent over-
fitting. This is achieved by halting the training process before further adjustments to the network weights
serve mainly to fit random noise in the training data.
To summarize, the process of training a cGNF using SGD involves the following steps:
1. Partition the sample data randomly into training and validation subsets.
2. Initialize the network weights with random values.
3. Randomly divide the training data into mini-batches.
4. For each mini-batch:
(a) Compute the gradient of the loss function.
(b) Adjust the weights incrementally in the opposite direction of the gradient.
14
5. Evaluate the loss function using the validation data.
6. Repeat steps 3 to 5 until the validation loss ceases to improve over a set number of epochs.
In general, we advise reserving 20 percent of the sample data for validation while training the cGNF on
the remaining 80 percent. Mini-batch sizes should fall between 64 and 512 observations for best performance.
We also suggest a learning rate less than or equal to 0.001 and terminating the training algorithm after the
validation loss stagnates for 30 to 50 epochs.
4.3 Sampling with cGNFs
After the cGNF is trained, it can be used for Monte Carlo sampling from the observational joint distribution
of the data. This is achieved by generating samples from the standard normal distribution and then trans-
forming them using the inverse of the trained cGNF. The inverse of the cGNF is found using a bisection
algorithm, which identifies the inverse by iteratively narrowing the range of values within which it must lie.9
To illustrate the sampling process, consider a cGNF trained with the DAG in Figure 1. In this case, the
sampling process begins with the first variable in causal order, denoted by V1. To simulate its values, we
initially create JMonte Carlo samples from the standard normal distribution. These samples, each denoted
by ˜
Zj
1for j= 1, ..., J , are then transformed as follows: ˜
Vj
1=h1
1˜
Zj
1;c1;ˆ
θ1, where the “hat” indicates
that the network weights have been estimated by SGD. The result of this transformation, ˜
Vj
1, represents
a Monte Carlo sample from the marginal distribution of V1. It is obtained by applying the inverse of the
normalizer for V1to a sample from the standard normal distribution.
For the next variable, V2, we generate another set of Monte Carlo samples from the standard normal
distribution, each denoted by Zj
2for j= 1, ..., J . We then transform these samples as follows: ˜
Vj
2=
h1
2˜
Zj
2;c2˜
Vj
1;ˆ
θ2, where ˜
Vj
1is carried over from the previous step. The result of this transformation,
˜
Vj
2, represents a Monte Carlo sample from the conditional distribution of V2, given its only parent V1. It
comes from transforming a standard normal sample using the inverse of the normalizer for V2.
Continuing this process for V3, the next variable in causal order, we generate a third set of Monte Carlo
samples from the standard normal distribution. These samples, denoted by Zj
3for j= 1, ..., J , are then
transformed as follows: ˜
Vj
3=h1
3˜
Zj
3;c3˜
Vj
1,˜
Vj
2;ˆ
θ3, where ˜
Vj
1and ˜
Vj
2are both carried over from the
previous steps. The result of this transformation, ˜
Vj
3, represents a Monte Carlo sample from the conditional
distribution of V3, given its parents V1and V2. It is obtained by passing a standard normal sample through
the inverse of the normalizer for V3.
For the final variable V4, we generate another set of standard normal samples, each denoted by Zj
4for
j= 1, ..., J . They are then subjected to the following transformation: ˜
Vj
4=h1
4˜
Zj
4;c4˜
Vj
2,˜
Vj
3;ˆ
θ4,
where ˜
Vj
2and ˜
Vj
3are samples obtained from the previous steps. This transformation yields ˜
Vj
4, which
represents a Monte Carlo sample from the conditional distribution of V4, given its parents V2and V3. As
before, it comes from applying the inverse of the normalizer, now for V4, to a Monte Carlo sample from the
standard normal distribution.
9The algorithm starts with an interval defined by the minimum and maximum values of the function, which necessarily
bound the desired output. The algorithm then finds the midpoint of these values and uses it to replace the endpoint of the
current range that does not contain the inverse, halving the interval’s size. This process is iterated until the interval is extremely
small, and the midpoint within this range is taken as the value of the inverse. Specifically, for a value zsampled from the
standard normal distribution, the algorithm starts with an interval [xa, xb] that must contain the inverse. The algorithm
replaces this interval with xa,x(ba)/2if h(x(ba)/2)> z , and with x(ba)/2, xbotherwise. This process of halving the
interval is repeated until its range is minuscule and, then, the midpoint of the final interval is returned as the inverse.
15
Combining the samples for each variable together, we obtain a random vector ˜
Vj=n˜
Vj
1,˜
Vj
2,˜
Vj
3,˜
Vj
4o.
This vector represents a Monte Carlo sample from the observational joint distribution, as approximated by
the trained cGNF.
In general, for a set of kvariables arranged in causal order, Monte Carlo sampling from their observational
joint distribution is accomplished by first generating standard normal samples, indexed by j= 1, ..., J , and
then transforming them recursively as follows:
˜
Vj
i=h1
i˜
Zj
i;ci˜
Vp,j
i;ˆ
θifor i= 1, ..., k. (23)
In this expression, ˜
Zj
idenotes a standard normal sample, h1
irepresents the inverse of the normalizer for
variable Vi, and ˜
Vp,j
idenotes the simulated values for the parents of Vi, which are obtained from previous
steps in the sampling algorithm. Transforming ˜
Zj
iwith h1
iyields ˜
Vj
i, a Monte Carlo sample from the
conditional distribution of Vi, given its parents. Cycling through these transformations for each variable in
causal order generates samples from the full joint distribution of the data, as modeled by the cGNF.
4.4 Estimating Causal Effects with cGNFs
A cGNF enables Monte Carlo sampling not only from the observational joint distribution but also from
various interventional distributions. To simulate values from an interventional distribution, Monte Carlo
samples are first selected from the standard normal distribution, as before. Next, these samples are trans-
formed using the inverse of the trained cGNF, after truncating it in accordance with the desired intervention.
Causal effects are then estimated by averaging and comparing the samples drawn from different interventional
distributions.
This approach to estimating causal effects is an implementation of the g-computation algorithm, first
proposed by Robins (1986) and later extended by others (Daniel et al., 2011; Imai et al., 2010; Wang and Arah,
2015). When implemented in conjunction with a cGNF, the algorithm is versatile enough to estimate any
causal effect that is non-parametrically identified from the observed data, without imposing any functional
form restrictions on their distribution.
4.4.1 Total Effects
To illustrate, suppose we trained a cGNF based on the DAG in Figure 1, and we were interested in estimating
the average total effect of V3on V4, defined formally as ATEV3V4=E[V4(v
3)V4(v3)]. For this estimand,
we construct an estimate by Monte Carlo sampling from the interventional distributions that arise after
setting V3at two different values, v
3and v3, respectively. To this end, we modify the sampling algorithm
outlined in the previous section as follows: (i) we skip the step where samples of V3are generated, and (ii)
we draw Monte Carlo samples for all the other variables after setting V3to v
3and v3, in turn, wherever this
variable appears in the conditioners of the cGNF.
Specifically, to simulate samples from the interventional distribution when V3is set at v
3, we first generate
JMonte Carlo samples from the standard normal distribution, each denoted by ˜
Zj
1for j= 1, ..., J . These
samples are then transformed by computing ˜
Vj
1=h1
1˜
Zj
1;c1;ˆ
θ1. Next, we create another JMonte Carlo
samples from the standard normal distribution, each denoted by ˜
Zj
2for j= 1, ..., J , and then transform them
by computing ˜
Vj
2=h1
2˜
Zj
2;c2˜
Vj
1;ˆ
θ2. These steps mirror the sampling algorithm for the observational
joint distribution, as outlined in the previous section, because V1and V2causally precede V3, the variable
16
subject to intervention.
At this juncture, however, the procedure diverges from the sampling algorithm outlined previously. After
generating Monte Carlo samples for V1and V2, we now skip drawing samples for V3and proceed directly
to sampling for V4. Here, we generate a final set of Jstandard normal samples, each denoted by Zj
4
for j= 1, ..., J . These samples are then transformed by computing ˜
Vj
4(v
3) = h1
4˜
Zj
4;c4˜
Vj
2, v
3;ˆ
θ4,
where, in the conditioner, ˜
Vj
2is carried over from the previous step and V3is set to v
3. The result of
this transformation, denoted by ˜
Vj
4(v
3), represents a Monte Carlo sample of V4from the interventional
distribution with V3set at v
3.
To simulate samples from the interventional distribution with V3set at v3, rather v
3, we repeat the
previous step, only now using this other value for the variable subject to intervention. In particular, we
again transform the final set of standard normal samples, denoted as Zj
4for j= 1, ..., J , by computing
˜
Vj
4(v3) = h1
4˜
Zj
4;c4˜
Vj
2, v3;ˆ
θ4. In the conditioner of this transformation, ˜
Vj
2is carried over, as before,
while V3is now set to its alternative value v3. As a result, we obtain ˜
Vj
4(v3), which represents a Monte
Carlo sample of V4from the interventional distribution with V3set to v3.
An estimate for ATEV3V4, the average total effect of V3on V4, can then be constructed as follows:
[
ATEV3V4=1
J
J
X
j=1
˜
V4
j(v
3)˜
V4
j(v3),(24)
where ˜
V4
j(v
3) and ˜
V4
j(v3) denote Monte Carlo samples drawn from interventional distributions with V3
set to v
3and v3, respectively. A similar procedure can be used to estimate any other total effect, such as
ATEV2V3=E[V3(v
2)V3(v2)] or ATEV2V4=E[V4(v
2)V4(v2)], as long as they are non-parametrically
identified. For these other estimands, the sampling algorithm would just be modified to reflect the different
interventions of interest.
4.4.2 Conditional Effects
The sampling algorithm we described for estimating average total effects can also be used to estimate con-
ditional average effects, which reflect the impact of interventions within particular subgroups. For example,
suppose we were interested in estimating the conditional average effect of V3on V4, given V2=v2, with a
cGNF based on the DAG in Figure 1. This estimand can be formally defined as follows:
CATEV3V4|v2=E[V4(v
3)V4(v3)|V2=v2].(25)
It captures the effect of V3on V4within the subpopulation for which V2=v2.
To estimate this effect, we execute the sampling algorithm exactly as outlined previously for the average
total effect of V3on V4. The only difference lies in the final step, where we restrict our comparison of the
resulting Monte Carlo samples to those that fall within a particular subgroup. Specifically, we compute the
following quantity:
\
CATEV3V4|v2=1
Jv2X
j:˜
Vj
2=v2
˜
V4
j(v
3)˜
V4
j(v3).(26)
In this expression, the sum is taken over Monte Carlo samples for which ˜
Vj
2=v2, and Jv2denotes the total
number of such samples. The simulated variables ˜
Vj
2,˜
V4
j(v
3), and ˜
V4
j(v3) are all defined and generated as
17
in Section 4.4.1. Other conditional effects can be estimated using a similar procedure.10
4.4.3 Joint Effects
Joint effects refer to causal estimands that involve interventions on multiple variables simultaneously. For
example, the average joint effect of V2and V3on V4can be expressed as follows:
AJEV2,V3V4=E[V4(v
2, v
3)V4(v2, v3)] .(27)
In this equation, V4(v
2, v
3) represents the potential outcome of V4when V2and V3are set to v
2and v
3,
respectively. This potential outcome is contrasted with another, denoted by V4(v2, v3), which is defined
analogously. The resulting effect captures the combined impact of V2and V3on V4.
This effect can also be estimated by Monte Carlo sampling from different interventional distributions,
using an inverted and appropriately truncated cGNF. Here, the sampling algorithm is modified to reflect an
intervention on multiple variables at once: that is, we skip drawing samples for both V2and V3, and then we
generate samples for the remaining variables after setting V2and V3at specific values, wherever they appear
in the conditioners of the flow.
Specifically, to simulate samples from an interventional distribution when V2is set at v
2and V3is set
at v
3, we begin by generating Monte Carlo samples for V1, as outlined previously. We then skip drawing
samples for both V2and V3and proceed directly to sampling for V4. To this end, we generate JMonte
Carlo samples from the standard normal distribution, denoted by Zj
4for j= 1, ..., J , and we then transform
them by computing ˜
Vj
4(v
2, v
3) = h1
4˜
Zj
4;c4(v
2, v
3) ; ˆ
θ4. The result of this transformation, denoted by
˜
Vj
4(v
2, v
3), represents a Monte Carlo sample of V4from the interventional distribution with V2and V3set
at v
2and v
3, respectively.
To simulate samples from the interventional distribution with V2and V3set at v2and v3, we repeat the
previous calculations, now using these other values for the variables subject to intervention. Thus, we again
transform the set of standard normal samples, Zj
4for j= 1, ..., J , this time by computing ˜
Vj
4(v2, v3) =
h1
4˜
Zj
4;c4(v2, v3) ; ˆ
θ4, where V2and V3are set to their alternative values in the conditioner of the flow.
An estimate for AJEV2,V3V4, the average joint effect of V2and V3on V4, can then be constructed as
follows:
[
AJEV2,V3V4=1
J
J
X
j=1
˜
V4
j(v
2, v
3)˜
V4
j(v2, v3).(28)
A similar procedure can be used to estimate any other joint effect, including controlled direct effects and
interaction effects (VanderWeele, 2009, 2015).
4.4.4 Mediation Effects
Monte Carlo sampling, in conjunction with a trained cGNF, can also be used to analyze different types of
causal mediation, including direct, indirect, and path-specific effects (Pearl, 2022; VanderWeele, 2015; Zhou
and Yamamoto, 2023). These effects are all formulated in terms of cross-world potential outcomes, where
an exposure variable is set at one value but the mediating variable or variables are set at their values under
a different exposure condition.
10For example, if V2were continuous, we could define the conditional effect over an interval of values and then average the
Monte Carlo samples of ˜
V4
j(v
3) and ˜
V4
j(v3) among cases with values of ˜
Vj
2that belong to this interval.
18
For example, consider the natural indirect effect of V2on V4operating via V3, given the DAG from Figure
1. This effect can be formally defined as follows:
NIEV2V3V4=E[V4(v
2)V4(v
2, V3(v2))] .(29)
In this expression, V4(v
2) = V4(v
2, V3(v
2)) represents the conventional potential outcome of V4when V2
is fixed at v
2and, by extension, V3takes its natural value under the same exposure setting. Conversely,
V4(v
2, V3(v2)) represents the potential outcome of V4when V2is set at v
2but V3assumes the value it would
naturally take if V2had been set at v2instead. This is known as a cross-world potential outcome, since it
fixes V2at one value, v
2, while setting V3at its value from an alternative counterfactual scenario where V2is
instead set at v2. The natural indirect effect captures the influence of V2on V4that is transmitted through
V3by comparing the conventional and cross-world potential outcomes.
To estimate the natural indirect effect, we begin by generating JMonte Carlo samples, each drawn from
the standard normal distribution and denoted by ˜
Zj
1for j= 1, ..., J . We then transform them to yield
samples for V1, computing ˜
Vj
1=h1
1˜
Zj
1;c1;ˆ
θ1as outlined previously.
Next, we skip drawing samples for V2and proceed directly to sampling for V3. Specifically, we generate
JMonte Carlo samples from the standard normal distribution, denoted by ˜
Zj
3for j= 1, ..., J , and we
then transform them by computing ˜
Vj
3(v
2) = h1
3˜
Zj
3;c3˜
Vj
1, v
2;ˆ
θ3. In addition, we also transform
these samples by computing ˜
Vj
3(v2) = h1
3˜
Zj
3;c3˜
Vj
1, v2;ˆ
θ3. The first of these transformations yields
˜
Vj
3(v
2), a Monte Carlo sample of V3from the interventional distribution with V2set at v
2, while the second
produces ˜
Vj
3(v2), a sample from the interventional distribution with V2now set at v2.
In the final step, we create Monte Carlo samples for V4. Here, we again initiate the sampling process by
generating JMonte Carlo samples from the standard normal distribution, denoted by Zj
4for j= 1, ..., J .
We then apply two different transformations to these samples. The first transformation computes ˜
Vj
4(v
2) =
h1
4˜
Zj
4;c3v
2,˜
Vj
3(v
2);ˆ
θ4, where, in the conditioner, ˜
Vj
3(v
2) is carried over from the preceding step.
The second transformation computes ˜
Vj
4(v
2, V3(v2)) = h1
4˜
Zj
4;c3v
2,˜
Vj
3(v2);ˆ
θ4, in which ˜
Vj
3(v2) is
carried over from the previous step as well.
Upon completing the sampling algorithm, we obtain obtain Monte Carlo samples of the conventional and
cross-world potential outcomes, denoted by ˜
Vj
4(v
2) and ˜
Vj
4(v
2, V3(v2)), respectively. With these samples,
an estimate for the natural indirect effect can then be constructed as follows:
[
NIEV2V3V4=1
J
J
X
j=1
˜
Vj
4(v
2)˜
Vj
4(v
2, V3(v2)) .(30)
Similar procedures can be used to estimate natural direct effects, pure indirect effects, and path-specific
effects. In each case, the sampling algorithm is adapted to simulate the particular conventional and cross-
world potential outcomes that compose these other estimands.
4.5 Sensitivity Analysis
Identifying and consistently estimating any causal effect hinges on assumptions about the absence of unob-
served confounding. Unobserved confounding arises when two variables are influenced by a common cause,
termed a confounder. In such cases, an association between these two variables may not reflect a causal effect
of one on the other. Instead, it could be a result of spurious co-variation due to their shared connection with
19
an unobserved factor.
In a DAG, unobserved confounding is often represented by dashed or bidirectional arrows (Elwert, 2013;
Morgan and Winship, 2015). For example, Panel A of Figure 5 uses a dashed, bidirectional arrow to denote
that the disturbance terms, ϵV3and ϵV4, are not independent because V3and V4share an unobserved common
cause. In this scenario, certain estimands cannot be identified–specifically, those whose identification relies
on the absence of unobserved confounding between V3and V4, including the ATEV3V4,CATEV3V4|v2,
AJEV2,V3V4, and NIEV2V3V4, among others. As a result, cGNFs may fail to provide accurate estimates
for these effects.
ϵV1
V1
ϵV2
V2
ϵV3
V3
ϵV4
V4
A. Disturbance Terms are Not Independent
ϵV1
V1
Z1
ϵV2
V2
Z2
ϵV3
V3
Z3
ϵV4
V4
Z4
ρZ3,Z4
h1(V1;c1)
h2(V2;c2(Vp
2))
h3(V3;c3(Vp
3))
h4(V4;c4(Vp
4))
B. Normalized Disturbances are Correlated
Figure 5: Directed Acyclic Graphs (DAGs) Depicting Unobserved Confounding.
Note: In these DAGs, the dashed bidirectional arrow connecting ϵV3and ϵV4indicates that these disturbance
terms are not independent (i.e., V3and V4share an unobserved common cause). The dashed bidirectional
arrow connecting Z3and Z4denotes that these variables are correlated, where ρZ3,Z4captures the direction
and strength of the relationship. The normalizing transformations, denoted by hi(Vi;ci(Vp
i)) for i= 1, ..., 4,
map each variable Vito a standard normal variable Zi, conditional on its parents Vp
i. Because each variable
Vi, given its parents Vp
i, varies only as a function of its disturbance ϵVi, the transformation hican be
also conceptualized as mapping this disturbance term to a standard normal variable Zi. Thus, Z3and Z4
represent normalized transformations of ϵV3and ϵV4, respectively, and ρZ3,Z4captures their correlation.
Because the absence of unobserved confounding is neither empirically verifiable nor automatically ensured
by common research designs in the social sciences, it is important to assess the sensitivity of effect estimates
to potential bias. cGNFs offer a straightforward and intuitive approach for conducting such analyses (Balgi
et al., 2022b). It involves specifying a set of sensitivity parameters, which represent correlations between
normalized disturbance terms in the assumed DAG. These correlations are then used to modify the processes
through which a cGNF is trained and utilized to generate Monte Carlo samples. By modifying the training
20
and sampling procedures in this way, the cGNF is recalibrated to adjust for possible biases due to unobserved
confounding.
To better appreciate this approach, note that mapping any variable Vi, conditional on its parents Vp
i,
to a standard normal variable Ziis functionally equivalent to mapping its disturbance term ϵVito the same
variable Zi. This is because the only source of variation in Vi, holding its parents constant, comes from
its disturbance term. The equivalence of these mappings is represented graphically in Panel B of Figure 5,
where each variable Ziis depicted as a transformation of a corresponding disturbance term ϵVi.
In analyses devoid of unobserved confounding, where all the disturbance terms are pairwise independent, a
cGNF should transform them into a set of mutually independent, standard normal variables. In other words,
when unobserved confounding is assumed away entirely, a cGNFs is designed to map the disturbance terms
to a new set of variables distributed as N(0,I). However, if the disturbance terms are not independent
due to unobserved confounding, a cGNF can be modified to map them instead to a multivariate normal
distribution that preserves their dependence structure.
Specifically, to accommodate dependent disturbances, a cGNF can be reformulated as follows:
Z=h(V;θ) N (0,ΣZ).(31)
In this expression, the normalizers now map the set of variables V={V1, ..., Vk}into a new set Z=
{Z1, ..., Zk}, which are distributed according to a multivariate normal distribution with a mean vector of
zeros and a covariance matrix given by ΣZrather than the identify matrix I.
This distribution can be represented in greater detail as follows:
N(0,ΣZ) = N
0
.
.
.
0
,
1ρZ1,Z2· · · ρZ1,Zk
ρZ1,Z2
.......
.
.
.
.
........
.
.
ρZ1,Zk· · · · · · 1
,(32)
where the off-diagonal elements of ΣZare a set of Pearson correlations. These correlations reflect the direction
and magnitude of the relationships between different disturbance terms, after each has been transformed such
that its marginal distribution is univariate standard normal. Stronger correlations between these normalized
disturbances correspond to a greater degree of unobserved confounding.
To adapt a cGNF for mapping the set of variables Vinto a new set Zthat is distributed as N(0,ΣZ)
rather than N(0,I), we need only modify the loss function used to optimize the weights of the UMNNs
during training. This modified loss function can be expressed as follows:
−LL (θ; ΣZ) = ln n
Y
l=1
fVvl;θ!
=ln n
Y
l=1
fZhvl;θ; ΣZdetJh(vl;θ)!
=
n
X
l=1
ln fZhvl;θ; ΣZ
n
X
l=1
k
X
i=1
ln
∂hi
∂vl
i,(33)
where l= 1, ..., n indexes observations in the sample data, i= 1, ..., k indexes the variables, and fZ(zl; ΣZ)
represents the multivariate normal distribution with a mean vector of zeros and a covariance matrix ΣZ,
21
evaluated at hvl;θ. The only difference between this loss function and the one used to train a standard
cGNF lies in the substitution of the covariance matrix ΣZfor the identity matrix Iin the multivariate normal
distribution fZ.
After training a cGNF with this modified loss function, the Monte Carlo sampling algorithm must also
be adapted to properly generate effect estimates adjusted for confounding bias. To modify the sampling
algorithm, we simply draw our initial JMonte Carlo samples of ˜
Zj
i, for j= 1, ..., J and i= 1, ..., k, from the
multivariate normal distribution given by N(0,ΣZ). These samples are then transformed using the inverse
of a cGNF that has been optimized with the loss function from Equation (33). Otherwise, the sampling
algorithm proceeds exactly as outlined in the previous section.
Effect estimates generated by this modified training and sampling procedure are adjusted for confounding
bias, as represented through the correlations between normalized disturbance terms in ΣZ. Their sensitivity
to different forms of unobserved confounding can then be assessed by training and computing estimates from
multiple cGNFs, using a range of values for these correlations.
To illustrate, consider the goal of estimating the average total effect of V3on V4in the presence of
unobserved confounding, as depicted in Figure 5. To construct bias-adjusted estimates in this scenario, we
first train a cGNF with the following form:
Z=h(V;θ) =
Z1
Z2
Z3
Z4
N
0
0
0
0
,
1 0 0 0
0 1 0 0
0 0 1 ρZ3,Z4
0 0 ρZ3,Z41
=N(0,ΣZ).(34)
In this model, Z3and Z4can be interpreted as normalized transformations of the disturbance terms ϵV3
and ϵV4, respectively, with ρZ3,Z4denoting their correlation on this transformed scale. By specifying a range
of values for ρZ3,Z4and training different cGNFs using a modified loss function, we could then obtain a
corresponding range of effect estimates adjusted for different types of unobserved confounding. To generate
these estimates, we implement the sampling algorithm for the ATEV3V4exactly as described in Section
4.4.1, but with two modifications: our initial samples of ˜
Zj
iare drawn from N(0,ΣZ), and then they are
transformed using the inverse of a cGNF optimized with the loss function from Equation (33).
In sum, cGNFs seamlessly integrate with methods of sensitivity analysis for unobserved confounding.
This is accomplished by modifying the training and sampling process to reflect a prescribed set of correla-
tions among normalized versions of the disturbance terms. This approach to sensitivity analysis is extremely
versatile. It enables construction of bias-adjusted estimates for a wide range of effects across many differ-
ent types of unobserved confounding, simply by varying the correlation structure among the transformed
disturbances.
4.6 Summary
A cGNF arranges a set of variables in causal order and then maps each one to the standard normal dis-
tribution, conditional on its parents, as defined by an assumed DAG. This mapping is accomplished with
UMNNs–a special class of artificial neural networks that can approximate any monotonic transformation–
trained on sample data by the method of SGD. Because UMNNs are invertible, a trained cGNF enables Monte
Carlo sampling from the observational joint distribution of the data by first drawing samples from the stan-
dard normal distribution and then recursively applying the inverse of the flow. Moreover, this sampling
22
procedure can be selectively truncated to facilitate Monte Carlo sampling from many different interventional
distributions, with the resulting samples used to construct estimates for a wide range of causal effects. All
these procedures can be modified to assess robustness to different types of unobserved confounding.
The workflow for implementing an analysis based on cGNFs can thus be summarized as follows:
1. Draw a DAG, using theory and prior knowledge.
2. Determine which estimands can be identified from the observed data, given the DAG.
3. Train a cGNF on the observed data.
(a) Specify the UMNN architecture.
(b) Specify the SGD hyperparameters.
(c) Execute the training algorithm.
4. Use the cGNF for Monte Carlo sampling from the relevant interventional distributions.
5. Construct estimates for the target estimands using the Monte Carlo samples.
6. Assess the sensitivity of these estimates to unobserved confounding.
The accuracy of the effect estimates produced by this procedure hinges crucially on whether the assumed
DAG is correct. If the assumed DAG is incorrect, estimates from the cGNF will be biased. The accuracy
of these estimates also depends on the expressiveness of the neural networks used to model the cGNF. If
the UMNNs are not sufficiently expressive, they may not provide an accurate approximation for the true
but unknown joint distribution. In this situation, cGNFs may also produce biased estimates, even when the
assumed DAG is accurate.
Beyond the accuracy of the DAG and the expressiveness of the networks, the performance of cGNFs also
depends on the amount of sample data available to train them, whether the training algorithm converges to an
optimal solution from its random initialization, and the number of Monte Carlo samples generated from the
trained model for estimation. To quantify the uncertainty in estimates due to sampling error, training error,
and simulation error taken together, we recommend constructing confidence intervals using the percentiles of
a synthetic sampling distribution generated via the non-parametric bootstrap (Tibshirani and Efron, 1993).
Although computationally demanding to implement, the bootstrap can quantify uncertainty in a wide variety
of predictions from artificial neural networks with a high degree of accuracy (Franke and Neumann, 2000;
Heskes, 1996).
5 Empirical Illustrations
We illustrate the utility of cGNFs by revisiting two seminal studies of social mobility. The first is Blau and
Duncan’s (1967) classic analysis of status attainment, which pioneered the study of mobility processes using
parametric structural equation models (SEMs). The second, Zhou’s (2019) recent analysis of conditional
versus controlled mobility, leveraged a directed acyclic graph (DAG) to guide semi-parametric estimation of
several specific estimands in a broader causal system.
23
5.1 Reanalysis of Blau and Duncan (1967)
In The American Occupational Structure, Blau and Duncan (1967) introduced a groundbreaking approach
to studying inter-generational social mobility in the United States. Specifically, they employed linear path
analysis–a form of parametric structural equation modeling–to assess the influence of family background,
education, and early career achievements on later occupational attainment. Drawing from the 1962 “Oc-
cupational Changes in a Generation” (OCG) survey, the study analyzed data from about 20,000 American
men between the ages of 20 and 64, offering a comprehensive view of the adult male workforce at the time.
The study centered its analysis on two main variables: educational attainment and occupational status,
relating these measures across generations (i.e., between fathers and sons). In the OCG survey, educational
attainment was originally measured in years of formal schooling, which was then grouped into nine distinct
categories, ranging from “no formal education” to “postgraduate studies.” Occupational status was measured
using the Duncan Socioeconomic Index (SEI), a scale that assigned an estimated prestige score to each
occupational category in the OCG survey, ranging from 0 to 96. This score was calculated by weighting
and combining the average income and educational levels of different occupations, with weights based on the
relation of these characteristics to a separate set of occupational prestige ratings.
ϵVV
ϵXX
ϵU
U
ϵW
W
ϵY
Y
Figure 6: A DAG corresponding to Blau and Duncan’s (1967) Linear Path Model.
Note: Vdenotes father’s educational attainment, Xdenotes father’s occupational status, Udenotes son’s
educational attainment, Wdenotes the occupational status of the son’s first job, and Ydenotes the occupa-
tional status of the son’s job in 1962. The dashed, bidirectional arrow connecting ϵVand ϵXindicates that
these disturbances are correlated (i.e., Vand Xshare unobserved common causes).
Using these data, Blau and Duncan (1967) fit a linear path model that included five variables: father’s
educational attainment (V), father’s occupational status (X), son’s educational attainment (U), the occu-
pational status of the son’s first job (W), and the occupational status of the son’s job in 1962 (Y), when
the OCG survey was fielded.11 The hypothesized relationships among them are depicted in Figure 6, which
contains a DAG corresponding to Blau and Duncan’s (1967) original model. In this graph, Vaffects Xand
U;Xaffects U,W, and Y;Uaffects Wand Y; and Waffects Y. The dashed, bidirectional arrow connecting
11We adopt the same notation for these variables as in Blau and Duncan (1967).
24
the disturbance terms for Vand Xindicates that these variables share unobserved common causes (e.g.,
grandfather’s occupational status). The DAG differs from Blau and Duncan’s (1967) linear path model in
that it does not impose any restrictions on the functional form of these relationships.
In our reanalysis, we employ a cGNF based on the DAG in Figure 6, training it on the OCG data.
Our model architecture includes an embedding network with five hidden layers, each composed of 100, 90,
80, 70, and 60 nodes, in succession. Similarly, the integrand network also features five hidden layers with
60, 50, 40, 30, and 20 nodes, respectively. We train the cGNF using stochastic gradient descent (SGD) to
minimize the negative log-likelihood, adopting a batch size of 128 and a learning rate of 0.0001. The training
process is halted when there has been no reduction in the negative log-likelihood for 50 consecutive epochs,
as computed on a one-fifth validation sample held out from the OCG data.
After training the cGNF, we use it to estimate several different effects of interest, including the average
total effect of father’s occupational status (X) on son’s occupational status (Y), the average total effect
of son’s education (U) on son’s occupational status (Y), and the natural direct and indirect effects of
father’s occupational status (X) on son’s occupational status (Y), as mediated by son’s education (U). To
quantify the uncertainty in these estimates, we construct 90 percent confidence intervals using the non-
parametric bootstrap with 600 replications. This involves repeatedly retraining the cGNF and recalculating
effect estimates on random samples from the OCG data, drawn with replacement. The confidence intervals
are then given by the 95th and 5th percentiles of the resulting bootstrap distribution. Replication files
for this analysis are available at https://github.com/gtwodtke/deep_learning_with_DAGs/tree/main/
blau_duncan_1967.
Figure 7: Effect of X(Father’s Occupational Status) on Y(Son’s Occupational Status).
Figure 7 presents an estimated response function describing the average total effect of father’s occupa-
tional status on son’s occupational status. Overall, these results are consistent with Blau and Duncan’s (1967)
25
linear path model, which showed a strong positive relationship between between occupational attainment
across generations. However, estimates from the cGNF also reveal some non-linearity in this relationship,
particularly at the upper and lower ends of the status spectrum among fathers.
Figure 8: Effect of U(Son’s Education) on Y(Son’s Occupational Status).
In Figure 8, we present a response function that summarizes the average total effect of a son’s education
on his subsequent occupational status. These results show a strong positive relationship between education
and occupational attainment, aligning with Blau and Duncan (1967). As before, estimates from the cGNF
also uncover considerable non-linearity, where differences in educational attainment up to high school exert
a comparatively modest impact compared to those at the secondary level and beyond.
Figure 9 presents the natural direct and indirect effects of father’s occupational status on son’s occu-
pational status, as mediated by son’s education. These effects are based on contrasts between adjacent
quintiles of the status distribution among fathers, and they are consistent with the linear path model of Blau
and Duncan (1967), which demonstrated a sizeable mediating role for educational attainment. Nevertheless,
estimates from the cGNF suggest that education does not seem to play a very important mediating role
at the lower end of the status distribution among fathers. In contrast, at the higher end, education is a
powerful mediator, as indicated by the large indirect effects.
All these estimates are predicated on the critical assumption that there is no unobserved confounding of
the focal relationships. If, for example, factors like motivation or prior achievement in high school, which
are not observed in the OCG data, confound the relationship between a son’s education and occupational
status, then our estimates involving the effects of Uon Ywould be biased.
Figure 10 presents results from a sensitivity analysis assessing the robustness of our findings to hypothet-
ical patterns of unobserved confounding. It displays bias-adjusted estimates for the effect of son’s education
on their occupational status, plotted across values of the sensitivity parameter ρZU,ZY. Positive values for
26
Figure 9: Natural direct and indirect effects of X(Father’s Occupational Status) on Y(Son’s Occupational
Status), as mediated via U(Son’s Education)
Note: The natural direct and indirect effects are here based on contrasts between adjacent quintiles of the
sample distribution for father’s occupational status. The first through fifth quintiles of this variable are 9,
14, 18, 41, and 61, respectively.
this parameter imply that individuals select into higher levels of education on the basis of unobserved fac-
tors that also lead them to attain higher-status occupations, with larger values signaling stronger forms of
selection.
The results show that, across a range of plausible values for ρZU,ZY, the estimated response function
remains relatively stable. The response function begins to flatten out only at large values for ρZU,ZY, but
even under this rather extreme level of unobserved confounding, post-secondary education continues to exert
a strong effect on occupational attainment. This stability suggests that our inferences about the causal
relationship between Uand Yare likely robust to unobserved confounding.
In sum, our reanalysis indicates that the status attainment process involves nontrivial departures from
linearity, a pattern that traditional path models fail to capture. By utilizing a cGNF, we circumvent the
restrictive assumptions of standard SEMs, gain the ability to uncover and describe more complex relation-
ships, and enable a seamless assessment of how these relationships vary under different forms of unobserved
confounding.
5.2 Reanalysis of Zhou (2019)
Zhou (2019) examined the influence of higher education on inter-generational income mobility in the United
States. The study showed that the relationship between the incomes of parents and their children is weaker
for college graduates than for those with less education. However, this observed pattern could stem from
non-random selection into post-secondary education. The process through which children select into college
27
Figure 10: Sensitivity of the Effect of U(Son’s Education) on Y(Son’s Occupational Status) to Unobserved
U-YConfounding.
is complex, as it involves factors like academic performance in high school, which not only confound the
effects of college among children but are also influenced by parental income. To navigate this complicated
selection process, Zhou (2019) used a DAG to guide the design of a semi-parametric estimator capable of
accurately evaluating whether a college degree actually promotes greater mobility.
The study analyzed data from 4,673 respondents in the 1979 National Longitudinal Survey of Youth
(NLSY), all under 19 years old at baseline. The analysis centered on the following variables: the income
rank of parents when the respondent was in high school (X), the income rank of the respondent as an
adult (Y), and whether the respondent graduated from college (C). A set of baseline controls (B) was also
incorporated into the analysis. These include measures of gender, race, urban residence, family structure,
and parental education. Furthermore, the analysis included two additional variables: the respondent’s
educational expectations and scores on the Armed Forces Qualification Test (AFQT), measured during their
high school years. These variables, denoted by Land A, respectively, may be influenced by parental income
and may also confound the relationship of education to income among respondents.
We conceptually replicate Zhou’s (2019) analysis using a cGNF based on the DAG in Figure 11. In
this graph, the baseline confounders (B) affect all downstream variables. Parental income (X) influences
a respondent’s income (Y) and their likelihood of college graduation (C). It also influences a respondent’s
educational expectations (L) and AFQT scores (A) in high school. Additionally, attaining a college degree
affects respondent income, and this effect is confounded by their expectations and test scores.
We train our cGNF with the NLSY data, adopting the same architecture, hyper-parameters, and stopping
criterion as outlined in the previous section. After training, we then use our model to estimate two effects
that reflect the influence of education on income mobility.
First, we estimate the conditional effect of parental income on respondent income, given a respondent’s
28
B
L
X
A
C
Y
Figure 11: DAG based on Zhou (2019).
Note: Bdenotes a set of baseline confounders; Xdenotes parental income rank; Land Adenote educational
expectations and AFQT scores when the respondent was in high school, respectively; Cdenotes whether the
respondent graduated from college; and Ydenotes the respondent’s income rank as an adult. For simplicity,
the random disturbances are suppressed from this graph.
educational attainment. This effect can be formally expressed as CATEXY|C=E[Y(x)Y(x)|C=c].
It captures the influence of parental income on respondent income among sub-populations of respondents
categorized by their observed level of educational attainment.
Second, we also estimate a joint effect of parental income and respondent education on respondent
income. This estimand can be formally expressed as AJEX,C Y=E[Y(x, c)Y(x, c)], which is a type
of controlled direct effect (VanderWeele, 2015). It captures the influence of parental income on respondent
income under a hypothetical intervention where all respondents attain the level of education given by c.
In addition, we also estimate the path-specific effects of parental income on respondent income that
operate through educational expectations, AFQT scores, and college graduation. As special types of indirect
effects, the estimands we consider here capture the mediating role of a given variable in the causal chain
connecting parental with respondent income, net of other potential mediators that precede it (Zhou and
Yamamoto, 2023).
Specifically, the first of these effects can be defined as follows:
PSEXCY=EY(x, L(x), A(x, L(x))) Y(x, L(x), A(x, L(x)), C(x, L(x), A(x, L(x))).(35)
This expression captures an effect of parental income on respondent income mediated through a respondent’s
educational attainment, but not their prior expectations or test scores. It is represented by the XCY
path in isolation.
Similarly, another path-specific effect captures the influence of parental income on respondent income
mediated through AFQT scores, but not prior expectations. This effect can be expressed as follows:
PSEXAY=EY(x, L(x)) Y(x, L(x), A(x, L(x))).(36)
It reflects the influence of parental income transmitted along the XAYand XACYpaths com-
29
bined.
The last effect of interest can be defined as follows:
PSEXLY=EY(x)Y(x, L(x)).(37)
This expression captures the influence of parental income on respondent income mediated through edu-
cational expectations, as transmitted along the XLY,XLAY, and XLACYpaths
together. It is equivalent to the natural indirect effect of Xon Yvia L.
These particular effects, while not examined by Zhou (2019), are important nonetheless for evaluating the
study’s broader theoretical model of mobility and selection. They illuminate the potential for parental income
to influence intermediate variables, like expectations and test scores in high school, which subsequently shape
post-secondary attainment and income later in adulthood. In other words, they reflect the complex selection
process hypothesized to contaminate the observed relationship of higher education with greater income
mobility.
Table 1: cGNF Estimates of the Conditional, Controlled, and Path-Specific Effects of Parental Income on
Respondent Income
Estimand Point Est. 90% Bootstrap CI
Conditional Effects
CATEXY|C=0 .102 (.062, .135)
CATEXY|C=1 .121 (.051, .146)
Controlled Direct Effects
AJEX,C=0Y.097 (.062, .131)
AJEX,C=1Y.077 (-.026, .103)
Path-specific Effects
PSEXCY.003 (-.002, .007)
PSEXAY.023 (.008, .025)
PSEXLY.008 (-.002, .019)
Note: All effects contrast the first with the third sample quartile of parental income (X). Confidence intervals
are based on the 5th and 95th percentiles of a bootstrap distribution with 600 replicates.
Table 1 presents the conditional, controlled, and path-specific effects of interest, as estimated by our
cGNF. All of them contrast the first quartile with the third quartile of parental income. Overall, they are
consistent with the results reported by Zhou (2019). Estimates for the conditional effects indicate that the
influence of parental income on respondent income–after adjusting for observed selection–is fairly similar
among those with and without a college education ( \
CAT E XY|C=0 =.102 versus \
CAT E XY|C=1 =.121).
Furthermore, estimates for the controlled direct effects suggest that the influence of parental income on
respondent income would also be fairly similar regardless of any intervention to expand or contract access to
higher education ( [
AJEX,C =0Y=.097 versus [
AJEX,C =1Y=.077). This suggests that increasing access
to college is unlikely to boost income mobility very much, in line with Zhou’s (2019) conclusions.
Our estimates for the path-specific effects point toward a potential explanation. They suggest that the
influence of parental income operating exclusively through its impact on respondent education, but not
through upstream factors like expectations and test scores in high school, is small ( [
P SE XCY= 0.003).
Rather, the most important mechanism through which parental income affects respondent income appears
30
to involve achievement test scores in high school ( [
P SE XAY= 0.023). These findings suggest that higher
parental incomes may lead to higher incomes for the next generation partly because they contribute to better
college preparedness, thereby enhancing the likelihood of graduation and subsequent success in the labor
market. Replication files for this analysis are available at https://github.com/gtwodtke/deep_learning_
with_DAGs/tree/main/zhou_2019.
6 Discussion
In this study, we introduced causal-graphical normalizing flows (cGNFs), a novel approach to analyzing
systems of causal relationships between variables. This approach integrates directed acyclic graphs (DAGs)
with unconstrained monotonic neural networks (UMNNs) to model entire causal systems, without relying on
restrictive parametric assumptions. The key advantage of cGNFs lies in their ability to flexibly approximate
the full joint distribution of the data, using its Markov factorization. This facilitates estimation via Monte
Carlo sampling for a wide range of causal estimands, including but not limited to total, conditional, direct,
indirect, and path-specific effects. Extending recent advances in machine learning and causal inference
(Athey et al., 2019; Chernozhukov et al., 2018; Koch et al., 2021; Van Der Laan and Rubin, 2006), cGNFs
transcend the prevailing focus on a narrow set of estimands and unlock the possibility of comprehensively
evaluating more elaborate causal theories.
We illustrated the utility of cGNFs by reanalyzing two seminal studies of social mobility in the United
States. The first reanalysis, drawing on Blau and Duncan’s (1967) study of the status attainment process,
highlighted the method’s ability to uncover and describe nonlinear relationships. These relationships are often
obscured by the functional form constraints traditionally imposed by parametric structural equation models
(SEMs). Our second reanalysis focused on Zhou’s (2019) study of conditional versus controlled mobility. In
this illustration, we demonstrated the ability of cGNFs to detect complex forms of effect moderation and
interaction, to adjust for dynamic selection processes, and to concurrently evaluate multiple aspects of a
broader causal theory. Together, these empirical examples highlight the potential of cGNFs for modeling
entire causal systems.
While cGNFs offer numerous advantages, they are certainly not without limitations, which in turn suggest
important directions for future research. A central limitation is that cGNFs depend on the analyst to identify
a correct DAG summarizing the causal relationships of interest. If the assumed DAG is incorrect, cGNFs
may produce inaccurate estimates for certain effects in the causal system. This limitation is not unique
to cGNFs. It afflicts almost any method of causal inference, including regression imputation, propensity
score matching or weighting, and instrumental variables, where incorrect assumptions about the underlying
structural model can lead to faulty estimates. However, the task of identifying a correct DAG is arguably
more important for cGNFs, since their objective is to evaluate multiple features of a broader causal system.
Notably, cGNFs can still provide accurate estimates for certain causal effects, even when parts of the DAG
are incorrectly specified. For example, they can still accurately estimate the total effect of one variable on
another, even if the DAG assumes an inaccurate set of relationships among the variables that confound this
effect. Future research must systematically establish, in generalizable terms, which regions of the DAG must
be accurately specified for cGNFs to reliably estimate particular effects within the broader causal system.
Similarly, cGNFs do not avoid any of the challenges associated with unobserved confounding, which are
ubiquitous in the social sciences. Identifying and measuring the many different variables that may confound
an effect of interest is a formidable task, regardless of the approach to modeling and estimation. In general,
31
if a target estimand is not identified due to unobserved confounding, cGNFs will yield inaccurate estimates
for this effect. In the present study, we demonstrated how cGNFs can be seamlessly integrated with methods
for assessing the sensitivity of estimates to unobserved confounding. Beyond this approach, future research
should also explore how cGNFs might be combined with other methods designed to identify causal effects
when unobserved confounding is present. These include non-parametric instrumental variable models (Newey
and Powell, 2003; Newey, 2013) and deep learning techniques for inferring and controlling latent confounders
by proxy (Louizos et al., 2017).
Even with a correct DAG and a cleanly identified estimand, cGNFs face another limitation: at present,
there is no theoretical guarantee that they will provide consistent point estimates or that bootstrap intervals
will have asymptotically valid coverage rates. As universal density approximators, UMNNs possess the
capacity to model any distribution to an arbitrary degree of accuracy (Huang et al., 2018; Wehenkel and
Louppe, 2019). Thus, with a sufficiently expressive architecture and enough data for training, they can, in
principle, recover any feature of the target distribution exactly. However, the precise conditions required
for UMNNs to reach this level of accuracy are not yet fully understood. What are the architectures, hyper-
parameter settings, and volume of data needed for cGNFs to estimate causal effects with no more than a
trivial degree of error?
In Part A.2 of the Appendix, we present results from a series of Monte Carlo experiments indicating that
cGNFs yield estimates with low bias and variance in sufficiently large samples, even with relatively simple
architectures and standard hyper-parameter settings. In general, the bias and variance appear to decline
monotonically as the sample size increases, and for samples of 16,000 cases or more, cGNF estimates for a wide
range of effects are approximately unbiased and exhibit high stability under repeated sampling. We observe
this pattern of results whether the data generating process is very simple (i.e., linear and additive with normal
disturbances), incorporates discrete variables, or involves substantial non-linearity, effect heterogeneity, and
non-normality.
Existing simulation studies additionally suggest that bootstrap methods generate confidence intervals
with satisfactory coverage rates for neural network predictions (Franke and Neumann, 2000; Heskes, 1996).
In some cases, bootstrap intervals even appear to be conservative (Papadopoulos et al., 2001). In Part A.3
of the Appendix, we present results from a Monte Carlo experiment that align with these prior studies.
Specifically, our experiment suggests that, with a sufficiently large sample ensuring minimal bias in cGNF
estimates, bootstrap intervals have coverage rates that are slightly conservative. Nevertheless, until the
asymptotic properties of cGNFs are conclusively established, any inferential statistics derived from them
should be interpreted with caution.
To address uncertainties surrounding inference, a potential solution involves utilizing cGNFs to generate
the components of multiply robust estimators with known asymptotic properties. These estimators include
targeted maximum likelihood, augmented inverse probability weighting, and other related approaches based
on the efficient influence curve (Glynn and Quinn, 2010; Van Der Laan and Rubin, 2006; Zhou, 2022).
Because cGNFs provide access to the full joint distribution of the data, they can be used to construct all the
terms that compose these estimators, including propensity scores, conditional means, and residuals. Using
cGNFs in this way would produce effect estimates endowed with desirable properties, such as consistency,
efficiency, and asymptotic normality. The advantage of cGNFs over other machine learning methods is their
ability to generate the components for many different robust estimators simultaneously, each tailored to a
particular effect in a broader causal system. Thus, future research should further explore the possibility of
combining cGNFs with multiply robust approaches to estimation.
32
Another challenge for cGNFs involves the limited capacity of neural networks to interpolate or extrap-
olate beyond the observed data. Although cGNFs can accurately model any joint distribution in theory,
their estimates may be less reliable in regions of the data space with few observations. Consequently, the
performance of cGNFs may vary inversely with the degree of sparsity, or in other words, their ability to
model entire causal systems, sans functional form assumptions, comes with an increased hunger for data.
Researchers must therefore exercise caution in assessing the suitability of any positivity conditions necessary
for non-parametric identification of their target estimands.
The computational complexity of training cGNFs presents an additional challenge. For example, in our
reanalysis of Blau and Duncan (1967), training the model and generating point estimates required about
3 hours of wall time on an Nvidia A100 graphical processing unit (GPU). Computing the 600 bootstrap
estimates in this analysis took another 44 hours using a high-performance computing cluster (HPC), with
each bootstrap iteration distributed across a separate central processing unit (CPU).12 While executing this
analysis on a standard personal computer, which is typically equipped with 4 to 16 CPUs and a consumer-
grade GPU, remains possible, the total computation time would significantly increase. Future research
should therefore prioritize increasing the speed and efficiency of the computations needed to train a cGNF
and then generate effect estimates.
Beyond the challenges of sparsity and computational complexity, model selection is another area ripe for
additional research (Koch et al., 2021). In our context, model selection involves comparing and choosing
among a set of candidate cGNFs after training, each distinguished by differences in its architecture or hyper-
parameters. These models might also vary based on the values used to initialize their weights, as the loss
functions associated with cGNFs are non-convex, and it is possible for the training process to terminate in a
local rather than global minimum. The range of approaches for selecting architectures and hyper-parameters
is broad, spanning from entirely random choices to exhaustive grid searches. Alternatively, model selection
could also be directed by theoretical insights and prior knowledge, focusing on models presumed to be
sufficiently expressive and to train smoothly. A common metric used for these comparisons is the best
validation loss achieved before the training algorithm terminates, although other metrics may be worth
exploring (e.g., Akaike or Bayesian information criteria).
In Part A.4 of the Appendix, we provide evidence from another set of Monte Carlo experiments, indicating
that the performance of cGNFs is fairly robust to modest variations in model architecture and hyper-
parameter settings. The experiments demonstrate that, across a range of sensible architectures, batch
sizes, and learning rates, cGNFs consistently produce estimates with low bias and variance in sufficiently
large samples. Despite these encouraging results, the development of more sophisticated model selection
techniques–specifically tailored for deep neural networks designed to estimate causal effects–remains a critical
necessity. Future research focusing on model selection with cGNFs will be essential to further enhance the
reliability of these methods for causal inference (Alaa and Van Der Schaar, 2019; Parikh et al., 2022).
Deep neural networks, including cGNFs, are also not immune to the challenges posed by measurement
error. Inaccuracies in the input data can distort the training process, resulting in erroneous output. While
parametric structural equation models (SEMs) can presently incorporate an extensive set of techniques for
addressing measurement error (Bollen, 1989), analogous methods for use with deep learning models are still
in their infancy (Hu et al., 2022). Because measurement error is pervasive in social science data, future
research should explore the possibility of integrating measurement models into the cGNF architecture. This
could potentially mitigate the impact of inaccurately measured inputs and improve the network’s ability to
12Specifically, we parallelized the calculations for each bootstrap sample across 15 HPC nodes and 40 CPUs per node.
33
distinguish signal from noise.
These limitations notwithstanding, cGNFs offer enormous potential for learning about causal systems
in the social sciences, transcending parametric SEMs that are typically based on naive assumptions about
functional form. We expect that they will find wide application, not only in research on social mobility but
wherever interest lies in studying broader systems of causal relationships. While current limitations may
forestall the prudent application of cGNFs in some cases, this article lays the foundation for future research
aimed at addressing these challenges, including those related to sparsity, valid inference, computational
complexity, model selection, and measurement error. By building on recent advances in machine learning
and causal inference, cGNFs represent a significant step toward a more complete integration of the deep
learning and causal revolutions (Pearl, 2018; Pearl and Mackenzie, 2018; Sejnowski, 2018).
References
Abbott, A. (1988). Transcending general linear reality. Sociological Theory, 6(2):169–186.
Alaa, A. and Van Der Schaar, M. (2019). Validating causal inference models via influence functions. In
Proceedings of the International Conference on Machine Learning, pages 191–201.
Allen-Zhu, Z., Li, Y., and Liang, Y. (2019). Learning and generalization in overparameterized neural net-
works. In Advances in Neural Information Processing Systems, pages 1–12.
Alwin, D. F. and Hauser, R. M. (1975). The decomposition of effects in path analysis. American Sociological
Review, 40(1):37–47.
Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests. The Annals of Statistics,
47(2):1148–1178.
Balgi, S., Pe˜na, J. M., and Daoud, A. (2022a). Personalized public policy analysis using causal-graphical
normalizing flows. In Proceedings of the AAAI Conference on Artificial Intelligence.
Balgi, S., Pe˜na, J. M., and Daoud, A. (2022b). rho-gnf: A novel sensitivity analysis approach under unob-
served confounders. arXiv Preprint arXiv:2209.07111.
Becker, G. S. and Tomes, N. (1979). An equilibrium theory of the distribution of income and intergenerational
mobility. Journal of Political Economy, 87(6):1153–1189.
Bengio, Y. and Bengio, S. (1999). Modeling high-dimensional discrete data with multi-layer neural networks.
Advances in Neural Information Processing Systems, pages 400–406.
Blau, P. M. and Duncan, O. D. (1967). The American Occupational Structure. John Wiley and Sons.
Bollen, K. A. (1989). Structural Equations With Latent Variables. John Wiley and Sons.
Bollen, K. A., Fisher, Z., Lilly, A., Brehm, C., Luo, L., Martinez, A., and Ye, A. (2022). Fifty years
of structural equation modeling: A history of generalization, unification, and diffusion. Social Science
Research, 107:102769.
Brutzkus, A., Globerson, A., Malach, E., and Shalev-Shwartz, S. (2017). Stochastic gradient descent
learns over-parameterized networks that provably generalize on linearly separable data. arXiv Preprint
arXiv:1710.10174.
34
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018).
Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal,
21(1):C1–C68.
Chollet, F. (2021). Deep Learning with Python. Simon and Schuster.
Clenshaw, C. W. and Curtis, A. R. (1960). A method for numerical integration on an automatic computer.
Numerische Mathematik, 2:197–205.
Daniel, R. M., De Stavola, B. L., and Cousens, S. N. (2011). Gformula: Estimating causal effects in the
presence of time-varying confounding or mediation using the g-computation formula. The Stata Journal,
11(4):479–517.
Daoud, A. and Dubhashi, D. (2023). Statistical modeling: The three cultures. Harvard Data Science Review,
5(1):1–51.
Elwert, F. (2013). Graphical causal models. In Handbook of Causal Analysis for Social Research, pages
245–273. Springer.
Franke, J. and Neumann, M. H. (2000). Bootstrapping neural networks. Neural Computation, 12(8):1929–
1949.
Frey, B. J. (1998). Graphical Models for Machine Learning and Digital Communication. MIT Press.
Glynn, A. N. and Quinn, K. M. (2010). An introduction to the augmented inverse propensity weighted
estimator. Political Analysis, 18(1):36–56.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
Haller, A. O. and Portes, A. (1973). Status attainment processes. Sociology of Education, 46(1):51–91.
Hedstr¨om, P. and Swedberg, R. (1998). Social Mechanisms: An Analytical Approach to Social Theory.
Cambridge University Press.
Heskes, T. (1996). Practical confidence and prediction intervals. In Advances in Neural Information Pro-
cessing Systems, pages 176–182.
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approx-
imators. Neural Networks, 2(5):359–366.
Hout, M. (1988). More universalism, less structural mobility: The american occupational structure in the
1980s. American Journal of Sociology, 93(6):1358–1400.
Hu, Z., Ke, Z. T., and Liu, J. S. (2022). Measurement error models: From nonparametric methods to deep
neural networks. Statistical Science, 37(4):473–493.
Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. C. (2018). Neural autoregressive flows. In
International Conference on Machine Learning, pages 2083–2092.
Imai, K., Keele, L., and Tingley, D. (2010). A general approach to causal mediation analysis. Psychological
Methods, 15(4):309–334.
35
Javaloy, A., anchez-Mart´ın, P., and Valera, I. (2023). Causal normalizing flows: from theory to practice.
arXiv preprint arXiv:2306.05415.
Karlson, K. B. and Birkelund, J. F. (2019). Education as a mediator of the association between origins and
destinations: The role of early skills. Research in Social Stratification and Mobility, 64:100436.
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016). Improved
variational inference with inverse autoregressive flow. In Advances in Neural Information Processing
Systems, pages 4743–4751.
Kline, R. B. (2023). Principles and Practice of Structural Equation Modeling. Guilford Publications.
Knight, C. R. and Winship, C. (2013). The causal implications of mechanistic thinking: Identification using
directed acyclic graphs. In Handbook of Causal Analysis for Social Research, pages 275–299. Springer.
Kobyzev, I., Prince, S., and Brubaker, M. (2020). Normalizing flows: An introduction and review of current
methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43:3964–3979.
Koch, B., Sainburg, T., Geraldo, P., Jiang, S., Sun, Y., and Foster, J. G. (2021). Deep learning of potential
outcomes. arXiv preprint arXiv:2110.04442.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444.
Lieberson, S. (1985). Making It Count: The Improvement of Social Research and Theory. University of
California Press.
Louizos, C., Shalit, U., Mooij, J. M., Sontag, D., Zemel, R., and Welling, M. (2017). Causal effect inference
with deep latent-variable models. In Advances in Neural Information Processing Systems, pages 17346–
17358.
Loury, G. C. (1981). Intergenerational transfers and the distribution of earnings. Econometrica, 49(4):843–
867.
Lundberg, I., Johnson, R., and Stewart, B. M. (2021). What is your estimand? defining the target quantity
connects statistical evidence to theory. American Sociological Review, 86(3):532–565.
MacCallum, R. (1986). Specification searches in covariance structure modeling. Psychological Bulletin,
100(1):107.
Milnor, J. and Weaver, D. W. (1997). Topology from the Differentiable Viewpoint. Princeton university
press.
Morgan, S. L. and Winship, C. (2015). Counterfactuals and Causal Inference. Cambridge University Press.
Newey, W. K. (2013). Nonparametric instrumental variables estimation. American Economic Review,
103(3):550–556.
Newey, W. K. and Powell, J. L. (2003). Instrumental variable estimation of nonparametric models. Econo-
metrica, 71(5):1565–1578.
Nielsen, D. and Winther, O. (2020). Closing the dequantization gap. In Advances in Neural Information
Processing Systems, pages 1–11.
36
Papadopoulos, G., Edwards, P. J., and Murray, A. F. (2001). Confidence estimation methods for neural
nnetworks: A practical comparison. IEEE transactions on neural networks, 12(6):1278–1287.
Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. (2021). Normal-
izing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1–64.
Parikh, H., Varjao, C., Xu, L., and Tchetgen, E. T. (2022). Validating causal inference methods. In
International Conference on Machine Learning, pages 17346–17358.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.
Pearl, J. (2010). The foundations of causal inference. Sociological Methodology, 40(1):75–149.
Pearl, J. (2018). Theoretical impediments to machine learning with seven sparks from the causal revolution.
arXiv preprint arXiv:1801.04016.
Pearl, J. (2022). Direct and indirect effects. In Probabilistic and Causal Inference: The Works of Judea
Pearl, pages 373–392. ACM Books.
Pearl, J. and Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books,
Inc., USA, 1st edition.
Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing flows. In Proceedings of the
International Conference on Machine Learning, pages 1530–1538.
Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure
period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9):1393–
1512.
Sejnowski, T. J. (2018). The Deep Learning Revolution. MIT Press.
Sewell, W. H., Haller, A. O., and Ohlendorf, G. W. (1970). The educational and early occupational status
attainment process: Replication and revision. American Sociological Review, 35(6):1014–1027.
Spirtes, P., Glymour, C. N., and Scheines, R. (2000). Causation, Prediction, and Search. MIT Press.
Tabak, E. G. and Turner, C. V. (2013). A family of nonparametric density estimation algorithms. Commu-
nications on Pure and Applied Mathematics, 66(2):145–164.
Tabak, E. G. and Vanden-Eijnden, E. (2010). Density estimation by dual ascent of the log-likelihood.
Communications in Mathematical Sciences, 8(1):217–233.
Tibshirani, R. J. and Efron, B. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC.
Torche, F. (2011). Is a college degree still the great equalizer? intergenerational mobility across levels of
schooling in the united states. American Journal of Sociology, 117(3):763–807.
Uria, B., Murray, I., and Larochelle, H. (2013). RNADE: The real-valued neural autoregressive density-
estimator. In Advances in Neural Information Processing Systems, pages 2175–2183.
Van Der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning. The International
Journal of Biostatistics, 2(1):1–38.
37
VanderWeele, T. J. (2009). On the distinction between interaction and effect modification. Epidemiology,
20(6):863–871.
VanderWeele, T. J. (2015). Explanation in Causal Inference: Methods for Mediation and Interaction. Oxford
University Press.
Wang, A. and Arah, O. A. (2015). G-computation demonstration in causal mediation analysis. European
Journal of Epidemiology, 30:1119–1127.
Wehenkel, A. and Louppe, G. (2019). Unconstrained monotonic neural networks. In Advances in Neural
Information Processing Systems, pages 1545–1555.
Wehenkel, A. and Louppe, G. (2021). Graphical normalizing flows. In Proceedings of the International
Conference on Artificial Intelligence and Statistics, pages 37–45.
Winship, C. and Mare, R. D. (1983). Structural equations and path analysis for discrete data. American
Journal of Sociology, 89(1):54–110.
Wodtke, G. T., Elwert, F., and Harding, D. J. (2016). Neighborhood effect heterogeneity by family income
and developmental period. American Journal of Sociology, 121(4):1168–1222.
Wodtke, G. T., Harding, D. J., and Elwert, F. (2011). Neighborhood effects in temporal perspective:
The impact of long-term exposure to concentrated disadvantage on high school graduation. American
Sociological Review, 76(5):713–736.
Wodtke, G. T. and Parbst, M. (2017). Neighborhoods, schools, and academic achievement: A formal
mediation analysis of contextual effects on reading and mathematics abilities. Demography, 54(5):1653–
1676.
Yang, Z., Yu, Y., You, C., Steinhardt, J., and Ma, Y. (2020). Rethinking bias-variance trade-off for gener-
alization of neural networks. In Proceedings of the International Conference on Machine Learning, pages
10767–10777.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021). Understanding deep learning (still)
requires rethinking generalization. Communications of the ACM, 64(3):107–115.
Zhou, X. (2019). Equalization or selection? reassessing the “meritocratic power” of a college degree in
intergenerational income mobility. American Sociological Review, 84(3):459–485.
Zhou, X. (2022). Semiparametric estimation for causal mediation analysis with multiple causally ordered
mediators. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(3):794–821.
Zhou, X. and Yamamoto, T. (2023). Tracing causal paths from experimental and observational data. The
Journal of Politics, 85(1):250–265.
Ziegler, Z. M. and Rush, A. M. (2019). Latent normalizing flows for discrete sequences. In Proceedings of
the International Conference on Machine Learning, pages 7673–7682.
38
A Appendix
A.1 Dequantization
A normalizing flow may assume any form, provided that it is bijective and that the transformation and its
inverse are both smooth with finite first derivatives. Transformations that satisfy these criteria are known
as diffeomorphisms (Milnor and Weaver, 1997). For a continuous variable X1, the normalizing flow Z1=
h(X1) can be conceptualized as Z1=h(X1) = F1
Z(FX1(X1)), where FX1is the cumulative distribution
function (CDF) for X1and F1
Zis the inverse of the standard normal CDF. For a continuous variable with
a smooth, invertible, and differentiable CDF, the composition of this CDF with the inverse normal CDF is a
diffeomorphism, and by extension, normalizing flows can readily model continuous distributions by mapping
them to and from the standard normal distribution.
However, if X1is discrete, its CDF FX1is neither bijective nor smooth with a finite first derivative.
Thus, to adapt normalizing flows for discrete data, research in this area has focused on integrating them
with different forms of dequantization (e.g., Nielsen and Winther 2020; Uria et al. 2013; Ziegler and Rush
2019). Dequantization converts a discrete variable into a continuous variable by adding a small amount of
random noise to each of the discrete values. The amount of random noise is selected so that the original
discrete values can be easily recovered by rounding off the dequantized variable to its nearest integer.
In our application of normalizing flows, we dequantize discrete variables by adding normally distributed
noise with zero mean and a variance of 1/36, following Balgi et al. (2022a). With a variance 1/36, nearly
all the random noise added to each discrete value will lie in the range [0.5,+0.5], such that the discrete
values can be covered by rounding to the nearest integer with virtually no loss of information. The addition
of random noise drawn from N(0,1/36) converts a discrete variable into a continuous variable that follows
a multimodal Guassian mixture distribution, with modes at each of the original values on the discrete
variable. The CDF of this Guassian mixture is bijective and smooth with a finite first derivative, and thus
its composition with the inverse normal CDF is a diffeomorphism, as above.
A normalizing flow can then be used to map the dequantized continuous variable with a multimodal
Gaussian mixture distribution to the standard normal distribution. In addition, its inverse can map from
the standard normal distribution back to the Guassian mixture, and then the original discrete variable can
be recovered by rounding the dequantized variable to its nearest integer. In this way, normalizing flows with
Gaussian dequantization model discrete distributions by approximating them with a multimodal normal
mixture, and then they recover the original mass points of the discrete distribution by rounding.
A.2 Monte Carlo Experiments on Bias, Variance, and Asymptotic Behavior
To examine the performance of causal-Graphical Normalizing Flows (cGNFs), we conducted a series of Monte
Carlo experiments. These experiments involved training cGNFs and using them to estimate causal effects
with simulated data generated from three different structural equation models. The first data-generating
model was linear and additive with normally distributed disturbances. It can be formally represented as
39
follows:
CN(0,1)
AN(0.1C, 1)
LN(0.2A+ 0.2C, 1)
MN(0.1A+ 0.2C+ 0.25L, 1)
YN(0.1A+ 0.1C+ 0.25M+ 0.25L, 1)
The second data-generating model was more complex, as it involves only discrete variables and also incor-
porates non-additive relationships among them. This model can be formally represented as follows:
CMultinomial(0.3,0.5,0.2)
ABernoulli(0.3+0.1C)
L
Multinomial(0.5,0.3,0.2) if A= 1 and C= 1
Multinomial(0.3,0.5,0.2) if A= 1 and C= 2
Multinomial(0.2,0.3,0.5) if A= 1 and C= 3
Multinomial(0.6,0.2,0.2) if A= 0
MBernoulli logit1(0.5+0.4A+ 0.2C+ 0.3L)
YBernoulli logit1(0.5+0.3A+ 0.1C+ 0.3M+ 0.3AM + 0.3L)
The last data-generating model is even more complex. It incorporates both discrete and continuous vari-
ables, nonlinear relationships, effect heterogeneity, heteroscedasticity, and highly non-normal disturbances.
Specifically, this model can be formally represented as follows:
CLaplace(0,1)
ABernoulli logit1(0.1C)
LTukey-Lambda(0.2A+ 0.2C+ 0.1AC, 1,0.3,0.7)
MStudent’s t(10) + 0.1A+ 0.2C2+ 0.25L+ 0.15AL
YNormal(0.1A+ 0.1C2+ 0.2M+ 0.2AM + 0.25L2,|C|)
We simulated 400 to 600 datasets from each model, with sample sizes progressively doubling from 2,000
to 128,000 cases. For each dataset, we trained a cGNF using a batch size of 128, a learning rate of 0.0001, and
a stopping criterion of no reduction in the loss for 50 consecutive epochs, computed on a one-fifth validation
sample. The architecture of the cGNF included an embedding network with five hidden layers, containing
100, 90, 80, 70, and 60 nodes respectively, and an integrand network also composed of five hidden layers,
but with 60, 50, 40, 30, and 20 nodes each.
After training, we used each cGNF to estimate several causal effects. First, we estimated the average
total effects of Aon Mand Y, denoted as ATEAM=E[M(a)M(a)] and ATEAY=E[Y(a)Y(a)]
respectively. Second, we also estimated the natural direct and indirect effects Aon M, as mediated by L.
These effects can be expressed as NDEALM=E[M(a, L(a)) M(a)] and NIEALM=E[M(a)
M(a, L(a))]. Lastly, we estimated the path-specific effects of Aon Y, operating both directly and through
Land M. These effects can be formally represented as PSEAY=E[Y(a, L(a), M (a, L(a))) Y(a)],
PSEAMY=E[Y(a, L(a)) Y(a, L(a), M (a, L(a)))], and PSEALY=E[Y(a)Y(a, L(a))]. In
40
all cases, we contrasted a= 1 with a= 0. The true values of these estimands under each data-generating
model are provided in Table 2. Replication files are available at https://github.com/gtwodtke/deep_
learning_with_DAGs/tree/main/MCEs.
Table 2: True Values of Target Estimands in each Monte Carlo Experiment
Estimand Linear DGM Discrete DGM Non-linear DGM
ATEAY.180 .143 .325
PSEAY.100 .109 .189
PSEALY.060 .022 .085
PSEAMY.020 .012 .051
ATEAM.150 .113 .207
NDEAM.100 .092 .127
NIEALM.050 .020 .080
Note: The true values for our target estimands in the first and second data-generating models (DGMs)
were calculated analytically using their non-parametric identification formulas. For the third DGM, these
values were computed numerically using 10 billion Monte Carlo samples drawn from of mutilated models.
All estimands contrast a= 1 with a= 0.
Figure 12 presents results from the experiments based on the normal, linear, and additive data-generating
model. Panel (a) plots the bias of the cGNF estimates against the sample size, while Panel (b) displays their
standard deviation. With data simulated from a relatively simple model, cGNF estimates exhibit low bias
and variance, even in relatively small samples. As the sample size increases, the bias and variance converge
toward zero, with both metrics stabilizing around 16,000 cases for all target estimands and then declining
more slowly thereafter. When parallelized over 15 nodes of a high-performance computing cluster (HPC),
each with 40 central processing units (CPUs), the wall time for these experiments was approximately 1 day
and 8 hours.
Figure 13 presents results from the experiments based on the discrete data-generating model. With dis-
crete data, cGNF estimates also generally exhibit low bias and variance, and as the sample size increases, the
bias and variance again appear to converge toward zero. Both metrics stabilize at low levels between 16,000
and 32,000 cases, declining more slowly thereafter. The wall time for these experiments was approximately
1 day and 22 hours, when parallelized over 10 nodes of a HPC with 40 CPUs each.
Figure 14 summarizes results from the experiments involving a data-generating model with both contin-
uous and discrete variables, non-linearity, heterogeneity, heteroscedasticity, and non-normality. With data
simulated from this highly complex process, cGNF estimates initially exhibit nontrivial bias and high vari-
ance, particularly in smaller samples with fewer than 8,000 cases. However, as the sample size increases,
both the bias and variance diminish rapidly. By the time the sample size reaches 16,000 to 32,000 cases, the
bias and variance are small for most estimands. With further increases in sample size, both metrics continue
to converge towards zero, resulting in nearly unbiased estimates with minimal variance in large samples.
These experiments took approximately 2 days and 13 hours to complete, when parallelized across 10 nodes
of a HPC with 40 CPUs each.
A.3 Monte Carlo Experiments on Bootstrap Interval Coverage
To examine the performance of bootstrap confidence intervals for effect estimates from causal-Graphical Nor-
malizing Flows (cGNFs), we conducted another Monte Carlo experiment. In this experiment, we simulated
41
100 datasets, each containing 8,000 cases. For each dataset, we constructed 90 percent confidence intervals
for an average total effect using the 5th and 95th percentiles of a bootstrap distribution composed with 200
estimates.
The datasets were generated from the following model:
CBernoulli(0.6)
ABernoulli(0.4+0.2C)
YN(0.2A+ 0.4C, 1).
With this model, we targeted the average total effect of Aon Y, denoted as ATEAY=E[Y(a)Y(a)],
which equals 0.2 when contrasting a= 1 with a= 0. We chose a simple data-generating model and a
relatively small sample size to ensure that our cGNF estimates for ATEAYwould be essentially unbiased,
while also keeping the experiment computationally tractable with the resources at our disposal. Computing
200 bootstrap estimates for all 100 simulated datasets necessitates training a total of 20,000 cGNFs–a task
that could easily exceed the wall time limits on the HPC available to us, if we were to use a larger sample
size and/or a more complex data-generating model.
For each simulated dataset, we selected 200 bootstrap samples. Then, for each bootstrap sample, we
trained a cGNF using a batch size of 128, a learning rate of 0.0001, and a stopping criterion of no reduction
in the loss for 50 consecutive epochs, evaluated on a validation sample comprising one-fifth of the data. The
cGNF architecture included an embedding network with five hidden layers, containing 100, 90, 80, 70, and
60 nodes in succession, and an integrand network also composed of five hidden layers with 60, 50, 40, 30,
and 20 nodes each. We computed the upper and lower limits of the confidence intervals using the 95th and
5th percentiles, respectively, of the bootstrap distribution for each simulated dataset.
Figure 15 displays the results of this experiment. Specifically, it plots the 90 percent bootstrap intervals
for the ATEAY, computed across each of the 100 simulated datasets. In this figure, solid lines denote
intervals that cover the true value of the target parameter, whereas dashed lines indicate intervals that fail
to cover it. The figure shows that the 90 percent bootstrap intervals cover the true value of ATEAYin 96
of 100 simulated datasets. This result suggests that the bootstrap intervals achieve their nominal coverage
rate but may be slightly conservative. The total computation time for this experiment was approximately
24 hours. Its replication files are available at https://github.com/gtwodtke/deep_learning_with_DAGs/
tree/main/MCEs.
A.4 Monte Carlo Experiments on Architecture and Hyper-parameters
In addition, we conducted another set of Monte Carlo experiments to investigate how the performance of
cGNFs is influenced by variations in their architecture and hyper-parameter settings. In these experiments,
we trained cGNFs with different architectures and hyper-parameters on 400 simulated datasets. In the
first set of experiments, each dataset, consisting of 32,000 cases, was simulated from the normal, linear, and
additive data-generating model described in Part A.2 of the Appendix. In the second set of experiments, each
dataset was simulated from the data-generating model with both discrete and continuous variables, nonlinear
relationships, effect heterogeneity, heteroscedasticity, and non-normal disturbances, also as outlined in Part
A.2 of the Appendix.
The results from these two experiments are presented in Figures 16 and 17, respectively. Both figures
contain bar charts summarizing the bias and variance for selected cGNF estimates. Each bar corresponds to a
42
cGNF trained with a specific architecture and hyper-parameter configuration. The “default” bar represents
a cGNF with the architecture, learning rate, and batch size outlined previously for the other simulation
experiments. The bar labeled “default one hidden layer” represents a cGNF with the same setup, minus
the final hidden layer in both the embedding and integrand networks. Similarly, the bar labeled “default
1/4 of nodes” refers to our default cGNF configuration with a 25% reduction in the number of nodes in
each hidden layer. The bar labeled “batch size of 512” represents a cGNF with the default architecture and
learning rate but an increased batch size of 512. Finally, the bar labeled “learning rate of 0.001” refers to a
cGNF with the default architecture and batch size, but a higher learning rate of 0.001.
Overall, Figures 16 and 17 suggest that cGNF estimates are fairly robust to modest variations in model
architecture and hyper-parameter settings. Across the configurations tested, we observed levels of bias and
variance that were low and stable for a variety of estimands. The only exception involved the variance of
cGNF estimates when the learning rate was increased from 0.0001 to 0.001 in 16, which led to a nontrivial
increase in variance. Aside from this, the results suggest that cGNFs generally yield estimates with low
bias and variance for a broad spectrum of sensible architectures and hyper-parameters, provided the sample
size is sufficiently large. The wall time to complete each of these experiments was roughly 20 hours, when
parallelized as above. Replication files are available at https://github.com/gtwodtke/deep_learning_
with_DAGs/tree/main/MCEs.
43
(a) Bias of cGNF Effect Estimates
(b) Standard deviation of cGNF Effect Estimates
Figure 12: Performance of cGNFs in Monte Carlo Experiments Based on the Normal, Linear, and Additive
Data Generating Process.
Note: Each set of results is based on 600 simulated datasets. All cGNFs have an embedding network with
five hidden layers, composed of 100, 90, 80, 70, and 60 nodes, and an integrand network also with five hidden
layers, composed of 60, 50, 40, 30, and 20 nodes. The models are trained using a learning rate of 0.0001 and
a batch size of 128.
44
(a) Bias of cGNF Effect Estimates
(b) Standard deviation of cGNF Effect Estimates
Figure 13: Performance of cGNFs in Monte Carlo Experiments Based on the Discrete Data Generating
Process.
Note: Each set of results is based on 400 simulated datasets. All cGNFs have an embedding network with
five hidden layers, composed of 100, 90, 80, 70, and 60 nodes, and an integrand network also with five hidden
layers, composed of 60, 50, 40, 30, and 20 nodes. The models are trained using a learning rate of 0.0001 and
a batch size of 128.
45
(a) Bias of cGNF Effect Estimates
(b) Standard deviation of cGNF Effect Estimates
Figure 14: Performance of cGNFs in Monte Carlo Experiments Based on the Non-normal, Non-linear, and
Non-additive Data Generating Process.
Note: Each set of results is based on 400 simulated datasets. All cGNFs have an embedding network with
five hidden layers, composed of 100, 90, 80, 70, and 60 nodes, and an integrand network also with five hidden
layers, composed of 60, 50, 40, 30, and 20 nodes. The models are trained using a learning rate of 0.0001 and
a batch size of 128.
46
Figure 15: Coverage of Bootstrap Intervals for cGNF Effect Estimates in a Simple Monte Carlo Experiment
Note: Results are based on 100 simulated datasets with 200 bootstrap samples per dataset. The figure displays 90 percent confidence intervals based
on the 5th and 95th percentiles of the bootstrap distribution for each simulated dataset. All cGNFs have an embedding network with five hidden
layers, composed of 100, 90, 80, 70, and 60 nodes, and an integrand network also with five hidden layers, composed of 60, 50, 40, 30, and 20 nodes.
The models are trained using a learning rate of 0.0001 and a batch size of 128.
47
(a) Bias of cGNF Effect Estimates
(b) Standard deviation of cGNF Effect Estimates
Figure 16: Performance of cGNFs across Different Architectures and Hyper-parameter Settings with a
Normal, Linear, and Additive Data Generating Process.
Note: Each set of results is based on 400 simulated datasets, each with a sample size of 32,000 generated from
the linear and additive SEM with normal disturbances. The default hyper-parameters refer to a learning
rate of 0.0001, a batch size of 128, an embedding network with five hidden layers, each composed of 100, 90,
80, 70, and 60 nodes, and an integrand network also with five hidden layers, composed of 60, 50, 40, 30, and
20 nodes each.
48
(a) Bias of cGNF Effect Estimates
(b) Standard deviation of cGNF Effect Estimates
Figure 17: Performance of cGNFs across Different Architectures and Hyper-parameter Settings with a Non-
normal, Non-linear and Non-additive Data Generating Process.
Note: Each set of results is based on 400 simulated datasets, each with a sample size of 32,000 generated
from the nonlinear and non-additive SEM. The default hyper-parameters refer to a learning rate of 0.0001,
a batch size of 128, an embedding network with five hidden layers, each composed of 100, 90, 80, 70, and 60
nodes, and an integrand network also with five hidden layers, composed of 60, 50, 40, 30, and 20 nodes each.
49
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Causal mediation analysis concerns the pathways through which a treatment affects an outcome. While most of the mediation literature focuses on settings with a single mediator, a flourishing line of research has examined settings involving multiple mediators, under which path-specific effects (PSEs) are often of interest. We consider estimation of PSEs when the treatment effect operates through K(≥ 1) causally ordered, possibly multivariate mediators. In this setting, the PSEs for many causal paths are not nonparametrically identified, and we focus on a set of PSEs that are identified under Pearl's nonparametric structural equation model. These PSEs are defined as contrasts between the expectations of 2 K+1 potential outcomes and identified via what we call the generalized mediation functional (GMF). We introduce an array of regression-imputation, weighting, and "hybrid" estimators, and, in particular, two K +2-robust and locally semiparametric efficient estimators for the GMF. The latter estimators are well suited to the use of data-adaptive methods for estimating their nuisance functions. We establish the rate conditions required of the nuisance functions for semiparametric efficiency. We also discuss how our framework applies to several estimands that may be of particular interest in empirical applications. The proposed estimators are illustrated with a simulation study and an empirical example.
Article
Full-text available
Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation can be efficient and exact. The goal of this survey article is to give a coherent and comprehensive review of the literature around the construction and use of Normalizing Flows for distribution learning. We aim to provide context and explanation of the models, review current state-of-the-art literature, and identify open questions and promising future directions.
Article
We make only one point in this article. Every quantitative study must be able to answer the question: what is your estimand? The estimand is the target quantity—the purpose of the statistical analysis. Much attention is already placed on how to do estimation; a similar degree of care should be given to defining the thing we are estimating. We advocate that authors state the central quantity of each analysis—the theoretical estimand—in precise terms that exist outside of any statistical model. In our framework, researchers do three things: (1) set a theoretical estimand, clearly connecting this quantity to theory; (2) link to an empirical estimand, which is informative about the theoretical estimand under some identification assumptions; and (3) learn from data. Adding precise estimands to research practice expands the space of theoretical questions, clarifies how evidence can speak to those questions, and unlocks new tools for estimation. By grounding all three steps in a precise statement of the target quantity, our framework connects statistical evidence to theory.
Article
Sociological research examining how education mediates the association between occupational origins and destinations has long relied on the origins-education-destinations framework. We argue that the framework would benefit from factoring in processes of early skill formation to better grasp the mechanisms through which education becomes a channel of social reproduction. We propose that education is a mediator of the origins-destinations associations as a result of two processes: The sorting into schooling on early skills and the independent mediating impact of education net of early skills. We outline the implications of this distinction for comparative research, stressing that education can be a mediator of the origins-destinations associations as a result of factors that have little to do with the effects of schools and schooling. Analyzing data from the National Child Development Study and the British Cohort Study, we show that the conventional OED framework may overstate the independent mediating role of education by up to about 25 percent. We discuss the implications of our framework for policies about using education as a vehicle for promoting social mobility.