Content uploaded by Elie Wolfe
Author content
All content in this area was uploaded by Elie Wolfe on Sep 14, 2021
Content may be subject to copyright.
Entropic Inequality Constraints from e-separation Relations in
Directed Acyclic Graphs with Hidden Variables
Noam Finkelstein1Beata Zjawin2, 3 Elie Wolfe2Ilya Shpitser1Robert W. Spekkens2
1Johns Hopkins University, Department of Computer Science, 3400 N Charles St, Baltimore, MD USA, 21218
2Perimeter Institute for Theoretical Physics, 31 Caroline St. N, Waterloo, Ontario, Canada, N2L 2Y5
3International Centre for Theory of Quantum Technologies, University of, Gda ´
nsk, 80-308 Gda´
nsk, Poland
Abstract
Directed acyclic graphs (DAGs) with hidden vari-
ables are often used to characterize causal relations
between variables in a system. When some vari-
ables are unobserved, DAGs imply a notoriously
complicated set of constraints on the distribution
of observed variables. In this work, we present en-
tropic inequality constraints that are implied by
e
-
separation relations in hidden variable DAGs with
discrete observed variables. The constraints can in-
tuitively be understood to follow from the fact that
the capacity of variables along a causal pathway to
convey information is restricted by their entropy;
e.g. at the extreme case, a variable with entropy
0
can convey no information. We show how these
constraints can be used to learn about the true causal
model from an observed data distribution. In ad-
dition, we propose a measure of causal influence
called the minimal mediary entropy, and demon-
strate that it can augment traditional measures such
as the average causal effect.
1 INTRODUCTION
A causal model of a system of random variables can be rep-
resented as a directed acyclic graph (DAG), where an edge
from a node
X
to a node
Y
can be taken to mean that the ran-
dom variable
X
is a direct cause of the random variable
Y
.
Such causal models can be used to algorithmically deduce
highly non-obvious properties of the system. For example, it
is possible to deduce that the probability distribution of ob-
served variables in the system, called the observed data dis-
tribution, must satisfy certain constraints.
When some variables in the system are unobserved, the con-
straints implied by the causal model are not well understood,
and, for computational reasons, cannot be feasibly enumer-
ated in full for arbitrary graphs. As a result, a number of
methods have been developed for quickly providing a subset
of these constraints [
1
–
3
]. In this work, we contribute to this
literature by describing entropic inequality constraints that
hold whenever an
e
-separation relationship [
4
,
5
] is present
in the graph.
The idea underlying these inequality constraints is that mu-
tual information between two variables in a graphical model
must be explained by variability of variables (termed bottle-
neck variables) that are between them along some path. Such
paths need not be directed; a bottleneck variable may consti-
tute the base of a fork structure or the mediary variable in a
chain structure along the path. Each such path has a limited
capacity for carrying information, which can be quantified
in terms of the entropies of the bottleneck variables on that
path. At the extreme case, if there is a bottleneck variable
along a path with zero entropy, then subsequent variables on
that path cannot learn about prior variables through the path,
because the bottleneck variable will hold a fixed valueregard-
less of the values taken other any other variables, observed
or unobserved. We will quantitatively relate the amount of
information that can flow through a path to the entropies of
its bottleneck variables below.
Constraints on the observed data distribution implied by a
causal model have primarily been used to determine whether
the observed data is compatible with a causal model, and to
learn the true causal model directly from the observed data.
Existing algorithms for learning causal models rely primarily
on equality constraints. We suggest that incorporating our
proposed inequality constraints, which can easily be read off
a graphical model, can meaningfully improve these methods.
In addition, we show how the entropy of latent variables
can be linked to properties of the observed data distribution,
yielding bounds on latent variable entropies or constraints on
the observed data distribution.
We also demonstrate that our constraints can be used to bound
an intuitive measure of the strength of a causal relationship
between two variables, called the Minimum Mediary Entropy
(
MME
). We show that the standard measure, called the Aver-
Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021).
age Causal Effect (
ACE
), is not well suited to capturing the
causal influence strength of a non-binary treatment on out-
come, and can be misleading in some settings. For example,
the
ACE
can be
0
even when treatment changes outcome for
every subject in the population. The
MME
overcomes both
of these issues, and can serve as an informative complement
to the ACE.
The remainder of the paper is organized as follows. In Sec-
tion 2, we discuss relevant material in causal inference and
information theory. We present our constraints in Section 3,
and several applications of the constraints in Section 4. Fi-
nally, a discussion of related work and directions for future
study can be found in Section 5and Section 6respectively.
2 PRELIMINARIES
2.1 CAUSAL INFERENCE BACKGROUND
We begin by introducing key ideas from the literature on
graphical causal models. Suppose we are interested in a sys-
tem of related phenomena, each of which can be represented
by a random variable. We denote observed variables in the
system as
Y
, unobserved variables as
U
, and the full set of
variables as V”YYU.
We let
G
denote a DAG representing the system of interest.
Each node in
G
corresponds to a variable in
V
. The direct
causes of each random variable
V
are defined to be its parents
in
G
, denoted
paGpVq
. We adopt a nonparametric structural
equations view of the DAG [
6
,
7
], under which the value of
each variable
V
is a function of its direct causes and exoge-
nous noise, denoted
V
. The set of these structural equations
is denoted
F” tfVppaGpVq,Vq | VPVu
. In most causal
analyses, the exact form of these functions is unknown. Nev-
ertheless, if the structure of causal dependencies in a system
is known to be summarized by a graph
G
, or, equivalently, to
be described by some set of functions
F
, then the distribu-
tion PpVqis know to factorize as
PpVq“ ź
VPV
PpV|paGpVqq.(1)
Equation
(1)
is the fundamental constraint that
G
places on
the distribution
PpVq
– if the equality holds, then the dis-
tribution is in the model; otherwise it is not. When all vari-
ables are observed, each term in the factorization is identi-
fiable from observed data, and the constraint may easily be
checked. When not all variables are observed, there is no
known polynomial-time algorithm for expressing the con-
straints that the factorization of the full joint distribution
places on the observed data distribution. In theory, necessary
and sufficient conditions for the observed data distribution to
be in the model can be obtained through the use of quantifier
elimination algorithms [
8
], but these have doubly exponen-
tial runtime and are prohibitively slow in practice.
We now review
d
-separation and
e
-separation, which are
properties of the graph
G
that imply certain properties of
distribution
PpVq
. We first introduce the notion of open and
closed paths in conditional distributions. Triples in the graph
of the form
AÑCÑB
and
AÐCÑB
are said to be open
if we do not condition on
C
, and closed if we do condition
on
C
. Triples of the form
AÑCÐB
, in which
C
is called
a collider, are closed if we do not condition on
C
or any of
its descendants, and open if we do. A path is said to be open
under a conditioning set
C
if all contiguous triples along that
path are open under that conditioning set.
Definition 1
(
d
-separation)
.
Let
A
,
B
and
C
be sets of vari-
ables in a DAG.
A
and
B
are said to be
d
-separated by
C
if
all paths between
A
and
B
are closed after conditioning on
C. This is denoted pAKdB|Cq.
It is a well-known consequence of Equation (1) that any
d
-
separation relation
pAKdB|Cq
in
G
implies the correspond-
ing conditional independence relation
AKB|C
in the dis-
tribution
PpVq
. Conditional independence constraints of
this form are about sub-populations in which the variables in
C
take the same value for all subjects. We can only evaluate
whether these constraints hold when all variables in
C
are
observed; otherwise there is no way to identify the relevant
sub-populations. For that reason, it is impossible to deter-
mine whether conditional independences implied by Ghold
if they have hidden variables in their conditioning sets, lead-
ing to the need for other mechanisms to test implications of
these independencies.
To describe
e
-separation, we first introduce the idea that a
node can be deleted from a graph by removing the node and
all of its incoming and outgoing edges.
e
-separation can then
be defined as follows.
Definition 2
(
e
-separation)
.
Let
A
,
B
,
C
and
D
be sets of
variables in a DAG.
A
and
B
are said to be
e
-separated by
C
after deletion of
D
if
pAKdB|Cq
after deletion of every
variable in D. This is denoted pAKeB|Cupon Dq.
Conditioning on
C
may close some paths between
A
and
B
,
and open others. In the context of
e
-separation, the set
D
,
which we refer to as a bottleneck for
A
and
B
conditional
on
C
, is any set that includes at least one variable from each
path between
A
and
B
that is open after conditioning on
C
. If no subset of
D
is a bottleneck, then
D
is called a
minimal bottleneck. This terminology reflects the fact that,
conditional on
C
, all information shared between
A
and
B
–
that is, transferred from one to the other or transferred to each
from a common source – must flow through D.
It has been shown that every
e
-separation relationship among
observed variables in a graph
G
corresponds to a constraint
on the observed data distribution
PpYq
[
4
]. However, this
result is not constructive, in the sense that it does not pro-
vide a strategy for deriving such constraints for a given
e
-
D BA
U2
U1
paq
D BA
U2
U1
D#
pbq
A D B
U
pcq
A D D#B
U
pdq
Figure 1: The Unrelated Confounders graph (a), and a split
node model for it (b), as well as the Instrumental graph (c),
and its split node model (d).
separation relationship. The inequality constraints we pro-
vide in Section 3partially fulfill this role; they provide ex-
plicit constraints that hold everywhere in the model when-
ever an e-separation relationship obtains in a graph.
2.1.1 Node Splitting
We will see that
e
-separation is related to the idea of splitting
nodes in a graph. We define a node-splitting operation as
follows. Given a graph
G
and a vertex
D
in the graph, the
node splitting operation returns a new graph
G#
in which
D
is split into two vertices. One of the vertices is still called
D
, and it maintains all edges directed into
D
in the original
graph
G
, but none of its outgoing edges. This vertex keeps
the name
D
because it will have the same distribution as
D
in
the original graph, as all of its causal parents remain the same.
The second random variable is labeled
D#
, and it inherits
all of the edges outgoing from
D
in the original graph, but
none of its incoming edges. Examples of the node splitting
operation are illustrated in Fig. 1.
By a result of Evans
[4]
,
pAKeB|Cupon Dq
in
G
if and
only if
pAKdB|C,D#q
in
G#
. Note that the node splitting
operation described here is closely related to the operation of
node splitting in Single World Intervention Graphs in causal
inference [7].
2.2 ENTROPIES
In this section, we review standard concepts in information
theory, which we will use to express our inequality con-
straints. We begin with the definitions of entropy and mutual
information.
Definition 3.
The
entropy
of a random variable
X
is defined as
HpXq”´řxPXPpxqlog2Ppxq
, with the
joint entropy of
X
and
Y
defined analogously. The
mutual information
between
X
and
Y
is defined as
IpX:Yq” HpXq`HpYq´HpX,Y q.
The entropy of a random variable can be thought of as the
level of uncertainty one has about its value. Entropy is maxi-
mized by a uniform distribution over the domain of a random
variable, as there is no reason to think any one value is more
probable than another, and minimized by a point distribution,
in which there is no uncertainty.
The mutual information between
X
and
Y
can be thought of
as the amount of certainty we gain about the value of one, on
average, if we learn the value of the other. It is maximized
when one of
X
and
Y
is a deterministic function of the other,
and is minimized when they are independent.
The entropy
HpX|Y“yq
of
X
conditional on a specific
value of
Y“y
is obtained by replacing the distribution
PpXq
in Definition 3with
PpX|Y“yq
. The
conditional entropy
of Xgiven Y, denoted HpX|Yq, is defined as the expected
value of
HpX|Y“yq
. Conditional mutual information is
analogously defined.
3E-SEPARATION CONSTRAINTS
We have already described the intuition behind our con-
straints, which can be roughly summarized by the observa-
tion that the statistical dependence between random variables
must be limited by the total amount of information that can
flow through any bottleneck between them. We now describe
how the tools introduced in Section 2help us formalize this
intuition.
First, we describe why
e
-separation helps formalize the idea
of blocking “all paths” between two sets of variables. Con-
sider the instrumental variable graph, depicted in Fig. 1(c).
A
and
B
are only
d
-separated by the set
tD,U u
, where
U
is un-
observed. Consequently, they are not
d
-separated by any set
consisting entirely of observed variables. They are, however,
e
-separated after deletion of the observed variable
D
. This
tells us that all paths between
A
and
B
are through
D
, and we
can take advantage of observed properties of
D
to bound the
dependence between them even when nothing is known about
the unobserved variable
U
. A similar story can be told about
the Unrelated Confounders scenario depicted in Fig. 1(a).
When all variables are observed,
e
-separation does not imply
any constraints that are not implied by
d
-separation, which
follows from the fact that
d
-separation implies all constraints
in such cases [
9
]. However, as illustrated by the examples in
Figs. 1(a) and 1(c),
e
-separation allows us to identify bottle-
necks consisting entirely of observed variables between
A
and
B
even when paths between
A
and
B
cannot be closed
by any manner of conditioning on observed variables. To
show how
e
-separation lead to entropic constraints, we will
make use of Theorem 4.2 in [4], reframed as follows.
Theorem 4. (Evans [4, Theorem 4.2])
Suppose
pAKeB|Cupon Dq
in
G
, and that no variable
in
C
is a descendant of any in
D
. Then there exists a distri-
bution P˚over A,B,C,D,D#such that
PpA“a,B“b,D“d|C“cq
“P˚pA“a,B“b,D“d|C“c,D#“dq(2)
with
AKB|C,D#
in
P˚
. If furthermore no variable
in
A
is a descendant of any in
D
, then there exists a dis-
tribution
P˚
such that
PpB“b,D“d|A“a,C“cq “
P˚pB“b,D“d|A“a,C“c,D#“dq
with
AKB|C,D#in P˚.1
We provide the following intuition for this theorem. Our
graph
G
represents the causal relationships within a system of
random variables in the real world. The graph
G#
represents
an alternative world in which the causal effects of
D
are
“spoofed” by some random variable
D#
. That is, children of
D
in
G
, which should be functions of
D
, are instead fooled
into being functions of D#.
In the alternative world represented by
G#
, we suppose that
the functional form
fV
of a variable
V
in terms of its parents
stays the same for all variables that are shared between graphs.
This means that all non-descendants of
D
have the same
joint distribution in our world and in the alternative world, as
neither their parents nor the functions defining them in terms
of their parents have changed. By contrast, descendants of
D
in
G
will have a different distribution in the alternative world,
as their distributions are now functions of the distribution of
D#
, which may be different from that of
D
, and is unknown.
Now, suppose we condition on a particular value of
D#“d
in
G#
. Then, because the functional form of the causal
mechanisms is shared across worlds, the descendants of
D
in
G
have the same distribution as they have when
D“d
in
the observed world. In addition, all of the non-descendants
of
D#
are marginally independent from
D#
, because it has
no ancestors so all connecting paths must be collider paths.
Therefore, both its non-descendants and its descendants have
the same joint distribution they would have had when
D“d
in the original graph. The results in the theorem then follow
when we note that
C
, and optionally
A
, are non-descendants
of
D
, and that the relevant independence properties hold in
the world of G#.
In general, we cannot know what this
P˚
distribution is,
because we never get to observe this alternate world. But
when we condition on
D#
, we are removing precisely the
randomness we do not know about, yielding a distribution
that we do know about. The fact that
P˚
agrees with
P
on a subset of their domains, and that it contains known
1
In causal inference problems, a distribution
P˚
that satisfies
the relevant conditions for this result may be constructed from coun-
terfactual random variables Apdq,Bpdq,Dpdqand Cpdq.
independences, is sufficient to derive informative constraints,
as seen below.
3.1 ENTROPIC CONSTRAINTS FROM
E-SEPARATION
We now show how the notion of
e
-separation permits the
formulation of entropic inequality constraints. In these
constraints, we use mutual information to represent depen-
dence between sets of variables, and entropy to measure the
information-carrying capacity of paths connecting them.
Theorem 5. (Proof in Appendix C.)
Suppose the variables in
D
are discrete. Whenever
pAKeB|Cupon Dq
, then
IpA:B|C,Dqď HpDq
. If
in addition no element of
C
is a descendant of any in
D
, then
for any value
c
in the domain of
C
, the following stronger
constraints hold:
IpA:B|C“c,Dqď HpD|C“cq(3a)
IpA:B|C,Dqď HpD|Cq.(3b)
If in addition, no element of
A
is a descendant of any in
D
,
then for any value
c
in the domain of
C
, the following even
stronger constraints hold:
IpA:B,D|C“cqď HpD|C“cq,(4a)
IpA:B,D|Cqď HpD|Cq.(4b)
This theorem potentially allows us to efficiently discover
many entropic inequalities implied by any given graph, such
as those implied by Fig. 2. In some cases, as in Fig. 2(a),
the theorem recovers all Shannon-type entropic inequality
constraints implied by the graph [
10
–
12
]. In other cases,
as in Fig. 2(b), the graph implies a Shannon-type entropic
inequality constraint beyond what Theorem 5can recover,
per a result in [
13
]. Indeed, entropic inequality constraints
can be implied by graphs not exhibiting
e
-separation relations
whatsoever, such as the triangle scenario [11,14].
The linear quantifier elimination of [
10
–
12
] will always dis-
cover all the entropic inequalities which can be inferred from
Theorem 5. However, the quantifier elimination method is
computationally expensive, and is essentially intractable for
graphs involving more than six or seven variables (observed
and latent combined). Theorem 5, by contrast, provides an
approach that is computationally tractable, but is capable of
discovering fewer entropic constraints.
Finally, we note that an inverse of Theorem 5also holds. In
particular, if
pAMeB|Cupon Dq
in
G
and the variables in
D
are discrete, then there necessarily exists some distribution
P
over
A
,
B
,
C
,
D
, in the marginal model of
G
such that
IpA:B|C,Dqŋ HpDqwhen evaluated on P.
A X Y Z
U1
U2
(a)
A X Y Z
U1U2
(b)
Figure 2: For graph (a), Theorem 5implies the
entropic inequality constrains
IpA:XY Zq ďHpXq
and
IpA:Y Zq ďHpYq
. For graph (b), Theorem 5implies
IpA:XY Zq ďHpXq
and
IpA:Y Z|Xq ďHpY|Xq
. Note,
however, that the entropic quantifier elimination method
of Chaves et al.
[10]
as applied by Weilenmann and Colbeck
[13]
, finds that the former inequality for graph (b) can be
strengthened into IpA:XY Zq ďHpX|Yq.
3.2 RELATING E-SEPARATION TO EQUALITY
CONSTRAINTS
We have seen that
d
-separation and
e
-separation relations
imply constraints on the observed data distribution. Verma
and Pearl
[15]
discuss equality constraints for latent variable
models that apply in identified post-intervention distribu-
tions. Such equality constraints are sometimes called Verma
constraints. A general description of the class of these con-
straints implied by a hidden variable DAG model, as well
as discussion of properties of these constraints is given in
Ref. [
16
]. In this section, we examine the relationship be-
tween the
e
-separation-based constraints of Theorem 5and
the
d
-separation-based conditional independence and Verma
constraints.
First, we observe that the presence of
d
-separation relations
and Verma constraints in a graphical model imply the pres-
ence of an e-separation relation.
Proposition 6. (Proof in Appendix C.)
If
A
is
d
-separated from
B
by
tC,Du
, then
A
is also
e
-
separated from Bby Cupon deleting D.
This demonstrates that for any
d
-separation relation in the
graph, it is possible to obtain an entropic constraint cor-
responding to any minimal bottleneck
D
through an
e
-
separation relation. More precisely, when
A
is
d
-separated
from
B
by
tC,Du
, then by Proposition 6, it is also the case
that
A
is
e
-separated from
B
given
C
upon deleting
D
, and
therefore Theorem 5can be applied to obtain entropic con-
straints. Note, however, that these are necessarily weaker
than the entropic constraint
IpA:B|C,Dq“ 0
, which fol-
lows from the d-separation relation itself.
In summary, every
d
-separation relation in the graph is an in-
stance of
e
-separation, but not vice-versa. When an instance
of
e
-separation is also an instance of
d
-separation, then all
the inequality constraints implied by
e
-separation are ren-
dered defunct by the stronger equality constraints implied by
d-separation.
We now show that a similar pattern of deprecating inequalities
by equalities occurs in the presence of Verma constraints
when certain counterfactual interventions are identifiable.
Proposition 7.
Consider a graph
G
which exhibits the
e
-
separation relation
pAKeB|Cupon Dq
and where no el-
ement of
C
is a descendant of any in
D
. If the counterfactual
distribution
PpApD“dq,BpD“dq,DpD“dq| Cq
is iden-
tifiable
2
then the inequalities of Theorem 5are logically im-
plied whenever the stronger equality constraints
IpApD“dq:BpD“dq| Cq “0(5)
are satisfied for all values of
d
. Note that Equation
(5)
is sat-
isfied if and only if the margin of the identified counterfactual
distribution factorizes, i.e., when
PpApD“dq,BpD“dq| Cq
”řd1PpApD“dq,BpD“dq,DpD“dq“d1|Cq
exhibits ApD“dqKBpD“dq | C.(6)
The proof directly follows from that of Theorem 5. In prov-
ing Theorem 5, we derive entropic inequalities by relating
the entropies pertaining to
PpA,B,D|Cq
to entropies per-
taining to the
P˚
distribution posited by Theorem 4. That is,
Theorem 5is an entropic consequence of Theorem 4. If the
conditions of Proposition 7are satisfied, then the conditions
of Theorem 4are also automatically satisfied since one can
then explicitly reconstruct
P˚pA,B,D“d|C,D#“d#q
“PpApD“d#q,BpD“d#q,DpD“d#q“d|Cq.(7)
There is no opportunity to violate the entropic inequalities of
Theorem 5once the observational data has been confirmed
as consistent with Theorem 4. In other words, in order to
violate the inequalities of Theorem 5it must be the case that
no
P˚
consistent with Theorem 4can be constructed, but this
contradicts the explicit recipe of Equation (7).
See Refs. [
15
,
16
,
18
] for details on how to derive the form
of the equality constraints summarized by Equation (6). We
note here that
PpApD“dq,BpD“dq,DpD“dq“d|Cq
is
certainly identifiable if
D
is not a member of the same dis-
trict ([
16
]) as any element in
tA,Bu
within the subgraph of
G
over
tA,B,C,Du
and their ancestors. We also note that
the identifiability of merely
PpApD“dq,BpD“dq| Cq
but
not of
PpApD“dq,BpD“dq,DpD“dq“d|Cq
negates
the implication from Equation
(6)
to Theorem 5. In Ap-
pendix A, we provide an example of a graph in which
2
The counterfactual distribution in this theorem allows
intervened-on variables and outcomes to intersect. See [
17
] for a
complete identification algorithm for counterfactual distributions
of this type.
PpApD“dq,BpD“dq| Cq
is identified, but the entropic
constraints of Theorem 5remain relevant. In addition, we
demonstrate that the application of the entropic constraints
to identified counterfactual distributions can also result in in-
equality constraints on the observed data distribution.
3.3 CONSTRAINTS AND BOUNDS INVOLVING
LATENT VARIABLES
In this section, we consider
d
-separation relations with hid-
den variables in the conditioning set. Because we cannot con-
dition on hidden variables, there is no way to check whether
the corresponding independence constraints hold in the full
data distribution. However, if we have access to auxiliary in-
formation about these hidden variables – such as information
about their entropy or their cardinality – it is possible to ob-
tain inequality constraints on the observed data distribution.
Proposition 8. (Proof in Appendix C.)
If
pAKdB|C,Uq
and
HpU|A,B,C“cqě 0
, for any value
cin the domain of C:
HpU|C“cqě IpA:B|C“cq(8)
Note that Proposition 8holds even if
A
,
B
, and
C
are con-
tinuously valued variables.
In many scenarios, we may have more (or be more interested
in) information pertaining to the cardinality of a hidden vari-
able than its entropy. We take the cardinality of a set of vari-
ables to be the product of the cardinalities of the variables in
the set. An upper bound on the cardinality of
U
entails an up-
per bound on its entropy. As observed above, the entropy of a
random variable is maximized when it takes a uniform distri-
bution. If we let
|U|
denote the cardinality of
U
, and recall
that the entropy of a uniformly distributed variable with car-
dinality
m
is simply
log2pmq
, then
log2|U| ě HpUq
. The
next corollary then follows immediately from Proposition 8,
since HpU|¨q ě 0whenever Uhas finite cardinality:
Corollary 8.1.
If
pAKdB|C,Uq
, then for any value
c
in
the domain of
C
, the cardinality of
U
may be lower-bounded:
|U|ě maxc2IpA:B|C“cqě2IpA:B|Cq.(9)
Finally, we note that both of these inequalities can also be
used if we do not know anything about the properties of
U
, but would like to infer lower bounds for its entropy and
cardinality from the observed data. In Section 4.2, we will
explore a scenario in genetics in which these bounds and
constraints may be of use.
Remark 9.
Constraints given in Proposition 8and Corol-
lary 8.1 are stronger than can be obtained from the
e-separation relation pAKeB|Cupon Uqon its own.
To demonstrate Remark 9, we consider a set of structural
equations consistent with Fig.
1(a)
. Suppose that
D
takes the
Y1Y2Y3Y4
U1U2
U3
Y5
paq
Y1Y2Y3Y4
U1U2
U3
Y5
pbq
Figure 3: Two hidden variable DAGs that share equality con-
straints over observed variables, but (a) contains
e
-separation
relations that are not in (b).
value
0
when
U1‰U2
, and the value
1
otherwise, and that
A
and
B
take the value
0
if
D
is
0
, and values equal to
U1
and
U2
respectively if
D
is
1
. It follows that
A
and
B
are always
equal, and therefore
IpA:Bq“ HpAq
. Now, suppose that
U1
and
U2
only take values not equal to
0
, and that there are at
least two values that each takes with nonzero probability. It
immediately follows that
HpDqă HpAq
, and therefore that
HpDqă IpA:Bq
, as
D
and
A
by construction take the value
0
with the same probability, but there is strictly more entropy
in the remainder of
A
’s distribution because
D
is binary and
Atakes at least two other values with nonzero probability.
4 APPLICATIONS
In this section, we explore several applications of the con-
straints developed above. In Sections 4.1 and 4.2, we show
how our results can be used to learn about causal models from
observational data. In Section 4.3, we further leverage the im-
portance of the entropy of variables along a causal pathway
to posit a new measure of causal strength, and observe that
this measure can be bounded by an application of Theorem 5.
4.1 CAUSAL DISCOVERY
In this section, we present an example in which two hidden
variable DAGs with the same equality constraints present dif-
ferent entropic inequality constraints. The ability to distin-
guish between models that share equality constraints has the
potential to advance the field of causal discovery, in which
causal DAGs are learned directly from the observed data.
Causal discovery algorithms for learning hidden variable
DAGs currently do so using only equality constraints. Our
approach may be useful as a post-processing addition to such
methods, whereby any graph found to satisfy the equality
constraints in the observed data is tested against the entropic
inequality constraints implied by
e
-separation relations in
the model.
The hidden variable DAGs in Fig. 3, adapted from Ap-
pendix B in Ref. [
19
], share the same conditional indepen-
dence constraints:
Y1KY3|Y2Y5
and
Y1KY5
, but exhibit
different e-separation relations.
In Fig.
3(a)
,
pY1KeY3Y4|Y2upon Y5q
,
pY1Y2KeY4|upon Y3q
, and
pY2KeY4|Y1upon Y3q
.
Applying Theorem 5in each case, we obtain the three
inequality constraints
IpY1:Y3Y4Y5|Y2qď HpY5|Y2q
,
IpY2:Y3Y4|Y1qď HpY3|Y1q,IpY1Y2:Y3Y4q ďHpY3q.
In Fig.
3(b)
, we have added an edge, which re-
moves some
e
-separation relations. We are left with
pY1KeY3|Y2upon Y5q
, and
pY2KeY4|Y1upon Y3q
.
We can again apply Theorem 5in each case, yielding the
inequality constraints
IpY1:Y3Y5|Y2qď HpY5|Y2q
and
IpY2:Y3Y4|Y1qď HpY3|Y1q
. The second of these con-
straints is shared by the graph in Fig.
3(a)
, and the first is
strictly weaker than a constraint in Fig. 3(a).
Models similar to those shown in Fig. 3sometimes arise in
time-series data, where the variables in the chain represent
observations taken at consecutive time steps. In such models,
it is often assumed that treatments no longer have a direct
effect on outcomes after a certain number of time steps. Here,
that assumption is encoded in the lack of a direct edge from
Y1
to
Y4
in Fig.
3(a)
. We have shown above that this kind
of assumption can be falsified even when it does not imply
any additional equality constraints, as is often the case. In
particular, if the stronger constraints implied by Fig.
3(a)
are
violated, but the weaker constraints of Fig.
3(b)
are not, then
the assumption is falsified.
4.2 CAUSAL DISCOVERY IN THE PRESENCE OF
LATENT VARIABLES
UX Y
paq
UX Y
pbq
Figure 4: Identifying direct causal influence in the presence
of a confounder with limited cardinality.
In this section, we consider a very simple possible applica-
tion of the constraints and bounds relating to entropies of un-
observed variables in genetics. Consider a causal hypothesis
wherein the presence or absence of an unobserved gene in-
fluences two aspects of an organism’s phenotype. Suppose
that due to genetic sequencing studies, the number of vari-
ants of the gene in the population – i.e. the cardinality of the
corresponding random variable – is known. Two possible hy-
potheses regarding the causal structure are depicted in Fig. 4,
where
U
represents the gene and
X
and
Y
are the phenotype
aspects. In Fig.
4(a)
, one presumes no causal influence of
X
on
Y
, whereas in Fig.
4(b)
, direct causal influence is allowed.
In the former case, knowledge of the number of variants of
the gene constrains the mutual information between the phe-
notypes, while in the latter case it is not constrained.
Thus, for certain types of statistical dependencies between
X
and
Y
, one can rule out the hypothesis of Fig.
4(a)
. For ex-
ample, suppose we know the cardinality of
U
to be
3
. Corol-
lary 8.1 then implies the constraint that the mutual informa-
tion between
X
and
Y
cannot exceed
log2p3q « 1.584
. Sup-
pose further that we observe the distribution depicted in Ta-
ble 1. The mutual information between
X
and
Y
in this distri-
bution is
«1.594
. Because this mutual information violates
the constraint implied by the model in Fig.
4(a)
, we know this
model cannot be correct, and conclude that Fig.
4(b)
is correct.
More generally, strong statistical dependence between high
cardinality variables cannot be explained by a low cardinality
common cause and requires a direct influence between them.
Y
0 1 2 3
0 0.002 0.001 0.400 0.001
X 1 0.003 0.005 0.005 0.066
2 0.224 0.003 0.003 0.001
3 0.002 0.281 0.001 0.002
Table 1: An example joint distribution over two variables
X
and Y, each with cardinality 4.
Conversely, suppose Fig.
4(a)
is known to be correct, and that
there is no direct causal influence between the two aspects
of phenotype. If the cardinality of
U
is not known, it can be
bounded from below directly from observed data, according
to Corollary 8.1. In this case, the lower bound would be
2IpX:Yq«21.594 «3.018
. It follows that
U
must have a
cardinality of
4
or above in this setting. The ability to extract
such information from observational data may be useful in
making substantive scientific decisions, or in guiding future
sequencing studies.
In many applied data analyses, different variables may be ob-
served for different subjects, i.e., data on some variables is
“missing” for some subjects. A recent line of work has fo-
cused on properties of missing data models that can be repre-
sented as DAGs [
20
]. Although the bounds and constraints
above have been developed in the context of fully unobserved
variables, they can also be used in missing data DAG models,
for variables that are not observed for all subjects.
4.3 QUANTIFYING CAUSAL INFLUENCE
The traditional approach to measuring the strength of a causal
relationship is by contrasting how different an outcome would
be, on average, under two different treatments. Formally, if
X
is a cause of
Y
, the
ACE
is defined as
ErYpX“xq ´
YpX“x1qs
. While the
ACE
is a very useful construct, we
suggest that it has two important shortcomings, and present
an alternative measure of causal strength called the Minimal
Mediary Entropy or
MME
. The
MME
is based on the idea –
explored throughout this work – that the entropy of variables
along a causal pathway provide insight into the amount of
information that can travel along that pathway.
In a scenario where treatment can be discerned to always
cause outcome, we might expect the
ACE
, as a measure of
causal influence, to be large. The example below shows that
this is not necessarily the case.
Example 1.Consider a randomized binary treatment
X
and a ternary outcome
Y
, with
PpY“0|X“0q “
PpY“2|X“0q “ 0.5
, and
PpY“1|X“1q “ 1
. In this set-
ting,
ACE “0
, even though treatment affects outcome for
every subject in the population.
In less extreme settings, the
ACE
may be very small even
when treatment affects outcome for almost every subject in
the population, or very large, even when very few subjects
have an outcome that is affected by treatment.
The
ACE
is likewise not always well suited to measuring the
strength of a causal relationship when the treatment variable
is non-binary. In such situations, no one causal contrast rep-
resents the causal influence, and the number of possible con-
trasts grows combinatorially in the cardinality of treatment.
We now define the
MME
and discuss how it can overcome
these issues.
CX Y
U
A
paq
CX Y
WUW Y
U
A
pa1q
DX Y
U
pbq
DX Y
WUW Y
U
pb1q
Figure 5: Modifying DAGs (a) and (b) by inserting a
latent mediary
W
between
X
and
Y
yields DAGs (a’)
and (b’) respectively. Note that in (a), even though
X
and
Y
are latent confounded, corollary 10.1 gives
MMEXÑYěIpA:Y|C“cq
for any
c
by exploiting the fact
that
APanpXq
. Also note that in (b), even though
X
affects
Y
both directly and indirectly though
D
, corollary 10.1 gives
MMEXÑYěIpX:Y|Dq´HpDqfor the direct effect.
Definition 10
(Minimal Mediary Entropy (
MME
) for Direct
Effect)
.
Given a DAG
G
containing a directed edge
XÑY
,
let
G1
XÑWÑY
denote the graph constructed by substitut-
ing the single edge
XÑY
in
G
with the set of four edges
tXÑWÑY , W ÐUW Y ÑYu
, introducing auxilliary la-
tent variables
W
and
UW Y
.
3
We then define
MMEXÑY
as
the smallest entropy
HpWq
over all structural equations
3
If a latent confounder were added between
X
and
W
, then
models reproducing the observed data distribution over
G1
XÑWÑYin which Whas finite cardinality.
Fig. 5illustrates the process of edge substitution. Essentially,
the edge
XÑY
in
G
is interrupted to pass through
W
in
G1
XÑWÑY
, such that the auxiliary latent variable
W
fully
mediates the direct effect of
X
on
Y
. Note that caveat that
MME
is defined in terms of minimizing the entropy of
W
over finite cardinality
W
capable of reproducing the observed
statistics. If
W
were allowed to be a continuously valued
variable, then the observed data distribution would always be
reproducible with arbitrarily small
HpWq
, due to the total
lack of restriction in the instrumental model with a continuous
mediary [21].
With the presumption of finite cardinality
W
by fiat, however,
we are are in a position to exploit Theorem 5in order to
practically lower bound the MME.
Corollary 10.1. (Proof in Appendix C.)
Suppose that graphical construction
G1
XÑWÑY
exhibits the
e
-separation relation
pAKeB|Cupon tD,W uq
for some
AĂtXuYanpXq
and for some
BĂtYuYdescpYq
, where
A
,
B
,
C
, and
D
are nonoverlapping subsets (
C
and
D
possibly empty) of the observed variables in
G
, and all the
variables within Dare discrete. Then
MMEXÑY
ěmax
cIpA:B|C“c,Dq´HpD|C“cq(10a)
ěIpA:B|C,Dq´HpD|Cq.(10b)
Suppose that
P0
is a distribution in the model of the extended
graph
G1
XÑWÑY
, such that
P0
marginalizes to the observed
data distribution. Then the entropy
HpWq
in
P0
is necessar-
ily an upper bound on
MMEXÑY
, i.e. we have found a
W
with entropy HpWqthat fully mediates the causal influence
of
X
on
Y
. Since
W
could always reproduce the observed
data by simply copying the values of
X
, we have a trivial
upper bound of
MMEXÑYďHpXq
.
4
This upper bound can
typically be improved by even partially exploring the space
of the distributions in G1
XÑWÑY.
Consider the simple model
XÑY
with
|X|“ 3
and
|Y|“ 3
,
and the observed data distribution
PpX“x, Y “yq “
#5
27 if x“y
2
27 if x‰y
. Our corollary gives us a lower bound on
MMEXÑYěIpX:Yq« 0.150
, contrasted with the trivial
upper bound
MMEXÑYďHpXq« 1.585
. We can improve
the trivial upper bounding by noting that this distribution
although
W
would still mediate
XÑY
,
X
and
Y
would share a
source of unobserved confounding, altering the causal model.
4
Consider example 1which has
ACEXÑY“0
. That example
has the feature that
IpX:Yq “ HpXq
. Accordingly the lower
bound of
MMEXÑYěIpX:Yq
is evidently tight, given the trivial
upper bounder MMEXÑYďHpXq.
can be reproduced by the following functional relation-
ships
#W“0and Y“UW Y when UW Y “X
W“1and Y“uniformly random when UW Y ‰X
and taking
UW Y
to be a uniform random distri-
bution with cardinality three. In this model for
G1
XÑWÑY
we obtain
PpW“0q“ 1
3
, corresponding to
MMEXÑYďHpWq« 0.918.
5 RELATED WORK
This work builds most directly on Ref. [
4
], in which
e
-
separation was introduced and Theorem 4was derived, both
of which are essential to our results. It follows in the tradition
of a line of literature that aims to derive symbolic expressions
of restrictions on the observed data distribution implied by a
causal model with latent variables, including Refs. [
2
,
18
,
22
].
Entropic constraints were previously considered in Refs. [
10
–
12
]. The entropic constraint for the instrumental scenario
appears as Equation (5) in Ref. [
11
], see also Appendix E
of Ref. [
23
]. Our work is also closely related to work in the
literature on information theory on how much information
can pass through channels of varying types [
24
]. Our pro-
posed measure of causal strength, the
MME
, is motivated by
weaknesses in standard causal strength measures (e.g. ACE),
which was previously discussed in Ref. [25].
Our results are also related to the causal discovery literature,
which seeks to find the causal structures compatible with an
observed data distribution [
26
]. The inequality constraints
posed above can be used to check the outputs of existing
causal discovery algorithms [26–28].
6 CONCLUSION
In this work, we present inequality constraints implied by
e
-
separation relations in hidden variable DAGs. We have shown
that these constraints can be used for a number of purposes,
including adjudicating between causal models, bounding the
cardinalities of latent variables, and measuring the strength
of a causal relationship.
e
-separation relations can be read
directly off a hidden variable DAG, leading to constraints
that can be easily obtained.
This work opens up two avenues for future work. The
first is that our constraints demonstrate a practical use of
e
-
separation relations, and should motivate the study of fast
algorithms for enumerating all such relations in hidden vari-
able DAGs. The second is that the constraints suggest that
equality-constraint-based causal discovery algorithms can
be improved; understanding how the inequality constraints
can best be used to this end will take careful study.
Acknowledgments
This research was supported by Perimeter Institute for Theo-
retical Physics. Research at Perimeter Institute is supported
in part by the Government of Canada through the Department
of Innovation, Science and Economic Development Canada
and by the Province of Ontario through the Ministry of Col-
leges and Universities. The first author was supported in part
by the Mathematical Institute for Data Science (MINDS) re-
search fellowship. The third author was supported in part by
grants ONR N00014-18-1-2760, NSF CAREER 1942239,
NSF 1939675, and R01 AI127271-01A1. We thank Murat
Kocaoglu for helpful discussion about the MME.
References
[1]
E. Wolfe, R. W. Spekkens, and T. Fritz, “The Inflation Technique for Causal Inference with Latent Variables,” J. Caus.
Inf. 7, 70 (2019).
[2]
C. Kang and J. Tian, “Inequality Constraints in Causal Models with Hidden Variables,” in Proc. 22nd Conf. UAI (AUAI,
2006) pp. 233–240.
[3]
D. Poderini, R. Chaves, I. Agresti, G. Carvacho, and F. Sciarrino, “Exclusivity graph approach to Instrumental inequali-
ties,” in Proc. 35th Conf. UAI (AUAI, 2019).
[4]
R. J. Evans, “Graphical methods for inequality constraints in marginalized DAGs,” in Proc. 2012 IEEE Intern. Work.
MLSP (IEEE, 2012) pp. 1–6.
[5] J. Pienaar, “Which causal structures might support a quantum–classical gap?” New J. Phys. 19, 043021 (2017).
[6] J. Pearl, “Causal inference in statistics: An overview,” Statist. Surv. 3, 96– (2009).
[7] T. S. Richardson and J. M. Robins, Single World Intervention Graphs (Now Publishers Inc, 2013).
[8]
D. Geiger and C. Meek, “Quantifier Elimination for Statistical Problems,” in Proc. 15th Conf. UAI (AUAI, 1999) pp.
226–235.
[9] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, 1988).
[10]
R. Chaves, L. Luft, T. O. Maciel, D. Gross, D. Janzing, and B. Schölkopf, “Inferring latent structures via information
inequalities,” in Proc. 30th Conf. UAI (AUAI, 2014) pp. 112–121.
[11]
R. Chaves, L. Luft, and D. Gross, “Causal structures from entropic information: geometry and novel scenarios,” New J.
Phys. 16, 043001 (2014).
[12]
M. Weilenmann and R. Colbeck, “Analysing causal structures with entropy,” Proc. Roy. Soc. A
473
, 20170483 (2017).
[13]
M. Weilenmann and R. Colbeck, “Analysing causal structures in generalised probabilistic theories,” Quantum
4
, 236
(2020).
[14] B. Steudel and N. Ay, “Information-theoretic inference of common ancestors,” Entropy 17, 2304 (2015).
[15] T. Verma and J. Pearl, “Equivalence and synthesis of causal models,” in Proc. 6th Conf. UAI (AUAI, 1990).
[16]
T. S. Richardson, R. J. Evans, J. M. Robins, and I. Shpitser, “Nested Markov Properties for Acyclic Directed Mixed
Graphs,” (2017), Working Paper.
[17]
I. Shpitser, T. S. Richardson, and J. M. Robins, “Chapter 41: Multivariate counterfactual systems and causal graphical
models,” in Probabilistic and Causal Inference: The Works of Judea Pearl (ACM Books, 2021).
[18]
J. Tian and J. Pearl, “On the Testable Implications of Causal Models with Hidden Variables,” in Proc. 18th Conf. UAI
(AUAI, 2002).
[19]
R. Bhattacharya, T. Nagarajan, D. Malinsky, and I. Shpitser, “Differentiable Causal Discovery Under Unmeasured
Confounding,” (2020), arXiv:2010.06978 .
[20]
K. Mohan, J. Pearl, and J. Tian, “Graphical Models for Inference with Missing Data,” in Advances in Neural Information
Processing Systems (Curran Associates, Inc., 2013) pp. 1277–1285.
[21]
F. Gunsilius, “A path-sampling method to partially identify causal effects in instrumental variable models,” (2019),
working paper.
[22]
A. Balke and J. Pearl, “Nonparametric Bounds on Causal Effects from Partial Compliance Data,” J. Am. Stat. Ass. (1993).
[23]
J. Henson, R. Lal, and M. F. Pusey, “Theory-independent limits on correlations from generalized Bayesian networks,”
New J. Phys. 16, 113043 (2014).
[24] A. E. Gamal and Y.-H. Kim, Network Information Theory (Cambridge University Press, 2011).
[25]
D. Janzing, D. Balduzzi, M. Grosse-Wentrup, B. Schölkopf, et al., “Quantifying causal influences,” Ann. Stat.
41
, 2324
(2013).
[26] P. L. Spirtes, C. N. Glymour, and R. Scheines, Causation, prediction, and search (MIT press, 2000).
[27]
E. V. Strobl, S. Visweswaran, and P. L. Spirtes, “Fast causal inference with non-random missingness by test-wise
deletion,” Int. J. Data Sci. Analyt. 6, 47 (2018).
[28]
D. Bernstein, B. Saeed, C. Squires, and C. Uhler, “Ordering-Based Causal Structure Learning in the Presence of Latent
Variables,” in Proc. 23rd Int. Conf. Art. Intell. Stat., Vol. 108 (PMLR, 2020) pp. 4098–4108.
[29]
R. J. Evans and T. S. Richardson, “Smooth, identifiable supermodels of discrete DAG models with latent variables,”
Bernoulli 25, 848 (2019).
[30]
D. Malinsky, I. Shpitser, and T. S. Richardson, “A potential outcomes calculus for identifying conditional path-specific
effects,” in Proc. 22nd Int. Conf. Art. Intell. Stat. (2019).
[31]
T. Gläßle, D. Gross, and R. Chaves, “Computational tools for solving a marginal problem with applications in Bell
non-locality and causal modeling,” J. Phys A 51, 484002 (2018).
[32] E. H. Lieb, “Some convexity and subadditivity properties of entropy,” Bull. Am. Math. Soc. 81, 1 (1975).
[33]
G. R. Kumar, C. T. Li, and A. El Gamal, “Exact common information,” in 2014 IEEE International Symposium on
Information Theory (2014) pp. 161–165.
[34] D. Geiger and J. Pearl, “On the Logic of Causal Models,” in Proc. 4th Conf. UAI (AUAI, 1998) pp. 136–147.
A COMPARING ENTROPIC INEQUALITIES TO GENERALIZED INDEPENDENCE
RELATIONS
A X D B
U
(a)
A
D1D2
B
C
U2
U1
(b)
AD#
1
D1
D#
2
D2
B
C
U2
U1
(c)
D1U1D2
U2U3
A B
(d)
Figure S6: In graphs (a) and (b), the entropic inequality constraints are logically implied by equality constraints. Graphs
(b) and (c) demonstrate that for a set of variables
D
, the counterfactual random variable
DpD“dq
is not necessarily equal
to the factual
D
. Graph (d) provides an example where the entropic inequality constraints remain relevant even though the
counterfactual distribution after intervention on an e-separating set over the remaining variables is identified.
In Proposition 7, we showed that for graphical models in which the counterfactual
PpApD“dq,BpD“dq,DpD“dq“d|Cq
is identified, the entropic constraints of Theorem 5are weaker than the corresponding Verma constraints. We now illustrate
this point with a few examples. In Fig. S6(a) the counterfactual
PpApD“dq,BpD“dq,DpD“dq“dq
is identified, and in
Fig. S6(b) the counterfactual
PpApD“dq,BpD“dq,DpD“dq“d|Cq
is identified. Accordingly, our entropic inequalities
are implied by equality constraints, due to Proposition 7. The resulting inequality constraints therefore cannot provide any
additional information about whether these causal structures are compatible with observed distributions.
By contrast, in Fig. S6(d) the counterfactual
PpApD“dq,BpD“dq,DpD“dq“d|Cq
is not identified, even though
PpApD“dq,BpD“dq | Cq
is. Although Fig. S6(d) implies no equality constraints [
29
], we find that it does entail the
entropic inequality constraint following from the
e
-separation relation
pAKeB|upon tD1,D2uq
. It is therefore an example
of a graph in which our inequality constraints are not made redundant by known equality constraints, despite the fact that
intervention on
D
is identified. This example is also an illustration of the fact that not every equality restriction featuring non-
adjacent variables in an identifiable counterfactual distribution implies equality restrictions on the observed data distribution.
However, some such non-adjacent variables may be involved in inequality restrictions.
The critical idenfifiability question for determining whether the entropic constraints are made redundant by equality constraints
is
PpApD“dq,BpD“dq,DpD“dq“d|Cq
. This distribution involves the counterfactual random variable
DpD“dq
. Note
that although any single random variable under intervention on itself is equivalent to the random variable under no intervention,
the same does not necessarily hold for sets of random variables. Figs. S6(b) and S6(c) demonstrate this point – because
D2
is a
descendant of D1, after intervention on both, D2no longer takes its natural value.
BE-SEPARATION IN IDENTIFIED COUNTERFACTUAL DISTRIBUTIONS
C
A B
D
X
U2
U1
(a)
C
A B
D
X
U2
U1U3
(b)
C
A B
D
X
U2
U1U3
(c)
Figure S7: In all three graphs,
A
and
B
are
e
-separated by
D
after intervention on
C
. The counterfactural distribution over
tA,B,Duafter intervention on Cis only identified in graphs (a) and (c), however.
A Single World Intervention Graph (SWIG) [
7
], which represents the model after intervention on one or more random variables,
can be obtained through a node-splitting operation as illustrated in Fig. S6(c). As described in Section 3.2,
d
-separation
relations that appear under interventions with identified distributions can be used to derive equality constraints on the observed
data distribution. In this section, we explore the significance of
e
-separation relations in identified counterfactual distributions.
We begin by noting that any e-separation relation that exists in a SWIG corresponds to an e-separation in the original DAG.
Proposition 11. pAKeB|Cupon Dqafter intervention on Eonly if pAKeB|Cupon tD,Euq.
This proposition follows directly from the relationship between the fixing [
16
] and deletion operations. In particular, fixing
and deleting vertices induce the same graphical relationships among the remaining variables in the graph.
It may at first seem that this result indicates that
e
-separation relations in SWIGs cannot be used to derive inequality constraints
on the observed data distribution that are not already implied by
e
-separation relations in the original model. However, entropic
inequality constraints on counterfactual distributions have a different form than such constraints on the factual distribution.
This is because entropies of counterfactual variables do not in general correspond to entropies of factual variables, so there
is no way to express inequality constraints that follow from
e
-separation relations in SWIGs as entropic inequalities on the
original distribution.
To illustrate this point, consider Fig. S7. In each graph,
pAKeB|upon Dq
in the SWIG resulting from intervention on
C
.
However, in Fig. S7(b), the distribution after intervention on
C
is not identified, whereas in Figs. S7(a) and S7(c) it is identified
as
PpApcq,Bpcq,Dpcqq “ řx
PpA,B,C “c,D,X“xq
PpC“c|X“xq
. This means the entropic inequalities
IpApcq:Bpcqq ď HpDpcqq
on this
counterfactual distribution (one for each level of
C
) imply inequality constraints on the observed data distribution as well.
These inequality constraints will be obtained in Figs. S7(a) and S7(c), but not in Fig. S7(b).
Moreover, these inequality constraints can be shown to be nontrivial. Since Figs. S7(a) and S7(b) share the same
d
-separation
and
e
-separation relations it follows that any distributions compatible with Fig. S7(b) cannot be witnessed as incompatible with
Fig. S7(a) using non-nested entropic equalities or inequalities. Consider the following structural equation model for Fig. S7(b):
Let
U1
,
U2
and
U3
be binary and uniformly distributed, and let
X“U2
,
A“U2‘A
,
C“X‘U3
,
B“C‘U3‘B
, and
D“D
where “
‘
” indicates addition mod 2 and where
k
is a random variable very heavily biased towards zero for
kP
tA,B,Du
. This establishes that
C
and
X
are uniformly distributed and statistically independent from each-other, and hence that
PpA,Bq “PpApc“0q,Bpc“0qq
. This construction also gives
A‘B“U2‘A‘C‘U3‘B“U2‘A‘X‘B“A‘B
and hence
A«B
. This yields
IpApc“0q:Bpc“0qq« HpAq «1
whereas
HpDpc“0qq“ HpDq «0
, strongly violating the
entropic inequality IpApc“0q:Bpc“0qqď HpDpc“0qq which applies only to Fig. S7(a).
C PROOFS
Proof of Theorem 5
Let
G#
represent the graph in which every node in
D
is split,
P˚
denote the distribution over variables in
G#
. For notational
convenience, we let
Pd#p¨| ¨q “P˚p¨| ¨,D#“d#q
, and
Id#
and
Hd#
be the mutual information and entropy in this distribution.
Recall that by Theorem 4, if pAKeB|Cupon Dq, then:
4.i. Id#pA:B|C“cq“ 0, and
4.ii. PpA,B,D“d#|C“cq“ Pd#pA,B,D“d#|C“cq.
From the latter condition (4.ii.) we readily have that Hp¨|¨,D“d#q “ Hd#p¨ |¨,D“d#q.
It should also be clear that
HpXq“ Hd#pXqwhenever Xare among the nondescendants of D#in G#. (S11)
In our argument below,
C
and
D
are examples of such a set
X
. If we view the distribution
Pd#
from which
Hd#
is derived as
an interventional distribution, then the above identity follows from the exclusion restriction displayed graphically by rule 3 of
the po-calculus [30].
It will prove extremely useful to show that
Hp¨| ¨,Dq “Hd#p¨| ¨,Dq
can be seen to follow from
Hp¨| ¨,D“d#q “Hd#p¨| ¨,D“d#qand PpD“d#| ¨q“Pd#pD“d#| ¨q via
Hp¨pre |¨post,Dq“ ÿ
d
Ppd|¨postqHp¨pre |¨post,dq
“ÿ
d#
Ppd#|¨postqHp¨pre |¨post,d#q[summing over dummy index]
“ÿ
d#
Pd#pd#|¨postqHd#p¨pre |¨post,d#q[applying Theorem 4]
“Hd#p¨pre |¨post,Dq(S12)
We will use conditions
(S11)
and
(S12)
to translate entropic constraints on
Pd#
into entropic constraints on
P
. The two cases
in Theorem 5share the same implicit entropic constraints on
Pd#
, but the implications on
P
are different: those scope of
condition (S11)’s applicability increases under the promise that Aare nondescendants of Din G.
From this point on we will focus on deriving entropic inequality constraints on
Pd#
such that all the terms in the derived
inequalities are translatable according to conditions
(S11)
and
(S12)5
, because such constraints apply both to
Pd#
and
P
. We
hereafter denote C“cwith simply cfor notational brevity. Firstly, consider the following entropic inequalities,
0ďId#pA:D|cq“ Hd#pA|cq´Hd#pA|c,Dq,(S13a)
0ďId#pB:D|cq“ Hd#pB|cq´Hd#pB|c,Dq,(S13b)
0ďHd#pD|A,B,cq“ Hd#pA,B,D|cq´Hd#pA,B|cq,(S13c)
0ď´Id#pA:B|cq “ Hd#pA,B|cq´Hd#pA|cq´Hd#pB|cq.(S13d)
The first two are subadditivity inequalities, which follow from the nonnegativity of conditional mutual information. Subaddi-
tivity holds for both discrete and continuously valued variables [
32
]. The penultimate inequality follows from monotonicity
(the fact that all conditional entropies are nonnegative). Conditional entropy is only guaranteed to be nonnegative for dis-
crete variables, and this is why Theorem 5demands that
D
be discrete. The final inequality is an expression of the fact that
Id#pA:B|cq“ 0per (4.i.) above. Summing all four inequalities (S13) leads to the derived inequality
0ďHd#pA,B,D|cq´Hd#pA|c,Dq´Hd#pB|c,Dq(S14)
By applying conditions (S11) and (S12) to inequality (S14) we obtain
0ďHpA,B,D|cq´HpA|c,Dq´HpB|c,Dq
i.e., IpA:B|c,Dqď HpD|cq.(S15)
5
Formally, the problem of inferring the implications of system of linear inequalities on a strict subset of their variables may be solved be
means of Fourier-Motzkin elimination or related algorithms [31].
Now consider the case where we are further promised that
A
are nondescendants of
D
in
G
and hence nondescendants of
D#
in
G#
. This means that in addition to the above results we also have that
Hd#pA|cq“ HpA|cq
. We proceed as before, but
instead of summing all four of the (S13) inequalities we only take the sum of the latter three. This yields
0ďHd#pA,B,D|cq´Hd#pA|cq´Hd#pB|c,Dq(S16)
By applying conditions (S11) and (S12) to inequality (S16) we obtain
0ďHpA,B,D|cq´HpA|cq´HpB|c,Dq
i.e., IpA:B,D|cqď HpD|cq.(S17)
In both cases, the constraint is maintained after taking the expectation of both sides with respect to
C
. Because each term in
the expectation will satisfy the inequality, so will the sum.
When
C
are not entirely among the nondescendants of
D
in
G
then we can still obtain
IpA:B|C,Dqď HpDq
by noting that
0ďId#pA:D|Cq`Id#pB:D|Cq´Id#pA:B|Cq`Hd#pD|A,B,Cq`Id#pC:Dq,(S18a)
“Hd#pA,B,C|Dq´Hd#pA,C|Dq´Hd#pB,C|Dq`Hd#pC|Dq`Hd#pDq,(S18b)
“HpA,B,C|Dq ´HpA,C|Dq ´HpB,C|Dq `HpC|Dq `HpDq,(S18c)
“HpDq´IpA:B|C,Dq.(S18d)
Note that this proof technique can be adapted to derive stronger entropic inequalities for graphs which exhibit multiple different
e
-separation relations involving the same
D
set. If
pA1KeB1|C1upon Dq
and
pA2KeB2|C2upon Dq
and so forth,
then Theorem 4still demands the existence of a single
Pd#
whose various margins must now satisfy multiple distinct zero
conditional mutual informational equalities. We can accommodate multiple entropic equality constraints on
Pd#
just as easily
as we can accommodate a single equality constraint: The translation between constraints on
Pd#
and
P
will continue to be
governed by conditions (S11) and (S12).
Proof of Proposition 6
If conditioning on some variables
D
is sufficient to close a path, then that path must go through
D
, and therefore deletion
of
D
eliminates the path. By construction, the deletion operation can never open a path, unlike the conditioning operation.
If
pAKdB|C,Dq
, then all paths from
A
to
B
go through
C
or
D
, or through colliders that are not in
tC,Du
, nor have
any descendants therein. It follows that
pAKeB|Cupon Dq
, as after deletion of
D
all paths through
C
remain blocked
through conditioning, all paths through Dare eliminated, and all other paths remain blocked by colliders.
Proof of Proposition 8
Firstly, note that pAKdB|C,Uqimplies that
0“IpA:B|C“c,Uq“ HpA|C“cq`HpB|C“cq´HpA,B,U|C“cq´HpU|C“cq,
i.e., that
HpU|C“cq“ HpA|C“cq`HpB|C“cq´HpA,B,U|C“cq.(S19)
which, whenever HpA,B,U|C“cqě HpA,B|C“cq, gives the lower bound
HpU|C“cqě HpA|C“cq`HpB|C“cq´HpA,B|C“cq
“IpA:B|C“cq.(S20)
Note that
HpA,B,U|C“cqě HpA,B|C“cq
is equivalent to
HpU|A,B,C“cqě 0
, and that
HpU|¨q ě 0
whenever
U
is
discrete. The result
HpUqě HpU|C“cq
follows from strong subadditivity, i.e., the fact that conditional entropy is never
greater than marginal entropy, even for continuously valued variables [32].
To obtain Corollary 8.1 simply note that the constraint is maintained after taking the expectation of both sides with respect to
C: because each term in the expectation will satisfy the inequality whenever the cardinality of Uis finite, so will the sum.
Proof of Corollary 10.1
Proof.
A consequence of
XPpapWq
, is that
AĂXYanpXq
implies that no element of
A
is a descendant of
W
. This allows
to to confirm the following sequence of inequalities,
max
cIpA:B|C“c,Dq(S21a)
ďmax
cmax
SEMs for G1IpA:B,W |C“c,Dq(S21b)
ďmax
cmax
SEMs for G1HpD,W |C“cq(S21c)
ďmax
cˆHpD|C“cq` max
SEMs for G1HpW|C“cq˙(S21d)
ďmax
cHpD|C“cq` max
SEMs for G1HpWq(S21e)
“max
cHpD|C“cq`MMEXÑY
where all the steps above are consequences of subadditivity except for the step from Equation
(S21b)
to Equation
(S21c)
,
which is just the application of Equation (4a). Finally, Equation (10b) follows from Equation (10a) by convexity.
D RELATION BETWEEN COMMON ENTROPY AND MME
The
MME
bears some resemblance to a concept called common entropy [
33
], which is defined for a distribution
PpX,Y q
as
the smallest possible entropy of an unobserved variable
W
such that
XKY|W
. Unlike the
MME
, the common entropy is a
function only of the probability distribution
PpX,Y q
, and not of the graph
G
. Any
W
that renders
X
and
Y
conditionally
independent must also fully mediate the effect of
X
on
Y
, which at first glance might be taken to mean that the common
entropy is an upper bound on the
MME
, because it implies that the
MME
can search over a larger set of distributions to obtain a
low-entropy mediator. Indeed in the simple
XÑY
model, it is the case that
MMEXÑY
is bounded from above by the common
entropy between Xand Yfor precisely this reason.
However, the common entropy is not an upper bound on the
MME
in general. To see this, consider the graph presented in
Fig. 1(c). This model contains distributions in which
A
and
B
are highly correlated, but
D
and
B
are entirely uncorrelated.
For such distributions, the common entropy of
B
and
D
would be
0
, as they are already marginally independent. However,
by Corollary 10.1, the
MME
would be bounded from below by
IpA:Bq
, which can be larger than
0
. The intuition for this
phenomenon is that if the edge
DÑB
were missing,
A
and
B
would be marginally independent, so a high mutual information
between them is evidence for the causal significance of the edge.
E AN INVERSE OF THEOREM 5
The goal of this appendix is to establish that
Proposition 12.
If a graph
G
has the feature
pAMeB|Cupon Dq
, then there exists a distribution in the marginal model of
Gwith discrete Dsuch that pA:B|C,Dq“ IpA:B|CqŋHpDq.
We begin by simply noting that
Lemma 13.
The marginal model of any graph
G
whose nodes include
tA,B,C,Du
contains all conditional distributions
PpA,B|C,Dqwherein
1. PpA,B|C,Dq“ PpA,B|Cq, and
2. PpA,B|Cqis within the marginal model of the graph G1defined by removing outgoing edges from Din G.
The sorts of
PpA,B|C,Dq
described in Lemma 13 arise by considering causal models wherein every child of
D
always
ignores the values of D, treating Das if it has no descendants. We next invoke the completeness of d-separation. That is,
Lemma 14. If pAMdB|Cqin G, then there exists a distribution in the marginal model of Gfor which IpA:B|Cqŋ 0.
We note that Lemma 14 follows from
Lemma 15.
If
pAMdB|Cq
in
G
for singleton nodes
A
and
B
, then there exists a distribution in the marginal model of
G
for
which Dcs.t. IpA:B|C“cqŋ0.
After all,
IpA:B|Cq“ 0
if and only if
IpA:B|Cq“ 0
for all singleton nodes
APA
and
BPB
. Moreover,
IpA:B|Cq“ 0
if and only if
IpA:B|C“cq“ 0
for all
c
having positive support. We believe that Lemma 15 explicitly follows from Theorem
3 in Ref. [34], but we provide an explicit proof of it below for completeness.
By combining Lemmas 13 and 14 we obtain Proposition 12. This follows by noting that whenever a graph
G
has the
feature
pAMeB|Cupon Dq
, then by definition the graph
G1
defined by removing outgoing edges from
D
in
G
exhibits
pAMdB|Cq
. To violate the basic inequality in Theorem 5we apply Lemma 14 while keeping
HpDqň IpA:B|Cq
. We can
make HpDqarbitrarily small by heavily biasing PpDqtowards one value.
Proof of Lemma 15
The following construction yields IpA:B|C“1q“logp2qwhenever Gexhibits pAMdB|Cqfor singleton nodes Aand B.
If
pAMdB|Cq
in
G
then there exists some path in
G
with end nodes
A
and
B
such that all colliders in the path are elements of
C
and no element in
C
is present in the path except as a collider. We classify the nodes within the path into three distinct types:
Two Incoming Edges from the Path
These are the colliders in the path, elements of
C
. We take each such node to act as a
Kronecker delta function over its two in-path parents. That is, it will return the value
1
iff the two in-path parents have
coinciding values. That is,
ypx1,x2q“ #0,with unit probability iff x1‰x2,
1with unit probability iff x1“x2.(S22)
One Incoming Edge from the Path
These are the mediaries in the path, as well as at least one (perhaps both) of the end
nodes of the path. Let these variables act as an identity functions of its single in-path parent. That is,
ypxq“ #0with unit probability iff x“0,
1with unit probability iff x“1..(S23)
Zero Incoming Edges from the Path These are the bases of forks in the path, as well as potentially one of the end nodes of
the path. Let these variable act as uniformly random variables with cardinality 2. That is,
ypq“ #0with probability 1
2,
1with probability 1
2.(S24)
This construction results in every non-collider being uniformly distributed over
t0,1u
and always taking the same value as
every other non-collider in the path upon postelecting all colliders in the path to take the value
1
. That is, this construction
explicitly ensures that IpA:B|C“1q“logp2q.