PreprintPDF Available

Entropic Inequality Constraints from e-separation Relations in Directed Acyclic Graphs with Hidden Variables

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Directed acyclic graphs (DAGs) with hidden variables are often used to characterize causal relations between variables in a system. When some variables are unobserved, DAGs imply a notoriously complicated set of constraints on the distribution of observed variables. In this work, we present entropic inequality constraints that are implied by e-separation relations in hidden variable DAGs with discrete observed variables. The constraints can intuitively be understood to follow from the fact that the capacity of variables along a causal pathway to convey information is restricted by their entropy; e.g. at the extreme case, a variable with entropy 0 can convey no information. We show how these constraints can be used to learn about the true causal model from an observed data distribution. In addition, we propose a measure of causal influence called the minimal mediary entropy, and demonstrate that it can augment traditional measures such as the average causal effect.
Entropic Inequality Constraints from e-separation Relations in
Directed Acyclic Graphs with Hidden Variables
Noam Finkelstein1Beata Zjawin2, 3 Elie Wolfe2Ilya Shpitser1Robert W. Spekkens2
1Johns Hopkins University, Department of Computer Science, 3400 N Charles St, Baltimore, MD USA, 21218
2Perimeter Institute for Theoretical Physics, 31 Caroline St. N, Waterloo, Ontario, Canada, N2L 2Y5
3International Centre for Theory of Quantum Technologies, University of, Gda ´
nsk, 80-308 Gda´
nsk, Poland
Abstract
Directed acyclic graphs (DAGs) with hidden vari-
ables are often used to characterize causal relations
between variables in a system. When some vari-
ables are unobserved, DAGs imply a notoriously
complicated set of constraints on the distribution
of observed variables. In this work, we present en-
tropic inequality constraints that are implied by
e
-
separation relations in hidden variable DAGs with
discrete observed variables. The constraints can in-
tuitively be understood to follow from the fact that
the capacity of variables along a causal pathway to
convey information is restricted by their entropy;
e.g. at the extreme case, a variable with entropy
0
can convey no information. We show how these
constraints can be used to learn about the true causal
model from an observed data distribution. In ad-
dition, we propose a measure of causal influence
called the minimal mediary entropy, and demon-
strate that it can augment traditional measures such
as the average causal effect.
1 INTRODUCTION
A causal model of a system of random variables can be rep-
resented as a directed acyclic graph (DAG), where an edge
from a node
X
to a node
Y
can be taken to mean that the ran-
dom variable
X
is a direct cause of the random variable
Y
.
Such causal models can be used to algorithmically deduce
highly non-obvious properties of the system. For example, it
is possible to deduce that the probability distribution of ob-
served variables in the system, called the observed data dis-
tribution, must satisfy certain constraints.
When some variables in the system are unobserved, the con-
straints implied by the causal model are not well understood,
and, for computational reasons, cannot be feasibly enumer-
ated in full for arbitrary graphs. As a result, a number of
methods have been developed for quickly providing a subset
of these constraints [
1
3
]. In this work, we contribute to this
literature by describing entropic inequality constraints that
hold whenever an
e
-separation relationship [
4
,
5
] is present
in the graph.
The idea underlying these inequality constraints is that mu-
tual information between two variables in a graphical model
must be explained by variability of variables (termed bottle-
neck variables) that are between them along some path. Such
paths need not be directed; a bottleneck variable may consti-
tute the base of a fork structure or the mediary variable in a
chain structure along the path. Each such path has a limited
capacity for carrying information, which can be quantified
in terms of the entropies of the bottleneck variables on that
path. At the extreme case, if there is a bottleneck variable
along a path with zero entropy, then subsequent variables on
that path cannot learn about prior variables through the path,
because the bottleneck variable will hold a fixed valueregard-
less of the values taken other any other variables, observed
or unobserved. We will quantitatively relate the amount of
information that can flow through a path to the entropies of
its bottleneck variables below.
Constraints on the observed data distribution implied by a
causal model have primarily been used to determine whether
the observed data is compatible with a causal model, and to
learn the true causal model directly from the observed data.
Existing algorithms for learning causal models rely primarily
on equality constraints. We suggest that incorporating our
proposed inequality constraints, which can easily be read off
a graphical model, can meaningfully improve these methods.
In addition, we show how the entropy of latent variables
can be linked to properties of the observed data distribution,
yielding bounds on latent variable entropies or constraints on
the observed data distribution.
We also demonstrate that our constraints can be used to bound
an intuitive measure of the strength of a causal relationship
between two variables, called the Minimum Mediary Entropy
(
MME
). We show that the standard measure, called the Aver-
Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021).
age Causal Effect (
ACE
), is not well suited to capturing the
causal influence strength of a non-binary treatment on out-
come, and can be misleading in some settings. For example,
the
ACE
can be
0
even when treatment changes outcome for
every subject in the population. The
MME
overcomes both
of these issues, and can serve as an informative complement
to the ACE.
The remainder of the paper is organized as follows. In Sec-
tion 2, we discuss relevant material in causal inference and
information theory. We present our constraints in Section 3,
and several applications of the constraints in Section 4. Fi-
nally, a discussion of related work and directions for future
study can be found in Section 5and Section 6respectively.
2 PRELIMINARIES
2.1 CAUSAL INFERENCE BACKGROUND
We begin by introducing key ideas from the literature on
graphical causal models. Suppose we are interested in a sys-
tem of related phenomena, each of which can be represented
by a random variable. We denote observed variables in the
system as
Y
, unobserved variables as
U
, and the full set of
variables as VYYU.
We let
G
denote a DAG representing the system of interest.
Each node in
G
corresponds to a variable in
V
. The direct
causes of each random variable
V
are defined to be its parents
in
G
, denoted
paGpVq
. We adopt a nonparametric structural
equations view of the DAG [
6
,
7
], under which the value of
each variable
V
is a function of its direct causes and exoge-
nous noise, denoted
V
. The set of these structural equations
is denoted
F tfVppaGpVq,Vq | VPVu
. In most causal
analyses, the exact form of these functions is unknown. Nev-
ertheless, if the structure of causal dependencies in a system
is known to be summarized by a graph
G
, or, equivalently, to
be described by some set of functions
F
, then the distribu-
tion PpVqis know to factorize as
PpVq ź
VPV
PpV|paGpVqq.(1)
Equation
(1)
is the fundamental constraint that
G
places on
the distribution
PpVq
if the equality holds, then the dis-
tribution is in the model; otherwise it is not. When all vari-
ables are observed, each term in the factorization is identi-
fiable from observed data, and the constraint may easily be
checked. When not all variables are observed, there is no
known polynomial-time algorithm for expressing the con-
straints that the factorization of the full joint distribution
places on the observed data distribution. In theory, necessary
and sufficient conditions for the observed data distribution to
be in the model can be obtained through the use of quantifier
elimination algorithms [
8
], but these have doubly exponen-
tial runtime and are prohibitively slow in practice.
We now review
d
-separation and
e
-separation, which are
properties of the graph
G
that imply certain properties of
distribution
PpVq
. We first introduce the notion of open and
closed paths in conditional distributions. Triples in the graph
of the form
AÑCÑB
and
AÐCÑB
are said to be open
if we do not condition on
C
, and closed if we do condition
on
C
. Triples of the form
AÑCÐB
, in which
C
is called
a collider, are closed if we do not condition on
C
or any of
its descendants, and open if we do. A path is said to be open
under a conditioning set
C
if all contiguous triples along that
path are open under that conditioning set.
Definition 1
(
d
-separation)
.
Let
A
,
B
and
C
be sets of vari-
ables in a DAG.
A
and
B
are said to be
d
-separated by
C
if
all paths between
A
and
B
are closed after conditioning on
C. This is denoted pAKdB|Cq.
It is a well-known consequence of Equation (1) that any
d
-
separation relation
pAKdB|Cq
in
G
implies the correspond-
ing conditional independence relation
AKB|C
in the dis-
tribution
PpVq
. Conditional independence constraints of
this form are about sub-populations in which the variables in
C
take the same value for all subjects. We can only evaluate
whether these constraints hold when all variables in
C
are
observed; otherwise there is no way to identify the relevant
sub-populations. For that reason, it is impossible to deter-
mine whether conditional independences implied by Ghold
if they have hidden variables in their conditioning sets, lead-
ing to the need for other mechanisms to test implications of
these independencies.
To describe
e
-separation, we first introduce the idea that a
node can be deleted from a graph by removing the node and
all of its incoming and outgoing edges.
e
-separation can then
be defined as follows.
Definition 2
(
e
-separation)
.
Let
A
,
B
,
C
and
D
be sets of
variables in a DAG.
A
and
B
are said to be
e
-separated by
C
after deletion of
D
if
pAKdB|Cq
after deletion of every
variable in D. This is denoted pAKeB|Cupon Dq.
Conditioning on
C
may close some paths between
A
and
B
,
and open others. In the context of
e
-separation, the set
D
,
which we refer to as a bottleneck for
A
and
B
conditional
on
C
, is any set that includes at least one variable from each
path between
A
and
B
that is open after conditioning on
C
. If no subset of
D
is a bottleneck, then
D
is called a
minimal bottleneck. This terminology reflects the fact that,
conditional on
C
, all information shared between
A
and
B
that is, transferred from one to the other or transferred to each
from a common source must flow through D.
It has been shown that every
e
-separation relationship among
observed variables in a graph
G
corresponds to a constraint
on the observed data distribution
PpYq
[
4
]. However, this
result is not constructive, in the sense that it does not pro-
vide a strategy for deriving such constraints for a given
e
-
D BA
U2
U1
paq
D BA
U2
U1
D#
pbq
A D B
U
pcq
A D D#B
U
pdq
Figure 1: The Unrelated Confounders graph (a), and a split
node model for it (b), as well as the Instrumental graph (c),
and its split node model (d).
separation relationship. The inequality constraints we pro-
vide in Section 3partially fulfill this role; they provide ex-
plicit constraints that hold everywhere in the model when-
ever an e-separation relationship obtains in a graph.
2.1.1 Node Splitting
We will see that
e
-separation is related to the idea of splitting
nodes in a graph. We define a node-splitting operation as
follows. Given a graph
G
and a vertex
D
in the graph, the
node splitting operation returns a new graph
G#
in which
D
is split into two vertices. One of the vertices is still called
D
, and it maintains all edges directed into
D
in the original
graph
G
, but none of its outgoing edges. This vertex keeps
the name
D
because it will have the same distribution as
D
in
the original graph, as all of its causal parents remain the same.
The second random variable is labeled
D#
, and it inherits
all of the edges outgoing from
D
in the original graph, but
none of its incoming edges. Examples of the node splitting
operation are illustrated in Fig. 1.
By a result of Evans
[4]
,
pAKeB|Cupon Dq
in
G
if and
only if
pAKdB|C,D#q
in
G#
. Note that the node splitting
operation described here is closely related to the operation of
node splitting in Single World Intervention Graphs in causal
inference [7].
2.2 ENTROPIES
In this section, we review standard concepts in information
theory, which we will use to express our inequality con-
straints. We begin with the definitions of entropy and mutual
information.
Definition 3.
The
entropy
of a random variable
X
is defined as
HpXq”´řxPXPpxqlog2Ppxq
, with the
joint entropy of
X
and
Y
defined analogously. The
mutual information
between
X
and
Y
is defined as
IpX:Yq HpXq`HpYHpX,Y q.
The entropy of a random variable can be thought of as the
level of uncertainty one has about its value. Entropy is maxi-
mized by a uniform distribution over the domain of a random
variable, as there is no reason to think any one value is more
probable than another, and minimized by a point distribution,
in which there is no uncertainty.
The mutual information between
X
and
Y
can be thought of
as the amount of certainty we gain about the value of one, on
average, if we learn the value of the other. It is maximized
when one of
X
and
Y
is a deterministic function of the other,
and is minimized when they are independent.
The entropy
HpX|Yyq
of
X
conditional on a specific
value of
Yy
is obtained by replacing the distribution
PpXq
in Definition 3with
PpX|Yyq
. The
conditional entropy
of Xgiven Y, denoted HpX|Yq, is defined as the expected
value of
HpX|Yyq
. Conditional mutual information is
analogously defined.
3E-SEPARATION CONSTRAINTS
We have already described the intuition behind our con-
straints, which can be roughly summarized by the observa-
tion that the statistical dependence between random variables
must be limited by the total amount of information that can
flow through any bottleneck between them. We now describe
how the tools introduced in Section 2help us formalize this
intuition.
First, we describe why
e
-separation helps formalize the idea
of blocking “all paths” between two sets of variables. Con-
sider the instrumental variable graph, depicted in Fig. 1(c).
A
and
B
are only
d
-separated by the set
tD,U u
, where
U
is un-
observed. Consequently, they are not
d
-separated by any set
consisting entirely of observed variables. They are, however,
e
-separated after deletion of the observed variable
D
. This
tells us that all paths between
A
and
B
are through
D
, and we
can take advantage of observed properties of
D
to bound the
dependence between them even when nothing is known about
the unobserved variable
U
. A similar story can be told about
the Unrelated Confounders scenario depicted in Fig. 1(a).
When all variables are observed,
e
-separation does not imply
any constraints that are not implied by
d
-separation, which
follows from the fact that
d
-separation implies all constraints
in such cases [
9
]. However, as illustrated by the examples in
Figs. 1(a) and 1(c),
e
-separation allows us to identify bottle-
necks consisting entirely of observed variables between
A
and
B
even when paths between
A
and
B
cannot be closed
by any manner of conditioning on observed variables. To
show how
e
-separation lead to entropic constraints, we will
make use of Theorem 4.2 in [4], reframed as follows.
Theorem 4. (Evans [4, Theorem 4.2])
Suppose
pAKeB|Cupon Dq
in
G
, and that no variable
in
C
is a descendant of any in
D
. Then there exists a distri-
bution P˚over A,B,C,D,D#such that
PpAa,Bb,Dd|Ccq
P˚pAa,Bb,Dd|Cc,D#dq(2)
with
AKB|C,D#
in
P˚
. If furthermore no variable
in
A
is a descendant of any in
D
, then there exists a dis-
tribution
P˚
such that
PpBb,Dd|Aa,Ccq
P˚pBb,Dd|Aa,Cc,D#dq
with
AKB|C,D#in P˚.1
We provide the following intuition for this theorem. Our
graph
G
represents the causal relationships within a system of
random variables in the real world. The graph
G#
represents
an alternative world in which the causal effects of
D
are
“spoofed” by some random variable
D#
. That is, children of
D
in
G
, which should be functions of
D
, are instead fooled
into being functions of D#.
In the alternative world represented by
G#
, we suppose that
the functional form
fV
of a variable
V
in terms of its parents
stays the same for all variables that are shared between graphs.
This means that all non-descendants of
D
have the same
joint distribution in our world and in the alternative world, as
neither their parents nor the functions defining them in terms
of their parents have changed. By contrast, descendants of
D
in
G
will have a different distribution in the alternative world,
as their distributions are now functions of the distribution of
D#
, which may be different from that of
D
, and is unknown.
Now, suppose we condition on a particular value of
D#d
in
G#
. Then, because the functional form of the causal
mechanisms is shared across worlds, the descendants of
D
in
G
have the same distribution as they have when
Dd
in
the observed world. In addition, all of the non-descendants
of
D#
are marginally independent from
D#
, because it has
no ancestors so all connecting paths must be collider paths.
Therefore, both its non-descendants and its descendants have
the same joint distribution they would have had when
Dd
in the original graph. The results in the theorem then follow
when we note that
C
, and optionally
A
, are non-descendants
of
D
, and that the relevant independence properties hold in
the world of G#.
In general, we cannot know what this
P˚
distribution is,
because we never get to observe this alternate world. But
when we condition on
D#
, we are removing precisely the
randomness we do not know about, yielding a distribution
that we do know about. The fact that
P˚
agrees with
P
on a subset of their domains, and that it contains known
1
In causal inference problems, a distribution
P˚
that satisfies
the relevant conditions for this result may be constructed from coun-
terfactual random variables Apdq,Bpdq,Dpdqand Cpdq.
independences, is sufficient to derive informative constraints,
as seen below.
3.1 ENTROPIC CONSTRAINTS FROM
E-SEPARATION
We now show how the notion of
e
-separation permits the
formulation of entropic inequality constraints. In these
constraints, we use mutual information to represent depen-
dence between sets of variables, and entropy to measure the
information-carrying capacity of paths connecting them.
Theorem 5. (Proof in Appendix C.)
Suppose the variables in
D
are discrete. Whenever
pAKeB|Cupon Dq
, then
IpA:B|C,Dqď HpDq
. If
in addition no element of
C
is a descendant of any in
D
, then
for any value
c
in the domain of
C
, the following stronger
constraints hold:
IpA:B|Cc,Dqď HpD|Ccq(3a)
IpA:B|C,Dqď HpD|Cq.(3b)
If in addition, no element of
A
is a descendant of any in
D
,
then for any value
c
in the domain of
C
, the following even
stronger constraints hold:
IpA:B,D|Ccqď HpD|Ccq,(4a)
IpA:B,D|Cqď HpD|Cq.(4b)
This theorem potentially allows us to efficiently discover
many entropic inequalities implied by any given graph, such
as those implied by Fig. 2. In some cases, as in Fig. 2(a),
the theorem recovers all Shannon-type entropic inequality
constraints implied by the graph [
10
12
]. In other cases,
as in Fig. 2(b), the graph implies a Shannon-type entropic
inequality constraint beyond what Theorem 5can recover,
per a result in [
13
]. Indeed, entropic inequality constraints
can be implied by graphs not exhibiting
e
-separation relations
whatsoever, such as the triangle scenario [11,14].
The linear quantifier elimination of [
10
12
] will always dis-
cover all the entropic inequalities which can be inferred from
Theorem 5. However, the quantifier elimination method is
computationally expensive, and is essentially intractable for
graphs involving more than six or seven variables (observed
and latent combined). Theorem 5, by contrast, provides an
approach that is computationally tractable, but is capable of
discovering fewer entropic constraints.
Finally, we note that an inverse of Theorem 5also holds. In
particular, if
pAMeB|Cupon Dq
in
G
and the variables in
D
are discrete, then there necessarily exists some distribution
P
over
A
,
B
,
C
,
D
, in the marginal model of
G
such that
IpA:B|C,Dqŋ HpDqwhen evaluated on P.
A X Y Z
U1
U2
(a)
A X Y Z
U1U2
(b)
Figure 2: For graph (a), Theorem 5implies the
entropic inequality constrains
IpA:XY Zq ďHpXq
and
IpA:Y Zq ďHpYq
. For graph (b), Theorem 5implies
IpA:XY Zq ďHpXq
and
IpA:Y Z|Xq ďHpY|Xq
. Note,
however, that the entropic quantifier elimination method
of Chaves et al.
[10]
as applied by Weilenmann and Colbeck
[13]
, finds that the former inequality for graph (b) can be
strengthened into IpA:XY Zq ďHpX|Yq.
3.2 RELATING E-SEPARATION TO EQUALITY
CONSTRAINTS
We have seen that
d
-separation and
e
-separation relations
imply constraints on the observed data distribution. Verma
and Pearl
[15]
discuss equality constraints for latent variable
models that apply in identified post-intervention distribu-
tions. Such equality constraints are sometimes called Verma
constraints. A general description of the class of these con-
straints implied by a hidden variable DAG model, as well
as discussion of properties of these constraints is given in
Ref. [
16
]. In this section, we examine the relationship be-
tween the
e
-separation-based constraints of Theorem 5and
the
d
-separation-based conditional independence and Verma
constraints.
First, we observe that the presence of
d
-separation relations
and Verma constraints in a graphical model imply the pres-
ence of an e-separation relation.
Proposition 6. (Proof in Appendix C.)
If
A
is
d
-separated from
B
by
tC,Du
, then
A
is also
e
-
separated from Bby Cupon deleting D.
This demonstrates that for any
d
-separation relation in the
graph, it is possible to obtain an entropic constraint cor-
responding to any minimal bottleneck
D
through an
e
-
separation relation. More precisely, when
A
is
d
-separated
from
B
by
tC,Du
, then by Proposition 6, it is also the case
that
A
is
e
-separated from
B
given
C
upon deleting
D
, and
therefore Theorem 5can be applied to obtain entropic con-
straints. Note, however, that these are necessarily weaker
than the entropic constraint
IpA:B|C,Dq 0
, which fol-
lows from the d-separation relation itself.
In summary, every
d
-separation relation in the graph is an in-
stance of
e
-separation, but not vice-versa. When an instance
of
e
-separation is also an instance of
d
-separation, then all
the inequality constraints implied by
e
-separation are ren-
dered defunct by the stronger equality constraints implied by
d-separation.
We now show that a similar pattern of deprecating inequalities
by equalities occurs in the presence of Verma constraints
when certain counterfactual interventions are identifiable.
Proposition 7.
Consider a graph
G
which exhibits the
e
-
separation relation
pAKeB|Cupon Dq
and where no el-
ement of
C
is a descendant of any in
D
. If the counterfactual
distribution
PpApDdq,BpDdq,DpDdq| Cq
is iden-
tifiable
2
then the inequalities of Theorem 5are logically im-
plied whenever the stronger equality constraints
IpApDdq:BpDdq| Cq 0(5)
are satisfied for all values of
d
. Note that Equation
(5)
is sat-
isfied if and only if the margin of the identified counterfactual
distribution factorizes, i.e., when
PpApDdq,BpDdq| Cq
řd1PpApDdq,BpDdq,DpDdq“d1|Cq
exhibits ApDdqKBpDdq | C.(6)
The proof directly follows from that of Theorem 5. In prov-
ing Theorem 5, we derive entropic inequalities by relating
the entropies pertaining to
PpA,B,D|Cq
to entropies per-
taining to the
P˚
distribution posited by Theorem 4. That is,
Theorem 5is an entropic consequence of Theorem 4. If the
conditions of Proposition 7are satisfied, then the conditions
of Theorem 4are also automatically satisfied since one can
then explicitly reconstruct
P˚pA,B,Dd|C,D#d#q
PpApDd#q,BpDd#q,DpDd#q“d|Cq.(7)
There is no opportunity to violate the entropic inequalities of
Theorem 5once the observational data has been confirmed
as consistent with Theorem 4. In other words, in order to
violate the inequalities of Theorem 5it must be the case that
no
P˚
consistent with Theorem 4can be constructed, but this
contradicts the explicit recipe of Equation (7).
See Refs. [
15
,
16
,
18
] for details on how to derive the form
of the equality constraints summarized by Equation (6). We
note here that
PpApDdq,BpDdq,DpDdq“d|Cq
is
certainly identifiable if
D
is not a member of the same dis-
trict ([
16
]) as any element in
tA,Bu
within the subgraph of
G
over
tA,B,C,Du
and their ancestors. We also note that
the identifiability of merely
PpApDdq,BpDdq| Cq
but
not of
PpApDdq,BpDdq,DpDdq“d|Cq
negates
the implication from Equation
(6)
to Theorem 5. In Ap-
pendix A, we provide an example of a graph in which
2
The counterfactual distribution in this theorem allows
intervened-on variables and outcomes to intersect. See [
17
] for a
complete identification algorithm for counterfactual distributions
of this type.
PpApDdq,BpDdq| Cq
is identified, but the entropic
constraints of Theorem 5remain relevant. In addition, we
demonstrate that the application of the entropic constraints
to identified counterfactual distributions can also result in in-
equality constraints on the observed data distribution.
3.3 CONSTRAINTS AND BOUNDS INVOLVING
LATENT VARIABLES
In this section, we consider
d
-separation relations with hid-
den variables in the conditioning set. Because we cannot con-
dition on hidden variables, there is no way to check whether
the corresponding independence constraints hold in the full
data distribution. However, if we have access to auxiliary in-
formation about these hidden variables such as information
about their entropy or their cardinality it is possible to ob-
tain inequality constraints on the observed data distribution.
Proposition 8. (Proof in Appendix C.)
If
pAKdB|C,Uq
and
HpU|A,B,Ccqě 0
, for any value
cin the domain of C:
HpU|Ccqě IpA:B|Ccq(8)
Note that Proposition 8holds even if
A
,
B
, and
C
are con-
tinuously valued variables.
In many scenarios, we may have more (or be more interested
in) information pertaining to the cardinality of a hidden vari-
able than its entropy. We take the cardinality of a set of vari-
ables to be the product of the cardinalities of the variables in
the set. An upper bound on the cardinality of
U
entails an up-
per bound on its entropy. As observed above, the entropy of a
random variable is maximized when it takes a uniform distri-
bution. If we let
|U|
denote the cardinality of
U
, and recall
that the entropy of a uniformly distributed variable with car-
dinality
m
is simply
log2pmq
, then
log2|U| ě HpUq
. The
next corollary then follows immediately from Proposition 8,
since HpU|¨q ě 0whenever Uhas finite cardinality:
Corollary 8.1.
If
pAKdB|C,Uq
, then for any value
c
in
the domain of
C
, the cardinality of
U
may be lower-bounded:
|U|ě maxc2IpA:B|Ccqě2IpA:B|Cq.(9)
Finally, we note that both of these inequalities can also be
used if we do not know anything about the properties of
U
, but would like to infer lower bounds for its entropy and
cardinality from the observed data. In Section 4.2, we will
explore a scenario in genetics in which these bounds and
constraints may be of use.
Remark 9.
Constraints given in Proposition 8and Corol-
lary 8.1 are stronger than can be obtained from the
e-separation relation pAKeB|Cupon Uqon its own.
To demonstrate Remark 9, we consider a set of structural
equations consistent with Fig.
1(a)
. Suppose that
D
takes the
Y1Y2Y3Y4
U1U2
U3
Y5
paq
Y1Y2Y3Y4
U1U2
U3
Y5
pbq
Figure 3: Two hidden variable DAGs that share equality con-
straints over observed variables, but (a) contains
e
-separation
relations that are not in (b).
value
0
when
U1U2
, and the value
1
otherwise, and that
A
and
B
take the value
0
if
D
is
0
, and values equal to
U1
and
U2
respectively if
D
is
1
. It follows that
A
and
B
are always
equal, and therefore
IpA:Bq HpAq
. Now, suppose that
U1
and
U2
only take values not equal to
0
, and that there are at
least two values that each takes with nonzero probability. It
immediately follows that
HpDqă HpAq
, and therefore that
HpDqă IpA:Bq
, as
D
and
A
by construction take the value
0
with the same probability, but there is strictly more entropy
in the remainder of
A
’s distribution because
D
is binary and
Atakes at least two other values with nonzero probability.
4 APPLICATIONS
In this section, we explore several applications of the con-
straints developed above. In Sections 4.1 and 4.2, we show
how our results can be used to learn about causal models from
observational data. In Section 4.3, we further leverage the im-
portance of the entropy of variables along a causal pathway
to posit a new measure of causal strength, and observe that
this measure can be bounded by an application of Theorem 5.
4.1 CAUSAL DISCOVERY
In this section, we present an example in which two hidden
variable DAGs with the same equality constraints present dif-
ferent entropic inequality constraints. The ability to distin-
guish between models that share equality constraints has the
potential to advance the field of causal discovery, in which
causal DAGs are learned directly from the observed data.
Causal discovery algorithms for learning hidden variable
DAGs currently do so using only equality constraints. Our
approach may be useful as a post-processing addition to such
methods, whereby any graph found to satisfy the equality
constraints in the observed data is tested against the entropic
inequality constraints implied by
e
-separation relations in
the model.
The hidden variable DAGs in Fig. 3, adapted from Ap-
pendix B in Ref. [
19
], share the same conditional indepen-
dence constraints:
Y1KY3|Y2Y5
and
Y1KY5
, but exhibit
different e-separation relations.
In Fig.
3(a)
,
pY1KeY3Y4|Y2upon Y5q
,
pY1Y2KeY4|upon Y3q
, and
pY2KeY4|Y1upon Y3q
.
Applying Theorem 5in each case, we obtain the three
inequality constraints
IpY1:Y3Y4Y5|Y2qď HpY5|Y2q
,
IpY2:Y3Y4|Y1qď HpY3|Y1q,IpY1Y2:Y3Y4q ďHpY3q.
In Fig.
3(b)
, we have added an edge, which re-
moves some
e
-separation relations. We are left with
pY1KeY3|Y2upon Y5q
, and
pY2KeY4|Y1upon Y3q
.
We can again apply Theorem 5in each case, yielding the
inequality constraints
IpY1:Y3Y5|Y2qď HpY5|Y2q
and
IpY2:Y3Y4|Y1qď HpY3|Y1q
. The second of these con-
straints is shared by the graph in Fig.
3(a)
, and the first is
strictly weaker than a constraint in Fig. 3(a).
Models similar to those shown in Fig. 3sometimes arise in
time-series data, where the variables in the chain represent
observations taken at consecutive time steps. In such models,
it is often assumed that treatments no longer have a direct
effect on outcomes after a certain number of time steps. Here,
that assumption is encoded in the lack of a direct edge from
Y1
to
Y4
in Fig.
3(a)
. We have shown above that this kind
of assumption can be falsified even when it does not imply
any additional equality constraints, as is often the case. In
particular, if the stronger constraints implied by Fig.
3(a)
are
violated, but the weaker constraints of Fig.
3(b)
are not, then
the assumption is falsified.
4.2 CAUSAL DISCOVERY IN THE PRESENCE OF
LATENT VARIABLES
UX Y
paq
UX Y
pbq
Figure 4: Identifying direct causal influence in the presence
of a confounder with limited cardinality.
In this section, we consider a very simple possible applica-
tion of the constraints and bounds relating to entropies of un-
observed variables in genetics. Consider a causal hypothesis
wherein the presence or absence of an unobserved gene in-
fluences two aspects of an organism’s phenotype. Suppose
that due to genetic sequencing studies, the number of vari-
ants of the gene in the population i.e. the cardinality of the
corresponding random variable is known. Two possible hy-
potheses regarding the causal structure are depicted in Fig. 4,
where
U
represents the gene and
X
and
Y
are the phenotype
aspects. In Fig.
4(a)
, one presumes no causal influence of
X
on
Y
, whereas in Fig.
4(b)
, direct causal influence is allowed.
In the former case, knowledge of the number of variants of
the gene constrains the mutual information between the phe-
notypes, while in the latter case it is not constrained.
Thus, for certain types of statistical dependencies between
X
and
Y
, one can rule out the hypothesis of Fig.
4(a)
. For ex-
ample, suppose we know the cardinality of
U
to be
3
. Corol-
lary 8.1 then implies the constraint that the mutual informa-
tion between
X
and
Y
cannot exceed
log2p3q « 1.584
. Sup-
pose further that we observe the distribution depicted in Ta-
ble 1. The mutual information between
X
and
Y
in this distri-
bution is
«1.594
. Because this mutual information violates
the constraint implied by the model in Fig.
4(a)
, we know this
model cannot be correct, and conclude that Fig.
4(b)
is correct.
More generally, strong statistical dependence between high
cardinality variables cannot be explained by a low cardinality
common cause and requires a direct influence between them.
Y
0 1 2 3
0 0.002 0.001 0.400 0.001
X 1 0.003 0.005 0.005 0.066
2 0.224 0.003 0.003 0.001
3 0.002 0.281 0.001 0.002
Table 1: An example joint distribution over two variables
X
and Y, each with cardinality 4.
Conversely, suppose Fig.
4(a)
is known to be correct, and that
there is no direct causal influence between the two aspects
of phenotype. If the cardinality of
U
is not known, it can be
bounded from below directly from observed data, according
to Corollary 8.1. In this case, the lower bound would be
2IpX:Yq«21.594 «3.018
. It follows that
U
must have a
cardinality of
4
or above in this setting. The ability to extract
such information from observational data may be useful in
making substantive scientific decisions, or in guiding future
sequencing studies.
In many applied data analyses, different variables may be ob-
served for different subjects, i.e., data on some variables is
“missing” for some subjects. A recent line of work has fo-
cused on properties of missing data models that can be repre-
sented as DAGs [
20
]. Although the bounds and constraints
above have been developed in the context of fully unobserved
variables, they can also be used in missing data DAG models,
for variables that are not observed for all subjects.
4.3 QUANTIFYING CAUSAL INFLUENCE
The traditional approach to measuring the strength of a causal
relationship is by contrasting how different an outcome would
be, on average, under two different treatments. Formally, if
X
is a cause of
Y
, the
ACE
is defined as
ErYpXxq ´
YpXx1qs
. While the
ACE
is a very useful construct, we
suggest that it has two important shortcomings, and present
an alternative measure of causal strength called the Minimal
Mediary Entropy or
MME
. The
MME
is based on the idea
explored throughout this work that the entropy of variables
along a causal pathway provide insight into the amount of
information that can travel along that pathway.
In a scenario where treatment can be discerned to always
cause outcome, we might expect the
ACE
, as a measure of
causal influence, to be large. The example below shows that
this is not necessarily the case.
Example 1.Consider a randomized binary treatment
X
and a ternary outcome
Y
, with
PpY0|X0q
PpY2|X0q 0.5
, and
PpY1|X1q 1
. In this set-
ting,
ACE 0
, even though treatment affects outcome for
every subject in the population.
In less extreme settings, the
ACE
may be very small even
when treatment affects outcome for almost every subject in
the population, or very large, even when very few subjects
have an outcome that is affected by treatment.
The
ACE
is likewise not always well suited to measuring the
strength of a causal relationship when the treatment variable
is non-binary. In such situations, no one causal contrast rep-
resents the causal influence, and the number of possible con-
trasts grows combinatorially in the cardinality of treatment.
We now define the
MME
and discuss how it can overcome
these issues.
CX Y
U
A
paq
CX Y
WUW Y
U
A
pa1q
DX Y
U
pbq
DX Y
WUW Y
U
pb1q
Figure 5: Modifying DAGs (a) and (b) by inserting a
latent mediary
W
between
X
and
Y
yields DAGs (a’)
and (b’) respectively. Note that in (a), even though
X
and
Y
are latent confounded, corollary 10.1 gives
MMEXÑYěIpA:Y|Ccq
for any
c
by exploiting the fact
that
APanpXq
. Also note that in (b), even though
X
affects
Y
both directly and indirectly though
D
, corollary 10.1 gives
MMEXÑYěIpX:Y|DHpDqfor the direct effect.
Definition 10
(Minimal Mediary Entropy (
MME
) for Direct
Effect)
.
Given a DAG
G
containing a directed edge
XÑY
,
let
G1
XÑWÑY
denote the graph constructed by substitut-
ing the single edge
XÑY
in
G
with the set of four edges
tXÑWÑY , W ÐUW Y ÑYu
, introducing auxilliary la-
tent variables
W
and
UW Y
.
3
We then define
MMEXÑY
as
the smallest entropy
HpWq
over all structural equations
3
If a latent confounder were added between
X
and
W
, then
models reproducing the observed data distribution over
G1
XÑWÑYin which Whas finite cardinality.
Fig. 5illustrates the process of edge substitution. Essentially,
the edge
XÑY
in
G
is interrupted to pass through
W
in
G1
XÑWÑY
, such that the auxiliary latent variable
W
fully
mediates the direct effect of
X
on
Y
. Note that caveat that
MME
is defined in terms of minimizing the entropy of
W
over finite cardinality
W
capable of reproducing the observed
statistics. If
W
were allowed to be a continuously valued
variable, then the observed data distribution would always be
reproducible with arbitrarily small
HpWq
, due to the total
lack of restriction in the instrumental model with a continuous
mediary [21].
With the presumption of finite cardinality
W
by fiat, however,
we are are in a position to exploit Theorem 5in order to
practically lower bound the MME.
Corollary 10.1. (Proof in Appendix C.)
Suppose that graphical construction
G1
XÑWÑY
exhibits the
e
-separation relation
pAKeB|Cupon tD,W uq
for some
AĂtXuYanpXq
and for some
BĂtYuYdescpYq
, where
A
,
B
,
C
, and
D
are nonoverlapping subsets (
C
and
D
possibly empty) of the observed variables in
G
, and all the
variables within Dare discrete. Then
MMEXÑY
ěmax
cIpA:B|Cc,DHpD|Ccq(10a)
ěIpA:B|C,DHpD|Cq.(10b)
Suppose that
P0
is a distribution in the model of the extended
graph
G1
XÑWÑY
, such that
P0
marginalizes to the observed
data distribution. Then the entropy
HpWq
in
P0
is necessar-
ily an upper bound on
MMEXÑY
, i.e. we have found a
W
with entropy HpWqthat fully mediates the causal influence
of
X
on
Y
. Since
W
could always reproduce the observed
data by simply copying the values of
X
, we have a trivial
upper bound of
MMEXÑYďHpXq
.
4
This upper bound can
typically be improved by even partially exploring the space
of the distributions in G1
XÑWÑY.
Consider the simple model
XÑY
with
|X| 3
and
|Y| 3
,
and the observed data distribution
PpXx, Y yq
#5
27 if xy
2
27 if xy
. Our corollary gives us a lower bound on
MMEXÑYěIpX:Yq« 0.150
, contrasted with the trivial
upper bound
MMEXÑYďHpXq« 1.585
. We can improve
the trivial upper bounding by noting that this distribution
although
W
would still mediate
XÑY
,
X
and
Y
would share a
source of unobserved confounding, altering the causal model.
4
Consider example 1which has
ACEXÑY0
. That example
has the feature that
IpX:Yq HpXq
. Accordingly the lower
bound of
MMEXÑYěIpX:Yq
is evidently tight, given the trivial
upper bounder MMEXÑYďHpXq.
can be reproduced by the following functional relation-
ships
#W0and YUW Y when UW Y X
W1and Yuniformly random when UW Y X
and taking
UW Y
to be a uniform random distri-
bution with cardinality three. In this model for
G1
XÑWÑY
we obtain
PpW0q 1
3
, corresponding to
MMEXÑYďHpWq« 0.918.
5 RELATED WORK
This work builds most directly on Ref. [
4
], in which
e
-
separation was introduced and Theorem 4was derived, both
of which are essential to our results. It follows in the tradition
of a line of literature that aims to derive symbolic expressions
of restrictions on the observed data distribution implied by a
causal model with latent variables, including Refs. [
2
,
18
,
22
].
Entropic constraints were previously considered in Refs. [
10
12
]. The entropic constraint for the instrumental scenario
appears as Equation (5) in Ref. [
11
], see also Appendix E
of Ref. [
23
]. Our work is also closely related to work in the
literature on information theory on how much information
can pass through channels of varying types [
24
]. Our pro-
posed measure of causal strength, the
MME
, is motivated by
weaknesses in standard causal strength measures (e.g. ACE),
which was previously discussed in Ref. [25].
Our results are also related to the causal discovery literature,
which seeks to find the causal structures compatible with an
observed data distribution [
26
]. The inequality constraints
posed above can be used to check the outputs of existing
causal discovery algorithms [2628].
6 CONCLUSION
In this work, we present inequality constraints implied by
e
-
separation relations in hidden variable DAGs. We have shown
that these constraints can be used for a number of purposes,
including adjudicating between causal models, bounding the
cardinalities of latent variables, and measuring the strength
of a causal relationship.
e
-separation relations can be read
directly off a hidden variable DAG, leading to constraints
that can be easily obtained.
This work opens up two avenues for future work. The
first is that our constraints demonstrate a practical use of
e
-
separation relations, and should motivate the study of fast
algorithms for enumerating all such relations in hidden vari-
able DAGs. The second is that the constraints suggest that
equality-constraint-based causal discovery algorithms can
be improved; understanding how the inequality constraints
can best be used to this end will take careful study.
Acknowledgments
This research was supported by Perimeter Institute for Theo-
retical Physics. Research at Perimeter Institute is supported
in part by the Government of Canada through the Department
of Innovation, Science and Economic Development Canada
and by the Province of Ontario through the Ministry of Col-
leges and Universities. The first author was supported in part
by the Mathematical Institute for Data Science (MINDS) re-
search fellowship. The third author was supported in part by
grants ONR N00014-18-1-2760, NSF CAREER 1942239,
NSF 1939675, and R01 AI127271-01A1. We thank Murat
Kocaoglu for helpful discussion about the MME.
References
[1]
E. Wolfe, R. W. Spekkens, and T. Fritz, “The Inflation Technique for Causal Inference with Latent Variables, J. Caus.
Inf. 7, 70 (2019).
[2]
C. Kang and J. Tian, “Inequality Constraints in Causal Models with Hidden Variables,” in Proc. 22nd Conf. UAI (AUAI,
2006) pp. 233–240.
[3]
D. Poderini, R. Chaves, I. Agresti, G. Carvacho, and F. Sciarrino, “Exclusivity graph approach to Instrumental inequali-
ties,” in Proc. 35th Conf. UAI (AUAI, 2019).
[4]
R. J. Evans, “Graphical methods for inequality constraints in marginalized DAGs, in Proc. 2012 IEEE Intern. Work.
MLSP (IEEE, 2012) pp. 1–6.
[5] J. Pienaar, “Which causal structures might support a quantum–classical gap?” New J. Phys. 19, 043021 (2017).
[6] J. Pearl, “Causal inference in statistics: An overview,” Statist. Surv. 3, 96– (2009).
[7] T. S. Richardson and J. M. Robins, Single World Intervention Graphs (Now Publishers Inc, 2013).
[8]
D. Geiger and C. Meek, “Quantifier Elimination for Statistical Problems,” in Proc. 15th Conf. UAI (AUAI, 1999) pp.
226–235.
[9] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, 1988).
[10]
R. Chaves, L. Luft, T. O. Maciel, D. Gross, D. Janzing, and B. Schölkopf, “Inferring latent structures via information
inequalities,” in Proc. 30th Conf. UAI (AUAI, 2014) pp. 112–121.
[11]
R. Chaves, L. Luft, and D. Gross, “Causal structures from entropic information: geometry and novel scenarios, New J.
Phys. 16, 043001 (2014).
[12]
M. Weilenmann and R. Colbeck, “Analysing causal structures with entropy, Proc. Roy. Soc. A
473
, 20170483 (2017).
[13]
M. Weilenmann and R. Colbeck, “Analysing causal structures in generalised probabilistic theories,” Quantum
4
, 236
(2020).
[14] B. Steudel and N. Ay, “Information-theoretic inference of common ancestors,” Entropy 17, 2304 (2015).
[15] T. Verma and J. Pearl, “Equivalence and synthesis of causal models, in Proc. 6th Conf. UAI (AUAI, 1990).
[16]
T. S. Richardson, R. J. Evans, J. M. Robins, and I. Shpitser, “Nested Markov Properties for Acyclic Directed Mixed
Graphs,” (2017), Working Paper.
[17]
I. Shpitser, T. S. Richardson, and J. M. Robins, “Chapter 41: Multivariate counterfactual systems and causal graphical
models,” in Probabilistic and Causal Inference: The Works of Judea Pearl (ACM Books, 2021).
[18]
J. Tian and J. Pearl, “On the Testable Implications of Causal Models with Hidden Variables, in Proc. 18th Conf. UAI
(AUAI, 2002).
[19]
R. Bhattacharya, T. Nagarajan, D. Malinsky, and I. Shpitser, “Differentiable Causal Discovery Under Unmeasured
Confounding,” (2020), arXiv:2010.06978 .
[20]
K. Mohan, J. Pearl, and J. Tian, “Graphical Models for Inference with Missing Data, in Advances in Neural Information
Processing Systems (Curran Associates, Inc., 2013) pp. 1277–1285.
[21]
F. Gunsilius, “A path-sampling method to partially identify causal effects in instrumental variable models, (2019),
working paper.
[22]
A. Balke and J. Pearl, “Nonparametric Bounds on Causal Effects from Partial Compliance Data, J. Am. Stat. Ass. (1993).
[23]
J. Henson, R. Lal, and M. F. Pusey, “Theory-independent limits on correlations from generalized Bayesian networks,
New J. Phys. 16, 113043 (2014).
[24] A. E. Gamal and Y.-H. Kim, Network Information Theory (Cambridge University Press, 2011).
[25]
D. Janzing, D. Balduzzi, M. Grosse-Wentrup, B. Schölkopf, et al., “Quantifying causal influences, Ann. Stat.
41
, 2324
(2013).
[26] P. L. Spirtes, C. N. Glymour, and R. Scheines, Causation, prediction, and search (MIT press, 2000).
[27]
E. V. Strobl, S. Visweswaran, and P. L. Spirtes, “Fast causal inference with non-random missingness by test-wise
deletion,” Int. J. Data Sci. Analyt. 6, 47 (2018).
[28]
D. Bernstein, B. Saeed, C. Squires, and C. Uhler, “Ordering-Based Causal Structure Learning in the Presence of Latent
Variables,” in Proc. 23rd Int. Conf. Art. Intell. Stat., Vol. 108 (PMLR, 2020) pp. 4098–4108.
[29]
R. J. Evans and T. S. Richardson, “Smooth, identifiable supermodels of discrete DAG models with latent variables,”
Bernoulli 25, 848 (2019).
[30]
D. Malinsky, I. Shpitser, and T. S. Richardson, “A potential outcomes calculus for identifying conditional path-specific
effects, in Proc. 22nd Int. Conf. Art. Intell. Stat. (2019).
[31]
T. Gläßle, D. Gross, and R. Chaves, “Computational tools for solving a marginal problem with applications in Bell
non-locality and causal modeling,” J. Phys A 51, 484002 (2018).
[32] E. H. Lieb, “Some convexity and subadditivity properties of entropy, Bull. Am. Math. Soc. 81, 1 (1975).
[33]
G. R. Kumar, C. T. Li, and A. El Gamal, “Exact common information,” in 2014 IEEE International Symposium on
Information Theory (2014) pp. 161–165.
[34] D. Geiger and J. Pearl, “On the Logic of Causal Models,” in Proc. 4th Conf. UAI (AUAI, 1998) pp. 136–147.
A COMPARING ENTROPIC INEQUALITIES TO GENERALIZED INDEPENDENCE
RELATIONS
A X D B
U
(a)
A
D1D2
B
C
U2
U1
(b)
AD#
1
D1
D#
2
D2
B
C
U2
U1
(c)
D1U1D2
U2U3
A B
(d)
Figure S6: In graphs (a) and (b), the entropic inequality constraints are logically implied by equality constraints. Graphs
(b) and (c) demonstrate that for a set of variables
D
, the counterfactual random variable
DpDdq
is not necessarily equal
to the factual
D
. Graph (d) provides an example where the entropic inequality constraints remain relevant even though the
counterfactual distribution after intervention on an e-separating set over the remaining variables is identified.
In Proposition 7, we showed that for graphical models in which the counterfactual
PpApDdq,BpDdq,DpDdq“d|Cq
is identified, the entropic constraints of Theorem 5are weaker than the corresponding Verma constraints. We now illustrate
this point with a few examples. In Fig. S6(a) the counterfactual
PpApDdq,BpDdq,DpDdq“dq
is identified, and in
Fig. S6(b) the counterfactual
PpApDdq,BpDdq,DpDdq“d|Cq
is identified. Accordingly, our entropic inequalities
are implied by equality constraints, due to Proposition 7. The resulting inequality constraints therefore cannot provide any
additional information about whether these causal structures are compatible with observed distributions.
By contrast, in Fig. S6(d) the counterfactual
PpApDdq,BpDdq,DpDdq“d|Cq
is not identified, even though
PpApDdq,BpDdq | Cq
is. Although Fig. S6(d) implies no equality constraints [
29
], we find that it does entail the
entropic inequality constraint following from the
e
-separation relation
pAKeB|upon tD1,D2uq
. It is therefore an example
of a graph in which our inequality constraints are not made redundant by known equality constraints, despite the fact that
intervention on
D
is identified. This example is also an illustration of the fact that not every equality restriction featuring non-
adjacent variables in an identifiable counterfactual distribution implies equality restrictions on the observed data distribution.
However, some such non-adjacent variables may be involved in inequality restrictions.
The critical idenfifiability question for determining whether the entropic constraints are made redundant by equality constraints
is
PpApDdq,BpDdq,DpDdq“d|Cq
. This distribution involves the counterfactual random variable
DpDdq
. Note
that although any single random variable under intervention on itself is equivalent to the random variable under no intervention,
the same does not necessarily hold for sets of random variables. Figs. S6(b) and S6(c) demonstrate this point because
D2
is a
descendant of D1, after intervention on both, D2no longer takes its natural value.
BE-SEPARATION IN IDENTIFIED COUNTERFACTUAL DISTRIBUTIONS
C
A B
D
X
U2
U1
(a)
C
A B
D
X
U2
U1U3
(b)
C
A B
D
X
U2
U1U3
(c)
Figure S7: In all three graphs,
A
and
B
are
e
-separated by
D
after intervention on
C
. The counterfactural distribution over
tA,B,Duafter intervention on Cis only identified in graphs (a) and (c), however.
A Single World Intervention Graph (SWIG) [
7
], which represents the model after intervention on one or more random variables,
can be obtained through a node-splitting operation as illustrated in Fig. S6(c). As described in Section 3.2,
d
-separation
relations that appear under interventions with identified distributions can be used to derive equality constraints on the observed
data distribution. In this section, we explore the significance of
e
-separation relations in identified counterfactual distributions.
We begin by noting that any e-separation relation that exists in a SWIG corresponds to an e-separation in the original DAG.
Proposition 11. pAKeB|Cupon Dqafter intervention on Eonly if pAKeB|Cupon tD,Euq.
This proposition follows directly from the relationship between the fixing [
16
] and deletion operations. In particular, fixing
and deleting vertices induce the same graphical relationships among the remaining variables in the graph.
It may at first seem that this result indicates that
e
-separation relations in SWIGs cannot be used to derive inequality constraints
on the observed data distribution that are not already implied by
e
-separation relations in the original model. However, entropic
inequality constraints on counterfactual distributions have a different form than such constraints on the factual distribution.
This is because entropies of counterfactual variables do not in general correspond to entropies of factual variables, so there
is no way to express inequality constraints that follow from
e
-separation relations in SWIGs as entropic inequalities on the
original distribution.
To illustrate this point, consider Fig. S7. In each graph,
pAKeB|upon Dq
in the SWIG resulting from intervention on
C
.
However, in Fig. S7(b), the distribution after intervention on
C
is not identified, whereas in Figs. S7(a) and S7(c) it is identified
as
PpApcq,Bpcq,Dpcqq řx
PpA,B,C c,D,Xxq
PpCc|Xxq
. This means the entropic inequalities
IpApcq:Bpcqq ď HpDpcqq
on this
counterfactual distribution (one for each level of
C
) imply inequality constraints on the observed data distribution as well.
These inequality constraints will be obtained in Figs. S7(a) and S7(c), but not in Fig. S7(b).
Moreover, these inequality constraints can be shown to be nontrivial. Since Figs. S7(a) and S7(b) share the same
d
-separation
and
e
-separation relations it follows that any distributions compatible with Fig. S7(b) cannot be witnessed as incompatible with
Fig. S7(a) using non-nested entropic equalities or inequalities. Consider the following structural equation model for Fig. S7(b):
Let
U1
,
U2
and
U3
be binary and uniformly distributed, and let
XU2
,
AU2A
,
CXU3
,
BCU3B
, and
DD
where
indicates addition mod 2 and where
k
is a random variable very heavily biased towards zero for
kP
tA,B,Du
. This establishes that
C
and
X
are uniformly distributed and statistically independent from each-other, and hence that
PpA,Bq PpApc0q,Bpc0qq
. This construction also gives
ABU2ACU3BU2AXBAB
and hence
A«B
. This yields
IpApc0q:Bpc0qq« HpAq «1
whereas
HpDpc0qq HpDq «0
, strongly violating the
entropic inequality IpApc0q:Bpc0qqď HpDpc0qq which applies only to Fig. S7(a).
C PROOFS
Proof of Theorem 5
Let
G#
represent the graph in which every node in
D
is split,
P˚
denote the distribution over variables in
G#
. For notational
convenience, we let
Pd#| ¨q P˚| ¨,D#d#q
, and
Id#
and
Hd#
be the mutual information and entropy in this distribution.
Recall that by Theorem 4, if pAKeB|Cupon Dq, then:
4.i. Id#pA:B|Ccq 0, and
4.ii. PpA,B,Dd#|Ccq Pd#pA,B,Dd#|Ccq.
From the latter condition (4.ii.) we readily have that H|¨,Dd#q Hd# |¨,Dd#q.
It should also be clear that
HpXq Hd#pXqwhenever Xare among the nondescendants of D#in G#. (S11)
In our argument below,
C
and
D
are examples of such a set
X
. If we view the distribution
Pd#
from which
Hd#
is derived as
an interventional distribution, then the above identity follows from the exclusion restriction displayed graphically by rule 3 of
the po-calculus [30].
It will prove extremely useful to show that
H| ¨,Dq Hd#| ¨,Dq
can be seen to follow from
H| ¨,Dd#q Hd#| ¨,Dd#qand PpDd#| ¨qPd#pDd#| ¨q via
Hpre |¨post,Dq ÿ
d
Ppd|¨postqHpre |¨post,dq
ÿ
d#
Ppd#|¨postqHpre |¨post,d#q[summing over dummy index]
ÿ
d#
Pd#pd#|¨postqHd#pre |¨post,d#q[applying Theorem 4]
Hd#pre |¨post,Dq(S12)
We will use conditions
(S11)
and
(S12)
to translate entropic constraints on
Pd#
into entropic constraints on
P
. The two cases
in Theorem 5share the same implicit entropic constraints on
Pd#
, but the implications on
P
are different: those scope of
condition (S11)’s applicability increases under the promise that Aare nondescendants of Din G.
From this point on we will focus on deriving entropic inequality constraints on
Pd#
such that all the terms in the derived
inequalities are translatable according to conditions
(S11)
and
(S12)5
, because such constraints apply both to
Pd#
and
P
. We
hereafter denote Ccwith simply cfor notational brevity. Firstly, consider the following entropic inequalities,
0ďId#pA:D|cq Hd#pA|cq´Hd#pA|c,Dq,(S13a)
0ďId#pB:D|cq Hd#pB|cq´Hd#pB|c,Dq,(S13b)
0ďHd#pD|A,B,cq Hd#pA,B,D|cq´Hd#pA,B|cq,(S13c)
0ď´Id#pA:B|cq Hd#pA,B|cHd#pA|cHd#pB|cq.(S13d)
The first two are subadditivity inequalities, which follow from the nonnegativity of conditional mutual information. Subaddi-
tivity holds for both discrete and continuously valued variables [
32
]. The penultimate inequality follows from monotonicity
(the fact that all conditional entropies are nonnegative). Conditional entropy is only guaranteed to be nonnegative for dis-
crete variables, and this is why Theorem 5demands that
D
be discrete. The final inequality is an expression of the fact that
Id#pA:B|cq 0per (4.i.) above. Summing all four inequalities (S13) leads to the derived inequality
0ďHd#pA,B,D|cHd#pA|c,Dq´Hd#pB|c,Dq(S14)
By applying conditions (S11) and (S12) to inequality (S14) we obtain
0ďHpA,B,D|cHpA|c,Dq´HpB|c,Dq
i.e., IpA:B|c,Dqď HpD|cq.(S15)
5
Formally, the problem of inferring the implications of system of linear inequalities on a strict subset of their variables may be solved be
means of Fourier-Motzkin elimination or related algorithms [31].
Now consider the case where we are further promised that
A
are nondescendants of
D
in
G
and hence nondescendants of
D#
in
G#
. This means that in addition to the above results we also have that
Hd#pA|cq HpA|cq
. We proceed as before, but
instead of summing all four of the (S13) inequalities we only take the sum of the latter three. This yields
0ďHd#pA,B,D|cHd#pA|cq´Hd#pB|c,Dq(S16)
By applying conditions (S11) and (S12) to inequality (S16) we obtain
0ďHpA,B,D|cHpA|cq´HpB|c,Dq
i.e., IpA:B,D|cqď HpD|cq.(S17)
In both cases, the constraint is maintained after taking the expectation of both sides with respect to
C
. Because each term in
the expectation will satisfy the inequality, so will the sum.
When
C
are not entirely among the nondescendants of
D
in
G
then we can still obtain
IpA:B|C,Dqď HpDq
by noting that
0ďId#pA:D|Cq`Id#pB:D|Cq´Id#pA:B|Cq`Hd#pD|A,B,Cq`Id#pC:Dq,(S18a)
Hd#pA,B,C|DHd#pA,C|Dq´Hd#pB,C|Dq`Hd#pC|Dq`Hd#pDq,(S18b)
HpA,B,C|Dq ´HpA,C|Dq ´HpB,C|Dq `HpC|Dq `HpDq,(S18c)
HpDIpA:B|C,Dq.(S18d)
Note that this proof technique can be adapted to derive stronger entropic inequalities for graphs which exhibit multiple different
e
-separation relations involving the same
D
set. If
pA1KeB1|C1upon Dq
and
pA2KeB2|C2upon Dq
and so forth,
then Theorem 4still demands the existence of a single
Pd#
whose various margins must now satisfy multiple distinct zero
conditional mutual informational equalities. We can accommodate multiple entropic equality constraints on
Pd#
just as easily
as we can accommodate a single equality constraint: The translation between constraints on
Pd#
and
P
will continue to be
governed by conditions (S11) and (S12).
Proof of Proposition 6
If conditioning on some variables
D
is sufficient to close a path, then that path must go through
D
, and therefore deletion
of
D
eliminates the path. By construction, the deletion operation can never open a path, unlike the conditioning operation.
If
pAKdB|C,Dq
, then all paths from
A
to
B
go through
C
or
D
, or through colliders that are not in
tC,Du
, nor have
any descendants therein. It follows that
pAKeB|Cupon Dq
, as after deletion of
D
all paths through
C
remain blocked
through conditioning, all paths through Dare eliminated, and all other paths remain blocked by colliders.
Proof of Proposition 8
Firstly, note that pAKdB|C,Uqimplies that
0IpA:B|Cc,Uq HpA|Ccq`HpB|CcHpA,B,U|Ccq´HpU|Ccq,
i.e., that
HpU|Ccq HpA|Ccq`HpB|CcHpA,B,U|Ccq.(S19)
which, whenever HpA,B,U|Cc HpA,B|Ccq, gives the lower bound
HpU|Ccqě HpA|Ccq`HpB|CcHpA,B|Ccq
IpA:B|Ccq.(S20)
Note that
HpA,B,U|Ccqě HpA,B|Ccq
is equivalent to
HpU|A,B,Ccqě 0
, and that
HpU|¨q ě 0
whenever
U
is
discrete. The result
HpUqě HpU|Ccq
follows from strong subadditivity, i.e., the fact that conditional entropy is never
greater than marginal entropy, even for continuously valued variables [32].
To obtain Corollary 8.1 simply note that the constraint is maintained after taking the expectation of both sides with respect to
C: because each term in the expectation will satisfy the inequality whenever the cardinality of Uis finite, so will the sum.
Proof of Corollary 10.1
Proof.
A consequence of
XPpapWq
, is that
AĂXYanpXq
implies that no element of
A
is a descendant of
W
. This allows
to to confirm the following sequence of inequalities,
max
cIpA:B|Cc,Dq(S21a)
ďmax
cmax
SEMs for G1IpA:B,W |Cc,Dq(S21b)
ďmax
cmax
SEMs for G1HpD,W |Ccq(S21c)
ďmax
cˆHpD|Ccq` max
SEMs for G1HpW|Ccq˙(S21d)
ďmax
cHpD|Ccq` max
SEMs for G1HpWq(S21e)
max
cHpD|Ccq`MMEXÑY
where all the steps above are consequences of subadditivity except for the step from Equation
(S21b)
to Equation
(S21c)
,
which is just the application of Equation (4a). Finally, Equation (10b) follows from Equation (10a) by convexity.
D RELATION BETWEEN COMMON ENTROPY AND MME
The
MME
bears some resemblance to a concept called common entropy [
33
], which is defined for a distribution
PpX,Y q
as
the smallest possible entropy of an unobserved variable
W
such that
XKY|W
. Unlike the
MME
, the common entropy is a
function only of the probability distribution
PpX,Y q
, and not of the graph
G
. Any
W
that renders
X
and
Y
conditionally
independent must also fully mediate the effect of
X
on
Y
, which at first glance might be taken to mean that the common
entropy is an upper bound on the
MME
, because it implies that the
MME
can search over a larger set of distributions to obtain a
low-entropy mediator. Indeed in the simple
XÑY
model, it is the case that
MMEXÑY
is bounded from above by the common
entropy between Xand Yfor precisely this reason.
However, the common entropy is not an upper bound on the
MME
in general. To see this, consider the graph presented in
Fig. 1(c). This model contains distributions in which
A
and
B
are highly correlated, but
D
and
B
are entirely uncorrelated.
For such distributions, the common entropy of
B
and
D
would be
0
, as they are already marginally independent. However,
by Corollary 10.1, the
MME
would be bounded from below by
IpA:Bq
, which can be larger than
0
. The intuition for this
phenomenon is that if the edge
DÑB
were missing,
A
and
B
would be marginally independent, so a high mutual information
between them is evidence for the causal significance of the edge.
E AN INVERSE OF THEOREM 5
The goal of this appendix is to establish that
Proposition 12.
If a graph
G
has the feature
pAMeB|Cupon Dq
, then there exists a distribution in the marginal model of
Gwith discrete Dsuch that pA:B|C,Dq“ IpA:B|CqŋHpDq.
We begin by simply noting that
Lemma 13.
The marginal model of any graph
G
whose nodes include
tA,B,C,Du
contains all conditional distributions
PpA,B|C,Dqwherein
1. PpA,B|C,Dq PpA,B|Cq, and
2. PpA,B|Cqis within the marginal model of the graph G1defined by removing outgoing edges from Din G.
The sorts of
PpA,B|C,Dq
described in Lemma 13 arise by considering causal models wherein every child of
D
always
ignores the values of D, treating Das if it has no descendants. We next invoke the completeness of d-separation. That is,
Lemma 14. If pAMdB|Cqin G, then there exists a distribution in the marginal model of Gfor which IpA:B|C 0.
We note that Lemma 14 follows from
Lemma 15.
If
pAMdB|Cq
in
G
for singleton nodes
A
and
B
, then there exists a distribution in the marginal model of
G
for
which Dcs.t. IpA:B|Ccqŋ0.
After all,
IpA:B|Cq 0
if and only if
IpA:B|Cq 0
for all singleton nodes
APA
and
BPB
. Moreover,
IpA:B|Cq 0
if and only if
IpA:B|Ccq 0
for all
c
having positive support. We believe that Lemma 15 explicitly follows from Theorem
3 in Ref. [34], but we provide an explicit proof of it below for completeness.
By combining Lemmas 13 and 14 we obtain Proposition 12. This follows by noting that whenever a graph
G
has the
feature
pAMeB|Cupon Dq
, then by definition the graph
G1
defined by removing outgoing edges from
D
in
G
exhibits
pAMdB|Cq
. To violate the basic inequality in Theorem 5we apply Lemma 14 while keeping
HpDqň IpA:B|Cq
. We can
make HpDqarbitrarily small by heavily biasing PpDqtowards one value.
Proof of Lemma 15
The following construction yields IpA:B|C1qlogp2qwhenever Gexhibits pAMdB|Cqfor singleton nodes Aand B.
If
pAMdB|Cq
in
G
then there exists some path in
G
with end nodes
A
and
B
such that all colliders in the path are elements of
C
and no element in
C
is present in the path except as a collider. We classify the nodes within the path into three distinct types:
Two Incoming Edges from the Path
These are the colliders in the path, elements of
C
. We take each such node to act as a
Kronecker delta function over its two in-path parents. That is, it will return the value
1
iff the two in-path parents have
coinciding values. That is,
ypx1,x2q #0,with unit probability iff x1x2,
1with unit probability iff x1x2.(S22)
One Incoming Edge from the Path
These are the mediaries in the path, as well as at least one (perhaps both) of the end
nodes of the path. Let these variables act as an identity functions of its single in-path parent. That is,
ypxq #0with unit probability iff x0,
1with unit probability iff x1..(S23)
Zero Incoming Edges from the Path These are the bases of forks in the path, as well as potentially one of the end nodes of
the path. Let these variable act as uniformly random variables with cardinality 2. That is,
ypq #0with probability 1
2,
1with probability 1
2.(S24)
This construction results in every non-collider being uniformly distributed over
t0,1u
and always taking the same value as
every other non-collider in the path upon postelecting all colliders in the path to take the value
1
. That is, this construction
explicitly ensures that IpA:B|C1qlogp2q.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Applied research conditions often make it impossible to point-identify causal estimands without untenable assumptions. Partial identification—bounds on the range of possible solutions—is a principled alternative, but the difficulty of deriving bounds in idiosyncratic settings has restricted its application. We present a general, automated numerical approach to causal inference in discrete settings. We show causal questions with discrete data reduce to polynomial programming problems, then present an algorithm to automatically bound causal effects using efficient dual relaxation and spatial branch-and-bound techniques. The user declares an estimand, states assumptions, and provides data—however incomplete or mismeasured. The algorithm then searches over admissible data-generating processes and outputs the most precise possible range consistent with available information—that is, sharp bounds—including a point-identified solution if one exists. Because this search can be computationally intensive, our procedure reports and continually refines non-sharp ranges guaranteed to contain the truth at all times, even when the algorithm is not run to completion. Moreover, it offers an ε-sharpness guarantee, characterizing the worst-case looseness of the incomplete bounds. These techniques are implemented in our Python package, autobounds. Analytically validated simulations show the method accommodates classic obstacles—including confounding, selection, measurement error, noncompliance, and nonresponse. Supplementary materials for this article are available online.
Article
Full-text available
Causal structures give us a way to understand the origin of observed correlations. These were developed for classical scenarios, but quantum mechanical experiments necessitate their generalisation. Here we study causal structures in a broad range of theories, which include both quantum and classical theory as special cases. We propose a method for analysing differences between such theories based on the so-called measurement entropy. We apply this method to several causal structures, deriving new relations that separate classical, quantum and more general theories within these causal structures. The constraints we derive for the most general theories are in a sense minimal requirements of any causal explanation in these scenarios. In addition, we make several technical contributions that give insight for the entropic analysis of quantum causal structures. In particular, we prove that for any causal structure and for any generalised probabilistic theory, the set of achievable entropy vectors form a convex cone.
Article
Full-text available
Marginal problems naturally arise in a variety of different fields: basically, the question is whether some marginal/partial information is compatible with a joint probability distribution. To this aim, the characterization of marginal sets via quantifier elimination and polyhedral projection algorithms is of primal importance. In this work, before considering specific problems, we review polyhedral projection algorithms with focus on applications in information theory, and, alongside known algorithms, we also present a newly developed geometric algorithm which walks along the face lattice of the polyhedron in the projection space. One important application of this is in the field of quantum non-locality, where marginal problems arise in the computation of Bell inequalities. We apply the discussed algorithms to discover many tight entropic Bell inequalities of the tripartite Bell scenario as well as more complex networks arising in the field of causal inference. Finally, we analyze the usefulness of these inequalities as nonlocality witnesses by searching for violating quantum states.
Article
Full-text available
A central question for causal inference is to decide whether a set of correlations fit a given causal structure. In general, this decision problem is computationally infeasible and hence several approaches have emerged that look for certificates of compatibility. Here we review several such approaches based on entropy. We bring together the key aspects of these entropic techniques with unified terminology, filling several gaps and establishing new connections regarding their relation, all illustrated with examples. We consider cases where unobserved causes are classical, quantum and post-quantum and discuss what entropic analyses tell us about the difference. This has applications to quantum cryptography, where it can be crucial to eliminate the possibility of classical causes. We discuss the achievements and limitations of the entropic approach in comparison to other techniques and point out the main open problems.
Article
Full-text available
Many real datasets contain values missing not at random (MNAR). In this scenario, investigators often perform list-wise deletion, or delete samples with any missing values, before applying causal discovery algorithms. List-wise deletion is a sound and general strategy when paired with algorithms such as FCI and RFCI, but the deletion procedure also eliminates otherwise good samples that contain only a few missing values. In this report, we show that we can more efficiently utilize the observed values with test-wise deletion while still maintaining algorithmic soundness. Here, test-wise deletion refers to the process of list-wise deleting samples only among the variables required for each conditional independence (CI) test used in constraint-based searches. Test-wise deletion therefore often saves more samples than list-wise deletion for each CI test, especially when we have a sparse underlying graph. Our theoretical results show that test-wise deletion is sound under the justifiable assumption that none of the missingness mechanisms causally affect each other in the underlying causal graph. We also find that FCI and RFCI with test-wise deletion outperform their list-wise deletion and imputation counterparts on average when MNAR holds in both synthetic and real data.
Article
Full-text available
Acausal scenario is a graph that describes the cause and effect relationships between all relevant variables in an experiment. A scenario is deemed not interesting if there is no device-independent way to distinguish the predictions of classical physics from any generalised probabilistic theory (including quantum mechanics). Conversely, an interesting scenario is one in which there exists a gap between the predictions of different operational probabilistic theories, as occurs for example in Belltype experiments. Henson, Lal and Pusey (HLP) recently proposed a sufficient condition for a causal scenario to not be interesting. In this paper we supplement their analysis with some new techniques and results.Wefirst show that existing graphical techniques due to Evans can be used to confirm by inspection that many graphs are interesting without having to explicitly search for inequality violations. For three exceptional cases-the graphs numbered #15, 16, 20 in HLP-we show that there exist non-Shannon type entropic inequalities that imply these graphs are interesting. In doing so, we find that existing methods of entropic inequalities can be greatly enhanced by conditioning on the specific values of certain variables.
Chapter
This chapter discusses effective graphic representations of the dependencies embedded in probabilistic models. Given an initial set of independence relationships, the axioms permit to infer new independencies by nonnumeric, logical manipulations. A Markov network is an undirected graph whose links represent symmetrical probabilistic dependencies, while a Bayesian network is a directed acyclic graph whose arrows represent causal influences or class-property relationships. After establishing formal semantics for both network types, one can explore their power and limitations as knowledge representation schemes in inference systems. A logic of dependency might be useful for verifying whether a set of dependencies asserted by an agent is consistent and whether a new dependency follows from the initial set. The nodes in the graphs represent propositional variables, and the arcs represent local dependencies among conceptually related propositions. Graph representations meet the requirements of explicitness, saliency, and stability. The links in the graph permit to express directly and qualitatively the dependence relationships, and the graph topology displays these relationships explicitly and preserves them, under any assignment of numerical parameters.
Article
The do-calculus is a well-known deductive system for deriving connections between interventional and observed distributions, and has been proven complete for a number of important identifiability problems in causal inference [1, 8, 18]. Nevertheless, as it is currently defined, the do-calculus is inapplicable to causal problems that involve complex nested counterfactuals which cannot be expressed in terms of the "do" operator. Such problems include analyses of path-specific effects and dynamic treatment regimes. In this paper we present the potential outcome calculus (po-calculus), a natural generalization of do-calculus for arbitrary potential outcomes. We thereby provide a bridge between identification approaches which have their origins in artificial intelligence and statistics, respectively. We use po-calculus to give a complete identification algorithm for conditional path-specific effects with applications to problems in mediation analysis and algorithmic fairness.
Article
Directed acyclic graph (DAG) models may be characterized in at least four different ways: via a factorization, the d-separation criterion, the moralization criterion, and the local Markov property. As pointed out by Robins (1986, 1999), Verma and Pearl (1990), and Tian and Pearl (2002b), marginals of DAG models also imply equality constraints that are not conditional independences. The well-known `Verma constraint' is an example. Constraints of this type were used for testing edges (Shpitser et al., 2009), and an efficient marginalization scheme via variable elimination (Shpitser et al., 2011). We show that equality constraints like the `Verma constraint' can be viewed as conditional independences in kernel objects obtained from joint distributions via a fixing operation that generalizes conditioning and marginalization. We use these constraints to define, via Markov properties and a factorization, a graphical model associated with acyclic directed mixed graphs (ADMGs). We show that marginal distributions of DAG models lie in this model, prove that a characterization of these constraints given in (Tian and Pearl, 2002b) gives an alternative definition of the model, and finally show that the fixing operation we used to define the model can be used to give a particularly simple characterization of identifiable causal effects in hidden variable graphical causal models.