Access to this full-text is provided by IOP Publishing.
Content available from New Journal of Physics
This content is subject to copyright. Terms and conditions apply.
New J. Phys. 17 (2015) 033002 doi:10.1088/1367-2630/17/3/033002
PAPER
The lesson of causal discovery algorithms for quantum correlations:
causal explanations of Bell-inequality violations require fine-tuning
Christopher J Wood
1,2
and Robert W Spekkens
3
1
Institute for Quantum Computing, Waterloo, Ontario, Canada N2L 3G1, Canada
2
Department of Physics and Astronomy, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1, Canada
3
Perimeter Institute for Theoretical Physics, Waterloo, Ontario, Canada N2L 2Y5, Canada
E-mail: rspekkens@perimeterinstitute.ca
Keywords: Bellʼs theorem, nonlocality, causality, quantum theory
Abstract
An active area of research in the fields of machine learning and statistics is the development of causal
discovery algorithms, the purpose of which is to infer the causal relations that hold among a set of
variables from the correlations that these exhibit . We apply some of these algorithms to the correla-
tions that arise for entangled quantum systems. We show that they cannot distinguish correlations
that satisfy Bell inequalities from correlations that violate Bell inequalities, and consequently that they
cannot do justice to the challenges of explaining certain quantum correlations causally. Nonetheless,
by adapting the conceptual tools of causal inference, we can show that any attempt to provide a causal
explanation of nonsignalling correlations that violate a Bell inequality must contradict a core principle
of these algorithms, namely, that an observed statistical independence between variables should not be
explained by fine-tuning of the causal parameters. In particular, we demonstrate the need for such
fine-tuning for most of the causal mechanisms that have been proposed to underlie Bell correlations,
including superluminal causal influences, superdeterminism (that is, a denial of freedom of choice of
settings), and retrocausal influences which do not introduce causal cycles.
1. Introduction
A causal relation, unlike a correlation, is an asymmetric relation that can support inferences about the
consequences of interventions and about counterfactuals. The sun rising and the rooster crowing are strongly
correlated, but to say that the first is the cause of the second is to say more. In particular, it says that forcing the
rooster to crow early will not precipitate an early dawn, whereas causing the sun to rise early (for instance, by
moving the rooster eastward), can lead to some early crowing. Nonetheless, causal structure has implications for
the observed correlations and consequently one can make inferences about the causal structure based on the
observed correlations. Indeed, there has been much progress in the last 25 years on how to make such inferences,
progress that has been primarily due to philosophers and researchers in the field of machine learning and which
is well summarized in the books of Pearl [1] and of Spirtes, Glymour and Scheines (SGS) [2]. Such inference
schemes are known as causal discovery algorithms. In this article, we shall consider the question of what some
prominent causal discovery algorithms have to say about the causal structure that might underlie quantum
correlations, in particular those that violate Bell inequalities.
Suppose that one conducts measurements on a pair of systems that have been prepared together, and then
removed to distant locations such that the outcome at each wing of the experiment is outside the future light
cone of the measurement choice in the other wing. Suppose further that one finds that the correlations so
obtained violate Bell inequalities. If one insists on a causal explanation of these correlations, then it would seem
that one must admit that the causes must propagate faster than the speed of light. But this is in tension with the
fact that one cannot send signals faster than the speed of light. We take this tension to be the mystery of Bellʼs
theorem: if there are indeed superluminal causes, then why can’t we use them to send superluminal signals? In
this article, we will show that the principles behind causal discovery algorithms can clarify the nature of this
OPEN ACCESS
RECEIVED
30 July 2014
REVISED
18 December 2014
ACCEPTED FOR PUBLICATION
19 January 2015
PUBLISHED
3 March 2015
Content from this work
may be used under the
terms of the Creative
Commons Attribution 3.0
licence.
Any further distribution of
this work must maintain
attribution to the author
(s) and the title of the
work, journal citation and
DOI.
© 2015 IOP Publishing Ltd and Deutsche Physikalische Gesellschaft
tension. We also show that this tension persists in more exotic proposals for achieving a causal explanation of
Bell inequality violations such as superdeterminism, which is an assumption that at least one of the
measurement settings is influenced by a variable that is a common cause of the outcome on the opposite wing
(and hence this setting variable is not freely chosen), and retrocausation, wherein causes propagate counter to
the standard direction of time.
We consider the most prominent causal discovery algorithms, which take as their input the set of conditional
independence (CI) relations that hold among the observed variables. No other feature of the probability
distribution is relevant for them. Our analysis will reveal that such algorithms do not capture the insights of
Bellʼs theorem. It follows that there is an opportunity for researchers in the field of quantum foundations with
expertise on Bellʼs theorem to improve upon existing causal discovery algorithms. Indeed, in the time since a
preprint of this article first appeared, the process has already begun. Inspired by entropic Bell inequalities and
building on the work of [6], it has recently been shown in [7] that the causal structure implies certain entropic
inequalities on the joint probability distribution. We anticipate that there are many more opportunities for
improvements to causal inference based on ideas from the field of quantum foundations
4
.
The distinction between causal and inferential concepts is an instance of the distinction between ontic
concepts (those pertaining to reality) and epistemic concepts (those pertaining to our knowledge of reality).
Within the field of statistics, disentangling causal and inferential concepts is notoriously difficult and
controversial, as is the question of when causal claims are supported by the observed correlations. In the
quantum realm, where there is even less agreement about which parts of the formalism refer to ontic concepts
and which refer to epistemic concepts, the problem is compounded [9]. As such, we shall try to present our
analysis in a manner that does not presume any particular interpretation of quantum theory. For instance, given
that different interpretations disagree on whether quantum theory implies an objective indeterminism in nature
or not, we shall not presume any particular answer to this question. Instead, we simply focus on the operational
predictions of the theory.
Some previous work has already considered Bellʼs theorem from the perspective of causal discovery
algorithms. In particular, the books by Pearl [1] and by SGS [2] comment briefly on the question. They both
assert that Bellʼs theorem forces a dilemma between (i) abandoning a particular notion of locality, that there are
no superluminal causal influences, and (ii) abandoning the assumption that if two variables are statistically
dependent, then this is explained either by the existence of a cause from one to the other or a common cause
acting on both, or a combination of the two mechanisms. Assumption (ii) underlies what is called the ‘causal
Markov condition’, but we will refer to it here simply as Reichenbachʼs principle; in a slogan, it asserts that
correlations must be explained causally
5
. One can legitimately quibble with the claim that Bellʼs theorem forces
such a dilemma on the grounds that there are other assumptions that go into the theorem: the absence of
superdeterminism (an assumption that is often characterized as the existence of freedom in the choice of
settings), and the absence of retrocausal influences, for instance. Nonetheless, this is an improvement over the
standard characterization of Bellʼs theorem as forcing a dilemma between abandoning locality and abandoning
realism. It has always been rather unclear what precisely is meant by ‘realism’. Norsen has considered various
philosophical notions of realism and concluded that none seem to have the feature that one could hope to save
locality by abandoning them [10]. For instance, if realism is taken to be a commitment to the existence of an
external world, then the notion of locality—that every causal influence between physical systems propagates
subluminally—already presupposes realism. Furthermore, we will show that the tools of causal inference can also
be used to argue for the implausibility of superdeterminism and retrocausal influences.
Our first conclusion is a relatively straightforward one. We note that in the case of a Bell scenario, where a
pair of systems is prepared together then separated and each subjected to a measurement, all correlations exhibit
the following CI relations among the observable variables:
1. Marginal independence of the measurement setting variables,
2. No-signalling, that is, CI of the outcome at one wing of the experiment and the setting at the opposite wing
given the setting at the first wing.
4
Other work in the field of machine learning has appealed to statistical features besides CI relations, but not the features of correlations that
are relevant for Bellʼs theorem. Peters et al [8] demonstrate that if one is promised an additive noise model, then features of the joint
distribution can often distinguish cause from effect in the case of a distribution on a pair of variables, where there are no CI relations to guide
the analysis. Other approaches have appealed to the complexity of conditional distributions [3–5].
5
Adefining feature of a common cause is that if the statistical dependence between two variables is to be explained entirely by a common
cause, then it must be the case that conditioning on the common cause makes the variables statistically independent. As we will see, this
feature is built into the framework of causal models. Statements of Reichenbachʼs principle often assert it explicitly.
2
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
Except for independences that are due to special degeneracies in the quantum state, these are the only CIs
arising in Bell scenarios. These independences characterize both the correlations that satisfy all the Bell
inequalities, and the correlations that violate some Bell inequality. Therefore, if the causal discovery algorithm
takes as its input not the full distribution but only the CI relations that hold in the distribution (as is the case with
the prominent such algorithms), then this algorithm cannot distinguish correlations that violate Bell inequalities
from correlations that satisfy Bell inequalities. The input to such algorithms is simply too impoverished to see
the difference. It follows that the causal distinctions that do exist between these correlations—those that are
implied by Bellʼs theorem—cannot be recognized by these algorithms. They may consequently make incorrect
assessments of what causal structure is implied by a given set of correlations.
By explicitly applying the standard causal discovery algorithms to the CI relations that characterize a Bell
scenario, we draw attention to the fact that the output of such algorithms must be interpreted with great care, lest
one be led to an incorrect conclusion about the viability of certain causal explanations. We look at both the case
where one presumes that the settings and outcomes are the only causally relevant variables, i.e., the case of no
hidden variables, and the case where one allows hidden variables.
Finally, we set aside the details of existing algorithms and consider simply what the core principles
underlying these algorithms imply about the possibility of causal explanations of Bell inequality violations. We
demonstrate that any causal model that can hope to explain Bell-inequality-violating correlations (or even to
explain Bell-inequality-satisfying correlations without recourse to hidden variables) has the feature that in order
to explain the conditional independencies among the observed variables, in particular the no-signalling
constraints, it must involve a fine-tuning of the causal parameters.
So, in the end, we obtain a characterization of Bellʼs theorem that is quite far from its standard
characterization as a denial of ‘local realism’. The assumptions that go into this new characterization are: the
framework of causal models, which incorporates Reichenbachʼs principle that correlations should be explained
causally, as well as the principle that CI relations should not be explained by fine-tuning. As we shall see, the no
fine-tuning principle, applied to the observed independences in a Bell scenario (including the lack of
superluminal signals), implies the lack of superluminal causal influences, which is Bellʼs notion of local
causality. So Bellʼs notion of local causality is derived as one particular consequence of no fine-tuning in this
approach. The real innovation of this approach, however, is that the no fine-tuning principle together with the
observed indepedences also rule out superdeterminism and retrocausal influences that do not introduce causal
cycles. It follows that all three of the main approaches for providing a causal explanation of Bell correlations,
superluminal causes, superdeterminism and retrocausal influences, are unsatisfactory, and they are all
unsatisfactory for the same reason.
Our approach demonstrates that Bellʼs theorem can be understood as a statement about the possibility of a
causal account of quantum correlations. This characterization is an improvement over the standard one for
several reasons. First, we believe that the question of what constitutes a causal explanation of correlations is more
clearly defined than the question of what constitutes a realist explanation of those correlations. Of course, if one
likes, one can take the notion of causal explanation to be an elucidation of the notion of realism at play in Bellʼs
theorem. In other words, one could take the view that an explanation should be described as realist only if it is
causal. Indeed, the views of many proponents of anti-realism in quantum theory are aptly characterized as a
denial of the need to provide a causal explanation of quantum correlations. The second advantage of our
characterization is that the fine-tuning criticism applies to all of the various attempts to provide a causal
explanation of Bell inequality violations. Accounts in terms of superluminal causes, superdeterminism or acyclic
retrocausation are found to fall under a common umbrella. The conspiratorial flavour of each such account can
be formalized as a need for fine-tuning.
2. Causal structures and causal models
The modern approach to the formal study of causality considers in some detail the significance of interventions
and counterfactuals for defining the notion of a causal relation [1,2]. There is a large literature on whether these
sorts of definitions are adequate [11]. Although questions of this sort are relevant to a discussion of Bellʼs
theorem, they will not be the focus of this article. We begin by describing the mathematical formalism that is
relevant for describing the causal discovery algorithms in [1] and [2]. We follow the presentation of these
authors.
Acausal structure is a set of variables
V
and a set of ordered pairs of distinct variables XY,specifying that X
is a direct cause of Yrelative to
V.
Being in a relationship of direct causation is a property that is defined relative to the set of variables being
considered. If one considers a larger set which includes more variables, then what was a direct causal relation in
the first set might become a mediated causal relation in the second.
3
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
Such causal structures can be represented conveniently by directed acyclic graphs (DAGs). A directed graph G
corresponds to a set of vertices and a set of directed edges among the vertices (a vertex cannot be connected to
itself). The acyclic property asserts that there are no directed paths in the graph that begin and end at the same
vertex. DAGs represent causal structures in the obvious manner: every variable in
V
is represented by a vertex,
and for every pair of variables XY,, where Xis a direct cause of
Y,
there is a directed edge in the graph between
the associated vertices
6
.
As is standard, we use the terminology of family relations in the obvious manner: if Xis a cause of Y, direct or
mediated, then Xis said to be an ancestor of Y, and Yis said to be a descendent of X.IfXis a direct cause of Y, then
Xis said to be a parent of Y. The set of all parents of a variable Xwill be denoted Pa(X) and the set of all
nondescendents of a variable Xwill be denoted Nd(X). The variables in the causal structure that have no parents
will be called exogenous, while those with parents will be called endogenous.
Adeterministic causal model consists of a causal structure and a set Θof causal and statistical parameters. The
causal parameters describe the functional relations that fix the values of every variable Xgiven its parents Pa(X)
in the causal structure, that is, for every Xthey describe a function fspecifying
=
X
fX(Pa( ))
. The statistical
parameters specify a probability distribution over the exogenous variables, that is, a distribution P(X) for every
exogenous X. An example of a deterministic causal model is given in figure 1.
The notion of a general causal model (not necessarily deterministic) can be explained as follows. We start
with a deterministic causal model and modify it in a particular way. When an exogenous variable Uis the parent
of only a single other variable, say X(i.e. Uis not a common cause of two or more variables), it is possible to
eliminate Ufrom the causal structure, and to replace the deterministic dependence of Xon its original set of
parents with a probabilistic dependence on its new set of parents. Specifically, if the deterministic causal model
specifies that
=
X
fX(Pa( ))
for some function f(here Pa(X) includes U) then the new causal model specifies a
conditional probability ∣′
P
XX(Pa())
(here ′X
P
a( )
are the parents relative to the new causal structure, which
excludes U). Specifically, the conditional probability is defined by δ∣′ =
∑′
P
XX PU(Pa()) ()
UXf X U,(Pa( ), ) .
It follows that a general causal model consists of a causal structure and a set
Θ
of causal–statistical parameters.
The causal–statistical parameters specify a conditional probability distribution for every variable given its causal
parents, ∣
P
XX(Pa())
7
. Exogenous variables have the null set for their causal parents, so that to condition on
Figure 1. An example of a deterministic causal model.
Figure 2. An example of a causal model consisting of a causal structure, represented by a directed acyclic graph, and a set of causal–
statistical parameters, specifying the probability of each variable conditioned on its parents.
6
One can imagine more general notions of causation wherein directed cycles are allowed, but we will not consider such notions here.
7
We have chosen to call the parameters of a general causal model ‘causal–statistical’because if the causal model arises from an underlying
deterministic causal model, then the conditional probabilities in the causal model fold together two different sorts of parameters from the
underlying deterministic causal model: functional dependences of variables on their parents, which are causal parameters, and distributions
over the local noise variables, which are statistical parameters.
4
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
their parents is not to condition at all. Consequently, for the exogenous variables, the causal–statistical
parameters simply specify the unconditioned distributions over each of these
8
.
An example of a general causal model is given in figure 2. It can be obtained from the deterministic causal
model of figure 1by eliminating the exogenous variables Uand V. (Note that one need not eliminate all
exogenous variables from a deterministic causal model to obtain a nondeterministic causal model—for
instance, Sand Thave not been eliminated in our example.)
Deterministic causal models are clearly a special case of causal models where all conditional probabilities
correspond to deterministic functions. It is also clear that for any given causal model, one can always view it as
arising from a deterministic causal model where some exogenous variables have been excluded. To obtain such a
deterministic extension of a causal model, it suffices to add new exogenous variables as parents of every
endogenous variable in the model. For the rest of the article, we will focus on the general notion of a causal
model, rather than on deterministic causal models.
We pause to discuss briefly the possible interpretation of the probabilities in a causal model. One could take a
Bayesian attitude towards these probabilities. In this case, the probability distribution on an exogenous variable
Urepresents an agentʼs degrees of belief about U, and the conditional probability ∣
P
XX(Pa())
represents
degrees of belief about Xgiven its parents. Another possibility is to take a frequentist attitude towards the
probabilities. This is arguably the position adopted by Pearl, who describes the auxiliary variables appearing in a
deterministic extension of a causal model as ‘unmeasurable conditions that Nature governs by some undisclosed
probability function’([1], p 44). One could even interpret the probabilities as propensities, indicating an
irreducible randomness in oneʼs theory (an option that might be appealing to some when considering the
possibility of explaining quantum correlations in terms of causal models). Our conclusions here will be
independent of this choice
9
.
It is worth noting that the fact that exogenous variables are assumed to be independently distributed, which
is part of the definition of a causal model, is a consequence of Reichenbachʼs principle. The principle asserts that
one must explain all correlations by a causal mechanism, so that if two variables are correlated then either one is
a cause of the other, or there is a common cause acting on both (this is not an exclusive or—it could be that two
variables are related by both a common cause and a cause-effect relation). In other words, the exogenous
variables are by definition the variables that one takes to be uncorrelated.
Consider the following question: given a causal model, what sorts of correlations can be observed among the
variables? Clearly, there is a set of joint distributions that are possible, depending on the causal–statistical
parameters that we add to the causal structure to get a causal model.
Consider the example from figure 2. The causal model predicts that the joint distribution over all the
variables should be
=PX Y ST W PWPSPTPYT WPX Y ST W(, ,, , ) ( )()()( , )( ,, , ). (1)
To see this, it suffices to note that in the deterministic extension of this model, depicted in figure 1, we have
δδ=PX Y ST W U V PUPVPWPSPT(, ,, , , , ) ()()( )()() , (2)
Yf TVW Xf STYUW,(,,) ,(,,,,)
YX
where δdenotes the Kronecker delta function,
δ
=1
XY,if and only if X=Y, and consequently
∑δδ=
=
PX Y ST W PUPVPWPSPT
PWPSPTPYT WPX Y ST W
(, ,, , ) ()()( )()() ,
()()()( , )( ,,, ), (3)
UV
Yf TVW Xf STYUW
,
,(,,) , (,,,, )
YX
where ∣
P
YTW(,)
δ≡∑PV()
VYf TVW,(,, )
Yand ∣
P
XYSTW(,,,) δ≡∑PU()
UXf STYUW, (,,,, )
X.
In general, a causal model with variables ≡…XXV{, , }
n1predicts a joint distribution of the form
∏
…=
=…
()
() ()
PX X PX X,, Pa . (4)
n
in
ii1
1, ,
Essentially, one multiplies together the conditional probabilities for every variable given its parents, all of which
are specified by the causal model. For a DAG that is not a complete graph (i.e. not every pair of nodes is
connected by an edge), the probability distributions that it supports are a subset of the possible distributions over
those variables.
We now turn to another question: what properties do all distributions consistent with a given causal
structure have in common? In other words, what are the features of the joint probability distribution that
8
Such models are sometimes called Markovian. A more general sort of model, which allows bi-directed edges representing the existence of
an unobserved common cause for a pair of variables, are called semi-Markovian.
9
Although we ultimately favour the Bayesian interpretation.
5
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
depend only on the causal structure and not on the causal–statistical parameters? CI relations are an example of
such properties, and they are the ones that most causal discovery algorithms to date have focussed upon.
Recall that variables Xand Yare conditionally independent given Z, denoted
⊥⊥XYZ
(
)
if any of the following three equivalent conditions hold
=∀==>
=∀==>
=∀=>
PXY Z PX Z y zPY yZ z
PYX Z PYZ xzPX xZ z
PXYZ PX ZPYZ zPZ z
1. ( , ) ( ) , : ( , ) 0,
2. ( , ) ( ) , : ( , ) 0,
3. ( , ) ( ) ( ) : ( ) 0.
An intuitive account of each of these conditions is as follows: in the context of already knowing Z, (1) learning Y
teaches you nothing about X(i.e. Yteaches you nothing more about Xthan what you already could infer from
knowing Z), (2) learning Xteaches you nothing about Y, and (3) Xand Yare independently distributed. Note
that marginal independence of Xand Y, where =
P
XY PXPY(, ) () (),
is simply CI, where the conditioning set is
the null set.
The definition of CI implies that certain logical inferences hold among CI relations. In other words, in a
complete set of CI relations, the CI relations need not be logically independent of one another. In particular, the
semi-graphoid axioms specify some inferences that can be drawn among CI relations. They are:
⊥⊥⇔⊥⊥
⊥⊥⇒⊥⊥
⊥⊥⇒⊥⊥
⊥⊥⊥⊥⇒⊥⊥
XYZ YXZ
XYWZ XYZ
XYWZ XYZW
XYZ XWZY XYWZ
Symmetry:()(),
Decomposition: ( ) ( ),
Weak union: ( ) ( ),
Contraction: ( ) and ( ) ( ).
Any set of variables can be considered as a new variable, so each of the variables X,Y,Wand Zappearing in the
axioms should be understood as possibly representing a set of variables. These axioms are quite intuitive.
Decomposition, for instance, states that if, in the context of knowing Z, learning Wand Yteaches you nothing
about
X,
then learning W alone teaches you nothing about X.
Note that if one wants to specify all the CI relations that hold for a given probability distribution, it suffices to
specify a generating set,defined to be a set from which the rest can be obtained by the semi-graphoid axioms. In
this paper, the CI relations will typically be specified by a generating set.
With these tools in hand, we can now discuss the central result concerning what properties of a joint
probability distribution can be inferred from the causal structure.
Theorem 1 (Causal Markov condition). In the joint distribution induced by a causal structure, every variable X is
conditionally independent of its nondescendants given its parents,
⊥⊥XXX
(
Nd()Pa()).
This result follows from equation (4) because
∏
∏
=
=
=
∈
∈
PX X X PX X X
PX X
PX X PY Y
PY Y
PX X
(Pa(),Nd()) (,Pa(),Nd())
(Pa( ), Nd( )) ,
( Pa( )) ( Pa( ))
(Pa()) ,
(Pa()). (5)
YX X
YX X
Pa( ),Nd ( )
Pa( ),Nd( )
The causal Markov condition implies a CI relation for every variable that is not exogenous in the causal
structure. One can then infer additional CI relations from these by the semi-graphoid axioms.
To see these ideas in action, consider again the example from figure 2. It turns out that ⊥⊥∣YST()
for this
causal structure, as we now demonstrate. Applying the causal Markov condition to Y, one infers that
⊥⊥∣YXSWT()
. Applying it to W,Sand Tone infers ⊥⊥WST(),⊥⊥SWT()
and ⊥⊥TWS()
respectively. By the
decomposition axiom, ⊥⊥∣YXSWT()
implies ⊥⊥∣YSWT()
. From the contraction axiom, ⊥⊥∣YSWT()
and
⊥⊥SWT()
imply ⊥⊥SYWT()
. Finally, from weak union we obtain ⊥⊥∣SYWT()
and then from decomposition
again we have ⊥⊥∣SYT(),
which is equivalent by symmetry to ⊥⊥∣YST()
.
We see that it can be rather laborious to infer CI relations from the causal Markov condition and the semi-
graphoid axioms. Fortunately, there is a graphical criterion for identifying such relations, known as d-separation
[1]. We will not dwell on this notion here, but we present a brief introduction in the appendix.
Note that in addition to the CI relations that are implied by the causal structure, there may be additional CI
relations that are implied by the particular values of the causal–statistical parameters. Such additional CI
relations are problematic for causal discovery algorithms, as we shall see.
6
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
3. Causal discovery algorithms
We have described the correlations that are possible for a given causal structure. Causal discovery algorithms
seek to solve the inverse problem: starting from correlations among observed variables, can one infer which
causal structures might account for these correlations? Researchers in this area have indeed devised some
schemes for narrowing down the set of causal structures that can yield a natural explanation of the correlations,
wherein the notion of naturalness at play is one that we shall make explicit shortly. The algorithms look to the
CIs among the variables to infer information about the causal structure.
In general, causal discovery algorithms may be applied directly to experimental data and in this case one
needs to deal with the subtle issue of how to infer CI relations from a finite sample of a probability distribution.
However, in what follows we are going to apply the causal discovery algorithms directly to the distributions
prescribed by quantum theory, so we needn’t worry about this subtlety.
It is worth reviewing a few basic facts about the output of causal discovery algorithms. First of all, two
different causal structures might support precisely the same probability distributions, so that observation of one
of these distributions necessarily leaves one ignorant about which causal structure is at play. As an example, for
three variables, the three causal structures show in figure 3all support the same set of probability distributions—
those wherein Aand Bare conditionally independent given C(these are the DAGs wherein Aand Bare d-
separated given C). (The general conditions under which two causal structures are observationally equivalent is
given by theorem 1.2.8 in [1].)
It follows that causal discovery algorithms will necessarily sometimes yield an equivalence class of causal
structures. When this occurs, additional information is required if one is to narrow down the causal structure to
a unique possibility, for instance information about the temporal order of some of the variables.
Despite this, one can often narrow down the field of causal possibilities significantly. To get a feeling for how
this works, it is useful to start with a very simple example. Suppose that one has three binary-valued variables,
denoted A,Band C. Suppose further that the joint distribution over the triple,
P
ABC(, , )
is such that
⊥⊥=
⊥⊥≠
⊥⊥≠
AB PAB PAPB
AC PAC PAPC
BC PBC PBPC
()i.e.(,)()(),
()i.e.(,)()(),
()i.e.(,)()(). (6)
What is the natural causal explanation for this sort of correlation? It is as shown in figure 4. The marginal
independence of Aand Bis explained by their being causally independent.
However, there are other possible causal explanations, such as the one given in figure 5. The reason this is a
possible explanation is because there are two causal mechanisms by which Aand Bcould become correlated, and
it could be that the two types of correlations combine in such a way as to leave Aand Bmarginally independent.
For this to happen, however, the parameters in the causal model cannot be chosen arbitrarily and it is in this
sense that the explanation is less natural than the one provided by figure 4.
An example helps to make all of this more explicit. We adopt the following notational convention (inspired
by the representation of mixtures in quantum theory)
===
=≡ ===
PA x PA x
PAB x y xy PA xB y
() []means ( ) 1,
(,)[][][]means( , )1.
Figure 3. The three causal models consistent with the CI relation ⊥⊥∣ABC()
.
Figure 4. The natural causal model for the set of CI given in equation (6).
7
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
Consider the following joint distribution, which has the dependences described in equation (6),
=+++PAB C(, , ) 1
4[000] 1
4[010] 1
4[100] 1
4[111]. (7)
We can easily verify that
=+ +
P
AB(, ) 1
2[0] 1
2[1] 1
2[0] 1
2[1] ,
⎜⎟⎜⎟
⎛
⎝
⎞
⎠
⎛
⎝
⎞
⎠
so that Aand Bare indeed marginally independent. We also have
= =++
P
AC PBC(, ) (, ) 1
2[00] 1
4[10] 1
4[11],
so that Aand Care marginally dependent, as are Band C.
The natural explanation is achieved by assuming that the causal structure is as given in figure 4, and the
priors over the exogenous variables and the conditional probabilities for the endogenous variables are as follows:
=+
=+
=
PA
PB
PCAB A B
() 1
2[0] 1
2[1],
() 1
2[0] 1
2[1],
(,)[·],
where
A
B·
denotes the product of the values of Aand B. Thus in this causal model, Aand Bare each chosen
uniformly at random, and Cis obtained as their product (equivalently, as the logical AND of Aand B). One can
easily verify that ∣
P
APBP C AB() () ( , )
yields the distribution of equation (7).
The alternative explanation assumes the causal structure of figure 5, with parameters
=+
== +
==
=== +
==
PC
PBC
PBC
PAB C
PAB C C
() 3
4[0] 1
4[1],
(0)
2
3[0] 1
3[1],
(1)[1],
(0,0)
1
2[0] 1
2[1],
(1,)[].
(We need not specify ∣= =
P
AB C(0,1)
because ===
P
BC(0, 1)0
.) The joint distribution one obtains is
again that of equation (7).
The difference between the two explanations becomes clear when we vary the parameters. If we change the
parameters in the first model, for instance to
=+−
=′+−
′
=″+−
″⊕
()
()
PA w w
PB w w
PCAB w AB w A B
( ) [0] (1 ) [1],
( ) [0] 1 [1],
(,) []1 [ ],
where
⊕
denotes addition modulo 2, then the joint distribution is no longer of the form of equation (7), but it is
still true that Ais independent of B, while Aand Care dependent, and Band Care dependent. On the other hand,
modifications to the parameters in the second model do not preserve the pattern of dependences and
independences among A,Band C.
Figure 5. An unnatural causal model for the set of CI given in equation (6).
8
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
The first causal structure explains the pattern of statistical dependences and independences in a manner that
is robust to changes in the parameters of the causal model, whereas the second causal structure does not. Causal
discovery algorithms therefore favour the first model over the second.
In the example we have used, all of the variables in the causal model were observed variables. In general (and
especially in a quantum context), one might only observe a subset of the variables that are part of the causal
model. Even in this case, however, one should prefer those causal models wherein the CIs in the probability
distribution over the observed variables are stable to changes in the causal–statistical parameters.
This is the main assumption of the causal discovery algorithms, usually called faithfulness [2]orstability[1].
For a physicist, it is natural to call this an assumption of no fine-tuning. It is the key assumption in our analysis, so
we highlight it:
Faithfulness (no fine-tuning): The probability distribution induced by a causal model M(over the
variables in Mor some subset thereof) is faithful (not fine-tuned) if its CIs continue to hold for
any variation of the causal–statistical parameters in M.
In other words, all CIs should be a consequence of the causal structure alone, not a result of the causal–
statistical parameters taking some particular set of values. If one assumes a uniform prior over the space of
causal–statistical parameters, then the parameter choices that can explain CI relations that are not implied by the
causal structure are found to have measure zero.
The second major assumption of CI-based causal discovery algorithms is an appeal to Occamʼs razor, an
assumption that one should favour the most simple or most minimal model that explains the statistics. Again, it
can be applied both for the case where the observed variables are all the variables in the causal model, or the case
where they are some subset thereof.
A causal model Mwill be said to simulate another causal model M′on a set of variables Vif for every choice of
causal–statistical parameters on M′, there is a choice of causal–statistical parameters on Msuch that Myields the
same distribution over Vas M′does. We can now define the assumption of minimality.
Minimality: Given two causal models Mand M′that induce a given probability distribution over
a set of observed variables V
O
(in general a subset of the variables postulated by each causal
model), if M′can simulate Mon V
O
but Mcannot simulate M′on V
O
, then Mis preferred to M′
as a causal explanation of the probability distribution over V
O
.
At first sight, it might seem odd to prefer Mover M′given that Mis consistent with fewer distributions over V
than M′is. But the fact that Mcan explain less than M′implies that Mis more falsifiable than M′, and in the
version of Occamʼs razor espoused by CI-based causal discovery algorithms, the degree of falsifiability is the
figure of merit that one seeks to optimize. More falsifiable theories are to be preferred because, in Pearlʼs words,
‘they provide the scientist with less opportunities to overfit the data ‘hindsightedly’and therefore command
greater credibility if a fit is found’([1], p 49). It follows that a causal model is deemed most simple if it has the
least expressive power, while still doing justice to the observed probability distribution. Note that Mmight be
preferred to M′as a causal explanation of the probability distribution over V
O
even though Mmay require more
latent variables and/or more causal arrows than M′;‘the preference for simplicity [...] is gauged by the expressive
power of a structure, not by its syntactic description.’([1], p 46). We will see some examples of the consequences
of the assumption of minimality shortly.
It is worth remembering that causal discovery algorithms are fallible. They are best considered a heuristic, an
inference to the best explanation. Indeed, Pearl likens the faithfulness assumption in causal discovery to the
following kind of inference: you see a chair before you and infer that there is a single chair rather than two chairs
positioned such that the one hides the other ([1], p 48). The task of causal discovery can be understood as ‘an
inductive game that scientists play against Nature’([1], p 42).
3.1. Example of causal discovery assuming no latent variables
Variables that are not observed but which are causally relevant are called latent variables, or hidden variables. In
this section, we assume that the observed variables are the only causally relevant variables, i.e. that there are no
hidden variables. We look at a particular example of how faithfulness can help to determine candidate causal
structures from a pattern of dependences in this case. The scheme is equivalent to the one introduced by
Wermuth and Lauritzen [12].
Suppose one is interested in answering the question ‘Does smoking cause lung cancer?’For each member of a
population of individuals, the value of a variable Sis known, indicating whether the individual smoked or not,
and the value of a variable Cis known, indicating whether they developed cancer or not. Suppose a correlation
9
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
between Sand Cis observed. Furthermore, suppose that one also has access to a third variable
T
,
indicating
whether the individual had tar in their lungs or not, and suppose that it is found that Sand Care conditionally
independent given
T.
In other words, after conditioning on whether or not there is tar in the lungs, smoking and
lung cancer are no longer correlated. Finally, imagine that these three variables are assumed to be the only
causally relevant ones (we will consider the alternative to this assumption further on). What causal structure is
natural given the observed CI relation? Because we wish to make it very clear how these algorithms work, we will
not simply specify what causal structure they return. Instead, we will look ‘under the hood’of these algorithms.
We begin by considering every possible hypothesis about the causal ordering. A causal ordering of variables
is an ordering wherein causal influences can only propagate from one variable to another if the second is higher
in the order than the first.
For instance, consider the causal ordering <<STC.The most general causal structure consistent with
such an ordering is given in figure 6. To get a causal model, we need to supplement this with conditional
probabilities of every variable given its parents, that is, ∣
P
SPTS(), ( ),
and ∣
P
CTS(,).
The joint distribution
that this model defines is simply
=
P
STC PSPTSPCTS(, , ) () ( ) ( , ).
Given that any distribution can be decomposed in this form, by choosing the conditional probabilities
appropriately, we can model any joint distribution
P
STC(, , ).
But now we make use of the additional
information we have about the joint distribution, namely that ⊥⊥∣SCT().
This implies that we can take the
parameters in the causal model to be such that ∣
P
CTS(,)
=∣
P
CT()
,
so that the joint distribution can be
written as
=
P
STC PSPTSPCT(, , ) () ( ) ( ),
and, by the assumption of minimality, we drop the causal arrow from Sto
C
,
so that the underlying causal
structure is simply given by figure 7.
This simplified causal structure cannot generate an arbitrary probability distribution, but it can generate one
wherein ⊥⊥∣SCT().
It is a candidate for the true causal structure.
One then simply repeats this procedure for every possible choice of the causal ordering. For instance, for the
ordering
<<CTS,
the most general causal structure is the one shown in figure 8. The decomposition of the
joint probability corresponding to this causal structure is
=
P
STC PCPTCPSCT(, , ) ( ) ( ) ( , ),
but the constraint ⊥⊥∣SCT()
implies that one can substitute ∣=∣
P
SCT PST(,) ()
in the causal model.
Therefore, by the assumption of minimality, we drop the causal arrow from Cto S, yielding a causal structure of
the form given in figure 9. So this is another possible causal structure.
Sometimes different causal orderings lead to the same causal structure, for instance, the orderings
<<TS
C
and
<<TCS
both yield the structure given in figure 10.
Other causal orderings, such as
<<SCT
and
<<CS
T
are such that the CI constraint does not lead to
any simplification of the causal structure. For instance, for
<<SCT,
the joint distribution decomposes as
=∣∣
P
STC PSPCSPTCS(, , ) () ( ) ( , ),
and none of the terms on the right-hand side can be simplified by
⊥⊥∣SCT().
These two orderings lead to the two causal structures in figure 11.
Figure 6. The most general DAG for the causal ordering
<<ST
C
.
Figure 7. DAG that captures ⊥⊥∣SCT()
for the causal ordering
<<ST
C
.
10
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
Therefore, in this example, the six possible causal orderings have led to five candidates for the causal structure,
depicted in figures 7,9,10 and 11. However, the two causal structures shown in figure 11 do not satisfy
faithfulness, so only the other three are viable.
Suppose finally that in addition to the information about CI, one has information which rules out certain
causal orderings. For instance, in the example we are considering, suppose one has the additional information
that tar in the lungs always appears after a person has smoked, never before. It is then reasonable to rule out any
causal structure that has
<TS.
This rules out figures 9and 10. At the end, the only candidate causal structure
which is left is the one given in figure 7, which says that smoking causes tar in the lungs which causes lung cancer.
Of course, it needn’t be the case that these observed variables are the only ones that are causally relevant. For
instance, there might be an unobserved genetic factor that predisposes people both to smoke and to develop lung
cancer. Indeed, tobacco companies were quick to point out the possibility of explaining the observed correlation
between smoking and cancer in terms of such a genetic factor. So it is useful also to have causal discovery
algorithms that allow for latent variables.
Before moving on to algorithms that posit latent variables, we pause to note that the algorithm described
here is proven to be correct in the sense that if there exists a set of causal structures that are minimal and faithful
to the observed correlations, then the algorithm will return these structures [12].
More efficient versions of this algorithm are described elsewhere, for instance, the inductive causation (IC)
algorithm described in Pearl [1], which is equivalent to the SGS algorithm of Spirtes et al [2]. There have also
been many proposals to further improve the efficiency of these algorithms (see [1] and [2] for details) . These
algorithms have been proven to be correct in the sense that if there exist causal models that are minimal and
faithful, then the algorithms will return them.
Figure 8. The most general DAG for the causal ordering
<<CT
S
.
Figure 9. DAG that captures ⊥⊥∣SCT()
for the causal ordering
<<CT
S
.
Figure 10. DAG that captures ⊥⊥∣SCT()
for the causal orderings <<TS
C
and
<<TC
S
.
Figure 11. DAGs that capture ⊥⊥∣SCT()
for the causal orderings
<<SCT
(11(a)) and
<<CST
(11(b)).
11
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
3.2. Example of causal discovery allowing for latent variables
Causal discovery in the case where one allows latent variables is more complicated. We begin by considering
some of the consequences of the assumption of minimality for causal models with latent variables.
First of all, it is clear that one needn’t consider any causal models wherein a latent variable mediates a relation
between two observed variables, because the set of distributions over the observed variables that can be explained
by such a model is no greater than the set that can be explained by simply postulating a direct causal influence
between the observed variables. Similarly, positing a latent variable that is a common effect of the observed
variables does not change the distributions that can be supported on the observed variables. Latent variables have
nontrivial consequences for the observed distribution only when they act as common causes of the observed
variables.
Consider the following suggestion for a causal explanation of the correlations among a set of observed
variables: there are no causal influences among any of the observed variables, but there is a single latent variable
that has a causal influence on each of them. By choosing the latent variable to take as many values as there are
valuations of the observed variables, one can explain any correlation among the observed variables in this way.
However, if there exists another causal model that can only reproduce a smaller set of possible correlations, while
reproducing the observed correlations, then the principle of minimality dictates that we should prefer the latter.
Of course, one could imagine that further investigations (involving interventions, for instance) might vindicate
the explanation that is less falsifiable over the one that is more falsifiable. This simply is another reminder that
causal discovery algorithms are not infallible—they are heuristics for identifying the most plausible causal
explanations given the evidence.
Now we come to the most subtle part of the causal discovery algorithms that posit latent variables. There is a
difference between applying the criterion of minimality among a set of causal structures that are consistent with
a given distribution over the observed variables and applying the criterion of minimality among a set of causal
structures that are consistent with a given set of CI relations over the observed variables. As we’ve mentioned
before, the algorithms described in [1] and [2] look only at the CI relations and consequently they follow the
latter course. This choice is a significant shortcoming of many prominent causal discovery algorithms, but we
will defer this criticism until the end of this section.
For the moment, we simply explain the consequences of this choice. To do so, it is useful to divide the causal
structures that are consistent with a given distribution over a set of observed variables into two sorts. The first
kind is such that all the latent variables it posits are common causes for at most two of the observed variables.
We’ll say that such a causal structure is limited to pairwise common causes. The other kind is unrestricted, so that
more than two observed variables can be directly influenced by a single latent variable.
It is possible to show [13] that for a given set of CI relations among a set of observed variables, if a causal
model Mgenerates those CI relations faithfully (that is, as a consequence of the causal structure, rather than the
causal–statistical parameters), then there is another causal model M′that achieves the same CI relations
faithfully but which is limited to pairwise common causes. The assumption of minimality makes M′preferred
to M.
Therefore, if one is only applying the criterion of minimality among a set of causal structures that are
consistent with the CI relations among the observed variables, then one need only look among causal models
that incorporate pairwise common causes. This is precisely what the standard causal discovery algorithms do.
There is a simplified graphical language for representing the set of causal structures that can be output by
these algorithms. Rather than using a DAG that includes both the latent and the observed variables in the causal
structure, one uses a graph which only includes the observed variables as nodes but uses a larger variety of edges
among these nodes to specify the causal relation that might hold among the associated variables. For instance, a
double-headed arrow between variables Xand Ysignifies that there is a common cause of Xand Y(figure 12). An
arrow that has a circle rather than an arrowhead at one end represents either a common cause or a direct causal
influence or both (figure 13). Finally, an undirected edge with a circle at both its head and its tail represents any
of the five possible ways in which a pair of variables might be related (figure 14). In this way, a set of causal
Figure 12. The interpretation of a bidirected edge in terms of a DAG.
12
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
structures that include latent variables can be summarized in a single graph. Following Pearl, we call such graphs
patterns
10
.
In order to infer the causal structures with only pairwise common causes that are consistent with a given
pattern, it is not sufficient to simply substitute for every undirected edge (or bi-directed edge or directed edge
with decorated tail) all the possibilities consistent with that edge, as enumerated in figures 12,13 and 14. One
must eliminate some of the combinations. The definition of a v-structure in a DAG is a head-to-head collision of
two arrows on a node such that the parents do not exert any direct causal influence on one another. The
prescription for finding all the DAGs consistent with a pattern is to consider all the combinations of possibilities
that do not create a new v-structure.
The IC* algorithm described in Pearl [1] (which is equivalent to the causal inference algorithm described in
SGS [2]) takes CI relations as input and returns a pattern. This algorithm is correct in the sense that if there exist
causal structures with only pairwise common causes that are faithful to the observed CI relations, then the
algorithm will return the minimal structures within this set
11
. We will not review the details of the algorithm
here, but we will apply it to a simple example to get a feeling for how it works.
Consider the smoking example again, where the observed variables ST,and Care found to satisfy
⊥⊥∣SCT
.
The pattern returned by the IC* algorithm in this case is shown in figure 15.
For each undirected edge in this pattern, there are five possibilities in the DAG for what connection holds
between the nodes, as displayed in figure 14.Infigure 16 we display all 25 combinations of such possibilities. We
have also shaded out each of the combinations that introduces a new v-structure—these combinations are not
candidates for the causal structure according to the IC* algorithm. Hence, the nine causal structures that remain
are the candidates returned by IC*.
How does this answer embody the principles of causal discovery? First, the fact that one unpacks the pattern
into causal structures with only pairwise common causes is a consequence of the minimality assumption, as we
discussed at the beginning of this section. This is the reason that we do not find in the output of the algorithm any
latent variable that is a common cause of all three variables S,Tand C.
Figure 13. The interpretation of a directed edge with a circle at its tail in terms of DAGs.
Figure 14. The interpretation of an undirected edge with circles at head and tail in terms of DAGs.
Figure 15. Output pattern of IC* algorithm for input
⊥⊥∣SC
T
.
10
More precisely, the analogue of the particular graphs we consider here are Pearlʼs‘marked patterns’. These have also been called ‘partially
oriented inducing path graphs’in SGS. We will follow SGSʼs notational convention rather than Pearlʼs when drawing such graphs.
11
Note, however, that the existence of a causal structure that reproduces the CI relations does not guarantee the existence of one that
reproduces the observed distribution, as we will see at the end of this section. In this sense, the algorithm may still fail to return a valid causal
explanation of the observed distribution.
13
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
Now consider the question of why there is neither a direct causal influence between Sand Cnor a latent
variable that acts as a common cause for the pair. The answer is simply that if either of these sorts of influences
were acting, then we would not find ⊥⊥∣SCT()
; learning S would teach us something about Ceven though Tis
known. In the context of our example, this eliminates the possibility put forward by the tobacco companies of a
hypothetical genetic factor that both predisposes people to smoke and to get lung cancer.
We need not consider the cases where there is also no connection between Sand Tnor the cases where there
is also no connection between Tand Cbecause by assumption ⊥⊥∣SCT()
is the only CI relation and therefore
⊥⊥ST()
and
⊥⊥TC()
.
It follows that the 25 structures displayed in figure 16 are the only possibilities that remain among all possible
causal structures with pairwise common causes. So, to explain why the output of the algorithm is justified we
need only explain why one should eliminate those that introduce a new v-structure. First note that if one
conditions on a variable that is the common effect of two other variables, then we expect a dependence between
those variables (for instance, in digital logic, knowing that the output of an AND gate is 0 implies that the two
inputs cannot both be 1). Therefore for each causal structure that includes a v-structure on T, we would expect
that conditioning on Tinduces a dependence between the roots of the v-structure, and because one of these
roots is always correlated with Sand the other with C, this would imply a dependence between Sand C,
contradicting the fact that ⊥⊥∣SCT()
. Alternatively, we can infer that a causal structure including a v-structure on
Tcontradicts the relation ⊥⊥∣SCT()
using the d-separation criterion.
What does this imply about whether smoking causes lung cancer? Suppose that we make use of the same
additional information as we considered in section 3.1, namely, that tar in the lungs is always found to occur after
smoking, never before. We can then eliminate all causal structures with an arrow from Tto S. What remains are
the three options in figure 17. They are: (i) smoking causes tar in the lungs which causes cancer, (ii) there is a
latent variable that is a common cause of smoking and having tar in the lungs and tar in the lungs causes cancer,
and (iii) both mechanisms are in play. If option (ii) holds then smoking is not a cause of cancer and, unlike the
hypothesis of a genetic factor that predisposes people both to smoke and to develop lung cancer, it is consistent
with the observation that tar screens off smoking from cancer. Of course, this hypothesis remains implausible if
one cannot identify (or imagine) any factor that screens off smoking from tar in the lungs.
Figure 16. The causal structures returned by the IC* algorithm when the input is a distribution over observed variables S,Tand Cwith
⊥⊥∣SCT()
. Those that introduce a new v-structure are shaded out.
Figure 17. The causal structures that remain if the ordering
<ST
is assumed.
14
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
We previously highlighted the fact that the causal discovery algorithms of [1] and [2] apply the principle of
minimality within the set of causal structures that are consistent with the CI relations in the observed
distribution, not within the set of those that are consistent with the observed distribution itself. This can be a
problem because these two sets of causal structures can be different [13].
It is best to illustrate this with an example. Consider the case of a triple of observed variables, X,Yand Z.We
will compare two causal models. The first posits a latent variable λwhich has a direct causal influence on all three
observed variables. The second posits three latent variables, λ,μand ν, each of which has a direct causal influence
on a distinct pair of observed variables
12
. The two models are illustrated in figure 18.
The two structures imply precisely the same set of CI relations among the observed variables, namely, the
null set. However, there are distributions over the triple of observed variables that are only consistent with the
first model and not the second. For instance, a joint distribution wherein the three observed variables X,Yand Z
are close to perfectly correlated
13
cannot be generated from the second causal structure for any choice of causal
parameters [15,16]. Therefore, if this is the distribution one has observed, then the second causal structure is not
a candidate for the underlying causal model. However, the CI relations one observes for such a distribution are
consistent with the second causal structure. So if the input to oneʼs causal discovery algorithm is limited to these
relations, then the algorithm can return a causal structure that is inconsistent with the observed distribution.
Indeed, because the first causal structure can simulate the second, the principle of minimality would naturally
lead one to prefer the second, even though it is inconsistent with the observed distribution.
We will see that this deficiency of CI-based causal discovery algorithms becomes manifest when one applies
them to correlations that violate a Bell inequality.
4. Applying causal discovery algorithms to quantum correlations
We now turn to the question of what these algorithms tell us about quantum correlations. We consider only
Bell-type experiments involving two systems, two possible settings for each measurement and two possible
outcomes for each measurement. Let Sand Tbe the binary variables that specify which measurement was
performed on the left and right wings of the experiment respectively, and let Aand Bbe the binary variables that
specify the outcomes of the measurements on the left and right wings respectively.
Bellʼs theorem derives constraints on ∣
P
AB ST()
from assumptions about the causal structure [17]. These
assumptions—which Bell justified by appeal to the space-like separation of the two wings of the experiment and
the impossibility of superluminal causal influences—are that Ais the joint effect of the setting variable Sand a
common cause variable
λ,
while Bis the joint effect of the setting variable Tand
λ
.
The causal structure
corresponding to this assumption is presented in figure 19.
This structure implies the following CI relations,
λλ⊥⊥⊥⊥ABTS BAST
(
)and( ).
Bell called his assumption local causality and formalized it in terms of these CIs. These in turn imply that
λ∣
P
AB ST()
=λλ∣∣
P
AS PBT()(),
which is known as factorizability. From this condition, together with the
assumption that there are no correlations between the settings and the hidden variables
λλ⊥⊥⊥⊥ST TS
(
)and( ),
one can infer that ∣
P
AB ST()
must satisfy the Bell inequalities [17,18]. Bellʼs assumption about the causal
structure also implies no superluminal signalling:
Figure 18. Two candidate causal structures for explaining correlations between X,Yand Zusing latent variables.
12
This causal scenario has also been considered in the context of a discussion of quantum correlations in [14,15].
13
We cannot take the case where they are perfectly correlated because we want our example to be of a distribution that is faithful to the first
causal structure and perfect correlation would imply that any two variables are conditionally independent given the third.
15
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
The fact that quantum correlations can violate Bell inequalities shows that they cannot be explained using the
causal structure of figure 19.
We will now consider the inverse problem to the one considered by Bell. Rather than attempting to infer
constraints on correlations from assumptions about the causal structure, we will attempt to infer conclusions
about possible causal structures from the nature of the correlations that arise in quantum theory. This is the sort
of problem that the causal discovery algorithms were designed to solve.
We will contrast two examples of quantum correlations: one which violates the Bell inequalities and the
other which satisfies the Bell inequalities.
For the latter, we will take a version of the Einstein–Podolsky–Rosen (EPR) experiment [19] in terms of
qubits (first proposed by Bohm for spin-1/2 systems [20]). The pair are prepared in the maximally entangled
state
Ψ=+++−−zz zz
1
2(), (9)
where
±z
are the eigenstates of spin along the
z
ˆaxis. On each wing, the two choices of measurement are
between a pair of mutually unbiased bases (the same pair for each wing). For instance, we may measure spin
along the
z
ˆor
x
ˆaxes, as illustrated in figure 20. In this case, if the same measurement is made on both wings
(both
z
ˆor both
x
ˆ), one sees perfect correlation between the outcomes, while if different measurements are made
(
z
ˆon one and
x
ˆon the other), then one sees no correlation between the outcomes. It is well known that these
sorts of correlations do not violate any Bell inequality, which is to say that they can be explained by a locally causal
model.
The other sort of correlation we consider will be those exhibited in the Clauser–Horne–Shimony–Holt
(CHSH) experiment [18]. We can take the pair of spins to be prepared in the same maximally entangled state
Ψ
as for the EPR scenario, and the pair of measurements on the left wing to also be of spin along the
z
ˆor
x
ˆaxes.
However, on the right wing, the pair of possible measurements are of spin along the +zx(ˆˆ
)2
axis or along the
−zx(ˆˆ
)2
axis, as indicated in figure 21. In this case, one finds that the probability of correlation for the cases
=ST( , ) (0, 0), (1, 0)
and(0, 1) is equal to the probability of anticorrelation for the case
=ST(, ) (1,1)
and
has the value +≃0.85.
1
2
1
22
In both the EPR and CHSH scenarios, we assume that the settings Sand Tare sampled independently.
The input to the standard causal discovery algorithms is limited to CI relations, so we begin by computing
the CIs that hold for the EPR and CHSH experiments. Rather than specifying an exhaustive list, we provide a
generating set (the rest can be obtained by applying the semi-graphoid axioms). They are:
Figure 19. The causal structure corresponding to Bellʼs notion of local causality.
Figure 20. Measurement axes for generating EPR correlations given the quantum state
Ψ
of equation (9).
16
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
⊥⊥⊥⊥⊥⊥⊥⊥⊥⊥
⊥⊥⊥⊥⊥⊥⊥⊥⊥⊥
ST ATS BST AS BT
ST ATS BST AS BT
EPR: ( ),( ),( ),( ),( ),
CHSH: ( ), ( ), ( ), ( ), ( ).
Consider the conditions ⊥⊥AS()
and ⊥⊥BT()
. These assert that the outcome on a wing is independent of the
setting on that wing. While true, this independence is not representative of the causal structure. Indeed, it only
holds because of the degeneracy of the Schmidt coefficients in the maximally entangled state. If we instead
consider the state
Ψ=+++−−−
pz z pz z1 , (10)
where ≠p12,
then
⊥⊥AS()
and ⊥⊥BT()
. Because it is intuitively clear that the choice of measurement does have
a causal influence on the outcome, the independences ⊥⊥AS()
and ⊥⊥BT()
are pathological in the context of the
causal discovery algorithms. Given that if the EPR (CHSH) experiment is implemented with a state that is close to
maximally entangled, it still satisfies (violates) the Bell inequalities, we consider these states instead. (If one likes,
pmay be taken to be arbitrarily close to
1
2
.) We then get the following generating sets of independence relations,
⊥⊥⊥⊥⊥⊥
⊥⊥⊥⊥⊥⊥
ST ATS BST
ST ATS BST
EPR: ( ), ( ), ( ),
CHSH: ( ), ( ), ( ),
where
⊥⊥ST()
asserts the independence of the settings, and ⊥⊥∣ATS()
and ⊥⊥∣BST()
are the no-signalling
conditions (equation (8)).
The critical point is that the set of independences are the same for the EPR and the CHSH experiments. Since
the input to the causal discovery algorithms that we consider is limited to CI relations, it follows that whatever
causal conclusions these algorithms draw, they will draw the same causal conclusions about the EPR experiment
as they do about the CHSH experiment. And yet, from the fact that the EPR correlations satisfy the Bell
inequalities, we know that they can be explained by local causes while from the fact that the CHSH correlations
violate a Bell inequality, we know that they cannot be so explained.
So the conclusion is that CI-based causal discovery algorithms do not do justice to Bellʼs theorem.
Independences simply do not provide enough information. One needs a causal discovery algorithm that looks at
the strength of correlations to reproduce Bellʼs conclusion.
Despite the inability of the standard causal discovery algorithms to distinguish correlations that violate the
Bell inequalities from those that satisfy them, it is nonetheless interesting to see what happens when one applies
the algorithms to the set of independences we found for the EPR and CHSH experiments. We will refer to these
as nontrivial no-signalling correlations (‘nontrivial’in the sense that there is some nonvanishing correlation
between the outcomes for some choices of the settings).
In applying the causal discovery algorithms, we will assume for the moment that the setting variable on one
wing is a cause of the outcome variable on that wing, that is, we will assume that Sis a cause of Aand that Tis a
cause of B. This assumption will be relaxed in section 5. In this case, the assumption that there are no causal
cycles then implies that there can be no causal influence from Ato S, nor from Bto T. Nonetheless, we are still
permitting influences from the outcome on one wing to the setting on the other, although, as we will see, the
causal discovery algorithms will rule against such influences.
4.1. No latent variables
It is instructive to consider the causal structure that arises for a single representative causal ordering of the
variables. We take <<<STAB.Then, the most general causal structure is illustrated in figure 22. Hence the
most general joint distribution for this ordering is of the form
=
P
ST A B PSPTSPASTPBSTA(, , , ) ()( )( , )( , , ).
Figure 21. Measurement axes for generating CHSH correlations given the quantum state
Ψ
of equation (9).
17
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
The independence
⊥⊥ST()
implies that ∣=
P
TS PT() ()
,
and the independence ⊥⊥∣ATS()
implies that
∣=∣
P
AS T P AS(,) ()
. The independence ⊥⊥∣BST()
has no nontrivial implications for this causal ordering,
hence the term ∣
P
BS T A(,,)
cannot be simplified. From these CI relation it follows that the joint distribution
can be written as
=
P
ST A B PSPTPASPBST A(, , , ) ()()( )( , , ),
which corresponds to the causal structure in figure 23(a). If we change the ordering of variables so that B
precedes A, then by a similar argument, we obtain the causal structure in figure 23(b). For every other possible
causal ordering consistent with our assumption that
<SA
and
<TB
,
we also obtain one of the causal structures
of figure 23.
Consider the causal structure in figure 23(a), Although it faithfully captures
⊥⊥ST()
and
⊥⊥∣ATS(),
it does
not faithfully capture
⊥⊥∣BST().
The only way to explain the independence ⊥⊥∣BST()
within this causal model is
by fine-tuning of the causal parameters in the model, for instance, if the parameters defining ∣
P
BS T A(,,)
are
not independent of those defining ∣
P
AS()
. A similar problem arises for the causal structure in figure 23(b). It
follows that in the case of no latent variables, no causal structure can satisfy faithfulness for the CIs of nontrivial
no-signalling correlations.
Note that if, instead of applying the Wermuth–Lauritzen algorithm to the nontrivial no-signalling
correlations, one applies the IC algorithm [1], equivalently the SGS algorithm [2] (which also assume no latent
variables), one finds that it returns a graph that is not a valid pattern, signalling a failure of the algorithm. This is
what one would expect given that the algorithm only promises to return a valid causal structure if there exists one
that satisfies faithfulness, and in this case, there is not.
There is an interesting lesson here for the foundations of quantum theory. Long before Bellʼs work, Einstein
had pointed out that if one did not assume hidden variables, then one could only explain the EPR correlations by
positing superluminal causes. This argument was made in his comments at the 1927 Solvay conference [21] (see
[22] and [9] for more concerning Einsteinʼs arguments on completeness and locality.) One can easily cast
Einsteinʼs argument into the mold of causal discovery algorithms as follows. If we allow the quantum state ψ,
considered as a classical variable, as the only common cause, then the assumption of no superluminal causal
influences implies that
ψ∣
P
ABST(, , , )
=ψψ∣∣
P
AS P BT(,)( ,),
and given that ψis fixed in the experiment (it
is a variable which only takes one possible value), this implies that Aand Bshould be uncorrelated, in
contradiction with the EPR correlations.
But Einstein failed to explicitly note another mysterious feature of the EPR correlations, which our analysis
highlights: even if one is willing to countenance superluminal causal influences in an attempt to explain the EPR
correlations without recourse to hidden variables, ensuring that these superluminal causes cannot be used to
send superluminal signals implies that there must be fine-tuning in the underlying causal model.
4.2. Latent variables allowed
If one simply inputs the independences of nontrivial no-signalling correlations into the IC* algorithm of [1],
which allows latent variables, one obtains the pattern illustrated in figure 24 as output.
Figure 22. The most general causal structure for the causal ordering <<<STAB,assuming no hidden variables.
Figure 23. Possible causal structures for no-signalling correlations, assuming no hidden variables, for causal orderings
<<<STAB,<< <TSAB,<<<SATB
(a) and <<<STBA,<< <TSBA,<<<TBSA
(b).
18
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens
Recall that the arrows with an empty circle at their tail imply that one can have either a direct causal link or a
common cause. If one believes that the settings at each wing are freely chosen, then one is inclined to think that
either the setting variables Sand Tshould be direct causes of Aand Brespectively, or that if they are not, then it is
the common cause for Aand Sand the common cause for Band Tthat is freely chosen. In this case, we could
lump the common causes into the definition of the setting variables without loss of generality.
Besides this caveat about the causal relation between Sand Aand between Tand B, the causal structures with
pairwise common causes that are consistent with the pattern that the IC* algorithm has returned are precisely
those that capture Bellʼs notion of local causality, illustrated in figure 19. Moreover, the principle of minimality,
applied to the causal models consistent with the CI relations, would lead us to favour the causal model of
figure 19. But because such a causal model satisfies the Bell inequalities, while the CHSH correlations do not, we
know that it cannot provide a causal explanation of the CHSH correlations.
This is how the deficiency of the IC* algorithm manifests itself when applied to quantum correlations. The
problem is that a causal structure with latent variables that reproduces the CI relations of a given distribution
might not be capable of reproducing the distribution itself. In particular, the causal structure of figure 19
reproduces the CI relations of the distribution
P
ABST(, , , )
defined by the CHSH experiment, namely
⊥⊥⊥⊥∣ST ATS(),( )
and ⊥⊥∣BST(),
but it cannot reproduce the distribution itself. As our brief discussion in
section 3.2 highlighted, if one applies the principle of minimality among the causal models that are consistent
with the CI relations, rather than among the causal models that are consistent with the entire observed distribution,
one can mistakenly come to favour a causal model that cannot reproduce the observed distribution.
Of course, we already pointed out in section 4, that the input of the IC* algorithm cannot distinguish Bell-
inequality-violating from Bell-inequality-satisfying correlations. So we reiterate our conclusion from section 4,
that causal discovery algorithms which look only at CIs are inadequate to the task of establishing whether or not
correlations can be explained by a locally causal model. We require better algorithms that also take into account
the strengths of the correlations.
4.3. Some proposed causal explanations of quantum correlations
We now apply the ideas behind causal discovery algorithms to a few of the existing proposals for providing a
causal explanation of Bell-inequality-violating correlations. We consider three: superluminal causation,
superdeterminism, and retrocausation.
We start by considering the most general kind of causal explanation, where one allows hidden variables.
Causal structures without hidden variables are a special case of these. Nonetheless, we consider the case of no
hidden variables explicitly to ensure that there is no confusion.
4.3.1. Superluminal causation
One option for explaining Bell correlations causally is to assume that there are some superluminal causes, for
instance, a causal influence from the outcome on one wing to the outcome on the other, or from the setting on
one wing to the outcome on the other, or both. The possibilities are illustrated in figure 25.
These sorts of causal explanations of Bell-inequality violations, however, are unsatisfactory in light of the
principles embodied in causal discovery algorithms. Given the superluminal causal influences from one wing to
the other, the only way to explain the lack of superluminal signals, that is the CI relations of equation (8), is
through a fine-tuning of the causal parameters.
For instance, in figure 25(c), the correlations set up between Sand Balong the direct causal path could cancel
with those set up by the causal path through A. (The path through λcannot set up correlations between Sand B
because there is a collider on Ain this path and we are not conditioning on A.) Such a cancelation requires fine-
tuning of the parameters of the model.
To salvage no-signalling for the causal structure of figure 25(a), we need a different sort of fine-tuning (a
similar sort of fine-tuning mechanism can also be used for the causal structure of figure 25(b)). For instance, it
could be that
λ
λλ=(, )
12
, where
λ
1
is a binary variable that is uniformly distributed and that Bis a function of
λ⊕S,
1Tand
λ
2. In this case, we can ensure that ⊥⊥∣BST()
by virtue of the special distribution on
λ
,
1which is a
kind of fine-tuning.
Figure 24. The output pattern of the IC* algorithm when applied to nontrivial no-signalling correlations.
19
New J. Phys. 17 (2015) 033002 C J Wood and R W Spekkens