Measuring Causal Speciﬁcity
Paul E. Griﬃths1, Arnaud Pocheville1, Brett Calcott2, Karola Stotz3,
Hyunju Kim4and Rob Knight 5
1Department of Philosophy and Charles Perkins Centre, University of Sydney, NSW
2ASU/SFI Center for Complex Biosocial Systems School of Life Sciences, Arizona State
University, Tempe, AZ 85287, USA
3Department of Philosophy, Macquarie University, NSW 2109, Australia
4Beyond Center for Fundamental Concepts in Science, Arizona State University, Tempe,
AZ 85287, USA
5Howard Hughes Medical Institute, and Departments of Chemistry & Biochemistry and
Computer Science, and BioFrontiers Institute, University of Colorado at Boulder,
Boulder, CO 80309, USA
March 18, 2015
Philosophy of Science, in press
Several authors have argued that causes diﬀer in the degree to which they
are ‘speciﬁc’ to their eﬀects. Woodward has used this idea to enrich his
inﬂuential interventionist theory of causal explanation. Here we propose a
way to measure causal speciﬁcity using tools from information theory. We
show that the speciﬁcity of a causal variable is not well-deﬁned without a
probability distribution over the states of that variable. We demonstrate the
tractability and interest of our proposed measure by measuring the speciﬁcity
of coding DNA and other factors in a simple model of the production of
This publication was made possible through the support of a grant from
the Templeton World Charity Foundation. The opinions expressed in this
publication are those of the author(s) and do not necessarily reﬂect the views
of the Templeton World Charity Foundation. Brett Calcott was supported
by Joshua Epstein’s NIH Director’s Pioneer Award, Number DP1OD003874
from the Oﬃce of the Director, National Institutes of Health. The paper is
the result of a workshop held at the University of Colorado, Boulder, CO
with support from Templeton World Charity Foundation. BC, PG, AP and
KS wrote the manuscript, and all authors agreed on the ﬁnal content. We
would like to thank two anonymous referees for their helpful comments.
1 Causal Speciﬁcity
Several authors have argued that causes diﬀer in the degree to which they
are ‘speciﬁc’ to their eﬀects. The existing literature on causal speciﬁcity is
mostly qualitative and recognizes that the idea is not yet adequately precise
(e.g. Waters, 2007; Weber, 2006, 2013; Woodward, 2010). Marcel Weber has
suggested that the next step should be a quantitative measure of speciﬁcity
(2006, 606). In this article we examine how to measure speciﬁcity using tools
from information theory.
Causal speciﬁcity is often introduced by contrasting the tuning dial and
the on/oﬀ switch of a radio. Hearing the news is equally dependent on the
dial (or digital tuner) taking the value ‘576’ and on the switch taking the
value ‘ON’. But the dial seems to have a diﬀerent kind of causal relationship
with the news broadcast than the switch. The switch is a non-speciﬁc cause,
whereas the dial (or digital tuner) is a speciﬁc cause. The diﬀerence has
something to do with the range of alternative eﬀects that can be produced
by manipulating the tuner, as opposed to manipulating the switch.
Another widely discussed example of speciﬁc and non-speciﬁc causes con-
trasts a coding sequence of DNA with other factors involved in DNA tran-
scription and translation (e.g. Waters, 2007). But this example has to be
carefully tailored to produce the desired intuition about speciﬁcity (Griﬃths
& Stotz, 2013). In Section 5 we will show that the causal speciﬁcity of coding
sequences of DNA diﬀers dramatically in diﬀerent cases.
Like most of the recent literature, our account of causal speciﬁcity makes
use of Woodward’s interventionist theory of causal explanation (Woodward
2003). We will give only the briefest summary of Woodward’s theory here,
since it should be well known to the presumptive audience for this paper and
Woodward has provided a succinct and readily accessible summary online
(Woodward 2012). Woodward construes causation as a relationship between
variables in a scientiﬁc representation of a system. There is a causal relation-
ship between variables Xand Yif it is possible to manipulate the value of
Yby intervening to change the value of X. ‘Intervention’ here is a technical
notion with various restrictions. For example, changing a third variable Z
that simultaneously changes Xand Ydoes not count as ‘intervening’ on
X. Causal relationships between variables diﬀer in how ‘invariant’ they are.
Invariance is a measure of the range of values of Xand Yacross which the
relationship between Xand Yholds. But even relationships with very small
ranges of invariance are causal relationships.
Both Kenneth Waters (2007) and Woodward (2010) have suggested that
causal speciﬁcity is related to ‘causal inﬂuence’ (Lewis 2000 and see Sec-
tion 2). A causal variable has ‘inﬂuence’ on an eﬀect variable if a range of
values of the cause produces a range of values of the eﬀect, as in the example
of the tuner. However, whilst Lewis proposed that ‘inﬂuence’ distinguishes
causes from non-causes, for Woodward it merely marks out causes that are
particularly apt for intervention.
Although Woodward (2010) gives the most complete account of speci-
ﬁcity to date there remains much to be done, as he recognizes. Marcel We-
ber has suggested that causal speciﬁcity is merely a variety of Woodward’s
invariance. A variable is a more speciﬁc cause of some other variable, Weber
suggests, to the extent that the causal relationship between cause and eﬀect
variables is invariant across the range of values of both variables, and to the
extent that the two variables have large ranges of values (Weber, 2006, 606).
Woodward disagrees, arguing that a causal relationship with these proper-
ties may fail to meet some of the other conditions we discuss below, such as
being a bijective1function from cause variable to eﬀect variable (Woodward,
2010 fn17). An attempt to quantify speciﬁcity is one obvious way to move
discussion forward. As we will see below, the points that Weber and Wood-
1A function mapping causes to eﬀects will be injective if no eﬀect has more than one
cause; surjective if every eﬀect has at least one cause; bijective if it is both injective and
surjective – every eﬀect has one and only one cause, and vice versa.
ward are making become much clearer when expressed using a quantitative
A skeptical reader may wonder why the apparently elusive notion of
causal speciﬁcity deserves such eﬀort. Our motivation is the same as that of
Waters and Weber: clarifying the notion of causal speciﬁcity may elucidate
the notion of biological speciﬁcity, and facilitate the study of speciﬁcity in
actual biological systems. The term ‘speciﬁcity’ entered biology in the 1890s
in response to the extraordinary precision of biochemical reactions, such as
the ability to produce an immune response to a single infective agent, or the
ability of an enzyme to interact with just one substrate. By the 1940s biolog-
ical speciﬁcity had come to be identiﬁed with the precision of stereochemical
relationships between biomolecules. In 1958, however, Francis Crick’s theo-
retical breakthrough in understanding protein synthesis introduced a com-
plementary conception of speciﬁcity, sometime referred to as ‘informational
speciﬁcity’. Stereochemical speciﬁcity results from the unique, complex 3-
dimensional structure of a molecule that allows some molecules but not other
to bind to it and interact. In contrast, informational speciﬁcity is produced
by exploiting combinatorial complexity within a linear sequence, which can
be done with a relatively simple and homogenous molecule such as DNA (see
Griﬃths & Stotz 2013, Ch3).
The notion of causal speciﬁcity in philosophy of science was not intro-
duced with any a priori assumption that it is the same thing as biological
speciﬁcity. However, Waters has used the idea of causal speciﬁcity to argue
that DNA encodes biological speciﬁcity for gene products, unlike other fac-
tors involved in making those products (Waters, 2007). In contrast, Stotz
and Griﬃths have used causal speciﬁcity to argue that the biological speci-
ﬁcity for a gene product is distributed across several of these factors (Griﬃths
& Stotz, 2013; Stotz, 2006).
A merely intuitive approach to causal speciﬁcity is unlikely to be helpful
in settling disputes like this. In Section 5 we show that a quantitative ap-
proach may allow a more deﬁnitive resolution. At the very least, it makes
clear which assumptions are driving the diﬀerent conclusions reached by the
2 Speciﬁcity and Information
Causal speciﬁcity has been characterized by Woodward as a property of the
mapping between causes and eﬀects:
My proposal is that, other things being equal, we are inclined to think of C
as having more rather than less inﬂuence on E(and as a more rather than
less speciﬁc cause of E) to the extent that it is true that:
(INF) There are a number of diﬀerent possible states of C(c1... cn), a
number of diﬀerent possible states of E(e1... em) and a mapping Ffrom
Cto Esuch that for many states of Ceach such state has a unique image
under Fin E(that is, Fis a function or close to it, so that the same state of
Cis not associated with diﬀerent states of E, either on the same or diﬀerent
occasions), not too many diﬀerent states of Care mapped onto the same
state of Eand most states of Eare the image under Fof some state of C.
(Woodward, 2010, 305)
We propose to quantify Woodward’s proposal that a cause becomes more
speciﬁc as the mapping of cause to eﬀect resembles a bijection.
We start from the simple idea that the more speciﬁc the relationship
between a cause variable and an eﬀect variable, the more information we
will have about the eﬀect after we perform an intervention on the cause.
Starting from this idea, we can apply the tools of information theory to
measure some properties of causal mappings that relate values of the cause
to values of the eﬀect. For simplicity, we restrict ourselves to variables that
take nominal values, with no obvious metric relating the diverse values. 2
2Variants of our approach to causal speciﬁcity are possible for metric variables. The
analysis of variance, for example, gives measures that are respectively equivalent to en-
tropy, conditional entropy and mutual information. The information theoretic approach
One property we can measure in this way is Woodward’s INF. Rather than
describing a relationship as injective or bijective, information theory allows
us to express the tendency towards a bijective relationship as a continuous
variable. Thus, our informational measure of speciﬁcity will preserve the
essence of Woodward’s proposal while allowing this desirable ﬂexibility.
We use the term ‘information’ in the classic sense of a reduction of un-
certainty (Shannon & Weaver 1949). In information theory, the uncertainty
about an event can be measured by the entropy of the probability distribu-
tion of events belonging to the same class (see Box 1). Uncertainty about
the outcome of throwing a die is measured by the entropy of the probability
distribution of the six possible outcomes. Maximum entropy occurs when all
six faces of the die have equal probabilities. If the die is loaded, the entropy
is smaller and there is less uncertainty about the outcome, because one side
is more probable than the others.
Applying this framework to a causal relationship allows one to measure
how much knowing the value set by an intervention on a causal variable re-
duces one’s uncertainty about the value of an eﬀect variable. We can measure
this reduction of uncertainty by comparing the entropy of the probability dis-
tribution of the value of the eﬀect before and after knowing the value of the
cause set by an intervention. The more the diﬀerence in entropies, the more
our uncertainty has been reduced. The maximum reduction of uncertainty
occurs when we start from complete ignorance (i.e., maximum entropy) and
when, after knowing the value of the cause set by an intervention, we end up
with a completely speciﬁed value for the eﬀect (null entropy – for instance,
when a die is so heavily loaded that it always comes up 6).
taken here is more general, but the analysis of variance retains more information about
the metric (see Garner & McGill 1956 for a comparison). Information theoretic variants
have also been developed to deal with continuous variables (e.g. Reshef et al. 2011; Ross
Box 1. A primer on information theory
Information theory provides us with tools to measure uncertainty, and to mea-
sure the reduction of that uncertainty. Importantly, for our purposes, it tells
us how information about the value of one variable can reduce the uncertainty
about the value of another, related, variable.
The simplest case occurs when a discrete variable has only two values, which
can then be known by answering a single question (e.g. by yes or no). The
answer is said to convey one unit of information (a bit). If the set of possible
values for the variable now contains 2nequally likely elements, we can remark
that n dichotomous questions (nbits) are needed to determine the actual value
of the variable. The quantity of information contained in knowing the actual
value is thus n= log2(2n). If we adopt a probabilistic framework where each
possible value has equal probability p= 1/2n, we can say that knowing any
actual value of the variable brings −log2pbits of information. When the
values are not equiprobable, the average information gained by knowing an
actual value of the variable is measured as an average over the probabilities of
the diﬀerent values. This quantity is the entropy of the probability distribution
of the variable, deﬁned as:
H(X) = −
where xirepresent values of the variable Xand Nis the number of diﬀerent
values. Entropy measures the uncertainty about the value of the variable and
is always non-negative. Uncertainty is maximised (maximum entropy) when
each value is equiprobable. Departing from uniformity will always make one
(or more) values more probable, and so decrease uncertainty. In a similar way,
increasing the number of possible values will increase uncertainty. All of the
above can be generalized to cases where the number of possible values is not
a power of 2.
If Xand Yare two random variables (with respectively Nand Mdiﬀerent
values, noted xi,yj), we can deﬁne the entropy of the couple X,Y:
H(X, Y ) = −
p(xi, yj) log2p(xi, yj)
Box 1. (Continued)
This enables us to deﬁne the conditional entropy, representing the amount of
uncertainty remaining on Y when we already know X:
H(Y|X) = H(X, Y )–H(X)
In a similar way, the mutual information, that is, the amount of redundant
information present in Xand Yis obtained by:
I(X;Y) = H(X) + H(Y)−H(X, Y )
p(xi, yj) log2
Mutual information can be thought of as the amount of information that one
variable, X, contains about the other, Y(normalized variants of mutual infor-
mation are available).
Conditional entropy is null, and mutual information is maximal, when Yis
completely determined by X. Note that conditional entropy is generally asym-
metric while mutual information is always symmetric:
I(X;Y) = I(Y;X)
The relationships between these three diﬀerent measures are represented in
ﬁgure 1. See Cover and Thomas (2012) for more detail.
Figure 1: Diagram of the relationships between the diﬀerent infor-
mational measures, entropy H(X), conditional entropy H(X|Y)
and mutual information I(X;Y).
These ideas can be illustrated with simple diagrams showing how diﬀerent
values of a causal variable (C) map to diﬀerent values of an eﬀect variable
(E). We draw the reader’s attention to the fact that these diagrams are
causal mappings rather than conventional causal graphs. Nodes represent
values of variables, rather than variables, as they would in a causal graph.
Likewise, arrows do not represent causal connections between variables, as
they would in a causal graph. An arrow connecting a value of a cause to
a value of an eﬀect means that interventions which set the cause to that
value will lead to the eﬀect having that value, with some probability. For
instance, the arrow stemming from ciand pointing to ejcorresponds to the
joint event (bci, ej)with probability p(bci, ej). The hat in the formula means
that the value ciis ﬁxed by an ‘atomic’ intervention (see Box 2).
For ease of presentation, we will make some simplifying assumptions:
1. We consider only cases where we start from complete ignorance about
the eﬀect (maximum entropy).
2. We assume that all causal values, arrows, and eﬀect values, are equiprob-
3. We consider only cases relating one cause and one eﬀect, ruling out the
possibility of confounding factors. However, the same measures could
be used in cases with confounding factors, as atomic interventions on
the causal value will break the confounding inﬂuence of such factors
on the association between values of the cause and values of the eﬀect.
The simplest case is a bijection, where each value of the cause corresponds
to one value of the eﬀect and vice versa (see ﬁgure 2). Here, complete
ignorance (maximum entropy) obtains when each value of the eﬀect has a
probability of ½before knowing the value set by the intervention on the cause:
H(E) = −
2= 1 [bit ]
After knowing the value of the cause set by the intervention (say, bc1), the
eﬀect is now fully speciﬁed (it is e1with probability 1), and the conditional
2n1 log2(1) + 0 log2(0) o= 0 [bit ]
The information gained by knowing the cause can be obtained by measur-
ing the diﬀerence between the entropy before and the entropy after knowing
the value set for the cause by the intervention. This quantity is the mutual
information between Eand b
C= 1 [bit ]
These three quantities H(E),HEb
C, and IE;b
teresting properties of the causal mapping above. The entropy, H(E)mea-
sures how large and even the repertoire of possible eﬀects is. It is the amount
of information that can be gained by totally specifying an eﬀect among a set
of possible eﬀects (here, this is one bit). The conditional entropy HEb
characterizes the remaining uncertainty about an eﬀect when the value set
for the cause is known, (here it is fully speciﬁed, so the uncertainty is 0
bit). Finally, the mutual information IE;b
Cmeasures the extent to which
knowing the value set for the cause speciﬁes the value of the eﬀect (here,
knowing the value of the cause brings 1 bit of information).
Another simple case is where any value of the cause can lead to any value
of the eﬀect (see ﬁgure 3). We only present this as a limiting case, because
Box 2: Causal modeling
Causal modeling provides us with the tools to track the eﬀects of interventions
on a system. Where statistical modeling would look at statistical associations
between supposed causes and supposed eﬀects, causal modeling introduces the
requirement of intervening on the system to compute the causal eﬀect. More
precisely, consider a causal model consisting of:
1. a set of functional relationships xi=f(pai, ui),i= 1. . .n, where xiis
the value of the variable Xibeing caused by Xi’s parent variables pai,
according to some function f, given some background conditions ui
2. a joint distribution function P(u)on the background factors.
Then the simplest ‘atomic’ intervention consists in forcing Xito take some
value xiirrespective of the value of the parent variables pai, keeping everything
else unchanged. Such an intervention can be written formally with the do()
operator. As Pearl writes: “Formally, this atomic intervention, which we
denote by do(Xi=xi)or do(xi)[or bxi] for short, amounts to removing the
equation xi=f(pai, ui)from the model and substituting Xi=xiin the
remaining equations. The new model when solved for the distribution of Xj,
yields the causal eﬀect of Xion Xj, which is denoted P(xj|bxi).” (Pearl 2009,
The causal eﬀect P(xj|bxi)is to be contrasted with the observational con-
ditional probability P(xj|xi), which can be aﬀected by confounding factors
leading to spurious associations or spurious independence.
Other recent works in mathematics and computer sciences have brought infor-
mation theory together with causal modeling to study information processing
in complex systems (Ay & Polani, 2008; Lizier & Prokopenko, 2010). These
works also builds on Pearl (2009), and are consistent with the work presented
here. However, our approach and measures are signiﬁcantly diﬀerent, reﬂect-
ing the fact that we start from a concern with ‘causal selection’ in a context
of intervention and control. The diﬀerences between these approaches will be
explored in a future paper.
See Pearl (2009, esp. chapter 3) for more details.
Figure 2: Bijection between causal values and eﬀect values.
Figure 3: Any value of the cause can lead to any value of the eﬀect.
manipulating the value of Cbetween c1and c2would have no eﬀect on the
value of E, and so Cis not a cause of Eon the interventionist account. In
this case, as in the previous case:
H(E) = −
2= 1 [bit ]
Because in this case knowing the value set by an intervention on Cgives
no information about the value of E, the conditional entropy HEb
equal to H(E)(our uncertainty is unchanged):
2= 1 [bit ]
Thus, the information gained by knowing the value set for Cis nil (Cis
C= 0 [bit ]
Notice that we can approach this null mutual information as a limit of
a genuine cause whose diﬀerent values make decreasingly small diﬀerences
Figure 4: A single value of the cause can lead to more than one
value of the eﬀect.
as regards the value of the eﬀect. This implies that speciﬁcity and the
interventionist criterion of causation are not fully independent.
These two cases, bijection (ﬁgure 2) and exhaustive connection (ﬁgure 3)
illustrate limit cases of Woodward’s ‘degree of bijectivity’ of causal mappings.
We can go further by examining two slightly more complicated cases.
The ﬁrst is where each value of a cause leads to a proper set of values of the
eﬀect (see ﬁgure 4). In this case the maximum uncertainty about the eﬀect
H(E) = −
4= 2 [bits ]
Furthermore, knowing the cause less than fully speciﬁes the eﬀect. As-
suming equiprobability between the two eﬀect values that can be produced
by a single value of the cause, the conditional entropy HEb
2+ 2 (0 log2(0)) o= 1 [bit ]
Thus, the information about the eﬀect gained by knowing the cause is:
C= 1 [bit ]
Figure 5: Diﬀerent values of the cause lead to the same outcome.
Notice that knowing the value of the cause provides as much information
about the eﬀect as in ﬁgure 2, but because the repertoire of eﬀects is larger,
the remaining uncertainty – HEb
C– is not null anymore. The repertoire
of eﬀects will be larger if, for instance, we increase the level of detail when
describing eﬀects (compare a game of dice based on odd versus even outcomes
to a game based on the values of the six individual faces).
Let us now consider the symmetric case (see ﬁgure 5). As in ﬁgure 2 and
ﬁgure 3, if we suppose complete ignorance of the eﬀects:
H(E) = 1 [bit ]
Although in ﬁgure 5 two values of the cause can lead to the same eﬀect,
knowing the value of the cause fully speciﬁes the value of the eﬀect just as
eﬀectively as it does in ﬁgure 2. Thus:
C= 0 [bit ]
Therefore, the diﬀerence in uncertainty about the eﬀect between not
knowing the value of the cause and knowing it is:
C= 1 [bit ]
Here again, knowing the cause provides as much information about the
eﬀects as in ﬁgure 2, but because the repertoire of states of the causal variable
is now larger, some values lead to the same eﬀects (this can happen if we
increase the level of detail in our description of the cause). Notice that this
will not matter if we are interested in controlling the value of the eﬀect:
applying c1or c2will deterministically lead to e1.
Furthermore, we can distinguish between ﬁgure 2 and ﬁgure 5 if we intro-
duce a fourth quantity, that is, the entropy characterizing the repertoire of
the cause, which in these two cases is the maximum entropy. In ﬁgure 2 the
C= 1 [bit ]whereas in ﬁgure 5, Hb
Thus, both the conditional entropy HEb
Cand the mutual informa-
Ccapture aspects of the intuition that causes diﬀer in ‘speci-
ﬁcity’. Because the prior uncertainty H(E)is not constant – it depends in
particular on the size of the repertoire of eﬀects – both measures are needed
to understand how much a cause speciﬁes an eﬀect (this is given by IE;b
and how much an eﬀect is speciﬁed when knowing the value set for the cause
(given by HEb
In the cases considered here, if HEb
C= 0 then manipulating Cpro-
vides complete control over E. This corresponds to Woodward’s observation
(2010, 305) that it is more important that the mapping from Cto Eis a
surjective function than that it is also bijective. Woodward’s notion of a
ﬁne-grained control, however, would be better represented using H(E)and
C. That is, ﬁne-grained control requires that the repertoire of eﬀects
is large and that a cause screens oﬀ many of them (recall that we are cur-
rently dealing only with nominal variables). In the ideal case, H(E)would
tend toward inﬁnity and IE;b
Cwould tend toward H(E).
3 Comparing Two Variables
We now have a proposal for a measure of causal speciﬁcity:
SPEC: the speciﬁcity of a causal variable is obtained by measur-
ing how much mutual information interventions on that causal
variable carry about the eﬀect variable.
It is important to note that, whilst mutual information is a symmetric
measure: I(X;Y) = I(Y;X), the mutual information between an interven-
tion and its eﬀect is not symmetrical because the fact that interventions on C
change Edoes not imply that interventions on Ewill change C: in general,
Recall that the aim of producing a measure of causal speciﬁcity was to
use it to compare diﬀerent causes of the same eﬀect. So we need to look at
a case where an eﬀect depends on more than one upstream causal variable,
and compare the mutual information they carry. To do so we will explore
some increasingly complex cases involving gene transcription. In each case
we focus on (messenger) RNA as the eﬀect variable, and look at the relative
speciﬁcity of diﬀerent upstream causal variables.
We begin with a simple case that has already been discussed in the
literature, namely comparing the causal contributions of RNA polymerase
and DNA coding sequences to the structure of a messenger RNA (Waters,
2007). Both are causes of RNA, since manipulating either makes a diﬀerence
to the RNA. Polymerase is like the radio on/oﬀ button, and the DNA is like
the channel tuner, with a number of settings.3
We can formalise this in the following way (ﬁgure 6). There are two causal
variables, DNA and POL, and one eﬀect variable, RNA. Each variable can
3Because we do not impose an order on the values of the DNA variable, it is more
like a digital tuner, to which any combination of digits can be entered, than an analogue
Figure 6: Causal mapping and probability distributions for DNA
and RNA (left) and POL and RNA (right).
take on a number of values. Assume, for now, that there are four possible
DNA sequences (d1,d2,d3,d4), and that the RNA polymerase is either
present or absent. Our eﬀect variable can thus take on ﬁve values—four
correspond to the RNA sequences (r1,r2,r3,r4) transcribed from the DNA,
and one is a state we will call r0, that occurs when there is no transcription.
In order to calculate the mutual information, we need to assign each of the
values a probability, and these must sum to 1. We begin by simply assigning
uniform probabilities over the causal variables, DNA and POL. What does
our speciﬁcity measure tell us about the two causal variables in this simple
When we do the calculation (see Supplementary Online Materials §1),
interventions on either DNA or POL carry the same amount of mutual in-
DNA= 0.5×2 = 1 [bit ]
POL= 1 [bit ]
They are (given our working assumptions) equally causally speciﬁc. That
might seem odd, as the DNA sequences can take on four diﬀerent values,
and the Polymerase is simply 0present0or 0absent 0. Our measurement seems
to be saying there is no diﬀerence between on/oﬀ switches and tuning knobs.
What has gone wrong?
To understand why this happens, recall that mutual information mea-
sures how much information on average we get by looking at a causal variable.
Notice that the value of
DNA is irrelevant if
POL =0absent0, and our uni-
form distribution sets the probability of this at 0.5. So half the time, when
we look at the value of
DNA, we learn nothing about the system. When
POL =0present0, knowing the value of
DNA is useful: it delivers 2 bits of
information. In short, half the time,
DNA gives us 0 bits of information, and
the other half of the time 2 bits. Hence, 1 bit on average.
What this shows is that our proposed measure for causal speciﬁcity is sen-
sitive to the probability distribution of the causal variables. This means that
either our speciﬁcity measure is incorrect, or Woodward’s INF (Section 2) is
missing something, because that condition makes no mention of the proba-
bility distributions over the variables. In the next section we will see that
this dilemma corresponds to two diﬀerent approaches to causal speciﬁcity.
4 Speciﬁc Actual Diﬀerence Making
The suggestion that the actual probability distributions of the causal vari-
ables matters when assessing which causes are signiﬁcant is an idea we have
heard before. Waters argues that in order to pick out the signiﬁcant causes,
you need to know the actual diﬀerence makers. For example, even when it is
possible to manipulate POL (which identiﬁes it as a potential cause), if there
is no actual diﬀerence in POL in a population of cells, as Waters assumes,
then it is not a signiﬁcant cause. Waters notion of an “actual diﬀerence
maker” (Waters, 2007, 567) can be related to our speciﬁcity measure.
Waters treats the question of whether a variable exhibits actual variation
as though it were a binary choice, but it makes sense to treat it as continuous.
The ‘actual variation’ is the entropy of the variable.
Figure 7: Eﬀects of changing probability of
several informational measures: the entropy of RNA (the eﬀect),
the mutual information between RNA and
DNA, and the mu-
tual information between RNA and intervening on the presence of
polymerase. It can be shown that H(RNA) = IRNA,
mentary Online Materials §1). The variation in the eﬀect can thus
here be decomposed into the respective contributions of the causes.
To show how this idea ﬁts into our speciﬁcity measure, consider how the
mutual information (speciﬁcity) of each of our two variables
with RNA changes as we vary the probability distribution of
in turn varies its entropy). In ﬁgure 7, each value on the X axis represents a
diﬀerent case. These range from cases where the probability of 0present0is 0
(Polymerase is never around) to systems where the probability of 0present0
is 1 (Polymerase is always around). In these extreme cases, the variable
has become a ﬁxed background factor and doesn’t actually vary, and thus
the entropy H[
POLis 0. When the probability of ‘present’ is 0.5,
is maximally variable, and has maximum entropy. The mutual information
POL and RNA is also maximized at this point. Notice also, that as
we increase p(present)to 1, the mutual information between
DNA and RNA
POL =present all the time, the full 2 bits of information
about RNA can be found in
Our proposed measure of speciﬁcity captures two things: the extent to
which a relationship approaches a bijection (Woodward’s INF) and the de-
gree to which the cause is an actual diﬀerence maker (i.e. the cause also
has high entropy). So the mutual information measure appears to capture
the degree to which a cause is a ‘speciﬁc actual diﬀerence maker’, or SAD
Within our information theoretic framework there is a clear diﬀerence
between the SAD concept and Woodward’s INF. SAD uses the actual prob-
ability distribution over the values of a causal variable in some population.
INF makes no distinction between the states of a causal variable. We will
represent this by supposing that the variable has maximum entropy: all its
states are equiprobable. This makes sense when we recall that for Wood-
ward causal variables are sites of intervention. For an idealised external
agent intervening on the system, the value of a causal variable is whatever
they choose to make it.
It is possible to ﬁnd diﬀerent scientiﬁc contexts in which biologists seem
to approach causal relationships in ways that correspond to SAD and INF
respectively. Waters argues that classical genetics of the Morgan school
was only concerned to characterize causes which actually varied in their
laboratory populations (Waters, 2007). Griﬃths and Stotz argue that some
work in behavioral developmental and much work in systems biology sets out
to characterize the eﬀect on the system of forcing all causal variables through
their full range of potential variation (Griﬃths & Stotz, 2013, 198-9). This
kind of research, they argue, is done with the aim of discovering new ways to
intervene in complex systems. The information theoretic framework allows
us to distinguish between the speciﬁcity of potential (INF) and actual (SAD)
Our measure of causal speciﬁcity sheds light on another issue that we dis-
cussed in our introduction. Weber proposed that the speciﬁcity of a causal
relationship is simply the range of values of the variables across which a
causal relationship holds, or what Woodward calls the “range of invariance”
(Woodward 2003, 254). Woodward rejected this idea because a causal rela-
tionship might hold across a large range of invariance but fail to be bijective.
Our information theoretic framework captures both why Weber makes this
suggestion and why Woodward’s additional condition is needed. Weber’s
point corresponds to the fact that mutual information between cause and ef-
fect variables will typically be greater when these variables have more values,
simply because the entropy of both variables is higher. Woodward’s caveat
corresponds to the fact that it will not do to increase the number of values of
a cause variable unless the additional values of the cause map onto distinct
values of the eﬀect. Increasing the entropy of the cause variable will not in-
crease mutual information when no additional entropy in the eﬀect variable
is captured. This is why the mutual information between the variables is
the same in ﬁgure 2 and in ﬁgure 5. In terms of the diagram in Box 1, such
an increase in the size of region H(X)would be conﬁned to the sub-region
H(X|Y)with no increase in sub-region I(X;Y). The same point, of course,
holds mutatis mutandis for the eﬀect variable.
In addition to the SAD and INF conceptions of speciﬁcity, there is a third
option corresponding to a suggestion by Weber that causal speciﬁcity should
be assessed on the assumption that causal variables are neither restricted
to their actual variation in some population, nor allowed to vary freely, but
instead restricted to their ‘biologically normal’ range of variation: “What we
need is a distinction between relevant and irrelevant counterfactuals, where
relevant counterfactuals are such that they describe biologically normal possi-
ble interventions” (Weber, 2013, 7, his italics). We will call this REL. Weber
tells us that a biologically normal intervention must (1) involve a naturally
occurring causal process and (2) not kill the organism. More work is ob-
viously needed to make this idea precise, but we will see in Section 5 that
even in this crude form REL provides a useful framework for modeling actual
cases. At a practical level, we interpret REL as assessing causal speciﬁcity
with uniform probability distributions within the range of variation in the
variable that would be produced by known mechanisms acting on relevant
timescales for the causal processes we are trying to model.
5 Distributed Causal Speciﬁcity
We have suggested that causal speciﬁcity can be measured by the amount
of mutual information between variables representing cause and eﬀect. This
implies that the degree of speciﬁcity of a causal relationship depends on the
probability distributions over the two variables, and we have argued that this
relates to Waters’ claim that signiﬁcant causes are speciﬁc actual diﬀerence
makers. We have also taken on board Weber’s point that it may be more
interesting to explore, not the strictly actual variation, but the ‘biologically
normal’ variation (REL). In this section we apply our measure to a more
complex case than the roles of RNA polymerase and DNA in the production
of RNA, namely the role of splicing factors and DNA in the production of
alternatively spliced mRNA. Importantly, we shall also attempt to ﬁll out
these measures with realistic values.
In contemporary molecular biology the image of the gene as a simple
sequence of coding DNA with an adjacent promoter region is very much a
special case. This image remains important in the practice of annotating
genomes with ‘nominal genes’ – regions that resemble reasonably closely the
textbook image (Burian, 2004; Fogle, 2000; Griﬃths & Stotz, 2007; Grif-
ﬁths & Stotz 2013). But a more representative image of the gene, at least
in eukaryotes, is a complex region of DNA whose structure is best under-
stood top-down in light of how that DNA can be used in transcription and
translation to make a range of products. Multiple promoter regions allow
transcripts of diﬀerent lengths to be produced from a single region. This
and other mechanisms allow the same region to be transcribed with diﬀer-
ent reading frames. mRNA editing allows single bases in a transcript to be
changed before translation. Trans-splicing allows diﬀerent DNA regions to
contribute components to a single mRNA. Here, however, we will concen-
trate on the most ubiquitous of these mechanisms, alternative cis-splicing, a
process known to occur, for example, in circa 95% of human genes (nominal
Genes are annotated with two kinds of regions, exons and introns. The
typically much larger introns are cut out of the corresponding mRNA and
discarded. In alternative cis-splicing (hereafter just ‘splicing’) there is more
than one way to do this, giving rise to a number of diﬀerent proteins or
functional RNAs. For simplicity, we will ignore mechanisms such as exon
4For more detail on all these processes, see (Griﬃths and Stotz, 2013). It may be useful
to know that the preﬁx trans- denotes processes involving a diﬀerent region of the DNA,
whilst the preﬁx cis- denotes processes involving the same or an immediately adjacent
repetition or reversal, and the fact that exon/intron boundaries may vary,
and treat this process as if it were simply a matter of choosing to include or
omit each of a determinate set of exons in the ﬁnal transcript.
With alternative splicing, the ﬁnal product is co-determined by the cod-
ing region from which the transcript originates and some combination of
trans-acting factors which bind to the transcript to determine whether cer-
tain exons will be included or excluded. These factors are transcribed from
elsewhere in the genome, and their presence at their site of action requires the
activation of those regions and correct processing, transport and activation
of the product. The entire process thus exempliﬁes the themes of ‘regulated
recruitment and combinatorial control’ characteristic of much recent work on
the control of genome expression (Griﬃths & Stotz 2013; Ptashne & Gann
2002). We will simplify this by representing alternative splicing as a sin-
gle variable each of whose values correspond to a set of trans-acting factors
suﬃcient to determine a unique splice-variant.
The role of alternative splicing is well known, but recent work on causal
speciﬁcity does not treat this issue with much care. Weber states that, “De-
pending on what protein factors are present, a cell can make a considerable
variety of diﬀerent polypeptides from the same gene. Thus we have some
causal speciﬁcity, but it is no match for the extremely high number of diﬀer-
ent protein sequences that may result by substituting nucleic acids” (Weber,
2006 endorsed by Waters, 2007, fn28). Here Weber seems to be making a
problematic comparison of the actual range of splicing variants present in
a single organism with the possible genetic variants that could be produced
by mutation. Recently, Weber has explicitly argued for this comparison, ar-
guing that only ‘biologically normal’ interventions should be considered and
that variation in DNA coding sequences is biologically normal. He concludes
that DNA and RNA deserve a unique status amongst biological causes be-
cause their biologically normal ability to vary in a way that inﬂuences the
structure of gene products is “vastly higher (i.e., many orders of magnitude)
than that of any other causal variables that bear the relation INF to protein
sequences (e.g., splicing agents)” (Weber, 2013, 31).
We are not convinced that it is a meaningful comparison to take, for ex-
ample, the Drosophila DSCAM gene5with 38,016 splice variants all or most
of which are found in any actual population of ﬂies, and say that alternative
splicing has negligible causal speciﬁcity because this number of variants, is
much less than the number of variants possible by mutation of the DSCAM
coding sequence with no limit on the number of mutational steps away from
the actual sequence (Weber, 2013, 19). This seems to be a classic example
of the way in which philosophers are unable to sustain parity of reasoning
(Oyama 2000, 200ﬀ) when thinking about DNA. The principle that only
‘biologically normal’ variation should be counted is rigorously enforced for
non-genetic causes but not for genetic causes. An anonymous reviewer has
pointed out that even when variation in the coding DNA sequence is re-
stricted to a small (and thus ‘biologically normal’) number of mutational
steps, the number of possible variants expands very rapidly because of the
sheer number of nucleotides (about 6000 in DSCAM). Which ranges of varia-
tion in splicing agents and coding sequences it is meaningful to compare will
depend on the biological question being addressed, as we will now discuss.
5In the Drosophila receptor DSCAM (Down Syndrome Cell Adhesion Molecule), 4 of
the 24 exons of the Dscam gene are arranged in large tandem arrays, whose regulation is an
example of mutually exclusive splicing. One block has 2 exons - leading to 1 of 2 alternative
transmembrane segments, the others contain respectively 12, 48 and 33 alternative exons
- leading to 19,008 diﬀerent ecto-domains. Neuron cells not only diﬀer with respect to
which one of the 38,016 variants (in a genome of about 15,000 genes) it expresses, but in
the exact ratio in which it expresses up to 50 variants at a time. Each block of exons seems
to possess a unique mechanism that ensures that exclusively only one of the alternative
exons is included in the ﬁnal transcript. For details and references, see Supplementary
Online Materials §3.
To make a meaningful comparison between splicing agents and coding
sequences it is also necessary to specify a population of entities across which
they produce variation. Waters (2007) focuses on two examples in which
most of the actual variation is caused by variation in DNA. The ﬁrst is the
population of phenotypic Drosophila mutants in a classical genetics labora-
tory. The second is the population of RNA transcripts at one point in time
in a bacterial cell in which there is no alternative splicing. Obviously, neither
of these cases is a useful one with which to evaluate the causal speciﬁcity of
splicing agents, but they do exemplify two important classes of comparisons
we might make. First, we might compare the variation between individuals
in an evolving population and seek to determine if variation in DNA coding
sequences is the sole or main speciﬁc diﬀerence maker. Second, we might
consider the transcriptome (population of transcripts) in a single cell, either
at a time or across time, and ask whether variation in DNA coding sequences
is the sole or main speciﬁc diﬀerence maker between these transcripts. We-
ber also considers examples of these two kinds. However, neither Waters nor
Weber considers a third important case, which is the variation between cells
in an organism, both spatial and temporal. This is the kind of variation
that needs to be explained to understand development, the context in which
controversy over the causal roles of genes and other factors most often arises.
Both actual and relevant (‘biologically normal’) variation in genes or
splicing agents will be diﬀerent in each of these three cases. In the case of
an evolving population mutation is a biologically normal source of variation,
but without any limit on the number of mutational steps from the current
sequence, let alone variation in genome size or ploidy, the values of the DNA
variable would simply be every possible genome, which would be both un-
manageable and biologically meaningless. It might seem natural to exclude
any other sources of variation on the grounds that they are not heritable,
but a number of evolutionary theorists would hotly dispute this (e.g. Bon-
duriansky, 2012; Jablonka & Lamb, 2005; Uller, 2012). Furthermore, the
machinery of splicing also changes over evolutionary time, so in the evolu-
tionary case the ‘biologically normal’ variation in splicing is greater than the
amount of variation observed in any actual population. These are very com-
plex issues, and we cannot undertake the extensive work of establishing the
relevant ranges of variation of genetic and other variables in the evolutionary
case in this paper.
Instead, we will examine the simpler case suggested by Waters, the pop-
ulation of RNA transcripts in a single cell at one time. But while Waters
considers only cells with no splicing, we will consider cells with splicing, so
as to make a comparison possible. For the transcriptome of a single cell at
a time, the relevant values of the DNA variable are the diﬀerent sequences
that can be transcribed by the polymerase. If we ignore complexities such
as multiple promoters, we can set this equal to the nominal gene count in
the genome, so that realistic ﬁgures are available. The values of the DNA
variable will be weighted by the probability of each gene being expressed.
The values of the splicing variable can be set equal to the number of splicing
variants from each gene, weighted by the probability of each splice variant.
We now propose a quantiﬁcation of the respective causal speciﬁcity of the
DNA and splicing variables for this very simple case. To further simplify the
exposition we assume that the polymerase is always present (an assumption
which can be relaxed easily, see Supplementary Online Materials §2). We
focus on the mutual information measure outlined above, but we need to
take a slightly diﬀerent approach to compare the speciﬁcity of splicing with
the speciﬁcity of DNA, for we assume that splicing factors are recruited only
after a given strand of DNA has been transcribed. We do this because, in
reality, it is not the case that any set of splicing factors can be combined
with any gene. If we were to model splicing in this way, then the outcome
of most combinations of genes and sets of splicing factors would be that the
system fails to produce any biologically meaningful outcome. So it is both
simpler and more biologically realistic to represent the process sequentially,
as the transcription of an mRNA followed by the recruitment of a set of
splicing factors. In other words, the transcription of a given DNA strand
opens a set of possibilities among a proper set of the possible combinations
of splicing factors (ﬁgure 11). This entails that the information in splicing
factors, measured by Hb
S, contains all the information in DNA, measured
Because the entropy in the DNA variable is conserved in the entropy of the
splicing variable, the mutual information between RNA and splicing will
also conserve the mutual information between RNA and DNA. Thus, we will
need a way to decompose our causal speciﬁcity measure into two components,
isolating the separate contributions of DNA and splicing.
As mentioned above, we treat the splicing process as if it were simply
a matter of choosing to include or omit each of a determinate set of exons
in the ﬁnal transcript. Each value of our splicing variable corresponds to
a set of trans-acting factors suﬃcient to determine a unique splice-variant
of the RNA. In other words, we consider a bijective relationship between
sets of splicing factors (once recruited) and RNA variants. This bijection en-
tails that the mutual information between RNA and interventions on splicing
Sis simply equal to the so-called self-information of splicing, Ib
which is itself equal to the entropy of splicing Hb
S. We can then decom-
pose the entropy of splicing according to well-known chain rules:
Noting that IR;b
Dwhen the polymerase is always present
6In the following equations, Dand Rare the variables DNA,RNA(see ﬁgure 7) and S
is the splicing variable.
Figure 8: The simpliﬁed relationship between DNA (D), splicing
(S) and RNA (R) variables, assumed in the models in Section 5.
Selection of a value for DNA opens a proper set of possibilities of
Splicing. There is a bijective relationship between Splicing and
(see Section 5), and that IR;b
D(see Supplementary On-
line Materials §2), we can rewrite the equation as:
This equation provides a decomposition of the mutual information between
RNA and splicing, IR;b
S, into two components, the mutual information
between RNA and DNA, IR;b
D, and the mutual information between
RNA and splicing conditional on DNA, IR;b
D. Because IR;b
this entails that IR;b
D. If we simply proceed as before, taking
mutual information as a measure of causal speciﬁcity, we ﬁnd that the speci-
ﬁcity of splicing is always greater than or equal to the speciﬁcity of DNA. As
we mentioned above, however, we need to account for the fact that all the in-
formation contributed by DNA to RNA is conserved in the splicing variable.
Fortunately, we can decompose the mutual information in splicing to obtain
two terms which represent the contribution from the DNA and the contri-
bution from the splicing process. The term Hb
Din the decomposition
Srepresents the amount of information which is preserved in the
splicing process but originates in the DNA. The variation in RNA properly
coming from the splicing process is represented by the term HSb
term that, roughly, reﬂects the number of splicing variants per DNA strand.
Thus, if one wants to compare the causal speciﬁcity of splicing and DNA,
one needs to know which of these two terms, Hb
the greatest contribution to IR;b
The answer will crucially depend on the biological system. In drosophila,
an important determinant of neuronal diversity is the single Dscam gene with
38,016 splice variants (see Supplementary Online Materials §3). This gives a
maximum entropy of circa log2(38016) = 15.2bits for Hb
with 0bits for Hb
D. The diversity of this class of transcripts in drosophila
is entirely explained by post-transcriptional processing.7
The homologs of this gene in humans, Dscam and Dscam-like present a
very diﬀerent picture. The number of splicing variants per gene appear to
be no greater than 3. Assuming that the transcription of each of these two
DNA regions is equiprobable, this gives a maximum entropy of circa 1.6bits
D, to be compared with 1 bit for Hb
D. DNA and splicing are
roughly equal determinants of diversity in this class of transcripts.
A more meaningful comparison to the Dscam case in drosophila, however,
may be other classes of vertebrate cell-surface proteins. Generalising from
real cases8we might imagine a class of transcripts that derives from, say 100
7Our decision to use actual ﬁgures for genes and isoforms but assume equiprobability
(maximum entropy) for each variable can be justiﬁed in this particular case on both the
INF and REL approaches (Section 4). The data required for Waters’ SAD approach are
not available, but there is no reason to suppose it would give qualitatively diﬀerent results.
8Dscam is homologous between almost all animals, but in vertebrates the two homol-
ogous genes, Dscam and DscamL1, do not encode multiple isoforms. There are however,
several hundred cell adhesion and surface receptor genes in vertebrates: the Ig superfamily,
related genes, each of which has 150 splicing variants. Assuming once again
that the transcription of any of these DNA regions is equiprobable, this gives
circa 7.2bits for Hb
D, to be compared with circa 6.6for Hb
DNA and splicing variables are important determinants of diversity in this
class of transcripts.
Assigning speciﬁcity to the causes of transcript diversity in a single cell
at a time is relatively tractable. The analyses just given could, in principle,
be extended to the entire transcriptome at one stage in the life-cycle of a
well-studied system such as yeast. But this would be of limited interest.
What is at stake in disagreements over the relative causal roles of coding
regions of DNA and other factors in gene expression would be better rep-
resented by comparing the transcriptome in a cell at diﬀerent times in its
life-cycle, or comparing transcriptomes between diﬀerent cell-types in an or-
ganism. These comparisons are both ways of thinking about development
– the process by which regulated genome expression produces an organism
and its life-cycle. In comparing the same cell across times, a critical feature
is that which genes are transcribed and how their products are processed de-
pends on transcription and processing at earlier times. For the population of
cells in an organism somatic mutations that could arise during development
become relevant, leading to the need to say something about the number of
mutational steps that counts as a ‘biologically realistic’ intervention on this
variable. We hope to confront these complexities in future work.
as well as integrins, cadherins, and selectins. This genetic diversity is combined with com-
plex regulatory patterns, albeit not on the scale of the Dscam expression in Drosophila.
The three neurexin genes display extensive alternative splicing, a process that can po-
tentially generate thousands of neurexin isoforms alone. For details and references, see
Supplementary Online Materials §3.
Causal speciﬁcity is the label given to an intuitive distinction amongst the
many conditions that are necessary to produce an eﬀect. The speciﬁc causes
are those variables that can be used for ﬁne-grained control of an eﬀect
variable. It has been suggested that a speciﬁc relationship between two
variables is one that resembles a bijective mapping between the values of the
two variables (Woodward, 2010). The concept of causal speciﬁcity can be
clariﬁed considerably by going a step further and attempting to measure it.
Our quantitative measure of speciﬁcity starts from the simple idea that
the more speciﬁc the relationship between a cause variable and an eﬀect vari-
able, the more information we will have about the eﬀect after we perform
an intervention on the cause. Section 2 used information theoretic measures
to express this idea. We found that if the conditional entropy of the eﬀect
on interventions on the cause HEb
C= 0 then manipulating Cprovides
complete control over E. We argued, however, that the idea of sensitive
manipulation, or ﬁne-grained inﬂuence (Woodward, 2010) would be better
represented by measuring the entropy of the eﬀect H(E)and the mutual
information between cause and eﬀect IE;b
C. Fine-grained inﬂuence re-
quires both that the repertoire of eﬀects is large and that the state of the
cause contains a great deal of information about the state of the eﬀect. In
the ideal case, H(E)would tend toward inﬁnity and IE;b
Section 3 examined the behavior of IE;b
Cas a measure of causal
speciﬁcity (SPEC). The behavior of the measure depends on the probability
distributions over the states of the variables, as well as the structure of the
causal graph. Other things being equal, a variable with many states that
are rarely or never occupied is a less speciﬁc cause than one equally likely
to be in any of its states, that is, one with higher entropy. Section 4 showed
that this feature is a strength of our proposed measure. It is in line with the
qualitative reasoning of Waters (2007), who argues that the property which
justiﬁes singling out one cause as more signiﬁcant that another can be its
speciﬁcity with respect to the actual variation seen in some population and
of Weber (2013) who suggests that we focus on the somewhat wider class of
‘biologically normal’ variation.
The sensitivity of our measure to the underlying probability distribu-
tions contrasts with presentations of causal speciﬁcity where it is assumed
that the value can be inferred from the structure of a causal graph. Our at-
tempt to quantify speciﬁcity forces this assumption to become explicit. The
least arbitrary way to represent this assumption in our models would seem
to be to make all values of the causal variables equiprobable. Making this
assumption is probably not appropriate for settling the disputes about the
relative signiﬁcance of various causal factors in biology with which Waters
and Weber are concerned. However, in the broader context of the interven-
tionist account of causation it may be entirely appropriate, because causal
variables are the sites of voluntary intervention by an idealized agent.
Section 5 used our measure to assess the relative speciﬁcity of diﬀerent
causes that contribute to the same eﬀect. The idea of speciﬁcity has been
used to argue that DNA sequences are the most signiﬁcant causes, because
of their supposedly unrivalled degree of speciﬁcity. Our discussion revealed
that this is completely premature. First, it is necessary to specify the causal
process in question. The causes of individual diﬀerences in an evolving pop-
ulation are quite diﬀerent from the causes of transcript diversity in a single
cell, and diﬀerent again from the causes of spatial and temporal diversity
amongst the cells of a single organism. We constructed a simple model with
which we were able to quantify the speciﬁcity of a DNA coding sequence
and of splicing factors with respect to transcript diversity in a single cell at
a time. We showed that the relative speciﬁcity of these two variables can be
very diﬀerent for diﬀerent classes of transcripts. The idea that that DNA
obviously has an unrivalled degree of speciﬁcity seems to arise because ear-
lier, qualitative discussions implicitly compared the actual variation in the
splicing variable within cells to the possible variation in the DNA variable
on an evolutionary timescale.
While it seems plausible to us that the speciﬁcity of coding DNA as a
cause of evolutionary change is very high, we pointed out that proper explo-
ration of this would require serious thought about which range of variation in
the DNA variable can be meaningfully compared with which range of varia-
tion in other cellular mechanisms. Similar work would be needed before our
measure can be applied to what is arguably the most pressing case, namely
the relative speciﬁcity of diﬀerent causes in development. We hope to focus
on this case in future work.
We believe that the work reported here amply demonstrates the philo-
sophical payoﬀ of developing quantitative measures of causal speciﬁcity.
However, a great deal remains to be done. First, although our measures
provides information about causal speciﬁcity rather than the presence of
causation per se, in future work we hope to provide an information theoretic
statement of the interventionist criterion of causation. Second, our measure
of speciﬁcity is only one of several information theoretic measures that can
be used to characterize causal relationships. In future work we hope to ex-
plore the potential of these other measures for the philosophy of causation.
Thirdly, and perhaps most urgently, we gave only minimal attention in this
paper (in Section 4) to the ways in which the relationship between two vari-
ables can be aﬀected by additional variables. In a forthcoming paper we
extend our framework to deal with these interactions.
Supplementary Online Materials can be downloaded from http:
Bonduriansky, R. (2012). Rethinking heredity, again. Trends in Ecology and
Evolution, 27(6), 330-336.
Burian, R. M. (2004). Molecular Epigenesis, Molecular Pleiotropy, and
Molecular Gene Deﬁnitions. History and Philosophy of the Life Sciences,
Cover, T. M., & Thomas, J. A. (2012). Elements of Information Theory.
Hoboken, NJ: John Wiley & Sons.
Garner, W. R., & McGill, W. (1956). The relation between information
and variance analyses. Psychometrika, 21 (3), 219-228.
Griﬃths, P. E., & Stotz, K. (2013). Genetics and Philosophy: An intro-
duction. New York: Cambridge University Press.
Jablonka, E., & Lamb, M. J. (2005). Evolution in Four Dimensions :
Genetic, Epigenetic, Behavioral, and Symbolic Variation in the History of
Life. Cambridge, Mass: MIT Press.
Lewis, D. K. (2000). Causation as inﬂuence. Journal of Philosophy, 97,
Oyama, S. (2000). The Ontogeny of Information: Developmental systems
and evolution (Second edition, revised and expanded. ed.). Durham, North
Carolina: Duke University Press.
Pearl, Judea. (2009). Causality Cambridge University Press.
Ptashne, M., & Gann, A. (2002). Genes and Signals. Cold Spring Har-
bor, NY: Cold Spring Harbor Laboratory Press.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean,
G., Turnbaugh, P. J., et al. (2011). Detecting Novel Associations in Large
Data Sets. Science, 334 (6062), 1518-1524.
Ross, B. C. (2014). Mutual Information between Discrete and Continu-
ous Data Sets. PLoS ONE, 9 (2), e87357.
Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of
Communication. Urbana, Ill.: Univ. of Illinois Press.
Uller, T. (2012). Parental eﬀects in development and evolution. In N. J.
Royle, P. T. Smiseth & M. Kölliker (Eds.), The Evolution of Parental Care
(pp. 247-266). Oxford: Oxford University Press.
Waters, C. K. (2007). Causes that make a diﬀerence. Journal of Philos-
ophy, 104 (11), 551-579.
Weber, M. (2006). The Central Dogma as a thesis of causal speciﬁcity.
History and Philosophy of the Life Sciences, 28(4), 595-609.
Weber, M. (2013). Causal Selection versus Causal Parity in Biology:
Relevant Counterfactuals and Biologically Normal Interventions, What If?
On the meaning, relevance and epistemology of counterfactual claims and
thought experiments (pp. 1-44). Konstanz: University of Konstanz.
Woodward, J. (2003). Making Things Happen: A Theory of Causal Ex-
plananation. New York & Oxford: Oxford University Press.
Woodward, J. (2010). Causation in biology: stability, speciﬁcity, and the
choice of levels of explanation. Biology & Philosophy, 25 (3), 287-318.
Woodward, J. (2012). Causation and Manipulability. The Stanford
Encyclopedia of Philosophy (Winter 2012 Edition). Retrieved from http:
8 Supplementary Online Materials (Online Materi-
als to be posted at http://philsci-archive.pitt.
8.1 The eﬀect of transcription probability
Here we derive the equations of the curves in Figure 7 (reproduced below
as Figure 10) describing the eﬀect of transcription probability on several
informational measures on RNA, DNA and transcription. For the ease of
presentation, we will ignore splicing.
To ease reading, we will write the variables RNA as R(with values ri),
transcription as T(with values th), and DNA as D(with values dj). Again,
hats on variables mean that their values are ﬁxed by a surgical intervention.
8.1.1 The mutual information between RNA and transcription
We suppose that if there is no transcription (h= 0), there is no RNA
strand produced (i= 0), while if there is transcription (h= 1), there is one
RNA strand produced among npossible variants (i= 1 . . . n). This implies
that once a given value for RNA is obtained (either i= 0, i.e. absence,
or i= 1 . . . n) we also know whether transcription was on or oﬀ. In other
words, the joint probability for RNA and transcription is given as follows
(see Figure 9):
p(r0),if h= 0 and i= 0.
p(ri),for h= 1 and i= 1,2, . . . , n.
Also, by computing the marginal probability of transcription, p(b
th), we can obtain that p(b
t0) = p(r0,b
t1) = Pn
t0) = p(r0)and p(b
Figure 9: Diagram showing events with non-null probabilities in
our model of transcription, when splicing is ignored. Transcription
can be either on (h= 1), in which case a DNA strand jwill deter-
ministically lead to a RNA strand j, or oﬀ (h= 0), in which case
any DNA strand will lead to a null RNA. (Probabilities assigned to
events are for illustratory purpose only, but notice that p(t0)and
p(t1)sum to 1.)
Figure 10: Eﬀects of changing probability of transcription on sev-
eral informational measures: the entropy of RNA (the eﬀect), the
mutual information between RNA and DNA, and the mutual infor-
mation between RNA and the presence of polymerase.
Now, using (1) and (2), we can compute the mutual information between
RNA and transcription.
=p(r0) log p(r0)
p(ri) log p(ri)
=p(r0) log 1
p(ri) log 1
=p(r0) log 1
Tsimply reﬂects that there is a bijection between
having transcription set to on (respectively oﬀ ) and obtaining some non-null
(respectively null ) RNA. In other words, none of the values for transcription
lead to convergent results: there is no loss of information about transcription
when it occurs (or not).
8.1.2 The mutual information between RNA and DNA
We suppose that if there is transcription (h= 1), a given strand of DNA
(j= 1...n) will deterministically lead to a given strand of RNA (i= 1. . .n).
If there is no transcription (h= 0), any strand of DNA will lead to no RNA
(i= 0) (see Figure 11). In other terms, there is a bijection between DNA
and RNA if and only if transcription is on, otherwise all values of DNA lead
to the same null result. We also suppose that state of the polymerase and
the choice of a DNA strand to transcribe are independent events.
We begin with:
We will consider now how this measure behaves when we take into ac-
count the probability of transcription.
To simplify writing, we will ﬁrst notice that many joint events have null
probabilities, which makes them cancel out in the calculus of mutual infor-
mation. These joint events are (ri>0, dj6=i): it is impossible to get another
strand of RNA than the one the DNA strand codes for (whatever the tran-
scription state, see Figure 9).
Thus, without loss of generality, we can write, splitting the cases with
non-null (i > 0) and null (i= 0) RNA:
Using the diagram in Figure 9, we can easily see the following relation-
di|ri)=1,if i > 0.
dj|r0) = p(b
dj),for j= 1, . . . n.
di) = p(b
t1),if i > 0.
Using these relationships, we can simplify I(R;b
Due to relationships (a) and (b),
Due to relationship (c),
This equation reﬂects the fact that the informativity of DNA is condi-
tional upon the presence of transcription. If transcription were always on,
there would be a bijection between DNA and RNA. However, when the tran-
scription is sometimes oﬀ, there is a loss of information between DNA and
the RNA outputs, as several strands of DNA can lead to the same result (no
RNA) when there is no transcription. The information loss is simply this
part of DNA entropy which is not present in the mutual information between
DNA and RNA, that is, Hb
8.1.3 The entropy of RNA
Here we derive the entropy of RNA in terms of mutual information between
RNA and DNA and the entropy of transcription. We again split between the
cases where there is transcription (b
t1) or none (b
t0). We again use the fact
that and that p(ri) = pb
ti. We also remark that Pn
sums to pb
H(R) = −
p(ri) log p(ri)(21)
t1−p(r0) log p(r0)(22)
−p(r0) log p(r0)(23)
H(R) = pb
8.2 The mutual information between RNA and splicing
8.2.1 When transcription is always on
Here we derive the equations for the mutual information between RNA and
For the sake of simplicity, we shall ﬁrst ignore transcription probability
and assume that pb
t1= 1. This amounts to relaxing the conditionalisation
In the model considered here the splicing factor variants are recruited
only once a given strand of DNA has been transcribed. In addition, we sup-
pose that the transcription of a given DNA strand opens a set of possibilities
among a proper set of splicing factors (see Figure 11). This entails that the
information in splicing Hb
Scontains all the information in DNA, Hb
In addition, we consider a bijective relationship between splicing factors
and RNA variants. This bijection entails that the mutual information be-
tween RNA and splicing is equal to the self-information of splicing (that is,
the entropy of splicing). We can then decompose the entropy of splicing
according to well known chain rules:
From equation (18), we know that Hb
D, assuming that
transcription always occurs. In addition, the bijection between splicing and
RNA (including the null value) entails that the conditional entropy of splicing
(conditioned on DNA) is the conditional mutual information of splicing and
Figure 11: Diagram of our model of splicing, when transcription
is assumed to be on. A DNA strand deterministically leads to a
proper set of splicing factor variants, each of them deterministically
leading to a proper RNA strand.
D(as an immediate calculation would show).
We thus can rewrite equation (31) as:
Readers familiar with information theory will recognize the decomposi-
tion of the mutual information IR;b
Dwhich happens to be, in this
particular example, equal to IR;b
S. That is, knowing the value of DNA
does not bring us any information as regards RNA in addition to knowing
the value of splicing. Notice equation (32) also provides a decomposition of
the entropy of splicing, that is, H(S) = IR;b
Sin virtue of the bijection
between RNA and splicing.
8.2.2 When transcription can be either on or oﬀ
For the sake of completeness, we now give equation (32) in a version taking
into account the probability of transcription. The reasoning is grounded on
the hypothesis that a given splicing factor occurs only when there is tran-
scription and a given DNA strand has been choosen. Then, decomposition
Again, we take advantage of the bijection between splicing and RNA,
to replace Hb
T. We also take advantage of the
fact that there is no interaction information between DNA, RNA, and tran-
scription, that is, IR;b
D. This can be shown with the
calculation sketched below. We again use relationship (1) to simplify, hence
if i > 0we have p(ri, dj, t1) = p(ri, dj)and p(ri, t1) = p(ri). A similar re-
placement method would hold for i= 0, but we directly simplify this term
as it is null.
= 0 +
Injecting these terms in equation (36), we obtain:
Again, noticing that Hb
T, we retrieve an equa-
tion similar to equation (32):
Readers familiar with information theory will recognize the decomposi-
tion of the mutual information IR;b
Twhich happens to be, in this
particular example, equal to IR;b
S. That is, knowing the value of DNA
and transcription does not bring us any more information as regards RNA
than just knowing the value of splicing. Notice that similarly to equation (32)
in the case where transcription is always on, equation (45) provides a decom-
position of the entropy of splicing, that is, H(S) = IR;b
Sin virtue of the
bijection between RNA and splicing.
To wrap up, in this model transcription adds variation in the set of
splicing factor variants (the absence of any factor now belongs to the set of
possibilities), which is independent from DNA.
8.3 Alternative splicing in Drosophila Dscam
The Drosophila receptor DSCAM (Down Syndrome Cell Adhesion Molecule),
a member of the immunoglobulin (Ig) superfamily, is a remarkable exam-
ple of homophilic binding speciﬁcity that functions in important biological
processes, such as innate immunity and neural wiring. In insects and also
crustaceans (e.g. Daphnia) 4 of the 24 exons of the Dscam gene are ar-
ranged in large tandem arrays, whose regulation is an example of mutually
exclusive splicing. In Drosophila one block has 2 exons - leading to 1 of 2
alternative transmembrane segments, the others contain respectively 12, 48
and 33 alternative exons - leading to 19,008 diﬀerent ecto-domains. Together
they produce, 38,016 alternative protein isoforms, within a genome of 15,016
protein-coding genes . There are several interesting aspects about this
1. For each block of exons there seem to exist a unique mechanism that
ensures that exclusively only one of the alternative axons is included
in the ﬁnal transcript. Only two of the mechanisms are known in some
detail. Researchers have identiﬁed speciﬁc cis-acting sequences and
trans-acting splicing factors that tightly regulate splicing of exon 4.2,
but for most others the details are again not yet known [4, 3].
2. It is not only the large number of alternative transcripts that allow
for high diversity of functions, but in addition most alternative exons
are expressed in neurons and found in many combinations. Neurons
express up 50 variants at a time, which makes for an even larger combi-
natorial spectrum of neuron diﬀerentiation. This ensures that branches
from diﬀerent neurons will share, at most, a few isoforms in common.
This diversity of function enables branches of neurons to distinguish
between sister branches and branches of other neurons, and also for
patterning of neural circuits .
3. There seem to be distinct ways of regulating isoforms in the two diﬀer-
ent functions. For self-recognition purposes, neurons seem to express
DSCAM isoforms in a stochastic yet biased fashion. Which isoform is
expressed in a single neuron is unimportant as long as it suﬃciently dif-
ferent from its neighbour. It might simply be an indirect consequence
of the expression of diﬀerent splicing factors in diﬀerent neurons that
leads to this bias. For appropriate branching patterns, however, the
research to date suggests that the expression of Dscam isoforms in
some neurons is under tight developmental control. So we ﬁnd a con-
trolled mix of stochasticity and regulation in the expression of Dscam
in drosophila .
4. Dscam is homologous between almost all animals, which places its ori-
gin to over 600 million years ago before the split between the deuteros-
tomes and protostomes . But while in vertebrates their two homol-
ogous genes, Dscam and DscamL1 do not encode multiple isoforms, in
arthropods the single gene is highly enriched with alternative exons.
That leads to the interesting hypothesis that while in simple animals
cell adhesion and cell recognition is controlled by complex genes, in
complex animals this is done by relatively simple genes. This raises
the question of how to address the diﬃculty of accounting for a molec-
ular diversity large enough to provide speciﬁcity for the extraordinary
large number of neurons in the more complex vertebrate brains .
Vertebrates seem to manage their increase in cell recognition speciﬁci-
ties through the combinatorial association of diﬀerent recognition systems
such as gene duplication and the successive divergence of other loci, and via
the graded expression of recognition proteins . There exists a large range
of cell adhesion, recognition and surface receptor genes in vertebrates: the
calcium-independent Ig superfamily, and calcium-dependent integrins, cad-
herins, and selectins. The human immunoglobulins (Ig) are the products
of three unlinked sets of genes: the immunoglobulin heavy (IGH), the im-
munoglobulin (IGK), and the immunoglobulin (IGL) genes, with a total of
about 150 functional genes. A large number of cadherin superfamily genes
have been identiﬁed to date, and most of them seem to be expressed in the
CNS. At least 80 members of the cadherin superfamily have been shown to
be expressed within a single mammalian species. Integrins have two diﬀerent
chains, the (alpha) and (beta) subunits of which mammals possess eighteen
and eight subunits, while Drosophila has ﬁve and two subunits.
This genetic diversity is combined with complex regulatory patterns. One
example are the neurexin and neuroligin proteins in humans which are all
encoded by multiple genes. Neurexin is encoded by three genes controlled
each by two promoters which produce 6 main forms of neurexin. Both genes
display relative extensive alternative splicing, a process that can potentially
generate thousands of neurexin isoforms alone [2, 6]. Splice form diversity is
most extensive in the mammalian brain .
1. Alicia M. Celottoa and Brenton R. Graveley. Alternative splicing of
the drosophila dscam pre mRNA is both temporally and spatially reg-
ulated. Genetics, 159:599–608, 2001.
2. Mack E. Crayton III, Bradford C. Powell, Todd J. Vision, and Mor-
gan C. Giddings. Tracking the evolution of alternatively spliced exons
within the dscam family. BMC Evolutionary Biology, 6(16), 2006.
3. Brenton R. Graveley. Mutually exclusive splicing of the insect dscam
pre-mRNA directed by competing intronic RNA secondary structures.
Cell, 123:65–73, 2005.
4. Jung Woo Park and Brenton R. Graveley. Complex alternative splic-
ing. In Benjamin J. Blencowe and Brenton R. Graveley, editors, Al-
ternative Splicing in the Postgenomic Era, number 623 in Advances in
Experiemental Medicine and Biology, pages 50–63. Landes Bioscience
and Springer Science+Business Media, 2007.
5. Dietmar Schmucker and Brian Chen. Dscam and DSCAM: complex
genes in simple animals, complex animals yet simple genes. Genes
Dev., 23:147–156, 2009.
6. Barbara Treutleina, Ozgun Gokcec, Stephen R. Quakea, and Thomas
C. Sudhof. Cartography of neurexin alternative splicing mapped by
single-molecule long-read mRNA sequencing. PNAS, 11(13):E1291–E1299,
7. Julian P. Venables, Jamal Tazi, and Franois Juge. Regulated functional
alternative splicing in drosophila. Nucleic Acids Research, 40(1):1–10,
8. Woj M. Wojtowicz, John J. Flanagan, S. Sean Millard, S. Lawrence
Zipursky, and James C. Clemens. Alternative splicing of drosophila
dscam generates axon guidance receptors that exhibit isoform-speciﬁc
homophilic binding. Cell, 118(5):619–633, 2004.
9. S. Lawrence Zipursky, Woj M. Wojtowicz, and Daisuke Hattori. Got
diversity? wiring the ﬂy brain with dscam. Trends in Biochemical
Sciences, 31(10):581–588, 2006.