[Show abstract][Hide abstract] ABSTRACT: Information Geometric Causal Inference (IGCI) is a new approach to
distinguish between cause and effect for two variables. It is based on an
independence assumption between input distribution and causal mechanism that
can be phrased in terms of orthogonality in information space. We describe two
intuitive reinterpretations of this approach that makes IGCI more accessible to
a broader audience.
Moreover, we show that the described independence is related to the
hypothesis that unsupervised learning and semi-supervised learning only works
for predicting the cause from the effect and not vice versa.
[Show abstract][Hide abstract] ABSTRACT: While conventional approaches to causal inference are mainly based on conditional (in)dependences, recent methods also account for the shape of (conditional) distributions. The idea is that the causal hypothesis “X causes Y” imposes that the marginal distribution PXPX and the conditional distribution PY|XPY|X represent independent mechanisms of nature. Recently it has been postulated that the shortest description of the joint distribution PX,YPX,Y should therefore be given by separate descriptions of PXPX and PY|XPY|X. Since description length in the sense of Kolmogorov complexity is uncomputable, practical implementations rely on other notions of independence. Here we define independence via orthogonality in information space. This way, we can explicitly describe the kind of dependence that occurs between PYPY and PX|YPX|Y making the causal hypothesis “Y causes X” implausible. Remarkably, this asymmetry between cause and effect becomes particularly simple if X and Y are deterministically related. We present an inference method that works in this case. We also discuss some theoretical results for the non-deterministic case although it is not clear how to employ them for a more general inference method.
[Show abstract][Hide abstract] ABSTRACT: We consider two variables that are related to each other by an invertible
function. While it has previously been shown that the dependence structure of
the noise can provide hints to determine which of the two variables is the
cause, we presently show that even in the deterministic (noise-free) case,
there are asymmetries that can be exploited for causal inference. Our method is
based on the idea that if the function and the probability density of the cause
are chosen independently, then the distribution of the effect will, in a
certain sense, depend on the function. We provide a theoretical analysis of
this method, showing that it also works in the low noise regime, and link it to
information geometry. We report strong empirical results on various real-world
data sets from different domains.
[Show abstract][Hide abstract] ABSTRACT: A directed acyclic graph (DAG) partially represents the conditional independence structure among observations of a system if the local Markov condition holds, that is, if every variable is independent of its non-descendants given its parents. In general, there is a whole class of DAGs that represents a given set of conditional independence relations. We are interested in properties of this class that can be derived from observations of a subsystem only. To this end, we prove an information theoretic inequality that allows for the inference of common ancestors of observed parts in any DAG representing some unknown larger system. More explicitly, we show that a large amount of dependence in terms of mutual information among the observations implies the existence of a common ancestor that distributes this information. Within the causal interpretation of DAGs our result can be seen as a quantitative extension of Reichenbach's Principle of Common Cause to more than two variables. Our conclusions are valid also for non-probabilistic observations such as binary strings, since we state the proof for an axiomatized notion of mutual information that includes the stochastic as well as the algorithmic version. Comment: 18 pages, 4 figures
[Show abstract][Hide abstract] ABSTRACT: The causal Markov condition (CMC) is a postulate that links observations to causality. It describes the conditional independences among the observations that are entailed by a causal hypothesis in terms of a directed acyclic graph. In the conventional setting, the observations are random variables and the independence is a statistical one, i.e., the information content of observations is measured in terms of Shannon entropy. We formulate a generalized CMC for any kind of observations on which independence is defined via an arbitrary submodular information measure. Recently, this has been discussed for observations in terms of binary strings where information is understood in the sense of Kolmogorov complexity. Our approach enables us to find computable alternatives to Kolmogorov complexity, e.g., the length of a text after applying existing data compression schemes. We show that our CMC is justified if one restricts the attention to a class of causal mechanisms that is adapted to the respective information measure. Our justification is similar to deriving the statistical CMC from functional models of causality, where every variable is a deterministic function of its observed causes and an unobserved noise term. Our experiments on real data demonstrate the performance of compression based causal inference. Comment: 21 pages, 4 figures
[Show abstract][Hide abstract] ABSTRACT: A recent method for causal discovery is in many cases able to infer whether X causes Y or Y causes X for just two observed variables X and Y. It is based on the observation that there exist (non-Gaussian) joint distributions P(X,Y) for which Y may be written as a function of X up to an additive noise term that is independent of X and no such model exists from Y to X. Whenever this is the case, one prefers the causal model X--> Y. Here we justify this method by showing that the causal hypothesis Y--> X is unlikely because it requires a specific tuning between P(Y) and P(X|Y) to generate a distribution that admits an additive noise model from X to Y. To quantify the amount of tuning required we derive lower bounds on the algorithmic information shared by P(Y) and P(X|Y). This way, our justification is consistent with recent approaches for using algorithmic information theory for causal reasoning. We extend this principle to the case where P(X,Y) almost admits an additive noise model. Our results suggest that the above conclusion is more reliable if the complexity of P(Y) is high. Comment: 17 pages, 1 Figure
Open Systems & Information Dynamics 10/2009; DOI:10.1142/S1230161210000126 · 0.69 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We pose a problem called ``broadcasting Holevo-information'': given an
unknown state taken from an ensemble, the task is to generate a bipartite state
transfering as much Holevo-information to each copy as possible.
We argue that upper bounds on the average information over both copies imply
lower bounds on the quantum capacity required to send the ensemble without
information loss. This is because a channel with zero quantum capacity has a
unitary extension transfering at least as much information to its environment
as it transfers to the output.
For an ensemble being the time orbit of a pure state under a Hamiltonian
evolution, we derive such a bound on the required quantum capacity in terms of
properties of the input and output energy distribution. Moreover, we discuss
relations between the broadcasting problem and entropy power inequalities.
The broadcasting problem arises when a signal should be transmitted by a
time-invariant device such that the outgoing signal has the same timing
information as the incoming signal had. Based on previous results we argue that
this establishes a link between quantum information theory and the theory of
low power computing because the loss of timing information implies loss of free
energy.
Physical Review A 09/2006; 75(2). DOI:10.1103/PhysRevA.75.022309 · 2.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Setting We use Bayesian nets as a formalization of the probabilistic and causal relations of a system and present a result that describes how information theoretic means can contribute to the causal inference process. Task of Causal Inference Starting from an observation of a subsystem in terms of a probability dis-tribution of random variables, determine the class of Bayesian nets that are consistent with the observation. X1 X2 X3 X4 X1 X2 X3 X4 Subsystem ? Observation Bayesian net, such that p(x1, . . . , x4) = u pB(x1, . . . , x4, u) p(x1, . . . , x4) Inference Graphical Models and Information Theory For a given directed acyclic graph G whose nodes are discrete random vari-ables X1, . . . , Xn, denote by P (G) the family of joint probability distribu-tions which factor according to G. Then for an arbitrary distribution p of the Xi I X1, . . . , X = p || P (G + n X i=1 I Xi, parents(Xi , (*) where • D(p||P (G)) := inf q∈P (G) D(p || q) is the distance of p from the family of distributions P (G), measured in terms of Kullback-Leibler diver-gence D(p||q) = P p log p/q. • Ip is the (generalized) mutual information Ip(X1, . . . , Xn) = p || p(x1) ⊗ · · · ⊗ p(xn . General Question: Relation (*) only holds for distributions p de-fined on all nodes of the graph, so if the whole system has been observed. What can be said in cases of incomplete knowledge, that is if there are unobserved variables? U (1) there are no causal interactions among the components of the observed subsys-tem and (2) the mutual information of the subsystem is maximal and all variables have equal entropy H(U) ≥ H(X1) = · · · = H(Xn) Example (Maximal Interaction): Distributions of binary variables of the form pa(x1, . . . , xn) ∼ exp(a xn · · · x k) (a ∈ R, xi ∈ {−1, 1}) can be generated using only common roots of order two. • Result holds also in the algorithmic causal setting introduced in [4] when substituting Kolmogorov complexity for entropy. Future Work 1. How far can one go in characterizing causal models using only entropy-like quantities? 2. Consider the decomposition of mutual information into terms originating from the projection of p onto l-interaction spaces: I(p) = k X l=2 D(p (l) || p (l−1)) Do causal interpretations for these interaction terms exist? 3. Derive heuristics for causal inference algorithms from information theo-retic results as above.