Publications (10)8.37 Total impact
 [Show abstract] [Hide abstract]
ABSTRACT: Information Geometric Causal Inference (IGCI) is a new approach to distinguish between cause and effect for two variables. It is based on an independence assumption between input distribution and causal mechanism that can be phrased in terms of orthogonality in information space. We describe two intuitive reinterpretations of this approach that makes IGCI more accessible to a broader audience. Moreover, we show that the described independence is related to the hypothesis that unsupervised learning and semisupervised learning only works for predicting the cause from the effect and not vice versa.  [Show abstract] [Hide abstract]
ABSTRACT: While conventional approaches to causal inference are mainly based on conditional (in)dependences, recent methods also account for the shape of (conditional) distributions. The idea is that the causal hypothesis “X causes Y” imposes that the marginal distribution PXPX and the conditional distribution PYXPYX represent independent mechanisms of nature. Recently it has been postulated that the shortest description of the joint distribution PX,YPX,Y should therefore be given by separate descriptions of PXPX and PYXPYX. Since description length in the sense of Kolmogorov complexity is uncomputable, practical implementations rely on other notions of independence. Here we define independence via orthogonality in information space. This way, we can explicitly describe the kind of dependence that occurs between PYPY and PXYPXY making the causal hypothesis “Y causes X” implausible. Remarkably, this asymmetry between cause and effect becomes particularly simple if X and Y are deterministically related. We present an inference method that works in this case. We also discuss some theoretical results for the nondeterministic case although it is not clear how to employ them for a more general inference method.  [Show abstract] [Hide abstract]
ABSTRACT: We consider two variables that are related to each other by an invertible function. While it has previously been shown that the dependence structure of the noise can provide hints to determine which of the two variables is the cause, we presently show that even in the deterministic (noisefree) case, there are asymmetries that can be exploited for causal inference. Our method is based on the idea that if the function and the probability density of the cause are chosen independently, then the distribution of the effect will, in a certain sense, depend on the function. We provide a theoretical analysis of this method, showing that it also works in the low noise regime, and link it to information geometry. We report strong empirical results on various realworld data sets from different domains. 
 [Show abstract] [Hide abstract]
ABSTRACT: A directed acyclic graph (DAG) partially represents the conditional independence structure among observations of a system if the local Markov condition holds, that is, if every variable is independent of its nondescendants given its parents. In general, there is a whole class of DAGs that represents a given set of conditional independence relations. We are interested in properties of this class that can be derived from observations of a subsystem only. To this end, we prove an information theoretic inequality that allows for the inference of common ancestors of observed parts in any DAG representing some unknown larger system. More explicitly, we show that a large amount of dependence in terms of mutual information among the observations implies the existence of a common ancestor that distributes this information. Within the causal interpretation of DAGs our result can be seen as a quantitative extension of Reichenbach's Principle of Common Cause to more than two variables. Our conclusions are valid also for nonprobabilistic observations such as binary strings, since we state the proof for an axiomatized notion of mutual information that includes the stochastic as well as the algorithmic version. Comment: 18 pages, 4 figures  [Show abstract] [Hide abstract]
ABSTRACT: The causal Markov condition (CMC) is a postulate that links observations to causality. It describes the conditional independences among the observations that are entailed by a causal hypothesis in terms of a directed acyclic graph. In the conventional setting, the observations are random variables and the independence is a statistical one, i.e., the information content of observations is measured in terms of Shannon entropy. We formulate a generalized CMC for any kind of observations on which independence is defined via an arbitrary submodular information measure. Recently, this has been discussed for observations in terms of binary strings where information is understood in the sense of Kolmogorov complexity. Our approach enables us to find computable alternatives to Kolmogorov complexity, e.g., the length of a text after applying existing data compression schemes. We show that our CMC is justified if one restricts the attention to a class of causal mechanisms that is adapted to the respective information measure. Our justification is similar to deriving the statistical CMC from functional models of causality, where every variable is a deterministic function of its observed causes and an unobserved noise term. Our experiments on real data demonstrate the performance of compression based causal inference. Comment: 21 pages, 4 figures 
Conference Paper: Inferring deterministic causal relations.
 [Show abstract] [Hide abstract]
ABSTRACT: A recent method for causal discovery is in many cases able to infer whether X causes Y or Y causes X for just two observed variables X and Y. It is based on the observation that there exist (nonGaussian) joint distributions P(X,Y) for which Y may be written as a function of X up to an additive noise term that is independent of X and no such model exists from Y to X. Whenever this is the case, one prefers the causal model X> Y. Here we justify this method by showing that the causal hypothesis Y> X is unlikely because it requires a specific tuning between P(Y) and P(XY) to generate a distribution that admits an additive noise model from X to Y. To quantify the amount of tuning required we derive lower bounds on the algorithmic information shared by P(Y) and P(XY). This way, our justification is consistent with recent approaches for using algorithmic information theory for causal reasoning. We extend this principle to the case where P(X,Y) almost admits an additive noise model. Our results suggest that the above conclusion is more reliable if the complexity of P(Y) is high. Comment: 17 pages, 1 Figure  [Show abstract] [Hide abstract]
ABSTRACT: We pose a problem called ``broadcasting Holevoinformation'': given an unknown state taken from an ensemble, the task is to generate a bipartite state transfering as much Holevoinformation to each copy as possible. We argue that upper bounds on the average information over both copies imply lower bounds on the quantum capacity required to send the ensemble without information loss. This is because a channel with zero quantum capacity has a unitary extension transfering at least as much information to its environment as it transfers to the output. For an ensemble being the time orbit of a pure state under a Hamiltonian evolution, we derive such a bound on the required quantum capacity in terms of properties of the input and output energy distribution. Moreover, we discuss relations between the broadcasting problem and entropy power inequalities. The broadcasting problem arises when a signal should be transmitted by a timeinvariant device such that the outgoing signal has the same timing information as the incoming signal had. Based on previous results we argue that this establishes a link between quantum information theory and the theory of low power computing because the loss of timing information implies loss of free energy.  [Show abstract] [Hide abstract]
ABSTRACT: Setting We use Bayesian nets as a formalization of the probabilistic and causal relations of a system and present a result that describes how information theoretic means can contribute to the causal inference process. Task of Causal Inference Starting from an observation of a subsystem in terms of a probability distribution of random variables, determine the class of Bayesian nets that are consistent with the observation. X1 X2 X3 X4 X1 X2 X3 X4 Subsystem ? Observation Bayesian net, such that p(x1, . . . , x4) = u pB(x1, . . . , x4, u) p(x1, . . . , x4) Inference Graphical Models and Information Theory For a given directed acyclic graph G whose nodes are discrete random variables X1, . . . , Xn, denote by P (G) the family of joint probability distributions which factor according to G. Then for an arbitrary distribution p of the Xi I X1, . . . , X = p  P (G + n X i=1 I Xi, parents(Xi , (*) where • D(pP (G)) := inf q∈P (G) D(p  q) is the distance of p from the family of distributions P (G), measured in terms of KullbackLeibler divergence D(pq) = P p log p/q. • Ip is the (generalized) mutual information Ip(X1, . . . , Xn) = p  p(x1) ⊗ · · · ⊗ p(xn . General Question: Relation (*) only holds for distributions p defined on all nodes of the graph, so if the whole system has been observed. What can be said in cases of incomplete knowledge, that is if there are unobserved variables? U (1) there are no causal interactions among the components of the observed subsystem and (2) the mutual information of the subsystem is maximal and all variables have equal entropy H(U) ≥ H(X1) = · · · = H(Xn) Example (Maximal Interaction): Distributions of binary variables of the form pa(x1, . . . , xn) ∼ exp(a xn · · · x k) (a ∈ R, xi ∈ {−1, 1}) can be generated using only common roots of order two. • Result holds also in the algorithmic causal setting introduced in [4] when substituting Kolmogorov complexity for entropy. Future Work 1. How far can one go in characterizing causal models using only entropylike quantities? 2. Consider the decomposition of mutual information into terms originating from the projection of p onto linteraction spaces: I(p) = k X l=2 D(p (l)  p (l−1)) Do causal interpretations for these interaction terms exist? 3. Derive heuristics for causal inference algorithms from information theoretic results as above.
Publication Stats
69  Citations  
8.37  Total Impact Points  
Top Journals
Institutions

20102012

Max Planck Institute for Mathematics in the Sciences
Leipzig, Saxony, Germany
