## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

The causal Markov condition (CMC) is a postulate that links observations to causality. It describes the conditional independences among the observations that are entailed by a causal hypothesis in terms of a directed acyclic graph. In the conventional setting, the observations are random variables and the independence is a statistical one, i.e., the information content of observations is measured in terms of Shannon entropy. We formulate a generalized CMC for any kind of observations on which independence is defined via an arbitrary submodular information measure. Recently, this has been discussed for observations in terms of binary strings where information is understood in the sense of Kolmogorov complexity. Our approach enables us to find computable alternatives to Kolmogorov complexity, e.g., the length of a text after applying existing data compression schemes. We show that our CMC is justified if one restricts the attention to a class of causal mechanisms that is adapted to the respective information measure. Our justification is similar to deriving the statistical CMC from functional models of causality, where every variable is a deterministic function of its observed causes and an unobserved noise term. Our experiments on real data demonstrate the performance of compression based causal inference. Comment: 21 pages, 4 figures

To read the full-text of this research,

you can request a copy directly from the authors.

... Nous pouvons en déduire deux façons d'estimer la complexité de Kolmogorov : par la complexité de Lempel-Ziv ou l'utilisation de compresseurs. La complexité de Lempel-Ziv a par exemple été utilisée en compression, en classification ( [ZM93] ), en analyse de signaux biomédicaux ( [Abo+06], [Li+08], [IM+15] ) ou encore en mesure de causalité ( [SJS10] ). Les compresseurs ont beaucoup été utilisés pour la classification ( [CV05] ) dans des domaines très variés comme l'imagerie hyperspectrale ( [VDG12] ), l'ADN ( [CV05] ), les langues ( [Li+04] ) ou même encore la musique ( [FDK15] ). ...

... Nous expliquerons donc en détail son processus de production (section I.2). Comme la complexité de Lempel-Ziv est une mesure d'information [SJS10], elle peut être la base d'une théorie algorithmique de l'information. Nous verrons alors qu'il existe différentes versions de cette 8 Chapitre I. Théorie algorithmique de l'information : complexité, mesure d'information et compression théorie algorithmique de l'information basée sur la complexité de Lempel-Ziv selon les besoins des auteurs. ...

... Définition I. 2.10 (Mesure d'information, voir [SJS10]). Nous disons que R : Ω → R est une mesure d'information, si elle respecte les trois propriétés suivantes : ...

Les données sous forme de chaîne de symboles sont très variées (ADN, texte, EEG quantifié,…) et ne sont pas toujours modélisables. Une description universelle des chaînes de symboles indépendante des probabilités est donc nécessaire. La complexité de Kolmogorov a été introduite en 1960 pour répondre à cette problématique. Le concept est simple : une chaîne de symboles est complexe quand il n'en existe pas une description courte. La complexité de Kolmogorov est le pendant algorithmique de l’entropie de Shannon et permet de définir la théorie algorithmique de l’information. Cependant, la complexité de Kolmogorov n’est pas calculable en un temps fini ce qui la rend inutilisable en pratique.Les premiers à rendre opérationnelle la complexité de Kolmogorov sont Lempel et Ziv en 1976 qui proposent de restreindre les opérations de la description. Une autre approche est d’utiliser la taille de la chaîne compressée par un compresseur sans perte. Cependant ces deux estimateurs sont mal définis pour le cas conditionnel et le cas joint, il est donc difficile d'étendre la complexité de Lempel-Ziv ou les compresseurs à la théorie algorithmique de l’information.Partant de ce constat, nous introduisons une nouvelle mesure d’information universelle basée sur la complexité de Lempel-Ziv appelée SALZA. L’implémentation et la bonne définition de notre mesure permettent un calcul efficace des grandeurs de la théorie algorithmique de l’information.Les compresseurs sans perte usuels ont été utilisés par Cilibrasi et Vitányi pour former un classifieur universel très populaire : la distance de compression normalisée [NCD]. Dans le cadre de cette application, nous proposons notre propre estimateur, la NSD, et montrons qu’il s’agit d’une semi-distance universelle sur les chaînes de symboles. La NSD surclasse la NCD en s’adaptant naturellement à davantage de diversité des données et en définissant le conditionnement adapté grâce à SALZA.En utilisant les qualités de prédiction universelle de la complexité de Lempel-Ziv, nous explorons ensuite les questions d’inférence de causalité. Dans un premier temps, les conditions algorithmiques de Markov sont rendues calculables grâce à SALZA. Puis en définissant pour la première l’information dirigée algorithmique, nous proposons une interprétation algorithmique de la causalité de Granger algorithmique. Nous montrons, sur des données synthétiques et réelles, la pertinence de notre approche.

... The notion of invariant, autonomous, and independent mechanisms has appeared in various guises throughout the history of causality research [72], [100], [111], [124], [183], [188], [240]. Early work on this was done by Haavelmo [100], stating the assumption that changing one of the structural assignments leaves the other ones invariant. ...

... Overviews are provided by Aldrich [4], Hoover [111], Pearl [183], and Peters et al. [188,Section 2.2]. These seemingly different notions can be unified [124], [240]. ...

The two fields of machine learning and graphical causality arose and are developed separately. However, there is, now, cross-pollination and increasing interest in both fields to benefit from the advances of the other. In this article, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, that is, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.

... Then the entropy [8] function H(X A ), and the mutual information between a set of variables and the complement set I(X A ; X Ω\A ), are both submodular functions [11]. These have been widely used in applications such as sensor placement [14,33], feature selection [32,3,17,37], observation selection, and causal modeling [56,45]. ...

... The submodular information measures we study in this work have been investigated before in special cases. [45] generalizes an information measure to elements of any ground set, primarily in order to introduce a causal Markov condition not over random variables. Also, in [15], an objective that corresponds to our submodular mutual information was used to show error bounds and hardness for general batch active semi-supervised learning. ...

Information-theoretic quantities like entropy and mutual information have found numerous uses in machine learning. It is well known that there is a strong connection between these entropic quantities and submodularity since entropy over a set of random variables is submodular. In this paper, we study combinatorial information measures that generalize independence, (conditional) entropy, (conditional) mutual information, and total correlation defined over sets of (not necessarily random) variables. These measures strictly generalize the corresponding entropic measures since they are all parameterized via submodular functions that themselves strictly generalize entropy. Critically, we show that, unlike entropic mutual information in general, the submodular mutual information is actually submodular in one argument, holding the other fixed, for a large class of submodular functions whose third-order partial derivatives satisfy a non-negativity property. This turns out to include a number of practically useful cases such as the facility location and set-cover functions. We study specific instantiations of the submodular information measures on these, as well as the probabilistic coverage, graph-cut, and saturated coverage functions, and see that they all have mathematically intuitive and practically useful expressions. Regarding applications, we connect the maximization of submodular (conditional) mutual information to problems such as mutual-information-based, query-based, and privacy-preserving summarization -- and we connect optimizing the multi-set submodular mutual information to clustering and robust partitioning.

... The answer is yes. Steudel et al. [35] show that independence of Markov kernels is justified when we use a compressor as an information measure, if we restrict ourselves to the class of causal mechanisms that is adapted to the information measure. In general, let X be a set of discrete-valued random variables and Ω be the powerset of X , i.e. the set of all subsets of X . ...

... Definition 1 (Information measure [35]) A function R : Ω → R is an information measure if it satisfies the following axioms: ...

Causal inference from observational data is one of the most fundamental problems in science. In general, the task is to tell whether it is more likely that (Formula presented.) caused (Formula presented.), or vice versa, given only data over their joint distribution. In this paper we propose a general inference framework based on Kolmogorov complexity, as well as a practical and computable instantiation based on the Minimum Description Length principle. Simply put, we propose causal inference by compression. That is, we infer that (Formula presented.) is a likely cause of (Formula presented.) if we can better compress the data by first encoding (Formula presented.), and then encoding (Formula presented.) given (Formula presented.), than in the other direction. To show this works in practice, we propose Origo, an efficient method for inferring the causal direction from binary data. Origo employs the lossless Pack compressor and searches for that set of decision trees that encodes the data most succinctly. Importantly, it works directly on the data and does not require assumptions about neither distributions nor the type of causal relations. To evaluate Origo in practice, we provide extensive experiments on synthetic, benchmark, and real-world data, including three case studies. Altogether, the experiments show that Origo reliably infers the correct causal direction on a wide range of settings.

... • Logarithm of period : For a deterministic dynamic system with periodic behavior, an information function can be defined as the logarithm of the period of a set of components (i.e., the time it takes for the joint state of these components to return to an initial joint state) [39]. This information function measures the number of questions which one should expect to answer in order to locate the position of those components in their cycle. ...

... This enables discussions of Markov chains, Markov random fields [39] and "computational mechanics" [102][103][104][105] to be subsumed in a general formalism and thence applied in algorithmic, vector-spatial or matroidal contexts. ...

We develop a general formalism for representing and understanding structure
in complex systems. In our view, structure is the totality of relationships
among a system's components, and these relationships can be quantified using
information theory. In the interest of flexibility we allow information to be
quantified using any function, including Shannon entropy and Kolmogorov
complexity, that satisfies certain fundamental axioms. Using these axioms, we
formalize the notion of a dependency among components, and show how a system's
structure is revealed in the amount of information assigned to each dependency.
We explore quantitative indices that summarize system structure, providing a
new formal basis for the complexity profile and introducing a new index, the
"marginal utility of information". Using simple examples, we show how these
indices capture intuitive ideas about structure in a quantitative way. Our
formalism also sheds light on a longstanding mystery: that the mutual
information of three or more variables can be negative. We discuss applications
to complex networks, gene regulation, the kinetic theory of fluids and
multiscale cybernetic thermodynamics.

... The causal Markov condition is only expected to hold for a given set of observations if all relevant components of a system have been observed, that is if there are no confounders (causes of more than two observations that have not been measured). It can then be proven by assuming a functional model of causality [1, 4, 5]. As an example, consider the observations X 1 , . . . ...

... Thus, highly redundant strings require a common ancestor in any DAG-model. Since the Kolmogorov complexity of a string s is uncomputable, we have argued in recent work [5], that it can be substituted by a measure of complexity in terms of the length of a compressed version of s with respect to a chosen compression scheme (instead of a universal Turing machine) and the above result should still hold approximately. ...

A directed acyclic graph (DAG) partially represents the conditional independence structure among observations of a system if the local Markov condition holds, that is, if every variable is independent of its non-descendants given its parents. In general, there is a whole class of DAGs that represents a given set of conditional independence relations. We are interested in properties of this class that can be derived from observations of a subsystem only. To this end, we prove an information theoretic inequality that allows for the inference of common ancestors of observed parts in any DAG representing some unknown larger system. More explicitly, we show that a large amount of dependence in terms of mutual information among the observations implies the existence of a common ancestor that distributes this information. Within the causal interpretation of DAGs our result can be seen as a quantitative extension of Reichenbach's Principle of Common Cause to more than two variables. Our conclusions are valid also for non-probabilistic observations such as binary strings, since we state the proof for an axiomatized notion of mutual information that includes the stochastic as well as the algorithmic version. Comment: 18 pages, 4 figures

... As it also is trivially 0 for empty input, it is an information measure, and hence we know by the results of Steudel [30] that under our score tree models themselves are identifiable. ...

How can we discover whether X causes Y, or vice versa, that Y causes X, when we are only given a sample over their joint distribution? How can we do this such that X and Y can be univariate, multivariate, or of different cardinalities? And, how can we do so regardless of whether X and Y are of the same, or of different data type, be it discrete, numeric, or mixed? These are exactly the questions we answer. We take an information theoretic approach, based on the Minimum Description Length principle, from which it follows that first describing the data over cause and then that of effect given cause is shorter than the reverse direction. Simply put, if Y can be explained more succinctly by a set of classification or regression trees conditioned on X, than in the opposite direction, we conclude that X causes Y. Empirical evaluation on a wide range of data shows that our method, Crack, infers the correct causal direction reliably and with high accuracy on a wide range of settings, outperforming the state of the art by a wide margin. Code related to this paper is available at: http://eda.mmci.uni-saarland.de/crack.

... LiNGAM and its variants [3,4], assume that the data generating process is linear and the noise distributions are non-Gaussian. There are other studies relat to this topic, such as explaining the underlying theoretical foundation behind asymmetric property based methods [21,23,24], addressing the latent variable problem [25], regression-based inference method [26] and kernel independence test based causation discovery methods [27]. Inference the direction between a causal-effect pair is focus of these methods. ...

Causation discovery without manipulation is considered a crucial problem to a variety of applications. The state-of-the-art solutions are applicable only when large numbers of samples are available or the problem domain is sufficiently small. Motivated by the observations of the local sparsity properties on causal structures, we propose a general Split-and-Merge framework, named SADA, to enhance the scalability of a wide class of causation discovery algorithms. In SADA, the variables are partitioned into subsets, by finding causal cut on the sparse causal structure over the variables. By running mainstream causation discovery algorithms as basic causal solvers on the subproblems, complete causal structure can be reconstructed by combining the partial results. SADA benefits from the recursive division technique, since each small subproblem generates more accurate result under the same number of samples. We theoretically prove that SADA always reduces the scales of problems without sacrifice on accuracy, under the condition of local causal sparsity and reliable conditional independence tests. We also present sufficient condition to accuracy enhancement by SADA, even when the conditional independence tests are vulnerable. Extensive experiments on both simulated and real-world datasets verify the improvements on scalability and accuracy by applying SADA together with existing causation discovery algorithms.

... • Logarithm of period: For a deterministic dynamic system with periodic behavior, an information function L(U) can be defined as the logarithm of the period of a set U of components (i.e., the time it takes for the joint state of these components to return to an initial joint state) [54]. This information function measures the number of questions which one should expect to answer in order to locate the position of those components in their cycle. ...

Complex systems display behavior at a range of scales. Large-scale behaviors can emerge from the correlated or dependent behavior of individual small-scale components. To capture this observation in a rigorous and general way, we introduce a formalism for multiscale information theory. Dependent behavior among system components results in overlapping or shared information. A system's structure is revealed in the sharing of information across the system's dependencies, each of which has an associated scale. Counting information according to its scale yields the quantity of scale-weighted information, which is conserved when a system is reorganized. In the interest of flexibility we allow information to be quantified using any function that satisfies two basic axioms. Shannon information and vector space dimension are examples. We discuss two quantitative indices that summarize system structure: an existing index, the complexity profile, and a new index, the marginal utility of information. Using simple examples, we show how these indices capture the multiscale structure of complex systems in a quantitative way.

... Due to the algorithmic Markov condition, postulated in [19], causal structures in nature also imply algorithmic independencies in analogy to the statistical case. We refer the reader to Ref. [30] for further information measures satisfying the polymatroidal axioms. ...

One of the goals of probabilistic inference is to decide whether an
empirically observed distribution is compatible with a candidate Bayesian
network. However, Bayesian networks with hidden variables give rise to highly
non-trivial constraints on the observed distribution. Here, we propose an
information-theoretic approach, based on the insight that conditions on
entropies of Bayesian networks take the form of simple linear inequalities. We
describe an algorithm for deriving entropic tests for latent structures. The
well-known conditional independence tests appear as a special case. While the
approach applies for generic Bayesian networks, we presently adopt the causal
view, and show the versatility of the framework by treating several relevant
problems from that domain: detecting common ancestors, quantifying the strength
of causal influence, and inferring the direction of causation from two-variable
marginals.

We postulate a principle stating that the initial condition of a physical
system is typically algorithmically independent of the dynamical law. We argue
that this links thermodynamics and causal inference. On the one hand, it
entails behaviour that is similar to the usual arrow of time. On the other
hand, it motivates a statistical asymmetry between cause and effect that has
recently postulated in the field of causal inference, namely, that the
probability distribution P(cause) contains no information about the conditional
distribution P(effect|cause) and vice versa, while P(effect) may contain
information about P(cause|effect).

Previous asymptotically correct algorithms for recovering causal structure from sample probabilities have been limited even in sparse causal graphs to a few variables. We describe an asymptotically correct algorithm whose complexity for fixed graph connectivity increases polynomially in the number of vertices, and may in practice recover sparse graphs with several hundred variables. From sample data with n = 20,000, an implementation of the algorithm on a DECStation 3100 recovers the edges in a linear version of the ALARM network with 37 vertices and 46 edges. Fewer than 8% of the undirected edges are incorrectly identified in the output. Without prior ordering information, the program also determines the direction of edges for the ALARM graph with an error rate of 14%. Processing time is less than 10 seconds. Keywords DAGS, Causal Modelling.

Upper and lower bounds are obtained for the joint entropy of a collection of random variables in terms of an arbitrary collection of subset joint entropies. These inequalities generalize Shannon's chain rule for entropy as well as inequalities of Han, Fujishige, and Shearer. A duality between the upper and lower bounds for joint entropy is developed. All of these results are shown to be special cases of general, new results for submodular functions-thus, the inequalities presented constitute a richly structured class of Shannon-type inequalities. The new inequalities are applied to obtain new results in combinatorics, such as bounds on the number of independent sets in an arbitrary graph and the number of zero-error source-channel codes, as well as determinantal inequalities in matrix theory. A general inequality for relative entropies is also developed. Finally, revealing connections of the results to literature in economics, computer science, and physics are explored.

A grammar transform is a transformation that converts any data sequence to be compressed into a grammar from which the original data sequence can be fully reconstructed. In a grammar-based code, a data sequence is first converted into a grammar by a grammar transform and then losslessly encoded. In this paper, a greedy grammar transform is first presented; this grammar transform constructs sequentially a sequence of irreducible grammars from which the original data sequence can be recovered incrementally. Based on this grammar transform, three universal lossless data compression algorithms, a sequential algorithm, an improved sequential algorithm, and a hierarchical algorithm, are then developed. These algorithms combine the power of arithmetic coding with that of string matching. It is shown that these algorithms are all universal in the sense that they can achieve asymptotically the entropy rate of any stationary, ergodic source. Moreover, it is proved that their worst case redundancies among all individual sequences of length are upper-bounded by log log log , where is a constant. Simulation results show that the proposed algorithms outperform the Unix Compress and Gzip algorithms, which are based on LZ78 and LZ77, respectively.

Spike train analysis generally focus on two purposes: (1) the estimate of the neuronal information quantity, and (2) the quantification of spikes or bursts synchronization. We introduce here a new multivariate index based on Lempel-Ziv complexity for spike train analysis. This index, called mutual Lempel-Ziv complexity (MLZC), can measure both spikes correlations and estimate the information quantity of spike trains (i.e. characterize the dynamic state). Using simulated spike trains from a Poisson process, we show that the MLZC is able to quantify spike correlations. In addition, using bursting activity generated by electrically coupled Hindmarsh-Rose neurons, the MLZC is able to quantify and characterize bursts synchronization, when classical measures fail.

Lempel-Ziv complexity (LZ) and derived LZ algorithms have been extensively used to solve information theoretic problems such as coding and lossless data compression. In recent years, LZ has been widely used in biomedical applications to estimate the complexity of discrete-time signals. Despite its popularity as a complexity measure for biosignal analysis, the question of LZ interpretability and its relationship to other signal parameters and to other metrics has not been previously addressed. We have carried out an investigation aimed at gaining a better understanding of the LZ complexity itself, especially regarding its interpretability as a biomedical signal analysis technique. Our results indicate that LZ is particularly useful as a scalar metric to estimate the bandwidth of random processes and the harmonic variability in quasi-periodic signals.

With the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression.
RNACompress employs an efficient grammar-based model to compress RNA sequences and their secondary structures. The main goals of this algorithm are two fold: (1) present a robust and effective way for RNA structural data compression; (2) design a suitable model to represent RNA secondary structure as well as derive the informational complexity of the structural data based on compression. Our extensive tests have shown that RNACompress achieves a universally better compression ratio compared with other sequence-specific or common text-specific compression algorithms, such as Gencompress, winrar and gzip. Moreover, a test of the activities of distinct GTP-binding RNAs (aptamers) compared with their structural complexity shows that our defined informational complexity can be used to describe how complexity varies with activity. These results lead to an objective means of comparing the functional properties of heteropolymers from the information perspective.
A universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity is discussed in this paper. We have developed RNACompress, as a useful tool for academic users. Extensive tests have shown that RNACompress is a universally efficient algorithm for the compression of RNA sequences with their secondary structures. RNACompress also serves as a good measurement of the informational complexity of RNA secondary structure, which can be used to study the functional activities of RNA molecules.

We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorov complexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.

While Kolmogorov (1965, 1983) complexity is the accepted absolute
measure of information content of an individual finite object, a
similarly absolute notion is needed for the relation between an
individual data sample and an individual model summarizing the
information in the data, for example, a finite set (or probability
distribution) where the data sample typically came from. The statistical
theory based on such relations between individual objects can be called
algorithmic statistics, in contrast to classical statistical theory that
deals with relations between probabilistic ensembles. We develop the
algorithmic theory of statistic, sufficient statistic, and minimal
sufficient statistic. This theory is based on two-part codes consisting
of the code for the statistic (the model summarizing the regularity, the
meaningful information, in the data) and the model-to-data code. In
contrast to the situation in probabilistic statistical theory, the
algorithmic relation of (minimal) sufficiency is an absolute relation
between the individual model and the individual data sample. We
distinguish implicit and explicit descriptions of the models. We give
characterizations of algorithmic (Kolmogorov) minimal sufficient
statistic for all data samples for both description modes-in the
explicit mode under some constraints. We also strengthen and elaborate
on earlier results for the “Kolmogorov structure function”
and “absolutely nonstochastic objects”-those objects for
which the simplest models that summarize their relevant information
(minimal sufficient statistics) are at least as complex as the objects
themselves. We demonstrate a close relation between the probabilistic
notions and the algorithmic ones: (i) in both cases there is an
“information non-increase” law; (ii) it is shown that a
function is a probabilistic sufficient statistic iff it is with high
probability (in an appropriate sense) an algorithmic sufficient
statistic

A grammar transform is a transformation that converts any data
sequence to be compressed into a grammar from which the original data
sequence can be fully reconstructed. In a grammar-based code, a data
sequence is first converted into a grammar by a grammar transform and
then losslessly encoded. In this paper, a greedy grammar transform is
first presented; this grammar transform constructs sequentially a
sequence of irreducible grammars from which the original data sequence
can be recovered incrementally. Based on this grammar transform, three
universal lossless data compression algorithms, a sequential algorithm,
an improved sequential algorithm, and a hierarchical algorithm, are then
developed. These algorithms combine the power of arithmetic coding with
that of string matching. It is shown that these algorithms are all
universal in the sense that they can achieve asymptotically the entropy
rate of any stationary, ergodic source. Moreover, it is proved that
their worst case redundancies among all individual sequences of length n
are upper-bounded by c log log n/log n, where c is a constant.
Simulation results show that the proposed algorithms outperform the Unix
Compress and Gzip algorithms, which are based on LZ78 and LZ77,
respectively

Some simple heuristic properties of conditional independence are shown to form a conceptual framework for much of the theory of statistical inference. This framework is illustrated by an examination of the rôle of conditional independence in several diverse areas of the field of statistics. Topics covered include sufficiency and ancillarity, parameter identification, causal inference, prediction sufficiency, data selection mechanisms, invariant statistical models and a subjectivist approach to model‐building.

Special conditional independence structures have been recognized to be matroids. This opens new possibilities for the application of matroid theory methods (duality, minors, expansions) to the study of conditional independence and, on the other hand, starts a new probabilistic branch of matroid representation theory.

Half-title pageSeries pageTitle pageCopyright pageDedicationPrefaceAcknowledgementsContentsList of figuresHalf-title pageIndex

In this article, we propose two well-defined distance metrics of biological sequences based on a universal complexity profile.
To illustrate our metrics, phylogenetic trees of 18 Eutherian mammals from comparison of their mtDNA sequences and 24 coronaviruses
using the whole genomes are constructed. The resulting monophyletic clusters agree well with the established taxonomic groups.

Information theory answers two fundamental questions in communication theory: what is the ultimate data compression (answer: the entropy H), and what is the ultimate transmission rate of communication (answer: the channel capacity C). For this reason some consider information theory to be a subset of communication theory. We will argue that it is much more. Indeed, it has fundamental contributions to make in statistical physics (thermodynamics), computer science (Kolmogorov complexity or algorithmic complexity), statistical inference (Occam's Razor: “The simplest explanation is best”) and to probability and statistics (error rates for optimal hypothesis testing and estimation). The relationship of information theory to other fields is discussed. Information theory intersects physics (statistical mechanics), mathematics (probability theory), electrical engineering (communication theory) and computer science (algorithmic complexity). We describe these areas of intersection in detail.

Inferring the causal structure that links n observables is usually based upon detecting statistical dependences and choosing simple graphs that make the joint measure Markovian. Here we argue why causal inference is also possible when the sample size is one. We develop a theory how to generate causal graphs explaining similarities between single objects. To this end, we replace the notion of conditional stochastic independence in the causal Markov condition with the vanishing of conditional algorithmic mutual information and describe the corresponding causal inference rules. We explain why a consistent reformulation of causal inference in terms of algorithmic complexity implies a new inference principle that takes into account also the complexity of conditional probability densities, making it possible to select among Markov equivalent causal graphs. This insight provides a theoretical foundation of a heuristic principle proposed in earlier work. We also sketch some ideas on how to replace Kolmogorov complexity with decidable complexity criteria. This can be seen as an algorithmic analog of replacing the empirically undecidable question of statistical independence with practical independence tests that are based on implicit or explicit assumptions on the underlying distribution.

It was mentioned by Kolmogorov (1968, IEEE Trans. Inform. Theory14, 662–664) that the properties of algorithmic complexity and Shannon entropy are similar. We investigate one aspect of this similarity. Namely, we are interested in linear inequalities that are valid for Shannon entropy and for Kolmogorov complexity. It turns out that (1) all linear inequalities that are valid for Kolmogorov complexity are also valid for Shannon entropy and vice versa; (2) all linear inequalities that are valid for Shannon entropy are valid for ranks of finite subsets of linear spaces; (3) the opposite statement is not true; Ingleton's inequality (1971, “Combinatorial Mathematics and Its Applications,” pp. 149–167. Academic Press, San Diego) is valid for ranks but not for Shannon entropy; (4) for some special cases all three classes of inequalities coincide and have simple description. We present an inequality for Kolmogorov complexity that implies Ingleton's inequality for ranks; another application of this inequality is a new simple proof of one of Gács–Körner's results on common information (1973, Problems Control Inform. Theory2, 149–162).

A variety of distance measures has been developed in information theory, proven useful in the application to digital information systems. According to the fact, that the information for a living organism is stored digitally on the information carrier DNA, it seems intuitive to apply these methods to genome analysis. We present two applications to genetics: a compression based distance measure can be used to compute pairwise distances between genomic sequences of unequal lengths and thus recognize the content of a DNA region. The Kullback-Leibler distance will serve as basis for the estimation of evolutionary conservation across the genomes of different species in order to identify regions with potential important functionality. Moreover, we show that we can draw conclusions about the biological properties of the such analyzed sequences.

A new approach to the problem of evaluating the complexity ("randomness") of finite sequences is presented. The proposed complexity measure is related to the number of steps in a self-delimiting production process by which a given sequence is presumed to be generated. It is further related to the number of distinct substrings and the rate of their occurrence along the sequence. The derived properties of the proposed measure are discussed and motivated in conjunction with other well-established complexity criteria.

Several recently-proposed data compression algorithms are based on the idea of representing a string by a context-free grammar. Most of these algorithms are known to be asymptotically optimal with respect to a stationary ergodic source and to achieve a low redundancy rate. However, such results do not reveal how effectively these algorithms exploit the grammarmodel itself; that is, are the compressed strings produced as small as possible? We address this issue by analyzing the approximation ratio of several algorithms, that is, the maximum ratio between the size of the generated grammar and the smallest possible grammar over all inputs. On the negative side, we show that every polynomial-time grammar-compression algorithm has approximation ratio at least 8569 8568 unless P = NP. Moreover, achieving an approximation ratio of o(log n= log log n) would require progress on an algebraic problem in a well-studied area. We then upper and lower bound approximation ratios for the following four previously-proposed grammar-based compression algorithms: Sequential, Bisection, Greedy, and LZ78, each of which employs a distinct approach to compression. These results seem to indicate that there is much room to improve grammar-based compression algorithms.