BookPDF Available

The Minimum Description Length Principle

Authors:

Abstract and Figures

A comprehensive introduction and reference guide to the minimum description length (MDL) Principle that is accessible to researchers dealing with inductive reference in diverse areas including statistics, pattern classification, machine learning, data mining, biology, econometrics, and experimental psychology, as well as philosophers interested in the foundations of statistics. The minimum description length (MDL) principle is a powerful method of inductive inference, the basis of statistical modeling, pattern recognition, and machine learning. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models under consideration can be arbitrarily complex, and overfitting the data is a serious concern. This extensive, step-by-step introduction to the MDL Principle provides a comprehensive reference (with an emphasis on conceptual issues) that is accessible to graduate students and researchers in statistics, pattern classification, machine learning, and data mining, to philosophers interested in the foundations of statistics, and to researchers in other applied sciences that involve model selection, including biology, econometrics, and experimental psychology. Part I provides a basic introduction to MDL and an overview of the concepts in statistics and information theory needed to understand MDL. Part II treats universal coding, the information-theoretic notion on which MDL is built, and part III gives a formal treatment of MDL theory as a theory of inductive inference based on universal coding. Part IV provides a comprehensive overview of the statistical theory of exponential families with an emphasis on their information-theoretic properties. The text includes a number of summaries, paragraphs offering the reader a "fast track" through the material, and boxes highlighting the most important concepts.
Content may be subject to copyright.
1 Introducing the Minimum Description
Length Principle
Peter Gr¨unwald
Centrum voor Wiskunde en Informatica
Kruislaan 413
1098 SJ Amsterdam
The Netherlands
pdg@cwi.nl
www.grunwald.nl
This chapter provides a conceptual, entirely nontechnical introduction and overview
of Rissanen’s minimum description length (MDL) principle. It serves as a basis for
the technical introduction given in Chapter 2, in which all the ideas discussed here
are made mathematically precise.
1.1 Introduction and Overview
How does one decide among competing explanations of data given limited obser-
vations? This is the problem of model selection. It stands out as one of the most
important problems of inductive and statistical inference. The minimum descrip-
tion length (MDL) principle is a relatively recent method for inductive inference
that provides a generic solution to the model selection problem. MDL is based on
the following insight: any regularity in the data can be used to compress the data,
that is, to describe it using fewer symbols than the number of symbols needed to
describe the data literally. The more regularities there are, the more the data can
be compressed. Equating “learning” with “finding regularity,” we can therefore say
that the more we are able to compress the data, the more we have learned about
the data. Formalizing this idea leads to a general theory of inductive inference with
several attractive properties:
1. Occam’s razor. MDL chooses a model that trades off goodness-of-fit on the
observed data with ‘complexity’ or ‘richness’ of the model. As such, MDL embodies
4 Introducing the Minimum Description Length Principle
a form of Occam’s razor, a principle that is both intuitively appealing and informally
applied throughout all the sciences.
2. No overfitting, automatically. MDL procedures automatically and inher-
ently protect against overfitting and can be used to estimate both the parameters
and the structure (e.g., number of parameters) of a model. In contrast, to avoid
overfitting when estimating the structure of a model, traditional methods such as
maximum likelihood must be modified and extended with additional, typically ad
hoc principles.
3. Bayesian interpretation. MDL is closely related to Bayesian inference, but
avoids some of the interpretation difficulties of the Bayesian approach,
1
especially
in the realistic case when it is known a priori to the modeler that none of the models
under consideration is true. In fact:
4. No need for “underlying truth.” In contrast to other statistical methods,
MDL procedures have a clear interpretation independent of whether or not there
exists some underlying “true” model.
5. Predictive interpretation. Because data compression is formally equivalent
to a form of probabilistic prediction, MDL methods can be interpreted as searching
for a model with good predictive performance on unseen data.
In this chapter, we introduce the MDL principle in an entirely nontechnical way,
concentrating on its most important applications: model selection and avoiding
overfitting. In Section 1.2 we discuss the relation between learning and data
compression. Section 1.3 introduces model selection and outlines a first, ‘crude’
version of MDL that can be applied to model selection. Section 1.4 indicates how
these crude ideas need to be refined to tackle small sample sizes and differences in
model complexity between models with the same number of parameters. Section 1.5
discusses the philosophy underlying MDL, and considers its relation to Occam’s
razor. Section 1.7 briefly discusses the history of MDL. All this is summarized in
Section 1.8.
1.2 The Fundamental Idea: Learning as Data Compression
We are interested in developing a method for learning the laws and regularities in
data. The following example will illustrate what we mean by this and give a first
idea of how it can be related to descriptions of data.
Regularity ... Consider the following three sequences. We assume that each
sequence is 10000 bits long, and we just list the beginning and the end of each
1.2 The Fundamental Idea: Learning as Data Compression 5
sequence.
00010001000100010001 ... 0001000100010001000100010001 (1.1)
01110100110100100110 ... 1010111010111011000101100010 (1.2)
00011000001010100000 ... 0010001000010000001000110000 (1.3)
The first of these three sequences is a 2500-fold repetition of 0001. Intuitively, the
sequence looks regular; there seems to be a simple ‘law’ underlying it; it might make
sense to conjecture that future data will also be subject to this law, and to predict
that future data will behave according to this law. The second sequence has been
generated by tosses of a fair coin. It is, intuitively speaking, as “random as possible,”
and in this sense there is no regularity underlying it. Indeed, we cannot seem to
find such a regularity either when we look at the data. The third sequence contains
approximately four times as many 0s as 1s. It looks less regular, more random than
the first, but it looks less random than the second. There is still some discernible
regularity in these data, but of a statistical rather than of a deterministic kind.
Again, noticing that such a regularity is there and predicting that future data will
behave according to the same regularity seems sensible.
... and Compression We claimed that any regularity detected in the data can be
used to compress the data, that is, to describe it in a short manner. Descriptions
are always relative to some description method which maps descriptions D
in a
unique manner to data sets D. A particularly versatile description method is a
general-purpose computer language like C or Pa scal . A description of D is then
any computer program that prints D and then halts. Let us see whether our claim
works for the three sequences above. Using a language similar to Pa scal ,wecan
write a program
for i = 1 to 2500; print 0001”; next; halt
which prints sequence (1.1) but is clearly a lot shorter. Thus, sequence (1.1) is indeed
highly compressible. On the other hand, we show in Chapter 2, Section 2.1, that
if one generates a sequence like (1.2) by tosses of a fair coin, then with extremely
high probability, the shortest program that prints (1.2) and then halts will look
something like this:
print 011101001101000010101010...1010111010111011000101100010”; halt
This program’s size is about equal to the length of the sequence. Clearly, it does
nothing more than repeat the sequence.
The third sequence lies in between the first two: generalizing n = 10000 to
arbitrary length n, we show in Chapter 2, Section 2.1 that the first sequence can be
compressed to O(log n) bits; with overwhelming probability, the second sequence
cannot be compressed at all; and the third sequence can be compressed to some
length αn,with0<α<1.
6 Introducing the Minimum Description Length Principle
Example 1.1 (compressing various regular sequences) The regularities un-
derlying sequences (1.1) and (1.3) were of a very particular kind. To illustrate that
any type of regularity in a sequence may be exploited to compress that sequence,
we give a few more examples:
The Number π Evidently, there exists a computer program for generating the
first n digits of π such a program could be based, for example, on an infinite
series expansion of π. This computer program has constant size, except for the
specification of n which takes no more than O(log n) bits. Thus, when n is very large,
the size of the program generating the first n digits of π will be very small compared
to n:theπ-digit sequence is deterministic, and therefore extremely regular.
Physics Data Consider a two-column table where the first column contains num-
bers representing various heights from which an object was dropped. The second
column contains the corresponding times it took for the object to reach the ground.
Assume both heights and times are recorded to some finite precision. In Section 1.3
we illustrate that such a table can be substantially compressed by first describing
the coefficients of the second-degree polynomial H that expresses Newton’s law,
then describing the heights, and then describing the deviation of the time points
from the numbers predicted by H.
Natural Language Most sequences of words are not valid sentences according to
the English language. This fact can be exploited to substantially compress English
text, as long as it is syntactically mostly correct: by first describing a grammar
for English, and then describing an English text D with the help of that grammar
[Gr¨unwald 1996], D can be described using many fewer bits than are needed without
the assumption that word order is constrained.
1.2.1 Kolmogorov Complexity and Ideal MDL
To formalize our ideas, we need to decide on a description method, that is, a formal
language in which to express properties of the data. The most general choice is
a general-purpose
2
computer language such as C or Pas cal. This choice leads to
the definition of the Kolmogorov complexity [Li and Vit´anyi 1997] of a sequence as
the length of the shortest program that prints the sequence and then halts. The
lower the Kolmogorov complexity of a sequence, the more regular it is. This notion
seems to be highly dependent on the particular computer language used. However,
it turns out that for every two general-purpose programming languages A and B
and every data sequence D, the length of the shortest program for D written in
language A and the length of the shortest program for D written in language B
differ by no more than a constant c, which does not depend on the length of D.This
so-called invariance theorem says that, as long as the sequence D is long enough,
it is not essential which computer language one chooses, as long as it is general-
purpose. Kolmogorov complexity was introduced, and the invariance theorem was
proved, independently by Kolmogorov [1965], Chaitin [1969] and Solomonoff [1964].
Solomono’s paper, called “A Formal Theory of Inductive Inference,” contained
1.3 MDL and Model Selection 7
the idea that the ultimate model for a sequence of data may be identified with the
shortest program that prints the data. Solomonoff’s ideas were later extended by
several authors, leading to an ‘idealized’ version of MDL [Solomonoff 1978; Li and
Vit´anyi 1997; acs, Tromp, and Vianyi 2001]. This idealized MDL is very general
in scope, but not practically applicable, for the following two reasons:
1. Uncomputability. It can be shown that there exists no computer program that,
for every set of data D,whengivenD as input, returns the shortest program that
prints D [Li and Vit´anyi 1997].
2. Arbitrariness/dependence on syntax. In practice we are confronted with small
data samples for which the invariance theorem does not say much. Then the
hypothesis chosen by idealized MDL may depend on arbitrary details of the syntax
of the programming language under consideration.
1.2.2 Practical MDL
Like most authors in the field, we concentrate here on nonidealized, practical
versions of MDL that deal explicitly with the two problems mentioned above. The
basic idea is to scale down Solomonoff’s approach so that it does become applicable.
This is achieved by using description methods that are less expressive than general-
purpose computer languages. Such description methods C should be restrictive
enough so that for any data sequence D, we can always compute the length of the
shortest description of D that is attainable using method C; but they should be
general enough to allow us to compress many of the intuitively regular” sequences.
The price we pay is that, using the “practical” MDL principle, there will always
be some regular sequences which we will not be able to compress. But we already
know that there can be no method for inductive inference at all which will always
give us all the regularity there is simply because there can be no automated
method which for any sequence D finds the shortest computer program that prints
D and then halts. Moreover, it will often be possible to guide a suitable choice of
C by a priori knowledge we have about our problem domain. For example, below
we consider a description method C that is based on the class of all polynomials,
such that with the help of C we can compress all data sets which can meaningfully
be seen as points on some polynomial.
1.3 MDL and Model Selection
Let us recapitulate our main insights so far:
8 Introducing the Minimum Description Length Principle
MDL: The Basic Idea
The goal of statistical inference may be cast as trying to find regularity in
the data. Regularity” may be identified with “ability to compress.” MDL
combines these two insights by viewing learning as data compression: it tells
us that, for a given set of hypotheses H and data set D, we should try to find
the hypothesis or combination of hypotheses in H that compresses D most.
This idea can be applied to all sorts of inductive inference problems, but it turns
out to be most fruitful in (and its development has mostly concentrated on) prob-
lems of model selection and, more generally, those dealing with overfitting.Hereis
a standard example (we explain the difference between “model” and hypothesis”
after the example).
Example 1.2 (Model Selection and Overfitting) Consider the points in Fig-
ure 1.1. We would like to learn how the y-values depend on the x-values. To this
end, we may want to fit a polynomial to the points. Straightforward linear re-
gression will give us the leftmost polynomial a straight line that seems overly
simple: it does not capture the regularities in the data well. Since for any set of n
points there exists a polynomial of the (n 1)st degree that goes exactly through
all these points, simply looking for the polynomial with the least error will give
us a polynomial like the one in the second picture. This polynomial seems overly
complex: it reflects the random fluctuations in the data rather than the general
pattern underlying it. Instead of picking the overly simple or the overly complex
polynomial, it seems more reasonable to prefer a relatively simple polynomial with
a small but nonzero error, as in the rightmost picture. This intuition is confirmed
by numerous experiments on real-world data from a broad variety of sources [Ris-
sanen 1989; Vapnik 1998; Ripley 1996]: if one naively fits a high-degree polynomial
to a small sample (set of data points), then one obtains a very good fit to the data.
Yet if one tests the inferred polynomial on a second set of data coming from the
same source, it typically fits these test data very badly in the sense that there is a
large distance between the polynomial and the new data points. We say that the
polynomial overfits the data. Indeed, all model selection methods that are used in
practice either implicitly or explicitly choose a tradeoff between goodness-of-fit and
Figure 1.1 A simple, complex and tradeoff (third-degree) polynomial.
1.3 MDL and Model Selection 9
complexity of the models involved. In practice, such tradeoffs lead to much better
predictions of test data than one would get by adopting the ‘simplest’ (one degree)
or most “complex”
3
(n1-degree) polynomial. MDL provides one particular means
of achieving such a tradeoff.
It will be useful to make a precise distinction between “model” and “hypothesis”:
Model vs. Hypothesis
We use the phrase point hypothesis to refer to a single probability distribution
or function. An example is the polynomial 5x
2
+4x +3.A pointhypothesisis
also known as a “simple hypothesis” in the statistical literature.
We use the word model to refer to a family (set) of probability distributions or
functions with the same functional form. An example is the set of all second-
degree polynomials. A model is also known as a “composite hypothesis” in the
statistical literature.
We use hypothesis as a generic term, referring to both point hypotheses and
models.
In our terminology, the problem described in Example 1.2 is a hypothesis selec-
tion problem” if we are interested in selecting both the degree of a polynomial and
the corresponding parameters; it is a “model selection problem” if we are mainly
interested in selecting the degree.
To apply MDL to polynomial or other types of hypothesis and model selection, we
have to make precise the somewhat vague insight “learning may be viewed as data
compression.” This can be done in various ways. In this section, we concentrate on
the earliest and simplest implementation of the idea. This is the so-called two-part
code version of MDL, see Figure 1.2.
Crude
4
, Two-Part Version of MDL principle (Informally Stated)
Let H
(1)
, H
(2)
,... be a list of candidate models (e.g., H
(k)
is the set of kth-
degree polynomials), each containing a set of point hypotheses (e.g., individual
polynomials). The best point hypothesis H ∈H
(1)
∪H
(2)
... to explain the
data D is the one which minimizes the sum L(H)+L(D|H), where
L(H) is the length, in bits, of the description of the hypothesis; and
L(D|H) is the length, in bits, of the description of the data when encoded
with the help of the hypothesis.
The best model to explain D is the smallest model containing the selected H.
Figure 1.2 The two-part MDL principle: first, crude implementation of the MDL ideas.
10 Introducing the Minimum Description Length Principle
Example 1.3 (Polynomials, cont.) In our previous example, the candidate
hypotheses were polynomials. We can describe a polynomial by describing its
coefficients in a certain precision (number of bits per parameter). Thus, the higher
the degree of a polynomial or the precision, the more
5
bits we need to describe
it and the more ‘complex’ it becomes. A description of the data ‘with the help
of a hypothesis means that the better the hypothesis fits the data, the shorter the
description will be. A hypothesis that fits the data well gives us a lot of information
about the data. Such information can always be used to compress the data (Chapter
2, Section 2.1). Intuitively, this is because we only have to code the errors the
hypothesis makes on the data rather than the full data. In our polynomial example,
the better a polynomial H fits D, the fewer bits we need to encode the discrepancies
between the actual y-values y
i
and the predicted y-values H(x
i
). We can typically
find a very complex point hypothesis (large L(H)) with a very good fit (small
L(D|H)). We can also typically find a very simple point hypothesis (small L(H))
with a rather bad fit (large L(D|H)). The sum of the two description lengths will
be minimized at a hypothesis that is quite (but not too) “simple,” with a good (but
not perfect) fit.
1.4 Crude and Refined MDL
Crude MDL picks the H minimizing the sum L(H)+L(D|H). To make this proce-
dure well-defined, we need to agree on precise definitions for the codes (description
methods) giving rise to lengths L(D|H)andL(H). We now discuss these codes in
more detail. We will see that the definition of L(H) is problematic, indicating that
we somehow need to “refine” our crude MDL principle.
Definition of L(D|H) Consider a two-part code as described above, and assume
for the time being that all H under consideration define probability distributions.
If H is a polynomial, we can turn it into a distribution by making the additional
assumption that the Y -values are given by Y = H(X)+Z,whereZ is a normally
distributed noise term.
For each H we need to define a code with length L(·|H) such that L(D|H)
can be interpreted as “the code length of D when encoded with the help of H.” It
turns out that for probabilistic hypotheses, there is only one reasonable choice for
this code. It is the so-called Shannon-Fano code, satisfying, for all data sequences
D, L(D|H)=log P (D|H), where P (D|H) is the probability mass or density of
D according to H such a code always exists; see Chapter 2, Section 2.1.
Definition of L(H): A Problem for Crude MDL It is more problematic to
find a good code for hypotheses H. Some authors have simply used ‘intuitively
reasonable’ codes in the past, but this is not satisfactory: since the description
length L(H)ofanyfixedpointhypothesisH can be very large under one code,
but quite short under another, our procedure is in danger of becoming arbitrary.
1.4 Crude and Refined MDL 11
Instead, we need some additional principle for designing a code for H. In the first
publications on MDL [Rissanen 1978, 1983], it was advocated to choose some sort
of minimax code for H, minimizing, in some precisely defined sense, the shortest
worst-case total description length L(H)+L(D|H),wheretheworstcaseisover
all possible data sequences. Thus, the MDL principle is employed at a “metalevel”
to choose a code for H. However, this code requires a cumbersome discretization
of the model space H, which is not always feasible in practice. Alternatively,
Barron [1985] encoded H by the shortest computer program that, when input
D, computes P (D|H). While it can be shown that this leads to similar code
lengths, it is computationally problematic. Later, Rissanen [1984] realized that
these problems could be sidestepped by using a one-part rather than a two-part
code. This development culminated in 1996 in a completely precise prescription of
MDL for many, but certainly not all, practical situations [Rissanen 1996]. We call
this modern version of MDL refined MDL:
Refined MDL In refined MDL, we associate a code for encoding D not with
asingleH ∈H, but with the full model H.Thus,givenmodelH,weencode
data not in two parts but we design a single one-part code with lengths
¯
L(D|H).
This code is designed such that whenever there is a member of (parameter in) H
that ts the data well, in the sense that L(D | H) is small, then the code length
¯
L(D|H) will also be small. Codes with this property are called universal codes in the
information-theoretic literature [Barron, Rissanen, and Yu 1998]. Among all such
universal codes, we pick the one that is minimax optimal in a sense made precise
in Chapter 2, Section 2.4. For example, the set H
(3)
of third-degree polynomials
is associated with a code with lengths
¯
L(·|H
(3)
) such that, the better the data
D are fit by the best-fitting third-degree polynomial, the shorter the code length
¯
L(D |H).
¯
L(D |H) is called the stochastic complexity of the data given the model.
Parametric Complexity The second fundamental concept of refined MDL is the
parametric complexity of a parametric model H which we denote by COMP(H).
This is a measure of the ‘richness’ of model H, indicating its ability to fit random
data. This complexity is related to the degrees of freedom in H, but also to the
geometric structure of H; see Example 1.4. To see how it relates to stochastic
complexity, let, for given data D,
ˆ
H denote the distribution in H which maximizes
the probability, and hence minimizes the code length L(D |
ˆ
H)ofD. It turns out
that
Stochastic complexity of D given H = L(D |
ˆ
H)+COMP(H).
Refined MDL model selection between two parametric models (such as the models
of first- and second-degree polynomials) now proceeds by selecting the model such
that the stochastic complexity of the given data D is smallest. Although we used
a one-part code to encode data, refined MDL model selection still involves a
12 Introducing the Minimum Description Length Principle
tradeoff between two terms: a goodness-of-fit term L(D |
ˆ
H) and a complexity term
COMP(H). However, because we do not explicitly encode hypotheses H anymore,
there is no arbitrariness anymore. The resulting procedure can be interpreted in
several different ways, some of which provide us with rationales for MDL beyond
the pure coding interpretation (see Chapter 2, Sections 2.5.1–2.5.4):
1. Counting/differential geometric interpretation. The parametric complex-
ity of a model is the logarithm of the number of essentially different, distinguishable
point hypotheses within the model.
2. Two-part code interpretation. For large samples, the stochastic complexity
can be interpreted as a two-part code length of the data after all, where hypotheses
H are encoded with a special code that works by first discretizing the model space
H into a set of “maximally distinguishable hypotheses,” and then assigning equal
code length to each of these.
3. Bayesian interpretation. In many cases, refined MDL model selection coin-
cides with Bayes factor model selection based on a noninformative prior such as
Jeffreys’ prior [Bernardo and Smith 1994].
4. Prequential interpretation. Refined MDL model selection can be interpreted
as selecting the model with the best predictive performance when sequentially
predicting unseen test data, in the sense described in Chapter 2, Section 2.5.4.
This makes it an instance of Dawid’s [1984] prequential model validation and also
relates it to cross-validation methods.
Refined MDL allows us to compare models of different functional form. It even
accounts for the phenomenon that different models with the same number of
parameters may not be equally “complex”:
Example 1.4 Consider two models from psychophysics describing the relationship
between physical dimensions (e.g., light intensity) and their psychological counter-
parts (e.g., brightness) [Myung, Balasubramanian, and Pitt 2000]: y = ax
b
+ Z
(Stevens’s model) and y = a ln(x + b)+Z (Fechner’s model) where Z is a nor-
mally distributed noise term. Both models have two free parameters; nevertheless,
it turns out that in a sense, Stevens’s model is more flexible or complex than Fech-
ner’s. Roughly speaking, this means there are a lot more data patterns that can be
explained by Stevens’s model than can be explained by Fechner’s model. Myung
and co-workers [2000] generated many samples of size 4 from Fechner’s model, using
some fixed parameter values. They then fitted both models to each sample. In 67%
of the trials, Stevens’s model fitted the data better than Fechner’s, even though the
latter generated the data. Indeed, in refined MDL, the ‘complexity’ associated with
Stevens’s model is much larger than the complexity associated with Fechner’s, and
if both models fit the data equally well, MDL will prefer Fechner’s model.
Summarizing, refined MDL removes the arbitrary aspect of crude, two-part code
MDL and associates parametric models with an inherent ‘complexity’ that does not
depend on any particular description method for hypotheses. We should, however,
1.5 The MDL Philosophy 13
warn the reader that we only discussed a special, simple situation in which we
compared a finite number of parametric models that satisfy certain regularity
conditions. Whenever the models do not satisfy these conditions, or if we compare
an infinite number of models, then the refined ideas have to be extended. We then
obtain a “general” refined MDL principle, which employs a combination of one-part
and two-part codes.
1.5 The MDL Philosophy
The first central MDL idea is that every regularity in data may be used to compress
those data; the second central idea is that learning can be equated with finding
regularities in data. Whereas the first part is relatively straightforward, the second
part of the idea implies that methods for learning from data must have a clear
interpretation independent of whether any of the models under consideration is
“true” or not. Quoting Rissanen [1989], the main originator of MDL:
‘We never want to make the false assumption that the observed data actually were
generated by a distribution of some kind, say Gaussian, and then go on to analyze the
consequences and make further deductions. Our deductions may be entertaining but quite
irrelevant to the task at hand, namely, to learn useful properties from the data.’
- Jorma Rissanen, 1989
Based on such ideas, Rissanen has developed a radical philosophy of learning and
statistical inference that is considerably different from the ideas underlying main-
stream statistics, both frequentist and Bayesian. We now describe this philosophy
in more detail:
1. Regularity as Compression. According to Rissanen, the goal of inductive
inference should be to ‘squeeze out as much regularity as possible’ from the given
data. The main task for statistical inference is to distill the meaningful information
present in the data, that is, to separate structure (interpreted as the regularity, the
‘meaningful information’) from noise (interpreted as the ‘accidental information’).
For the three sequences of Example 1.2, this would amount to the following: the
first sequence would be considered as entirely regular and noiseless.” The second
sequence would be considered as entirely random all information in the sequence
is accidental, there is no structure present. In the third sequence, the structural part
would (roughly) be the pattern that 4 times as many 0s than 1s occur; given this
regularity, the description of exactly which of all sequences with four times as many
0s than 1s occurs is the accidental information.
2. Models as Languages. Rissanen interprets models (sets of hypotheses) as
nothing more than languages for describing useful properties of the data a model
H is identified with its corresponding universal code
¯
L(·|H). Different individual
hypotheses within the models express different regularities in the data, and may
14 Introducing the Minimum Description Length Principle
simply be regarded as statistics, that is, summaries of certain regularities in the
data. These regularities are present and meaningful independently of whether some
H
∈His the “true state of nature” or not. Suppose that the model H under
consideration is probabilistic. In traditional theories, one typically assumes that
some P
∈Hgenerates the data, and then ‘noise’ is defined as a random quantity
relative to this P
. In the MDL view ‘noise’ is defined relative to the model H as
the residual number of bits needed to encode the data once the model H is given.
Thus, noise is not a random variable: it is a function only of the chosen model and
the actually observed data. Indeed, there is no place for a “true distribution” or a
“true state of nature” in this view there are only models and data. To bring
out the difference to the ordinary statistical viewpoint, consider the phrase ‘these
experimental data are quite noisy.’ According to a traditional interpretation, such a
statement means that the data were generated by a distribution with high variance.
According to the MDL philosophy, such a phrase means only that the data are not
compressible with the currently hypothesized model as a matter of principle, it
can never be ruled out that there exists a different model under which the data are
very compressible (not noisy) after all!
3. We Have Only the Data. Many (but not all
6
) other methods of inductive
inference are based on the idea that there exists some “true state of nature,”
typically a distribution assumed to lie in some model H. The methods are then
designed as a means to identify or approximate this state of nature based on as
littledataaspossible.AccordingtoRissanen,
7
such methods are fundamentally
flawed. The main reason is that the methods are designed under the assumption
that the true state of nature is in the assumed model H, which is often not the case.
Therefore, such methods only admit a clear interpretation under assumptions that
are typically violated in practice. Many cherished statistical methods are designed in
this way we mention hypothesis testing, minimum-variance unbiased estimation,
several non-parametric methods, and even some forms of Bayesian inference see
Example 2.22. In contrast, MDL has a clear interpretation which depends only on
the data, and not on the assumption of any underlying “state of nature.”
Example 1.5 (Models That Are Wrong, Yet Useful) Even though the models
under consideration are often wrong, they can nevertheless be very useful.Examplesare
the successful ‘naive Bayes’ model for spam filtering, hidden Markov models for speech
recognition (is speech a stationary ergodic process? probably not), and the use of linear
models in econometrics and psychology. Since these models are evidently wrong, it seems
strange to base inferences on them using methods that are designed under the assumption
that they contain the true distribution. To be fair, we should add that domains such as
spam filtering and speech recognition are not what the fathers of modern statistics had
in mind when they designed their procedures they were usually thinking about much
simpler domains, where the assumption that some distribution P
∈His “true” may not
be so unreasonable.
4. MDL and Consistency. Let H be a probabilistic model, such that each
P ∈His a probability distribution. Roughly, a statistical procedure is called
1.6 MDL and Occam’s Razor 15
consistent relative to H if, for all P
∈H, the following holds: suppose data are
distributed according to P
. Then given enough data, the learning method will
learn a good approximation of P
with high probability. Many traditional statistical
methods have been designed with consistency in mind (Chapter 2, Section 2.2).
The fact that in MDL, we do not assume a true distribution may suggest that
we do not care about statistical consistency. But this is not the case: we would
still like our statistical method to be such that in the idealized case, where one of
the distributions in one of the models under consideration actually generates the
data, our method is able to identify this distribution, given enough data. If even
in the idealized special case where a ‘truth’ exists within our models, the method
fails to learn it, then we certainly cannot trust it to do something reasonable in
the more general case, where there may not be a “true distribution” underlying the
data at all. So: consistency is important in the MDL philosophy, but it is used as
a sanity check (for a method that has been developed without making distributional
assumptions) rather than as a design principle.
In fact, mere consistency is not sufficient. We would like our method to converge
to the imagined true P
fast, based on as small a sample as possible. Two-part code
MDL with ‘clever’ codes achieves good rates of convergence in this sense (Barron
and Cover [1991], complemented by Zhang [2004], show that in many situations,
the rates are minimax optimal ).Thesameseemstobetrueforrenedone-part
code MDL [Barron et al. 1998], although there is at least one surprising exception
where inference based on the normalized maximum likelihood (NML) and Bayesian
universal model behaves abnormally see Csisz´ar and Shields [2000] for the details.
Summarizing this section, the MDL philosophy is quite agnostic about whether
any of the models under consideration is “true”, or whether something like a “true
distribution” even exists. Nevertheless, it has been suggested [Webb 1996; Domingos
1999] that MDL embodies a naive belief that “simple models are a priori more likely
to be true than complex models.” Below we explain why such claims are mistaken.
1.6 MDL and Occam’s Razor
When two models fit the data equally well, MDL will choose the one that is the
“simplest in the sense that it allows for a shorter description of the data. As such,
it implements a precise form of Occam’s razor even though as more and more
data become available, the model selected by MDL may become more and more
‘complex’! Occam’s razor is sometimes criticized for being either (1) arbitrary or
(2) false [Webb 1996; Domingos 1999]. Do these criticisms apply to MDL as well?
“1. Occam’s Razor (and MDL) Is Arbitrary” Because “description length”
is a syntactic notion it may seem that MDL selects an arbitrary model: different
codes would have led to different description lengths, and therefore, to different
models. By changing the encoding method, we can make ‘complex’ things ‘simple’
16 Introducing the Minimum Description Length Principle
and vice versa. This overlooks the fact we are not allowed to use just any code
we like! ‘Refined’ MDL tells us to use a specific code, independent of any specific
parameterization of the model, leading to a notion of complexity that can also
be interpreted without any reference to ‘description lengths’ (see also Chapter 2,
Section 2.9.1).
“2. Occam’s Razor Is False” It is often claimed that Occam’s razor is false
we often try to model real-world situations that are arbitrarily complex, so why
should we favor simple models? In the words of Webb [1996], “What good are simple
models of a complex world?”
8
The short answer is: even if the true data-generating machinery is very complex,
it may be a good strategy to prefer simple models for small sample sizes. Thus,
MDL (and the corresponding form of Occam’s razor) is a strategy for inferring
models from data (“choose simple models at small sample sizes”), not a statement
about how the world works (“simple models are more likely to be true”) indeed,
a strategy cannot be true or false; it is “clever” or “stupid.” And the strategy of
preferring simpler models is clever even if the data-generating process is highly
complex, as illustrated by the following example:
Example 1.6 (“Infinitely” Complex Sources) Suppose that data are subject
to the law Y = g(X)+Z where g is some continuous function and Z is some
noise term with mean 0. If g is not a polynomial, but X only takes values in a
finite interval, say [1, 1], we may still approximate g arbitrarily well by taking
polynomials of higher and higher degree. For example, let g(x)=exp(x). Then,
if we use MDL to learn a polynomial for data D =((x
1
,y
1
),...,(x
n
,y
n
)), the
degree of the polynomial
¨
f
(n)
selected by MDL at sample size n will increase with
n, and with high probability
¨
f
(n)
converges to g(x)=exp(x) in the sense that
max
x[1,1]
|
¨
f
(n)
(x) g(x)|→0. Of course, if we had better prior knowledge about
the problem we could have tried to learn g using a model class M containing the
function y =exp(x). But in general, both our imagination and our computational
resources are limited, and we may be forced to use imperfect models.
If, based on a small sample, we choose the best-fitting polynomial
ˆ
f within the
set of all polynomials, then, even though
ˆ
f will fit the data very well, it is likely
to be quite unrelated to the “true” g,and
ˆ
f may lead to disastrous predictions of
future data. The reason is that, for small samples, the set of all polynomials is very
large compared to the set of possible data patterns that we might have observed.
Therefore, any particular data pattern can only give us very limited information
about which high-degree polynomial best approximates g. On the other hand, if
we choose the best-fitting
ˆ
f
in some much smaller set such as the set of second-
degree polynomials, then it is highly probable that the prediction quality (mean
squared error) of
ˆ
f
on future data is about the same as its mean squared error on
the data we observed: the size (complexity) of the contemplated model is relatively
small compared to the set of possible data patterns that we might have observed.
1.7 History 17
Therefore, the particular pattern that we do observe gives us a lot of information
on what second-degree polynomial best approximates g.
Thus, (a)
ˆ
f
typically leads to better predictions of future data than
ˆ
f;and
(b) unlike
ˆ
f,
ˆ
f
is reliable in that it gives a correct impression of how good it
will predict future data even if the “true g is ‘infinitely’ complex. This idea does
not just appear in MDL, but is also the basis of Vapnik’s [1998] structural risk
minimization approach and many standard statistical methods for nonparametric
inference. In such approaches one acknowledges that the data-generating machin-
ery can be infinitely complex (e.g., not describable by a finite-degree polynomial).
Nevertheless, it is still a good strategy to approximate it by simple hypotheses
(low-degree polynomials) as long as the sample size is small. Summarizing:
The Inherent Difference between Under- and Overfitting
If we choose an overly simple model for our data, then the best-fitting point
hypothesis within the model is likely to be almost the best predictor, within
the simple model, of future data coming from the same source. If we overfit
(choose a very complex model) and there is noise in our data, then, even if
the complex model contains the “true” point hypothesis, the best-fitting point
hypothesis within the model is likely to lead to very bad predictions of future
data coming from the same source.
This statement is very imprecise and is meant more to convey the general idea
than to be completely true. As will become clear in Chapter 2, Section 2.9.1, it
becomes provably true if we use MDL’s measure of model complexity; we measure
prediction quality by logarithmic loss; and we assume that one of the distributions
in H actually generates the data.
1.7 History
The MDL principle has mainly been developed by Jorma Rissanen in a series
of papers starting with [Rissanen 1978]. It has its roots in the theory of Kol-
mogorov or algorithmic complexity [Li and Vit´anyi 1997], developed in the 1960s
by Solomonoff [1964], Kolmogorov [1965], and Chaitin [1966, 1969]. Among these
authors, Solomono (a former student of the famous philosopher of science, Rudolf
Carnap) was explicitly interested in inductive inference. The 1964 paper contains
explicit suggestions on how the underlying ideas could be made practical, thereby
foreshadowing some of the later work on two-part MDL. Although Rissanen was
not aware of Solomonoff’s work at the time, Kolmogorov’s [1965] paper did serve
as an inspiration for Rissanen’s [1978] development of MDL.
Another important inspiration for Rissanen was Akaike’s [1973] information
criterion (AIC) method for model selection, essentially the first model selection
method based on information-theoretic ideas. Even though Rissanen was inspired
18 Introducing the Minimum Description Length Principle
by AIC, both the actual method and the underlying philosophy are substantially
different from MDL.
MDL is much more closely related to the minimum message length (MML)
principle, developed by Wallace and his co-workers in a series of papers starting with
the groundbreaking [Wallace and Boulton 1968]; other milestones were [Wallace and
Boulton 1975] and [Wallace and Freeman 1987]. Remarkably, Wallace developed
his ideas without being aware of the notion of Kolmogorov complexity. Although
Rissanen became aware of Wallace’s work before the publication of [Rissanen 1978],
he developed his ideas mostly independently, being influenced rather by Akaike and
Kolmogorov. Indeed, despite the close resemblance of both methods in practice, the
underlying philosophy is quite different (Chapter 2, Section 2.8).
The first publications on MDL only mention two-part codes. Important progress
was made by Rissanen [1984], in which prequential codes are employed for the first
time and [Rissanen 1987], introducing the Bayesian mixture codes into MDL. This
led to the development of the notion of stochastic complexity as the shortest code
length of the data given a model [Rissanen 1986, 1987]. However, the connection
to Shtarkov’s normalized maximum likelihood code was not made until 1996, and
this prevented the full development of the notion of parametric complexity.”
In the meantime, in his impressive Ph.D. thesis, Barron [1985] showed how a
specific version of the two-part code criterion has excellent frequentist statistical
consistency properties. This was extended by Barron and Cover [1991] who achieved
a breakthrough for two-part codes: they gave clear prescriptions on how to design
codes for hypotheses, relating codes with good minimax code length properties
to rates of convergence in statistical consistency theorems. Some of the ideas of
Rissanen [1987] and Barron and Cover [1991] were, as it were, unified when Rissanen
[1996] introduced a new definition of stochastic complexity based on the normalized
maximum likelihood (NML) code (Chapter 2, Section 2.4). The resulting theory was
summarized for the first time by Barron and co-workers [1998], and is called ‘refined
MDL’ in the present overview.
1.8 Summary and Outlook
We have discussed how regularity is related to data compression, and how MDL
employs this connection by viewing learning in terms of data compression. One
can make this precise in several ways; in idealized MDL one looks for the shortest
program that generates the given data. This approach is not feasible in practice,
and here we concern ourselves with practical MDL. Practical MDL comes in a crude
version based on two-part codes and in a modern, more refined version based on
the concept of universal coding. The basic ideas underlying all these approaches can
be found in the boxes spread throughout the text.
These methods are mostly applied to model selection but can also be used
for other problems of inductive inference. In contrast to most existing statistical
methodology, they can be given a clear interpretation irrespective of whether or not
References 19
there exists some “true” distribution generating data inductive inference is seen
as a search for regular properties in (interesting statistics of) the data, and there is
no need to assume anything outside the model and the data. In contrast to what is
sometimes thought, there is no implicit belief that ‘simpler models are more likely
to be true’ MDL does embody a preference for ‘simple’ models, but this is best
seen as a strategy for inference that can be useful even if the environment is not
simple at all.
In the next chapter, we make precise both the crude and the refined versions of
practical MDL. For this, it is absolutely essential that the reader familiarizes him-
or herself with two basic notions of coding and information theory: the relation
between code length functions and probability distributions, and (for refined MDL),
the idea of universal coding a large part of Chapter 2 is devoted to these.
Notes
1. See Section 2.8.2, Example 2.22.
2. By this we mean that a universal Turing machine can be implemented in it [Li and Vit´anyi
1997].
3. Strictly speaking, in our context it is not very accurate to speak of “simple” or “complex”
polynomials; instead we should call the set of first-degree polynomials “simple, and the set of
100th-degree polynomials “complex.”
4. The terminology ‘crude MDL’ is not standard. It is introduced here for pedagogical reasons,
to make clear the importance of having a single, unified principle for designing codes. It should
be noted that Rissanen’s and Barron’s early theoretical papers on MDL already contain such
principles, albeit in a slightly different form than in their recent papers. Early practical applications
[Quinlan and Rivest 1989; Gr¨unwald 1996] often do use ad hoc two-part codes which really are
‘crude’ in the sense defined here.
5. See the previous endnote.
6. For example, cross-validation cannot easily be interpreted in such terms of ‘a method hunting
for the true distribution.’
7. My own views are somewhat milder in this respect, but this is not the place to discuss them.
8. Quoted with permission from KDD Nuggets 96,28, 1996.
References
Akaike, H. (1973). Information theory as an extension of the maximum likelihood
principle. In B. N. Petrov and F. Csaki (Eds.), Second International Sympo-
sium on Information Theory, pp. 267–281. Budapest: Akademiai Kiado.
Barron, A.R. and T. Cover (1991). Minimum complexity density estimation.
IEEE Transactions on Information Theory, 37 (4), 1034–1054.
Barron, A.R. (1985). Logically Smooth Density Estimation.Ph.D.thesis,De-
partment of Electrical Engineering, Stanford University, Stanford, CA.
Barron, A.R., J. Rissanen, and B. Yu (1998). The Minimum Description Length
Principle in coding and modeling. Special Commemorative Issue: Information
Theory: 1948-1998. IEEE Transactions on Information Theory, 44 (6), 2743–
2760.
20 Introducing the Minimum Description Length Principle
Bernardo, J., and A. Smith (1994). Bayesian theory. New York: Wiley.
Chaitin, G. (1966). On the length of programs for computing finite binary
sequences. Journal of the ACM, 13, 547–569.
Chaitin, G. (1969). On the length of programs for computing finite binary
sequences: Statistical considerations. Journal of the ACM, 16, 145–159.
Csisz´ar, I., and P. Shields (2000). The consistency of the BIC Markov order
estimator. Annals of Statistics, 28, 1601–1619.
Dawid, A. (1984). Present position and potential developments: Some personal
views, statistical theory, the prequential approach. Journal of the Royal
Statistical Society, Series A, 147(2), 278–292.
Domingos, P. (1999). The role of Occam’s razor in knowledge discovery. Data
Mining and Knowledge Discovery, 3 (4), 409–425.
acs,P.,J.Tromp,andP.Vit´anyi (2001). Algorithmic statistics. IEEE Trans-
actions on Information Theory, 47(6), 2464–2479.
Gr¨unwald, P.D. (1996). A minimum description length approach to grammar in-
ference. In G.Scheler, S. Wermter, and E. Riloff (Eds.), Connectionist, Statis-
tical and Symbolic Approaches to Learning for Natural Language Processing,
Volume 1040 in Springer Lecture Notes in Artificial Intelligence, pp. 203–216.
Berlin: Springer-Verlag
Kolmogorov, A. (1965). Three approaches to the quantitative definition of infor-
mation. Problems of Information Transmission, 1 (1), 1–7.
Li,M.,andP.Vit´anyi (1997). An Introduction to Kolmogorov Complexity and
Its Applications, 2nd edition. New York: Springer-Verlag.
Myung, I.J., V. Balasubramanian, and M.A. Pitt (2000). Counting probability
distributions: Differential geometry and model selection. Proceedings of the
National Academy of Sciences USA, 97, 11170–11175.
Quinlan, J., and R. Rivest (1989). Inferring decision trees using the minimum
description length principle. Information and Computation, 80, 227–248.
Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge, UK:
Cambridge University Press.
Rissanen, J. (1978). Modeling by the shortest data description. Automatica, 14,
465–471.
Rissanen, J. (1983). A universal prior for integers and estimation by minimum
description length. Annals of Statistics, 11, 416–431.
Rissanen, J. (1984). Universal coding, information, prediction and estimation.
IEEE Transactions on Information Theory, 30, 629–636.
Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics,
14, 1080–1100.
Rissanen, J. (1987). Stochastic complexity. Journal of the Royal Statistical Soci-
ety, Series B, 49, 223–239. Discussion: pp. 252–265.
References 21
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. Singapore:
World Scientific.
Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Trans-
actions on Information Theory, 42(1), 40–47.
Solomonoff, R. (1964). A formal theory of inductive inference, part 1 and part 2.
Information and Control, 7, 1–22, 224–254.
Solomonoff, R. (1978). Complexity-based induction systems: Comparisons and
convergence theorems. IEEE Transactions on Information Theory, 24, 422–
432.
Vapnik, V. (1998). Statistical Learning Theory. New York: John Wiley.
Wallace, C., and D. Boulton (1968). An information measure for classification.
Computer Journal, 11, 185–195.
Wallace, C., and D. Boulton (1975). An invariant Bayes method for point
estimation. Classification Society Bulletin, 3 (3), 11–34.
Wallace, C., and P. Freeman (1987). Estimation and inference by compact coding.
Journal of the Royal Statistical Society, Series B, 49, 240–251. Discussion: pp.
252–265.
Webb, G. (1996). Further experimental evidence against the utility of Occam’s
razor. Journal of Artificial Intelligence Research, 4, 397–417.
Zhang, T. (2004). On the convergence of MDL density estimation. In Proceedings
of the Seventeenth Annual Conference on Computational Learning Theory
(COLT’ 04). Berlin: Springer-Verlag.
... However, if a model has too many degrees of freedom, it can overfit to noise in the observed data which may not reflect the true distribution. In approaching this problem, the machine learning literature points consistently to the importance of compression: In order to build a system that effectively predicts the future, the best approach is to ensure that that system accounts for past observations in the most compact or economical way possible [28][29][30][31]. This Occam's Razor-like philosophy is formalized by the minimum description length (MDL) principle, which prescribes finding the shortest solution written in a general-purpose programming language which accurately reproduces the data, an idea rooted in Kolmogorov complexity [32]. ...
... L(M) here is the description length of the model, that is, the number of bits it would require to encode that model, a measure of its complexity [30]. L(D|M), meanwhile, is the description length of the data given the model, that is, an information measure indicating how much the data deviates from what is predicted by the model. ...
... First, a great deal depends on the exact nature of the computational bottleneck hypothesized. At the center of our account is a measure related to algorithmic complexity [24,30,33], a measure that differs from the mutual information constraint that has provided the usual focus for value-complexity tradeoff theories [23,89] (see S1 Methods). Second and still more important, the MDL-C framework does not anchor on the assumption of fixed and insuperable resource restrictions. ...
Article
Full-text available
Dual-process theories play a central role in both psychology and neuroscience, figuring prominently in domains ranging from executive control to reward-based learning to judgment and decision making. In each of these domains, two mechanisms appear to operate concurrently, one relatively high in computational complexity, the other relatively simple. Why is neural information processing organized in this way? We propose an answer to this question based on the notion of compression. The key insight is that dual-process structure can enhance adaptive behavior by allowing an agent to minimize the description length of its own behavior. We apply a single model based on this observation to findings from research on executive control, reward-based learning, and judgment and decision making, showing that seemingly diverse dual-process phenomena can be understood as domain-specific consequences of a single underlying set of computational principles.
... Rooted in information theory, the minimum description length (MDL) principle states that the optimal model is the one that can compress the data most [41,15]. Precisely, given the dataset D = (x n , y n ), the MDL-optimal model is defined as ...
... According to the Kraft's inequality [7], L(M ) can also be regarded as a prior probability distribution defined on the model class. Further, P M (.) is the so-called universal distribution [15]: as M parameterizes a family of probability distributions, denoted as P M,θ , P M (.) can be regarded as a "representative" distribution such that the maximum regret, defined as max y n {max θ log 2 P M,θ (y n |x n ) − log 2 P M (y n |x n )}, can be bounded by nϵ > 0, ∀ϵ > 0 as n → ∞ [15], in which the maximum is defined over all possible values for y n in the domain of the target variable. Intuitively, the regret is the difference between the log-likelihood of P M (y n |x n ) (which does not depend on θ) and the maximum log-likelihood max θ log 2 P M,θ (y n |x n ). ...
... According to the Kraft's inequality [7], L(M ) can also be regarded as a prior probability distribution defined on the model class. Further, P M (.) is the so-called universal distribution [15]: as M parameterizes a family of probability distributions, denoted as P M,θ , P M (.) can be regarded as a "representative" distribution such that the maximum regret, defined as max y n {max θ log 2 P M,θ (y n |x n ) − log 2 P M (y n |x n )}, can be bounded by nϵ > 0, ∀ϵ > 0 as n → ∞ [15], in which the maximum is defined over all possible values for y n in the domain of the target variable. Intuitively, the regret is the difference between the log-likelihood of P M (y n |x n ) (which does not depend on θ) and the maximum log-likelihood max θ log 2 P M,θ (y n |x n ). ...
Preprint
Full-text available
Conditional density estimation (CDE) goes beyond regression by modeling the full conditional distribution, providing a richer understanding of the data than just the conditional mean in regression. This makes CDE particularly useful in critical application domains. However, interpretable CDE methods are understudied. Current methods typically employ kernel-based approaches, using kernel functions directly for kernel density estimation or as basis functions in linear models. In contrast, despite their conceptual simplicity and visualization suitability, tree-based methods -- which are arguably more comprehensible -- have been largely overlooked for CDE tasks. Thus, we propose the Conditional Density Tree (CDTree), a fully non-parametric model consisting of a decision tree in which each leaf is formed by a histogram model. Specifically, we formalize the problem of learning a CDTree using the minimum description length (MDL) principle, which eliminates the need for tuning the hyperparameter for regularization. Next, we propose an iterative algorithm that, although greedily, searches the optimal histogram for every possible node split. Our experiments demonstrate that, in comparison to existing interpretable CDE methods, CDTrees are both more accurate (as measured by the log-loss) and more robust against irrelevant features. Further, our approach leads to smaller tree sizes than existing tree-based models, which benefits interpretability.
... This term is intractable to compute as it requires an enumeration over all programs that output p, but the conditional complexity K(D|p) can be easily computed. According to (Grünwald, 2007), if the dataset is sufficiently large and the generative model is close to the true data distribution, the optimal method for compressing a data point d i uses only − log 2 p(d i ) bits (e.g., using an arithmetic coding scheme, Witten et al., 1987), as in the case of Shannon information (Shannon, 2001). As such, we have K(D|p) ≃ − D log 2 p(d i ) which is the negative log-likelihood of the data under p(d), a commonly used objective function in ML. ...
... This encoding scheme on the RHS is referred to as a 2-part code (Grünwald, 2007). For large datasets, we have equality when the model's description length and error are jointly minimized, which occurs when the model p θ (x) is equivalent to the true distribution p(x): ...
Preprint
Full-text available
The goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best: a principle called Occam's razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam's razor and in-context learning: an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at https://github.com/3rdCore/PrequentialCode.
... We adopt an information theoretic view of SAEs, inspired by Grünwald (2007), which views SAEs as explanatory tools that compress neural activations into communicable explanations. This view suggests that sparsity may appear as a special case of a larger objective: minimising the description length of the explanations. ...
... Second, we send them a decoder network Dec(·) that recompiles these activations back to (some close approximation of) the neural activations, x = Dec(z). This is closely analogous to two-part coding schemes (Grünwald, 2007) for transmitting a program via its source code and a compiler program that converts the source code into an executable format. Together the SAE activations and the decoder provide an explanation of the neural activations, based on the definition below. ...
Preprint
Full-text available
Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs for reconstruction loss and sparsity results in a preference for SAEs that are extremely wide and sparse. We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which are both accurate and concise. We further argue that interpretable SAEs require an additional property, "independent additivity": features should be able to be understood separately. We demonstrate an example of applying our MDL-inspired framework by training SAEs on MNIST handwritten digits and find that SAE features representing significant line segments are optimal, as opposed to SAEs with features for memorised digits from the dataset or small digit fragments. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity such as undesirable feature splitting and that this framework naturally suggests new hierarchical SAE architectures which provide more concise explanations.
... We will refer to [49] for a review of the subject including a long list of references. See [50] for a more information theoretic treatment of prior distributions. ...
Preprint
Full-text available
Since the seminal work of Kolmogorov, probability theory has been based on measure theory, where the central components are so-called probability measures, defined as measures with total mass equal to 1. In Kolmogorov’s theory, a probability measure is used to model an experiment with a single outcome that will belong to exactly one out of several disjoint sets. In this paper, we present a different basic model where an experiment results in a multiset, i.e. for each of the disjoint sets we get the number of observations in the set. This new framework is consistent with Kolmogorov’s theory, but the theory focuses on expected values rather than probabilities. We present examples from testing Goodness-of-Fit, Bayesian statistics, and quantum theory, where the shifted focus gives new insight or better performance. We also provide several new theorems that address some problems related to the change in focus.
... A particular interesting perspective to approach real numbers is in terms of the information content, which can be gauged in terms of minimum length descriptions (e.g. [28,29]) in terms of digits. It follows from the discussion above that repeating decimals tend to be less complex than irrational numbers, because the former type of number can be expressed in terms of two finite positive integer values, while irrational numbers require an infinite numeric description. ...
Preprint
Full-text available
Number theory is an interesting mathematical area focusing on integer values and arithmetic functions, especially on topics including prime factor decompositions and properties. Relatively fewer developments have been developed respectively to real numbers, including their rational and irrational subsets. In the present work, real numbers are approached from a particular number-theoretic perspective focusing on canonical representations. In particular, it is shown that canonical representations can be generalized to rational numbers by using np-sets, namely multisets involving non-negative and non-positive multiplicities. This np-set based approach also allows not only effective representation and performance of products and divisions between rational numbers, but can also be directly related to the identification of the least common denominator and greatest common divider of two rational numbers. In addition, it is also argued that the comparison between the canonical representations of two rational numbers can be performed in terms of the np-set Jaccard similarity index between those representations, actually corresponding to the quotient between the greatest common divider and least common denominator. Also developed in this work is the possibility to characterize the complexity of both rational and irrational numbers in terms of the Jaccard similarity between the number of primes factors and total multiplicity signatures obtained by successive respective decimal expansion of this type of real numbers.
... This encoding scheme on the RHS is referred to as a 2-part code (Grünwald, 2007). For large datasets, we have equality when the model's description length and error are jointly minimized, which occurs when the model p θ (x) is equivalent to the true distribution p(x): ...
Preprint
Full-text available
Compositionality is believed to be fundamental to intelligence. In humans, it underlies the structure of thought, language, and higher-level reasoning. In AI, compositional representations can enable a powerful form of out-of-distribution generalization, in which a model systematically adapts to novel combinations of known concepts. However, while we have strong intuitions about what compositionality is, there currently exists no formal definition for it that is measurable and mathematical. Here, we propose such a definition, which we call representational compositionality, that accounts for and extends our intuitions about compositionality. The definition is conceptually simple, quantitative, grounded in algorithmic information theory, and applicable to any representation. Intuitively, representational compositionality states that a compositional representation satisfies three properties. First, it must be expressive. Second, it must be possible to re-describe the representation as a function of discrete symbolic sequences with re-combinable parts, analogous to sentences in natural language. Third, the function that relates these symbolic sequences to the representation, analogous to semantics in natural language, must be simple. Through experiments on both synthetic and real world data, we validate our definition of compositionality and show how it unifies disparate intuitions from across the literature in both AI and cognitive science. We also show that representational compositionality, while theoretically intractable, can be readily estimated using standard deep learning tools. Our definition has the potential to inspire the design of novel, theoretically-driven models that better capture the mechanisms of compositional thought.
... If we choose the reparametrization-invariant minimax-optimal Jeffreys/Bernardo prior for p d (θ), then even the O(d) term coincides with MDL. Bayes and MDL only differ in o(1) terms that vanish for n → ∞ [Wal05,Grü07]. ...
Preprint
Full-text available
Machine Learning (ML) and Algorithmic Information Theory (AIT) address complexity from distinct yet complementary perspectives. This paper explores the synergy between AIT and ML, specifically examining how kernel method can effectively bridge these disciplines. By applying these methods to the problems of Clustering and Density Estimation-fundamental techniques in unsupervised learning-we demonstrate their potential to integrate AIT principles into ML algorithms. This investigation builds on our previous work, which introduced Sparse Kernel Flows from an AIT viewpoint, aiming to enhance the theoretical robustness of ML algorithms. We propose that kernel methods are not only versatile tools in ML but also pivotal in merging AIT concepts, thus providing a robust framework for refining and advancing machine learning techniques.
Article
Intrinsically motivated information seeking is an expression of curiosity believed to be central to human nature. However, most curiosity research relies on small, Western convenience samples. Here, we analyze a naturalistic population of 482,760 readers using Wikipedia’s mobile app in 14 languages from 50 countries or territories. By measuring the structure of knowledge networks constructed by readers weaving a thread through articles in Wikipedia, we replicate two styles of curiosity previously identified in laboratory studies: the nomadic “busybody” and the targeted “hunter.” Further, we find evidence for another style—the “dancer”—which was previously predicted by a historico-philosophical examination of texts over two millennia and is characterized by creative modes of knowledge production. We identify associations, globally, between the structure of knowledge networks and population-level indicators of spatial navigation, education, mood, well-being, and inequality. These results advance our understanding of Wikipedia’s global readership and demonstrate how cultural and geographical properties of the digital environment relate to different styles of curiosity.
Article
Full-text available
Physical and functional constraints on biological networks lead to complex topological patterns across multiple scales in their organization. A particular type of higher-order network feature that has received considerable interest is network motifs, defined as statistically regular subgraphs. These may implement fundamental logical and computational circuits and are referred to as “building blocks of complex networks”. Their well-defined structures and small sizes also enable the testing of their functions in synthetic and natural biological experiments. Here, we develop a framework for motif mining based on lossless network compression using subgraph contractions. This provides an alternative definition of motif significance which allows us to compare different motifs and select the collectively most significant set of motifs as well as other prominent network features in terms of their combined compression of the network. Our approach inherently accounts for multiple testing and correlations between subgraphs and does not rely on a priori specification of an appropriate null model. It thus overcomes common problems in hypothesis testing-based motif analysis and guarantees robust statistical inference. We validate our methodology on numerical data and then apply it on synaptic-resolution biological neural networks, as a medium for comparative connectomics, by evaluating their respective compressibility and characterize their inferred circuit motifs.
Article
This article has no abstract.
Article
An attempt is made to carry out a program (outlined in a previous paper) for defining the concept of a random or patternless, finite binary sequence, and for subsequently defining a random or patternless, infinite binary sequence to be a sequence whose initial segments are all random or patternless finite binary sequences. A definition based on 2 G. J. Chaitin the bounded-transfer Turing machine is given detailed study, but insufficient understanding of this computing machine precludes a complete treatment. A computing machine is introduced which avoids these difficulties. Key Words and Phrases: computational complexity, sequences, random sequences, Turing machines CR Categories: 5.22, 5.5, 5.6 1. Introduction In this section a definition is presented of the concept of a random or patternless binary sequence based on 3-tape-symbol bounded-transfer Turing machines. 2 These computing machines have been introduced and studied in [1], where a proposal to apply them in this manne...
Book
Ripley brings together two crucial ideas in pattern recognition: statistical methods and machine learning via neural networks. He brings unifying principles to the fore, and reviews the state of the subject. Ripley also includes many examples to illustrate real problems in pattern recognition and how to overcome them.
Article
This paper derives a measure of the goodness of a classification based on information theory. A classification is regarded as a method of economical statistical encoding of the available attribute information. The measure may be used to compare the relative goodness of classifications produced by different methods or as the basis of a classification procedure. A classification program, ‘SNOB’, has been written for the University of Sydney KDF 9 computer, and first tests show good agreement with conventional taxonomy.
Article
The systematic variation within a set of data, as represented by a usual statistical model, may be used to encode the data in a more compact form than would be possible if they were considered to be purely random. The encoded form has two parts. The first states the inferred estimates of the unknown parameters in the model, the second states the data using an optimal code based on the data probability distribution implied by those parameter estimates. Choosing the model and the estimates that give the most compact coding leads to an interesting general inference procedure. In its strict form it has great generality and several nice properties but is computationally infeasible. An approximate form is developed and its relation to other methods is explored.