Content uploaded by Marcus Hutter
Author content
All content in this area was uploaded by Marcus Hutter on Oct 14, 2014
Content may be subject to copyright.
Universal Artificial Intelligence
Marcus Hutter
Canberra, ACT, 0200, Australia
http://www.hutter1.net/
ANU RSISE NICTA
Machine Learning Summer School
MLSS2010, 27 September – 6 October, Canberra
Marcus Hutter  2  Foundations of Intelligent Agents
Abstract
The dream of creating artiﬁcial devices that reach or outperform human
intelligence is many centuries old. This tutorial presents the elegant
parameterfree theory, developed in [Hut05], of an optimal reinforcement
learning agent embedded in an arbitrary unknown environment that
possesses essentially all aspects of rational intelligence. The theory
reduces all conceptual AI problems to pure computational questions.
How to perform inductive inference is closely related to the AI problem.
The tutorial covers Solomonoﬀ’s theory, elaborated on in [Hut07], which
solves the induction problem, at least from a philosophical and
statistical perspective.
Both theories are based on Occam’s razor quantiﬁed by Kolmogorov
complexity; Bayesian probability theory; and sequential decision theory.
Marcus Hutter  3  Foundations of Intelligent Agents
TABLE OF CONTENTS
1. PHILOSOPHICAL ISSUES
2. BAYESIAN SEQUENCE PREDICTION
3. UNIVERSAL INDUCTIVE INFERENCE
4. UNIVERSAL ARTIFICIAL INTELLIGENCE
5. APPROXIMATIONS AND APPLICATIONS
6. FUTURE DIRECTION, WRAP UP, LITERATURE
Marcus Hutter  4  Foundations of Intelligent Agents
PHILOSOPHICAL ISSUES
• Philosophical Problems
• What is (Artiﬁcial) Intelligence?
• How to do Inductive Inference?
• How to Predict (Number) Sequences?
• How to make Decisions in Unknown Environments?
• Occam’s Razor to the Rescue
• The Grue Emerald and Conﬁrmation Paradoxes
• What this Tutorial is (Not) About
Marcus Hutter  5  Foundations of Intelligent Agents
Philosophical Issues: Abstract
I start by considering the philosophical problems concerning artiﬁcial
intelligence and machine learning in general and induction in particular.
I illustrate the problems and their intuitive solution on various (classical)
induction examples. The common principle to their solution is Occam’s
simplicity principle. Based on Occam’s and Epicurus’ principle, Bayesian
probability theory, and Turing’s universal machine, Solomonoﬀ
developed a formal theory of induction. I describe the sequential/online
setup considered in this tutorial and place it into the wider machine
learning context.
Marcus Hutter  6  Foundations of Intelligent Agents
What is (Artiﬁcial) Intelligence?
Intelligence can have many faces ⇒ formal deﬁnition diﬃcult
• reasoning
• creativity
• association
• generalization
• pattern recognition
• problem solving
• memorization
• planning
• achieving goals
• learning
•
optimization
• selfpreservation
• vision
• language processing
• classiﬁcation
• induction
• deduction
• ...
What is AI? Thinking Acting
humanly Cognitive Turing test,
Science Behaviorism
rationally Laws Doing the
Thought Right Thing
Collection of 70+ Defs of Intelligence
http://www.vetta.org/
definitionsofintelligence/
Real world is nasty: partially unobservable,
uncertain, unknown, nonergodic, reactive,
vast, but luckily structured, ...
Marcus Hutter  7  Foundations of Intelligent Agents
Informal Deﬁnition of (Artiﬁcial) Intelligence
Intelligence measures an agent’s ability to achieve goals
in a wide range of environments. [S. Legg and M. Hutter]
Emergent: Features such as the ability to learn and adapt, or to
understand, are implicit in the above deﬁnition as these capacities
enable an agent to succeed in a wide range of environments.
The science of Artiﬁcial Intelligence is concerned with the construction
of intelligent systems/artifacts/agents and their analysis.
What next? Substantiate all terms above: agent, ability, utility, goal,
success, learn, adapt, environment, ...
Marcus Hutter  8  Foundations of Intelligent Agents
On the Foundations of Artiﬁcial Intelligence
• Example: Algorithm/complexity theory: The goal is to ﬁnd fast
algorithms solving problems and to show lower bounds on their
computation time. Everything is rigorously deﬁned: algorithm,
Turing machine, problem classes, computation time, ...
• Most disciplines start with an informal way of attacking a subject.
With time they get more and more formalized often to a point
where they are completely rigorous. Examples: set theory, logical
reasoning, proof theory, probability theory, inﬁnitesimal calculus,
energy, temperature, quantum ﬁeld theory, ...
• Artiﬁcial Intelligence: Tries to build and understand systems that
act intelligently, learn from experience, make good predictions, are
able to generalize, ... Many terms are only vaguely deﬁned or there
are many alternate deﬁnitions.
Marcus Hutter  9  Foundations of Intelligent Agents
Induction→Prediction→Decision→Action
Induction infers general models from speciﬁc observations/facts/data,
usually exhibiting regularities or properties or relations in the latter.
Having or acquiring or learning or inducing a model of the environment
an agent interacts with allows the agent to make predictions and utilize
them in its decision process of ﬁnding a good next action.
Example
Induction: Find a model of the world economy.
Prediction: Use the model for predicting the future stock market.
Decision: Decide whether to invest assets in stocks or bonds.
Action: Trading large quantities of stocks inﬂuences the market.
Marcus Hutter  10  Foundations of Intelligent Agents
Example 1: Probability of Sunrise Tomorrow
What is the probability p(11
d
) that the sun will rise tomorrow?
(d = past # days sun rose, 1 =sun rises. 0 = sun will not rise)
• p is undeﬁned, because there has never been an experiment that
tested the existence of the sun tomorrow (ref. class problem).
• The p = 1, because the sun rose in all past experiments.
• p = 1 − ², where ² is the proportion of stars that explode per day.
• p =
d+1
d+2
, which is Laplace rule derived from Bayes rule.
• Derive p from the type, age, size and temperature of the sun, even
though we never observed another star with those exact properties.
Conclusion: We predict that the sun will rise tomorrow with high
probability independent of the justiﬁcation.
Marcus Hutter  11  Foundations of Intelligent Agents
Example 2: Digits of a Computable Number
• Extend 14159265358979323846264338327950288419716939937?
• Looks random?!
• Frequency estimate: n = length of sequence. k
i
= number of
occured i =⇒ Probability of next digit being i is
i
n
. Asymptotically
i
n
→
1
10
(seems to be) true.
• But we have the strong feeling that (i.e. with high probability) the
next digit will be 5 because the previous digits were the expansion
of π.
• Conclusion: We prefer answer 5, since we see more structure in the
sequence than just random digits.
Marcus Hutter  12  Foundations of Intelligent Agents
Example 3: Number Sequences
Sequence:
x
1
, x
2
, x
3
, x
4
, x
5
, ...
1, 2, 3, 4, ?, ...
• x
5
= 5, since x
i
= i for i = 1..4.
• x
5
= 29, since x
i
= i
4
− 10i
3
+ 35i
2
− 49i + 24.
Conclusion: We prefer 5, since linear relation involves less arbitrary
parameters than 4thorder polynomial.
Sequence: 2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,?
• 61, since this is the next prime
• 60, since this is the order of the next simple group
Conclusion: We prefer answer 61, since primes are a more familiar
concept than simple groups.
OnLine Encyclopedia of Integer Sequences:
http://www.research.att.com/∼njas/sequences/
Marcus Hutter  13  Foundations of Intelligent Agents
Occam’s Razor to the Rescue
• Is there a unique principle which allows us to formally arrive at a
prediction which
 coincides (always?) with our intuitive guess or even better,
 which is (in some sense) most likely the best or correct answer?
• Yes! Occam’s razor: Use the simplest explanation consistent with
past data (and use it for prediction).
• Works! For examples presented and for many more.
• Actually Occam’s razor can serve as a foundation of machine
learning in general, and is even a fundamental principle (or maybe
even the mere deﬁnition) of science.
• Problem: Not a formal/mathematical objective principle.
What is simple for one may be complicated for another.
Marcus Hutter  14  Foundations of Intelligent Agents
Grue Emerald Paradox
Hypothesis 1: All emeralds are green.
Hypothesis 2: All emeralds found till y2020 are green,
thereafter all emeralds are blue.
• Which hypothesis is more plausible? H1! Justiﬁcation?
• Occam’s razor: take simplest hypothesis consistent with data.
is the most important principle in machine learning and science.
Marcus Hutter  15  Foundations of Intelligent Agents
Conﬁrmation Paradox
(i) R → B is conﬁrmed by an Rinstance with property B
(ii) ¬B → ¬R is conﬁrmed by a ¬Binstance with property ¬R.
(iii) Since R → B and ¬B → ¬R are logically equivalent,
R → B is also conﬁrmed by a ¬Binstance with property ¬R.
Example: Hypothesis (o): All ravens are black (R=Raven, B=Black).
(i) observing a Black Raven conﬁrms Hypothesis (o).
(iii) observing a White Sock also conﬁrms that all Ravens are Black,
since a White Sock is a nonRaven which is nonBlack.
This conclusion sounds absurd! What’s the problem?
Marcus Hutter  16  Foundations of Intelligent Agents
What This Tutorial is (Not) About
Dichotomies in Artiﬁcial Intelligence & Machine Learning
scope of my tutorial ⇔ scope of other tutorials
(machine) learning ⇔ (GOFAI) knowledgebased
statistical ⇔ logicbased
decision ⇔ prediction ⇔ induction ⇔ action
classiﬁcation ⇔ regression
sequential / noniid ⇔ independent identically distributed
online learning ⇔ oﬄine/batch learning
passive prediction ⇔ active learning
Bayes ⇔ MDL ⇔ Expert ⇔ Frequentist
uninformed / universal ⇔ informed / problemspeciﬁc
conceptual/mathematical issues ⇔ computational issues
exact/principled ⇔ heuristic
supervised learning ⇔ unsupervised ⇔ RL learning
exploitation ⇔ exploration
Marcus Hutter  17  Foundations of Intelligent Agents
BAYESIAN SEQUENCE PREDICTION
• Sequential/Online Prediction – Setup
• Uncertainty and Probability
• Frequency Interpretation: Counting
• Objective Uncertain Events & Subjective Degrees of Belief
• Bayes’ and Laplace’s Rules
• The Bayesmixture distribution
• Predictive Convergence
• Sequential Decisions and Loss Bounds
• Summary
Marcus Hutter  18  Foundations of Intelligent Agents
Bayesian Sequence Prediction: Abstract
The aim of probability theory is to describe uncertainty. There are
various sources and interpretations of uncertainty. I compare the
frequency, objective, and subjective probabilities, and show that they all
respect the same rules, and derive Bayes’ and Laplace’s famous and
fundamental rules. Then I concentrate on general sequence prediction
tasks. I deﬁne the Bayes mixture distribution and show that the
posterior converges rapidly to the true posterior by exploiting some
bounds on the relative entropy. Finally I show that the mixture predictor
is also optimal in a decisiontheoretic sense w.r.t. any bounded loss
function.
Marcus Hutter  19  Foundations of Intelligent Agents
Sequential/Online Prediction – Setup
In sequential or online prediction, for times t = 1, 2, 3, ...,
our predictor p makes a prediction y
p
t
∈ Y
based on past observations x
1
, ..., x
t−1
.
Thereafter x
t
∈ X is observed and p suﬀers Loss(x
t
, y
p
t
).
The goal is to design predictors with small total loss or cumulative
Loss
1:T
(p) :=
P
T
t=1
Loss(x
t
, y
p
t
).
Applications are abundant, e.g. weather or stock market forecasting.
Example:
Loss(x, y) X = {sunny , rainy}
Y =
n
umbrella
sunglasses
o
0.1 0.3
0.0 1.0
Setup also includes: Classiﬁcation and Regression problems.
Marcus Hutter  20  Foundations of Intelligent Agents
Uncertainty and Probability
The aim of probability theory is to describe uncertainty.
Sources/interpretations for uncertainty:
• Frequentist: probabilities are relative frequencies.
(e.g. the relative frequency of tossing head.)
• Objectivist: probabilities are real aspects of the world.
(e.g. the probability that some atom decays in the next hour)
• Subjectivist: probabilities describe an agent’s degree of belief.
(e.g. it is (im)plausible that extraterrestrians exist)
Marcus Hutter  21  Foundations of Intelligent Agents
Frequency Interpretation: Counting
• The frequentist interprets probabilities as relative frequencies.
• If in a sequence of n independent identically distributed (i.i.d.)
experiments (trials) an event occurs k(n) times, the relative
frequency of the event is k(n)/n.
• The limit lim
n→∞
k(n)/n is deﬁned as the probability of the event.
• For instance, the probability of the event head in a sequence of
repeatedly tossing a fair coin is
1
2
.
• The frequentist position is the easiest to grasp, but it has several
shortcomings:
• Problems: deﬁnition circular, limited to i.i.d, reference class
problem.
Marcus Hutter  22  Foundations of Intelligent Agents
Objective Interpretation: Uncertain Events
• For the objectivist probabilities are real aspects of the world.
• The outcome of an observation or an experiment is not
deterministic, but involves physical random processes.
• The set Ω of all possible outcomes is called the sample space.
• It is said that an event E ⊂ Ω occurred if the outcome is in E.
• In the case of i.i.d. experiments the probabilities p assigned to
events E should be interpretable as limiting frequencies, but the
application is not limited to this case.
• (Some) probability axioms:
p(Ω) = 1 and p({}) = 0 and 0 ≤ p(E) ≤ 1.
p(A ∪ B) = p(A) + p(B) − p(A ∩ B).
p(BA) =
p(A∩B)
p(A)
is the probability of B given event A occurred.
Marcus Hutter  23  Foundations of Intelligent Agents
Subjective Interpretation: Degrees of Belief
• The subjectivist uses probabilities to characterize an agent’s degree
of belief in something, rather than to characterize physical random
processes.
• This is the most relevant interpretation of probabilities in AI.
• We deﬁne the plausibility of an event as the degree of belief in the
event, or the subjective probability of the event.
• It is natural to assume that plausibilities/beliefs Bel(··) can be repr.
by real numbers, that the rules qualitatively correspond to common
sense, and that the rules are mathematically consistent. ⇒
• Cox’s theorem: Bel(·A) is isomorphic to a probability function
p(··) that satisﬁes the axioms of (objective) probabilities.
• Conclusion: Beliefs follow the same rules as probabilities
Marcus Hutter  24  Foundations of Intelligent Agents
Bayes’ Famous Rule
Let D be some possible data (i.e. D is event with p(D) > 0) and
{H
i
}
i∈I
be a countable complete class of mutually exclusive hypotheses
(i.e. H
i
are events with H
i
∩ H
j
= {} ∀i 6= j and
S
i∈I
H
i
= Ω).
Given: p(H
i
) = a priori plausibility of hypotheses H
i
(subj. prob.)
Given: p(DH
i
) = likelihood of data D under hypothesis H
i
(obj. prob.)
Goal: p(H
i
D) = a posteriori plausibility of hypothesis H
i
(subj. prob.)
Solution: p(H
i
D) =
p(DH
i
)p(H
i
)
P
i∈I
p(DH
i
)p(H
i
)
Proof: From the deﬁnition of conditional probability and
X
i∈I
p(H
i
...) = 1 ⇒
X
i∈I
p(DH
i
)p(H
i
) =
X
i∈I
p(H
i
D)p(D) = p(D)
Marcus Hutter  25  Foundations of Intelligent Agents
Example: Bayes’ and Laplace’s Rule
Assume data is generated by a biased coin with head probability θ, i.e.
H
θ
:=Bernoulli(θ) with θ ∈ Θ := [0, 1].
Finite sequence: x = x
1
x
2
...x
n
with n
1
ones and n
0
zeros.
Sample inﬁnite sequence: ω ∈ Ω = {0, 1}
∞
Basic event: Γ
x
= {ω : ω
1
= x
1
, ..., ω
n
= x
n
} = set of all sequences
starting with x.
Data likelihood: p
θ
(x) := p(Γ
x
H
θ
) = θ
n
1
(1 − θ)
n
0
.
Bayes (1763): Uniform prior plausibility: p(θ) := p(H
θ
) = 1
(
R
1
0
p(θ) dθ = 1 instead
P
i∈I
p(H
i
) = 1)
Evidence: p(x) =
R
1
0
p
θ
(x)p(θ) dθ =
R
1
0
θ
n
1
(1 − θ)
n
0
dθ =
n
1
!n
0
!
(n
0
+n
1
+1)!
Marcus Hutter  26  Foundations of Intelligent Agents
Example: Bayes’ and Laplace’s Rule
Bayes: Posterior plausibility of θ
after seeing x is:
p(θx) =
p(xθ)p(θ)
p(x)
=
(n+1)!
n
1
!n
0
!
θ
n
1
(1−θ)
n
0
.
Laplace: What is the probability of seeing 1 after having observed x?
p(x
n+1
= 1x
1
...x
n
) =
p(x1)
p(x)
=
n
1
+1
n + 2
Laplace believed that the sun had risen for 5000 years = 1’826’213 days,
so he concluded that the probability of doomsday tomorrow is
1
1826215
.
Marcus Hutter  27  Foundations of Intelligent Agents
Exercise: Envelope Paradox
• I oﬀer you two closed envelopes, one of them contains twice the
amount of money than the other. You are allowed to pick one and
open it. Now you have two options. Keep the money or decide for
the other envelope (which could double or half your gain).
• Symmetry argument: It doesn’t matter whether you switch, the
expected gain is the same.
• Refutation: With probability p = 1/2, the other envelope contains
twice/half the amount, i.e. if you switch your expected gain
increases by a factor 1.25=(1/2)*2+(1/2)*(1/2).
• Present a Bayesian solution.
Marcus Hutter  28  Foundations of Intelligent Agents
The BayesMixture Distribution ξ
• Assumption: The true (objective) environment µ is unknown.
• Bayesian approach: Replace true probability distribution µ by a
Bayesmixture ξ.
• Assumption: We know that the true environment µ is contained in
some known countable (in)ﬁnite set M of environments.
• The Bayesmixture ξ is deﬁned as
ξ(x) :=
X
ν∈M
w
ν
ν(x) with
X
ν∈M
w
ν
= 1, w
ν
> 0 ∀ν
• The weights w
ν
may be interpreted as the prior degree of belief that
the true environment is ν, or k
ν
= ln w
−1
ν
as a complexity penalty
(preﬁx code length) of environment ν.
• Then ξ(x) could be interpreted as the prior subjective belief
probability in observing x.
Marcus Hutter  29  Foundations of Intelligent Agents
Convergence and Decisions
Goal: Given seq. x
1:t−1
≡ x
<t
≡ x
1
x
2
...x
t−1
, predict continuation x
t
.
Expectation w.r.t. µ: E[f(ω
1:n
)] :=
P
x∈X
n
µ(x)f(x)
KLdivergence: D
n
(µξ) := E[ln
µ(ω
1:n
)
ξ(ω
1:n
)
] ≤ ln w
−1
µ
∀n
Hellinger distance: h
t
(ω
<t
) :=
P
a∈X
(
p
ξ(aω
<t
) −
p
µ(aω
<t
))
2
Rapid convergence:
P
∞
t=1
E[h
t
(ω
<t
)] ≤ D
∞
≤ ln w
−1
µ
< ∞ implies
ξ(x
t
ω
<t
) → µ(x
t
ω
<t
), i.e. ξ is a good substitute for unknown µ.
Bayesian decisions: Bayesoptimal predictor Λ
ξ
suﬀers instantaneous
loss l
Λ
ξ
t
∈ [0, 1] at t only slightly larger than the µoptimal predictor Λ
µ
:
P
∞
t=1
E[(
p
l
Λ
ξ
t
−
p
l
Λ
µ
t
)
2
] ≤
P
∞
t=1
2E[h
t
] < ∞ implies rapid l
Λ
ξ
t
→ l
Λ
µ
t
.
Paretooptimality of Λ
ξ
: Every predictor with loss smaller than Λ
ξ
in
some environment µ ∈ M must be worse in another environment.
Marcus Hutter  30  Foundations of Intelligent Agents
Bayesian Sequence Prediction: Summary
• The aim of probability theory is to describe uncertainty.
• Various sources and interpretations of uncertainty:
frequency, objective, and subjective probabilities.
• They all respect the same rules.
• General sequence prediction: Use known (subj.) Bayes mixture
ξ =
P
ν∈M
w
ν
ν in place of unknown (obj.) true distribution µ.
• Bound on the relative entropy between ξ and µ.
⇒ posterior of ξ converges rapidly to the true posterior µ.
• ξ is also optimal in a decisiontheoretic sense w.r.t. any bounded
loss function.
• No structural assumptions on M and ν ∈ M.
Marcus Hutter  31  Foundations of Intelligent Agents
UNIVERSAL INDUCTIVE INFERENCE
• Foundations of Universal Induction
• Bayesian Sequence Prediction and Conﬁrmation
• Fast Convergence
• How to Choose the Prior – Universal
• Kolmogorov Complexity
• How to Choose the Model Class – Universal
• Universal is Better than Continuous Class
• Summary / Outlook / Literature
Marcus Hutter  32  Foundations of Intelligent Agents
Universal Inductive Inference: Abstract
Solomonoﬀ completed the Bayesian framework by providing a rigorous,
unique, formal, and universal choice for the model class and the prior. I
will discuss in breadth how and in which sense universal (noni.i.d.)
sequence prediction solves various (philosophical) problems of traditional
Bayesian sequence prediction. I show that Solomonoﬀ’s model possesses
many desirable properties: Fast convergence, and in contrast to most
classical continuous prior densities has no zero p(oste)rior problem, i.e.
can conﬁrm universal hypotheses, is reparametrization and regrouping
invariant, and avoids the oldevidence and updating problem. It even
performs well (actually better) in noncomputable environments.
Marcus Hutter  33  Foundations of Intelligent Agents
Induction Examples
Sequence prediction: Predict weather/stockquote/... tomorrow, based
on past sequence. Continue IQ test sequence like 1,4,9,16,?
Classiﬁcation: Predict whether email is spam.
Classiﬁcation can be reduced to sequence prediction.
Hypothesis testing/identiﬁcation: Does treatment X cure cancer?
Do observations of white swans conﬁrm that all ravens are black?
These are instances of the important problem of inductive inference or
timeseries forecasting or sequence prediction.
Problem: Finding prediction rules for every particular (new) problem is
possible but cumbersome and prone to disagreement or contradiction.
Goal: A single, formal, general, complete theory for prediction.
Beyond induction: active/reward learning, fct. optimization, game theory.
Marcus Hutter  34  Foundations of Intelligent Agents
Foundations of Universal Induction
Ockhams’ razor (simplicity) principle
Entities should not be multiplied beyond necessity.
Epicurus’ principle of multiple explanations
If more than one theory is consistent with the observations, keep
all theories.
Bayes’ rule for conditional probabilities
Given the prior belief/probability one can predict all future prob
abilities.
Turing’s universal machine
Everything computable by a human using a ﬁxed procedure can
also be computed by a (universal) Turing machine.
Kolmogorov’s complexity
The complexity or information content of an object is the length
of its shortest description on a universal Turing machine.
Solomonoﬀ’s universal prior=Ockham+Epicurus+Bayes+Turing
Solves the question of how to choose the prior if nothing is known.
⇒ universal induction, formal Occam, AIT,MML,MDL,SRM,...
Marcus Hutter  35  Foundations of Intelligent Agents
Bayesian Sequence Prediction and Conﬁrmation
• Assumption: Sequence ω ∈ X
∞
is sampled from the “true”
probability measure µ, i.e. µ (x) := P[xµ] is the µprobability that
ω starts with x ∈ X
n
.
• Model class: We assume that µ is unknown but known to belong to
a countable class of environments=models=measures
M = {ν
1
, ν
2
, ...}. [no i.i.d./ergodic/stationary assumption]
• Hypothesis class: {H
ν
: ν ∈ M} forms a mutually exclusive and
complete class of hypotheses.
• Prior: w
ν
:= P[H
ν
] is our prior belief in H
ν
⇒ Evidence: ξ(x) := P[ x] =
P
ν∈M
P[xH
ν
]P[H
ν
] =
P
ν
w
ν
ν(x)
must be our (prior) belief in x.
⇒ Posterior: w
ν
(x) := P[H
ν
x] =
P[xH
ν
]P[H
ν
]
P[x]
is our posterior belief
in ν (Bayes’ rule).
Marcus Hutter  36  Foundations of Intelligent Agents
How to Choose the Prior?
• Subjective: quantifying personal prior belief (not further discussed)
• Objective: based on rational principles (agreed on by everyone)
• Indiﬀerence or symmetry principle: Choose w
ν
=
1
M
for ﬁnite M.
• Jeﬀreys or Bernardo’s prior: Analogue for compact parametric
spaces M.
• Problem: The principles typically provide good objective priors for
small discrete or compact spaces, but not for “large” model classes
like countably inﬁnite, noncompact, and nonparametric M.
• Solution: Occam favors simplicity ⇒ Assign high (low) prior to
simple (complex) hypotheses.
• Problem: Quantitative and universal measure of simplicity/complexity.
Marcus Hutter  37  Foundations of Intelligent Agents
Kolmogorov Complexity K(x)
K. of string x is the length of the shortest (preﬁx) program producing x:
K(x) := min
p
{l(p) : U(p) = x}, U = universal TM
For nonstring objects o (like numbers and functions) we deﬁne
K(o) := K(hoi), where hoi ∈ X
∗
is some standard code for o.
+ Simple strings like 000...0 have small K,
irregular (e.g. random) strings have large K.
• The deﬁnition is nearly independent of the choice of U.
+ K satisﬁes most properties an information measure should satisfy.
+ K shares many properties with Shannon entropy but is superior.
− K(x) is not computable, but only semicomputable from above.
Fazit:
K is an excellent universal complexity measure,
suitable for quantifying Occam’s razor.
Marcus Hutter  38  Foundations of Intelligent Agents
Schematic Graph of Kolmogorov Complexity
Although K(x) is incomputable, we can draw a schematic graph
Marcus Hutter  39  Foundations of Intelligent Agents
The Universal Prior
• Quantify the complexity of an environment ν or hypothesis H
ν
by
its Kolmogorov complexity K(ν).
• Universal prior: w
ν
= w
U
ν
:= 2
−K(ν)
is a decreasing function in
the model’s complexity, and sums to (less than) one.
⇒ D
n
≤ K(µ) ln 2, i.e. the number of εdeviations of ξ from µ or l
Λ
ξ
from l
Λ
µ
is proportional to the complexity of the environment.
• No other semicomputable prior leads to better prediction (bounds).
• For continuous M, we can assign a (proper) universal prior (not
density) w
U
θ
= 2
−K(θ)
> 0 for computable θ, and 0 for uncomp. θ.
• This eﬀectively reduces M to a discrete class {ν
θ
∈ M : w
U
θ
> 0}
which is typically dense in M.
• This prior has many advantages over the classical prior (densities).
Marcus Hutter  40  Foundations of Intelligent Agents
Universal Choice of Class M
• The larger M the less restrictive is the assumption µ ∈ M.
• The class M
U
of all (semi)computable (semi)measures, although
only countable, is pretty large, since it includes all valid physics
theories. Further, ξ
U
is semicomputable [ZL70].
• Solomonoﬀ’s universal prior M(x) := probability that the output of
a universal TM U with random input starts with x.
• Formally: M(x) :=
P
p : U (p)=x∗
2
−`(p)
where the sum is over all
(minimal) programs p for which U outputs a string starting with x.
• M may be regarded as a 2
−`(p)
weighted mixture over all
deterministic environments ν
p
. (ν
p
(x) = 1 if U (p) = x∗ and 0 else)
• M(x) coincides with ξ
U
(x) within an irrelevant multiplicative constant.
Marcus Hutter  41  Foundations of Intelligent Agents
Universal is better than Continuous Class&Prior
• Problem of zero prior / conﬁrmation of universal hypotheses:
P[All ravens blackn black ravens]
½
≡ 0 in BayesLaplace model
fast
−→ 1 for universal prior w
U
θ
• Reparametrization and regrouping invariance: w
U
θ
= 2
−K(θ)
always
exists and is invariant w.r.t. all computable reparametrizations f.
(Jeﬀrey prior only w.r.t. bijections, and does not always exist)
• The Problem of Old Evidence: No risk of biasing the prior towards
past data, since w
U
θ
is ﬁxed and independent of M.
• The Problem of New Theories: Updating of M is not necessary,
since M
U
includes already all.
• M predicts better than all other mixture predictors based on any
(continuous or discrete) model class and prior, even in
noncomputable environments.
Marcus Hutter  42  Foundations of Intelligent Agents
Convergence and Loss Bounds
• Total (loss) bounds:
P
∞
n=1
E[h
n
]
×
< K(µ) ln 2, where
h
t
(ω
<t
) :=
P
a∈X
(
p
ξ(aω
<t
) −
p
µ(aω
<t
))
2
.
• Instantaneous i.i.d. bounds: For i.i.d. M with continuous, discrete,
and universal prior, respectively:
E[h
n
]
×
<
1
n
ln w(µ)
−1
and E[h
n
]
×
<
1
n
ln w
−1
µ
=
1
n
K(µ) ln 2.
• Bounds for computable environments: Rapidly M(x
t
x
<t
) → 1 on
every computable sequence x
1:∞
(whichsoever, e.g. 1
∞
or the digits
of π or e), i.e. M quickly recognizes the structure of the sequence.
• Weak instantaneous bounds: valid for all n and x
1:n
and ¯x
n
6= x
n
:
2
−K(n)
×
< M(¯x
n
x
<n
)
×
< 2
2K(x
1:n
∗)−K(n)
• Magic instance numbers: e.g. M(01
n
)
×
= 2
−K(n)
→ 0, but spikes
up for simple n. M is cautious at magic instance numbers n.
• Future bounds / errors to come: If our past observations ω
1:n
contain a lot of information about µ, we make few errors in future:
P
∞
t=n+1
E[h
t
ω
1:n
]
+
< [K(µω
1:n
)+K(n)] ln 2
Marcus Hutter  43  Foundations of Intelligent Agents
Universal Inductive Inference: Summary
Universal Solomonoﬀ prediction solves/avoids/meliorates many problems
of (Bayesian) induction. We discussed:
+ general total bounds for generic class, prior, and loss,
+ the D
n
bound for continuous classes,
+ the problem of zero p(oste)rior & conﬁrm. of universal hypotheses,
+ reparametrization and regrouping invariance,
+ the problem of old evidence and updating,
+ that M works even in noncomputable environments,
+ how to incorporate prior knowledge,
Marcus Hutter  44  Foundations of Intelligent Agents
UNIVERSAL RATIONAL AGENTS
• Rational agents
• Sequential decision theory
• Reinforcement learning
• Value function
• Universal Bayes mixture and AIXI model
• Selfoptimizing and Paretooptimal policies
• Environmental Classes
Marcus Hutter  45  Foundations of Intelligent Agents
Universal Rational Agents: Abstract
Sequential decision theory formally solves the problem of rational agents
in uncertain worlds if the true environmental prior probability distribution
is known. Solomonoﬀ’s theory of universal induction formally solves the
problem of sequence prediction for unknown prior distribution.
Here we combine both ideas and develop an elegant parameterfree
theory of an optimal reinforcement learning agent embedded in an
arbitrary unknown environment that possesses essentially all aspects of
rational intelligence. The theory reduces all conceptual AI problems to
pure computational ones.
There are strong arguments that the resulting AIXI model is the most
intelligent unbiased agent possible. Other discussed topics are relations
between problem classes.
Marcus Hutter  46  Foundations of Intelligent Agents
The Agent Model
Most if not all AI problems can be
formulated within the agent
framework
r
1
 o
1
r
2
 o
2
r
3
 o
3
r
4
 o
4
r
5
 o
5
r
6
 o
6
...
y
1
y
2
y
3
y
4
y
5
y
6
...
work
Agent
p
tape ...
work
Environ
ment q
tape ...
©
©
©
©
©¼ H
H
H
H
HY
³
³
³
³
³
³
³1P
P
P
P
P
P
Pq
Marcus Hutter  47  Foundations of Intelligent Agents
Rational Agents in Deterministic Environments
 p:X
∗
→Y
∗
is deterministic policy of the agent,
p(x
<k
) = y
1:k
with x
<k
≡ x
1
...x
k−1
.
 q :Y
∗
→X
∗
is deterministic environment,
q(y
1:k
) = x
1:k
with y
1:k
≡ y
1
...y
k
.
 Input x
k
≡r
k
o
k
consists of a regular informative part o
k
and reward r
k
∈ [0..r
max
].
 Value V
pq
km
:= r
k
+ ... + r
m
,
optimal policy p
best
:= arg max
p
V
pq
1m
,
Lifespan or initial horizon m.
Marcus Hutter  48  Foundations of Intelligent Agents
Agents in Probabilistic Environments
Given history y
1:k
x
<k
, the probability that the environment leads to
perception x
k
in cycle k is (by deﬁnition) σ(x
k
y
1:k
x
<k
).
Abbreviation (chain rule)
σ(x
1:m
y
1:m
) = σ(x
1
y
1
)·σ(x
2
y
1:2
x
1
)· ... ·σ(x
m
y
1:m
x
<m
)
The average value of policy p with horizon m in environment σ is
deﬁned as
V
p
σ
:=
1
m
X
x
1:m
(r
1
+ ... +r
m
)σ(x
1:m
y
1:m
)
y
1:m
=p(x
<m
)
The goal of the agent should be to maximize the value.
Marcus Hutter  49  Foundations of Intelligent Agents
Optimal Policy and Value
The σoptimal policy p
σ
:= arg max
p
V
p
σ
maximizes V
p
σ
≤ V
∗
σ
:= V
p
σ
σ
.
Explicit expressions for the action y
k
in cycle k of the σoptimal policy
p
σ
and their value V
∗
σ
are
y
k
= arg max
y
k
X
x
k
max
y
k+1
X
x
k+1
... max
y
m
X
x
m
(r
k
+ ... +r
m
)·σ(x
k:m
y
1:m
x
<k
),
V
∗
σ
=
1
m
max
y
1
X
x
1
max
y
2
X
x
2
... max
y
m
X
x
m
(r
1
+ ... +r
m
)·σ(x
1:m
y
1:m
).
Keyword: Expectimax tree/algorithm.
Marcus Hutter  50  Foundations of Intelligent Agents
Expectimax Tree/Algorithm
r
¡
¡
¡
¡
¡
y
k
= 0
@
@
@
@
@
y
k
= 1
max
 {z }
V
∗
σ
(yx
<k
) = max
y
k
V
∗
σ
(yx
<k
y
k
)
action y
k
with max value.
q
¢
¢
¢
¢
¢
o
k
= 0
r
k
= ...
A
A
A
A
A
o
k
= 1
r
k
= ...
E
{z}
q
¢
¢
¢
¢
¢
o
k
= 0
r
k
= ...
A
A
A
A
A
o
k
= 1
r
k
= ...
E
{z}
V
∗
σ
(yx
<k
y
k
) =
X
x
k
[r
k
+ V
∗
σ
(yx
1:k
)]σ(x
k
yx
<k
y
k
)
σ expected reward r
k
and observation o
k
.
q
¢
¢
¢
A
A
A
max
^
y
k+1
q
¢
¢
¢
A
A
A
max
^
y
k+1
q
¢
¢
¢
A
A
A
max
^
y
k+1
q
¢
¢
¢
A
A
A
max
^
V
∗
σ
(yx
1:k
) = max
y
k+1
V
∗
σ
(yx
1:k
y
k+1
)
· · · · · · · · · · · · · · · · · · · · · · · ·
Marcus Hutter  51  Foundations of Intelligent Agents
Known environment µ
• Assumption: µ is the true environment in which the agent operates
• Then, policy p
µ
is optimal in the sense that no other policy for an
agent leads to higher µ
AI
expected reward.
• Special choices of µ: deterministic or adversarial environments,
Markov decision processes (mdps), adversarial environments.
• There is no principle problem in computing the optimal action y
k
as
long as µ
AI
is known and computable and X , Y and m are ﬁnite.
• Things drastically change if µ
AI
is unknown ...
Marcus Hutter  52  Foundations of Intelligent Agents
The Bayesmixture distribution ξ
Assumption: The true environment µ is unknown.
Bayesian approach: The true probability distribution µ
AI
is not learned
directly, but is replaced by a Bayesmixture ξ
AI
.
Assumption: We know that the true environment µ is contained in some
known (ﬁnite or countable) set M of environments.
The Bayesmixture ξ is deﬁned as
ξ(x
1:m
y
1:m
) :=
X
ν∈M
w
ν
ν(x
1:m
y
1:m
) with
X
ν∈M
w
ν
= 1, w
ν
> 0 ∀ν
The weights w
ν
may be interpreted as the prior degree of belief that the
true environment is ν.
Then ξ(x
1:m
y
1:m
) could be interpreted as the prior subjective belief
probability in observing x
1:m
, given actions y
1:m
.
Marcus Hutter  53  Foundations of Intelligent Agents
Questions of Interest
• It is natural to follow the policy p
ξ
which maximizes V
p
ξ
.
• If µ is the true environment the expected reward when following
policy p
ξ
will be V
p
ξ
µ
.
• The optimal (but infeasible) policy p
µ
yields reward V
p
µ
µ
≡ V
∗
µ
.
• Are there policies with uniformly larger value than V
p
ξ
µ
?
• How close is V
p
ξ
µ
to V
∗
µ
?
• What is the most general class M and weights w
ν
.
Marcus Hutter  54  Foundations of Intelligent Agents
A universal choice of ξ and M
• We have to assume the existence of some structure on the
environment to avoid the NoFreeLunch Theorems [Wolpert 96].
• We can only unravel eﬀective structures which are describable by
(semi)computable probability distributions.
• So we may include all (semi)computable (semi)distributions in M.
• Occam’s razor and Epicurus’ principle of multiple explanations tell
us to assign high prior belief to simple environments.
• Using Kolmogorov’s universal complexity measure K(ν) for
environments ν one should set w
ν
∼ 2
−K(ν)
, where K(ν) is the
length of the shortest program on a universal TM computing ν.
• The resulting AIXI model [Hutter:00] is a uniﬁcation of (Bellman’s)
sequential decision and Solomonoﬀ’s universal induction theory.
Marcus Hutter  55  Foundations of Intelligent Agents
The AIXI Model in one Line
complete & essentially unique & limitcomputable
AIXI: a
k
:= arg max
a
k
X
o
k
r
k
... max
a
m
X
o
m
r
m
[r
k
+ ... + r
m
]
X
p : U(p,a
1
..a
m
)=o
1
r
1
..o
m
r
m
2
−`(p)
action, reward, observation, Universal TM, program, k=now
AIXI is an elegant mathematical theory of AI
Claim: AIXI is the most intelligent environmental independent, i.e.
universally optimal, agent possible.
Proof: For formalizations, quantiﬁcations, and proofs, see [Hut05].
Applications: Strategic Games, Function Optimization, Supervised
Learning, Sequence Prediction, Classiﬁcation, ...
In the following we consider generic M and w
ν
.
Marcus Hutter  56  Foundations of Intelligent Agents
ParetoOptimality of p
ξ
Policy p
ξ
is Paretooptimal in the sense that there is no other policy p
with V
p
ν
≥ V
p
ξ
ν
for all ν ∈ M and strict inequality for at least one ν.
Selfoptimizing Policies
Under which circumstances does the value of the universal policy p
ξ
converge to optimum?
V
p
ξ
ν
→ V
∗
ν
for horizon m → ∞ for all ν ∈ M. (1)
The least we must demand from M to have a chance that (1) is true is
that there exists some policy ˜p at all with this property, i.e.
∃˜p : V
˜p
ν
→ V
∗
ν
for horizon m → ∞ for all ν ∈ M. (2)
Main result: (2) ⇒ (1): The necessary condition of the existence of a
selfoptimizing policy ˜p is also suﬃcient for p
ξ
to be selfoptimizing.
Marcus Hutter  57  Foundations of Intelligent Agents
Environments w. (Non)SelfOptimizing Policies
Marcus Hutter  58  Foundations of Intelligent Agents
Particularly Interesting Environments
• Sequence Prediction, e.g. weather or stockmarket prediction.
Strong result: V
∗
µ
− V
p
ξ
µ
= O(
q
K(µ)
m
), m =horizon.
• Strategic Games: Learn to play well (minimax) strategic zerosum
games (like chess) or even exploit limited capabilities of opponent.
• Optimization: Find (approximate) minimum of function with as few
function calls as possible. Diﬃcult exploration versus exploitation
problem.
• Supervised learning: Learn functions by presenting (z, f(z)) pairs
and ask for function values of z
0
by presenting (z
0
, ?) pairs.
Supervised learning is much faster than reinforcement learning.
AIξ quickly learns to predict, play games, optimize, and learn supervised.
Marcus Hutter  59  Foundations of Intelligent Agents
Universal Rational Agents: Summary
• Setup: Agents acting in general probabilistic environments with
reinforcement feedback.
• Assumptions: True environment µ belongs to a known class of
environments M, but is otherwise unknown.
• Results: The Bayesoptimal policy p
ξ
based on the Bayesmixture
ξ =
P
ν∈M
w
ν
ν is Paretooptimal and selfoptimizing if M admits
selfoptimizing policies.
• We have reduced the AI problem to pure computational questions
(which are addressed in the timebounded AIXItl).
• AIξ incorporates all aspects of intelligence (apart comp.time).
• How to choose horizon: use future value and universal discounting.
• ToDo: prove (optimality) properties, scale down, implement.
Marcus Hutter  60  Foundations of Intelligent Agents
APPROXIMATIONS & APPLICATIONS
• Universal Similarity Metric
• Universal Search
• TimeBounded AIXI Model
• BruteForce Approximation of AIXI
• A MonteCarlo AIXI Approximation
• Feature Reinforcement Learning
• Comparison to other approaches
• Future directions, wrapup, references.
Marcus Hutter  61  Foundations of Intelligent Agents
Approximations & Applications: Abstract
Many fundamental theories have to be approximated for practical use.
Since the core quantities of universal induction and universal intelligence
are incomputable, it is often hard, but not impossible, to approximate
them. In any case, having these “gold standards” to approximate
(top→down) or to aim at (bottom→up) is extremely helpful in building
truly intelligent systems. The most impressive direct approximation of
Kolmogorov complexity todate is via the universal similarity metric
applied to a variety of realworld clustering problems. A couple of
universal search algorithms ((adaptive) Levin search, FastPrg, OOPS,
Goedelmachine, ...) that ﬁnd short programs have been developed and
applied to a variety of toy problem. The AIXI model itself has been
approximated in a couple of ways (AIXItl, Brute Force, Monte Carlo,
Feature RL). Some recent applications will be presented. The Tutorial
concludes by comparing various learning algorithms along various
dimensions, pointing to future directions, wrapup, and references.
Marcus Hutter  62  Foundations of Intelligent Agents
Conditional Kolmogorov Complexity
Question: When is object=string x similar to object=string y?
Universal solution: x similar y ⇔ x can be easily (re)constructed from y
⇔ Kolmogorov complexity K(xy) := min{`(p) : U(p, y) = x} is small
Examples:
1) x is very similar to itself (K(xx)
+
= 0)
2) A processed x is similar to x (K(f(x)x)
+
= 0 if K(f) = O(1)).
e.g. doubling, reverting, inverting, encrypting, partially deleting x.
3) A random string is with high probability not similar to any other
string (K(randomy) =length(random)).
The problem with K(xy) as similarity=distance measure is that it is
neither symmetric nor normalized nor computable.
Marcus Hutter  63  Foundations of Intelligent Agents
The Universal Similarity Metric [CV’05]
• Symmetrization and normalization leads to a/the universal metric d:
0 ≤ d(x, y) :=
max{K(xy), K(yx)}
max{K(x), K(y)}
≤ 1
• Every eﬀective similarity between x and y is detected by d
• Use K(xy)≈K(xy)−K(y) and K(x)≡K
U
(x)≈K
T
(x) (coding T)
=⇒ computable approximation: Normalized compression distance:
d(x, y) ≈
K
T
(xy) − min{K
T
(x), K
T
(y)}
max{K
T
(x), K
T
(y)}
. 1
• For T choose LempelZiv or gzip or bzip(2) (de)compressor in the
applications below.
• Theory: LempelZiv compresses asymptotically better than any
probabilistic ﬁnite state automaton predictor/compressor.
Marcus Hutter  64  Foundations of Intelligent Agents
TreeBased Clustering [CV’05]
• If many objects x
1
, ..., x
n
need to be compared, determine the
Similarity matrix: M
ij
= d(x
i
, x
j
) for 1 ≤ i, j ≤ n
• Now cluster similar objects.
• There are various clustering techniques.
• Treebased clustering: Create a tree connecting similar objects,
• e.g. quartet method (for clustering)
• Applications: Phylogeny of 24 Mammal mtDNA,
50 Language Tree (based on declaration of human rights),
composers of music, authors of novels, SARS virus, fungi,
optical characters, galaxies, ... [Cilibrasi&Vitanyi’05]
Marcus Hutter  65  Foundations of Intelligent Agents
Genomics & Phylogeny: Mammals [CV’05]
Evolutionary tree built from complete mammalian mtDNA of 24 species:
Carp
Cow
BlueWhale
FinbackWhale
Cat
BrownBear
PolarBear
GreySeal
HarborSeal
Horse
WhiteRhino
Ferungulates
Gibbon
Gorilla
Human
Chimpanzee
PygmyChimp
Orangutan
SumatranOrangutan
Primates
Eutheria
HouseMouse
Rat
Eutheria  Rodents
Opossum
Wallaroo
Metatheria
Echidna
Platypus
Prototheria
Basque [Spain]
Hungarian [Hungary]
Polish [Poland]
Sorbian [Germany]
Slovak [Slovakia]
Czech [Czech Rep]
Slovenian [Slovenia]
Serbian [Serbia]
Bosnian [Bosnia]
Icelandic [Iceland]
Faroese [Denmark]
Norwegian Bokmal [Norway]
Danish [Denmark]
Norwegian Nynorsk [Norway]
Swedish [Sweden]
Afrikaans
Dutch [Netherlands]
Frisian [Netherlands]
Luxembourgish [Luxembourg]
German [Germany]
Irish Gaelic [UK]
Scottish Gaelic [UK]
Welsh [UK]
Romani Vlach [Macedonia]
Romanian [Romania]
Sardinian [Italy]
Corsican [France]
Sammarinese [Italy]
Italian [Italy]
Friulian [Italy]
Rhaeto Romance [Switzerland]
Occitan [France]
Catalan [Spain]
Galician [Spain]
Spanish [Spain]
Portuguese [Portugal]
Asturian [Spain]
French [France]
English [UK]
Walloon [Belgique]
OccitanAuvergnat [France]
Maltese [Malta]
Breton [France]
Uzbek [Utzbekistan]
Turkish [Turkey]
Latvian [Latvia]
Lithuanian [Lithuania]
Albanian [Albany]
Romani Balkan [East Europe]
Croatian [Croatia]
Finnish [Finland]
Estonian [Estonia]
ROMANCE
BALTIC
UGROFINNIC
CELTIC
GERMANIC
SLAVIC
ALTAIC
Language Tree (Re)construction
based on “The Universal Declaration of
Human Rights” in 50 languages.
Marcus Hutter  67  Foundations of Intelligent Agents
Universal Search
• Levin search: Fastest algorithm for
inversion and optimization problems.
• Theoretical application:
Assume somebody found a nonconstructive
proof of P=NP, then Levinsearch is a polynomial
time algorithm for every NP (complete) problem.
• Practical (OOPS) applications (J. Schmidhuber)
Maze, towers of hanoi, robotics, ...
• FastPrg: The asymptotically fastest and shortest algorithm for all
welldeﬁned problems.
• AIXItl: Computable variant of AIXI.
• Human Knowledge Compression Prize: (50’000 C
=
)
Marcus Hutter  68  Foundations of Intelligent Agents
The TimeBounded AIXI Model (AIXItl)
An algorithm p
best
has been constructed for which the following holds:
• Let p be any (extended chronological) policy
• with length `(p)≤
˜
l and computation time per cycle t(p)≤
˜
t
• for which there exists a proof of length ≤l
P
that p is a valid
approximation.
• Then an algorithm p
best
can be constructed, depending on
˜
l,
˜
t and
l
P
but not on knowing p
• which is eﬀectively more or equally intelligent according to º
c
than
any such p.
• The size of p
best
is `(p
best
)=O(ln(
˜
l·
˜
t·l
P
)),
• the setuptime is t
setup
(p
best
)=O(l
2
P
·2
l
P
),
• the computation time per cycle is t
cycle
(p
best
)=O(2
˜
l
·
˜
t).
Marcus Hutter  69  Foundations of Intelligent Agents
BruteForce Approximation of AIXI
• Truncate expectimax tree depth to a small ﬁxed lookahead h.
Optimal action computable in time Y ×X 
h
× time to evaluate ξ.
• Consider mixture over Markov Decision Processes (MDP) only, i.e.
ξ(x
1:m
y
1:m
) =
P
ν∈M
w
ν
Q
m
t=1
ν(x
t
x
t−1
y
t
). Note: ξ is not MDP
• Choose uniform prior over w
µ
.
Then ξ(x
1:m
y
1:m
) can be computed in linear time.
• Consider (approximately) Markov problems
with very small action and perception space.
• Example application: 2×2 Matrix Games like Prisoner’S Dilemma,
Stag Hunt, Chicken, Battle of Sexes, and Matching Pennies. [PH’06]
Marcus Hutter  70  Foundations of Intelligent Agents
AIXI Learns to Play 2×2 Matrix Games
• Repeated prisoners dilemma. Loss matrix
• Game unknown to AIXI.
Must be learned as well
• AIXI behaves appropriately.
0 20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
t
avg. p. round cooperation ratio
AIXI vs. random
AIXI vs. tit4tat
AIXI vs. 2−tit4tat
AIXI vs. 3−tit4tat
AIXI vs. AIXI
AIXI vs. AIXI2
Marcus Hutter  71  Foundations of Intelligent Agents
A MonteCarlo AIXI Approximation
Consider class of VariableOrder Markov Decision Processes.
The Context Tree Weighting (CTW) algorithm can eﬃciently mix
(exactly in essentially linear time) all prediction suﬃx trees.
MonteCarlo approximation of expectimax tree:
Upper Conﬁdence Tree (UCT) algorithm:
• Sample observations from CTW distribution.
• Select actions with highest upper conﬁdence bound.
• Expand tree by one leaf node (per trajectory).
a1
a2
a3
o1
o2
o3 o4
future reward estimate
• Simulate from leaf node further down using (ﬁxed) playout policy.
• Propagate back the value estimates for each node.
Repeat until timeout. [VNHS’09]
Guaranteed to converge to exact value.
Extension: Predicate CTW not based on raw obs. but features thereof.
Marcus Hutter  72  Foundations of Intelligent Agents
MonteCarlo AIXI Applications
Normalized Learning Scalability
0
1
100 1000 10000 100000 1000000
Experience
Norm. Av. Reward per Trial
Optimum
Tiger
4x4 Grid
1d Maze
Extended Tiger
TicTacToe
Cheese Maze
Pocman*
[Joel Veness et al. 2009]
Marcus Hutter  73  Foundations of Intelligent Agents
Feature Reinforcement Learning (FRL)
Goal: Develop eﬃcient general purpose intelligent agent.
Stateoftheart: (a) AIXI: Incomputable theoretical solution.
(b) MDP: Eﬃcient limited problem class.
(c) POMDP: Notoriously diﬃcult. (d) PSRs: Underdeveloped.
Idea: ΦMDP reduces real problem to MDP automatically by learning.
Accomplishments so far: (i) Criterion for evaluating quality of reduction.
(ii) Integration of the various parts into one learning algorithm.
(iii) Generalization to structured MDPs (DBNs)
ΦMDP is promising path towards the grand goal & alternative to (a)(d)
Problem: Find reduction Φ eﬃciently (generic optimization problem?)
Marcus Hutter  74  Foundations of Intelligent Agents
ΦMDP: Computational Flow
Environment
¶
µ
³
´
History h
¶
µ
³
´
Feature Vec.
ˆ
Φ
¶
µ
³
´
Transition Pr.
ˆ
T
Reward est.
ˆ
R
¶
µ
³
´
ˆ
T
e
,
ˆ
R
e
¶
µ
³
´
(
ˆ
Q)
ˆ
Value
¶
µ
³
´
Best Policy ˆp
6
reward r observation o
6
Cost(Φh) minimization
¡
¡
¡µ
frequency estimate

exploration
bonus
@
@
@R
Bellman
?
implicit
?
action a
Marcus Hutter  75  Foundations of Intelligent Agents
Intelligent Agents in Perspective
º
¹
·
¸
Universal AI
(AIXI)
²
±
¯
°
Feature RL (ΦMDP/DBN/..)
²
±
¯
°
Information
²
±
¯
°
Learning
²
±
¯
°
Planning
²
±
¯
°
Complexity
Search – Optimization – Computation – Logic – KR
¡
¡
¡
¡
¡
¡
¡
¡
¡
¡
@
@
@
@
@
@
@
@
@
@
J
J
J
J
J
J
J
¤
¤
¤
¤
¤
¤
¤
C
C
C
C
C
C
C
Agents = General Framework, Interface = Robots,Vision,Language
Marcus Hutter  76  Foundations of Intelligent Agents
Properties of Learning Algorithms
Comparison of AIXI to Other Approaches
Algorithm
Properties
time
eﬃcient
data
eﬃcient
explo
ration
conver
gence
global
optimum
genera
lization
pomdp
learning
active
Value/Policy iteration yes/no yes – YES YES NO NO NO yes
TD w. func.approx. no/yes NO NO no/yes NO YES NO YES YES
Direct Policy Search no/yes YES NO no/yes NO YES no YES YES
Logic Planners yes/no YES yes YES YES no no YES yes
RL with Split Trees yes YES no YES NO yes YES YES YES
Pred.w. Expert Advice yes/no YES – YES yes/no yes NO YES NO
OOPS yes/no no – yes yes/no YES YES YES YES
Market/Economy RL yes/no no NO no no/yes yes yes/no YES YES
SPXI no YES – YES YES YES NO YES NO
AIXI NO YES YES yes YES YES YES YES YES
AIXItl no/yes YES YES YES yes YES YES YES YES
MCAIXICTW yes/no yes YES YES yes NO yes/no YES YES
Feature RL yes/no YES yes yes yes yes yes YES YES
Human yes yes yes no/yes NO YES YES YES YES
Marcus Hutter  77  Foundations of Intelligent Agents
Machine Intelligence Tests & Deﬁnitions
F= yes, ·= no,
•= debatable,
? = unknown.
Intelligence Test
Valid
Informative
Wide Range
General
Dynamic
Unbiased
Fundamental
Formal
Objective
Fully Deﬁned
Universal
Practical
Test vs. Def.
Turing Test • · · · • · · · · • · • T
Total Turing Test • · · · • · · · · • · · T
Inverted Turing Test • • · · • · · · · • · • T
Toddler Turing Test • · · · • · · · · · · • T
Linguistic Complexity • F • · · · · • • · • • T
Text Compression Test • F F • · • • F F F • F T
Turing Ratio • F F F ? ? ? ? ? · ? ? T/D
Psychometric AI F F • F ? • · • • • · • T/D
Smith’s Test • F F • · ? F F F · ? • T/D
CTest • F F • · F F F F F F F T/D
AIXI F F F F F F F F F F F · D
Marcus Hutter  78  Foundations of Intelligent Agents
Next Steps
• Address the many open theoretical questions (see Hutter:05).
• Bridge the gap between (Universal) AI theory and AI practice.
• Explore what role logical reasoning, knowledge representation,
vision, language, etc. play in Universal AI.
• Determine the right discounting of future rewards.
• Develop the right nurturing environment for a learning agent.
• Consider embodied agents (e.g. internal↔external reward)
• Analyze AIXI in the multiagent setting.
Marcus Hutter  79  Foundations of Intelligent Agents
The Big Questions
• Is noncomputational physics relevant to AI? [Penrose]
• Could something like the number of wisdom Ω prevent a simple
solution to AI? [Chaitin]
• Do we need to understand consciousness before being able to
understand AI or construct AI systems?
• What if we succeed?
Marcus Hutter  80  Foundations of Intelligent Agents
Wrap Up
• Setup: Given (non)iid data D = (x
1
, ..., x
n
), predict x
n+1
• Ultimate goal is to maximize proﬁt or minimize loss
• Consider Models/Hypothesis H
i
∈ M
• Max.Likelihood: H
best
= arg max
i
p(DH
i
) (overﬁts if M large)
• Bayes: Posterior probability of H
i
is p(H
i
D) ∝ p(DH
i
)p(H
i
)
• Bayes needs prior(H
i
)
• Occam+Epicurus: High prior for simple models.
• Kolmogorov/Solomonoﬀ: Quantiﬁcation of simplicity/complexity
• Bayes works if D is sampled from H
true
∈ M
• Bellman equations tell how to optimally act in known environments
• Universal AI = Universal Induction + Sequential Decision Theory
• Practice = approximate, restrict, search, optimize, knowledge
Marcus Hutter  81  Foundations of Intelligent Agents
Literature
[CV05] R. Cilibrasi and P. M. B. Vit´anyi. Clustering by compression. IEEE
Trans. Information Theory, 51(4):1523–1545, 2005.
[Hut05] M. Hutter. Universal Artiﬁcial Intelligence: Sequential Decisions
based on Algorithmic Probability . Springer, Berlin, 2005.
[Hut07] M. Hutter. On universal prediction and Bayesian conﬁrmation.
Theoretical Computer Science , 384(1):33–48, 2007.
[LH07] S. Legg and M. Hutter. Universal intelligence: a deﬁnition of
machine intelligence. Minds & Machines, 17(4):391–444, 2007.
[Hut09] M. Hutter. Feature reinforcement learning: Part I: Unstructured
MDPs. Journal of Artiﬁcial General Intelligence, 1:3–24, 2009.
[VNHS09] J. Veness, K. S. Ng, M. Hutter, and D. Silver. A Monte Carlo AIXI
approximation. Technical report, NICTA, Australia, 2009.
Marcus Hutter  82  Foundations of Intelligent Agents
Thanks! Questions? Details:
Jobs: PostDoc and PhD positions at RSISE and NICTA, Australia
Projects at http://www.hutter1.net/
A Uniﬁed View of Artiﬁcial Intelligence
= =
Decision Theory = Probability + Utility Theory
+ +
Universal Induction = Ockham + Bayes + Turing
Open research problems at www.hutter1.net/ai/uaibook.htm
Compression contest with 50’000 C
=
prize at prize.hutter1.net