ArticlePDF Available

Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability

Authors:

Abstract

Personal motivation. The dream of creating artificial devices which reach or outperform human intelligence is an old one. It is also one of the two dreams of my youth, which have never let me go (the other is finding a physical theory of everything). What makes this challenge so interesting? A solution would have enormous implications on our society, and there are reasons to believe that the AI problem can be solved in my expected lifetime. So it’s worth sticking to it for a lifetime, even if it will take 30 years or so to reap the benefits. The AI problem. The science of Artificial Intelligence (AI) may be defined as the construction of intelligent systems and their analysis. A natural definition of a system is anything which has an input and an output stream. Intelligence is more complicated. It can have many faces like creativity, solving problems, pattern recognition, classification, learning, induction, deduction, building analogies, optimization, surviving in an environment, language processing, knowledge, and many more. A formal definition incorporating every aspect of intelligence, however, seems difficult. Most, if not all known facets of intelligence can be formulated as goal
Universal Artificial Intelligence
Marcus Hutter
Canberra, ACT, 0200, Australia
http://www.hutter1.net/
ANU RSISE NICTA
Machine Learning Summer School
MLSS-2010, 27 September 6 October, Canberra
Marcus Hutter - 2 - Foundations of Intelligent Agents
Abstract
The dream of creating artificial devices that reach or outperform human
intelligence is many centuries old. This tutorial presents the elegant
parameter-free theory, developed in [Hut05], of an optimal reinforcement
learning agent embedded in an arbitrary unknown environment that
possesses essentially all aspects of rational intelligence. The theory
reduces all conceptual AI problems to pure computational questions.
How to perform inductive inference is closely related to the AI problem.
The tutorial covers Solomonoff’s theory, elaborated on in [Hut07], which
solves the induction problem, at least from a philosophical and
statistical perspective.
Both theories are based on Occam’s razor quantified by Kolmogorov
complexity; Bayesian probability theory; and sequential decision theory.
Marcus Hutter - 3 - Foundations of Intelligent Agents
TABLE OF CONTENTS
1. PHILOSOPHICAL ISSUES
2. BAYESIAN SEQUENCE PREDICTION
3. UNIVERSAL INDUCTIVE INFERENCE
4. UNIVERSAL ARTIFICIAL INTELLIGENCE
5. APPROXIMATIONS AND APPLICATIONS
6. FUTURE DIRECTION, WRAP UP, LITERATURE
Marcus Hutter - 4 - Foundations of Intelligent Agents
PHILOSOPHICAL ISSUES
Philosophical Problems
What is (Artificial) Intelligence?
How to do Inductive Inference?
How to Predict (Number) Sequences?
How to make Decisions in Unknown Environments?
Occam’s Razor to the Rescue
The Grue Emerald and Confirmation Paradoxes
What this Tutorial is (Not) About
Marcus Hutter - 5 - Foundations of Intelligent Agents
Philosophical Issues: Abstract
I start by considering the philosophical problems concerning artificial
intelligence and machine learning in general and induction in particular.
I illustrate the problems and their intuitive solution on various (classical)
induction examples. The common principle to their solution is Occam’s
simplicity principle. Based on Occam’s and Epicurus’ principle, Bayesian
probability theory, and Turing’s universal machine, Solomonoff
developed a formal theory of induction. I describe the sequential/online
setup considered in this tutorial and place it into the wider machine
learning context.
Marcus Hutter - 6 - Foundations of Intelligent Agents
What is (Artificial) Intelligence?
Intelligence can have many faces formal definition difficult
reasoning
creativity
association
generalization
pattern recognition
problem solving
memorization
planning
achieving goals
learning
optimization
self-preservation
vision
language processing
classification
induction
deduction
...
What is AI? Thinking Acting
humanly Cognitive Turing test,
Science Behaviorism
rationally Laws Doing the
Thought Right Thing
Collection of 70+ Defs of Intelligence
http://www.vetta.org/
definitions-of-intelligence/
Real world is nasty: partially unobservable,
uncertain, unknown, non-ergodic, reactive,
vast, but luckily structured, ...
Marcus Hutter - 7 - Foundations of Intelligent Agents
Informal Definition of (Artificial) Intelligence
Intelligence measures an agent’s ability to achieve goals
in a wide range of environments. [S. Legg and M. Hutter]
Emergent: Features such as the ability to learn and adapt, or to
understand, are implicit in the above definition as these capacities
enable an agent to succeed in a wide range of environments.
The science of Artificial Intelligence is concerned with the construction
of intelligent systems/artifacts/agents and their analysis.
What next? Substantiate all terms above: agent, ability, utility, goal,
success, learn, adapt, environment, ...
Marcus Hutter - 8 - Foundations of Intelligent Agents
On the Foundations of Artificial Intelligence
Example: Algorithm/complexity theory: The goal is to find fast
algorithms solving problems and to show lower bounds on their
computation time. Everything is rigorously defined: algorithm,
Turing machine, problem classes, computation time, ...
Most disciplines start with an informal way of attacking a subject.
With time they get more and more formalized often to a point
where they are completely rigorous. Examples: set theory, logical
reasoning, proof theory, probability theory, infinitesimal calculus,
energy, temperature, quantum field theory, ...
Artificial Intelligence: Tries to build and understand systems that
act intelligently, learn from experience, make good predictions, are
able to generalize, ... Many terms are only vaguely defined or there
are many alternate definitions.
Marcus Hutter - 9 - Foundations of Intelligent Agents
InductionPredictionDecisionAction
Induction infers general models from specific observations/facts/data,
usually exhibiting regularities or properties or relations in the latter.
Having or acquiring or learning or inducing a model of the environment
an agent interacts with allows the agent to make predictions and utilize
them in its decision process of finding a good next action.
Example
Induction: Find a model of the world economy.
Prediction: Use the model for predicting the future stock market.
Decision: Decide whether to invest assets in stocks or bonds.
Action: Trading large quantities of stocks influences the market.
Marcus Hutter - 10 - Foundations of Intelligent Agents
Example 1: Probability of Sunrise Tomorrow
What is the probability p(1|1
d
) that the sun will rise tomorrow?
(d = past # days sun rose, 1 =sun rises. 0 = sun will not rise)
p is undefined, because there has never been an experiment that
tested the existence of the sun tomorrow (ref. class problem).
The p = 1, because the sun rose in all past experiments.
p = 1 ², where ² is the proportion of stars that explode per day.
p =
d+1
d+2
, which is Laplace rule derived from Bayes rule.
Derive p from the type, age, size and temperature of the sun, even
though we never observed another star with those exact properties.
Conclusion: We predict that the sun will rise tomorrow with high
probability independent of the justification.
Marcus Hutter - 11 - Foundations of Intelligent Agents
Example 2: Digits of a Computable Number
Extend 14159265358979323846264338327950288419716939937?
Looks random?!
Frequency estimate: n = length of sequence. k
i
= number of
occured i = Probability of next digit being i is
i
n
. Asymptotically
i
n
1
10
(seems to be) true.
But we have the strong feeling that (i.e. with high probability) the
next digit will be 5 because the previous digits were the expansion
of π.
Conclusion: We prefer answer 5, since we see more structure in the
sequence than just random digits.
Marcus Hutter - 12 - Foundations of Intelligent Agents
Example 3: Number Sequences
Sequence:
x
1
, x
2
, x
3
, x
4
, x
5
, ...
1, 2, 3, 4, ?, ...
x
5
= 5, since x
i
= i for i = 1..4.
x
5
= 29, since x
i
= i
4
10i
3
+ 35i
2
49i + 24.
Conclusion: We prefer 5, since linear relation involves less arbitrary
parameters than 4th-order polynomial.
Sequence: 2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,?
61, since this is the next prime
60, since this is the order of the next simple group
Conclusion: We prefer answer 61, since primes are a more familiar
concept than simple groups.
On-Line Encyclopedia of Integer Sequences:
http://www.research.att.com/njas/sequences/
Marcus Hutter - 13 - Foundations of Intelligent Agents
Occam’s Razor to the Rescue
Is there a unique principle which allows us to formally arrive at a
prediction which
- coincides (always?) with our intuitive guess -or- even better,
- which is (in some sense) most likely the best or correct answer?
Yes! Occam’s razor: Use the simplest explanation consistent with
past data (and use it for prediction).
Works! For examples presented and for many more.
Actually Occam’s razor can serve as a foundation of machine
learning in general, and is even a fundamental principle (or maybe
even the mere definition) of science.
Problem: Not a formal/mathematical objective principle.
What is simple for one may be complicated for another.
Marcus Hutter - 14 - Foundations of Intelligent Agents
Grue Emerald Paradox
Hypothesis 1: All emeralds are green.
Hypothesis 2: All emeralds found till y2020 are green,
thereafter all emeralds are blue.
Which hypothesis is more plausible? H1! Justification?
Occam’s razor: take simplest hypothesis consistent with data.
is the most important principle in machine learning and science.
Marcus Hutter - 15 - Foundations of Intelligent Agents
Confirmation Paradox
(i) R B is confirmed by an R-instance with property B
(ii) ¬B ¬R is confirmed by a ¬B-instance with property ¬R.
(iii) Since R B and ¬B ¬R are logically equivalent,
R B is also confirmed by a ¬B-instance with property ¬R.
Example: Hypothesis (o): All ravens are black (R=Raven, B=Black).
(i) observing a Black Raven confirms Hypothesis (o).
(iii) observing a White Sock also confirms that all Ravens are Black,
since a White Sock is a non-Raven which is non-Black.
This conclusion sounds absurd! What’s the problem?
Marcus Hutter - 16 - Foundations of Intelligent Agents
What This Tutorial is (Not) About
Dichotomies in Artificial Intelligence & Machine Learning
scope of my tutorial scope of other tutorials
(machine) learning (GOFAI) knowledge-based
statistical logic-based
decision prediction induction action
classification regression
sequential / non-iid independent identically distributed
online learning offline/batch learning
passive prediction active learning
Bayes MDL Expert Frequentist
uninformed / universal informed / problem-specific
conceptual/mathematical issues computational issues
exact/principled heuristic
supervised learning unsupervised RL learning
exploitation exploration
Marcus Hutter - 17 - Foundations of Intelligent Agents
BAYESIAN SEQUENCE PREDICTION
Sequential/Online Prediction Setup
Uncertainty and Probability
Frequency Interpretation: Counting
Objective Uncertain Events & Subjective Degrees of Belief
Bayes’ and Laplace’s Rules
The Bayes-mixture distribution
Predictive Convergence
Sequential Decisions and Loss Bounds
Summary
Marcus Hutter - 18 - Foundations of Intelligent Agents
Bayesian Sequence Prediction: Abstract
The aim of probability theory is to describe uncertainty. There are
various sources and interpretations of uncertainty. I compare the
frequency, objective, and subjective probabilities, and show that they all
respect the same rules, and derive Bayes’ and Laplace’s famous and
fundamental rules. Then I concentrate on general sequence prediction
tasks. I define the Bayes mixture distribution and show that the
posterior converges rapidly to the true posterior by exploiting some
bounds on the relative entropy. Finally I show that the mixture predictor
is also optimal in a decision-theoretic sense w.r.t. any bounded loss
function.
Marcus Hutter - 19 - Foundations of Intelligent Agents
Sequential/Online Prediction Setup
In sequential or online prediction, for times t = 1, 2, 3, ...,
our predictor p makes a prediction y
p
t
Y
based on past observations x
1
, ..., x
t1
.
Thereafter x
t
X is observed and p suffers Loss(x
t
, y
p
t
).
The goal is to design predictors with small total loss or cumulative
Loss
1:T
(p) :=
P
T
t=1
Loss(x
t
, y
p
t
).
Applications are abundant, e.g. weather or stock market forecasting.
Example:
Loss(x, y) X = {sunny , rainy}
Y =
n
umbrella
sunglasses
o
0.1 0.3
0.0 1.0
Setup also includes: Classification and Regression problems.
Marcus Hutter - 20 - Foundations of Intelligent Agents
Uncertainty and Probability
The aim of probability theory is to describe uncertainty.
Sources/interpretations for uncertainty:
Frequentist: probabilities are relative frequencies.
(e.g. the relative frequency of tossing head.)
Objectivist: probabilities are real aspects of the world.
(e.g. the probability that some atom decays in the next hour)
Subjectivist: probabilities describe an agent’s degree of belief.
(e.g. it is (im)plausible that extraterrestrians exist)
Marcus Hutter - 21 - Foundations of Intelligent Agents
Frequency Interpretation: Counting
The frequentist interprets probabilities as relative frequencies.
If in a sequence of n independent identically distributed (i.i.d.)
experiments (trials) an event occurs k(n) times, the relative
frequency of the event is k(n)/n.
The limit lim
n→∞
k(n)/n is defined as the probability of the event.
For instance, the probability of the event head in a sequence of
repeatedly tossing a fair coin is
1
2
.
The frequentist position is the easiest to grasp, but it has several
shortcomings:
Problems: definition circular, limited to i.i.d, reference class
problem.
Marcus Hutter - 22 - Foundations of Intelligent Agents
Objective Interpretation: Uncertain Events
For the objectivist probabilities are real aspects of the world.
The outcome of an observation or an experiment is not
deterministic, but involves physical random processes.
The set of all possible outcomes is called the sample space.
It is said that an event E occurred if the outcome is in E.
In the case of i.i.d. experiments the probabilities p assigned to
events E should be interpretable as limiting frequencies, but the
application is not limited to this case.
(Some) probability axioms:
p(Ω) = 1 and p({}) = 0 and 0 p(E) 1.
p(A B) = p(A) + p(B) p(A B).
p(B|A) =
p(AB)
p(A)
is the probability of B given event A occurred.
Marcus Hutter - 23 - Foundations of Intelligent Agents
Subjective Interpretation: Degrees of Belief
The subjectivist uses probabilities to characterize an agent’s degree
of belief in something, rather than to characterize physical random
processes.
This is the most relevant interpretation of probabilities in AI.
We define the plausibility of an event as the degree of belief in the
event, or the subjective probability of the event.
It is natural to assume that plausibilities/beliefs Bel(·|·) can be repr.
by real numbers, that the rules qualitatively correspond to common
sense, and that the rules are mathematically consistent.
Cox’s theorem: Bel(·|A) is isomorphic to a probability function
p(·|·) that satisfies the axioms of (objective) probabilities.
Conclusion: Beliefs follow the same rules as probabilities
Marcus Hutter - 24 - Foundations of Intelligent Agents
Bayes’ Famous Rule
Let D be some possible data (i.e. D is event with p(D) > 0) and
{H
i
}
iI
be a countable complete class of mutually exclusive hypotheses
(i.e. H
i
are events with H
i
H
j
= {} i 6= j and
S
iI
H
i
= ).
Given: p(H
i
) = a priori plausibility of hypotheses H
i
(subj. prob.)
Given: p(D|H
i
) = likelihood of data D under hypothesis H
i
(obj. prob.)
Goal: p(H
i
|D) = a posteriori plausibility of hypothesis H
i
(subj. prob.)
Solution: p(H
i
|D) =
p(D|H
i
)p(H
i
)
P
iI
p(D|H
i
)p(H
i
)
Proof: From the definition of conditional probability and
X
iI
p(H
i
|...) = 1
X
iI
p(D|H
i
)p(H
i
) =
X
iI
p(H
i
|D)p(D) = p(D)
Marcus Hutter - 25 - Foundations of Intelligent Agents
Example: Bayes’ and Laplace’s Rule
Assume data is generated by a biased coin with head probability θ, i.e.
H
θ
:=Bernoulli(θ) with θ Θ := [0, 1].
Finite sequence: x = x
1
x
2
...x
n
with n
1
ones and n
0
zeros.
Sample infinite sequence: ω = {0, 1}
Basic event: Γ
x
= {ω : ω
1
= x
1
, ..., ω
n
= x
n
} = set of all sequences
starting with x.
Data likelihood: p
θ
(x) := p
x
|H
θ
) = θ
n
1
(1 θ)
n
0
.
Bayes (1763): Uniform prior plausibility: p(θ) := p(H
θ
) = 1
(
R
1
0
p(θ) = 1 instead
P
iI
p(H
i
) = 1)
Evidence: p(x) =
R
1
0
p
θ
(x)p(θ) =
R
1
0
θ
n
1
(1 θ)
n
0
=
n
1
!n
0
!
(n
0
+n
1
+1)!
Marcus Hutter - 26 - Foundations of Intelligent Agents
Example: Bayes’ and Laplace’s Rule
Bayes: Posterior plausibility of θ
after seeing x is:
p(θ|x) =
p(x|θ)p(θ)
p(x)
=
(n+1)!
n
1
!n
0
!
θ
n
1
(1θ)
n
0
.
Laplace: What is the probability of seeing 1 after having observed x?
p(x
n+1
= 1|x
1
...x
n
) =
p(x1)
p(x)
=
n
1
+1
n + 2
Laplace believed that the sun had risen for 5000 years = 1’826’213 days,
so he concluded that the probability of doomsday tomorrow is
1
1826215
.
Marcus Hutter - 27 - Foundations of Intelligent Agents
Exercise: Envelope Paradox
I offer you two closed envelopes, one of them contains twice the
amount of money than the other. You are allowed to pick one and
open it. Now you have two options. Keep the money or decide for
the other envelope (which could double or half your gain).
Symmetry argument: It doesn’t matter whether you switch, the
expected gain is the same.
Refutation: With probability p = 1/2, the other envelope contains
twice/half the amount, i.e. if you switch your expected gain
increases by a factor 1.25=(1/2)*2+(1/2)*(1/2).
Present a Bayesian solution.
Marcus Hutter - 28 - Foundations of Intelligent Agents
The Bayes-Mixture Distribution ξ
Assumption: The true (objective) environment µ is unknown.
Bayesian approach: Replace true probability distribution µ by a
Bayes-mixture ξ.
Assumption: We know that the true environment µ is contained in
some known countable (in)finite set M of environments.
The Bayes-mixture ξ is defined as
ξ(x) :=
X
ν∈M
w
ν
ν(x) with
X
ν∈M
w
ν
= 1, w
ν
> 0 ν
The weights w
ν
may be interpreted as the prior degree of belief that
the true environment is ν, or k
ν
= ln w
1
ν
as a complexity penalty
(prefix code length) of environment ν.
Then ξ(x) could be interpreted as the prior subjective belief
probability in observing x.
Marcus Hutter - 29 - Foundations of Intelligent Agents
Convergence and Decisions
Goal: Given seq. x
1:t1
x
<t
x
1
x
2
...x
t1
, predict continuation x
t
.
Expectation w.r.t. µ: E[f(ω
1:n
)] :=
P
x∈X
n
µ(x)f(x)
KL-divergence: D
n
(µ||ξ) := E[ln
µ(ω
1:n
)
ξ(ω
1:n
)
] ln w
1
µ
n
Hellinger distance: h
t
(ω
<t
) :=
P
a∈X
(
p
ξ(a|ω
<t
)
p
µ(a|ω
<t
))
2
Rapid convergence:
P
t=1
E[h
t
(ω
<t
)] D
ln w
1
µ
< implies
ξ(x
t
|ω
<t
) µ(x
t
|ω
<t
), i.e. ξ is a good substitute for unknown µ.
Bayesian decisions: Bayes-optimal predictor Λ
ξ
suffers instantaneous
loss l
Λ
ξ
t
[0, 1] at t only slightly larger than the µ-optimal predictor Λ
µ
:
P
t=1
E[(
p
l
Λ
ξ
t
p
l
Λ
µ
t
)
2
]
P
t=1
2E[h
t
] < implies rapid l
Λ
ξ
t
l
Λ
µ
t
.
Pareto-optimality of Λ
ξ
: Every predictor with loss smaller than Λ
ξ
in
some environment µ M must be worse in another environment.
Marcus Hutter - 30 - Foundations of Intelligent Agents
Bayesian Sequence Prediction: Summary
The aim of probability theory is to describe uncertainty.
Various sources and interpretations of uncertainty:
frequency, objective, and subjective probabilities.
They all respect the same rules.
General sequence prediction: Use known (subj.) Bayes mixture
ξ =
P
ν∈M
w
ν
ν in place of unknown (obj.) true distribution µ.
Bound on the relative entropy between ξ and µ.
posterior of ξ converges rapidly to the true posterior µ.
ξ is also optimal in a decision-theoretic sense w.r.t. any bounded
loss function.
No structural assumptions on M and ν M.
Marcus Hutter - 31 - Foundations of Intelligent Agents
UNIVERSAL INDUCTIVE INFERENCE
Foundations of Universal Induction
Bayesian Sequence Prediction and Confirmation
Fast Convergence
How to Choose the Prior Universal
Kolmogorov Complexity
How to Choose the Model Class Universal
Universal is Better than Continuous Class
Summary / Outlook / Literature
Marcus Hutter - 32 - Foundations of Intelligent Agents
Universal Inductive Inference: Abstract
Solomonoff completed the Bayesian framework by providing a rigorous,
unique, formal, and universal choice for the model class and the prior. I
will discuss in breadth how and in which sense universal (non-i.i.d.)
sequence prediction solves various (philosophical) problems of traditional
Bayesian sequence prediction. I show that Solomonoff’s model possesses
many desirable properties: Fast convergence, and in contrast to most
classical continuous prior densities has no zero p(oste)rior problem, i.e.
can confirm universal hypotheses, is reparametrization and regrouping
invariant, and avoids the old-evidence and updating problem. It even
performs well (actually better) in non-computable environments.
Marcus Hutter - 33 - Foundations of Intelligent Agents
Induction Examples
Sequence prediction: Predict weather/stock-quote/... tomorrow, based
on past sequence. Continue IQ test sequence like 1,4,9,16,?
Classification: Predict whether email is spam.
Classification can be reduced to sequence prediction.
Hypothesis testing/identification: Does treatment X cure cancer?
Do observations of white swans confirm that all ravens are black?
These are instances of the important problem of inductive inference or
time-series forecasting or sequence prediction.
Problem: Finding prediction rules for every particular (new) problem is
possible but cumbersome and prone to disagreement or contradiction.
Goal: A single, formal, general, complete theory for prediction.
Beyond induction: active/reward learning, fct. optimization, game theory.
Marcus Hutter - 34 - Foundations of Intelligent Agents
Foundations of Universal Induction
Ockhams’ razor (simplicity) principle
Entities should not be multiplied beyond necessity.
Epicurus’ principle of multiple explanations
If more than one theory is consistent with the observations, keep
all theories.
Bayes’ rule for conditional probabilities
Given the prior belief/probability one can predict all future prob-
abilities.
Turing’s universal machine
Everything computable by a human using a fixed procedure can
also be computed by a (universal) Turing machine.
Kolmogorov’s complexity
The complexity or information content of an object is the length
of its shortest description on a universal Turing machine.
Solomonoff’s universal prior=Ockham+Epicurus+Bayes+Turing
Solves the question of how to choose the prior if nothing is known.
universal induction, formal Occam, AIT,MML,MDL,SRM,...
Marcus Hutter - 35 - Foundations of Intelligent Agents
Bayesian Sequence Prediction and Confirmation
Assumption: Sequence ω X
is sampled from the “true”
probability measure µ, i.e. µ (x) := P[x|µ] is the µ-probability that
ω starts with x X
n
.
Model class: We assume that µ is unknown but known to belong to
a countable class of environments=models=measures
M = {ν
1
, ν
2
, ...}. [no i.i.d./ergodic/stationary assumption]
Hypothesis class: {H
ν
: ν M} forms a mutually exclusive and
complete class of hypotheses.
Prior: w
ν
:= P[H
ν
] is our prior belief in H
ν
Evidence: ξ(x) := P[ x] =
P
ν∈M
P[x|H
ν
]P[H
ν
] =
P
ν
w
ν
ν(x)
must be our (prior) belief in x.
Posterior: w
ν
(x) := P[H
ν
|x] =
P[x|H
ν
]P[H
ν
]
P[x]
is our posterior belief
in ν (Bayes’ rule).
Marcus Hutter - 36 - Foundations of Intelligent Agents
How to Choose the Prior?
Subjective: quantifying personal prior belief (not further discussed)
Objective: based on rational principles (agreed on by everyone)
Indifference or symmetry principle: Choose w
ν
=
1
|M|
for finite M.
Jeffreys or Bernardo’s prior: Analogue for compact parametric
spaces M.
Problem: The principles typically provide good objective priors for
small discrete or compact spaces, but not for “large” model classes
like countably infinite, non-compact, and non-parametric M.
Solution: Occam favors simplicity Assign high (low) prior to
simple (complex) hypotheses.
Problem: Quantitative and universal measure of simplicity/complexity.
Marcus Hutter - 37 - Foundations of Intelligent Agents
Kolmogorov Complexity K(x)
K. of string x is the length of the shortest (prefix) program producing x:
K(x) := min
p
{l(p) : U(p) = x}, U = universal TM
For non-string objects o (like numbers and functions) we define
K(o) := K(hoi), where hoi X
is some standard code for o.
+ Simple strings like 000...0 have small K,
irregular (e.g. random) strings have large K.
The definition is nearly independent of the choice of U.
+ K satisfies most properties an information measure should satisfy.
+ K shares many properties with Shannon entropy but is superior.
K(x) is not computable, but only semi-computable from above.
Fazit:
K is an excellent universal complexity measure,
suitable for quantifying Occam’s razor.
Marcus Hutter - 38 - Foundations of Intelligent Agents
Schematic Graph of Kolmogorov Complexity
Although K(x) is incomputable, we can draw a schematic graph
Marcus Hutter - 39 - Foundations of Intelligent Agents
The Universal Prior
Quantify the complexity of an environment ν or hypothesis H
ν
by
its Kolmogorov complexity K(ν).
Universal prior: w
ν
= w
U
ν
:= 2
K(ν)
is a decreasing function in
the model’s complexity, and sums to (less than) one.
D
n
K(µ) ln 2, i.e. the number of ε-deviations of ξ from µ or l
Λ
ξ
from l
Λ
µ
is proportional to the complexity of the environment.
No other semi-computable prior leads to better prediction (bounds).
For continuous M, we can assign a (proper) universal prior (not
density) w
U
θ
= 2
K(θ)
> 0 for computable θ, and 0 for uncomp. θ.
This effectively reduces M to a discrete class {ν
θ
M : w
U
θ
> 0}
which is typically dense in M.
This prior has many advantages over the classical prior (densities).
Marcus Hutter - 40 - Foundations of Intelligent Agents
Universal Choice of Class M
The larger M the less restrictive is the assumption µ M.
The class M
U
of all (semi)computable (semi)measures, although
only countable, is pretty large, since it includes all valid physics
theories. Further, ξ
U
is semi-computable [ZL70].
Solomonoff’s universal prior M(x) := probability that the output of
a universal TM U with random input starts with x.
Formally: M(x) :=
P
p : U (p)=x
2
`(p)
where the sum is over all
(minimal) programs p for which U outputs a string starting with x.
M may be regarded as a 2
`(p)
-weighted mixture over all
deterministic environments ν
p
. (ν
p
(x) = 1 if U (p) = x and 0 else)
M(x) coincides with ξ
U
(x) within an irrelevant multiplicative constant.
Marcus Hutter - 41 - Foundations of Intelligent Agents
Universal is better than Continuous Class&Prior
Problem of zero prior / confirmation of universal hypotheses:
P[All ravens black|n black ravens]
½
0 in Bayes-Laplace model
fast
1 for universal prior w
U
θ
Reparametrization and regrouping invariance: w
U
θ
= 2
K(θ)
always
exists and is invariant w.r.t. all computable reparametrizations f.
(Jeffrey prior only w.r.t. bijections, and does not always exist)
The Problem of Old Evidence: No risk of biasing the prior towards
past data, since w
U
θ
is fixed and independent of M.
The Problem of New Theories: Updating of M is not necessary,
since M
U
includes already all.
M predicts better than all other mixture predictors based on any
(continuous or discrete) model class and prior, even in
non-computable environments.
Marcus Hutter - 42 - Foundations of Intelligent Agents
Convergence and Loss Bounds
Total (loss) bounds:
P
n=1
E[h
n
]
×
< K(µ) ln 2, where
h
t
(ω
<t
) :=
P
a∈X
(
p
ξ(a|ω
<t
)
p
µ(a|ω
<t
))
2
.
Instantaneous i.i.d. bounds: For i.i.d. M with continuous, discrete,
and universal prior, respectively:
E[h
n
]
×
<
1
n
ln w(µ)
1
and E[h
n
]
×
<
1
n
ln w
1
µ
=
1
n
K(µ) ln 2.
Bounds for computable environments: Rapidly M(x
t
|x
<t
) 1 on
every computable sequence x
1:
(whichsoever, e.g. 1
or the digits
of π or e), i.e. M quickly recognizes the structure of the sequence.
Weak instantaneous bounds: valid for all n and x
1:n
and ¯x
n
6= x
n
:
2
K(n)
×
< M(¯x
n
|x
<n
)
×
< 2
2K(x
1:n
)K(n)
Magic instance numbers: e.g. M(0|1
n
)
×
= 2
K(n)
0, but spikes
up for simple n. M is cautious at magic instance numbers n.
Future bounds / errors to come: If our past observations ω
1:n
contain a lot of information about µ, we make few errors in future:
P
t=n+1
E[h
t
|ω
1:n
]
+
< [K(µ|ω
1:n
)+K(n)] ln 2
Marcus Hutter - 43 - Foundations of Intelligent Agents
Universal Inductive Inference: Summary
Universal Solomonoff prediction solves/avoids/meliorates many problems
of (Bayesian) induction. We discussed:
+ general total bounds for generic class, prior, and loss,
+ the D
n
bound for continuous classes,
+ the problem of zero p(oste)rior & confirm. of universal hypotheses,
+ reparametrization and regrouping invariance,
+ the problem of old evidence and updating,
+ that M works even in non-computable environments,
+ how to incorporate prior knowledge,
Marcus Hutter - 44 - Foundations of Intelligent Agents
UNIVERSAL RATIONAL AGENTS
Rational agents
Sequential decision theory
Reinforcement learning
Value function
Universal Bayes mixture and AIXI model
Self-optimizing and Pareto-optimal policies
Environmental Classes
Marcus Hutter - 45 - Foundations of Intelligent Agents
Universal Rational Agents: Abstract
Sequential decision theory formally solves the problem of rational agents
in uncertain worlds if the true environmental prior probability distribution
is known. Solomonoff’s theory of universal induction formally solves the
problem of sequence prediction for unknown prior distribution.
Here we combine both ideas and develop an elegant parameter-free
theory of an optimal reinforcement learning agent embedded in an
arbitrary unknown environment that possesses essentially all aspects of
rational intelligence. The theory reduces all conceptual AI problems to
pure computational ones.
There are strong arguments that the resulting AIXI model is the most
intelligent unbiased agent possible. Other discussed topics are relations
between problem classes.
Marcus Hutter - 46 - Foundations of Intelligent Agents
The Agent Model
Most if not all AI problems can be
formulated within the agent
framework
r
1
| o
1
r
2
| o
2
r
3
| o
3
r
4
| o
4
r
5
| o
5
r
6
| o
6
...
y
1
y
2
y
3
y
4
y
5
y
6
...
work
Agent
p
tape ...
work
Environ-
ment q
tape ...
©
©
©
©
©¼ H
H
H
H
HY
³
³
³
³
³
³
³1P
P
P
P
P
P
Pq
Marcus Hutter - 47 - Foundations of Intelligent Agents
Rational Agents in Deterministic Environments
- p:X
Y
is deterministic policy of the agent,
p(x
<k
) = y
1:k
with x
<k
x
1
...x
k1
.
- q :Y
X
is deterministic environment,
q(y
1:k
) = x
1:k
with y
1:k
y
1
...y
k
.
- Input x
k
r
k
o
k
consists of a regular informative part o
k
and reward r
k
[0..r
max
].
- Value V
pq
km
:= r
k
+ ... + r
m
,
optimal policy p
best
:= arg max
p
V
pq
1m
,
Lifespan or initial horizon m.
Marcus Hutter - 48 - Foundations of Intelligent Agents
Agents in Probabilistic Environments
Given history y
1:k
x
<k
, the probability that the environment leads to
perception x
k
in cycle k is (by definition) σ(x
k
|y
1:k
x
<k
).
Abbreviation (chain rule)
σ(x
1:m
|y
1:m
) = σ(x
1
|y
1
)·σ(x
2
|y
1:2
x
1
)· ... ·σ(x
m
|y
1:m
x
<m
)
The average value of policy p with horizon m in environment σ is
defined as
V
p
σ
:=
1
m
X
x
1:m
(r
1
+ ... +r
m
)σ(x
1:m
|y
1:m
)
|y
1:m
=p(x
<m
)
The goal of the agent should be to maximize the value.
Marcus Hutter - 49 - Foundations of Intelligent Agents
Optimal Policy and Value
The σ-optimal policy p
σ
:= arg max
p
V
p
σ
maximizes V
p
σ
V
σ
:= V
p
σ
σ
.
Explicit expressions for the action y
k
in cycle k of the σ-optimal policy
p
σ
and their value V
σ
are
y
k
= arg max
y
k
X
x
k
max
y
k+1
X
x
k+1
... max
y
m
X
x
m
(r
k
+ ... +r
m
)·σ(x
k:m
|y
1:m
x
<k
),
V
σ
=
1
m
max
y
1
X
x
1
max
y
2
X
x
2
... max
y
m
X
x
m
(r
1
+ ... +r
m
)·σ(x
1:m
|y
1:m
).
Keyword: Expectimax tree/algorithm.
Marcus Hutter - 50 - Foundations of Intelligent Agents
Expectimax Tree/Algorithm
r
¡
¡
¡
¡
¡
y
k
= 0
@
@
@
@
@
y
k
= 1
max
| {z }
V
σ
(yx
<k
) = max
y
k
V
σ
(yx
<k
y
k
)
action y
k
with max value.
q
¢
¢
¢
¢
¢
o
k
= 0
r
k
= ...
A
A
A
A
A
o
k
= 1
r
k
= ...
E
|{z}
q
¢
¢
¢
¢
¢
o
k
= 0
r
k
= ...
A
A
A
A
A
o
k
= 1
r
k
= ...
E
|{z}
V
σ
(yx
<k
y
k
) =
X
x
k
[r
k
+ V
σ
(yx
1:k
)]σ(x
k
|yx
<k
y
k
)
σ expected reward r
k
and observation o
k
.
q
¢
¢
¢
A
A
A
max
^
y
k+1
q
¢
¢
¢
A
A
A
max
^
y
k+1
q
¢
¢
¢
A
A
A
max
^
y
k+1
q
¢
¢
¢
A
A
A
max
^
V
σ
(yx
1:k
) = max
y
k+1
V
σ
(yx
1:k
y
k+1
)
· · · · · · · · · · · · · · · · · · · · · · · ·
Marcus Hutter - 51 - Foundations of Intelligent Agents
Known environment µ
Assumption: µ is the true environment in which the agent operates
Then, policy p
µ
is optimal in the sense that no other policy for an
agent leads to higher µ
AI
-expected reward.
Special choices of µ: deterministic or adversarial environments,
Markov decision processes (mdps), adversarial environments.
There is no principle problem in computing the optimal action y
k
as
long as µ
AI
is known and computable and X , Y and m are finite.
Things drastically change if µ
AI
is unknown ...
Marcus Hutter - 52 - Foundations of Intelligent Agents
The Bayes-mixture distribution ξ
Assumption: The true environment µ is unknown.
Bayesian approach: The true probability distribution µ
AI
is not learned
directly, but is replaced by a Bayes-mixture ξ
AI
.
Assumption: We know that the true environment µ is contained in some
known (finite or countable) set M of environments.
The Bayes-mixture ξ is defined as
ξ(x
1:m
|y
1:m
) :=
X
ν∈M
w
ν
ν(x
1:m
|y
1:m
) with
X
ν∈M
w
ν
= 1, w
ν
> 0 ν
The weights w
ν
may be interpreted as the prior degree of belief that the
true environment is ν.
Then ξ(x
1:m
|y
1:m
) could be interpreted as the prior subjective belief
probability in observing x
1:m
, given actions y
1:m
.
Marcus Hutter - 53 - Foundations of Intelligent Agents
Questions of Interest
It is natural to follow the policy p
ξ
which maximizes V
p
ξ
.
If µ is the true environment the expected reward when following
policy p
ξ
will be V
p
ξ
µ
.
The optimal (but infeasible) policy p
µ
yields reward V
p
µ
µ
V
µ
.
Are there policies with uniformly larger value than V
p
ξ
µ
?
How close is V
p
ξ
µ
to V
µ
?
What is the most general class M and weights w
ν
.
Marcus Hutter - 54 - Foundations of Intelligent Agents
A universal choice of ξ and M
We have to assume the existence of some structure on the
environment to avoid the No-Free-Lunch Theorems [Wolpert 96].
We can only unravel effective structures which are describable by
(semi)computable probability distributions.
So we may include all (semi)computable (semi)distributions in M.
Occam’s razor and Epicurus’ principle of multiple explanations tell
us to assign high prior belief to simple environments.
Using Kolmogorov’s universal complexity measure K(ν) for
environments ν one should set w
ν
2
K(ν)
, where K(ν) is the
length of the shortest program on a universal TM computing ν.
The resulting AIXI model [Hutter:00] is a unification of (Bellman’s)
sequential decision and Solomonoff’s universal induction theory.
Marcus Hutter - 55 - Foundations of Intelligent Agents
The AIXI Model in one Line
complete & essentially unique & limit-computable
AIXI: a
k
:= arg max
a
k
X
o
k
r
k
... max
a
m
X
o
m
r
m
[r
k
+ ... + r
m
]
X
p : U(p,a
1
..a
m
)=o
1
r
1
..o
m
r
m
2
`(p)
action, reward, observation, Universal TM, program, k=now
AIXI is an elegant mathematical theory of AI
Claim: AIXI is the most intelligent environmental independent, i.e.
universally optimal, agent possible.
Proof: For formalizations, quantifications, and proofs, see [Hut05].
Applications: Strategic Games, Function Optimization, Supervised
Learning, Sequence Prediction, Classification, ...
In the following we consider generic M and w
ν
.
Marcus Hutter - 56 - Foundations of Intelligent Agents
Pareto-Optimality of p
ξ
Policy p
ξ
is Pareto-optimal in the sense that there is no other policy p
with V
p
ν
V
p
ξ
ν
for all ν M and strict inequality for at least one ν.
Self-optimizing Policies
Under which circumstances does the value of the universal policy p
ξ
converge to optimum?
V
p
ξ
ν
V
ν
for horizon m for all ν M. (1)
The least we must demand from M to have a chance that (1) is true is
that there exists some policy ˜p at all with this property, i.e.
˜p : V
˜p
ν
V
ν
for horizon m for all ν M. (2)
Main result: (2) (1): The necessary condition of the existence of a
self-optimizing policy ˜p is also sufficient for p
ξ
to be self-optimizing.
Marcus Hutter - 57 - Foundations of Intelligent Agents
Environments w. (Non)Self-Optimizing Policies
Marcus Hutter - 58 - Foundations of Intelligent Agents
Particularly Interesting Environments
Sequence Prediction, e.g. weather or stock-market prediction.
Strong result: V
µ
V
p
ξ
µ
= O(
q
K(µ)
m
), m =horizon.
Strategic Games: Learn to play well (minimax) strategic zero-sum
games (like chess) or even exploit limited capabilities of opponent.
Optimization: Find (approximate) minimum of function with as few
function calls as possible. Difficult exploration versus exploitation
problem.
Supervised learning: Learn functions by presenting (z, f(z)) pairs
and ask for function values of z
0
by presenting (z
0
, ?) pairs.
Supervised learning is much faster than reinforcement learning.
AIξ quickly learns to predict, play games, optimize, and learn supervised.
Marcus Hutter - 59 - Foundations of Intelligent Agents
Universal Rational Agents: Summary
Setup: Agents acting in general probabilistic environments with
reinforcement feedback.
Assumptions: True environment µ belongs to a known class of
environments M, but is otherwise unknown.
Results: The Bayes-optimal policy p
ξ
based on the Bayes-mixture
ξ =
P
ν∈M
w
ν
ν is Pareto-optimal and self-optimizing if M admits
self-optimizing policies.
We have reduced the AI problem to pure computational questions
(which are addressed in the time-bounded AIXItl).
AIξ incorporates all aspects of intelligence (apart comp.-time).
How to choose horizon: use future value and universal discounting.
ToDo: prove (optimality) properties, scale down, implement.
Marcus Hutter - 60 - Foundations of Intelligent Agents
APPROXIMATIONS & APPLICATIONS
Universal Similarity Metric
Universal Search
Time-Bounded AIXI Model
Brute-Force Approximation of AIXI
A Monte-Carlo AIXI Approximation
Feature Reinforcement Learning
Comparison to other approaches
Future directions, wrap-up, references.
Marcus Hutter - 61 - Foundations of Intelligent Agents
Approximations & Applications: Abstract
Many fundamental theories have to be approximated for practical use.
Since the core quantities of universal induction and universal intelligence
are incomputable, it is often hard, but not impossible, to approximate
them. In any case, having these “gold standards” to approximate
(topdown) or to aim at (bottomup) is extremely helpful in building
truly intelligent systems. The most impressive direct approximation of
Kolmogorov complexity to-date is via the universal similarity metric
applied to a variety of real-world clustering problems. A couple of
universal search algorithms ((adaptive) Levin search, FastPrg, OOPS,
Goedel-machine, ...) that find short programs have been developed and
applied to a variety of toy problem. The AIXI model itself has been
approximated in a couple of ways (AIXItl, Brute Force, Monte Carlo,
Feature RL). Some recent applications will be presented. The Tutorial
concludes by comparing various learning algorithms along various
dimensions, pointing to future directions, wrap-up, and references.
Marcus Hutter - 62 - Foundations of Intelligent Agents
Conditional Kolmogorov Complexity
Question: When is object=string x similar to object=string y?
Universal solution: x similar y x can be easily (re)constructed from y
Kolmogorov complexity K(x|y) := min{`(p) : U(p, y) = x} is small
Examples:
1) x is very similar to itself (K(x|x)
+
= 0)
2) A processed x is similar to x (K(f(x)|x)
+
= 0 if K(f) = O(1)).
e.g. doubling, reverting, inverting, encrypting, partially deleting x.
3) A random string is with high probability not similar to any other
string (K(random|y) =length(random)).
The problem with K(x|y) as similarity=distance measure is that it is
neither symmetric nor normalized nor computable.
Marcus Hutter - 63 - Foundations of Intelligent Agents
The Universal Similarity Metric [CV’05]
Symmetrization and normalization leads to a/the universal metric d:
0 d(x, y) :=
max{K(x|y), K(y|x)}
max{K(x), K(y)}
1
Every effective similarity between x and y is detected by d
Use K(x|y)K(xy)K(y) and K(x)K
U
(x)K
T
(x) (coding T)
= computable approximation: Normalized compression distance:
d(x, y)
K
T
(xy) min{K
T
(x), K
T
(y)}
max{K
T
(x), K
T
(y)}
. 1
For T choose Lempel-Ziv or gzip or bzip(2) (de)compressor in the
applications below.
Theory: Lempel-Ziv compresses asymptotically better than any
probabilistic finite state automaton predictor/compressor.
Marcus Hutter - 64 - Foundations of Intelligent Agents
Tree-Based Clustering [CV’05]
If many objects x
1
, ..., x
n
need to be compared, determine the
Similarity matrix: M
ij
= d(x
i
, x
j
) for 1 i, j n
Now cluster similar objects.
There are various clustering techniques.
Tree-based clustering: Create a tree connecting similar objects,
e.g. quartet method (for clustering)
Applications: Phylogeny of 24 Mammal mtDNA,
50 Language Tree (based on declaration of human rights),
composers of music, authors of novels, SARS virus, fungi,
optical characters, galaxies, ... [Cilibrasi&Vitanyi’05]
Marcus Hutter - 65 - Foundations of Intelligent Agents
Genomics & Phylogeny: Mammals [CV’05]
Evolutionary tree built from complete mammalian mtDNA of 24 species:
Carp
Cow
BlueWhale
FinbackWhale
Cat
BrownBear
PolarBear
GreySeal
HarborSeal
Horse
WhiteRhino
Ferungulates
Gibbon
Gorilla
Human
Chimpanzee
PygmyChimp
Orangutan
SumatranOrangutan
Primates
Eutheria
HouseMouse
Rat
Eutheria - Rodents
Opossum
Wallaroo
Metatheria
Echidna
Platypus
Prototheria
Basque [Spain]
Hungarian [Hungary]
Polish [Poland]
Sorbian [Germany]
Slovak [Slovakia]
Czech [Czech Rep]
Slovenian [Slovenia]
Serbian [Serbia]
Bosnian [Bosnia]
Icelandic [Iceland]
Faroese [Denmark]
Norwegian Bokmal [Norway]
Danish [Denmark]
Norwegian Nynorsk [Norway]
Swedish [Sweden]
Afrikaans
Dutch [Netherlands]
Frisian [Netherlands]
Luxembourgish [Luxembourg]
German [Germany]
Irish Gaelic [UK]
Scottish Gaelic [UK]
Welsh [UK]
Romani Vlach [Macedonia]
Romanian [Romania]
Sardinian [Italy]
Corsican [France]
Sammarinese [Italy]
Italian [Italy]
Friulian [Italy]
Rhaeto Romance [Switzerland]
Occitan [France]
Catalan [Spain]
Galician [Spain]
Spanish [Spain]
Portuguese [Portugal]
Asturian [Spain]
French [France]
English [UK]
Walloon [Belgique]
OccitanAuvergnat [France]
Maltese [Malta]
Breton [France]
Uzbek [Utzbekistan]
Turkish [Turkey]
Latvian [Latvia]
Lithuanian [Lithuania]
Albanian [Albany]
Romani Balkan [East Europe]
Croatian [Croatia]
Finnish [Finland]
Estonian [Estonia]
ROMANCE
BALTIC
UGROFINNIC
CELTIC
GERMANIC
SLAVIC
ALTAIC
Language Tree (Re)construction
based on “The Universal Declaration of
Human Rights” in 50 languages.
Marcus Hutter - 67 - Foundations of Intelligent Agents
Universal Search
Levin search: Fastest algorithm for
inversion and optimization problems.
Theoretical application:
Assume somebody found a non-constructive
proof of P=NP, then Levin-search is a polynomial
time algorithm for every NP (complete) problem.
Practical (OOPS) applications (J. Schmidhuber)
Maze, towers of hanoi, robotics, ...
FastPrg: The asymptotically fastest and shortest algorithm for all
well-defined problems.
AIXItl: Computable variant of AIXI.
Human Knowledge Compression Prize: (50’000 C
=
)
Marcus Hutter - 68 - Foundations of Intelligent Agents
The Time-Bounded AIXI Model (AIXItl)
An algorithm p
best
has been constructed for which the following holds:
Let p be any (extended chronological) policy
with length `(p)
˜
l and computation time per cycle t(p)
˜
t
for which there exists a proof of length l
P
that p is a valid
approximation.
Then an algorithm p
best
can be constructed, depending on
˜
l,
˜
t and
l
P
but not on knowing p
which is effectively more or equally intelligent according to º
c
than
any such p.
The size of p
best
is `(p
best
)=O(ln(
˜
l·
˜
t·l
P
)),
the setup-time is t
setup
(p
best
)=O(l
2
P
·2
l
P
),
the computation time per cycle is t
cycle
(p
best
)=O(2
˜
l
·
˜
t).
Marcus Hutter - 69 - Foundations of Intelligent Agents
Brute-Force Approximation of AIXI
Truncate expectimax tree depth to a small fixed lookahead h.
Optimal action computable in time |Y ×X |
h
× time to evaluate ξ.
Consider mixture over Markov Decision Processes (MDP) only, i.e.
ξ(x
1:m
|y
1:m
) =
P
ν∈M
w
ν
Q
m
t=1
ν(x
t
|x
t1
y
t
). Note: ξ is not MDP
Choose uniform prior over w
µ
.
Then ξ(x
1:m
|y
1:m
) can be computed in linear time.
Consider (approximately) Markov problems
with very small action and perception space.
Example application: 2×2 Matrix Games like Prisoner’S Dilemma,
Stag Hunt, Chicken, Battle of Sexes, and Matching Pennies. [PH’06]
Marcus Hutter - 70 - Foundations of Intelligent Agents
AIXI Learns to Play 2×2 Matrix Games
Repeated prisoners dilemma. Loss matrix
Game unknown to AIXI.
Must be learned as well
AIXI behaves appropriately.
0 20 40 60 80 100
0
0.2
0.4
0.6
0.8
1
t
avg. p. round cooperation ratio
AIXI vs. random
AIXI vs. tit4tat
AIXI vs. 2−tit4tat
AIXI vs. 3−tit4tat
AIXI vs. AIXI
AIXI vs. AIXI2
Marcus Hutter - 71 - Foundations of Intelligent Agents
A Monte-Carlo AIXI Approximation
Consider class of Variable-Order Markov Decision Processes.
The Context Tree Weighting (CTW) algorithm can efficiently mix
(exactly in essentially linear time) all prediction suffix trees.
Monte-Carlo approximation of expectimax tree:
Upper Confidence Tree (UCT) algorithm:
Sample observations from CTW distribution.
Select actions with highest upper confidence bound.
Expand tree by one leaf node (per trajectory).
a1
a2
a3
o1
o2
o3 o4
future reward estimate
Simulate from leaf node further down using (fixed) playout policy.
Propagate back the value estimates for each node.
Repeat until timeout. [VNHS’09]
Guaranteed to converge to exact value.
Extension: Predicate CTW not based on raw obs. but features thereof.
Marcus Hutter - 72 - Foundations of Intelligent Agents
Monte-Carlo AIXI Applications
Normalized Learning Scalability
0
1
100 1000 10000 100000 1000000
Experience
Norm. Av. Reward per Trial
Optimum
Tiger
4x4 Grid
1d Maze
Extended Tiger
TicTacToe
Cheese Maze
Pocman*
[Joel Veness et al. 2009]
Marcus Hutter - 73 - Foundations of Intelligent Agents
Feature Reinforcement Learning (FRL)
Goal: Develop efficient general purpose intelligent agent.
State-of-the-art: (a) AIXI: Incomputable theoretical solution.
(b) MDP: Efficient limited problem class.
(c) POMDP: Notoriously difficult. (d) PSRs: Underdeveloped.
Idea: ΦMDP reduces real problem to MDP automatically by learning.
Accomplishments so far: (i) Criterion for evaluating quality of reduction.
(ii) Integration of the various parts into one learning algorithm.
(iii) Generalization to structured MDPs (DBNs)
ΦMDP is promising path towards the grand goal & alternative to (a)-(d)
Problem: Find reduction Φ efficiently (generic optimization problem?)
Marcus Hutter - 74 - Foundations of Intelligent Agents
ΦMDP: Computational Flow
Environment
µ
³
´
History h
µ
³
´
Feature Vec.
ˆ
Φ
µ
³
´
Transition Pr.
ˆ
T
Reward est.
ˆ
R
µ
³
´
ˆ
T
e
,
ˆ
R
e
µ
³
´
(
ˆ
Q)
ˆ
Value
µ
³
´
Best Policy ˆp
6
reward r observation o
6
Cost(Φ|h) minimization
¡
¡
¡µ
frequency estimate
-
exploration
bonus
@
@
@R
Bellman
?
implicit
?
action a
Marcus Hutter - 75 - Foundations of Intelligent Agents
Intelligent Agents in Perspective
º
¹
·
¸
Universal AI
(AIXI)
²
±
¯
°
Feature RL (ΦMDP/DBN/..)
²
±
¯
°
Information
²
±
¯
°
Learning
²
±
¯
°
Planning
²
±
¯
°
Complexity
Search Optimization Computation Logic KR
¡
¡
¡
¡
¡
¡
¡
¡
¡
¡
@
@
@
@
@
@
@
@
@
@
J
J
J
J
J
J
J
¤
¤
¤
¤
¤
¤
¤
C
C
C
C
C
C
C
Agents = General Framework, Interface = Robots,Vision,Language
Marcus Hutter - 76 - Foundations of Intelligent Agents
Properties of Learning Algorithms
Comparison of AIXI to Other Approaches
Algorithm
Properties
time
efficient
data
efficient
explo-
ration
conver-
gence
global
optimum
genera-
lization
pomdp
learning
active
Value/Policy iteration yes/no yes YES YES NO NO NO yes
TD w. func.approx. no/yes NO NO no/yes NO YES NO YES YES
Direct Policy Search no/yes YES NO no/yes NO YES no YES YES
Logic Planners yes/no YES yes YES YES no no YES yes
RL with Split Trees yes YES no YES NO yes YES YES YES
Pred.w. Expert Advice yes/no YES YES yes/no yes NO YES NO
OOPS yes/no no yes yes/no YES YES YES YES
Market/Economy RL yes/no no NO no no/yes yes yes/no YES YES
SPXI no YES YES YES YES NO YES NO
AIXI NO YES YES yes YES YES YES YES YES
AIXItl no/yes YES YES YES yes YES YES YES YES
MC-AIXI-CTW yes/no yes YES YES yes NO yes/no YES YES
Feature RL yes/no YES yes yes yes yes yes YES YES
Human yes yes yes no/yes NO YES YES YES YES
Marcus Hutter - 77 - Foundations of Intelligent Agents
Machine Intelligence Tests & Definitions
F= yes, ·= no,
= debatable,
? = unknown.
Intelligence Test
Valid
Informative
Wide Range
General
Dynamic
Unbiased
Fundamental
Formal
Objective
Fully Defined
Universal
Practical
Test vs. Def.
Turing Test · · · · · · · · T
Total Turing Test · · · · · · · · · T
Inverted Turing Test · · · · · · · T
Toddler Turing Test · · · · · · · · · T
Linguistic Complexity F · · · · · T
Text Compression Test F F · F F F F T
Turing Ratio F F F ? ? ? ? ? · ? ? T/D
Psychometric AI F F F ? · · T/D
Smith’s Test F F · ? F F F · ? T/D
C-Test F F · F F F F F F F T/D
AIXI F F F F F F F F F F F · D
Marcus Hutter - 78 - Foundations of Intelligent Agents
Next Steps
Address the many open theoretical questions (see Hutter:05).
Bridge the gap between (Universal) AI theory and AI practice.
Explore what role logical reasoning, knowledge representation,
vision, language, etc. play in Universal AI.
Determine the right discounting of future rewards.
Develop the right nurturing environment for a learning agent.
Consider embodied agents (e.g. internalexternal reward)
Analyze AIXI in the multi-agent setting.
Marcus Hutter - 79 - Foundations of Intelligent Agents
The Big Questions
Is non-computational physics relevant to AI? [Penrose]
Could something like the number of wisdom prevent a simple
solution to AI? [Chaitin]
Do we need to understand consciousness before being able to
understand AI or construct AI systems?
What if we succeed?
Marcus Hutter - 80 - Foundations of Intelligent Agents
Wrap Up
Setup: Given (non)iid data D = (x
1
, ..., x
n
), predict x
n+1
Ultimate goal is to maximize profit or minimize loss
Consider Models/Hypothesis H
i
M
Max.Likelihood: H
best
= arg max
i
p(D|H
i
) (overfits if M large)
Bayes: Posterior probability of H
i
is p(H
i
|D) p(D|H
i
)p(H
i
)
Bayes needs prior(H
i
)
Occam+Epicurus: High prior for simple models.
Kolmogorov/Solomonoff: Quantification of simplicity/complexity
Bayes works if D is sampled from H
true
M
Bellman equations tell how to optimally act in known environments
Universal AI = Universal Induction + Sequential Decision Theory
Practice = approximate, restrict, search, optimize, knowledge
Marcus Hutter - 81 - Foundations of Intelligent Agents
Literature
[CV05] R. Cilibrasi and P. M. B. Vit´anyi. Clustering by compression. IEEE
Trans. Information Theory, 51(4):1523–1545, 2005.
[Hut05] M. Hutter. Universal Artificial Intelligence: Sequential Decisions
based on Algorithmic Probability . Springer, Berlin, 2005.
[Hut07] M. Hutter. On universal prediction and Bayesian confirmation.
Theoretical Computer Science , 384(1):33–48, 2007.
[LH07] S. Legg and M. Hutter. Universal intelligence: a definition of
machine intelligence. Minds & Machines, 17(4):391–444, 2007.
[Hut09] M. Hutter. Feature reinforcement learning: Part I: Unstructured
MDPs. Journal of Artificial General Intelligence, 1:3–24, 2009.
[VNHS09] J. Veness, K. S. Ng, M. Hutter, and D. Silver. A Monte Carlo AIXI
approximation. Technical report, NICTA, Australia, 2009.
Marcus Hutter - 82 - Foundations of Intelligent Agents
Thanks! Questions? Details:
Jobs: PostDoc and PhD positions at RSISE and NICTA, Australia
Projects at http://www.hutter1.net/
A Unified View of Artificial Intelligence
= =
Decision Theory = Probability + Utility Theory
+ +
Universal Induction = Ockham + Bayes + Turing
Open research problems at www.hutter1.net/ai/uaibook.htm
Compression contest with 50’000 C
=
prize at prize.hutter1.net
... Equation (1) captures the natural constraint that actions in the future do not affect past percepts and is known as the chronological condition [67]. We will drop the subscript on ρ n when the context is clear. ...
... In practice, µ is of course unknown and needs to be learned from data and background knowledge. The AIXI agent [67] is a mathematical solution to the general reinforcement learning, obtained by estimating the unknown environment µ in (2) using Solomonoff Induction [119]. At time t, the AIXI agent chooses action a * t according to ...
... 2 −K(ρ) ρ(or 1:t+m | a 1:t+m ), (4) where m ∈ N is a finite lookahead horizon, M U is the set of all enumerable chronological semimeasures [67], ρ(or 1:t+m |a 1:t+m ) is the probability of observing or 1:t+m given the action sequence a 1:t+m , and K(ρ) denotes the Kolmogorov complexity [85] of ρ. The performance of AIXI relies heavily on the next result. . ...
Preprint
The AI safety literature is full of examples of powerful AI agents that, in blindly pursuing a specific and usually narrow objective, ends up with unacceptable and even catastrophic collateral damage to others. In this paper, we consider the problem of social harms that can result from actions taken by learning and utility-maximising agents in a multi-agent environment. The problem of measuring social harms or impacts in such multi-agent settings, especially when the agents are artificial generally intelligent (AGI) agents, was listed as an open problem in Everitt et al, 2018. We attempt a partial answer to that open problem in the form of market-based mechanisms to quantify and control the cost of such social harms. The proposed setup captures many well-studied special cases and is more general than existing formulations of multi-agent reinforcement learning with mechanism design in two ways: (i) the underlying environment is a history-based general reinforcement learning environment like in AIXI; (ii) the reinforcement-learning agents participating in the environment can have different learning strategies and planning horizons. To demonstrate the practicality of the proposed setup, we survey some key classes of learning algorithms and present a few applications, including a discussion of the Paperclips problem and pollution control with a cap-and-trade system.
... The term artificial general intelligence (AGI) refers to a learning paradigm that results in an agent whose level of intelligence is comparable to human intelligence and that is capable of performing any intellectual task a human can perform [14][15][16]. Sometimes AGI is referred to as "narrow AI" or "strong AI" [17,18] but the terminology for this is not consistent [19]. To date, the goal of designing such an agent hasn't been reached yet but there are speculations that we could achieve AGI within the next few decades [17,20]. ...
... One particular theory of AGI was developed in the early 2000s by Hutter. He introduced a universal mathematical theory of AGI called AIXI [14]. Its basic idea is to combine Solomonoff induction with sequential decision theory where the former provides an optimal solution for induction and prediction problems while the latter addresses optimal sequential decisions. ...
Article
Full-text available
The success of the conversational AI system ChatGPT has triggered an avalanche of studies that explore its applications in research and education. There are also high hopes that, in addition to such particular usages, it could lead to artificial general intelligence (AGI) that means to human-level intelligence. Such aspirations, however, need to be grounded by actual scientific means to ensure faithful statements and evaluations of the current situation. The purpose of this article is to put ChatGPT into perspective and to outline a way forward that might instead lead to an artificial special intelligence (ASI), a notion we introduce. The underlying idea of ASI is based on an environment that consists only of text. We will show that this avoids the problem of embodiment of an agent and leads to a system with restricted capabilities compared to AGI. Furthermore, we discuss gated actions as a means of large language models to moderate ethical concerns.
... We need an area of mathematics that is concerned with the structural regularities in a bare, concrete collection of standalone data, not in a random variable that is potentially correlated with another variable. Such a field exists: algorithmic information theory (AIT) [34,35], and hence the name of the approach. ...
Preprint
I argue for an approach to the Foundations of Physics that puts the question in the title center stage, rather than asking "what is the case in the world?". This approach, Algorithmic Idealism, attempts to give a mathematically rigorous in-principle-answer to this question both in the usual empirical regime of physics and more exotic regimes of cosmology, philosophy, and science fiction (but soon perhaps real) technology. I begin by arguing that quantum theory, in its actual practice and in some interpretations, should be understood as telling an agent what they should expect to observe next (rather than what is the case), and that the difficulty of answering this former question from the usual "external" perspective is at the heart of persistent enigmas such as the Boltzmann brain problem, extended Wigner's friend scenarios, Parfit's teletransportation paradox, or our understanding of the simulation hypothesis. Algorithmic Idealism is a conceptual framework, based on two postulates that admit several possible mathematical formalizations, cast in the language of algorithmic information theory. Here I give a non-technical description of this view and show how it dissolves the aforementioned enigmas: for example, it claims that you should never bet on being a Boltzmann brain regardless of how many there are, that shutting down computer simulations does not generally terminate its inhabitants, and it predicts the apparent embedding into an objective external world as an approximate description.
... This has lead to some to believe that intelligence may be approximately described, but cannot be fully defined. Indeed, a formal definition of intelligence, called universal intelligence was developed (Legg and Hutter, 2006), which has strong connections to the theory of optimal learning agents (Hutter, 2005). ...
Article
Full-text available
The purpose of this study was to compare the academic achievement and intelligence level of Secondary School students of science, management, and education streams to identify the enrollment trend of students in teacher education in Nepal. Mean score of grade point averages and intelligence test of science stream students was greater than management stream students and average scores of management stream students were greater than education stream students. F-test revealed that there was significant difference among the mean scores of science, management, and education stream students at significance level α = .01.Results show that the students with higher academic achievement and intelligence level are enrolling in science stream, average are in management stream and with low academic achievement and intelligence are in education stream, i.e., teacher education. Review of previous studies and reports revealed that intelligent person are not attracting towards teaching profession and the condition is same till now.
... We train transduction models on LLM-produced Python scripts, meaning that transduction is trained on the input-outputs of symbolic code. Although program learning has long been a popular vision of how general AI could work (Solomonoff, 1964;Schmidhuber, 2004;Hutter, 2004), the dominant theory has always been one of explicit code generation (induction), rather than implicitly teaching neural networks to imitate code (transduction). Our work puts this assumption to the test. ...
Preprint
Full-text available
When learning an input-output mapping from very few examples, is it better to first infer a latent function that explains the examples, or is it better to directly predict new test outputs, e.g. using a neural network? We study this question on ARC, a highly diverse dataset of abstract reasoning tasks. We train neural models for induction (inferring latent functions) and transduction (directly predicting the test output for a given test input). Our models are trained on synthetic data generated by prompting LLMs to produce Python code specifying a function to be inferred, plus a stochastic subroutine for generating inputs to that function. We find inductive and transductive models solve very different problems, despite training on the same problems, and despite sharing the same neural architecture.
... However, if a model has too many degrees of freedom, it can overfit to noise in the observed data which may not reflect the true distribution. In approaching this problem, the machine learning literature points consistently to the importance of compression: In order to build a system that effectively predicts the future, the best approach is to ensure that that system accounts for past observations in the most compact or economical way possible [28][29][30][31]. This Occam's Razor-like philosophy is formalized by the minimum description length (MDL) principle, which prescribes finding the shortest solution written in a general-purpose programming language which accurately reproduces the data, an idea rooted in Kolmogorov complexity [32]. ...
Article
Full-text available
Dual-process theories play a central role in both psychology and neuroscience, figuring prominently in domains ranging from executive control to reward-based learning to judgment and decision making. In each of these domains, two mechanisms appear to operate concurrently, one relatively high in computational complexity, the other relatively simple. Why is neural information processing organized in this way? We propose an answer to this question based on the notion of compression. The key insight is that dual-process structure can enhance adaptive behavior by allowing an agent to minimize the description length of its own behavior. We apply a single model based on this observation to findings from research on executive control, reward-based learning, and judgment and decision making, showing that seemingly diverse dual-process phenomena can be understood as domain-specific consequences of a single underlying set of computational principles.
... Rather, we assume that the agent maintains a local state that does not depend on the model and makes a decision as a function of the agent state. Such an agentstate formulation has been proposed multiple times in the literature [29], [39], [54], [57], [78], [80]. Perhaps the simplest example of the agent state is an agent keeping track of a finite window of past observations and actions, which was first considered in [67], [88] and is commonly referred to as frame stacking in the RL literature [58]. ...
Preprint
The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. However, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. We present a unified treatment of some of these approaches by viewing them as models where the agent maintains a local recursively updateable agent state and chooses actions based on the agent state. We highlight the different classes of agent-state based policies and the various approaches that have been proposed in the literature to find good policies within each class. These include the designer's approach to find optimal non-stationary agent-state based policies, policy search approaches to find a locally optimal stationary agent-state based policies, and the approximate information state to find approximately optimal stationary agent-state based policies. We then present how ideas from the approximate information state approach have been used to improve Q-learning and actor-critic algorithms for learning in POMDPs.
... On the other hand, Legg and Hutter [19] argues that despite the diverse set of definitions, it might still be possible to construct a formal encompassing definition of intelligence (e.g. universal intelligence), which is believed to have strong connections to the theory of optimal learning agents [20]. Optimal learning agents are artificial agents designed to learn and make decisions in a way that maximizes their performance or efficiency in achieving specific goals or tasks. ...
Preprint
Full-text available
Human intelligence, the most evident and accessible form of source of reasoning, hosted by biological hardware, has evolved and been refined over thousands of years, positioning itself today to create new artificial forms and preparing to self--design their evolutionary path forward. Beginning with the advent of foundation models, the rate at which human and artificial intelligence interact with each other has surpassed any anticipated quantitative figures. The close engagement led to both bits of intelligence to be impacted in various ways, which naturally resulted in complex confluences that warrant close scrutiny. In the sequel, we shall explore the interplay between human and machine intelligence, focusing on the crucial role humans play in developing ethical, responsible, and robust intelligent systems. We slightly delve into interesting aspects of implementation inspired by the mechanisms underlying neuroscience and human cognition. Additionally, we propose future perspectives, capitalizing on the advantages of symbiotic designs to suggest a human-centered direction for next-generation AI development. We finalize this evolving document with a few thoughts and open questions yet to be addressed by the broader community.
... However, these functions can sometimes be imperfectly defined, leading to situations where the ML can achieve high-performance metrics without actually completing the intended task. This phenomenon is called reward hacking (Hutter, 2005;Everitt et al., 2017;Pan et al., 2022). In classical RecSys, reward hacking occurs when the system learns to exploit loopholes or biases in its reward function in order to maximize engagement metrics, even if it means sacrificing the true intended objective (Skalse et al., 2022). ...
Preprint
Full-text available
Generative models are a class of AI models capable of creating new instances of data by learning and sampling from their statistical distributions. In recent years, these models have gained prominence in machine learning due to the development of approaches such as generative adversarial networks (GANs), variational autoencoders (VAEs), and transformer-based architectures such as GPT. These models have applications across various domains, such as image generation, text synthesis, and music composition. In recommender systems, generative models, referred to as Gen-RecSys, improve the accuracy and diversity of recommendations by generating structured outputs, text-based interactions, and multimedia content. By leveraging these capabilities, Gen-RecSys can produce more personalized, engaging, and dynamic user experiences, expanding the role of AI in eCommerce, media, and beyond. Our book goes beyond existing literature by offering a comprehensive understanding of generative models and their applications, with a special focus on deep generative models (DGMs) and their classification. We introduce a taxonomy that categorizes DGMs into three types: ID-driven models, large language models (LLMs), and multimodal models. Each category addresses unique technical and architectural advancements within its respective research area. This taxonomy allows researchers to easily navigate developments in Gen-RecSys across domains such as conversational AI and multimodal content generation. Additionally, we examine the impact and potential risks of generative models, emphasizing the importance of robust evaluation frameworks.
... In other words, what we call software is nothing more than the state of hardware [32]. This has undermined all but the most subjective of claims regarding the behaviour of theorised, software superintelligence [46,76,77]. It is a flaw in the very idea of intelligent software. ...
Preprint
Full-text available
We tackle the hard problem of consciousness taking the naturally-selected, self-organising, embodied organism as our starting point. We provide a mathematical formalism describing how biological systems self-organise to hierarchically interpret unlabelled sensory information according to valence and specific needs. Such interpretations imply behavioural policies which can only be differentiated from each other by the qualitative aspect of information processing. Selection pressures favour systems that can intervene in the world to achieve homeostatic and reproductive goals. Quality is a property arising in such systems to link cause to affect to motivate real world interventions. This produces a range of qualitative classifiers (interoceptive and exteroceptive) that motivate specific actions and determine priorities and preferences. Building upon the seminal distinction between access and phenomenal consciousness, our radical claim here is that phenomenal consciousness without access consciousness is likely very common, but the reverse is implausible. To put it provocatively: Nature does not like zombies. We formally describe the multilayered architecture of self-organisation from rocks to Einstein, illustrating how our argument applies in the real world. We claim that access consciousness at the human level is impossible without the ability to hierarchically model i) the self, ii) the world/others and iii) the self as modelled by others. Phenomenal consciousness is therefore required for human-level functionality. Our proposal lays the foundations of a formal science of consciousness, deeply connected with natural selection rather than abstract thinking, closer to human fact than zombie fiction.
Article
An attempt is made to carry out a program (outlined in a previous paper) for defining the concept of a random or patternless, finite binary sequence, and for subsequently defining a random or patternless, infinite binary sequence to be a sequence whose initial segments are all random or patternless finite binary sequences. A definition based on 2 G. J. Chaitin the bounded-transfer Turing machine is given detailed study, but insufficient understanding of this computing machine precludes a complete treatment. A computing machine is introduced which avoids these difficulties. Key Words and Phrases: computational complexity, sequences, random sequences, Turing machines CR Categories: 5.22, 5.5, 5.6 1. Introduction In this section a definition is presented of the concept of a random or patternless binary sequence based on 3-tape-symbol bounded-transfer Turing machines. 2 These computing machines have been introduced and studied in [1], where a proposal to apply them in this manne...
Article
Expected values in a game of chance change with the step-by-step unfolding of the game. This unfolding is governed by a protocol, a set of rules that tell, at each step, what can happen next. This paper develops the idea of a protocol intuitively, historically and mathematically. Protocols are important in statistics and in subjective probability judgment because we can properly interpret new information only when we know the rules governing its acquisition. With a protocol, the rule of conditioning can be treated as a theorem. Without a protocol, the use of this rule is questionable. /// Les valeurs des espérances dans un jeu de hasard se transforment avec le déroulement du jeu. Ce déroulement est governé par un protocole, qui précise ce qui peut se passer à chaque pas. Ici on étudie l'idée d'un protocole des points de vue intuitifs, historiques, et mathématiques. Les protocoles sont d'importance dans la statistique et dans l'évaluation des probabilités subjectives parce que la signification d'un renseignement dépend des conditions qui gouvernent sa transmission. En présence d'un protocole, le conditionnement des probabilités est obligatoire; sans protocol, ce conditionnement est douteux.
Article
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by (Kimura et al. 1995). The algorithm’s chief advantages are that it requires storage of only twice the number of policy parameters, uses one free beta (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter beta is related to the mixing time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper [ibid. 15, 351-381 (2001; Zbl 0994.68222)] we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugategradient procedure to find local optima of the average reward.
Article
The traditional theory of Kolmogorov complexity and algorithmic probability focuses on monotone Turing machines with one-way write-only output tape. This naturally leads to the universal enumerable Solomonoff-Levin measure. Here we introduce more general, nonenumerable but cumulatively enumerable measures (CEMs) derived from Turing machines with lexicographically nondecreasing output and random input, and even more general approximable measures and distributions computable in the limit. We obtain a natural hierarchy of generalizations of algorithmic probability and Kolmogorov complexity, suggesting that the "true" information content of some (possibly infinite) bitstring x is the size of the shortest nonhalting program that converges to x and nothing but x on a Turing machine that can edit its previous outputs. Among other things we show that there are objects computable in the limit yet more random than Chaitin's "number of wisdom" Omega, that any approximable measure of x is small for any x lacking a short description, that there is no universal approximable distribution, that there is a universal CEM, and that any nonenumerable CEM of x is small for any x lacking a short enumerating program. We briefly mention consequences for universes sampled from such priors.