Content uploaded by Arthur Franz

Author content

All content in this area was uploaded by Arthur Franz on May 30, 2016

Content may be subject to copyright.

Toward tractable universal induction through

recursive program learning

Arthur Franz

Independent researcher?

Abstract. Since universal induction is a central topic in artiﬁcial gen-

eral intelligence (AGI), it is argued that compressing all sequences up

to a complexity threshold should be the main thrust of AGI research.

A measure for partial progress in AGI is suggested along these lines.

By exhaustively executing all two and three state Turing machines a

benchmark for low-complexity universal induction is constructed. Given

the resulting binary sequences, programs are induced by recursively con-

structing a network of functions. The construction is guided by a breadth-

ﬁrst search departing only from leaves of the lowest entropy programs,

making the detection of low entropy (“short”) programs eﬃcient. This

way, all sequences (80% of the sequences) generated by two (three) state

machines could be compressed back roughly to the size deﬁned by their

Kolmogorov complexity.

1 Introduction

What is intelligence? After compiling a large set of deﬁnitions in the literature

Legg and Hutter [8] came up with a deﬁnition of intelligence that is consistent

with most other deﬁnitions:

“Intelligence measures an agent’s ability to achieve goals in a wide range of

environments.”

Based on that deﬁnition Marcus Hutter [5] has developed a mathematical

formulation and theoretical solution to the universal AGI problem, called AIXI.

Although it is not computable, approximations may lead to tractable solutions.

AIXI is in turn essentially based on Solomonoﬀ’s theory of universal induction

[15], that assigns the following universal prior to any sequence x:

M(x) := X

p:U(p)=x∗

2−l(p)(1.1)

where pis a program of length l(p)executed on a universal monotone Turing

machine U.U(p) = x∗denotes that after executing program p, the machine U

prints the sequence xwithout necessarily halting. Impressively, it can be shown

[5] that after seeing the ﬁrst tdigits of any computable sequence this universal

prior is able to predict the next digit with a probability converging to certainty:

?e-mail: franz@ﬁas.uni-frankfurt.de

2 A. Franz

limt→∞ M(xt|x1, . . . , xt−1) = 1. Since most probability weight is assigned to

short programs (Occam’s razor) this proves that compressed representations lead

to successful predictions of any computable environment. This realization makes

it especially promising to try to construct an eﬃcient algorithm for universal

induction as a milestone, even cornerstone, of AGI.

A general but brute force approach is universal search. For example, Levin

search [10] executes all possible programs, starting with the shortest, until one of

them generates the required data sequence. Although general, it is not surprising

that the approach is computationally costly and rarely applicable in practice.

On the other side of the spectrum, there are non-general but computationally

tractable approaches. Speciﬁcally, inductive programming techniques are used to

induce programs from data [6] and there are some approaches within the context

of AGI as well [16,14, 12, 3]. However, the reason why the generalization of many

algorithms is impeded is the curse of dimensionality faced by all algorithms

at some point. Considering the (algorithmic) complexity and diversity of tasks

solved by today’s typical algorithms, we observe that most if not all will be

highly speciﬁc and many will be able to solve quite complex tasks (known as

“narrow AI” [7]). Algorithms from the ﬁeld of data compression are no exception.

For example, the celebrated Lempel-Ziv compression algorithm (see e.g. [2])

handles stationary sequences but fails at compressing simple but non-stationary

sequences eﬃciently. AI algorithms undoubtedly exhibit some intelligence, but

when comparing them to humans, a striking diﬀerence comes to mind: the tasks

solvable by humans seem to be much less complex albeit very diverse, while

tasks solved by AI algorithms tend to be quite complex but narrowly deﬁned

(Fig. 1.1).

For this reason, we should not try to beat the curse of dimensionality merci-

lessly awaiting us at high complexities, but instead head for general algorithms

at low complexity levels and ﬁll the task cup from the bottom up.

2 A Measure for Partial Progress in AGI

One of the troubles of AGI research is the lack of a measure for partial progress.

While the Turing test is widely accepted as a test for general intelligence, it

is only able to give an all or none signal. In spite of all attempts, we did not

yet have a way to tell whether we are half way or 10% through toward general

intelligence. The reason for that disorientation is the fact that every algorithm

having achieved partially intelligent behavior, has failed to generalize to a wider

range of behaviors. Therefore, it is hard to tell whether research has progressed

in the right direction or has been on the wrong track all along.

However, since making universal induction tractable seems to be a corner-

stone for AGI, we can formalize partial progress toward AGI as the extent to

which universal induction has been eﬃciently implemented. Additionally, if we

start out with a provably general algorithm that works up to a complexity level,

thereby solving all simple compression problems, the objection about its possible

Toward tractable universal induction through recursive program learning 3

Fig. 1.1. Approach to artiﬁcial general intelligence. Instead of trying to solve complex

but narrow tasks, AGI research should head for solving all simple tasks and only then

expand toward more complexity.

non-generalizability is countered. The measure for partial progress then simply

becomes the complexity level up to which the algorithm can solve all problems.

2.1 Related Work

This measure is reminiscent of existing intelligence tests based on algorithmic

complexity. Hernandez-Orallo [4] has developed the C-test, that allows only se-

quences with unique induced explanations, of which a preﬁx leads to the same

explanation and various other restrictions on the sequence set. However, since

the pitfall of building yet another narrow AI system is lurking at every step,

a measure of research progress in AGI (not so much of the intelligence of an

agent) should make sure that all sequences below a complexity level are com-

pressed successfully and can not aﬀord to discard large subsets as is done in the

C-test.

Legg and Veness [9] developed a measure that takes into account the perfor-

mance of an agent in a reinforcement learning setting which includes an Occam

bias decreasing exponentially with the complexity of the environment. They

are correct to note that the solution to the important exploration-exploitation

dilemma is neglected in a purely complexity-based measure. In that sense, uni-

versal induction is a necessary albeit not suﬃcient condition for intelligence. For

our purposes, it is important to set up a measure for universal induction alone,

as it seems to be a simpler problem than one of building complete intelligent

agents.

4 A. Franz

Text based measures of intelligence follow the rationale that an agent can

be considered intelligent if it is able to compress the information content of a

text, like humanity’s knowledge in the form of Wikipedia [1, 13]. However, this

kind of compression requires large amounts of information not present in the

text itself, like real world experience through the agent’s senses. Therefore, the

task is either ill-deﬁned for agents not disposing of such external information or

the agent has to be provided with such information extending texts to arbitrary

data, which is equivalent to the compression of arbitrary sequences as proposed

here.

2.2 Formalization

Suppose, we run binary programs on a universal monotone Turing machine U.

U’s possible input programs pican be ordered in a length-increasing lexico-

graphic way: “” (empty program), “0”, “1”, “00”, “01”, “10”, “11”, “000”, etc. up

to a maximal complexity level L. We run all those programs until they halt

or for a maximum of ttime steps and read oﬀ their outputs xion the output

tape. In contrast to Kolmogorov complexity1, we use the time-bounded version

– the Levin complexity – which is computable and includes a penalty term on

computation time [11]:

Kt(x) = min

p{|p|+ log t:U(p) = xin tsteps}(2.1)

Saving all the generated strings paired with their optimal programs (xi, po

i)with

po

i(xi) = argminp{|p|+ log t:U(p) = xiin tsteps,|p| ≤ L}, we have all we

need for the progress measure. The goal of universal induction is to ﬁnd all such

optimal programs po

ifor each of the xi. If piis the actually found program, its

performance can be measured by

ri(L) = |po

i|

|pi|∈(0,1] (2.2)

If not, there is no time-bounded solution to the compression problem. The overall

performance Rat complexity level Lcould be used as a measure for partial

progress in universal induction and be given by averaging:

R(L) = hri(L)i(2.3)

One may object that the number of programs increases exponentially with

their length such that an enumeration quickly becomes intractable. This is a

weighty argument if the task is universal search – a general procedure for in-

version problems. However, we suggest this procedure to play the mere role of

a benchmark for an eﬃcient universal induction algorithm, which will use com-

pletely diﬀerent methods than universal search and will be described in Section

1The Kolmogorov complexity of a string is deﬁned as the length of the shortest

program able to generate that string on a Turing machine.

Toward tractable universal induction through recursive program learning 5

3. Therefore, using the set of simple programs as a benchmark may be enough

to set the universal induction algorithm on the right track.

Note that only a small fraction of possible sequences can be generated this

way. After all, it is well known that only exponentially few, O(2n−m), sequences

of length ncan be compressed by mbits [11].

2.3 Implementation

Implementing this test does not require coding of a universal Turing machine

(TM) since computers are already universal TMs. Instead, enumerating all tran-

sition functions of an n-state machine is suﬃcient. The machine used here has

one bidirectional, two way inﬁnite work tape and a unidirectional, one way inﬁ-

nite, write only output tape. Two symbols are used, B={0,1}, the states taken

from Q={0, . . . , n −1}. The transition map is then:

Q×B→Q× {0,1, L, R, N }×{0,1, N }(2.4)

where L,R, and Ndenote left, right and no motion of the head, respectively.

The work tape can move in any direction while the output tape either writes 0

or 1and moves to the right, or does not move at all (N). No halting or accepting

states were utilized. The machine starts with both tapes ﬁlled with zeros. A ﬁnite

sequence xis considered as generated by machine Tgiven transition function

(program) p, if it is at the left of the output head at some point: we write

T(p) = x∗. The transition table enumerated all possible combinations of state

and work tape content, which amounts to |Q|·|B|= 2n. Therefore, there exist

|Q|· 5·3 = 15ndiﬀerent instructions and consequently (15n)2ndiﬀerent machines

with nstates. For n= 2,3this amounts to around 106and 1010 machines. All

those machines (n= 1 machines are trivial) were executed until 50 symbols

were written on the output tape or the maximum number of 400 time steps

was reached. All unique outputs were stored, amounting to 210 and 43295, for

n= 2,3, respectively, and paired with their respective programs.

Table 1 depicts a small sample of the outputs. It may be interjected that se-

quences generated by 2 and 3 state machines are not very “interesting”. However,

the present work is the just initial step. Moreover, it is interesting to note that

even the 2 state machine shows non-repetitive patterns with an ever increasing

number of 1’s. In the 3 state machine patterns become quickly more involved

states sample outputs

210101010101010101010101010101010101010101010101010

211011011011011011011011011011011011011011011011011

200010011001110011110011111001111110011111110011111

300101101001010001011010010100010110100101000101101

310111111001110111101001110101111010100111010101111

301011010110101110101101101011110101101101101011111

Table 1. Sample outputs of 2 and 3 state Turing machines

6 A. Franz

and require “intelligence” to detect the regularities in the patterns (try the last

one!). Consistently with the reasoning in the introduction, could it be that the

threshold complexity level of human intelligence is not far oﬀ from the sequence

complexity of 3 state machines, especially when the data presentation is not

comfortably tuned according to natural image statistics?

We suggest that these patterns paired with their respective programs con-

stitute a benchmark for partial progress in artiﬁcial general intelligence. If an

eﬃcient algorithm can compress these patterns to small programs then it can be

claimed to be moderately intelligent. Modern compression algorithms, such as

Lempel-Ziv (on which the famous Zip compression is based), fail at compressing

those sequences, since the packed ﬁle size increases with sequence length (ergo

rigets arbitrary small) while the size of the TM transition table is always the

same independently of sequence length.

3 Universal Induction of Low-Complexity Sequences

3.1 Methods

Having generated all strings printed by two and three state programs the task

is to build an eﬃcient algorithm compressing those strings back into a short

representation, not necessarily the original one though, but having a similar size

in terms of entropy.

As exempliﬁed in Fig. 3.1 the present algorithm induces a recursive network

of function primitives using a sequence generated by a three state Turing ma-

chine. Four function primitives were used that generate constant, alternating or

incremental sequences or a single number:

C(s, n) = s, s, . . . , s (ntimes), s ∈Z∪S, n ∈N(3.1)

A(a, b, n) = a, b, a, b, . . . a, b (ntimes)a, b ∈Z∪S, n ∈N(3.2)

I(s, d, n) = s+ 0 ·d, s + 1 ·d, . . . , s + (n−1) ·d s, d ∈Z, n ∈N(3.3)

R(s) = s s ∈Z∪S(3.4)

where Zis the set of integers, Nthe set of non-negative integers and S=

{C, A, I , R}is the set of arbitrary symbols (here function names).

The entropy of a given function network is computed as follows. Let xi∈Z∪S

denote the inputs to those functions without parent functions. The distribution

p(n)=2−|n|/3is imposed on integers n∈Z. If xi∈Zthen its information

content is given by H(xi) = −log2p(xi) = |xi|+ log2(3) bits2which we simplify

to |xi|+ 1 bits. If xi∈Sthen H(xi) = log2|S|= 2 bits. The overall entropy

of the network is the sum Htot =PiH(xi). It may be objected that according

2This coding is linear in the integer value. We could use Elias gamma or delta coding,

which is logarithmic, however the algorithm has turned out to perform better with

linear coding. This is work in progress and this issue shall be investigated in future

work.

Toward tractable universal induction through recursive program learning 7

Fig. 3.1. Exemplifying recursive compression. A sequence is recursively transformed by

a network of functions to an increasingly smaller representation. The original sequence

takes up 220 bits of information, 129 bits for encoding the 0’s and 1’s plus the length

of the sequence (91 bits). At the zeroth recursion level the sequence is parsed using a

constant function (C) that prints ntimes the number s. At level 1 the sequences of

function inputs are shown that recreate the original sequence. The original sequence is

thereby transformed to two sequences of function inputs. Subsequently, an alternating

function (A) explains the ﬁrst sequence and an incremental function (I) explains the

second one. This is done recursively, until the entropy can not be reduced any more.

The bold inputs remain unexplained and amount to 96 bits. Note that the ﬁnal number

of inputs does not depend on the sequence length any more. If we remove those inputs

that change with sequence length (bold and underlined) then the entropy decoding

sequence structure is only 27 bits (only bold).

8 A. Franz

to the Minimum Description Length principle, the information contained in the

algorithm itself has to be taken into account as well. After all, for any sequence

xit is possible to deﬁne a universal Turing machine U0such that KtU0(x) = 0

thereby encoding all information about xin the design of U0, making U0highly

dependent on x. However, since both the present algorithm and the benchmark

do not depend on x, their description length is a mere constant and can be

neglected.

At each step of the algorithm a set of unexplained sequences is present,

which are sequences of inputs to those functions without parent functions. For

each such input sequence its entropy can be computed and the sequences ordered

after decreasing entropy. Looping through that set starting with the sequence

of highest entropy (requiring most explanation) the algorithm tries to generate

a part of the sequence with one of the function primitives. For example, if the

sequence q= 3,3,3,9,9,9,9,6,6,6,6,6is present, a sequence of inputs to the

constant function is induced: C(s= 3,9,6, n = 3,4,5). The entropy is reduced,

in this case H(q) = 87 bits and its explanation takes only H(s) + H(n) = 36

bits. For each function primitive, such an entropy change is computed. If the

entropy has been reduced, the function is accepted and added to the network.

Otherwise, it is accepted only if its child (the function that receives its outputs)

has been entropy reducing, allowing to overcome local minima in the entropy

landscape to some extent.

In this fashion a breadth-ﬁrst search is performed, while pruning away the

least promising tree branches. Those are deﬁned as programs having a higher

total entropy than the 1.05 times the program with lowest entropy.3

3.2 Results

Since our ﬁxed-size Turing machine programs can create sequences of arbitrary

length, successful program induction is deﬁned as induction of a program with

a ﬁxed number of inputs to the function network. Further, to establish a bench-

mark, the entropy of the Turing machine programs is computed as follows. There

are (15n)2nmachines with nstates, hence the amount of information needed to

specify a TM program with nstates is

HTM(n) = 2nlog2(15n)(3.5)

which results in a program size of around 20 and 33 bits for two and three state

TMs, respectively. Since the induced programs encode the length lof the target

sequence and the TM programs do not, the information contained in the length

has to be subtracted from the induced program entropy (the bold and underlined

numbers in Fig. 3.1).

All sequences generated by all two state machines could be compressed suc-

cessfully. The average induced program size is µ2= 22 bits with a standard

deviation of σ2= 23 bits. Because of the large number of three states sequences,

3Python code and string/program pairs are available upon request.

Toward tractable universal induction through recursive program learning 9

200 sequences were randomly sampled. This way, 80±4% of three state sequences

could be compressed successfully, with µ3= 27 bits and σ3= 20 bits. However,

“unsuccessful” sequences could be compressed to some extent as well, although

the resulting program size was not independent of sequence length. With se-

quences of length l= 100 the entropy statistics of “unsuccessful” sequences are

µ0

3= 112 bits and σ0

3= 28 bits. Given an average sequence entropy of 146 bits,

this constitutes an average compression factor of 1.3.

It may seem surprising that the average entropy of the induced programs is

even below the entropy of the TM programs (transition tables). However, since

not all rows of a transition table are guaranteed to be used when executing a

program, the actual shortest representation will not contain unused rows leading

to a smaller program size than 20 or 33 bits. The most important result is that

very short programs, with a size roughly around the Kolmogorov complexity,

have indeed been found for most sequences.

4 Discussion

The present approach has shown that it is possible to both sensibly deﬁne a

measure for partial progress toward AGI by measuring the complexity level up to

which all sequences can be induced and to build an algorithm actually performing

universal induction for most low complexity sequences. Our demonstrator has

been able to compress all sequences generated by two state Turing machines and

80% of the sequences generated by three state Turing machines.

The current demonstrator presents work in progress and it is already fairly

clear how to improve the algorithm such that the remaining 20% are also covered.

For example, there is no unique partition of a sequence into a set of concate-

nated primitives. The way, those partitions are selected should also be guided

by compressibility considerations, e.g. partition subsets of equal length should

have a higher prior chance to be analyzed further. Currently, the partition is im-

plemented in a non-principled way, which is one of the reasons for the algorithm

to run into dead ends. Remarkably, all reasons for stagnation seem to be those

aspects of the algorithm that are not yet guided by the compression principle.

This observation leads to the conjecture that the further extension and general-

ization of the algorithm may not require any additional class of measures, but a

“mere” persistent application of the compression principle.

One may object that the function primitives are hard-coded and may there-

fore constitute an obstacle for generalizability. However, those primitives can

also be resolved into a combination of elementary operations, e.g. the incre-

mental function can be constructed by adding a ﬁxed number to the previous

sequence element, hence be itself represented by a function network. Therefore, it

is all a matter of ﬂexible application and organization of the very same function

network and thus lies within the scope of the present approach.

The hope of this approach is that it may lead us on a path ﬁnally scaling up

universal induction to practically signiﬁcant levels. It would be nice to backup

this hope by a time complexity measure of the present algorithm, which not

10 A. Franz

available at present unfortunately, since this is work in progress. Further, it can

not be excluded that a narrow algorithm is also able to solve all low-complexity

problems. In fact, the present algorithm is narrow as well since there are nu-

merous implicit assumptions about the composition of the sequence, e.g. the

concatenation of outputs of several functions, no possibility to represent depen-

dencies within a sequence, or regularities between diﬀerent inputs etc. Neverthe-

less, since we represent general programs without speciﬁc a priori restrictions this

setup seems to be general enough to tackle such questions which will hopefully

result in a scalable system.

References

1. Hutter H-Prize. http://prize.hutter1.net, accessed: 2015-05-17

2. Cover, T.M., Thomas, J.A.: Elements of information theory. John Wiley & Sons

(2012)

3. Friedlander, D., Franklin, S.: LIDA and a theory of mind. In: Artiﬁcial General

Intelligence, 2008: Proceedings of the First AGI Conference. vol. 171, p. 137. IOS

Press (2008)

4. Hernandez-Orallo, J.: Beyond the turing test. Journal of Logic, Language and

Information 9(4), 447–466 (2000)

5. Hutter, M.: Universal Artiﬁcial Intelligence: Sequential Decisions based on Al-

gorithmic Probability. Springer, Berlin (2005), http://www.hutter1.net/ai/

uaibook.htm, 300 pages

6. Kitzelmann, E.: Inductive programming: A survey of program synthesis techniques.

In: Approaches and Applications of Inductive Programming, pp. 50–73. Springer

(2010)

7. Kurzweil, R.: The singularity is near: When humans transcend biology. Penguin

(2005)

8. Legg, S., Hutter, M.: A collection of deﬁnitions of intelligence. In: Goertzel, B.,

Wang, P. (eds.) Advances in Artiﬁcial General Intelligence: Concepts, Architectures

and Algorithms. Frontiers in Artiﬁcial Intelligence and Applications, vol. 157, pp.

17–24. IOS Press, Amsterdam, NL (2007), http://arxiv.org/abs/0706.3639

9. Legg, S., Veness, J.: An approximation of the universal intelligence measure. In: Al-

gorithmic Probability and Friends. Bayesian Prediction and Artiﬁcial Intelligence,

pp. 236–249. Springer (2013)

10. Levin, L.A.: Universal sequential search problems. Problemy Peredachi Informatsii

9(3), 115–116 (1973)

11. Li, M., Vitányi, P.M.: An introduction to Kolmogorov complexity and its applica-

tions. Springer (2009)

12. Looks, M., Goertzel, B.: Program representation for general intelligence. In: Proc.

of AGI. vol. 9 (2009)

13. Mahoney, M.V.: Text compression as a test for artiﬁcial intelligence. In:

AAAI/IAAI. p. 970 (1999)

14. Potapov, A., Rodionov, S.: Universal induction with varying sets of combinators.

In: Artiﬁcial General Intelligence, pp. 88–97. Springer (2013)

15. Solomonoﬀ, R.J.: A formal theory of inductive inference. Part I. Information and

control 7(1), 1–22 (1964)

16. Veness, J., Ng, K.S., Hutter, M., Uther, W., Silver, D.: A Monte-Carlo AIXI ap-

proximation. Journal of Artiﬁcial Intelligence Research 40(1), 95–142 (2011)