Content uploaded by Duy Nguyen-Tuong

Author content

All content in this area was uploaded by Duy Nguyen-Tuong on May 13, 2016

Content may be subject to copyright.

Solving the 15-Puzzle Game

Using Local Value-Iteration

Bastian Bischoﬀ1, Duy Nguyen-Tuong1, Heiner Markert1, and Alois Knoll2

1Robert Bosch GmbH, Corporate Research,

Robert-Bosch-Str. 2, 71701 Schwieberdingen, Germany

2TU Munich, Robotics and Embedded Systems,

Boltzmannstr. 3, 85748 Garching at Munich, Germany

Abstract. The 15-puzzle is a well-known game which has a long history

stretching back in the 1870’s. The goal of the game is to arrange a shuﬄed

set of 15 numbered tiles in ascending order, by sliding tiles into the one

vacant space on a 4 ×4 grid. In this paper, we study how Reinforcement

Learning can be employed to solve the 15-puzzle problem. Mathemati-

cally, this problem can be described as a Markov Decision Process with

the states being puzzle conﬁgurations. This leads to a large state space

with approximately 1013 elements. In order to deal with this large state

space, we present a local variation of the Value-Iteration approach ap-

propriate to solve the 15-puzzle starting from arbitrary conﬁgurations.

Furthermore, we provide a theoretical analysis of the proposed strategy

for solving the 15-puzzle problem. The feasibility of the approach and the

plausibility of the analysis are additionally shown by simulation results.

1 Introduction

The 15-puzzle is a sliding puzzle invented by Samuel Loyd in 1870’s [4]. In this

game, 15 tiles are arranged on a 4 ×4 grid with one vacant space. The tiles are

numbered from 1 to 15. Figure 1 left shows a possible conﬁguration of the puzzle.

The state of the puzzle can be changed by sliding one of the numbered tiles –

adjacent to the vacant space – into the vacant space. The action is denoted by

the direction, in which the numbered tile is moved. For each state, the set of

possible actions Asis therefore a subset of {up,down,left,right}. The goal is to

get the puzzle to the ﬁnal state shown in Fig. 1 right by applying a sequence of

actions. A puzzle conﬁguration is considered as solvable, if there exists a sequence

of actions which leads to the goal conﬁguration. This holds true for exactly half

of all possible puzzle conﬁgurations [10].

Solving the 15-puzzle problem has been thoroughly investigated in the op-

timization community [5]. Search algorithms, such as A∗and IDA∗[6], can be

employed to ﬁnd feasible solutions. The major diﬃculty of the game is the size of

the state space. The 15-puzzle has (16)! ≈2·1013 diﬀerent states. Even optimal

solutions may take up to 80 moves to solve the puzzle. Because of the huge size

of the state space, a complete search is diﬃcult and the 15-puzzle problem is one

of the most popular benchmarks for heuristic search algorithms [5].

In this paper, we study how Reinforcement Learning (RL) [7,13] can be em-

ployed to solve the 15-puzzle problem. Thus, we seek to learn a general strategy

for ﬁnding solutions for all solvable puzzle conﬁgurations. It is well-known that

RL suﬀers from the problem of large state spaces [7]. It is therefore diﬃcult

to straightforwardly employ state-of-the-art RL algorithms for solving the 15-

puzzle problem. However, in [1] Pizlo and Li analysed how humans deal with this

enormous state space of the 15-puzzle. Their study shows that humans try to

locally solve the puzzle tile by tile. While this approach does not always provide

optimal solutions, it signiﬁcantly reduces the complexity of the problem. Inspired

by the work in [1], we propose a local variation of the Value-Iteration approach

appropriate for solving the 15-puzzle problem. Furthermore, we provide some

theoretical analysis of the proposed strategy in this study.

The remainder of the paper is organized as follows: in the next section, we

introduce a formal notation for the 15-puzzle problem. In Section 3, we brieﬂy

review the basic idea behind RL. In Section 4, we introduce the local Value-

Iteration approach to solve the puzzle, followed by a section with an analysis

of the puzzle and the proposed approach. In Section 6, simulation results are

provided supporting the analysis and showing the feasibility of the proposed

approach. A conclusion is given in Section 7.

2 Notation

Let Π16 be the set of all possible permutations on the set {1,2,...,16}. If we

map the vacant space of the 15-puzzle to 16 and all tiles to its corresponding

number, we can interpret a given puzzle conﬁguration as a permutation π∈Π16

by reading the tile-numbers row-wise. For example, for the left puzzle in Fig. 1 we

have π= (15,10,16,13,11,4,1,12,3,7,9,8,2,14,6,5). As shown in [10], exactly

half of the permutations correspond to solvable puzzles, i.e. puzzles that can be

brought to ascending order by a sequence of sliding actions ([10] also provide a

simple criterion to check for solvability). We deﬁne the state space of the 15-

puzzle as the set S15 ⊂Π16 of all solvable puzzles. The tile on position i, i.e.

i-th entry of the permutation s∈S15, is denoted by s(i). The position of tile iis

written as s−1(i) (note that s−1(16) gives the position of the vacant space). This

implies s(4) = 13 and s−1(7) = 10 for the example given in Fig. 1 on the left.

The goal state of the 15-puzzle is deﬁned as state sgoal ∈S15 with sgoal(i) = ifor

all i= 1,...,16, as shown in Fig. 1 on the right. Depending on the conﬁguration

of the puzzle, the possible action set is a subset of {up,down,left,right}. In the

permutation s, each action corresponds to a transposition, i.e. a switch of two

elements of the permutation. Formally, a transposition is deﬁned as permutation

τwith τ(i)6=ifor exactly two elements i, j and therefore is denoted by τ= (i, j).

Thus, the application of each action left, right, up, down (which describes the

movement direction of the numbered tile into the vacant space) corresponds to

the concatenation of the state permutation swith a corresponding transposition

τ,s◦τ. Given a state s, we deﬁne transpositions corresponding to each actions

and provide conditions when each action is applicable.

15 10 13

11 4 1 12

3 7 9 8

2 14 6 5

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15

Fig. 1. A possible start state (left) and the goal state (right) of the 15-puzzle.

–left:τ`=s−1(16), s−1(16) + 1applicable iﬀ s−1(16) 6≡ 0 mod 4

–right:τr=s−1(16), s−1(16) −1applicable iﬀ s−1(16) 6≡ 1 mod 4

–up:τu=s−1(16), s−1(16) + 4applicable iﬀ s−1(16) ≤12

–down:τd=s−1(16), s−1(16) −4applicable iﬀ s−1(16) ≥5

Elements of the set of possible actions a∈Asare the transpositions τthat

perform the corresponding switch of elements. Instead of writing s0=s◦τ, the

action corresponding to τcan also be given as f(s, a) = s0, where f(·) is the

so-called transition function or dynamics model.

3 Reinforcement Learning: A Brief Review

In this section, we will brieﬂy introduce RL, see [7, 13] for detailed introductions.

Reinforcement Learning is a sub area of machine learning that deals with goal

directed learning. The idea is to use an environment feedback about the desir-

ability of states, and let the agent learn to ﬁnd a task solution while optimizing

the overall gathered reward.

The state of a learning agent is given by s∈S, in each state actions a∈

As⊆Acan be applied. A reward function r:s7→ Rdeﬁnes the desirability of

states and hence encodes the learning goal. The function f:s, a 7→ s0describes

the dynamics of the considered system. The goal of the learning agent is to ﬁnd

a policy π:s7→ awhich maximizes the long term reward P∞

i=0 γir(si). Here,

0< γ < 1 is a discount factor, s0is the start state and si=f(si−1, π(si−1))

holds. Many Reinforcement Learning algorithms are based on a value-function

Vπ(s0) = P∞

i=0 γir(si) with si=f(si−1, π(si−1)), which encodes the long term

reward starting in state s0and acting according to policy π. The optimal value-

function Vis deﬁned as V(s) = maxπ{Vπ(s)}, the optimal policy can be derived

by π(s) = arg maxa∈As{V(s0)|s0=f(s, a)}. The optimal value-function V(·)

is latent. In the literature, diﬀerent ways to estimate V(·) can be found. A well-

studied approach for the estimation of V(·) is Value-Iteration which is used in

various applications [14,15]. It is given by an iterative application of

∀s∈S:Vt+1(s)←r(s) + γ·maxa∈As{Vt(s0)|s0=f(s, a)}

until convergence of Vt. It is guaranteed that Vtconverges to the optimal value-

function Vfor t→ ∞ [7].

In the context of the 15-puzzle problem, the state space Scorresponds to the

puzzle-conﬁgurations S15, an action a∈Asis the sliding of a numbered tile into

the vacant space – as explained in the previous section. The reward function is

deﬁned as

r(s) = (1,if s=sgoal

0,else.

With these deﬁnitions, Value-Iteration can be performed. The policy π(s) : s7→ a

resulting from the learned value-function V(·) returns for every puzzle conﬁgu-

ration sthe sliding action afor reaching the goal-state.

While this approach theoretically gives us optimal solutions to all solvable

puzzles, it is practically intractable. The reason is the size of the state space

S15 with |S15| ≈ 1013 . The Value-Iteration approach will iterate over all possible

states, which obviously cannot be performed. In the next section, we describe

how the computational eﬀort can be reduced signiﬁcantly by decomposing the

problem into tractable subproblems.

4 Local Value-Iteration for Solving the 15-Puzzle

The work by Pizlo and Li [1] analysed how humans deal with the enormous

state space of the 15-puzzle. The study shows that humans try to locally solve

the puzzle tile by tile. While this approach does not provide optimal solutions,

it signiﬁcantly reduces the complexity of the problem [1]. The basic principle

behind the solution strategy is to hierarchically divide the global problem into

small local problems (e.g. move each tile to its correct position sequentially). As

we will see later, it is not always suﬃcient to only consider one single tile. Instead

of that, multiple tiles have to be considered at once in some scenarios (see Sect.

5). Inspired by the work in [1], we present a RL approach using local state space

information, similar to the ideas given in [2, 3]. Thus, when learning a policy

to move a given tile to its correct position, we consider only a limited number

of possible puzzle conﬁgurations. Furthermore, the size of the local region can

be set adaptively. After solving the local subproblems, the complete solution is

obtained by sequential application of the local solutions.

A local subproblem can be deﬁned by a set G={i1, . . . , ik}⊆{1,...,15}of

ktiles which need to be moved to distinct positions i1, . . . , ikwithout moving

tiles R={j1, . . . , j`}. For example, for G={3}and R={1,2}the G, R-

subproblem is to bring tile 3 to its correct position without moving tiles 1 and

2. Here, R∩G=∅must hold, the free space must not be part of Rand we

restrict the action sets Asaccording to R(to prevent that a tile from the set R

is moved).

Algorithm 1 Value-Iteration to Solve the Local G, R-problem

function value iteration(SG,R

15 )

VG,R

0←0, t←0Initial value-function

repeat

for all s∈SG,R

15 do Apply Value-Iteration Update step on all states

VG,R

t+1 (s)←r(s) + γmaxa∈AsnVG,R

t(s0)|s0=f(s, a), s0∈SG,R

no

end for

t←t+ 1

until VG,R

thas converged

πG,R(·)←arg maxa∈AsVG,R (s0)|s0=f(·, a)

Derive optimal policy πG,R :SG,R

15 →A

return VG,R

t, πG,R

end function

When moving the tiles in the local set Gto their correct positions, we do not

need to keep track of the positions of all other tiles of the puzzle. Thus, the state

of the G, R-subproblem has the form (i1, . . . , ik, j), i.e. the positions of the k

tiles in Gincluding the position of the free space. The number of elements of the

state space SG,R

15 can be given as |SG,R

15 |=1

2(16 −`)(16 −`−1) . . . (16 −`−k)

with `=|R|. Note, that the factor 1

2bases on the fact that only half of all

permutations are solvable [10]. In general, the SG,R

15 state space is signiﬁcantly

smaller than S15. For example, the state space to move tiles 6 and 7 to correct

positions without moving tiles {1,...,5}has only 495 elements (compared to

1013 elements in S15).

As the G, R-subproblems have signiﬁcantly smaller state spaces, we can em-

ploy state-of-the-art RL algorithms, such as Value-Iteration shown in Algorithm

1, for learning a policy to move a given tile to its desired position. Here, fis

the dynamics function applied on elements s∈SG,R

15 , the reward function is

deﬁned as given in Sect. 3. The algorithm returns the value-function VG,R and

the optimal local policy πG,R . To solve the G, R-subproblem for a given puzzle

conﬁguration s, we subsequently apply actions a=πG,R(s) until all tiles in G

are at their correct positions.

Until now, we described how Value-Iteration can be employed on local G, R-

subproblems for learning an optimal local policy πG,R . In the next step, we

discuss an approach to determine the sets Gand Rfor a given puzzle conﬁgu-

ration. A naive approach would be to successively set G={i}for i= 1,...,15

and R={1, . . . , i −1}. That would mean solving the puzzle tile by tile, while

ﬁxing all lower numbered tiles. However, this simple approach does not work for

many puzzles, as we will show in the next section. Thus, we need to adaptively

vary the local region of Gand possibly consider many tiles at once. A systematic

approach is to successively move tiles from set Rto Guntil a solution is found.

Thus, if no solution is found for G={i}and R={1, . . . , i −1}, we ﬁrst set

G=G∪ {i−1}and R=R\{i−1}and continue to increase Gand decrease R

by setting G=G∪max(R) and R=R\max(R). This procedure is done until

Algorithm 2 Local Value-Iteration for Solving the 15-puzzle

function solve puzzle(start state)

s←start state

for i←1,...,15 do

Gi← {i},Ri← {1,...,i−1}Successively move tiles to correct positions

VGi,Ri, πGi,Ri←value iteration(SGi,Ri

15 )

while VGi,Ri(s) = 0 do holds iﬀ it is not possible to solve Gi, Rifor s

Gi=Gi∪max(Ri)move tile with highest number from set Rito Gi

Ri=Ri\max(Ri)

VGi,Ri, πGi,Ri←value iteration(SGi,Ri

15 )

end while

Now solve the part of the puzzle corresponding to Gi, Ri

while ∃i∈Gi:s(i)6=ido while at least one tile of Giis not at its goal

position

abest ←πGi,Ri(s)Policy πreturns the action to solve Gi, Ri

s←f(s, abest)Apply sliding action

end while

end for

end function

a solution is found. If the puzzle is solvable, this approach guarantees to ﬁnd a

solution, because you will ﬁnally end up with R=∅. This idea of increasing the

local region gives us the following technique to solve an arbitrary puzzle:

Given a solvable starting puzzle conﬁguration s, we successively try to solve

G={i}for i= 1,...,15 and R={1, . . . , i −1}while employing Value-Iteration

on SG,R, as given in Algorithm 1. We can move the tiles in Gto their correct

positions without moving tiles in the set R, if and only if VG,R(s)6= 0. In this

case, we just apply the corresponding policy πG,R to bring the tiles to their

correct positions. Otherwise, we increase the set Gand decrease the set R, such

as G=G∪max(R) and R=R\max(R). Subsequently, we go back to the

Value-Iteration step with the new sets G, R. The resulting local Value-Iteration

approach is summarized in Algorithm 2.

5 Analysis of Local Value-Iteration for the 15-Puzzle

As the proposed RL method bases on the well-known Value-Iteration [7], it

guarantees to ﬁnd a solution for any solvable 15-puzzle conﬁguration. However,

the proposed local approach involves the consideration of many tiles at times and,

thus, can increase the computational complexity. In this section, we analyse the

proposed algorithm and develop a bound for the maximum number of tiles that

are considered at each step. Furthermore, we provide the maximal cardinality

of involved state spaces and give bounds on the maximum number of actions to

solve a given puzzle conﬁguration.

Deﬁnition 1. Consider a random puzzle s∈S15. Let Gibe the set used in

Local Value-Iteration to bring tile ito its correct position (see Algorithm 2). The

probability, that |Gi|=jholds is denoted by τj

i. The subset of puzzles Sk⊆S15

is deﬁned by s∈Sk, if and only if max(G1, . . . , G15) = kholds. Finally, Φ(Sk)

gives an upper bound on the number of actions, that are necessary to solve any

puzzle s∈Skusing Local Value-Iteration.

In Deﬁnition 1, the factor τj

idescribes the probability that jtiles {i, . . . , i−j+1}

need to be considered to bring tile ito its correct position given a random puzzle.

For example, tile 2 can always be moved to position 2 without moving tile 1.

Hence, it is τ1

2= 1 and τ`

2= 0 for all ` > 1. As we will see later, tile 4 can be

moved to its correct position without moving tile 3 in τ1

4=1

12 of the cases. In

other cases, tile 3 needs to be moved in order to get tile 4 to the right position

and, thus, τ2

4=11

12 . The proportions τ1

i, τ 2

i, . . . always have to sum up to 1. The

classes Skpartition the set of all puzzle conﬁgurations. Informally, s∈Skstates

that the puzzle scan be solved with Local Value-Iteration by considering ktiles

at once.

Theorem 1. Given a puzzle s∈S15, then the state spaces involved to solve

the puzzle swith local Value-Iteration have at most 10080 elements. For i∈

{1,2,3,5,6,7,9,14,15}it holds that τ1

i= 1. For i∈ {4,8,10,11,12,13}τj

iis

given as

Tile i τ1

iτ2

iτ3

iτ4

iτ5

iMax. state space size

41

12

11

12 S{3,4},{1,2}

15 = 1092

81

8

7

8S{7,8},{1,...,6}

15 = 360

10 2100

2520

420

2520 S{9,10},{1,...,8}

15 = 168

11 216

360

72

360

72

360 S{9,10,11},{1,...,8}

15 = 840

12 15

60

30

60

15

60 S{9,...,12},{1,...,8}

15 = 3360

13 4

12

8

12 S{9,...,13},{1,...,8}

15 = 10080

Proof. In the following, we provide the calculation for the probabilities τj

i, as

well as the maximal cardinality of the state space involved for each tile i.

–Tiles 1,2,3,5,6,7: It is easy to check that these tiles can be moved to

their goal position without moving lower numbered tiles. This implies τ1

i=

1 for all i∈ {1,2,3,5,6,7}. The state spaces considered have the size

S{i},{1,...,i−1}

15 =1

2(16 −i+ 1)(16 −i), which is at most 1

2·16 ·15 = 120.

–Tiles 4,8: In 1

13 (resp. 1

9) of all states, tile 4 (resp. 8) is already on its correct

position. Tile 4 (8) can also be brought to its correct position, if tile 4 (8)

is in position 8 (12) and the vacant space is on position 4 (8). This applies

to 1

13 ·1

12 (resp. 1

9·1

8) of all remaining puzzles with ﬁxed tile 1 to 3. For the

rest of the cases, one can check that it is not possible to bring tile 4 (resp.

8) to its position without moving lower numbered tiles. But it is suﬃcient

to only move tile 3 (7): It is always possible to move tile 4 (8) below its

correct position with the free space below tile 4 (8). The following sequence

will then bring tiles 3 and 4 (7 and 8) to its correct position: Down, Down,

Right, Up, Left, Up, Right, Down, Down, Left, Up. This implies

τ1

4=1

13 +1

13 ·1

12 =1

12, τ 2

4=11

12, τ 1

8=1

9+1

9·1

8=1

8, τ2

8=7

8.

The largest state space is S{3,4},{1,2}

15 with S{3,4},{1,2}

15 =1

214·13·12 = 1092.

–Tile 9: Tile 9 can be brought to its correct position by moving the vacant

space circle wise, τ1

9= 1 with S{9},{1,...,8}

15 =1

28·7 = 28.

–Tiles 10,11,12,13: If the puzzle is solved up to tile 9, the remaining number

of solvable puzzles is 1

27! = 2520 (as mentioned Sect. 2, only half of the

states are solvable puzzles). We calculate the proportions τby application

of Algorithm 2 on all 2520 solvable puzzles.

–Tiles 14,15: As can be checked, it holds τ1

14 =τ1

15 = 1, the state spaces have

3 resp. 2 elements.

This concludes the proof for Theorem 1. The Theorem shows, that Local Value-

Iteration considers in the worst-case a state space with 10080 elements – com-

pared to 1013 elements of original Value-Iteration. In the next step, we investigate

the puzzle classes Skintroduced in Deﬁnition 1. Theorem 1 directly implies the

following corollary.

Corollary 1. It is τj

i= 0 for all 1≤i≤15,j > 5. Hence Sk=∅holds for all

k > 5and S1,S2,S3,S4,S5partition S15.

Thus, Local Value-Iteration considers at most 5 tiles at once to any given puzzle

and puzzles of the class Skcan be considered the most diﬃcult to solve. With

the next theorem, we investigate how the puzzles S15 split up into the diﬃculty

classes S1to S5.

Theorem 2. Consider the classes S1,S2,...,S5introduced in Deﬁnition 1.

Then S1≈0,04% |S15|,S2≈6,62% |S15 |,S3= 18,3% |S15|,S4= 8,3% |S15 |,

S5= 66,6% |S15|holds.

Proof. We compute the cardinalites using the proportions τj

ifrom Theorem 1:

S1=τ1

4·τ1

8·τ1

10 ·τ1

11 ·τ1

12 ·τ1

13

=1

2304 |S15| ≈ 0,04% |S15 |

S2=τ2

4·τ1

8+τ2

8·τ1

10 +τ2

10

·τ1

11 +τ2

11

+τ1

4·τ2

8·τ1

10 +τ2

10·τ1

11 +τ2

11

+τ1

4·τ1

8·τ2

10 ·τ1

11 +τ2

11

+τ1

4·τ1

8·τ1

10 ·τ2

11·τ1

12 ·τ1

13 · |S15|

=763

11520 |S15| ≈ 6,62 |S15 |

S3=τ3

11 τ1

12 +τ3

12

+τ1

11 +τ2

11τ3

12τ1

13 |S15|

=11

60 |S15|= 18,3% |S15 |

S4=τ4

12 ·τ1

13

=1

12 |S15|= 8,3% |S15 |

S5=τ5

13 |S15|

=2

3|S15|= 66,6% |S15 |

According to Theorem 2, only 0.04% of all puzzles can be solved tile by tile

without moving lower numbered tiles. On the other hand, 66,6% of all puzzle

involve a situation, where 4 lower numbered tiles need to be considered to solve

the puzzle. As already mentioned, the membership of a puzzle s∈S15 in a class

Skdescribes its diﬃculty in terms of how many tiles need to be considered at

once to solve the puzzle with Local Value-Iteration. Another possible measure

for the diﬃculty of a puzzle, is the number of sliding actions necessary to solve

it. Now, our ﬁnal step is to give upper bounds Φ(Sk) which give the maximal

number of actions that need to be applied to solve any puzzle s∈Sk. As we

will see, Φ(Sk) will go up with kmeaning that puzzle from a higher class Sk

also potentially need more actions to be solved.

Theorem 3. Given a puzzle s∈Sk. Then the maximal number of actions

necessary to solve the puzzle susing Algorithm 2 is given by ϕ(S1) = 142,

ϕ(S2) = 220,ϕ(S3) = 248,ϕ(S4) = 255,ϕ(S5) = 288.

Proof. We proof the statement by inspection of the value-function learned with

Algorithm 1. An entry V(s) = P∞

t=kγtimplies, that kactions are necessary to

reach the goal state given state sunder an optimal policy. Let MG,R denote the

maximal number of actions necessary to solve G, R. The relevant subproblems

G, R are given in Theorem 1. By application of Algorithm 1, the following values

for MG,R can be given:

M{1},∅= 21

M{2},{1}= 17

M{3},{1,2}= 20

M{4},{1,2,3}= 1

M{4,3},{1,2}= 32

M{5},{1,...,4}= 17

M{6},{1,...,5}= 13

M{7},{1,...,6}= 18

M{8},{1,...,7}= 1

M{8,7},{1,...,6}= 29

M{9},{1,...,8}= 15

M{10},{1,...,9}= 9

M{10,9},{1,...,8}= 20

M{11},{1,...,10}= 6

M{11,10},{1,...,9}= 14

M{11,10,9},{1,...,8}= 23

M{12},{1,...,11}= 1

M{12,11,10},{1,...,9}= 20

M{12,...,9},{1,...,8}= 27

M{13},{1,...,12}= 1

M{13,...,9},{1,...,8}= 34

M{14},{1,...,13}= 1

M{15},{1,...,14}= 1

Given these maximal numbers of actions to solve a subproblem G, R, we

can now estimate the worst-case, i.e. the maximal number of actions, to solve a

puzzle s∈Sk. Therefore, we compute

ϕ(Sk) =

15

X

i=1

max nM{i},{1,...,i−1}, . . . , M {i,...,i−k+1},{1,...,i−k}o

where the max-operator only considers entries MG,R that are given above (other

subproblems G, R are not relevant according to Theorem 1).

6 Simulation Results

This section contains experimental results when Local Value-Iteration is applied

to solve the 15-Puzzle. The experiments will show the feasibility of the Local

Value-Iteration approach and will also support the analysis provided in the pre-

vious section.

We will use the Local Value-Iteration Algorithm 2 to solve instances of the

15-Puzzle. Here, subproblems G, R will recur for diﬀerent puzzles. Hence, we

will learn the value-function and the corresponding policy only once for a given

subproblem G, R. On a standard desktop PC, the training for all involved G, R-

problems took 155 seconds. After the training, arbitrary puzzles can be solved.

We build a set ˜

Sof 100000 random puzzles, all puzzles s∈˜

Scould successfully

be solved with Algorithm 2 with an average of approximately 0.25 seconds to

solve one puzzle.

In the analysis section, Theorem 2 states the proportions of the classes Sk

relative to the set of all puzzles. These proportions can also be estimated by

experiment: for each puzzle s∈˜

S, we keep track of the maximum number of

tiles kconsidered at once to solve the puzzle and subsequently sort them in the

corresponding class ˜

Sk. The results after solving all 100000 random puzzles in

˜

Sare

˜

S1= 42 = 0.042%

˜

S

˜

S2= 6539 = 6.539%

˜

S

˜

S3= 18362 = 18.362%

˜

S

˜

S4= 8446 = 8.446%

˜

S

˜

S5= 66611 = 66.611%

˜

S

These empirical proportions correspond to the values given in Theorem 2 up to

errors of less than 0.15%. For each of the ﬁve classes ˜

Sk, we compute the average

and maximum number of actions necessary to solve a puzzle. The results are

given in Table 1 and correspond to the statements of Theorem 3. The results

show, that when the number of tiles which need to be considered increases, the

number of actions to solve the puzzle also increases.

Table 1. We solved 100000 puzzles with Local Value-Iteration and classiﬁed them into

the classes Sk. The table gives the average and maximum number of actions necessary

to solve puzzles of the subsequent classes.

S1S2S3S4S5overall

average # actions 68.85 100.68 112.16 122.92 128.71 123.35

maximum # actions 107 156 165 174 202 202

So far, we start with one tile, i.e. Gi={i}, and increased the local region,

if the subproblem could not be solved. While Value-Iteration provides optimal

solutions for the subproblems, the overall solution is in general not optimal

(with respect to the number of actions necessary to solve a puzzle). On the

other hand, solving the complete puzzle at once, i.e. G={1,...,15}, will give

us the optimal solution, but is computationally intractable. In the next step, we

vary between those two extremes and increase the initial size of the local region

to ﬁnd better solutions – in terms of fewer actions necessary to solve the puzzle

–, while accepting higher computational eﬀorts. For an initial local size `, we

deﬁne Gi={`i −`+ 1, . . . , `i}. If we increase, for example, the size to `= 2,

this will give us G1={1,2},G2={3,4},G3={5,6}and so on. We solved the

100000 random puzzles in ˜

Sagain, now with initial size from 1 to 5. Figure 2

shows on the left, that the average as well as the maximal number of actions to

solve a puzzles declines as expected when we increase the region size. The cost of

this improvement is shown in Fig. 2 on the right – the computation time grows

exponentially. The detailed results can be found in Table 2.

1 2 3 4 5

60

80

100

120

140

160

180

200

220

initial size of local region

# of actions to solve puzzle

Maximum

Average

1 2 3 4 5

0

50

100

150

200

250

300

350

initial size of local region

training time in min

Fig. 2. The initial size of the local region Gfor Local Value-Iteration is varied from 1,

i.e. initial Gi={i}, up to 5, i.e. initial Gi={5i−4,...,5i}. The left ﬁgure shows the

average as well as maximal number of actions to solve 100000 random puzzles. On the

right, the necessary overall training time for each initial local size is shown.

Table 2. We adapted the initial local size from 1, i.e. Gi={i}for i= 1,...,15, up to

5, i.e. Gi={5i−4, . . . 5i}for i= 1, . . . 3. We solved 100000 puzzles with each initial

local size, the table gives the average and maximum number of actions necessary to

solve solve the puzzles. The last row states the necessary training time on a standard

desktop PC.

default size local region 12345

average # actions 123.35 99.36 93.69 78.07 73.82

maximum # actions 202 150 136 115 107

training time in minutes 2.58 7.10 57.87 109.03 312.33

7 Conclusion

In this study, we investigate the possibility of using RL to solve the popular

15-puzzle game. Due to the high state space dimension of the problem, it is

diﬃcult to straightforwardly employ state-of-the-art RL algorithms. In order to

deal with this large state space, we proposed a local variation of the well-known

Value-Iteration appropriate to solve the 15-puzzle problem. Our algorithm is in-

spired by the insight that humans use to solve the 15-puzzle game locally, by

sequentially moving tiles to their correct positions. Furthermore, we provide a

theoretical analysis showing the feasibility of the proposed approach. The plau-

sibility of the analysis is supported by several simulation results.

References

1. Pizlo, Z., Li, Z.: Solving Combinatorial Problems: The 15-Puzzle. Memory & Cog-

nition 33 (2005) 1069–1084

2. Borga, M.: Reinforcement Learning Using Local Adaptive Models. ISY, Link¨oping

University, ISBN 91-7871-590-3, LiU-Tek-Lic-1995:39 (1995)

3. Zhang, W., Zhang, N. L.: Restricted Value Iteration: Theory and Algorithms. Jour-

nal of Artiﬁcial Intelligence Research 23 (2005) 123–165

4. Archer, A. F.: A Modern Treatment of the 15 Puzzle. American Mathematical

Monthly 106, AAAI Press (1999) 793–799

5. Burns, E. A., Hatem, M. J., Ruml, W.: Implementing Fast Heuristic Search Code.

Symposium on Combinatorial Search (2012)

6. Korf, R. E.: Depth-ﬁrst iterative-deepening: an optimal admissible tree search. Ar-

tiﬁcial Intelligence 27 (1985) 97–109

7. Sutton, R. S., Barto, A. G.: Reinforcement Learning: An Introduction (Adaptive

Computation and Machine Learning). The MIT Press (1998)

8. Surynek, P., Michalik, P.: An Improved Sub-optimal Algorithm for Solving N2−1-

Puzzle. Institute for Theoretical Computer Science, Charles University, Prague (2011)

9. Ratner, D., Warmuth, M. K.: NxN Puzzle and Related Relocation Problem. Journal

of Symbolic Computation 10 (1990) 111-138

10. Hordern, E.: Sliding Piece Puzzles. Oxford University Press (1986)

11. Parberry, I.: A Real-Time Algorithm for the (n2-1)-Puzzle. Inf. Process. Lett. 56

(1995) 23-28

12. Korf, R. E., Schultze, P.: Large-scale parallel breadth-ﬁrst search. Proceedings of

the 20th national conference on Artiﬁcial intelligence 3 (2005) 1380–1385

13. Wiering, M. and van Otterlo, M.: Reinforcement Learning: State-of-the-Art.

Springer (2012)

14. Porta, M. J., Spaan, M. T. J. and Vlassis, N.: Robot Planning in Partially Observ-

able Continuous Domains. Robotics: Science and Systems, MIT Press (2005) 217–224

15. Asadian A., Kermani M.R., Patel R.V.: Accelerated Needle Steering Using Parti-

tioned Value Iteration. American Control Conference, ACC 10, Baltimore, MD, USA,

2010