Conference PaperPDF Available

Solving the 15-Puzzle Game Using Local Value-Iteration

Authors:
  • Bosch Center for Artificial Intelligence

Abstract

The 15-puzzle is a well-known game which has a long history stretching back in the 1870's. The goal of the game is to arrange a shuffled set of 15 numbered tiles in ascending order, by sliding tiles into the one vacant space on a 4x4 grid. In this paper, we study how Reinforcement Learning can be employed to solve the 15-puzzle problem. Mathematically, this problem can be described as a Markov Decision Process with the states being puzzle configurations. This leads to a large state space with approximately 10(13) elements. In order to deal with this large state space, we present a local variation of the Value-Iteration approach appropriate to solve the 15-puzzle starting from arbitrary configurations. Furthermore, we provide a theoretical analysis of the proposed strategy for solving the 15-puzzle problem. The feasibility of the approach and the plausibility of the analysis are additionally shown by simulation results.
Solving the 15-Puzzle Game
Using Local Value-Iteration
Bastian Bischoff1, Duy Nguyen-Tuong1, Heiner Markert1, and Alois Knoll2
1Robert Bosch GmbH, Corporate Research,
Robert-Bosch-Str. 2, 71701 Schwieberdingen, Germany
2TU Munich, Robotics and Embedded Systems,
Boltzmannstr. 3, 85748 Garching at Munich, Germany
Abstract. The 15-puzzle is a well-known game which has a long history
stretching back in the 1870’s. The goal of the game is to arrange a shuffled
set of 15 numbered tiles in ascending order, by sliding tiles into the one
vacant space on a 4 ×4 grid. In this paper, we study how Reinforcement
Learning can be employed to solve the 15-puzzle problem. Mathemati-
cally, this problem can be described as a Markov Decision Process with
the states being puzzle configurations. This leads to a large state space
with approximately 1013 elements. In order to deal with this large state
space, we present a local variation of the Value-Iteration approach ap-
propriate to solve the 15-puzzle starting from arbitrary configurations.
Furthermore, we provide a theoretical analysis of the proposed strategy
for solving the 15-puzzle problem. The feasibility of the approach and the
plausibility of the analysis are additionally shown by simulation results.
1 Introduction
The 15-puzzle is a sliding puzzle invented by Samuel Loyd in 1870’s [4]. In this
game, 15 tiles are arranged on a 4 ×4 grid with one vacant space. The tiles are
numbered from 1 to 15. Figure 1 left shows a possible configuration of the puzzle.
The state of the puzzle can be changed by sliding one of the numbered tiles –
adjacent to the vacant space – into the vacant space. The action is denoted by
the direction, in which the numbered tile is moved. For each state, the set of
possible actions Asis therefore a subset of {up,down,left,right}. The goal is to
get the puzzle to the final state shown in Fig. 1 right by applying a sequence of
actions. A puzzle configuration is considered as solvable, if there exists a sequence
of actions which leads to the goal configuration. This holds true for exactly half
of all possible puzzle configurations [10].
Solving the 15-puzzle problem has been thoroughly investigated in the op-
timization community [5]. Search algorithms, such as Aand IDA[6], can be
employed to find feasible solutions. The major difficulty of the game is the size of
the state space. The 15-puzzle has (16)! 2·1013 different states. Even optimal
solutions may take up to 80 moves to solve the puzzle. Because of the huge size
of the state space, a complete search is difficult and the 15-puzzle problem is one
of the most popular benchmarks for heuristic search algorithms [5].
In this paper, we study how Reinforcement Learning (RL) [7,13] can be em-
ployed to solve the 15-puzzle problem. Thus, we seek to learn a general strategy
for finding solutions for all solvable puzzle configurations. It is well-known that
RL suffers from the problem of large state spaces [7]. It is therefore difficult
to straightforwardly employ state-of-the-art RL algorithms for solving the 15-
puzzle problem. However, in [1] Pizlo and Li analysed how humans deal with this
enormous state space of the 15-puzzle. Their study shows that humans try to
locally solve the puzzle tile by tile. While this approach does not always provide
optimal solutions, it significantly reduces the complexity of the problem. Inspired
by the work in [1], we propose a local variation of the Value-Iteration approach
appropriate for solving the 15-puzzle problem. Furthermore, we provide some
theoretical analysis of the proposed strategy in this study.
The remainder of the paper is organized as follows: in the next section, we
introduce a formal notation for the 15-puzzle problem. In Section 3, we briefly
review the basic idea behind RL. In Section 4, we introduce the local Value-
Iteration approach to solve the puzzle, followed by a section with an analysis
of the puzzle and the proposed approach. In Section 6, simulation results are
provided supporting the analysis and showing the feasibility of the proposed
approach. A conclusion is given in Section 7.
2 Notation
Let Π16 be the set of all possible permutations on the set {1,2,...,16}. If we
map the vacant space of the 15-puzzle to 16 and all tiles to its corresponding
number, we can interpret a given puzzle configuration as a permutation πΠ16
by reading the tile-numbers row-wise. For example, for the left puzzle in Fig. 1 we
have π= (15,10,16,13,11,4,1,12,3,7,9,8,2,14,6,5). As shown in [10], exactly
half of the permutations correspond to solvable puzzles, i.e. puzzles that can be
brought to ascending order by a sequence of sliding actions ([10] also provide a
simple criterion to check for solvability). We define the state space of the 15-
puzzle as the set S15 Π16 of all solvable puzzles. The tile on position i, i.e.
i-th entry of the permutation sS15, is denoted by s(i). The position of tile iis
written as s1(i) (note that s1(16) gives the position of the vacant space). This
implies s(4) = 13 and s1(7) = 10 for the example given in Fig. 1 on the left.
The goal state of the 15-puzzle is defined as state sgoal S15 with sgoal(i) = ifor
all i= 1,...,16, as shown in Fig. 1 on the right. Depending on the configuration
of the puzzle, the possible action set is a subset of {up,down,left,right}. In the
permutation s, each action corresponds to a transposition, i.e. a switch of two
elements of the permutation. Formally, a transposition is defined as permutation
τwith τ(i)6=ifor exactly two elements i, j and therefore is denoted by τ= (i, j).
Thus, the application of each action left, right, up, down (which describes the
movement direction of the numbered tile into the vacant space) corresponds to
the concatenation of the state permutation swith a corresponding transposition
τ,sτ. Given a state s, we define transpositions corresponding to each actions
and provide conditions when each action is applicable.
15 10 13
11 4 1 12
3 7 9 8
2 14 6 5
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15
Fig. 1. A possible start state (left) and the goal state (right) of the 15-puzzle.
left:τ`=s1(16), s1(16) + 1applicable iff s1(16) 6≡ 0 mod 4
right:τr=s1(16), s1(16) 1applicable iff s1(16) 6≡ 1 mod 4
up:τu=s1(16), s1(16) + 4applicable iff s1(16) 12
down:τd=s1(16), s1(16) 4applicable iff s1(16) 5
Elements of the set of possible actions aAsare the transpositions τthat
perform the corresponding switch of elements. Instead of writing s0=sτ, the
action corresponding to τcan also be given as f(s, a) = s0, where f(·) is the
so-called transition function or dynamics model.
3 Reinforcement Learning: A Brief Review
In this section, we will briefly introduce RL, see [7, 13] for detailed introductions.
Reinforcement Learning is a sub area of machine learning that deals with goal
directed learning. The idea is to use an environment feedback about the desir-
ability of states, and let the agent learn to find a task solution while optimizing
the overall gathered reward.
The state of a learning agent is given by sS, in each state actions a
AsAcan be applied. A reward function r:s7→ Rdefines the desirability of
states and hence encodes the learning goal. The function f:s, a 7→ s0describes
the dynamics of the considered system. The goal of the learning agent is to find
a policy π:s7→ awhich maximizes the long term reward P
i=0 γir(si). Here,
0< γ < 1 is a discount factor, s0is the start state and si=f(si1, π(si1))
holds. Many Reinforcement Learning algorithms are based on a value-function
Vπ(s0) = P
i=0 γir(si) with si=f(si1, π(si1)), which encodes the long term
reward starting in state s0and acting according to policy π. The optimal value-
function Vis defined as V(s) = maxπ{Vπ(s)}, the optimal policy can be derived
by π(s) = arg maxaAs{V(s0)|s0=f(s, a)}. The optimal value-function V(·)
is latent. In the literature, different ways to estimate V(·) can be found. A well-
studied approach for the estimation of V(·) is Value-Iteration which is used in
various applications [14,15]. It is given by an iterative application of
sS:Vt+1(s)r(s) + γ·maxaAs{Vt(s0)|s0=f(s, a)}
until convergence of Vt. It is guaranteed that Vtconverges to the optimal value-
function Vfor t→ ∞ [7].
In the context of the 15-puzzle problem, the state space Scorresponds to the
puzzle-configurations S15, an action aAsis the sliding of a numbered tile into
the vacant space – as explained in the previous section. The reward function is
defined as
r(s) = (1,if s=sgoal
0,else.
With these definitions, Value-Iteration can be performed. The policy π(s) : s7→ a
resulting from the learned value-function V(·) returns for every puzzle configu-
ration sthe sliding action afor reaching the goal-state.
While this approach theoretically gives us optimal solutions to all solvable
puzzles, it is practically intractable. The reason is the size of the state space
S15 with |S15| ≈ 1013 . The Value-Iteration approach will iterate over all possible
states, which obviously cannot be performed. In the next section, we describe
how the computational effort can be reduced significantly by decomposing the
problem into tractable subproblems.
4 Local Value-Iteration for Solving the 15-Puzzle
The work by Pizlo and Li [1] analysed how humans deal with the enormous
state space of the 15-puzzle. The study shows that humans try to locally solve
the puzzle tile by tile. While this approach does not provide optimal solutions,
it significantly reduces the complexity of the problem [1]. The basic principle
behind the solution strategy is to hierarchically divide the global problem into
small local problems (e.g. move each tile to its correct position sequentially). As
we will see later, it is not always sufficient to only consider one single tile. Instead
of that, multiple tiles have to be considered at once in some scenarios (see Sect.
5). Inspired by the work in [1], we present a RL approach using local state space
information, similar to the ideas given in [2, 3]. Thus, when learning a policy
to move a given tile to its correct position, we consider only a limited number
of possible puzzle configurations. Furthermore, the size of the local region can
be set adaptively. After solving the local subproblems, the complete solution is
obtained by sequential application of the local solutions.
A local subproblem can be defined by a set G={i1, . . . , ik}⊆{1,...,15}of
ktiles which need to be moved to distinct positions i1, . . . , ikwithout moving
tiles R={j1, . . . , j`}. For example, for G={3}and R={1,2}the G, R-
subproblem is to bring tile 3 to its correct position without moving tiles 1 and
2. Here, RG=must hold, the free space must not be part of Rand we
restrict the action sets Asaccording to R(to prevent that a tile from the set R
is moved).
Algorithm 1 Value-Iteration to Solve the Local G, R-problem
function value iteration(SG,R
15 )
VG,R
00, t0Initial value-function
repeat
for all sSG,R
15 do Apply Value-Iteration Update step on all states
VG,R
t+1 (s)r(s) + γmaxaAsnVG,R
t(s0)|s0=f(s, a), s0SG,R
no
end for
tt+ 1
until VG,R
thas converged
πG,R(·)arg maxaAsVG,R (s0)|s0=f(·, a)
Derive optimal policy πG,R :SG,R
15 A
return VG,R
t, πG,R
end function
When moving the tiles in the local set Gto their correct positions, we do not
need to keep track of the positions of all other tiles of the puzzle. Thus, the state
of the G, R-subproblem has the form (i1, . . . , ik, j), i.e. the positions of the k
tiles in Gincluding the position of the free space. The number of elements of the
state space SG,R
15 can be given as |SG,R
15 |=1
2(16 `)(16 `1) . . . (16 `k)
with `=|R|. Note, that the factor 1
2bases on the fact that only half of all
permutations are solvable [10]. In general, the SG,R
15 state space is significantly
smaller than S15. For example, the state space to move tiles 6 and 7 to correct
positions without moving tiles {1,...,5}has only 495 elements (compared to
1013 elements in S15).
As the G, R-subproblems have significantly smaller state spaces, we can em-
ploy state-of-the-art RL algorithms, such as Value-Iteration shown in Algorithm
1, for learning a policy to move a given tile to its desired position. Here, fis
the dynamics function applied on elements sSG,R
15 , the reward function is
defined as given in Sect. 3. The algorithm returns the value-function VG,R and
the optimal local policy πG,R . To solve the G, R-subproblem for a given puzzle
configuration s, we subsequently apply actions a=πG,R(s) until all tiles in G
are at their correct positions.
Until now, we described how Value-Iteration can be employed on local G, R-
subproblems for learning an optimal local policy πG,R . In the next step, we
discuss an approach to determine the sets Gand Rfor a given puzzle configu-
ration. A naive approach would be to successively set G={i}for i= 1,...,15
and R={1, . . . , i 1}. That would mean solving the puzzle tile by tile, while
fixing all lower numbered tiles. However, this simple approach does not work for
many puzzles, as we will show in the next section. Thus, we need to adaptively
vary the local region of Gand possibly consider many tiles at once. A systematic
approach is to successively move tiles from set Rto Guntil a solution is found.
Thus, if no solution is found for G={i}and R={1, . . . , i 1}, we first set
G=G∪ {i1}and R=R\{i1}and continue to increase Gand decrease R
by setting G=Gmax(R) and R=R\max(R). This procedure is done until
Algorithm 2 Local Value-Iteration for Solving the 15-puzzle
function solve puzzle(start state)
sstart state
for i1,...,15 do
Gi← {i},Ri← {1,...,i1}Successively move tiles to correct positions
VGi,Ri, πGi,Rivalue iteration(SGi,Ri
15 )
while VGi,Ri(s) = 0 do holds iff it is not possible to solve Gi, Rifor s
Gi=Gimax(Ri)move tile with highest number from set Rito Gi
Ri=Ri\max(Ri)
VGi,Ri, πGi,Rivalue iteration(SGi,Ri
15 )
end while
Now solve the part of the puzzle corresponding to Gi, Ri
while iGi:s(i)6=ido while at least one tile of Giis not at its goal
position
abest πGi,Ri(s)Policy πreturns the action to solve Gi, Ri
sf(s, abest)Apply sliding action
end while
end for
end function
a solution is found. If the puzzle is solvable, this approach guarantees to find a
solution, because you will finally end up with R=. This idea of increasing the
local region gives us the following technique to solve an arbitrary puzzle:
Given a solvable starting puzzle configuration s, we successively try to solve
G={i}for i= 1,...,15 and R={1, . . . , i 1}while employing Value-Iteration
on SG,R, as given in Algorithm 1. We can move the tiles in Gto their correct
positions without moving tiles in the set R, if and only if VG,R(s)6= 0. In this
case, we just apply the corresponding policy πG,R to bring the tiles to their
correct positions. Otherwise, we increase the set Gand decrease the set R, such
as G=Gmax(R) and R=R\max(R). Subsequently, we go back to the
Value-Iteration step with the new sets G, R. The resulting local Value-Iteration
approach is summarized in Algorithm 2.
5 Analysis of Local Value-Iteration for the 15-Puzzle
As the proposed RL method bases on the well-known Value-Iteration [7], it
guarantees to find a solution for any solvable 15-puzzle configuration. However,
the proposed local approach involves the consideration of many tiles at times and,
thus, can increase the computational complexity. In this section, we analyse the
proposed algorithm and develop a bound for the maximum number of tiles that
are considered at each step. Furthermore, we provide the maximal cardinality
of involved state spaces and give bounds on the maximum number of actions to
solve a given puzzle configuration.
Definition 1. Consider a random puzzle sS15. Let Gibe the set used in
Local Value-Iteration to bring tile ito its correct position (see Algorithm 2). The
probability, that |Gi|=jholds is denoted by τj
i. The subset of puzzles SkS15
is defined by sSk, if and only if max(G1, . . . , G15) = kholds. Finally, Φ(Sk)
gives an upper bound on the number of actions, that are necessary to solve any
puzzle sSkusing Local Value-Iteration.
In Definition 1, the factor τj
idescribes the probability that jtiles {i, . . . , ij+1}
need to be considered to bring tile ito its correct position given a random puzzle.
For example, tile 2 can always be moved to position 2 without moving tile 1.
Hence, it is τ1
2= 1 and τ`
2= 0 for all ` > 1. As we will see later, tile 4 can be
moved to its correct position without moving tile 3 in τ1
4=1
12 of the cases. In
other cases, tile 3 needs to be moved in order to get tile 4 to the right position
and, thus, τ2
4=11
12 . The proportions τ1
i, τ 2
i, . . . always have to sum up to 1. The
classes Skpartition the set of all puzzle configurations. Informally, sSkstates
that the puzzle scan be solved with Local Value-Iteration by considering ktiles
at once.
Theorem 1. Given a puzzle sS15, then the state spaces involved to solve
the puzzle swith local Value-Iteration have at most 10080 elements. For i
{1,2,3,5,6,7,9,14,15}it holds that τ1
i= 1. For i∈ {4,8,10,11,12,13}τj
iis
given as
Tile i τ1
iτ2
iτ3
iτ4
iτ5
iMax. state space size
41
12
11
12 S{3,4},{1,2}
15 = 1092
81
8
7
8S{7,8},{1,...,6}
15 = 360
10 2100
2520
420
2520 S{9,10},{1,...,8}
15 = 168
11 216
360
72
360
72
360 S{9,10,11},{1,...,8}
15 = 840
12 15
60
30
60
15
60 S{9,...,12},{1,...,8}
15 = 3360
13 4
12
8
12 S{9,...,13},{1,...,8}
15 = 10080
Proof. In the following, we provide the calculation for the probabilities τj
i, as
well as the maximal cardinality of the state space involved for each tile i.
Tiles 1,2,3,5,6,7: It is easy to check that these tiles can be moved to
their goal position without moving lower numbered tiles. This implies τ1
i=
1 for all i∈ {1,2,3,5,6,7}. The state spaces considered have the size
S{i},{1,...,i1}
15 =1
2(16 i+ 1)(16 i), which is at most 1
2·16 ·15 = 120.
Tiles 4,8: In 1
13 (resp. 1
9) of all states, tile 4 (resp. 8) is already on its correct
position. Tile 4 (8) can also be brought to its correct position, if tile 4 (8)
is in position 8 (12) and the vacant space is on position 4 (8). This applies
to 1
13 ·1
12 (resp. 1
9·1
8) of all remaining puzzles with fixed tile 1 to 3. For the
rest of the cases, one can check that it is not possible to bring tile 4 (resp.
8) to its position without moving lower numbered tiles. But it is sufficient
to only move tile 3 (7): It is always possible to move tile 4 (8) below its
correct position with the free space below tile 4 (8). The following sequence
will then bring tiles 3 and 4 (7 and 8) to its correct position: Down, Down,
Right, Up, Left, Up, Right, Down, Down, Left, Up. This implies
τ1
4=1
13 +1
13 ·1
12 =1
12, τ 2
4=11
12, τ 1
8=1
9+1
9·1
8=1
8, τ2
8=7
8.
The largest state space is S{3,4},{1,2}
15 with S{3,4},{1,2}
15 =1
214·13·12 = 1092.
Tile 9: Tile 9 can be brought to its correct position by moving the vacant
space circle wise, τ1
9= 1 with S{9},{1,...,8}
15 =1
28·7 = 28.
Tiles 10,11,12,13: If the puzzle is solved up to tile 9, the remaining number
of solvable puzzles is 1
27! = 2520 (as mentioned Sect. 2, only half of the
states are solvable puzzles). We calculate the proportions τby application
of Algorithm 2 on all 2520 solvable puzzles.
Tiles 14,15: As can be checked, it holds τ1
14 =τ1
15 = 1, the state spaces have
3 resp. 2 elements.
This concludes the proof for Theorem 1. The Theorem shows, that Local Value-
Iteration considers in the worst-case a state space with 10080 elements – com-
pared to 1013 elements of original Value-Iteration. In the next step, we investigate
the puzzle classes Skintroduced in Definition 1. Theorem 1 directly implies the
following corollary.
Corollary 1. It is τj
i= 0 for all 1i15,j > 5. Hence Sk=holds for all
k > 5and S1,S2,S3,S4,S5partition S15.
Thus, Local Value-Iteration considers at most 5 tiles at once to any given puzzle
and puzzles of the class Skcan be considered the most difficult to solve. With
the next theorem, we investigate how the puzzles S15 split up into the difficulty
classes S1to S5.
Theorem 2. Consider the classes S1,S2,...,S5introduced in Definition 1.
Then S10,04% |S15|,S26,62% |S15 |,S3= 18,3% |S15|,S4= 8,3% |S15 |,
S5= 66,6% |S15|holds.
Proof. We compute the cardinalites using the proportions τj
ifrom Theorem 1:
S1=τ1
4·τ1
8·τ1
10 ·τ1
11 ·τ1
12 ·τ1
13
=1
2304 |S15| ≈ 0,04% |S15 |
S2=τ2
4·τ1
8+τ2
8·τ1
10 +τ2
10
·τ1
11 +τ2
11
+τ1
4·τ2
8·τ1
10 +τ2
10·τ1
11 +τ2
11
+τ1
4·τ1
8·τ2
10 ·τ1
11 +τ2
11
+τ1
4·τ1
8·τ1
10 ·τ2
11·τ1
12 ·τ1
13 · |S15|
=763
11520 |S15| ≈ 6,62 |S15 |
S3=τ3
11 τ1
12 +τ3
12
+τ1
11 +τ2
11τ3
12τ1
13 |S15|
=11
60 |S15|= 18,3% |S15 |
S4=τ4
12 ·τ1
13
=1
12 |S15|= 8,3% |S15 |
S5=τ5
13 |S15|
=2
3|S15|= 66,6% |S15 |
According to Theorem 2, only 0.04% of all puzzles can be solved tile by tile
without moving lower numbered tiles. On the other hand, 66,6% of all puzzle
involve a situation, where 4 lower numbered tiles need to be considered to solve
the puzzle. As already mentioned, the membership of a puzzle sS15 in a class
Skdescribes its difficulty in terms of how many tiles need to be considered at
once to solve the puzzle with Local Value-Iteration. Another possible measure
for the difficulty of a puzzle, is the number of sliding actions necessary to solve
it. Now, our final step is to give upper bounds Φ(Sk) which give the maximal
number of actions that need to be applied to solve any puzzle sSk. As we
will see, Φ(Sk) will go up with kmeaning that puzzle from a higher class Sk
also potentially need more actions to be solved.
Theorem 3. Given a puzzle sSk. Then the maximal number of actions
necessary to solve the puzzle susing Algorithm 2 is given by ϕ(S1) = 142,
ϕ(S2) = 220,ϕ(S3) = 248,ϕ(S4) = 255,ϕ(S5) = 288.
Proof. We proof the statement by inspection of the value-function learned with
Algorithm 1. An entry V(s) = P
t=kγtimplies, that kactions are necessary to
reach the goal state given state sunder an optimal policy. Let MG,R denote the
maximal number of actions necessary to solve G, R. The relevant subproblems
G, R are given in Theorem 1. By application of Algorithm 1, the following values
for MG,R can be given:
M{1},= 21
M{2},{1}= 17
M{3},{1,2}= 20
M{4},{1,2,3}= 1
M{4,3},{1,2}= 32
M{5},{1,...,4}= 17
M{6},{1,...,5}= 13
M{7},{1,...,6}= 18
M{8},{1,...,7}= 1
M{8,7},{1,...,6}= 29
M{9},{1,...,8}= 15
M{10},{1,...,9}= 9
M{10,9},{1,...,8}= 20
M{11},{1,...,10}= 6
M{11,10},{1,...,9}= 14
M{11,10,9},{1,...,8}= 23
M{12},{1,...,11}= 1
M{12,11,10},{1,...,9}= 20
M{12,...,9},{1,...,8}= 27
M{13},{1,...,12}= 1
M{13,...,9},{1,...,8}= 34
M{14},{1,...,13}= 1
M{15},{1,...,14}= 1
Given these maximal numbers of actions to solve a subproblem G, R, we
can now estimate the worst-case, i.e. the maximal number of actions, to solve a
puzzle sSk. Therefore, we compute
ϕ(Sk) =
15
X
i=1
max nM{i},{1,...,i1}, . . . , M {i,...,ik+1},{1,...,ik}o
where the max-operator only considers entries MG,R that are given above (other
subproblems G, R are not relevant according to Theorem 1).
6 Simulation Results
This section contains experimental results when Local Value-Iteration is applied
to solve the 15-Puzzle. The experiments will show the feasibility of the Local
Value-Iteration approach and will also support the analysis provided in the pre-
vious section.
We will use the Local Value-Iteration Algorithm 2 to solve instances of the
15-Puzzle. Here, subproblems G, R will recur for different puzzles. Hence, we
will learn the value-function and the corresponding policy only once for a given
subproblem G, R. On a standard desktop PC, the training for all involved G, R-
problems took 155 seconds. After the training, arbitrary puzzles can be solved.
We build a set ˜
Sof 100000 random puzzles, all puzzles s˜
Scould successfully
be solved with Algorithm 2 with an average of approximately 0.25 seconds to
solve one puzzle.
In the analysis section, Theorem 2 states the proportions of the classes Sk
relative to the set of all puzzles. These proportions can also be estimated by
experiment: for each puzzle s˜
S, we keep track of the maximum number of
tiles kconsidered at once to solve the puzzle and subsequently sort them in the
corresponding class ˜
Sk. The results after solving all 100000 random puzzles in
˜
Sare
˜
S1= 42 = 0.042%
˜
S
˜
S2= 6539 = 6.539%
˜
S
˜
S3= 18362 = 18.362%
˜
S
˜
S4= 8446 = 8.446%
˜
S
˜
S5= 66611 = 66.611%
˜
S
These empirical proportions correspond to the values given in Theorem 2 up to
errors of less than 0.15%. For each of the five classes ˜
Sk, we compute the average
and maximum number of actions necessary to solve a puzzle. The results are
given in Table 1 and correspond to the statements of Theorem 3. The results
show, that when the number of tiles which need to be considered increases, the
number of actions to solve the puzzle also increases.
Table 1. We solved 100000 puzzles with Local Value-Iteration and classified them into
the classes Sk. The table gives the average and maximum number of actions necessary
to solve puzzles of the subsequent classes.
S1S2S3S4S5overall
average # actions 68.85 100.68 112.16 122.92 128.71 123.35
maximum # actions 107 156 165 174 202 202
So far, we start with one tile, i.e. Gi={i}, and increased the local region,
if the subproblem could not be solved. While Value-Iteration provides optimal
solutions for the subproblems, the overall solution is in general not optimal
(with respect to the number of actions necessary to solve a puzzle). On the
other hand, solving the complete puzzle at once, i.e. G={1,...,15}, will give
us the optimal solution, but is computationally intractable. In the next step, we
vary between those two extremes and increase the initial size of the local region
to find better solutions – in terms of fewer actions necessary to solve the puzzle
–, while accepting higher computational efforts. For an initial local size `, we
define Gi={`i `+ 1, . . . , `i}. If we increase, for example, the size to `= 2,
this will give us G1={1,2},G2={3,4},G3={5,6}and so on. We solved the
100000 random puzzles in ˜
Sagain, now with initial size from 1 to 5. Figure 2
shows on the left, that the average as well as the maximal number of actions to
solve a puzzles declines as expected when we increase the region size. The cost of
this improvement is shown in Fig. 2 on the right – the computation time grows
exponentially. The detailed results can be found in Table 2.
1 2 3 4 5
60
80
100
120
140
160
180
200
220
initial size of local region
# of actions to solve puzzle
Maximum
Average
1 2 3 4 5
0
50
100
150
200
250
300
350
initial size of local region
training time in min
Fig. 2. The initial size of the local region Gfor Local Value-Iteration is varied from 1,
i.e. initial Gi={i}, up to 5, i.e. initial Gi={5i4,...,5i}. The left figure shows the
average as well as maximal number of actions to solve 100000 random puzzles. On the
right, the necessary overall training time for each initial local size is shown.
Table 2. We adapted the initial local size from 1, i.e. Gi={i}for i= 1,...,15, up to
5, i.e. Gi={5i4, . . . 5i}for i= 1, . . . 3. We solved 100000 puzzles with each initial
local size, the table gives the average and maximum number of actions necessary to
solve solve the puzzles. The last row states the necessary training time on a standard
desktop PC.
default size local region 12345
average # actions 123.35 99.36 93.69 78.07 73.82
maximum # actions 202 150 136 115 107
training time in minutes 2.58 7.10 57.87 109.03 312.33
7 Conclusion
In this study, we investigate the possibility of using RL to solve the popular
15-puzzle game. Due to the high state space dimension of the problem, it is
difficult to straightforwardly employ state-of-the-art RL algorithms. In order to
deal with this large state space, we proposed a local variation of the well-known
Value-Iteration appropriate to solve the 15-puzzle problem. Our algorithm is in-
spired by the insight that humans use to solve the 15-puzzle game locally, by
sequentially moving tiles to their correct positions. Furthermore, we provide a
theoretical analysis showing the feasibility of the proposed approach. The plau-
sibility of the analysis is supported by several simulation results.
References
1. Pizlo, Z., Li, Z.: Solving Combinatorial Problems: The 15-Puzzle. Memory & Cog-
nition 33 (2005) 1069–1084
2. Borga, M.: Reinforcement Learning Using Local Adaptive Models. ISY, Link¨oping
University, ISBN 91-7871-590-3, LiU-Tek-Lic-1995:39 (1995)
3. Zhang, W., Zhang, N. L.: Restricted Value Iteration: Theory and Algorithms. Jour-
nal of Artificial Intelligence Research 23 (2005) 123–165
4. Archer, A. F.: A Modern Treatment of the 15 Puzzle. American Mathematical
Monthly 106, AAAI Press (1999) 793–799
5. Burns, E. A., Hatem, M. J., Ruml, W.: Implementing Fast Heuristic Search Code.
Symposium on Combinatorial Search (2012)
6. Korf, R. E.: Depth-first iterative-deepening: an optimal admissible tree search. Ar-
tificial Intelligence 27 (1985) 97–109
7. Sutton, R. S., Barto, A. G.: Reinforcement Learning: An Introduction (Adaptive
Computation and Machine Learning). The MIT Press (1998)
8. Surynek, P., Michalik, P.: An Improved Sub-optimal Algorithm for Solving N21-
Puzzle. Institute for Theoretical Computer Science, Charles University, Prague (2011)
9. Ratner, D., Warmuth, M. K.: NxN Puzzle and Related Relocation Problem. Journal
of Symbolic Computation 10 (1990) 111-138
10. Hordern, E.: Sliding Piece Puzzles. Oxford University Press (1986)
11. Parberry, I.: A Real-Time Algorithm for the (n2-1)-Puzzle. Inf. Process. Lett. 56
(1995) 23-28
12. Korf, R. E., Schultze, P.: Large-scale parallel breadth-first search. Proceedings of
the 20th national conference on Artificial intelligence 3 (2005) 1380–1385
13. Wiering, M. and van Otterlo, M.: Reinforcement Learning: State-of-the-Art.
Springer (2012)
14. Porta, M. J., Spaan, M. T. J. and Vlassis, N.: Robot Planning in Partially Observ-
able Continuous Domains. Robotics: Science and Systems, MIT Press (2005) 217–224
15. Asadian A., Kermani M.R., Patel R.V.: Accelerated Needle Steering Using Parti-
tioned Value Iteration. American Control Conference, ACC 10, Baltimore, MD, USA,
2010
Chapter
In view of the problem that current jigsaw games have no functions of upgrades or accumulation scores, the UML (Unified Modeling Language) technology is adopted to analyze to design the mobile game software via spitting jigsaw puzzle pieces of a map. The results of software development illustrate that it is very easy to upgrade and accumulate scores. The gaps between design and coding are smaller using UML than before. Also, the software development period is shorter. Not only the material resources for entity toys can be saved greatly via this software, but also it is significant for children users to study the knowledge of map and traffic.
Article
Sliding puzzles are classic and ancient intellectual problems. Since the amount of states in a sliding puzzle equals to the factorial of the number of tiles including the blank tile, traditional algorithms are only effective for small-scale ones, e.g. 8-puzzle. This article proposes a novel and efficient algorithm called DSolving for large-scale sliding puzzles. DSolving adopts direct solving manner. It does not need to store the intermediate states. Therefore, theoretically, it can solve any scale sliding puzzle. For general number tiles except the after-mentioned ones, DSolving adopts an efficient method to quickly move them to their target locations along shortest paths. For the top-right 3 × 2 corner sub-puzzles beginning with the last two positions in each row, DSolving constructs a state transition table (STT), which can ensure those two number tiles in the top-right corner be moved and placed correctly. The last two rows of tiles are considered as several 2 × 3 sub-puzzles, and another STT is constructed to solve these 2 × 3 sub-puzzles. These two STTs reduce corresponding problems to simple table-look-up operations. Experimental results show that DSolving exhibits high time-efficiency and stability. It takes only 4–5 ms to solve a random instance of 20 × 20 sliding puzzle on a general personal computer.
Article
Full-text available
Published papers rarely disclose implementation details. In this paper we show how such details can account for speedups of up to a factor of 28 for different implementations of the same algorithm. We perform an in-depth analysis of the most popular benchmark in heuristic search: the 15-puzzle. We study implementation choices in C++ for both IDA* and A* using the Manhattan distance heuristic. Results suggest that several optimizations deemed critical in folklore provide only small improvements while seemingly innocuous choices can play a large role. These results are important for ensuring that the correct conclusions are drawn from empirical comparisons. Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Article
Value iteration is a popular algorithm for finding near optimal policies for POMDPs. It is inefficient due to the need to account for the entire belief space, which necessitates the solution of large numbers of linear programs. In this paper, we study value iteration restricted to belief subsets. We show that, together with properly chosen belief subsets, restricted value iteration yields near-optimal policies and we give a condition for determining whether a given belief subset would bring about savings in space and time. We also apply restricted value iteration to two interesting classes of POMDPs, namely informative POMDPs and near-discernible POMDPs.
Book
Reinforcement learning encompasses both a science of adaptive behavior of rational beings in uncertain environments and a computational methodology for finding optimal behaviors for challenging problems in control, optimization and adaptive behavior of intelligent agents. As a field, reinforcement learning has progressed tremendously in the past decade. The main goal of this book is to present an up-to-date series of survey articles on the main contemporary sub-fields of reinforcement learning. This includes surveys on partially observable environments, hierarchical task decompositions, relational knowledge representation and predictive state representations. Furthermore, topics such as transfer, evolutionary methods and continuous spaces in reinforcement learning are surveyed. In addition, several chapters review reinforcement learning methods in robotics, in games, and in computational neuroscience. In total seventeen different subfields are presented by mostly young experts in those areas, and together they truly represent a state-of-the-art of current reinforcement learning research. Marco Wiering works at the artificial intelligence department of the University of Groningen in the Netherlands. He has published extensively on various reinforcement learning topics. Martijn van Otterlo works in the cognitive artificial intelligence group at the Radboud University Nijmegen in The Netherlands. He has mainly focused on expressive knowledge representation in reinforcement learning settings.
Conference Paper
This paper presents a fast 2D motion planner for steering flexible needles inside relatively rigid tissue. This approach exploits a nonholonomic system approach, which models tissue-needle interaction, and formulates the problem as a Markov Decision Process that is solvable using infinite horizon Dynamic Programming. Starting from any initial condition defined in the workspace, this method calculates a set of control actions that enables the needle to reach the target and avoid collisions with obstacles. Unlike conventional solvers, e.g. the value iterator, which suffers from the curse of dimensionality, partitioned-based solvers show promising numerical performance. Given a segmented image of a workspace including the locations of the obstacles, the target and the entry point, the partitioned-based solver provides a descent solution where high resolution is required. It is shown in this paper how prioritized partitioning increases computational performance of the current DP-based solutions for the purpose of off-line path planning. By default, our planner selects the path with the least number of turning points while maintaining minimum insertion length, which leads to the least damage to tissue. In this paper, more emphasis is given to the control aspects of the problem rather than the corresponding biomedical issues.
Article
We describe a new technique for designing more accurate admissible heuristic evaluation functions, based on pattern databases [J. Culberson, J. Schaeffer, Comput. Intelligence 14 (3) (1998) 318–334]. While many heuristics, such as Manhattan distance, compute the cost of solving individual subgoals independently, pattern databases consider the cost of solving multiple subgoals simultaneously. Existing work on pattern databases allows combining values from different pattern databases by taking their maximum. If the subgoals can be divided into disjoint subsets so that each operator only affects subgoals in one subset, then we can add the pattern-database values for each subset, resulting in a more accurate admissible heuristic function. We used this technique to improve performance on the Fifteen Puzzle by a factor of over 2000, and to find optimal solutions to 50 random instances of the Twenty-Four Puzzle.
Article
The complexities of various search algorithms are considered in terms of time, space, and cost of solution path. It is known that breadth-first search requires too much space and depth-first search can use too much time and doesn't always find a cheapest path. A depth-first iterative-deepening algorithm is shown to be asymptotically optimal along all three dimensions for exponential tree searches. The algorithm has been used successfully in chess programs, has been effectively combined with bi-directional search, and has been applied to best-first heuristic search as well. This heuristic depth-first iterative-deepening algorithm is the only known algorithm that is capable of finding optimal solutions to randomly generated instances of the Fifteen Puzzle within practical resource limits.
Conference Paper
Recently, best-first search algorithms have been introduced that store their nodes on disk, to avoid their inherent memory limitation. We introduce several improvements to the best of these, including parallel processing, to reduce their storage and time requirements. We also present a linear-time algorithm for bijectively mapping permutations to integers in lexicographic order. We use breadth-first searches of sliding-tile puzzles as testbeds. On the 3×5 Fourteen Puzzle, we reduce both the storage and time needed by a factor of 3.5 on two processors. We also performed the first complete breadth-first search of the 4×4 Fifteen Puzzle, with over 10 13 states. Copyright © 2005, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.