Content uploaded by Kamil Rocki

Author content

All content in this area was uploaded by Kamil Rocki

Content may be subject to copyright.

Large-Scale Parallel Monte Carlo Tree Search on GPU

Kamil Rocki, Reiji Suda

Department of Computer Science

Graduate School of Information Science and Technology, The University of Tokyo

7-3-1, Hongo, Bunkyo-ku, 113-8654, Tokyo, Japan

Email: kamil.rocki, reiji@is.s.u-tokyo.ac.jp

Abstract—Monte Carlo Tree Search (MCTS) is a method

for making optimal decisions in artiﬁcial intelligence (AI)

problems, typically move planning in combinatorial games.

It combines the generality of random simulation with the

precision of tree search. The motivation behind this work is

caused by the emerging GPU-based systems and their high

computational potential combined with relatively low power

usage compared to CPUs. As a problem to be solved I chose

to develop an AI GPU(Graphics Processing Unit)-based agent

in the game of Reversi (Othello) which provides a sufﬁciently

complex problem for tree searching with non-uniform structure

and an average branching factor of over 8. I present an efﬁcient

parallel GPU MCTS implementation based on the introduced

’block-parallelism’ scheme which combines GPU SIMD thread

groups and performs independent searches without any need of

intra-GPU or inter-GPU communication. I compare it with a

simple leaf parallel scheme which implies certain performance

limitations. The obtained results show that using my GPU

MCTS implementation on the TSUBAME 2.0 system one

GPU can be compared to 100-200 CPU threads depending on

factors such as the search time and other MCTS parameters in

terms of obtained results. I propose and analyze simultaneous

CPU/GPU execution which improves the overall result.

I. INTRODUCTION

Monte Carlo Tree Search (MCTS)[1][2] is a method

for making optimal decisions in artiﬁcial intelligence (AI)

problems, typically move planning in combinatorial games.

It combines the generality of random simulation with the

precision of tree search.

Research interest in MCTS has risen sharply due to its

spectacular success with computer Go and potential appli-

cation to a number of other difﬁcult problems. Its application

extends beyond games[6][7][8][9]. The main advantages of

the MCTS algorithm are that it does not require any strategic

or tactical knowledge about the given domain to make

reasonable decisions and algorithm can be halted at any time

to return the current best estimate. Another advantage of

this approach is that the longer the algorithm runs the better

the solution and the time limit can be speciﬁed allowing

to control the quality of the decisions made. It provides

relatively good results in games like Go or Chess where

standard algorithms fail. So far, current research has shown

that the algorithm can be parallelized on multiple CPUs.

The motivation behind this work is caused by the emerg-

ing GPU-based systems and their high computational po-

tential combined with relatively low power usage compared

to CPUs. As a problem to be solved I chose developing an

AI GPU(Graphics Processing Unit)-based agent in the game

of Reversi (Othello) which provides a sufﬁciently complex

problem for tree searching with a non-uniform structure and

an average branching factor of over 8. The importance of

this research is that if the MCTS algorithm can be efﬁciently

parallelized on GPU(s) it can also be applied to other similar

problems on modern multi-CPU/GPU systems such as the

TSUBAME 2.0 supercomputer. Tree searching algorithms

are hard to parallelize, especially when GPU is considered.

Finding an algorithm which is suitable for GPUs is crucial

if tree search has to be performed on recent supercomput-

ers. Conventional ones do not provide good performance,

because of the limitations of the GPU’s architecture and the

programming scheme, threads’ communication boundaries.

One of the problems is the SIMD execution scheme within

GPU for a group of threads. It means that a standard CPU

parallel implementation such as root-parallelism[3] fail. So

far I were able to successfully parallelize the algorithm and

run it on thousands of CPU threads[4] using root-parallelism.

I research on an efﬁcient parallel GPU MCTS imple-

mentation based on the introduced block-parallelism scheme

which combines GPU SIMD thread groups and performs

independent searches without any need of intra-GPU or

inter-GPU communication. I compare it with a simple leaf

parallel scheme which implies certain performance limita-

tions. The obtained results show that using my GPU MCTS

implementation on the TSUBAME 2.0 system one GPU’s

performance can be compared to 50-100 CPU threads[4]

depending on factors such as the search time and other

MCTS parameters using root-parallelism. The block-parallel

algorithm provides better results than the simple leaf-parallel

scheme which fail to scale well beyond 1000 threads on a

single GPU. The block-parallel algorithm is approximately

4 times more efﬁcient in terms of the number of CPU

threads needed to obtain results comparable with the GPU

implementation. Additionally I are currently testing this

algorithms running on more than 100 GPUs to test its

scalability limits.

2011 IEEE International Parallel & Distributed Processing Symposium

1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.370

2037

2011 IEEE International Parallel & Distributed Processing Symposium

1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.370

2033

2011 IEEE International Parallel & Distributed Processing Symposium

1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.370

2029

2011 IEEE International Parallel & Distributed Processing Symposium

1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.370

2033

2011 IEEE International Parallel & Distributed Processing Symposium

1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.370

2033

2011 IEEE International Parallel & Distributed Processing Symposium

1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.370

2033

2011 IEEE International Parallel & Distributed Processing Symposium

1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.370

2029

2011 IEEE International Parallel & Distributed Processing Symposium

1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.370

2034

2011 IEEE International Parallel & Distributed Processing Symposium

1530-2075/11 $26.00 © 2011 IEEE

DOI 10.1109/IPDPS.2011.370

2034

One of the reasons why this problem has not been solved

before is that this architecture is quite new and new appli-

cations are being developed, so far there is no related work.

The scale of parallelism is extreme here (i.e. using 1000s

GPUs of 10000 threads). The published work is related to

hundreds or thousands of CPU cores at most. The existing

parallel schemes[3] rely on algorithm requiring either each

thread to execute the whole code (which does not work well

since GPU is a SIMD device) or synchronization/communi-

cation which is also not applicable.

II. MO NT E CAR LO TR EE SEARCH

A simulation is deﬁned as a series of random moves

which are performed until the end of a game is reached

(until neither of the players can move). The result of this

simulation can be successful, when there was a win in the

end or unsuccessful otherwise. So, let every node iin the tree

store the number of simulations T(visits) and the number

of successful simulations Si. The general MCTS algorithm

comprises 4 steps (Figure 1) which are repeated.

A. MCTS iteration steps

1) Selection: - a node from the game tree is chosen

based on the speciﬁed criteria. The value of each node

is calculated and the best one is selected. In this paper,

the formula used to calculate the node value is the Upper

Conﬁdence Bound (UCB).

UCBi=Si

ti

+C∗rlogT

ti

Where:

Ti- total number of simulations for the parent of node i

C- a parameter to be adjusted

Supposed that some simulations have been performed

for a node, ﬁrst the average node value is taken and

then the second term which includes the total number

of simulations for that node and its parent. The ﬁrst

one provides the best possible node in the analyzed tree

(exploitation), while the second one is responsible for the

tree exploration. That means that a node which has been

rarely visited is more likely to be chosen, because the value

of the second terms is greater.

2) Expansion: - one or more successors of the selected

node are added to the tree depending on the strategy. This

point is not strict, in our implementation I add one node per

iteration, so this number can be different.

3) Simulation: - for the added node(s) perform simula-

tion(s) and update the node(s) values (successes, total) - here

in the CPU implementation, one simulation per iteration

is performed. In the GPU implementations, the number

of simulations depends on the number of threads, blocks

SimulationExpansion Backpropagation

Selection

Repeat until time is left

Figure 1. A single MCTS algorithm iteration’s steps

and the method (leaf of block parallelism). I.e. the number

of simulations can be equal to 1024 per iteration for 4

block 256 thread conﬁguration using the leaf parallelization

method.

4) Backpropagation: - update the parents’ values up to

the root nodes. The numbers are added, so that the root node

has the total number of simulations and successes for all of

the nodes and each node contains the sum of values of all

of its successors. For the root/block parallel methods, the

root node has to be updated by summing up results from all

other trees processed in parallel.

III. GPU IMPLEMENTATION

In the GPU implementation, 2 approaches are considered

and discussed. The ﬁrst one (Figure 2a) is the simple

leaf parallelization, where one GPU is dedicated to one

MCTS tree and each GPU thread performs an independent

simulation from the same node. Such a parallelization should

provide much better accuracy when the great number of

GPU threads is considered. The second approach (Figure

2c), is the proposed in this paper block parallelization

method. It combines both aforementioned schemes. Root

parallelism (Figure 2b) is an efﬁcient method of paralleliza-

tion MCTS on CPUs. It is more efﬁcient than simple leaf

parallelization[3][4], because building more trees diminishes

the effect of being stuck in a local extremum/increases the

chances of ﬁnding the true global maximum. Therefore

having nprocessors it is more efﬁcient to build ntrees

rather than performing nparallel simulations in the same

node. Given that a problem can have many local maximas,

starting from one point and performing a search might not

be very accurate in the basic MCTS case. The second one,

leaf parallelism should diminish this effect by having more

samples from a given point. The third one is root parallelism.

Here a single tree has the same properties as each tree in

the sequential approach except for the fact that there are

many trees and the chance of ﬁnding the global maximum

increases with the number of trees. The last, our proposed

algorithm, combines those two, so each search should be

more accurate and less local at the same time.

5) Leaf-parallel scheme: This is the simplest paralleliza-

tion method in terms of implementation. Here GPU receives

a root node from the CPU controlling process and performs n

simulations, where ndepends on the dimensions of the grid

(block size and number of blocks). Afterwards the results

are written to an array in the GPU’s memory (0 = loss, 1 =

victory) and CPU reads the results back.

203820342030203420342034203020352035

n simulations

a. Leaf parallelism

n trees

b. Root parallelism

c. Block parallelism

n = blocks(trees) x threads (simulations at once)

Figure 2. An illustration of considered schemes

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

GPU Hardware

Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

GPU Program

Block 0 Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

Block 2

Block 4

Block 6

Block 1

Block 3

Block 5

Block 7

SIMD warp SIMD warp

SIMD warpSIMD warp

32 threads (xed, for current hardware)

Thread 0 Thread 1

Thread 4 Thread 5

Thread 2 Thread 3

Thread 6 Thread 7

Thread 8 Thread 9

Thread 12 Thread 13

Thread 10 Thread 11

Thread 14 Thread 15

Thread 16 Thread 17

Thread 20 Thread 21

Thread 18 Thread 19

Thread 22 Thread 23

Number of blocks congurable

Number of threads congurable

Number of MPs xed

Root parallelism

Leaf parallelism

Block parallelism

}

}

Figure 3. Correspondence of the algorithm to hardware

Based on that, the obtained result is the same as in the

basic CPU version except for the fact that the number of

simulations is greater and the accuracy is better.

6) Block-parallel scheme: To maximize the GPU’s simu-

lating performance some modiﬁcations had to be introduced.

In this approach the threads are grouped and a ﬁxed number

of them is dedicated to one tree. This method is introduced

due to the hierarchical GPU architecture, where threads

form small SIMD groups called warps and then these warps

form blocks(Figure 3). It is crucial to ﬁnd the best possible

job division scheme for achieving high GPU performance.

The trees are still controlled by the CPU threads, GPU

simulates only. That means that at each simulation step in

the algorithm, all the GPU threads start and end simulating

at the same time and that there is a particular sequential part

of this algorithm which decreases the number of simulations

per second a bit when the number of blocks is higher. This

is caused by the necessity of managing each tree by the

CPU. On the other hand the more the trees, the better the

performance. In our experiments the smallest number of

threads used is 32 which corresponds to the warp size.

A. Hybrid CPU-GPU processing

I observed that the trees formed by our algorithm using

GPUs are not as deep as the trees when CPUs and root

parallelism are used. It is caused by the time spent on each

GPU’s kernel execution. CPU performs quick single simu-

lations, whereas GPU needs more time, but runs thousands

of threads at once. It would mean that the results are less

accurate, since the CPU tree grows faster in the direction of

the optimal solution. As a solution I experimented on using

hybrid CPU-GPU algorithm(Figure 4). In this approach, the

GPU kernel is called asynchronously and the control is given

back to CPU. Then CPU operates on the same tree (in case

GPU

kernel

execution

time

time

kernel execution call

gpu ready event

cpu control

CPU

can

work

here!

processed by GPU

expanded by CPU

in the meantime

Figure 4. Hybrid CPU-GPU processing scheme

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 14336

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8

8.5

9x 105

GPU Threads

Simulations/second

Leaf parallelism (block size = 64)

Block parallelism (block size = 32)

Block parallelism (block size = 128)

Figure 5. Block parallelism vs Leaf parallelism, speed

of leaf parallelism) or trees (block parallelism) to increase

their depth. It means that while GPU processes some data,

CPU repeats the MCTS iterative process and checks for the

GPU kernel completion.

Here the results are presented. Our test platform is TSUB-

AME 2.0 supercomputer equipped with a NVIDIA TESLA

C2050 GPUs and Intel Xeon X5670 CPUs. I compare the

speed(Figure 5) and results(Figure 6) of leaf parallelism and

block parallelism using different block sizes. The block size

and their number corresponds to the hardware’s properties.

In those graphs a GPU Player is playing against one CPU

core running sequential MCTS. The main aspect of the

analysis is that despite running fewer simulations in a given

amount of time using block parallelism, the results are much

better compared to leaf parallelism, where the maximal

winning ratio stops at around 0.75 for 1024 threads (16

blocks of 64 threads). The results are better when the block

size is smaller (32), but only when the number of threads

is small (up to 4096, 128 blocks/trees), then the lager block

case(128) performs better. It can be observed in Figure 5

that as I decrease the number of threads per block and at

the same time increase the number of trees, the number of

simulations per second decreases. This is due to the CPU’s

sequential part.

In Figure 7 and 8 I also show a different type of result,

where the X-axis represents current game step and the Y-axis

is the average point difference between 2 players.

In Figure 7 I observe that one GPU outperforms 256 CPUs

in terms both intermediate and ﬁnal scores. Also I see that

the characteristics of the results using CPUs and GPU are

slightly different, where GPU is stronger at the beginning.

I believe that it might be caused by the larger search space

and therefore I conclude that later the parallel effect of the

GPU is weaker, as the number of distinct samples decreases.

Another reason for this is mentioned depth of the tree which

is lower in the GPU case. I present this in Figure 7.

Also I show that using our hybrid CPU/GPU approach

203920352031203520352035203120362036

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 14336

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

GPU Threads

Win ratio

Leaf parallelism (block size = 64)

Block parallelism (block size = 32)

Block parallelism (block size = 128)

Figure 6. Block parallelism vs Leaf parallelism, ﬁnal result

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61

−1

2

5

8

11

14

17

20

23

26

29

32

35

38

41

44

47

50

Game step

Point difference ( our score − opponent’s score)

2 cpus

4 cpus

8 cpus

16 cpus

32 cpus

64 cpus

128 cpus

256 cpus

1 GPU − block parallelism (block size = 128)

Figure 7. GPU vs root-parallel CPUs

both the tree depth and the result are improved as expected

especially in the last phase of the game.

IV. CONCLUSION

I introduced an algorithm called block-parallelism which

allows to efﬁciently run Monte Carlo Tree Search on

GPUs achieving results comparable with a hundred of CPU

cores(Figure 7). Block-parallelism is not ﬂawless and not

completely parallel as at most one CPU controls one GPU,

certain part of the algorithm has to be processed sequen-

tially which decreases the performance. I show that block-

parallelism performs better that leaf-parallelism on GPU

and probably is the optimal solution unless the hardware

limitations are not changed. I also show that using CPU and

GPU at the same time I get better results. The are challenges

ahead, such as unknown scalability and universality of the

algorithm. In Figure 9 I present preliminary results of multi

GPU conﬁgurations’ scaling using MPI.

V. FUT UR E WO RK

•Application of the algorithm to other domain. A more

general task can and should be solved by the algorithm

•Scalability analysis. This is a major challenge and

requires analyzing certain number of parameters and

their affect on the overall performance. Currently I

implemented the MPI-GPU version of the algorithm,

but the results are inconclusive, there are several reason

why the scalability can be limited including Reversi

itself.

ACK NOW LE DG EM EN TS

This work is partially supported by Core Research of Evo-

lutional Science and Technology (CREST) project ”ULP-

HPC: Ultra Low-Power, High-Performance Computing via

10 20 30 40 50 60

0

2

4

6

8

10

12

14

Game step

Points

10 20 30 40 50 60

0

10

20

30

40

Game step

Depth

GPU + CPU

GPU

Figure 8. Hybrid CPU/GPU vs GPU-only processing

2 3 4 5

10

6

10

7

Simulations/second

1 2 4 8 16 32

26.5

27

27.5

28

28.5

29

29.5

No of GPUs (112 block x 64 Threads)

Average Point Difference

Figure 9. Multi GPU Results - based on MPI communication scheme

Modeling and Optimization of Next Generation HPC Tech-

nologies” of Japan Science and Technology Agency (JST)

and Grant-in-Aid for Scientiﬁc Research of MEXT Japan.

REFERENCES

[1] Monte Carlo Tree Search (MCTS) research hub,

http://www.mcts-hub.net/index.html

[2] Kocsis L., Szepesvari C.: Bandit based Monte-Carlo Planning,

15th European. Conference on Machine Learning Proceed-

ings, 2006

[3] Guillaume M.J-B. Chaslot, Mark H.M. Winands, and H. Jaap

van den Herik: Parallel Monte-Carlo Tree Search, Computers

and Games: 6th International Conference, 2008

[4] Rocki K., Suda R.: Massively Parallel Monte Carlo Tree

Search, Proceedings of the 9th International Meeting High

Performance Computing for Computational Science, 2010

[5] Coulom R.: Efﬁcient Selectivity and Backup Operators in

Monte-Carlo Tree Search, 5th International Conference on

Computer and Games, 2006

[6] Romaric Gaudel, Michle Sebag - Feature Selection as a one-

player game (2010)

[7] Guillaume Chaslot , Steven Jong , Jahn-takeshi Saito , Jos

Uiterwijk - Monte-Carlo Tree Search in Production Manage-

ment Problems (2006)

[8] O. Teytaud et. al, High-dimensional planning with Monte-

Carlo Tree Search (2008)

[9] Maarten P.D. Schadd, Mark H.M. Winands, H. Jaap van den

Herik, Guillaume M.J-B. Chaslot, and Jos W.H.M (2008)

204020362032203620362036203220372037